Fast and Scalable Distributed Tensor Decompositions

Tensor decomposition is a prominent technique

for analyzing multi-attribute data and is being increasingly

used for data analysis in different application areas. Tensor

decomposition methods are computationally intense and often

involve irregular memory accesses over large-scale sparse data.

Hence it becomes critical to optimize the execution of such data

intensive computations and associated data movement to reduce

the eventual time-to-solution in data analysis applications. With

the prevalence of using advanced high-performance computing

(HPC) systems for data analysis applications, it is becoming

increasingly important to provide fast and scalable implementation

of tensor decompositions and execute them efficiently on

modern and advanced HPC systems. In this paper, we present

distributed tensor decomposition methods that achieve faster,

memory-efficient, and communication-reduced execution on HPC

systems. We demonstrate that our techniques reduce the overall

communication and execution time of tensor decomposition

methods when they are used for analyzing datasets of varied

size from real application. We illustrate our results on HPE

Superdome Flex server, a high-end modular system offering

large-scale in-memory computing, and on a distributed cluster

of Intel Xeon multi-core nodes.