Memory-efficient Parallel Tensor Decompositions



Publication Source: 2017 IEEE High Performance Extreme Computing Conference (HPEC '17), Waltham, MA, USA. (to appear) [Nominated for Best Paper]

Tensor decompositions are a powerful technique for enabling comprehensive and complete analysis of real-world data. Data analysis through tensor decompositions involves intensive computations over large-scale irregular sparse data. Optimizing the execution of such data intensive computations is key to reducing the time-to-solution (or response time) in real-world data analysis applications. As high-performance computing (HPC) systems are increasingly used for data analysis applications, it is becoming increasingly important to optimize sparse tensor computations and execute them efficiently on modern and advanced HPC systems. In addition to utilizing the large processing capability of HPC systems, it is crucial to improve memory performance (memory usage, communication, synchronization, memory reuse, and data locality) in HPC systems. In this paper, we present multiple optimizations that are targeted towards faster and memory-efficient execution of large-scale tensor analysis on HPC systems. We demonstrate that our techniques achieve reduction in memory usage and execution time of tensor decomposition methods when they are applied on multiple datasets of varied size and structure from different application domains. We achieve up to 11x reduction in memory usage and up to 7x improvement in performance. More importantly, we enable the application of large tensor decompositions on some important datasets on a multi-core system that would not have been feasible without our optimization.

Article

A Quantitative and Qualitative Analysis of Tensor Decompositions on Spatiotemporal Data



Publication Source: 2017 IEEE High Performance Extreme Computing Conference (HPEC '17), Waltham, MA, USA. (to appear)

With the recent explosion of systems capable of generating and storing large quantities of GPS data, there is an opportunity to develop novel techniques for analyzing and gaining meaningful insights. In this paper we examine the application of tensor decompositions, a high-dimensional data analysis technique, to georeferenced data sets. Guidance is provided on fitting spatiotemporal data into the tensor model and analyzing the results. We find that tensor decompositions can provide insight and that future research into spatiotemporal tensor decompositions for pattern detection, clustering, and anomaly detection is warranted.
Article

Multiresolution Priority Queues



Publication Source: 2017 IEEE High Performance Extreme Computing Conference (HPEC '17), Waltham, MA, USA. (to appear)

Priority queues are container data structures essential to many high performance computing (HPC) applications. In this paper, we introduce multiresolution priority queues, a data structure that improves the performance of the standard heap based implementations by trading off a controllable amount of resolution in the space of priorities. The new data structure can reduce the worst case performance of inserting an element from O(log(n)) to O(log(r)), where n is the number of elements in the queue and r is the number of resolution groups in the priority space. The worst case cost of removing the top element is O(1). When the number of elements in the table is high, the amortized cost to insert an element becomes O(1).
Article

A Mathematical Framework for the Detection of Elephant Flows



Publication Source: Extended Mathematical Report

How large is a network flow? Traditionally this question has been addressed by using metrics such as the number of bytes, the transmission rate or the duration of a flow. We reason that a formal mathematical definition of flow size should account for the impact a flow has on the performance of a network: flows that have the largest impact, should have the largest size. In this paper we present a theory of flow ordering that reveals the connection between the abstract concept of flow size and the QoS properties of a network. The theory is generalized to accommodate for the case of partial information, allowing us to model real computer network scenarios such as those found in involuntary lossy environments or voluntary packet sampling protocols (e.g., sFlow). We explore one application of this theory to address the problem of elephant flow detection at very high speed rates. The algorithm uses the information theoretic properties of the problem to help reduce the computational cost by a factor of one thousand.
Article

Automatic Code Generation and Data Management for an Asynchronous Task-based Runtime



Publication Source: 5th Workshop on Extreme-scale Programming Tools (ESPT-2016) at The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16)

Hardware scaling and low-power considerations associated with the quest for exascale and extreme scale computing are driving system designers to consider new runtime and execution models such as the event-driven-task (EDT) models that enable more concurrency and reduce the amount of synchronization. Further, for performance, productivity, and code sustainability reasons, there is an increasing demand for autoparallelizing compiler technologies to automatically produce code for EDT-based runtimes. However achieving scalable performance in extreme-scale systems with auto-generated codes is a non-trivial challenge. Some of the key requirements that are important for achieving good scalable performance across many EDT-based systems are:  (1) scalable dynamic creation of task-dependence graph and spawning of tasks, (2) scalable creation and management of data and communications, and (3) dynamic scheduling of tasks and movement of data for scalable asynchronous execution. In this paper, we develop capabilities within R-Stream - an automatic source-to-source optimization compiler - for automatic generation and optimization of code and data management targeted towards Open Community Runtime (OCR) - an exascale-ready asynchronous task-based runtime. We demonstrate the effectiveness of our techniques through performance improvements on various benchmarks and proxy application kernels that are relevant to the extreme-scale computing community.
Article

1 2 3 13