Polyhedral Optimization of TensorFlow Computation Graphs



We present R-Stream·TF, a polyhedral optimization tool for neural network computations. R-Stream·TF transforms computations performed in a neural network graph into C programs suited to the polyhedral representation and uses R-Stream, a polyhedral compiler, to parallelize and optimize the computations performed in the graph. R-Stream·TF can exploit the optimizations available with R-Stream to generate a highly optimized version of the computation graph, specifically mapped to the targeted architecture. During our experiments, R-Stream·TF was able to automatically reach performance levels close to the hand-optimized implementations, demonstrating its utility in porting neural network computations to parallel architectures.

Article

Memory-efficient Parallel Tensor Decompositions



Publication Source: 2017 IEEE High Performance Extreme Computing Conference (HPEC '17), Waltham, MA, USA. [Best Paper Award Winner]

Tensor decompositions are a powerful technique for enabling comprehensive and complete analysis of real-world data. Data analysis through tensor decompositions involves intensive computations over large-scale irregular sparse data. Optimizing the execution of such data intensive computations is key to reducing the time-to-solution (or response time) in real-world data analysis applications. As high-performance computing (HPC) systems are increasingly used for data analysis applications, it is becoming increasingly important to optimize sparse tensor computations and execute them efficiently on modern and advanced HPC systems. In addition to utilizing the large processing capability of HPC systems, it is crucial to improve memory performance (memory usage, communication, synchronization, memory reuse, and data locality) in HPC systems. In this paper, we present multiple optimizations that are targeted towards faster and memory-efficient execution of large-scale tensor analysis on HPC systems. We demonstrate that our techniques achieve reduction in memory usage and execution time of tensor decomposition methods when they are applied on multiple datasets of varied size and structure from different application domains. We achieve up to 11x reduction in memory usage and up to 7x improvement in performance. More importantly, we enable the application of large tensor decompositions on some important datasets on a multi-core system that would not have been feasible without our optimization.

Article

A Quantitative and Qualitative Analysis of Tensor Decompositions on Spatiotemporal Data



Publication Source: 2017 IEEE High Performance Extreme Computing Conference (HPEC '17), Waltham, MA, USA. (to appear)

With the recent explosion of systems capable of generating and storing large quantities of GPS data, there is an opportunity to develop novel techniques for analyzing and gaining meaningful insights. In this paper we examine the application of tensor decompositions, a high-dimensional data analysis technique, to georeferenced data sets. Guidance is provided on fitting spatiotemporal data into the tensor model and analyzing the results. We find that tensor decompositions can provide insight and that future research into spatiotemporal tensor decompositions for pattern detection, clustering, and anomaly detection is warranted.
Article

Multiresolution Priority Queues



Publication Source: 2017 IEEE High Performance Extreme Computing Conference (HPEC '17), Waltham, MA, USA. (to appear)

Priority queues are container data structures essential to many high performance computing (HPC) applications. In this paper, we introduce multiresolution priority queues, a data structure that improves the performance of the standard heap based implementations by trading off a controllable amount of resolution in the space of priorities. The new data structure can reduce the worst case performance of inserting an element from O(log(n)) to O(log(r)), where n is the number of elements in the queue and r is the number of resolution groups in the priority space. The worst case cost of removing the top element is O(1). When the number of elements in the table is high, the amortized cost to insert an element becomes O(1).
Article

A Mathematical Framework for the Detection of Elephant Flows



Publication Source: Extended Mathematical Report

How large is a network flow? Traditionally this question has been addressed by using metrics such as the number of bytes, the transmission rate or the duration of a flow. We reason that a formal mathematical definition of flow size should account for the impact a flow has on the performance of a network: flows that have the largest impact, should have the largest size. In this paper we present a theory of flow ordering that reveals the connection between the abstract concept of flow size and the QoS properties of a network. The theory is generalized to accommodate for the case of partial information, allowing us to model real computer network scenarios such as those found in involuntary lossy environments or voluntary packet sampling protocols (e.g., sFlow). We explore one application of this theory to address the problem of elephant flow detection at very high speed rates. The algorithm uses the information theoretic properties of the problem to help reduce the computational cost by a factor of one thousand.
Article

1 2 3 14