June 8, 2020
In this paper, we introduce the Theory of Bottleneck Ordering, a
mathematical framework that reveals the bottleneck structure of
data networks. This theoretical framework provides insights into
the inherent topological properties of a network at least in three
areas: (1) It identifies the regions of influence of each bottleneck; (2)
it reveals the order in which bottlenecks (and flows traversing them)
converge to their steady state transmission rates in distributed
congestion control algorithms; and (3) it provides key insights to
the design of optimized traffic engineering policies. We demonstrate
the efficacy of the proposed theory in TCP congestion-controlled
networks for two broad classes of algorithms: congestion-based
algorithms (BBR) and loss-based additive-increase/multiplicative-decrease
algorithms (Cubic and Reno). Among other results, our
network experiments show that: (1) Qualitatively, both classes of
congestion control algorithms behave as predicted by the bottleneck
structure of the network; (2) flows compete for bandwidth only
with other flows operating at the same bottleneck level; (3) BBR
flows achieve higher performance and fairness than Cubic and Reno
flows due to their ability to operate at the right bottleneck level;
(4) the bottleneck structure of a network is ever-changing and its
levels can be folded due to variations in the flows’ round trip times;
and (5) against conventional wisdom, low-hitter flows can have a
large impact to the overall performance of a network.
This paper has not yet been published. If you would like to receive a copy once available, please contact us or check back here in December.
November 17, 2019
Congestion control algorithms for data networks have been the subject of intense research for the last three decades. While most of the work has focused around the characterization of a flow’s bottleneck link, understanding the interactions amongst links and the ripple effects that perturbations in a link can cause on the rest of the network has remained much less understood. The Theory of Bottleneck Ordering is a recently developed mathematical framework that reveals the bottleneck structure of a network and provides a model to understand such effects. In this paper we present G2, the first operational network optimization framework that utilizes this new theoretical framework to characterize with high-precision the performance of bottlenecks and flows. G2 generates an interactive graph structure that describes how perturbations in links and flows propagate, providing operators new optimization insights and traffic engineering recommendations to help improve network performance. We provide a description of the G2 implementation and a set of experiments using real TCP/IP code to demonstrate its operational efficacy.This paper has not yet been published. If you would like to receive a copy once available, please contact us or check back here after the 2019 SuperComputing Conference. For more information on G2 technology, visit https://www.reservoir.com/research/tech/networking/.
Combining Tensor Decompositions and Graph Analytics to Provide Cyber Situational Awareness at HPC Scale
September 24, 2019
This paper describes MADHAT (Multidimensional Anomaly Detection fusing HPC, Analytics, and Tensors), an integrated workflow that demonstrates the applicability of HPC resources to the problem of maintaining cyber situational awareness. MADHAT combines two high-performance packages: ENSIGN for large-scale sparse tensor decompositions and HAGGLE for graph analytics. Tensor decompositions isolate coherent patterns of network behavior in ways that common clustering methods based on distance metrics cannot. Parallelized graph analysis then uses directed queries on a representation that combines the elements of identified patterns with other available information (such as additional log fields, domain knowledge, network topology, whitelists and blacklists, prior feedback, and published alerts) to confirm or reject a threat hypothesis, collect context, and raise alerts. MADHAT was developed using the collaborative HPC Architecture for Cyber Situational Awareness (HACSAW) research environment and evaluated on structured network sensor logs collected from Defense Research and Engineering Network (DREN) sites using HPC resources at the U.S. Army Engineer Research and Development Center DoD Supercomputing Resource Center (ERDC DSRC). To date, MADHAT has analyzed logs with over 650 million entries.
September 24, 2019
With the end of Moore’s law, asynchronous taskbased parallelism has seen growing support as a parallel programming paradigm, with the runtime system offering such advantages as dynamic load balancing, locality, and scalability. However, there has been a proliferation of such programming systems in recent years, each of which presents different performance tradeoffs and runtime semantics. Developing applications on top of these systems thus requires not only application expertise but also deep familiarity with the runtime, exacerbating the perennial problems of programmability and portability. This work makes three main contributions to this growing landscape. First, we extend a polyhedral optimizing compiler with techniques to extract task-based parallelism and data management for a broad class of asynchronous task-based runtimes. Second, we introduce a generic runtime layer for asynchronous task-based systems with representations of data and tasks that are sparse and tiled by default, which serves as an abstract target for the compiler backend. Finally, we implement this generic layer using OpenMP and Legion, demonstrating the flexibility and viability of the generic layer and delivering an end-to-end path for automatic parallelization to asynchronous task-based runtimes. Using a wide range of applications from deep learning to scientific kernels, we obtain geometric mean speedups of 23.0 (OpenMP) and 9.5 (Legion) using 64 threads.
September 24, 2019
Tensor decomposition is a prominent technique
for analyzing multi-attribute data and is being increasingly
used for data analysis in different application areas. Tensor
decomposition methods are computationally intense and often
involve irregular memory accesses over large-scale sparse data.
Hence it becomes critical to optimize the execution of such data
intensive computations and associated data movement to reduce
the eventual time-to-solution in data analysis applications. With
the prevalence of using advanced high-performance computing
(HPC) systems for data analysis applications, it is becoming
increasingly important to provide fast and scalable implementation
of tensor decompositions and execute them efficiently on
modern and advanced HPC systems. In this paper, we present
distributed tensor decomposition methods that achieve faster,
memory-efficient, and communication-reduced execution on HPC
systems. We demonstrate that our techniques reduce the overall
communication and execution time of tensor decomposition
methods when they are used for analyzing datasets of varied
size from real application. We illustrate our results on HPE
Superdome Flex server, a high-end modular system offering
large-scale in-memory computing, and on a distributed cluster
of Intel Xeon multi-core nodes.