Data Analytics



Computationally Efficient CP Tensor Decomposition Update Framework for Emerging Component Discovery in Streaming Data



Publication Source: 2018 IEEE High Performance Extreme Computing Conference (HPEC '18), Waltham, MA, USA [Best Paper Award]

We present streaming CP update, an algorithmic framework for updating CP tensor decompositions that possesses the capability of identifying emerging components and can produce decompositions of large, sparse tensors streaming along multiple modes at a low computational cost. We discuss a large-scale implementation of the proposed scheme integrated within the ENSIGN tensor analysis package, and we evaluate and demonstrate the performance of the framework, in terms of computational efficiency and capability to discover emerging components, on a real cyber dataset.

All-at-once Decomposition of Coupled Billion-scale Tensors in Apache Spark



Publication Source: 2018 IEEE High Performance Extreme Computing Conference (HPEC '18), Waltham, MA, USA

As the scale of unlabeled data rises, it becomes increasingly valuable to perform scalable, unsupervised data analysis. Tensor decompositions, which have been empirically successful at finding meaningful cross-dimensional patterns in multidimensional data, are a natural candidate to test for scalability and meaningful pattern discovery in these massive real-world datasets. Furthermore, the production of big data of different types necessitates the ability to mine patterns across disparate sources. The coupled tensor decomposition framework captures this idea by decomposing several tensors from different data sources together. We present a scalable implementation of coupled tensor decomposition on Apache Spark. We introduce nonnegativity and sparsity constraints, and perform all-at-once quasi-Newton optimization of all factor matrix parameters. We present results showing the billion-scale scalability of this novel implementation and also demonstrate the high level of interpretability in the components produced, suggesting that coupled, all-at-once tensor decompositions on Apache Spark represent a promising framework for large-scale, unsupervised pattern discovery.

High Speed Elephant Flow Detection Under Partial Information



Publication Source: 2018 IEEE International Symposium on Networks, Computers and Communications

In this paper we introduce a new framework to detect elephant flows at very high speed rates and under uncertainty. The framework provides exact mathematical formulas to compute the detection likelihood and introduces a new flow reconstruction lemma under partial information. These  theoretical results lead to the design of BubbleCache, a new elephant flow detection algorithm  designed to operate near the optimal tradeoff between computational scalability and accuracy by dynamically tracking the traffic’s natural cutoff sampling rate. We demonstrate on a real world 100 Gbps network that the BubbleCache algorithm helps reduce the computational cost by a factor of 1000 and the memory requirements by a factor of 100 while detecting the top flows on the network with very high probability.


Topic Modeling for Analysis of Big Data Tensor Decompositions



Publication Source: SPIE Proceedings Volume 10652, Disruptive Technologies in Information Sciences; 1065208 (2018), doi: 10.1117/12.2306933

Tensor decompositions are a class of algorithms used for unsupervised pattern discovery. Structured, multidimensional datasets are encoded as tensors and decomposed into discrete, coherent patterns captured as weighted collections of high-dimensional vectors known as components. Tensor decompositions have recently shown promising results when addressing problems related to data comprehension and anomaly discovery in cybersecurity and intelligence analysis. However, analysis of Big Data tensor decompositions is currently a critical bottleneck owing to the volume and variety of unlabeled patterns that are produced. We present an approach to automated component clustering and classi fication based on the Latent Dirichlet Allocation (LDA) topic modeling technique and show example applications to representative cybersecurity and geospatial datasets.
 
Copyright 2018 Society of Photo-Optical Instrumentation Engineers (SPIE). One print or electronic copy may be made for personal use only. Systematic reproduction and distribution, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper are prohibited.

Google Scholar    Article

Algorithms and Data Structures to Accelerate Network Analysis (Extended Version)



Publication Source: Elsevier: Future Generation Computer Systems Volume 86, September 2018

As the sheer amount of computer generated data continues to grow exponentially, new bottlenecks are unveiled that require rethinking our traditional software and hardware architectures. In this paper, we present five algorithms and data structures (long queue emulation, lockless bimodal queues, tail early dropping, LFN tables, and multiresolution priority queues) designed to optimize the process of analyzing network traffic. We integrated these optimizations on R-Scope, a high performance network appliance that runs the Bro network analyzer, and present benchmarks showcasing performance speed-ups of 5X at traffic rates of 10 Gbps.
Google Scholar    Article

1 2 3 4