Compiler Optimization

Enabling the Next Generation of Hardware

The end of Moore’s law has led to recent, substantial interest in gaining performance through architectural improvements, and many companies, from established to startup, are innovating with new architectures.  This is particularly true for specialized processors for deep learning, but includes architectures specialized for other application domains such as molecular dynamics, climate simulation, geophysical simulation, automated reasoning, signal processing, and graph processing.

Reservoir Labs is uniquely positioned to assist companies and clients with their compiler needs for these architectures. Reservoir’s team of compiler engineers has experience with specialized architectures that spans three decades while additionally leveraging expertise with tools on the leading edge of artificial intelligence and neural networks. Reservoir is an established leader in the application of compiler expertise for high performance computing (HPC), AI, and special purpose architectures.

Learn more about our offerings

Core Capabilities

Instruction Scheduling
We have experience related to instruction scheduling, including DAG scheduling and modulo scheduling techniques. We have done advanced work on instruction scheduling in the context of predication and the use of looping instructions.

Loop optimization
We have experience with LLVM’s basic loop optimization passes. For advanced loop optimization, we offer R-Stream, a polyhedral optimizer that plugs into LLVM.

Code Generator
Our team has expertise developing support for new instruction sets, including instruction sets with novel instructions, predication, and vector extensions. Novel instructions can be generated various ways – via instruction selection, intrinsics, builtins, and inline assembly. We have improved TableGen usability, including for management of large instruction sets and for debuggability of code generation. Our team has also written and delivered new back ends for processors with novel ISAs related to domain-specific optimizations.

Vectorization
We have experience implementing vector instruction, vector register, and vector memory features.

Optimizations and transformations often become impractical or even infeasible to write by hand.  The talent of application domain experts is rare and expensive; giving those programmers a “power tool” – an automatic compiler – allows that talent to have maximum impact.  Specialized architectures expose many microarchitectural details, and the tradeoffs and interactions to get correct and high performance code are complex. Automation using modern optimization techniques can efficiently explore tradeoffs and respect constraints and conflicts to find optimal solutions.

 

In the traditional compiler world, it has long been recognized that optimizing assembly code by hand is not practical in terms of programmer productivity compared with automatic compiler optimization. Modem compilers perform many optimizations automatically, such as instruction scheduling, register allocation, and vectorization.

 

Automated optimizations can search a much larger space than a human, which allows it to potentially discover more efficient implementations.

Instead of requiring all optimizations to be done specifically for hardware, a compiler approach allows reuse of existing optimization passes and infrastructure from the open-source community.

Neural network problems (also referred to as “deep learning networks” or “models”) are encoded in a format that is not directly executable; they need to be translated into an executable form. This translation is the job of the neural network toolchain. Such translation is a highly complex task, involving multiple iterations over the problem description, applying sophisticated mathematical transformations at multiple levels of abstraction.

 

While traditional programming language toolchains are mature technologies with established solutions, deep learning toolchains are being developed at a rapid pace. Thus, it is more difficult to judge which deep learning toolchain is best, not only in terms of existing capability, but also future potential. Reservoir Labs has extensive experience with traditional compiler toolchains and applies that background to deep learning toolchains.

 

While building a simple neural network compiler is straightforward, to get good performance and high utilization of hardware requires considering a range of different design choices that can have significant effects on performance scalability and the ability to perform advanced optimizations. Effects such as processor utilization, memory latency, and concurrency create a complex web of dependencies requiring investigation of a large optimization space.

 

Reservoir is working with the leading neural network toolchains, such as TVM and Glow, to develop solutions for neural network hardware architectures. Our engineers adapt the general frameworks to specific architectures, accounting for specific memory hierarchies and instruction sets, and perform high level optimizations such as operator fusion. The neural network toolchain also achieves greatest leverage when it incorporates robust low-level compiler support, such as provided by LLVM, which is a complementary capability of Reservoir.

The path to determining an optimal toolchain starts with understanding goals, the existing solution, and inherent gaps therein. Reservoir’s engineers study each firm’s architecture and existing software tools, assessing current and planned approaches to make informed, contextually sensitive recommendations for future development to realize desired performance. Reservoir will provide a thorough report with outlined options, including relative strengths and weaknesses of various approaches and a recommendation. This report includes a scoping of the investment required to meet the recommended architecture.

A solid toolchain is foundational software for any architecture. Reservoir’s engineers have experience building end-to-end toolchain solutions, starting from front ends that accept standard languages such as C and C++ all the way through optimization, machine-specific code generation, and packaging for the platform. Reservoir brings up appropriate runtime libraries (including libc and those for C++) to the level required. Reservoir also provides an array of supporting tools that integrate with the compiler, such as assemblers, disassemblers, linkers, debuggers, and profilers, all customized for the specific target architecture.

 

Reservoir uses the industry-leading LLVM compiler framework, which is used by the majority of large architecture providers today. Reservoir also has experience developing the GNU toolchain (GCC).

 

Reservoir supports the toolchains we develop, offering full lifecycle maintenance.

Meet Some of our Experts

Muthu Baskaran

Fellow & Managing Engineer
Bio

Jonathan Springer

SVP
Bio

George Whiteside

Senior Engineer
Bio

Madeline Lea

Senior Compiler Engineer
Bio

Benjamin Huang

Compiler Engineer
Bio

Adithya Dattatri

Engineer
Bio

Get in touch with one of our experts today

The Latest

LLVM Virtual Developers’ Meeting Oct 6-8, 2020

The LLVM Developers’ Meeting gathers developers and users of LLVM, Clang, and related subprojects to learn the latest in novel compiler & toolchain technology through technical talks, BoFs, posters, networking and more.    Reservoir Labs is proud to sponsor this

Read More »

Recent Publications

Efficient and scalable computations with sparse tensors

In a system for storing in memory a tensor that includes at least three modes, elements of the tensor are stored in a mode-based order for improving locality of references when the elements are accessed during an operation on the tensor. To facilitate efficient data reuse in a tensor transform

Read More »

Static Versioning in the Polyhedral Model

We present an approach to enhancing the optimization process in a polyhedral compiler by introducing compile-time versioning, i.e., the production of several versions of optimized code under varying assumptions on its run-time parameters. We illustrate this process by enabling versioning in the polyhedral processor placement pass. We propose an efficient

Read More »