Trading off Latency for Memory Bandwidth in Spatial Architectures using the Polyhedral Model

We present a fast algorithm for transforming loop nests represented in the polyhedral model, which have been placed onto a spatial grid of processors. The goal of the transformation is to avoid broadcasts, which can put extra pressure on remote memory bandwidth, and turn them mainly into neighbor-to-neighbor reuse. The transformation preserves existing parallelism and locality properties of the original program as much as possible, and in particular does not reduce its amount of parallelism.