Image and video processing benchmarks for data locality optimization
By Maurice Peemen data locality optimizations

Image and video processing benchmarks for data locality optimization

CNN overview The performance of dedicated hardware accelerators for embedded systems are often severely limited by data movement. In these accelerators the computations are often quick and parallel, but the challenge is to move the required data in, and collect the results which is essential for a good data-path utilization. Especially in the very popular domain of image and video processing data movement is key for performance and energy efficiency. This domain often requires huge bandwidth data streams that must be processed with real-time constraints. A very effective approach to reduce the communication requirements is data locality optimization, because these applications often contain a substantial amount of data reuse that can be exploited in a local buffer. We publish three real-world embedded application benchmarks from the image and video processing domain, with extensive data transfer requirements. 1. Demosaicing: converting Bayer pattern images to RGB. 2. Motion Estimation: an essential but complex step in the video coding process for data compression. 3. Convolutional Network: state-of-the-art visual object recognition.
We evaluated our Inter-Tile reuse optimization methodology on these benchmarks. In this work we run the benchmarks on an Microblaze softcore in FPGA and we offload the loop nests with large workloads to HLS generated accelerators. With this methodology we show that we can reduce external communication by 2.1x compared to the bast case of intra-tile optimization. Furthermore we demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze to the performance level of a high-end Intel-i7 processor.

Files: