Tools

Bones: A Parallelizing Source-to-Source Compiler Based on Algorithmic Skeletons

By Cedric Nugteren
Bones is a source-to-source compiler based on algorithmic skeletons and the presented algorithm classification. The compiler takes C-code annotated with class information as input and generates parallelized target code. At this moment, targets include NVIDIA GPUs (through CUDA), AMD GPUs (through OpenCL) and x86 CPUs (through OpenCL). he compiler is based on the C-parser CAST (http://cast.rubyforge.org/), which is used to parse the input code into an abstract syntax tree (AST) ...
View the full description


Algorithmic Species: A Classification of Program Code for Parallel Programming

By Pieter Custers, Cedric Nugteren
Algorithmic species is an algorithm classification technique targeted at parallel programming. Species are classes of algorithms, capturing among others the amount and structure of parallelism and data re-use in nested for-loops. Algorithmic species are derived from a polyhedral representation of code, but provide a more intuitive description of classes, similar to skeletons or pattern languages. ASET (algorithmic species extraction tool) is a tool to automatically annotate C program code with ...
View the full description


Algorithm mappings

Image and video processing benchmarks for data locality optimization

By Maurice Peemen
The performance of dedicated hardware accelerators for embedded systems are often severely limited by data movement. In these accelerators the computations are often quick and parallel, but the challenge is to move the required data in, and collect the results which is essential for a good data-path utilization. Especially in the very popular domain of image and video processing data movement is key for performance and energy efficiency. This domain often requires huge bandwidth data streams that must be processed with real-time constraints. A very effective approach to reduce the communication requirements is data locality optimization, because these applications often contain a substantial amount of data reuse that can be exploited in a local buffer. We publish three real-world embedded application benchmarks from the image and video processing domain, with extensive data transfer requirements. 1. Demosaicing: converting Bayer pattern images to RGB. 2. Motion Estimation: an essential but complex step in the video coding process for data compression. 3. Convolutional Network: state-of-the-art visual object recognition.

View the full description


Camera Throughput Exploration for the Reconfigurable ZedBoard

By , Shakith Fernando, Maurice Peemen
A recent innovation in heterogeneous platforms is the Zynq-7000 all programmable system-on-chip, which offers an embedded processor combined with FPGA based reconfigurable logic. This work utilises this platform to capture live video data from a USB camera and apply hardware accelerated operations on the data. A naive implementation has a limited frame rate, therefore a bottleneck analysis is done and multiple optimisations are proposed and applied. This study demonstrates that there are multiple opportunities to improve the throughput from camera to accelerator. The best implementation results in a 32x speed-up over an OpenCV implementation.

View the full description


Hough Transform on GPU

By Gert-Jan van den Braak
Mapping the Hough Transform to a GPU can be tricky, especially when you want to achieve maximum performance. In the paper "Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs" (see the Publications page) we introduced three different methods to calculate the Hough transform for lines on a GPU. The first implementation is basic, and is (just a bit) slower than an optimized CPU implementation. The second implementation is aimed at speed ...

View the full description


Speed Sign Detection and Recognition by Convolutional Neural Networks

By Maurice Peemen
Dataset for training and testing a speed sign detection and recognition application: A fully trainable application is for speed sign detection and recognition from a video stream is developed in this work. We show that a fully trainable solution can perform reliable classification under varying circumstances (day and night). When such a parallel neural network is mapped to a parallel platform such as a GPU; real-time detection is achieved with 35 fps ...

View the full description


Highly efficient and predictable histogramming for GPUs

By Cedric Nugteren
Histogramming has been mapped on a GPU prior to this work. Although significant research effort has been spent in optimizing the mapping, we show that the performance and performance predictability of existing methods can still be improved. We present two novel histogramming methods, both achieving a higher performance and predictability than existing methods ...

View the full description