IEEE Transactions on Computers (TC) has selected our paper Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs as the July 2016 Spotlight Paper. This month you can download the article for free from the IEEE TC website.
The first microserver from IBM/Astron has arrived at the TU/e. We have started benchmarking the microserver. Next, we plan to update our Bones source-to-source compiler and add the microserver as a target in the coming weeks.
More news on the microserver: IBM and ASTRON provide microserver prototypes to three Dutch partners, Nieuwe microserver van ASTRON kan Noord-Nederland honderden banen opleveren (in Dutch) and Data niet meer naar computer, maar computer naar de data (in Dutch).
test setup: microserver in a box
Today we released version 1.6.0 of our A-Darwin and Bones tools. In this release it is now possible to have multiple scops in a single source file. Also some bugs have been fixed, including processing of empty scops and a skeleton argument mismatch. The source code is available on Github, the documentation can be found online or in PDF.
If you have any questions, suggestions or bug reports, feel free to contact us.
Bones is a source-to-source compiler based on algorithmic skeletons and a new algorithm classification. The compiler takes C-code annotated with class information as input and generates parallelized target code. Targets include NVIDIA GPUs (through CUDA), AMD GPUs (through OpenCL) and x86 CPUs (through OpenCL and OpenMP). More information on the Bones project page or in the paper Bones: An Automatic Skeleton-Based C-to-CUDA Compiler for GPUs.
This year’s EuroPar was held in Porto, a city on the mouth of the river Douro in the north of Portugal.
The main program included the following interesting talks:
Oh, and here is a picture of my presentation:
Last week we presented our paper on a GPU cache model at the 20th IEEE International Symposium On High Performance Computer Architecture in Orlando, Florida. The slides of the presentation are now available. Also the source-code of the cache model is available on GitHub. You can find the full publication at our publications page.
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.
In the last group meeting, we had a casual discussion about the unreasonably effectiveness of simple laws in computing. It turns out that IEEE Computer Dec. 2013 has a special section on Computing Laws: Origins, Standing, and Impact. It covers several classic laws:
Three Fingered Jack: Productively Addressing Platform Diversity, the PhD thesis of David Sheffield from ParLab. This thesis addresses the issue of implementing computer vision applications (among other applications) on different targets, including multicore processor, data-parallel processor, custom hardware, etc. This work is related to some of our on-going research projects.
As daylight saving time has ended, days get shorter and evenings get longer, more time becomes available for some reading by the fireplace. In case you would like to read up on GPU architectures, you may find an introduction on GPGPU architectures of the last couple of years below.
Programmable GPU architectures have been around for about seven years now. In November 2006 NVIDIA launched its first fully programmable GPU architecture, the G80 based GeForce 8800. In June 2008 a major revision was introduced, the GT200. This first architecture is described in detail in IEEE Micro volume 28, issue 2 (March-April 2008). The NVIDIA Tesla: A Unified Graphics and Computing Architecture article describes not only the history of NVIDIA GPUs from dedicated graphic accelerators to a unified architecture suitable for GPGPU workloads, but also the CUDA programming model. Many architecture details of the GT200 have been revealed by benchmarks in the paper Demystifying GPU Microarchitecture through Microbenchmarking (PDF).
In 2010 NVIDIA’s launched its next big architecture: Fermi. Many details are described in the Fermi White paper and in the AnandTech article NVIDIA’s GeForce GTX 480 and GTX 470. Later that year an update of the Fermi architecture, oriented more at gaming rather than GPGPU compute, was introduced, the GF104 in the GeForce GTX 460. More (architecture) details are described by AnandTech in NVIDIA’s GeForce GTX 460.
The latest GPGPU architecture by NVIDIA, Kepler, was released in 2012. Another whitepaper by NVIDIA describes this GK110 architecture used in the Tesla K20 GPGPU compute card. Also a gaming version of Kepler has been made: the GK104 used in the GeForce GTX 680. A couple of articles on AnandTech describe the architecture in more detail: the GK104 and the GK110.
For the history of AMD’s programmable GPGPU architecture the best place to start is the AMD Graphics Core Next (GCN) Architecture Whitepaper. It describes the evolution of AMD GPUs from fixed function GPUs to the programmable VLIW5 and VLIW4 GPUs and finally the GCN architecture. Again AnandTech gives some nice insights in the transition from VLIW5 to VLIW4 in the article AMD’s Radeon HD6970 & Radeon HD 6950, and from VLIW to GCN in AMD’s Graphics Core Next Preview.