Embedded Computer Architecture
Beagleboard In this course we treat different processor architectures: DSP (digital signal processors), VLIWs (very long instruction word, including Transport Triggered Architectures), ASIPs (application specific processors), and highly tuned, weakly programmable processors. In all cases it is shown how to program these architectures. Code generation techniques, especially for VLIWs, are treated, including methods to optimize code at source or assembly level. Furthermore the design of advanced data and instruction memory hierarchies will be detailed. A methodology is discussed for the efficient use of the data memory hierarchy.

(visit webpage)
Advanced Computer Architecture
Ca-book Basic principles (like instruction set design), pipelining and its consequences; VLIW (very long instruction word) architectures, Superpipelined, Superscalar, SIMD (single instruction, multiple data, used in vector and sub-wordparallel processors) and MIMD (multiple instruction, multiple data) architectures; SMT (Simultaneous Multi-Threading); Out-of-order and speculative execution; Branch prediction; Data (value) prediction; Design of advanced memory hierarchies; Memory coherency and consistency; Multi-threading; Exploiting task-level and instruction-level parallelism; Inter-processor communication models; Input and output; Network Communication Architecture; and Networks-on-Chip.

(visit webpage)
Processor Design
Book The course treats both the architecture and implementation of current RISC type of processors. RISC processors have a reduced instruction-set which enables the pipelined execution of instructions. This gives them a very high instruction throughput and performance, while their implementation is not too complex. RISC processors are used everywhere, not only in general purpose processors (although a Pentium is not a RISC its core also obeys the RISC design principles), but also in billions of embedded systems. This course not only treats how to design these processors, but also details the required memory hierarchy, interfacing, input and output peripherals, and the role of the operating system. In total this gives a thorough understanding of a complete processor system. Part of the course consists of laboratory assignments, where students have to exercise assembly level programming, and implement (parts of) a RISC processor in SystemC. The implementation has to be verified with real programs. The MIPS architecture is used as guiding example throughout the course.

(visit webpage)
CUDA GPU assignment
Gpu_outerproduct_storec Graphic processing units (GPUs) can contain upto hundreds of Processing Engines (PEs). They achieve performance levels of hundreds of GFLOPS (10^9 floating point operations per second). In the past GPUs were very dedicated, not general programmable, and could only be used to speedup graphics processing. Today, they become more-and-more general purpose. The latest GPUs of ATI and NVIDIA can be programmed in C and OpenCL. For this lab we will use NVIDIA GPUs together with the CUDA (based on C) programming environment.

(visit webpage)
Multiprocessor assignment
Power7 The purpose of this assignment is to get familiar with multiprocessor architectures and their programming models. The state-of-the-art multicore processors may contain dozens of cores on a single die. The figures below are examples of these processors. The trend of of going multicore posts new challenges to both computer architects and programmers. Putting hundreds of cores on a die is not difficult, but designing memory hierarchy to keep them busy is difficult. On the other hand, programming dozens of cores requires programmers to think 'parallel'. In this assignment, we will try to tackle these challenges, from the view point of both computer architects and programmers.

(visit webpage)
Image Processing with OpenCL (Bachelor students)
Students map their own image processing algorithm onto the GPU using OpenCL. Similar to a tutorial, they perform the following steps: (1) Copy data to and from the GPU's memory, (2) Create a basic kernel implementation in OpenCL that produces the correct results, and (3) Improve the performance of the kernel. Finally, the students are free to test their OpenCL implementation on different GPU's from AMD and NVIDIA. Then, they will be given a reference CUDA implementation, so they can evaluate performance between both programming languages and on different architectures.