Xilinx Introducing the Vivado Design Suite
Posted on 25-4-2012 by Zhenyu Ye
Xilinx introduces the Vivado design suite in a press release. According to the press release and the product page, the Vivado will support 7-series and Zynq devices but not the old devices, while the ISE will support all the existing devices and the old devices but seems to be phasing out in the future.
Posted on 10-4-2012 by Cedric Nugteren
Tags: GPU, Tegra, supercomputing
A recently published article on the web gives some more details on Barcelona’s GPU/ARM-based supercomputer, the Mont-Blanc. According to the article, they’ll start assembling their first prototype soon. It will be based on Tegra 3 SoCs (4 ARM cores), but the main processing will be done by low-power GPUs on a separate chip.
More information is also available on their website.
GeForce GTX 680 launched by NVIDIA
Posted on 23-3-2012 by Gert-Jan van den Braak
Yesterday NVIDIA launched it’s latest GPU, the GeForce GTX 680. It has a new architecture, called Kepler. A detailed description of the GXT 680 and the new Kepler architecture can be found in the NVIDIA whitepaper.
According to Anandtech it performs very well in gaming benchmarks, but not in compute benchmarks. I guess we have to wait for someone to optimize his/her CUDA program for Kepler before we can draw some real conclusions here.
FPGAs in 2032: Challenges and Opportunities in the next 20 years
Posted on 17-3-2012 by Zhenyu Ye
It is a pre-conference workshop of FPGA 2012. Many big names in the industry and academia are predicting the future of FPGA for the coming 20 years. The slides and videos are available from the workshop web site.
Selection of GPGPU/ASPLOS papers
Posted on 4-3-2012 by Cedric Nugteren
Tags: CPU, GPU, FPGA, architecture, High Level Synthesis, conference
Below is a selected list of papers presented at the ASPLOS ’12 conference and its workshops (in particular the GPGPU workshop and the CCPC workshop).
The GPGPU workshop started with an interesting keynote given by Norm Rubin from AMD on the Graphics Core Next GPU architecture. A summary of changes compared to the old VLIW4/5 architecture:
- One SIMD lane of VLIW4 processors will make place for 4 SIMD lanes in one ‘compute unit’. The compiler is having more and more trouble packing instructions, even when looking at graphics (e.g. DirectX10).
- One scalar processor is added per compute unit (see diagram) to perform control flow more efficiently. The scalar core has 4-way hyper-threading.
- A message unit is added in hardware to communicate to the host (designed for debugging purposes).
- L1 and L2 caches have been added at similar locations as in the Fermi architecture.
- Next to the LDS (‘local data share’ – the ‘local’ or ‘shared’ memory) there is now a GDS (‘global data share’) at the location of the L2 cache. It is programmable through a new OpenCL extension. Synchronization is not possible at this level though.
A selection of interesting papers at the GPGPU workshop:
- Introducing ‘Bones’: A Parallelizing Source-to-Source Compiler Based on Algorithmic Skeletons: A C-to-CUDA/OpenCL compiler based on algorithmic skeletons. The work includes a comparison of different existing tools (Par4All, PGI Accelerator, hiCUDA, SkePU).
- FLAT: A GPU Programming Framework to Provide Embedded MPI: An interesting approach to embed MPI in CUDA kernels. It uses a source-to-source compiler to split kernels with embedded MPI code in two parts to execute the MPI code on the host processor.
- High‐Performance Sparse Matrix‐Vector Multiplication on GPUs for Structured Grid Computation: Spare matrix-vector computations revisited. The work evaluates different packing techniques.
- Paragon: Collaborative Speculative Loop Execution on GPU and CPU: Executes loops speculatively on the GPU and performs correctness checks. Meanwhile, the CPU executes the code sequentially in case the result was incorrect.
- Enabling Task‐level Scheduling on Heterogeneous Platforms: Introduces ‘work pools’ for execution of tasks on heterogeneous systems. It uses the SURF algorithm as an example.
- Auto‐tuning Interactive Ray Tracing using an Analytical GPU Architecture Model: Interesting approach which has as goal to maintain a constant throughput for a given example application (ray tracing in this case). It uses a model of the GPU which is iteratively updated after each frame using a feedback controller. It uses Hong & Kim’s existing GPU model.
- Full System Simulation of Many‐Core Heterogeneous SoCs using GPU and QEMU Semihosting: Simulates a system-on-chip with a host (on CPU) and a many-core processor (on a GPU). The GPU runs 1-thread per simulated core, which is a in-order ARM processor.
A selection of presentations at the CCPC (Compiling Complete Programs into Circuits) workshop (no publications available, slides will become available later):
- Mapping algorithms to hardware using ClaSH: Transforms a functional programming language into HDL. Work from the University of Twente.
- Compiling OpenCL kernels into FPGA-based processor networks: Actually maps OpenCL threads on hardware similar to soft-cores. Afterwards it optimizes the cores for bit-width and such.
- Hardware Design in Lime: Adds a new target to the Lime project/language. Now, at run-time a selection between execution on a CPU, GPU or FPGA is possible.
A selection of interesting papers at the ASPLOS main conference:
- Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware: A study on the future processor architecture for data centers, with reduced OoO complexity, reduced cache sizes, no on-chip connectivity, and less off-chip memory bandwidth.
- HICAMP: Architectural Support for Efficient Concurrency-safe Shared Structured Data Access: Rethinking memory systems: storage of lines only once. The technique requires no memory allocations, no de-duplication, no synchronization, and has no concurrency bugs. Goodbye Von Neuman? Probably not.
- Architecture Support for Disciplined Approximate Programming: Introduces an approximate/precise mixed-precision programming language, hardware and ISA. Both operations and data can be inprecise. A dual-voltage architecture allows for fine-grained interleaving between precise and approximate.
- Automatic Generation of Hardware/Software Interfaces: Uses the BCL (Bluespec Codesign Language) to target both software (C++ backend) and hardware (BSV backend). The connection logic is generated automatically (FIFOs). The guard rules are implemented in software to guarantee atomicity.
- Green-Marl: A DSL for Easy and Efficient Graph Analysis: A new DSL for graph traversal algorithms using ‘templates’ to generate efficient code for multiple targets. Uses breath-first-search as an example template. Currently only one backend (C++) is available, but GPUs and clusters will be added in the future.
- SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures: Merges a group of instructions into a single SIMD instruction. SIMD lane assignment is investigated to minimize SIMD shuffling requirements in between instructions. It automatically detects possible candidates.
Selection of PPoPP papers
Posted on 1-3-2012 by Cedric Nugteren
Tags: CPU, GPU, programming, conference
Below is a selected list of papers presented at the PPoPP 2012 conference and the PMAM workshop.
PMAM workshop papers:
A selection of GPU related papers presented at PPoPP 2012:
- Scalable Framework for Mapping Streaming Applications onto Multi-GPU Systems Uses compute threads and memory threads to map StreamIT application onto multiple GPUs.
- A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications: An extension of an existing GPU model to perform automated kernel analysis. The authors introduce 4 performance metrics.
- Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors: Uses a source-to-source transformation to instrument CUDA kernels to understand memory performance. The work uses memory traces and stochastical models.
- A GPU implementation of Inclusion-Based Points-to Analysis: An irregular graph-algorithm with a dynamic set of nodes and vertices has been mapped succesfully to a GPU. It uses a 128-byte wide data representation with 4-bytes per thread to get coalesced memory accesses. It overlaps CPU-GPU memory transfers with computation.
- Scalable GPU Graph Traversal: Uses prefix-sum and local scan to map a graph algorithm on a GPU.
- GKLEE: Concolic Verification and Test Generation for GPUs: Implements a virtual GPU which is used to test correctness and profile CUDA applications.
Other interesting papers presented at PPoPP 2012:
OpenGPU slides available
Posted on 29-2-2012 by Cedric Nugteren
Tags: GPU, programming, conference, SIMD, VLIW
The slides of the OpenGPU tutorial at the HiPEAC conference are now online at the OpenGPU.net website.
The presentation given by AMD is in particular very interesting. It describes the new ‘Graphics Core Next’ architecture as used in the HD7000-series, including a motivation for the end of the VLIW design.
Other talks include PGI’s OpenACC, a directive based approach and an overview of Par4All
ISPASS 2012 Preliminary Program and Pre-Prints
Posted on 29-2-2012 by Zhenyu Ye
Tags: GPU, FPGA, architecture, multicore, modeling
The program of International Symposium on Performance Analysis of Systems and Software (ISPASS) 2012 was released. Some interesting preprints are already available:
- Automated Regression-based GPU Design Space Exploration (PDF)
- A Mechanistic Performance Model for Superscalar In-Order Processors (PDF)
- Comparing the Power and Performance of Intel SCC to State-of-the-Art CPUs and GPUs (PDF)
- Bandwidth Bandit: Understanding Memory Contention. It looks similar to a previous tech report from the same authors: Design and Evaluation of the Bandwidth Bandit
- Speedup Stacks: Identifying Scaling Bottlenecks in Multi-Threaded Applications (PDF)
- Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources (PDF)