Biomedical Imaging Application
A branch of Biomedical image
processing involves analyzing images containing elongates structures.
The enhancement of these structures in noisy image data is often
required to enable automatic image analysis. A framework for such noise
reduction based on Coherence Enhancing Diffusion (CED) using
(OS) has been developed. However, owing to the high computational
complexity and high memory consumption of this approach, the current
implementation is not able to process sizeable images in reasonable
time. This paper presents a GPU/CPU-based optimization of the OSCED
The primary goal of this work was to reduce the execution time of the framework by harnessing the
processing capabilities offered by an existing GPU cluster. First, the
bottlenecks were identified. These were subsequently improved by
applying a number of CPU- and GPU-based optimizations. Using a set of
reference images, we show that the performance of the framework
improved by at least an order of magnitude following
our optimizations. In addition, we present a ‘split-and-merge’ approach
and illustrate its potential for further performance improvement using
the existing GPU cluster as a reference. We conclude that significant
performance gains can be obtained by applying our approach to a
suitable cluster configuration. Furthermore, there is
still room for optimizing the parts that are currently executed on the CPU.
This webpage summarizes our results.
The application was
profiled to identify the bottlenecks. For the purpose of this study,
two reference images were used as benchmarks for quantitative analysis
of performance improvements. These reference images, boat
, took approximately 147 seconds and one hour, respectively, to execute with the original code on one of the gpunode machines.
The profiling was done by using the CPU timing utility already present
in the mathvisioncpp library (at an earlier stage, the Intel VTune
Amplifier XE 2011 was very useful especially for identifying control
The profiling results are shown below:
|#||Function||Time for boat image|
|Time for noisy_fibers image
| 4.85s|| 154.37s
| 2.30s|| 71.69s
| 2.04s|| 63.44s
|5||Rest|| 29.11s|| 886.29s
As the profiling results show, the
calculateLevel_() function dominates the execution for both images.
Actually, this function is at the heart of the orientation score
approach. It is used to perform convolution and correlation on lines
from the different image orientations.
We compare different versions of our
implementation with the original version. Each version incorporates the
optimizations in the previous one.
results shown in the table below are based on the average of 5
executions per input image, measured using CPU timers and/or CUDA
||Time for boat image
|Speedup (w.r.t. previous)||Time for noisy_fibers image
|Speedup (w.r.t. previous)
| 40.51s||3.48|| 1121.67s
|Single machine, single GPU
|Single machine, four GPUs