Optimizing a Biomedical Imaging Orientation Score Framework

Biomedical Imaging Application

A branch of Biomedical image processing involves analyzing images containing elongates structures. The enhancement of these structures in noisy image data is often required to enable automatic image analysis. A framework for such noise reduction based on Coherence Enhancing Diffusion (CED) using Orientation Scores
(OS) has been developed. However, owing to the high computational complexity and high memory consumption of this approach, the current implementation is not able to process sizeable images in reasonable time. This paper presents a GPU/CPU-based optimization of the OSCED framework.

The primary goal of this work was to reduce the execution time of the framework by harnessing the
processing capabilities offered by an existing GPU cluster. First, the bottlenecks were identified. These were subsequently improved by applying a number of CPU- and GPU-based optimizations. Using a set of reference images, we show that the performance of the framework improved by at least an order of magnitude following
our optimizations. In addition, we present a ‘split-and-merge’ approach and illustrate its potential for further performance improvement using the existing GPU cluster as a reference. We conclude that significant performance gains can be obtained by applying our approach to a suitable cluster configuration. Furthermore, there is
still room for optimizing the parts that are currently executed on the CPU.

This webpage summarizes our results.


Profiled Application

The application was profiled to identify the bottlenecks. For the purpose of this study, two reference images were used as benchmarks for quantitative analysis of performance improvements. These reference images, boat and noisy_fibers, took approximately 147 seconds and one hour, respectively, to execute with the original code on one of the gpunode machines.

The profiling was done by using the CPU timing utility already present in the mathvisioncpp library (at an earlier stage, the Intel VTune Amplifier XE 2011 was very useful especially for identifying control flow).

The profiling results are shown below:

#FunctionTime for boat image
Time for noisy_fibers image
1Convolve::calculateLevel_
102.48s2499.39s
2ScaleSpace::ExplicitDiffuser::
Calculate
    4.85s   154.37s
3OrientationScores::OSConvert-
DiffusionTermsToCartesian
    2.30s     71.69s
4OrientationScores::SteerOrien-
tationScoreDerivatives
   2.04s      63.44s
5Rest  29.11s    886.29s

Total140.78s3675.18s

As the profiling results show, the calculateLevel_() function dominates the execution for both images. Actually, this function is at the heart of the orientation score approach. It is used to perform convolution and correlation on lines from the different image orientations.

Result

We compare different versions of our implementation with the original version. Each version incorporates the optimizations in the previous one.

The results shown in the table below are based on the average of 5 executions per input image, measured using CPU timers and/or CUDA events.

Version Time for boat image
Speedup (w.r.t. previous)Time for noisy_fibers image
Speedup (w.r.t. previous)
Reference code
140.78s1.003675.18s
1.00
CPU-improved
   40.51s3.48 1121.67s
3.28
Single machine, single GPU
   17.90s2.26
   523.28s
2.14
Single machine, four GPUs
   17.76s1.01
   473.17s
1.11
GPU cluster
    8.33s
2.13
    174.23s
2.72
Overall speedup
16.9
21.1
Authors
  • Chidiebere Okwudire
  • Martin Palatnik
  • Xu Zhang
  • Tanya Kudchadker