SIFT Object Recognition - Team 1

GPU cluster project (Team 1)

This page contains the results achieved by team 1 during the two-week Software Technology project "Exploiting massive parallelism of a GPU-based compute cluster" (02.10.2012 - 12.10.2012), which is organized by the "Parallel Architecture Research Group" at the faculty of Electrical Engineering.

The goal of the project is to identify a coffee mug in a video of a cluttered office desk. An image of the coffee mug (the target) and the video of the office serve as inputs to the problem. A video which highlights the mug on the desk is expected as output.

The algorithm used to recognize objects in images in this project is Scale-invariant feature transform (SIFT). It extracts so-called keypoints from the object that are robust to changes in scale, orientation, shear, position, and illumination. These extracted keypoints are then used to identify the object in the image. As the SIFT algorithm has a big potential of being parallelized using multiple CPU and GPUs, it is an interesting subject for optimization using the compute cluster.
Approach

As a first step, we did some profiling on the initial unmodified application to analyze which functions take up most time. Using OpenMP timers and the perf tool we measured the following results:
Functions

Total execution time (for one frame): 4683 ms

Using these results, we decided to start working in two seperate groups on the most time-consuming functions. The following sections contain the different GPU and CPU techniques we used to optimize the application.
GPU optimization with CUDA

NVIDIA The first optimization technique we used was CUDA, a parallel computing platform and programming model created by NVIDIA. We used CUDA to port functions to the GPU in order to exploit the huge number of threads that it can run simultaneously.

We considered the smooth and prepareGrad functions for optimization with CUDA.
Upon inspection of the smooth function, it was shown that the sub-function econvolve, which is called 144 times (twice per call of smooth), takes up almost all of the execution time.
The optimization process is detailed bellow.

FunctionOptimization
econvolveThe function convolves the source image by a Gaussian 1D kernel. It convolves along columns and it saves the transposed result. This function originally was called twice, once with the original image and once with the transposed result.
The execution time of this function was reduced when it was ported to the GPU's global memory. Further reduced execution time was achieved when using the shared memory. The shared memory version consists of two kernels, one for filtering over columns and one for filtering over rows, while keeping the intermediate result image in the GPU's global memory. Additionaly, the filter is sent to a GPU register.
As a result the initial 144 function calls were reduced to 72.
prepareGradThe function computes the modulus and the angle of the gradient of the specified octave.
The execution time of this function was reduced when it was ported to the GPU's global memory with every thread working on a separate pixel. When using the shared memory there were too many exceptions for calculating the border pixels gradient.Therefore further reduced execution time was not achieved.
If the original prepareGrad implementation is called mutliple time on the same data, the function exits immediately. The initial implementation had 5169 function calls which were reduced to 14 by moving an if-statement, which decides if the function should be called, outside of the function body.

    The time measured for all econvolve calls in the application is as follows:
    • Initial execution time: 3095 ms
    • Global device memory: 557 ms
    • Using CPU for smaller image sizes and GPU for larger ones: 354 ms
    • Shared memory: 130 ms

    The time measured for all prepareGrad calls in the application is as follows:
    • Initial execution time: 529 ms
    • Global device memory: 173 ms
An overview of the improvements gained with CUDA can be found in the graph below.

CUDA
CPU optimization with OpenMP and SSE

In order to optimize some functions on the CPU, we used OpenMP and SSE.
    OpenMP
  • OpenMP is an API for shared-memory parallel programming, which allows us to use the multi-core processors that are present on the cluster by applying multi-threading.

  • SSE, Intel's Streaming SIMD Extension, utilizes SIMD instructions to greatly increase performance when exactly the same operations are to be performed on multiple data objects.

The functions that we improved using one or combination of these methods were match, rgb2y, computeKeypointDescriptor and highlight.
The measured performance gains can be seen in the graphs below.

computeKeypointDescriptor highlight match rgb2y
Cluster optimization with MPI
MPI is used to fully take advantage of the 4 machines that are located on the cluster, which each have 4 CPUs and 4 GPUs.
The approach that we used in order to take advantage of the fully utilize cluster consisted of:
  • Master node sends message to slave nodes.
  • Slave nodes processes one frame and returns the location of the coffee mug.
  • Master node highlights frame.
MPI structure


The measured performance on both, a single frame and a video can be seen in the graphs below.
The Single measurement corresponds to performance when using one machine and one GPU.
The GPUs measurement corresponds to performance when using one machine with 4 GPUs.
The 2,3 and 4 nodes measurements correspond to performance when using 2, 3 or 4 machines with all four GPUs.

computeKeypointDescriptor highlight
Conclusion
  • Porting to GPU's shared memory is very difficult (when there are too many exceptional situations like data dependencies) and optimal use of the shared memory variables must be allways considered.
  • Using the GPU's global memory already provides very good performance gains.
  • Using OpenMP is easy. Its performance increase according to number of cores.
  • Using the SSE instructions is complex but you can gain speed-up of 4x.
  • Developing messaging protocol for the cluster usage is challenging but provided examples helped.
  • Gaining 16x speed-up is not straight forward or possible because of messaging overhead and the master node being busy with computations