SIFT Object Recognition - Team 2
SIFT object recognition parallelization project Team 2
The Software Technology project "Exploiting massive parallelism of a GPU-based compute cluster" is organized by the "Parallel Architecture Research Group" at the faculty of Electrical Engineering.
The team members are:
- Vincent van der Weele
- Matija Lukic
- Athanasios Papakostopoulos
- Selamawit Demewoz
- Spyridon Strouzas
Pure CPU implementation
By using the perf tool the results of the first measurements on the naive pure CPU implementation were as follows:
Total execution time: 4692.26 ms
After analyzing the results of the perf report we identified the main bottlenecks of the application. Most of the functions on the graph above are using internally the functions bellow:
This function after further investigation was found to use the econvolve() function. This function is executed twice for every
smooth and runs each time through the whole image and transposes the data. The smooth is called in the Sift() function multiple times. Optimization of the econvolve() was
considered the first priority and thus it was ported to the GPU
This function runs for every image octace and through each octaves scales. Computes for each pixel the magnitude and the orientation
Econvolve() port to GPU
Econvolve() is called multiple times in the execution of the program. Therefore in order to have an impression of the time it consumes, instead of timing individually each time the
econvolve() is called, the timing was placed in the Sift() function that calls the econvolve() multiple times. We can in this way measure the differences between the time Sift() takes
and understand how better the econvolve() performs.
The Sift() is called for 2 images, the first target small image and the large image that we try to find the target in.
The timings belong show how the execution was improved by porting the econvolve in the GPU and the use of global and shared GPU memory.
Global Memory GPU port
The first port to the GPU was made by reading the source image from the GPU global memory.
Shared memory GPU port
Then we ported the econvolve() using the shared memory of the GPU. We loaded the data on the shared memory and then we did the calculations by reading from there.
Timing results for econvolve()
Complete results from improvement steps on econvolve()
PrepareGrad() port to GPU
Global Memory GPU port
The first port to the GPU was made by reading the source image from the GPU global memory. The execution was faster than the CPU approach, especially as the image size grew.
The logic is the same with the CPU version. The neighbor elements are loaded by accessing the left,right,above and bellow pixel and then calculating the magnitude and the orientation.
We increase our index by one on the x-axis and one on the y-axis to achieve the one pixel padding from the original picture. We also check if the threads go beyond the image size and stop them.
Shared memory GPU port
Then we ported the prepareGrad() using the shared memory of the GPU. We loaded the data on the shared memory and then we did the calculations by reading from there.
The block size is calculated automatically by dividing picture size and the number of threads we have. The number of threads per block is defined on the top. The default size is 16x16 threads per block.
Each block has shared memory that has an extra pixel on each side. This is used to load the extra pixels needed to make the calculations with the neighbors. If the thread is
on the boarder of the block then apart from itself, it must also load the neighbor to fill up the apron.
Timing results for prepareGrad()
Optimizing for different sizes
From the timings per individual picture between CPU and GPU, it is noticeable that when the picture's size is bellow a specific number, then it is more efficient to run
it on the CPU. This is caused because the overhead to send the image to the GPU and the communication becomes greater than the benefit from the calculations. Because of this, the
final version of the prepareGrad() is a hybrid version that runs on the GPU for images with width larger than 125 pixels and on the CPU for images smaller than that.
Complete results from improvement steps on prepareGrad()
For CPU optimization, OpenMP was used to parallelize some for loops. Especially for functions that do not consume much time and have a limited amount of
data, the overhead of moving data to the GPU would exceed the earnings of the faster execution time. So the decision was to make the use of the CPU threads available
The omp pragma parallel for
directive was used to speed up for loops on several functions.
The performance gains can be seen in the following tables:
MPI is used to split up execution to cluster-nodes. The procedure is:
- Split video frames to nodes
- Each node finds keypoints
- Sends them back to host
- Host highlights by using the keypoints
Easy to port
- Big data sets with low data dependencies
- Algorithms based on simple arithmetic operations
- Predictable memory access
Hard to port
- Code optimized for CPU
- Algorithms with writing hotspots