Optimization of a 3D Reconstruction Application

Internet provides access to a vast, ever-growing collection of photographs of cities and landmarks from all over the world. A simple search for a city name in Flickr returns millions of photographs capturing different parts of the city from various viewpoints. Creating
accurate 3D models of cities is a problem of great interest and with
broad applications. The 3D Scene Reconstruction application [1] that
we are using is a software aimed to reconstruct a 3D model from an
unstructured collections of photographs. It is achieved by applying
new computer vision techniques.
This application has a pipeline as shown in Figure 1
architecture which consists of four parts: Sift-GPU [2], KeyMatchFull [3], Bundler [4] and Pmvs[5]. Each of these parts represents
an independent application that keeps the communication in
the pipeline through input and output files. The final output is a file
in a ply format that can be visualized as a 3D model.
The work presented in this project is a solution for optimizing an
existing 3D Scene Reconstruction application [1]. The application
is aimed to match and reconstruct scenes from extremely large unstructured
collections of photographs. The reconstruction process,
which performs complex algorithms from image matching to large
scale optimization, can take days. In this paper we propose a way to
speed up the performance by porting the application on GPU computing.
Our experimental results demonstrate that it is now possible
to reconstruct 3D objects 4 - 5.6 times faster.
Context
The solution that we propose in this project for optimizing the
3D Scene Reconstruction application involves optimizing each of
the three main steps of the pipeline: SiftGPU, KeyMatchFull and
Bundler:
- SiftGPU is the first step of the application pipeline. Sift,
which stands for Scale Invariant Feature Transform, is a method
for extracting distinctive invariant features from images that can be
used to perform reliable matching between different views of an
object or scene. SiftGPU is an implementation of Sift for GPU
computing
- KeyMatchFull is an intermediate step between the extraction of
keys in SiftGPU and the Bundler. It associates images with common
features. It compares two images by calculating the nearest
neighbors for all the keys of the images and produces a list of
matches.
- The Bundler application is an implementation of a Structurefrom-
Motion (SfM) algorithm that takes the set of images, a list of
images features and a list of image matches as input, and produces a 3D reconstruction of camera and (sparse) scene reconstruction as
output.
These three applications are independent and they communicate
through input and output files. Therefore in each of them we
apply different approaches to reduce the execution time, which are
described in more details in section 3.
For measuring the performance of the application we used three
different data sets:
- Xu31 Consists of 31 images of size 3456x5184 pixels.
- Car31 Consists of 31 images of size 2048x1536 pixels.
- Car61 Consists of 61 images of size 2048x1536 pixels.
Analysis and Results
Figure 2
shows the improvement that was achieved over the total time of the
pipeline.

SiftGPU
SiftGPU processes pixels in parallel to build Gaussian pyramids
and detects Difference of Gaussian key points. The result of the
process contains the orientations and descriptors of the key points.
The original implementation of SiftGPU extracts key points for
every image on a single GPU. By using the maximum number (4)
of GPUs on a node, we modified the SiftGPU implementation by
processing four images in parallel. We used OpenMP for thread
manipulation.

The results obtained in Figure 3 show significant improvements
between the two implementations. The performance of the key
point extraction process was speeded up to 4.53 - 5.1 times as
shown in Figure 4. In addition, it proves the advantage of using
OpenMP correctly for splitting the SiftGPU process into multiple
processes. Moreover, Figure 3 also depicts the different execution
time of SiftGPU for the same number (31) of images for data sets
Xu31 and Car31 because the dimension and the key points of the
images used in those data sets are not the same. The achieved result was even more than expected (more than
four times) because the way of applying multiple processes reduces
the execution time for loading SiftGPU initialization and releasing
the memory.
KeyMatchFull
The initial implementation uses ANN library [2] to match the
key points between two images. Although ANN is highly optimized
to minimize the number of calculations needed for finding the nearest
neighbors, this task can be ported on GPU. We came across the
Fast K-Nearest Neighbor (KNN) Search integrated GPU Computing
[6], a cuda library that performs the same task. Although the
search algorithm is much simpler, resulting in much more calculations,
we can benefit a lot by the parallelization. KNN was very
promising since it was documented that its performance can be 64
times faster than ANN. We managed to replace the KNN library
with ANN library in the KeyMatchFull implementation. The KNN
version that we used is highly customized to meet our needs. The
output results we obtain by using KNN are slightly worse than the
initial ones but this doesn't affect significantly the whole process
and is a fair trade-off for the speed up that we obtain.

Figure 4 shows the timing results we obtained running the
three versions of KeyMatchFull (ANN-CPU, KNN-GPU, KNN-4
GPU) with our testing data sets. The measurements do not include
the loading phase but contain the saving time. It is easily noticed
that the GPU version is about 2.5 times faster than the initial one.
The reason that there is a great differentiation between what KNN was promising and the final result is because the testing data that
were used in KNN's research paper [6] differ significantly from our
case.
Bundler
From the profiler results, we noticed that the step that takes most
time during the execution in the CPU is the Levenberg-Marquardt
(LM) algorithm. It is aimed to solve a linear system of equations for
bundle adjustment created out of the parameters given as input for
the Structure-from-Motion (SfM) process. The function that solves
this equations system is in the library sba-1.5 [8], and it takes 22.1%
of the execution time.
Based on the described result we decided that the LM procedure
should be parallelized. In order to do it, an existing implementation
of the Bundler that uses Multi-Core CPU and Single-GPU implementations
of the SfM was used [9]. That implementation is focused
on reducing the use of memory and time in the execution
of the Hessian, Schur complement and Jacobian, which are usually
used by LM to reduce the size of the linear system [10]. The
code is contained in the library libpba, which was integrated in the
application and used instead of sba-1.5 [7].
After the integration was performed, we ran experiments in
order to compare the performance between the Single-CPU, Multi-
CPU and Single-GPU approaches. The three data sets described
in the introduction of this paper were used. In every experiment
two measurements were taken, the time of execution of the SfM
routine (which was parallelized) and the total time of execution of
the Bundler. The results obtained are presented in Figure 5.

As it can be observed in the Figure 5, after the parallelization
of the Structure-from-Motion the reduction in the execution time
of this routine influenced the total execution time of the Bundler. The total time decreased according to the speedup factors of the
Structure-from-Motion routine, which can be observed in Figure 6.
