Optimizing the Proba-V mapping algorithm implementation for a heterogenous (CPU + GPU) environment
Satellite Image Projection
Proba-V is a small earth observation satellite for global vegetation monitoring.
The satellite is a project of the European Space Agency
(ESA mission link)
and is expected to be fully operational by the beginning of 2013.
The User Segment of the Proba-V project was developed by VITO, Belgium
(VITO project link).
It includes processing of data: The vegetation data captured by the satellite is mapped to a given user projection.
The goal of the project was to
optimize the performace of the mapping algorithm using CPU and GPU parallelization.
To briefly illustrate the algorithm, the mapping is essentially composed by two steps:
- Locate the coordinates in the unprojected image that correspond to the (longitude, latitude) coordinates pair in the user projection.
(Basically, an inverse location process).
This is achieved using predictor functions.
- Determine the vale of the pixel located in the previous step. This is done by applying filters such as bilinear, bicubic etc. on the surrounding pixels.
Analysis and refactoring of initial application
The execution of the mapping was measured using code instrumentation.
The execution time is shown below:
The input/output (reading and writing of files) takes about one-third of the total execution
time. This is measured on a small set of input data (small segment). On a big input segment, the I/O takes close to 40%.
The input files can be very large to the extent that you cannot
keep all the input data in memory at a time. Therefore the given
implementation is such that files are read by parts. At every single
point in time, only a portion of the input data that is necessary to
compute one line of the output data is kept in memory. The
input files are compressed in HDF5. When they are read by
the mapping, they are first uncompressed, giving rise to a large file
reading time.
Writing takes a very small (1%) portion of the total execution
time. This is because only a complete line is writen at a time, giving
rise to a very regular file access and therefore efficient writing
operation. In addition, unlike the input files, the ouput files are not
compressed.
The computations are made up of the locate and filtering operations. All the locate and filtering operations can be done
independently for each pixel; and so these operations are theoretically
completely parallelizable over the output data. But in practice
we can only keep enough data in memory for a single line of output.
Therefore only pixels in a single output line can be produced
in parallel and hence the parallelism is restricted by the amount of
memory needed to keep input data in cache.
Originally, the complete functionality of the image projection was
done sequentially for every pixel. We refactored the code to an outer row loop that
contains functions that operate on the current row of pixels, as show below:

The code was restructured in that way so that
- Each function can be executed in parallel with the others. This
allows the possibility of forming an execution pipeline.
- Parallelizing the computations on a row level can be tackled in
a different manner in each function.
For every output line that represents a row in the output data, the input that has to be read is a curved line that crosses the input data. This is illustrated in the image below; The array represents the input data, and the
red lines are the data that has to be read for corresponding output lines.

From the reading access pattern we can see if you cache data seperately for each line, it will require some input lines to be read
multiple times. Since the time needed for reading is very high
(due to HDF5), an initial
caching mechanism was implemented to limit the number
of input lines read multiple times.
To port the computation to the GPU, we need the input data not on the GPU as well.
Therefore we mirrored the cache mechanism in the GPU memory.
In addition, to be able to run calculations in parallel, data must
be collected in advance. The input data needed for an output line is
completely cached before the computation part is started.
To achieve that we modified the caching algorithm: instead of caching for
each output pixel the lines required from the input, we calculate
the whole input data range needed (min and max line numbers) and
cache it all at once. By doing this the time spent caching dropped
by almost a factor of two.
After those modifications, caching data for a whole line
at a time the I/O time dropped to 20% of the execution. In the end
copying the data to the GPU memory added extra delays so I/O
became almost 25% of execution time.
The reading of files is done through the HDF5 library. To
make this library compatible with our multithreading approach we
had to recompile it using the following compiler flags: 'enablethreadsafe' with-pthread.
We also tried to run multiple caching (for different files) in
parallel using multiple threads but the execution time was the same.
It seems that the HDF5 library serializes writes to different files.
We can conclude that file reading (25% of the initial execution
time) cannot be parallelized and is the minimum execution time
needed for the mapping application.
The
flowchart of the locate function is displayed below:

Initially, the (lon, lat) are projected on the intermediate (u, v)
space (referred to as reproject).
After that, an initial (p0, l0) value is predicted using (u, v), and
then an iterative prediction takes place that converges on a (p, l)
value. In practice, with the currently used accuracy
threshold the value will converge after one or two iterations.
The execution time measurements yield the following results:

The locate function operates on each pixel independently from other pixels' computations.
Therefore it is possible to parallelize the function on a per
pixel basis.
The bulk of the computation (70%)is taken by the reproject and the
biggest performance gain would be from optimizing that
operation. However, the reproject is calling the PROJ.4 external
library, and the call graph that branches within the library is not
simple enough to make feasible the porting of only those parts
of the library that are being used. Moreover, the library is not
provided as a GPU port by its creators, and porting it was beyond
the limitations of this project.
If GPU usage is ruled out for reproject, its performance can still
be increased by using CPU parallelization. Its computation can be
split among all available CPU threads in a straightforward manner,
as there are no race conditions or data dependencies.
To be able to parallelize the reproject and predict operations separately, we restructured the code so that each part of the function operates on
a whole row of pixels.
The reproject loop was ported on cpu threads, using OpenMP.
The only performance bottleneck here is the number of threads
available. We use as many threads as possible (out of the total 8).
If we combine the solution with the final parallelized pipeline (see
here), we are left with 6 threads.
The iterative prediction is implemented in the following way for performance reasons: After
an iteration for a pixel finishes and a (p, l) value is found,
the difference between (p0, l0) and (p, l) is stored. This difference is added
to the input of the next pixel iteration to reduce iteration time. If
every pixel calculation is done in parallel, then this mechanism can
no longer be used since every pixel value is calculated separately.
This also means that the results between the two methods will
differ, which is explained if you compare how the coordinates for
the same pixel are calculated between the two methods: we see that
the iteration (polynomial interpolation) starts from different initial
values and thus will reach the difference threshold with different
results. Of course, this difference will be less than the difference
threshold.
To be able to verify the correctness of the above GPU solution,
there is a need of a tool that can read output images, so that we can
compare them with the initial ones to verify that the different pixels
only differ by the allowed threshold. It was decided not to allocate
project time on implementing this solution, since improving the
other parts of the mapping would yield a bigger performance gain.
Instead, the predict operation was ported on CPU threads similarly
to the reproject. The final solution for improving the locate
performance, as well as the achieved execution times are displayed below.
Two filters are used per output pixel, one for the
radiometric bands and the other for the non-radiometric bands.
Each of these filters is performed for a row of
output on the GPU before output data is copied back from the GPU
and written to ouput files. The memory on the GPU devices is only
enough to keep data for the production of a single row of output.
The table below shows initial execution time compared with the optimized one.
The radiometric and non-radiometric filters are data bounded
rather than computation bounded. For each output pixel in a row,
there are nested for loops which calcuate running sums. In these loops, much of the data read is mostly only used once in a computation,
limiting the prospect for using shared memory to prevent
reading input data multiple times from global memory and thereby
gaining peformance.
As the final step of optimization, the operations of locate and the filters can be executed in parallel.
We have three threads
running: Two threads execute the radiometric and nonradiometric
filters in parallel. Each of these two threads are associated
with a single GPU device. The third thread is one iteration ahead and runs the locate
function for the next line of output. The figure below illustrates the execution flow:
If we compare the final execution time results from the previous sections, we see that the radiometric thread
takes most of the time and hides the other two threads. The locate thread is close to execution time of the radiometric thread,
while the non-radiometric thread takes about
half.
The end result of the performance improvement, factoring all optimizations, can be viewed in the table below:
Initially the mapping has about one-third of its execution time
going to I/O operations. That leads to a maximum speedup of up to 3 based on Amdahl's law. We see that we are very close to reaching this limit.
It is worth noting that the speedup is not equal for the
small and big segments. As the segment size grows, the execution
time for I/O and computations grows non-linearly. Due to this, the
speedup for the big segment is smaller.
The graph below shows the execution time contributions of
the different parts of the mapping for the small segment input. The
performance gain is both from the gain from CPU optimizations on
the whole program and the GPU optimizations on the filters. We
see that performance already doubles from the CPU optimizations.
The use GPU optimizations further improves the performance to a
total gain of three.
Authors
- Dimitris Theodorou
- Roshan Kotian
- Tudor Mihordea
- Ashenafi Tesfay Gebreweld