Optimizing the Proba-V mapping algorithm implementation for a heterogenous (CPU + GPU) environment

Contents
Satellite Image Projection

Proba-V Satellite

Proba-V is a small earth observation satellite for global vegetation monitoring. The satellite is a project of the European Space Agency (ESA mission link) and is expected to be fully operational by the beginning of 2013.

The User Segment of the Proba-V project was developed by VITO, Belgium (VITO project link). It includes processing of data: The vegetation data captured by the satellite is mapped to a given user projection. The goal of the project was to optimize the performace of the mapping algorithm using CPU and GPU parallelization.

To briefly illustrate the algorithm, the mapping is essentially composed by two steps:
  • Locate the coordinates in the unprojected image that correspond to the (longitude, latitude) coordinates pair in the user projection. (Basically, an inverse location process). This is achieved using predictor functions.
  • Determine the vale of the pixel located in the previous step. This is done by applying filters such as bilinear, bicubic etc. on the surrounding pixels.
Analysis and refactoring of initial application

The execution of the mapping was measured using code instrumentation. The execution time is shown below:

Original Execution Division

The input/output (reading and writing of files) takes about one-third of the total execution time. This is measured on a small set of input data (small segment). On a big input segment, the I/O takes close to 40%.

The input files can be very large to the extent that you cannot keep all the input data in memory at a time. Therefore the given implementation is such that files are read by parts. At every single point in time, only a portion of the input data that is necessary to compute one line of the output data is kept in memory. The input files are compressed in HDF5. When they are read by the mapping, they are first uncompressed, giving rise to a large file reading time.

Writing takes a very small (1%) portion of the total execution time. This is because only a complete line is writen at a time, giving rise to a very regular file access and therefore efficient writing operation. In addition, unlike the input files, the ouput files are not compressed.

The computations are made up of the locate and filtering operations. All the locate and filtering operations can be done independently for each pixel; and so these operations are theoretically completely parallelizable over the output data. But in practice we can only keep enough data in memory for a single line of output. Therefore only pixels in a single output line can be produced in parallel and hence the parallelism is restricted by the amount of memory needed to keep input data in cache.

Originally, the complete functionality of the image projection was done sequentially for every pixel. We refactored the code to an outer row loop that contains functions that operate on the current row of pixels, as show below:

Mapping Flowchart
The code was restructured in that way so that
  • Each function can be executed in parallel with the others. This allows the possibility of forming an execution pipeline.
  • Parallelizing the computations on a row level can be tackled in a different manner in each function.
I/O and caching

For every output line that represents a row in the output data, the input that has to be read is a curved line that crosses the input data. This is illustrated in the image below; The array represents the input data, and the red lines are the data that has to be read for corresponding output lines.

Reading Access Pattern
From the reading access pattern we can see if you cache data seperately for each line, it will require some input lines to be read multiple times. Since the time needed for reading is very high (due to HDF5), an initial caching mechanism was implemented to limit the number of input lines read multiple times.

To port the computation to the GPU, we need the input data not on the GPU as well. Therefore we mirrored the cache mechanism in the GPU memory. In addition, to be able to run calculations in parallel, data must be collected in advance. The input data needed for an output line is completely cached before the computation part is started. To achieve that we modified the caching algorithm: instead of caching for each output pixel the lines required from the input, we calculate the whole input data range needed (min and max line numbers) and cache it all at once. By doing this the time spent caching dropped by almost a factor of two. After those modifications, caching data for a whole line at a time the I/O time dropped to 20% of the execution. In the end copying the data to the GPU memory added extra delays so I/O became almost 25% of execution time.

The reading of files is done through the HDF5 library. To make this library compatible with our multithreading approach we had to recompile it using the following compiler flags: 'enablethreadsafe' with-pthread. We also tried to run multiple caching (for different files) in parallel using multiple threads but the execution time was the same. It seems that the HDF5 library serializes writes to different files. We can conclude that file reading (25% of the initial execution time) cannot be parallelized and is the minimum execution time needed for the mapping application.
The locate function

The flowchart of the locate function is displayed below:

Locate (p,l) flowchart
Initially, the (lon, lat) are projected on the intermediate (u, v) space (referred to as reproject). After that, an initial (p0, l0) value is predicted using (u, v), and then an iterative prediction takes place that converges on a (p, l) value. In practice, with the currently used accuracy threshold the value will converge after one or two iterations. The execution time measurements yield the following results:

Locate (p,l) initial execution time
The locate function operates on each pixel independently from other pixels' computations. Therefore it is possible to parallelize the function on a per pixel basis. The bulk of the computation (70%)is taken by the reproject and the biggest performance gain would be from optimizing that operation. However, the reproject is calling the PROJ.4 external library, and the call graph that branches within the library is not simple enough to make feasible the porting of only those parts of the library that are being used. Moreover, the library is not provided as a GPU port by its creators, and porting it was beyond the limitations of this project. If GPU usage is ruled out for reproject, its performance can still be increased by using CPU parallelization. Its computation can be split among all available CPU threads in a straightforward manner, as there are no race conditions or data dependencies. To be able to parallelize the reproject and predict operations separately, we restructured the code so that each part of the function operates on a whole row of pixels.

The reproject loop was ported on cpu threads, using OpenMP. The only performance bottleneck here is the number of threads available. We use as many threads as possible (out of the total 8). If we combine the solution with the final parallelized pipeline (see here), we are left with 6 threads.

The iterative prediction is implemented in the following way for performance reasons: After an iteration for a pixel finishes and a (p, l) value is found, the difference between (p0, l0) and (p, l) is stored. This difference is added to the input of the next pixel iteration to reduce iteration time. If every pixel calculation is done in parallel, then this mechanism can no longer be used since every pixel value is calculated separately. This also means that the results between the two methods will differ, which is explained if you compare how the coordinates for the same pixel are calculated between the two methods: we see that the iteration (polynomial interpolation) starts from different initial values and thus will reach the difference threshold with different results. Of course, this difference will be less than the difference threshold. To be able to verify the correctness of the above GPU solution, there is a need of a tool that can read output images, so that we can compare them with the initial ones to verify that the different pixels only differ by the allowed threshold. It was decided not to allocate project time on implementing this solution, since improving the other parts of the mapping would yield a bigger performance gain. Instead, the predict operation was ported on CPU threads similarly to the reproject. The final solution for improving the locate performance, as well as the achieved execution times are displayed below.

Locate (p,l) optimized flowchart

Locate (p,l) optimized execution time
The filters

Two filters are used per output pixel, one for the radiometric bands and the other for the non-radiometric bands. Each of these filters is performed for a row of output on the GPU before output data is copied back from the GPU and written to ouput files. The memory on the GPU devices is only enough to keep data for the production of a single row of output. The table below shows initial execution time compared with the optimized one.

Filters execution times
The radiometric and non-radiometric filters are data bounded rather than computation bounded. For each output pixel in a row, there are nested for loops which calcuate running sums. In these loops, much of the data read is mostly only used once in a computation, limiting the prospect for using shared memory to prevent reading input data multiple times from global memory and thereby gaining peformance.
Parallel pipeline

As the final step of optimization, the operations of locate and the filters can be executed in parallel. We have three threads running: Two threads execute the radiometric and nonradiometric filters in parallel. Each of these two threads are associated with a single GPU device. The third thread is one iteration ahead and runs the locate function for the next line of output. The figure below illustrates the execution flow:

Threaded execution
If we compare the final execution time results from the previous sections, we see that the radiometric thread takes most of the time and hides the other two threads. The locate thread is close to execution time of the radiometric thread, while the non-radiometric thread takes about half.

Locate (p,l) optimized flowchart
The end result of the performance improvement, factoring all optimizations, can be viewed in the table below:

Locate (p,l) optimized flowchart
Initially the mapping has about one-third of its execution time going to I/O operations. That leads to a maximum speedup of up to 3 based on Amdahl's law. We see that we are very close to reaching this limit. It is worth noting that the speedup is not equal for the small and big segments. As the segment size grows, the execution time for I/O and computations grows non-linearly. Due to this, the speedup for the big segment is smaller.

The graph below shows the execution time contributions of the different parts of the mapping for the small segment input. The performance gain is both from the gain from CPU optimizations on the whole program and the GPU optimizations on the filters. We see that performance already doubles from the CPU optimizations. The use GPU optimizations further improves the performance to a total gain of three.

Locate (p,l) optimized flowchart
Authors
  • Dimitris Theodorou
  • Roshan Kotian
  • Tudor Mihordea
  • Ashenafi Tesfay Gebreweld