Introduction

The algorithm used is the recognition part of a vision system based on a Convolutional Neural Network which is based on a trained network structure gained from the training part of this vision system.
The trained network structure consists of four layers: the first two layers (depicted in red rectangle) are the feature extraction step and the last two layers (depicted in blue rectangle) are used for classification. Each layer is based on the same principle but with different workload. Every layer is composed by several outmaps. An outmap is the result of the convolution of an outmap from the previous layer with a predefined set of weights. A constant value, called bias, is added to the sum. To generate the final output value the previous sum is passed through a sigmoid activation function. This process should be performed on an outmap as many times as the connections between this outmap and the previous layer, accumulating the intermediate results. The input of the algorithm is a 1280x720 HD video frame which will be processed by the trained network structure, generating an output that gives information about the existence of a sign specifying the detected sign type.

Application profile

In the code given, all the computations performed in one image were done by a single function. The total time to process one HD image is 3580ms. Different parts of this function were measured in order to analyze where to exploit parallelism. The time required to process every layer is shown in the second column of the table, where it can be observed that the third layer is the most time consuming due to the network structure. In the table, the third column shows the time needed to perform the convolution of a set of weights with an image from the previous layer. It was measured the time needed to perform the activation function. The results obtained are shown in the fourth column of the table.
The time needed to convert the image from unsigned char to float and vice versa was computed. The results obtained for the functions uc2f and f2uc are approximately 6ms and 0.7ms, respectively. These two functions were also decided to be implemented in GPU.

Single CPU-GPU implementation - First approach

Initially the most time consuming layer, the third one, was mapped on a GPU. Different kernels were created for the functions implemented into the layer such as :
- the convolution between an image of the previous layer and the corresponding set of weights, which produces an intermediate result for a feature map of the current layer.
- the accumulation of all the intermediate feature maps in order to obtain the final feature map.
- the addition of the bias value.
- the sigmoid activation function.

Shared memory is used in order to avoid the data transactions to and from the slow global memory. It was decided to store in shared memory as many lines of the image as the corresponding convolution window size. The use of shared memory is considered to exploit the reusage of many data for the convolution kernel. Further speedup is achieved performing the activation function using the special functions units of the GPU which provides faster single-precision floating point division than the division operator.

Initially, only the third layer was mapped on GPU and later all the layers were mapped to GPU. The speedup gained is 8.75 times, the execution time per frame is now 420ms.

Single CPU-GPU implementation - Second approach

The idea of the second approach is to exploit parallelism in all the outmaps. All the intermediate outmaps needed to compute all the final outmaps will be processed in parallel. Thus, it was necessary to define in every layer a matrix structure that contains the corresponding set of weights needed for every final outmap. Another matrix structure is needed to store the intermediate feature maps resulting from the convolution. The accumulation kernel was modified as now it has to add together the intermediate results associated to each final outmap. The speedup gained is 3.5 times, the execution time per frame is now 120ms.
Single CPU-GPU implementation - Third approach

In each CNN layer, in the previous approach it was needed to allocate extra GPU memory space, as well as transfer data to this space and free it afterwards. In order to avoid this extra space for storing the intermediate results in each CNN layer, the memory space is just allocated once large enough. The GPU memory space allocated for the image is used for both intermediate and final outmaps.
Based on the cuda profile the occupancy of the convolution is relatively low. For this reason, the 2D CUDA grid for convolution kernel was extended to a 3D grid so that the image height is divided in tiles computed in parallel, using a greater number of thread blocks. The best results were obtained when dividing the height in tiles of 64 lines.

Additionally, loop unrolling with template was applied to improve the convolution kernel. Other optimizations were made. For example, bias and activation kernels are merged into the convolution kernel and accumulation kernel respectively. The speedup gained is 3.5 times, the execution time per frame is now 33ms which corresponds to 30fps. The total speed up obtained with respect to the original version is 108 times.

Single CPU-GPU implementation - Results

Cuda profiler was used to further analyze the kernel characteristics highlighted in the table. In the convolution kernel it was observed that the third layer remains the most time consuming. The low memory bandwidth of this kernel indicates that the convolution kernel is limited by the big amount of computations needed. This is not the case in the accumulation kernel which has a high memory bandwidth. The number of computations in this kernel is significantly less than in the convolution. Thus, this kernel is limited by the memory accesses. Both kernels have a relatively high occupancy. The lower result of the first layer in the convolution kernel is related to its CNN structure.
Multiple CPU-GPU implementation

Due to the data dependencies among the different layers of the CNN and the large amount of frames needed to be processed, it was decided to further exploit the parallelism in terms of frames. Using OpenMP, the amount of frames is equally distributed among the GPUs available. This approach speeds up the total execution time whereas the time to process one frame remains the same.
In order to evaluate the advantages of using multiple GPUs, the application was run using a set of 120 input images. The results were compared with those obtained using the single GPU implementation and the original not-optimized CPU version. The total speed up when using 4 GPUs is 357 times.

Given that the maximum number of GPUs that can be used with OpenMP is four, the number of GPUs was varied from 1 to 4 to measure the different performance in each case. For these measurements the set of 120 images was used. The graph illustrates the results obtained.

Conclusion

- Gradual mapping of the recognition algorithm to GPU.
- Suitable for GPU:
- Independent pixel computation
- CNN structure
- Large amount of FLOP

- Time per frame=33ms and total speedup= 108x
- Satisfaction of real-time requirements.
- Acceleration of the total time using multiple GPUs.

Authors

- Panagiotis Afxentis
- Alicia Sánchez Crespo
- Ying Zhang