Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster

Convolution Neural Network

Figure 1. Convolution Neural Network Architecture Model

Artificial Neural Networks (ANNs) have emerged to be a powerful technique in machine learning for solving various practical problems like pattern classification and recognition, medical imaging, speech recognition and control.

A Convolution Neural Network (CNN) is an extension of a ANN optimized for two-dimensional pattern recognition because it uses shared weights and less connections which greatly reduces the solution space. The architecture of a CNN is shown in Figure 1.

The biggest problem with CNN implementation is the long training time since it is a highly computational and data intensive process. Training large data sets could take up to several days or weeks. However, the training involves big amount of floating point operations and relatively low data transfer in every training step which makes it well suited for running on modern graphical processing units.

The goal of the project was to improve the overall application performance by converting the existing implementation into a parallel version and making the best of the available resources.
The solution

Figure 2. Image tile Processing

Online-mode learning

The process of online-mode learning in CNNs is consisted of
  • a 'Convolution' phase
  • a 'Compute local gradients' phase
  • a 'Update weights' phase
Each pattern in a training epoch goes through each of the phases and updates the network. After one epoch iteration, the network is tested by running the convolution phase on the test patterns computing an error.

The general approach to parallelization involves splitting the image in tiles which are represented by thread blocks per output feature map (Figure 2). The convolution phase is implemented by using this approach. Each operation in this phase is applied per kernel which adds additional loops inside each CUDA kernel. In order to mitigate the problem with performance by adding loops and given the relatively small size of the loops, they are unrolled using a boost preprocessor library. The same approach is used to the sigmoid function at the end of this phase.

Figure 3. Backward calculation

The second phase uses back-propagation method to calculate local gradients from the succeeding layers and then, by using the local gradients, calculate the gradients of the weights. The algorithm in this phase utilizes the connections of the Feature Extraction Layers (FELs) from each layer to the FELs of its precedent layer. This means that each layer pushes the data to its precedent layer for calculation purposes (Figure 3).

The information about the connections between the layers are stored in a connection matrix. This matrix can have infinite network-dependent connection combinations which leads to not knowing when and how many times the pixels in a FEL in a layer would be updated. As a result, the kernel could not be parallelized based on the pixel of each FEL which eventually led to serializing the sum of the gradients that did not improve the overall processing time.

In order to overcome the problem, the concept of inverted connection matrix was introduced. This new matrix pulls for relevant data from the FEL of the succeeding layer, thus making sure there are no memory access conflicts when updating the pixels when a read from the matrix is done for the same FEL. This behaviour is denoted as forward calculation (Figure 4) which enables a FEL to be parallelized.
Figure 3. Forward calculation

The last phase in the training process is the update weights phase. This phase consists of two parts: summation of local gradients followed by an update of the weights and calculation of bias values per layer. Calculating bias values is done using the standard optimized way of summing up arrays on GPUs. Updating delta weights requires writing to the same memory address a number of times equivalent to the size of the outmap, which led to mapping these sections of the algorithms to a single thread block. The remaining dimensions are used to handle output maps, algorithm kernels and inputs per output map in order to avoid synchronization issues between different CUDA threads when writing to the same memory address. Each thread iterates over a set of pixels of the input associated with the output map. Once all threads complete, their results are added and stored in the final weight location.

Batch learning

In batch mode, the network is trained on a batch of images. For each batch of images, delta values for bias and weights are accumulated. After the batch of images is processed, the previous values for bias and weights are updated. This leads to the open issue that batch learning may take more iterations to reach the desired error minimum. However, since the convergence speed to this value relies on many factors such as learning rate, batch size, number of neuron layers, and so forth, no general conclusion can be reached regarding which mode is better.

Batch training maps naturally to the use of multiple GPUs. First, all the GPUs start with the same weights and bias values. Then, the images from the batch are equally split between the GPUs. These two steps comprise the setup for batch training. Once the images are separated, the GPUs are ready to process them and generate the delta values locally. After each GPU finishes processing, it sends the generated delta values to the GPU that is in charge of summing the results. Then, a sum of the delta values from all GPUs is made which is added to the previous bias and weights values. Finally, the updated bias and weights values are copied back to all GPUs. The weights and bias values update is done on one GPU due to the fact that summation of values is highly parallelizable. CUDA 2.x introduces a memory copying mechanism between devices which gives a substantial speedup. By using one device as a master, it is possible to avoid unnecessary overhead given by copying between host and device memories.
Performance gains

Table 1 contains the execution time for each function call on CPU and GPU respectively as well as the performance gain from the parallelization. The function run_convolution_layer has the largest gain, mostly because the sigmoid function doesn't have data dependency, so its execution is completely parallel on the GPU.

Table 2 shows the execution time for the operations when run after the first time. We used static variables to avoid allocating and freeing memory for each execution of the GPU functions.

Table 3 shows the result of processing four images using batch method on four GPUs (one image per device). Given that the amount of data we need to transfer and sum on the master GPU is independent of the number of images, and training of images is not dependent on other devices, batch method speed-up increases linearly with the number of images that should be processed per GPU.

However, the accuracy of the algorithm decreases subsequently. Furthermore, the testing of the network, which is done after learning, cannot be parallelized on multiple GPUs so the time it takes to test is equal to the time needed in online training. Processing the same number of images using online method takes 2.68745 ms and it increases linearly to the input.
One of the biggest bottlenecks in the proposed solution is that all data has to be copied to a master GPU which computes the weights and bias values. Every GPU device has to wait until the copying is finished in order to process the delta values and the master GPU also has to wait before it sends the updated values to the next device. However, the latency introduced by synchronization is smaller than the gains obtained from a faster training.
Batch learning should be preferred during neural network training because it allows involving more GPUs in processing. The number of images per GPU as well as the number of iterations needed to train the network should be carefully chosen so that the accuracy is not affected. Unfortunately, there are no techniques or heuristics to determine these parameters since that field is still under heavy research.

In conclusion, Convolution Neural Networks are highly suitable for parallelization on GPUs. That allows to significantly reduce the training time from several days or weeks to the order of minutes and hours, depending on the network size and the image resolution. The standard implementation of the compute local gradients phase does not exploit the full parallelism capabilities that a CUDA device provides due to the nature of the pushing algorithm. In contrast, the pulling version achieves a considerable improvement in the execution time.

By using batch mode learning, the training was distributed over multiple GPUs within a single node in the cluster. Running the training on multiple GPUs decreases the overall execution time even further with respect to the results obtained with the online mode training, which uses a single GPU. However, the potential performance improvement achievable by running batch learning in several cluster nodes would be negligible due to the overhead that is generated by copying the data between them.

The optimal solution would be to train a different network in each cluster node using batch mode. Since these networks would be independent, different configurations are possible. After training, these networks can be compared in order to decide which one has the highest accuracy. Such an approach would utilize the full computational power of the cluster.
  • Jonatan Ward
  • Sergey Andreev
  • Francisco Heredia
  • Bogdan Lazar
  • Zlatka Manevska