Convolution Neural Network
Figure 1. Convolution Neural Network Architecture Model
Artificial Neural Networks (ANNs) have emerged to be a powerful technique in machine learning for solving various
practical problems like pattern classification and recognition, medical imaging, speech recognition and control.
A Convolution Neural Network (CNN) is an extension of a ANN optimized for two-dimensional pattern recognition because it
uses shared weights and less connections which greatly reduces the solution space.
The architecture of a CNN is shown in Figure 1.
The biggest problem with CNN implementation is the long training time since it is a highly computational and data
intensive process. Training large data sets could take up to several days or weeks. However, the training involves big
amount of floating point operations and relatively low data transfer in every training step which makes it well
suited for running on modern graphical processing units.
The goal of the project was to improve the overall application performance by converting the existing implementation into
a parallel version and making the best of the available resources.
The solution

Figure 2. Image tile Processing
Online-mode learning
The process of online-mode learning in CNNs is consisted of
- a 'Convolution' phase
- a 'Compute local gradients' phase
- a 'Update weights' phase
Each pattern in a training epoch goes through each of the phases and updates the network.
After one epoch iteration, the network is tested by running the convolution phase on the test patterns
computing an error.
The general approach to parallelization involves splitting the image in tiles which are represented by thread blocks
per output feature map (Figure 2).
The convolution phase is implemented by using this approach. Each operation in this phase is applied per kernel which adds
additional loops inside each CUDA kernel. In order to mitigate the problem with performance by adding loops and given
the relatively small size of the loops, they are unrolled using a boost preprocessor library.
The same approach is used to the sigmoid function at the end of this phase.

Figure 3. Backward calculation
The second phase uses back-propagation method to calculate local gradients from the succeeding layers and then, by using the local gradients,
calculate the gradients of the weights. The algorithm in this phase utilizes the connections of the Feature Extraction Layers (FELs) from
each layer to the FELs of its precedent layer. This means that each layer pushes the data to its precedent layer for calculation purposes (Figure 3).
The information about the connections between the layers are stored in a
connection matrix. This matrix can have infinite network-dependent
connection combinations which leads to not knowing when and how many times the pixels in a FEL in a layer would be updated.
As a result, the kernel could not be parallelized based on the pixel of each FEL which eventually led to serializing the sum of the gradients that did
not improve the overall processing time.
In order to overcome the problem, the concept of
inverted connection matrix was introduced.
This new matrix pulls for relevant data from the FEL of the succeeding layer,
thus making sure there are no memory access conflicts when updating the pixels when a read from the matrix is done
for the same FEL.
This behaviour is denoted as forward calculation (Figure 4) which enables a FEL to be parallelized.

Figure 3. Forward calculation
The last phase in the training process is the update weights phase. This phase consists of two parts: summation
of local gradients followed by an update of the weights and calculation of bias values per layer.
Calculating bias values is done using the standard optimized way of summing up arrays on GPUs.
Updating delta weights requires writing to the same memory address a number of times equivalent to the size of
the outmap, which led to mapping these sections of the algorithms to a single thread block. The remaining dimensions
are used to handle output maps, algorithm kernels and inputs per output map in order to avoid synchronization issues
between different CUDA threads when writing to the same memory address. Each thread iterates over a set of pixels of the input
associated with the output map. Once all threads complete, their results are added and stored in the final weight location.
Batch learning
In batch mode, the network is trained on a batch of images. For each batch of images, delta values for bias and weights are accumulated.
After the batch of images is processed, the previous values for bias and weights are updated. This leads to the open issue that batch
learning may take more iterations to reach the desired error minimum. However, since the convergence speed to this value relies on many
factors such as learning rate, batch size, number of neuron layers, and so forth, no general conclusion can be reached regarding which mode
is better.
Batch training maps naturally to the use of multiple GPUs. First, all the GPUs start with the same weights and
bias values. Then, the images from the batch are equally split between the GPUs. These two steps comprise the setup
for batch training.
Once the images are separated, the GPUs are ready to process them and generate the delta values locally.
After each GPU finishes processing, it sends the generated delta values to the GPU that is in charge of summing
the results. Then, a sum of the delta values from all GPUs is made which is added to the
previous bias and weights values. Finally, the updated bias and weights values are copied back to all GPUs.
The weights and bias values update is done on one GPU due to the fact that summation of values
is highly parallelizable. CUDA 2.x introduces a memory copying mechanism between devices which gives a
substantial speedup. By using one device as a master, it is possible to avoid unnecessary overhead given by
copying between host and device memories.
Performance gains
Table 1 contains the execution time for each function call on CPU and GPU respectively as well as the performance gain
from the parallelization. The function
run_convolution_layer has the largest gain, mostly because the sigmoid function
doesn't have data dependency,
so its execution is completely parallel on the GPU.
Table 2 shows the execution time for the operations when run after the first time.
We used static variables to avoid allocating and freeing memory for each execution of the GPU functions.
Table 3 shows the result of processing four images using batch method on four GPUs (one image per device).
Given that the amount of data we need to transfer and sum on the master GPU is independent of the number of images,
and training of images is not dependent on other devices, batch method speed-up increases linearly with
the number of images that should be processed per GPU.
However, the accuracy of the algorithm decreases subsequently.
Furthermore, the testing of the network, which is done after learning, cannot be parallelized on multiple GPUs so the
time it takes to test is equal to the time needed in online training. Processing the same number of images using
online method takes 2.68745 ms and it increases linearly to the input.
One of the biggest bottlenecks in the proposed solution is that all data has to be copied to a master GPU which
computes the weights and bias values. Every GPU device has to wait until the copying is finished in order to process the delta values and
the master GPU also has to wait before it sends the updated values to the next device.
However, the latency introduced by synchronization is smaller than the gains obtained from a faster training.
Batch learning should be preferred during neural network training because it allows involving more GPUs in
processing. The number of images per GPU as well as the number of iterations needed to train the network
should be carefully chosen so that the accuracy is not affected. Unfortunately, there are no
techniques or heuristics to determine these parameters since that field is still under heavy research.
Conclusion
In conclusion, Convolution Neural Networks are highly suitable for parallelization on
GPUs. That allows to significantly reduce the training time from several days or weeks
to the order of minutes and hours, depending on the network size and the image resolution.
The standard implementation of the compute local gradients phase does not exploit the full
parallelism capabilities that a CUDA device provides due to the nature of the pushing algorithm.
In contrast, the pulling version achieves a considerable improvement in the execution time.
By using batch mode learning, the training was distributed over multiple GPUs within a single
node in the cluster. Running the training on multiple GPUs decreases the overall execution
time even further with respect to the results obtained with the online mode training, which uses
a single GPU.
However, the potential performance improvement achievable by running batch learning in several
cluster nodes would be negligible due to the overhead that is generated by copying the data
between them.
The optimal solution would be to train a different network in each cluster node using batch
mode. Since these networks would be independent, different configurations are possible.
After training, these networks can be compared in order to decide which one has the highest
accuracy. Such an approach would utilize the full computational power of the cluster.