---
   title:  Training Speedup of Deep Neural Networks
   author:  Parnian Zameni
   date:  April 13, 2023
   ---

   # Introduction
   [BPPSA: Scaling Back-propagation by Parallel Scan Algorithm](https://proceedings.mlsys.org/paper/2020/file/96da2f590cd7246bbde0051047b0d6f7-Paper.pdf)
   [Scans as Primitive Parallel Operations](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=42122)
   [SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems](https://proceedings.mlsys.org/paper/2020/file/65b9eea6e1cc6bb9f0cd2a47751a186f-Paper.pdf)

   Deep learning models have achieved outstanding performance in various fields of artificial intelligence, such as computer vision, natural language processing, and …. The performance of these models is indebted to massive data available on the internet and computation resources. As the size of the available data increases, the complexity of the models must increase as well to be able to capture intricate patterns and relations in the data. This increase in complexity often results in longer training times and the need for more powerful computational resources. 
   
   Deep learning models' performance largely depends on the data available and the computational resources used to train them. Deep learning models are trained using the data to predict the desired output. Deep learning models are a type of machine learning that utilizes neural networks with multiple layers to learn and extract features from input data. Training of deep neural networks usually consists of large matrix multiplications and are memory and computation-intensive tasks. However, the resources available to train the deep learning models are limited.

   Due to the high amount of computation in training deep learning models, there have been two general lines of research to optimize and make the training process of these large models efficient. One approach to optimizing the training process is to make computations parallel using graphics processing units (GPUs). GPUs are specialized hardware that can perform many operations in parallel, making them well-suited for the large matrix multiplications required in deep learning. By utilizing GPUs, the training time for deep neural networks can be significantly reduced, allowing for faster experimentation and iteration. Another approach to optimizing the training process is to design algorithms that efficiently use the available computational resources, including central processing units (CPUs). 
   
   This report discusses and compares two main approaches to optimize the training process. In the background section, several essential concepts are discussed for the rest of the paper. In the Approach section, the common approaches to target the optimization of training deep neural networks are discussed. In the Strengths and weaknesses section, the methods are compared, and their strengths and weaknesses are discussed. 


   # Background
   This section reviews several background concepts, such as neural networks, training neural networks, and parallel scans used in the rest of the paper. 

   ## Neural Networks
   Neural networks are computing systems consisting of units called neurons inspired by neurons in the brain. Neural Networks are made from several layers, and several neurons are stacked together in each layer. The neurons are computing units that take several inputs and output a weighted sum of the input after applying an activation function. 

   There are various architectures of the layers in neural networks, such as a fully connected layer, a convolutional layer, or a recurrent cell. The common characteristic of these layers is that they have learnable weights. These weights can be optimized to output better predictions in a task, such as classification or regression. Optimization is defined as minimizing a loss function with respect to the weights in the neural network. 

   ## Training Neural Networks
   Training deep learning models is composed of two parts: the forward pass and the backward pass. In the forward pass, the data is passed through the network layers, and activation values for each layer are calculated. In the backward pass, gradients of the loss function with respect to all the parameters in the network are calculated, and the parameters are updated. The backward pass can be costly. In the backward pass, the backpropagation algorithm is used to calculate the gradients of the loss function with respect to each parameter in the model. 

   In the backward pass, using backpropagation algorithm, first, the gradient with respect to the parameters in the last layer is calculated. Then the gradients for one layer before the previous layer are computed using the final layer's gradients. The excellent characteristic of this model is that there is no need to calculate the gradient of each layer independently of the other layers. In fact, the gradients used in the layer before can be reused in the calculation for the current layer, and the computations are reused. 

   ## Parallel Scan
   Paper[2] presents a parallel scan algorithm to use as a primitive operation in Parallel Random Access Machine(PRAM). According to the paper, using the proposed parallel scan algorithm as primitives can improve the running time of algorithms by log n, and the number of steps of the algorithm is reduced.  The scan presented in paper [2] consists of two steps. One is the upsweep, and one is the down sweep. In total, the time complexity of this algorithm is 2 log n. 

   The parallel scan is performed on a binary tree structure in two phases: upsweep and downsweep. The leaves of the tree are the elements in the array that we want to calculate the scan for. 

   In each step of the upsweep, a layer of the binary tree is processed. Two children of the unit are processed. The unit’s output is the sum of the value of its two children. Also, the value of the left child is stored in the unit. Therefore, in the binary tree, the algorithm starts with the leaves of the binary tree and reaches the root at the end. 

   In the down-sweep, at each layer, the value coming from the parent of the unit is passed to the left child. Also, the value coming from the parent is added to the value stored in the unit and passed to the right child of the unit. The values at the leaves of the tree are the result of the scan operation on the input array.



   # Approach

   There have been several approaches to improve the training efficiency of deep learning models. These approaches can be grouped as data parallelism, model parallelism, and pipeline parallelism, optimizing algorithms on GPU, designing special hardware for deep learning models, and optimizing algorithms on CPU. 

   In data parallelism, the whole model is replicated among all the devices, and data is partitioned among several devices. In this method, the cost of synchronization of updated parameters between devices can be minimized to zero; however, this method does not work if the model is larger than it could fit in the device. In model parallelism, the model is distributed between several devices. Model parallelism is not a practical approach due to its underutilization of devices, as only one device can be used at a time. To solve the underutilization issue of model parallelism, pipeline parallelism has also been proposed, but the memory of each device limits this approach. 

   Designing hardware specific to deep learning models has also been explored to optimize the training of neural networks. This approach is not considered very practical because of the fast evolution of deep learning model designs. Better models and designs replace some methods that once had outstanding performance, and as a result, the previous methods are not used anymore. If hardware were designed for that specific design, it would be useless when newer techniques are proposed.

   While different parts of training a deep neural network can use improvements, the paper [1] identifies the issue in the backward pass and targets optimizing the backward pass on the GPU.  Another category of approaches is not using GPUs and relying on the computing power of many CPU cores in parallel; this approach is presented in the paper [3]. 

   In the following two subsections, the strategies that target optimizing algorithms on CPU and GPU is discussed. 

   ## Optimizing Algorithm on GPU

   Paper [1] proposes a method to gain speedup in training the neural networks on GPU. After the training data is passed through a deep learning model, using the outputs of the model, a loss function is calculated. In the backward pass, the gradient of the loss with respect to the outputs of the last layer of the network is calculated. Then for each layer, the gradients of the output of that layer with respect to the parameters of that layer of the model are calculated. The gradient of a previous layer could be computed using the gradient of a later layer using the chain rule. 
   
   The nature of the backpropagation algorithm is sequential, and the calculation of gradients in the later layers can be used in the earlier layers. If we want to reuse the computation for the last layers, we should wait for the calculation in those layers to finish and then calculate the gradient for earlier layers. Because of the mentioned characteristic of the algorithm, the original formulation of the backpropagation algorithm cannot be run parallelly. 

   The novelty of paper [1] is that by using the existing parallel scan algorithm presented in paper [2], the backpropagation algorithm is reformulated in a way that it can be run in parallel. The difference between the scan performed in paper [2] and paper[1] is that the scan operations presented in paper [2] are sum or scalar multiplication. Still, the scan operations used in paper [1] in the upsweep and down sweep are matrix multiplication. 
   
    The reformulation is described as follows. A deep learning model can be modeled as a function of the model’s inputs. Each layer can be modeled as a function that has learnable parameters. After the function is calculated, the loss function is used to evaluate the model’s output. To update the parameters of each layer, the derivative of the loss function with respect to the parameters of each layer should be calculated. 
   
   It can be shown that the gradients with respect to the output of each layer can be recursively solved. Also, the gradients with respect to parameters can be written as the multiplication of two terms using the chain rule: 
   1. the gradient of the loss function with respect to the output of a layer.
   2. the gradient of the output of a layer with respect to the parameters of the layer. 
   
   As a result, the backpropagation algorithm can be reformulated a scan operation on an array that is defined as follows. The first element of the array is the gradient of the loss function with respect to the activations of the last layer. The other elements are set as the gradient of activation of each layer with respect to the layer before that. Then the scan algorithm, as described in the paper [2], is used to compute the accumulative multiplication of the gradients. 

   Since paper [1] proposes a reformulation of the backpropagation algorithm and not an approximation, the algorithm outputs the same updates as the original backpropagation algorithm would have calculated. As a result, the convergence of the proposed method must not be different from the baseline backpropagation algorithm, and the results shown in their paper reflect this property of their algorithm. 
   
   The step complexity is the minimum number of steps to finish execution. In the reformulated backpropagation if the number of workers(p) is less than the number of elements in the array theta(p/n + log p), and is theta(log n) if otherwise. While for the linear scan, which is the original way to perform backpropagation, the step complexity is theta(n). As illustrated above, in the best case where there are more workers than the entries in the input array, the runtime of the algorithm is reduced significantly. 

   The paper [1] evaluates their method using Recurrent Neural Networks and GRU. Using their method, the backward pass in training is sped up 16 times, and the overall training is sped up 2.73 times. According to the experiments performed, if the number of input sequences increases to be much higher than the number of workers, the speedup gain is not significant. As a result, their method speedup is bounded by the number of workers.   

   ## Optimizing algorithms on CPU

   In paper [3], a method called SLIDE is proposed in order to make the training of the neural networks faster on the CPU. This approach is a different path compared to the approach in the last subsection. The paper [3] uses another approach called SLIDE to speed up the training of deep learning models on the CPU. By using randomized algorithms and parallelism on the CPU, they reduce the time required to perform training on a neural network. 

   Their proposed method is based on Locality Sensitive hashing. Locality Sensitive Hashing functions are a family of functions that, for similar inputs, the probability of the collision of their outputs is higher than for dissimilar inputs. Therefore, one sufficient property of hash functions that makes the LSH is that the probability of collision increases monotonically with the similarity of inputs. The LSH functions are used in solving nearest-neighbor searches in sub-linear time, as it maps similar inputs of the function to the same buckets.

   The paper [3] speeds up training by sampling neurons in the training by utilizing LSH Hashing; therefore, the number of neurons that are used in the computations is sparse and limited. In other words, the LSH algorithm is used to sample neurons and only compute the outputs of those neurons. For constructing an LSH, L independent hash tables are created, and each hash table has a meta-function formed by concatenating K hash functions. Since only the nearest neighbors can match all the K hash values, the buckets become sparse. It must be noted that LSH can be slow for accurate search; Therefore, an accurate search is not performed.

   Paper [3] uses three different strategies to sample neurons to optimize the sampling process from the hash tables. These sampling methods are as follows:

   1. Vanilla sampling: a table is chosen randomly, and the corresponding neurons are retrieved for each bucket. Retrieving neurons is continued until the desired number of neurons are selected, or all the tables have been looked up. 

   2. TopK sampling: the neurons are chosen that occur more frequently among all the hash tables. In this method, first, the input is given as a query, and then from the corresponding bucket in the hash table for that query, all the neurons are retrieved. The frequencies for these neurons are accumulated and sorted and the top frequencies are selected. 

   3. Hard thresholding: the neurons are retrieved the same way in the top k sampling. Instead of sorting, neurons are chosen that are retrieved at least a specific number of times. Among the three presented strategies, vanilla sampling is shown to be more time efficient than the other two methods. 

   In the SLIDE algorithm, each layer of neurons consists of the list of neurons as well as LSH sampling hash tables. Also, neurons are hashed into buckets, and each hash table contains the ids of neurons. In the initialization phase, the weights of the layers in the network are randomly initialized; After that, the LSH functions are initialized. In the feed-forward pass of the network, all the output values of the layer are not calculated. Instead, a set of active neurons are chosen, and only the output for those neurons is calculated. 

   To select the active neurons, the input of the layer is passed into the hash functions, and the resulting hash works as a query to retrieve the ids of the active neurons. Additionally, the output of the neurons that are not chosen is set to zero and is not used in later layers. Also, if there is a Softmax layer after the last layer of the network, only the active neurons are used to calculate the Softmax function, and other outputs are treated as zero. In the Softmax function, we have a normalization term that sums over the exponential of all the outputs of the network. Due to the modifications presented in this algorithm, the summation is only calculated for the active neurons.

   As in this algorithm, conventional matrix multiplication approaches are not used; there are several distinctions between the training using this algorithm and other strategies. For instance, as opposed to most approaches implemented in GPU, only one sample at a time is passed through the network. Furthermore, as it is illustrated, the method makes the weight matrices of the feed-forward networks sparse. 

   After the forward pass of the algorithm is done, the error is backpropagated through the network, and the backward pass will be sparse as well. However, all the weights of the layer are not updated. Only the active neurons during the forward pass are updated, and the rest are left unchanged. Rather than using vector multiplication in calculating the updates in the backward pass, a message-passing implementation of the backpropagation is used. 

   Since message passing is used to calculate the network outputs, the inactive neurons are never accessed in computations. After the network weights are updated, the positions of neurons in the hash tables must also be updated. This operation may require that the neurons are removed from one bucket and added to another bucket. Even though each sample is passed sequentially in the feed-forward phase of the network, batch gradient descent is used. In other words, each sample goes through the network independent of other samples and in parallel (in different threads). 

   Additionally, to ensure that the computations performed are independent, each neuron keeps track of the batch using an array of size of the batch size. This array contains information about the input to the neuron and the error gradients. Every input has its unique id, and the activations and gradient errors are accessed using that id as an index. This proposed approach causes the updates to the network weights to be sparse, and as a result, the probability of two threads updating the weight of the same neuron will be very low. However, no synchronization is done between parallel updates. 

   # Strengths and Weaknesses

   In this section, the strengths and weaknesses of paper [1] and paper [3] are enumerated.  

   Considering the limitations of the computer systems, while the proposed approach in paper[1] shows excellent results in improving the training time of a category of neural networks in the GPUs, it has several drawbacks. The paper[1] uses a binary tree structure to perform the scan algorithm, and in a binary tree, the number of nodes in the layer above is half the number of nodes in the lower layer. When starting the upsweep, the number of workers needed to perform the algorithm is ideally equal to the number of leaves in the binary tree of the data, as in each step, two leaves are taken and processed. After one step, the number of nodes in the layer is halved, and half of the workers are not used anymore because of the structure of the binary tree. In other words, at each iteration, the number of active workers is half of the active workers in the iteration before that, and the other half of the workers are idle.  As a result, the hardware resources are not fully utilized in this way. This is the same for the down sweep of the algorithm as well. In the first few iterations, most of the workers are idle, and as the algorithm reaches the leaves of the tree, more of the workers are utilized. Also, another limitation of the algorithm in paper [1] is that it is assumed that the number of entries in the list that the scan operation is performed on is a power of two and can be modeled using a binary tree. This might not be the case for every input to the algorithm. In fact, it is possible that the number of layers of a model plus one is not a power of two. One solution could be to partition the list into two groups, one whose length is a power of two and one which is the rest of the array. Then, the parallel scan can be performed on the part of the array whose length is a power of two. Overall, the number of steps needed to complete the algorithm will be higher than the number mentioned in the paper. Additionally, the paper uses Recurrent neural networks to evaluate the performance of the algorithm, and all the results in the paper are reported only on recurrent neural network models. In recurrent neural networks, the chain that calculates the gradients can be very long. However, this is not true for other neural networks, such as convolutional neural networks and fully connected layers. In fact, in fully connected and convolutional layers, most of the computation in the backward pass is done in each layer and not in the accumulation of the gradients. For example, consider a large convolutional neural network, ResNet101. This model has about 100 layers. If the algorithm proposed in paper [1] calculates the gradients at each layer, the scan is performed on an array of size 100. It is unclear if the algorithm would have achieved the same speedup gain in the convolutional and fully connected models as in the recurrent neural networks. One of the crucial strengths of the paper [1] is that their method is not limited by the memory capacity of each device, as other mechanisms, such as pipeline parallelism. Additionally, in paper [1], they have achieved considerable speedup in the backward pass of the training in recurrent neural networks by using existing algorithms. 

   Paper [3] has proposed a method to speed up training on the CPU. The authors have implemented their algorithm on fully connected neural networks and illustrate the improved performance of the training of the network. However, similar to the paper [1], they have not shown results on other types of neural networks. For instance, convolutional neural networks are the sparse and weight-shared version of fully connected layers. It needs to be clarified how the convergence of the training is going to be affected. It is possible that the proposed algorithm gains speed in training by sacrificing the accuracy of the neural networks. There is also another drawback in the method proposed in paper [3]. In paper [3], they have assumed that the memory access is completely random. According to this assumption, the cache is used in an efficient way. However, it is possible the cache is not utilized in the most efficient way. As we know, there exists L1, L2, and L3 caches, and L1 is the fastest cache. If a data is requested by the CPU and it does not exist in L1, L2 cache is accessed, and is the data is not in L2, L3 cache is accessed. Additionally, if the data requested is not cache it will be fetched from memory. Furthermore, it is possible that a processor instead of ust fetching that specific line, fetches the consecutive cache lines as well. Therefore, the CPU has an assumption about data request that it will be serial and addresses near each other will be requested. However, since the SLIDE algorithm does not perform serial reads it may be possible that the cache is not completely utilized.  


   # Conclusion
   This survey highlights two distinct methods for improving the efficiency of training neural networks. The emergence of powerful GPUs has fueled a significant focus on parallelizing the various training operations, with much effort dedicated to this endeavor. In paper [1], the backward pass of GPU training is targeted specifically. To accomplish this, the backward pass is restructured as a parallel scan algorithm, as proposed in Paper [2]. On the other hand, Paper 3 puts forth a new training algorithm that exclusively employs CPUs, with no use of GPUs. This algorithm suggests that it is possible to train neural networks without computing the output of all neurons. Instead, a sampling approach is used to select a fraction of the neurons, and only the activated neurons are utilized for the forward pass. Furthermore, only gradient updates are conducted for these active neurons. The survey concludes by assessing the pros and cons of both papers. Overall, the findings of the survey offer valuable insights into the ongoing efforts to enhance neural network training efficiency.