Title: Survey of Deep Learning Acceleration

Author: Shih-Po Lee

Abstract

SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems
Fine-Grained GPU Sharing Primitives for Deep Learning Applications
Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More

This survey paper explores the acceleration issue of Deep Learning (DL) applications in the training and inference phase. It mainly focuses on three works: SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, Fine-Grained GPU Sharing Primitives for Deep Learning Applications.

First work proposed a flexible and efficient system, Sub LInear Deep learning Engine, called SLIDE, on CPU. It can make DL training 3.5x times faster than Tesla V100 with TensorFlow (TF) on the same network architecture. Second work is an enhanced version of the first work, called Optimized SLIDE. It implemented SLIDE on x86 architecture with more modern functionalities of CPU such as vectorization, quantizations, and memory optimizations. It outperforms the first work by 2-7x times training time speedup on the same CPU. The third work addressed the issue of GPU underutilization. It proposed fast job switching to make DL applications on the same GPU switch quickly and memory sharing to pack different DL tasks on the same GPU. All of these works introduced their solutions to optimize DL training and achieved state-of-the-art performance regarding faster training speed and better memory usage while sustaining original DL model's accuracy.

Introduction

Deep Learning (DL) techniques have been widely studied and applied to applications due to the advancement of Graphics Processing Unit (GPU). They are mostly based on Neural Networks (NNs) which involve tons of operations of matrix multiplication. One classic type of NNs is the fully-connected model. It involves multiple layers which are learnable matrices as the parameters of the model. Each layer usually is followed by a non-linear activation function such as ReLU, Sigmoid, Tanh, etc. Every training iteration contains a foward pass and a backward pass.

In the beginning of the forward pass, inputs first are passed through all the layers in the fully-connected model to get final outputs. The outputs and the targets will be processed by the pre-defined loss function (e.g., Cross-entropy, Mean Square Error) to obtain the loss. Based on the loss, the model launches backward pass with the back-propagation process to get the gradient for every parameter and update accordingly. Each training iteration involves the forward and backward pass. Since inputs data can be processed independently, the more processing units the machine has, the faster the training can be.

Modern GPUs are equipped with large memory and massive processing units that can accelerate the computing process of DL, especially matrix multiplication. However, several issues appear so that DL training cannot take fully advantage of GPUs. First, the algorithmic matrix multiplication process in GPUs are unexplored. Most of the current approach just intuitively put all the data into GPUs and utilize those GPU units, hoping that all the memory and resource sharing can be solved by parallelism. Nonetheless, there are too many difficulties to make it happen. For example, multiple training tasks on the same GPU will extremely slow down the training speed. The imbalance computations resource allocation between large and small tasks also waste the computational resource. There have been some methods focusing on this issue but have limited impact on it.

Besides, dense tensor operation on sparse neural network parameters hinder computational savings. In the fully-connected models, only a few of parameters will be updated after each training iteration. Executing matrix multiplication over an entire model and performing back propagation will cause a gigantic waste of computational resource since most of the operations do not affect the results at all. How to select those effective, or active neurons from the models efficiently is critical.

Finally, job scheduling and resource allocation methods of GPUs are coarse. Each GPU can handle one application well. However, if there are multiple applications allocated on the same GPU, the computational speed can largely decrease. During computation, the tensors are swapped between GPUs and CPUs which also slow down the speed. Both show that GPUs do not have a fine-grained scheduling scheme for handling it resources. Recent works are proposed to resolve the issues and will be detailed in the following sections.

To deal with those possible issues described in the previous section, many recent works are proposed to resolve them. This paper categorize them into two classes. The first class is CPU-based methods and the other one is GPU-based methods.

CPU-based methods, instead of focusing on the optimization on GPUs which should be more favorable to matrix multiplication than CPUs should be, do research on computational optimization on CPUs. They can deal with a sequence of instructions more efficiently but also have limited abilities of parallelism. GPUs, on the other hand, are not good at handling a long sequence of instructions but strong at parallelism. The scenario now is how the sparse weights of a model can be efficiently updated, specifically selecting those active neurons from the neural networks. CPUs then have the chance to better utilize its computational resource for DL training since there is no large-scale parallelism needed, and even have chance to outperform the training speed of GPUs.

On the other hand, GPU-based methods explicitly explore better memory utilization and computational allocation of GPUs. There are several ways to boost the utilization of GPUs. For example, to fully make use of GPU resource, multiple model training tasks can be put into one single GPU. Recent works however have discovered that it will extremely slow down the training speed of each task. The phenomenon can be easily observed on the recent hardware platform, .e.g Tensorflow (FT) and Pytorch. This is mainly because the scheduling methods of GPUs are coarse. The training jobs cannot be efficiently packed into one single GPU and even they can, the processing speed slows down.

Another true scenario is that how do the GPUs handle the processing time for each task. What kinds of tasks should be finished first and what should be finished later? How much computational resource is needed for some specific tasks? Should GPUs give more resource to the tasks that are almost finished or those are just at the beginning? Those are the true cases that the users will face when they train their models with GPUs. As a result, the GPU-based methods try look into the details of those issues.

Background

This section provides the related work of DL training acceleration, including Locality Sensitive Hashing, AVX-512, and GPU sharing methods.

Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) is a technique that hashes similar input into the same buckets with high probability. Note that the number of buckets is smaller than all the possible input, indicating that only a few of input will be selected. Because similar inputs are in the same buckets, this is useful for nearest neighbor search and data clustering. It is different from the conventional hashing techniques where hash collisions are maximized, not minimized.

LSH can be regarded as dimension reduction method for high-dimensional data. Those high-dimensional data can be reduced to low-dimensional versions and similarity between each other can be preserved. Since it has the property of dimension reduction, it can be applied on updating sparse neural networks by searching nearest neighbor. There are works about hash tables. SimHash is a popular LSH for the cosine similarity measure. DWTA hash transforms the input feature space into binary codes such that the Hamming distance in the resulting space closely correlates with rank similarity measure for sparse data.

AVX-512

AVX-512 is 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture. The architecture introduces various new operations, such as new data conversions, scatter operations, and permutations and therefore, provides increased parallelism. AVX-512 instructions allow for a single operation, such as an add, to be as a vector of values that can be put into a 512 bit register. For instance, there are 16 32-bit integers and we would like to do a pairwise addtion between two arrays. By using AVX-512, we load two arrays into an 512-bit register and are able to add them in a single operation. We then write the summed value back to a new array.

The straight forward way to share GPUs would be disabling the exclusive access mode and statistically partitioning the GPU memory among DL jobs. However, this cannot address the under utilization problem due to high peak-to-average memory usage of DL jobs. There are some existing techniques for sharing GPUs. NVIDIA’s Multi-Process Service (MPS) can be used to speed up the static partitioning approach for GPU sharing by avoiding costly GPU context switches. Nonetheless, MPS has limited support for DL frameworks such as TF and Pytorch. The vanilla method, Static partitioning, and MPS both fail to provide performance isolation.

The main issue is that co-located DL jobs can cause large and hard to predict interferences. A recent work, Gandiva (Xiao et al.,2018) uses trial and error and fallback to non-sharing mode. Xu et al. (2019) proposed to use machine learning model to predict and schedule GPU-using VMs in the cluster to minimize interferences. NVIDIA’s TensorRT Inference server (Migacz, 2017) and the prior work by Samanta et al. (2019) achieve simultane DL inference in parallel on a single GPU. But all of them lack support for DL training.

Those earlier works aforementioned on fine-grained GPU sharing have two disadvantages. Some attempt to intercept GPU driver API calls and dynamically introduce concurrency by time-slicing kernel execution at runtime. Others call for new APIs for GPU programming. These solutions are designed for jobs with a few GPU kernels. In other words, they are not scalable to DL applications, where the number of unique kernels can easily go up to several hundreds.

Approach

This paper mainly focuses on three DL acceleration methods which are SLIDE, Optimized SLIDE, and SALUS. The first and second work are CPU-based methods and the last work is a GPU-based method.

SLIDE

SLIDE aims at developing a faster implementation of Locality Sensitive Hashing (LSH) tables compared to GPU acceleration. Given an input, LSH algorithm searches its nearest-neighbor based on the similarity measure and probability. SLIDE is based on LSH to identify sparse neurons in each iteration and accelerates the gradient update in parallel without affecting the convergence. In the implementation of SLIDE, it supports four types of hash functions from LSH family: (1) Simhash (2) WTA hash (3) DWTA hash and 4) Minhash respectively. Each of these hash families preserves different similarities and hence is useful for various scenarios which will be discussed in the later section.

In SLIDE, every layer contains a set of neurons and a set of LSH hash tables. Each table has corresponding ids for neurons which are hashed in to buckets. LSH process begins with the initialization the all the values in hash tables are random. In each step the id a of the neuron is saved into hash buckets mapped by the LSH function. It will construct the hash tables in each layer which is easily parallelized over neurons independently.

After constructing the has tables, the next is to start sparse feed-forward pass. Instead of calculating all the activations, it compute the hash codes for all the inputs as query to retrieve ids of active neurons from the matching buckets. Only those retrieved neurons are passed to the next layer and others will be set as 0. The third step is to do sparse back-propagation (or gradient update). Every training step the neuron propagate gradients only to those active neurons in previous layers. It therefore indicates that only a set of neurons will be updated while saving computation. Finally, SLIDE updates tables after weights updating. It accordingly removes the codes from the old bucket and add new ones into it.

To maintain the parallelism in training, SLIDE makes sure the OpenMP parallelization across a batch. It uses Stochastic Gradient Descent (Batch Gradient Descent) with Adam optimizer. For any given training instance, it runs in a separate thread and its gradients are computed in parallel. The independence is secured by storing two additional arrays with the length equal to the batch size. Every input is assigned an id, which can be used as an index to locate its activation (or error gradient) on any neuron. Furthermore, the extreme sparsity and randomness in gradient updating allows SLIDE to asynchronously parallelizes the accumulation step of the gradient across different training data without leading to a considerable amount of overlapping updates. This asynchronous update avoids synchronization during batch accumulation which is otherwise sequential in the batch.

After understanding SLIDE wit LSH, the paper presents the key idea of using LSH for adaptive sampling of neurons with three strategies as follows: 1) Vanilla Sampling, 2) Topk Sampling, 3) Hard Thresholding. Vanilla Sampling uses a beta as the number of active neurons that uses target to retrieve in a layer. After obtaining the hash codes, it randomly chooses a table and only retrieve those neurons with the corresponding bucket. The method will continue retrieving neurons from another random table until beta neurons are selected or all the tables have been checked. It can be regarded as a random-based sampling method.

Different from random-based sampling method, TopK Sampling aims at obtaining those neurons that occur more frequently among all hash tables. After querying with the input, it first retrieves all the neurons from the corresponding bucket in each hash table and aggregate their frequencies across all hash tables. The frequencies are sorted, and only the neurons with top beta frequencies are selected. Hard thresholding ignores the sorting step in TopK Sampling and instead select neurons that appear at least m times in the retrieved buckets.

Since SLIDE involes LSH process, the overhead of updating hash tables becomes critical and influential. It is expensive to recompute the hash tables after each gradient update. SLIDE dynamically change the update frequency of hash tables to reduce the overhead. The key idea is to update more in the initial training stage and update less when the model is close to convergence. SLIDE also needs a policy for adding a new neuron to a bucket when it is full so it uses Vitters reservoir sampling algorithm as the replacement strategy.

Optimized SLIDE

Regarding the success of SLIDE, Optimized SLIDE further explored the issue of sparsity and proposed a optimized implementation on modern CPUs. Optimized SLIDE first focuses on the memory issue of memory coalescing and cache utilization. The training process of DL is memory-driven, including reading data and updating parameters. The best scenario for threads to execute a load instruction is to access consecutive memory locations. The accesses will be coalesced into single access to DRAM and most threads are able to load from the cache. However, SLIDE, a type of randomized scheme, did not consider this case, resulting in memory fragmentatio and increasing the latency of data access which makes CPUs underutilized.

Optimized SLIDE found two major fragmentation in SLIDE: Data Memory Fragmentation and Parameter Memory Fragmentation. In SLIDE, each thread will access data from a different location instead of depending on cache because various data are stored in different arrays which will easily create cache miss. Optimized SLIDE created on long contiguous vector that stores the values and indices of active neurons with another vector of offsets to track those neurons. Therefore, when the first instance is requested, the following consecutive data elements will be loaded into the cache. On the other hand, SLIDE also updates the parameters of active neurons by hash tables.

However, SLIDE does not consider that the same neuron activated in different threads should trigger coalescing. When a thread tries to fetch neuron from the memory, the nearby neurons will also be used by other threads. Optimized SLIDE would ensure neuron weights in the same layer will be located contiguously in the memory so that the other threads can load the nearby neuron readily from the cache.

Besides the memory issue, Optimized SLIDE works on Vectorization with AVX-512. Specifically, AVX reduces sum instruction by combining products into a single sum and will accumulate over iterations based on two lemmas. Lemma 1 shows that if a weight matrix is in column-major order, then its transpose is in row-major order. If a weight matrix is in row-major order then its transpose is in column-major order. Lemma 2 shows that matrix multiplication between weight and input can be vectorized if input matrix is dense and weight matrix is in row-major order. Based on these two lemmas, Optimized SLIDE proposed the solutions accordingly for dense and sparse matrices with AVX-512.

After handling matrix multiplication, Optimized SLIDE works on another bottle neck of SLIDE which is the computation of hash tables. To mitigate this bottle neck, SLIDE uses Densified-Winner-Takes-All (DTWA) hashing whose operation computes a random has map of every non-zero index in the list. Then the map is partitioned into a fixed number of bins and the index with the maximum coordinate value in each bin is the hash value.

SALUS

In addition to exploring utilization of CPUs for accelerating DL training, many other works tried to find out new solutions for accelerating DL training for GPUs. SALUS, one of the GPU-based methods, proposed two GPU sharing schemes:fast job switching and memory sharing. Both are to fully take advantage of GPU computational resource in a fine-grained manner while processing multiple DL applications. SALUS highlights three main resource issues.

First, the memory usage can be very different over models. For example, resnet152 costs 13.8 GB while Variational Autoencoder (VAE) only costs 1 GB, indicating that the training tasks may be loaded on the same GPU. Second, temporal memory usage is predictable since the training iteration is repeated and thus, it is helpful while designing scheduling method. Third, each training iteration has a persistent but low use of memory to store model parameters. It can make the repetitive iteration take less time to reload the data which is a good starting point to improve memory sharing and utilization.

SALUS focuses on those issues and enables two GPU sharing primitives. The efficient job switching is designed with faster operations by not moving persistent memory and an iteration-granularity job scheduler which leads to a tradeoff between maximum memory utilization and efficiency. On the other hand, spatial sharing via GPU lane considers a better way for utilizing the unused memory. The goal is to limit the dynamic allocation in the ephemeral regions where DL job's ephemeral memory will go. Since there are lanes, continuous memory spaces for iterations, inside each ephemeral region, isolate memory allocations across lanes can reach maximum compatibility.

SALUS accordingly developed the scheduling policies which are PACK policy, SRTF policy, and FAIR policy. PACK policy aims at improving resource utilization by packing jobs with different GPU memory requirements into separate GPU lanes. Note that the peak memory usage across lanes is smaller than the total GPU usage to avoid crash. In addition, SALUS allows job priorities with any criteria and therefore, shortest-remaining-time-first (SRTF) policy can be applied. Finally, some DL applications need to share time and SALUS is able to equalize the resource service over time for different tasks.

Strengths, Weaknesses, and Potentials

In this section, I will describe the experimental settings, the strengths of three works, the weaknesses from various points of view, and the possible potentials that benefit the DL community.

Experimental settings

The section includes the details of datasets and baselines. SLIDE trained the models on Delicious 200K and Amazon-670K. The former one is a sub-sampled dataset generated from a corpus that includes 150 million bookmarks from Social Bookmarking Systems. Amazon-670K includes product to product recommendations with 670K labels. The baselines are optimized TF framework for both CPU and GPU, called TF-CPU and TF-GPU. The language models are skip-gram which aims to predict nearby words in a document by learning continuous representation of words.

Optimized SLIDE also trained the model on the dataset of Amazon-670K. It additionally trained the model on WikiLSHTC-32K and Text8. WikiLSHTC-32K is extracted from Wikipedia and has about million training samples with 325K labels of the instance. Text8 contains 100 million tokens of English Wikipedia which has 253K vocabulary. The work uses CPX to run Optimized SLIDE with BF16 Naive SLIDE, and TF implementation of full-softmax, refering them as Optimized SLIDE CPX, Naive SLIDE CPX, and TF FullSoftmax CPX, respectively.

SALUS combined different DL methods with TensorFlow and evaluated them using a collection of training, hyper-parameter running, and inference workloads. The baselines are the FIFO scheduling which are commonly used in GPU clusters and NVIDIA MPS.

Strengths

The experimental results show that these three works have demonstrated their superior abilities to achieve the state-of-the-art performance regarding the convergence speed of model training and the computational resource usage. Specifically, the convergence time of SLIDE is 2.7 lower than that of TF-GPU (2 hours vs, 5.5 hours). It demonstrates that SLIDE has taken the advantage of CPU power with its implementation to better accelerate the training process than the common GPU scheme.

Furthermore, Optimized SLIDE is much more faster than both TF-GPU and SLIDE. On Intel CPUs server, Cascade Lake server, it is 3.5x faster than TF-GPU on V100 and 4.4x faster than Naive SLIDE, showing that Optimized SLIDE further accelerates training process with modern CPU functionalities. Finally, with 14 different models and each of them has 3 instances. It needs 42 GPUs to hold these DL models with TF only. SALUS requires only one GPU comparing to MPS that needs 6 GPUs while dealing with the same number of training models.

On the other hand, ablation study has been done for fully exploring the benefits of these works from various aspects. SLIDE conducts experiments on different batch sizes and scalability with varied numbers of CPU cores. The larger the batch size is, the higher the accuracy of SLIDE is. The convergence speed of SLIDE linearly reduces according to the increasing number of CPUs. Furethermore, increasing CPU cores makes SLIDE has less convergence time than Tensorflow-CPU dose.

Optimized SLIDE ablation shows the impact of BF16 and AVX-512 on average clock time and training time per epoch. Specifically, AXV-512 vectorization decreases average training time up to 1.2x with almost the same accuracy. BF16 utilization boosts the performance up to 1.28x and 1.39x on Amazon-670K and WikiLSH325K for both activation and weights.

Finally, SALUS provides a point of view of automatic hyper-parameter exploration. Since the bad quality models will be killed during exploration, it is possible to gather those running models into on single GPU. Comparing to FIFO in TF, SALUS has shown that there is a 1.07x improvement with resnet5050 and 2.38x for superres128. It also explored the inference saving ability of SALUS while doing inference. It shows that SALUS has a strong ability to reduce the number of GPUs needed while maintaining reasonable response latency. The highest latency is about 50ms.

Weaknesses

Though those three works demonstrate their strengths for computational saving and better utilization, there are still some issues of these works.

First of all, is it possible to apply SLIDE and Optimized SLIDE on Convolutional Neural Networks (CNN)? Both works focused on accelerating the training speed of fully connected models, specifically for matrix multiplication. However, most of the recent DL models such as ResNet, VGG16, AlexNet, etc, are CNN-based networks. Most of their operations during training and test are convolution with small kernels. The convolution operation involves calculation between the input and a small kernel with sliding windows by going over all the units of the input

Taking RGB images for instance, the input is a 32 by 32 image and the convolution kernel is 3 x 3. The kernel will do convolution, which actually is dot product between the 3 x 3 pixel area of the input and the kernel. After the computation, the kernel will move right or down according to the size of sliding window (step size) to find the new region to do convolution until all the pixels are traversed. It is different from large and sparse matrix multiplication which in turns has a massive number of small matrix multiplication operations in each iteration. SLIDE and Optimized SLIDE did not describe if their methods can be generalized to those CNN-based models.

Second, although SLIDE and Optimized SLIDE conducted extensive experiments on traditional fully-connected models, they did not explored the case with Transformers. Transformers have been widely studied and used as language models for natural language processing applications, e.g., machine translation, text summarization, visual question answering, etc. To handle those language tasks which usually require extensive training data, the models are enormous with billions of parameters. It is reasonable for both works to test their methods under this scenario to see if the method can be also applied to those state-of-the-art models. Those two works, however, did not further explore the capacities of their methods on modern fully-connected models.

Third, though CNN-based models have been used to evaluate the effectiveness of SALUS, none of them are designed for complicated tasks such as object detection, object tracking, and image generation, etc. Those tasks often need much more complicated design of CNNs and their targets will not just be classification. They will rather involves multi-task learning with braches of loss to compute.

Several questions then are raised: How is the fair sharing among these models? What will the only latencies look like with these kinds of large models? Those questions and issues are thought-provoking and can make DL acceleration more connected to the modern DL model designs. The reasoning will be discussed further in the next section.

Potentials

The strengths and weaknesses of these three works have shown that the potentials of DL acceleration are large but still under explored. There are several possible potentials that I will discuss in detail.

First of all, a better memory utilization and switching between CPU memory and GPU memory is worth to discover. To evaluate the model performance after the training of each epoch, the model has to produce quantitative or qualitative results of the test set. Those results often require CPU resource rather than GPU resource such as drawing figures or calculating the statistical results. In this case, the data has to be moved back and forth from GPU memory and CPU memory. It is interesting to find out the scheduling and switching methods that are able to handle this task so as the model can fully take advantage of GPU and CPU resources.

Second, nowadays DL training sometimes combines with Reinforcement Learning (RL). RL is a trai-and-error process which includes many exploration and exploitation steps. Those steps can be expanded in parallel and each step has its own training process with rewarding. The processed data are frequently switched between CPU and GPU. In this case, how the recent CPU-based and GPU-based methods accelerating the process of generating the roll-outs, back-propagation, and weight updating is critical.

Third, splitting a batch into several sub-batches into multiple GPUs is common in large-scale Computer Vision tasks but not explored yet. Recent Computer Vision models are based on DL which are so enormous that even one GPU (e.g., V100) cannot handle one training iteration with a batch. Sometimes, it is not proper to set a very small batch size which may hurt the convergence the model. In this case, users usually split the training batch into multiple mini-batch and put them to different GPUs to do the computation independently.

After the computation, the model can collect the gradients from those GPUs and update the parameters. How the GPU-based methods handle can properly gathering the gradients with a better memory utilization and allocation critical. Generally speaking, how do GPU-based methods simultaneously handle multiple tasks on a single GPU and multiple GPUs for a tasks?

The aforementioned weaknesses and potentials of these three works are critical and meaningful to the community of DL. Most of the recent DL models are not only CNN-based but also complicated in order to deal with real-life tasks. Those tasks usually involves large-scale of input or output so that make the entire training process expensive. In order to deal with those tasks, the model has to grow accordingly which then become a very larger model with millions of parameters.

These three works, though can be the fundamental schemes of DL training acceleration, actually have limited impact on accelerating the training speed of those modern and large models. They are all designed for basics DL models. The DL community nowadays needs the advancement of training acceleration to speed up the research on various Computer Vision Tasks.

For example, these works barely accelerate the training and inference process of image super-resolution (ISR), one of a Computer Vision tasks. It requires many and powerful GPUs to train the model. The goal ISR is to convert a low resolution image into a high resolution version. Due to nature of image super-resolution, that is, the output is a high-resolution image, the training process requires more GPU resources and takes longer time to obtain a converged model.

In order to develop better ISR model, the training process must be running over and over again to find the best designs, settings, and hyper-parameters. In this case, extensive experiments have to be conducted to verify any assumptions that have been made. Nonetheless, each training iteration requires large amount of computational resource and memory. This indicates that the best hyper-parameters and settings are not easy to get, not to mention the designs which needs more.

Conclusion

The survey paper organizes three DL acceleration works by introducing the background, detailing the solutions, showing the strengths and weaknesses, and analyzing the possible potentials.

The survey paper eventually emphasizes that the importance of developing better solutions according to the designs of modern DL models. The bottle neck of developing better accelerating solutions is not the inferiority of the researchers and scientists but is the limitation of hardware and non-optimized acceleration process for training DL models. If the acceleration process were improved, it would consequently speed up the development of modern DL models. As a result, it is critical to make DL training acceleration catch up on the rapid growth of DL model designs.