Abstract

This survey paper aims to provide a short overview of three tested acceleration techniques for applying deep learning (DL) systems. These three techniques include new GPU-sharing primates concerning memory allocation & resource scheduling, worker communication scheduling in distributed DL systems, and Locality Sensitive Hashing (LSH) for adaptive sampling of neural networks. All three techniques fall within system-level optimization but can break down into hardware-aware and software optimization strategies. Lastly, I provide ideas for future work of combining these strategies and for which DL architectures, computation strategies, and phases would make sense to utilize which methods.

1 Introduction

Recently, deep learning (DL) has been growing exponentially now that we have powerful enough resources and large enough amounts of data to extract high-accuracy inferences. This increase in computational power and dataset sizes has led to the development of increasingly complex DL models, such as the large language models that power ChatGPT: GPT-3.5 and GPT-4. Both models have achieved unprecedented accuracy in natural language processing tasks through few-shot learning [@DBLP:journals/corr/abs-2005-14165].

Graphical Processing Units (GPUs) have played a crucial role in the success of DL and machine learning by providing the computational power needed to train and run such large-scale models like ChatGPT effectively. Prior to their widespread use, such models lacked the necessary resources for efficient training. Despite the benefits of GPU-accelerated DL systems, training these models efficiently remains a significant challenge. As DL models and datasets continue to grow in size and complexity, researchers and engineers have developed various software and hardware-aware optimization strategies to speed up DL systems. In this survey paper, I provide an overview of three such methods: new fined-grained GPU sharing primitives, worker communication scheduling, and implementation of the Locality Sensitive Hashing (LSH) algorithm for adaptive sampling of neural networks.

In this paper, I will first briefly overview some of the necessary background concepts. Then, I will touch on what DL is and what different architectures there are for it. Next, I will review the two most commonly used DL frameworks: Tensorflow and PyTorch. Additionally, I will provide background on DL jobs, memory usage, and distributed DL techniques and their advantages and challenges.

After providing some background, I will provide overviews of each of the three papers [@yu2020fine] [@hashemi2019tictac] [@chen2020slide] regarding their system designs to improve DL acceleration through either software or hardware-aware optimization. Lastly, I provide my thoughts on potential future work and whether or not to combine these strategies based on different DL architectures, hardware, and computation types.

2 Background

2.1 Deep Learning

Deep Learning (DL) is a subfield of machine learning that pertains to using deep neural networks with multiple layers, which the human brain has inspired. Each node or neuron receives input from nodes in the previous layer if the node lies within a hidden layer. Then, each node calculates based on an activation function and passes the output to the nodes in the next connected layer. The layers closer to the input extract simple features, while the layers deeper in the network extract more complex and abstract features. Finally, the output layer generates the final prediction or classification based on the application you are applying it to.

There are various DL architectures, each with unique feature extractions and use cases. One type of DL architecture is the Convolutional Neural Network (CNN). CNNs are commonly used for image and video processing tasks. CNNs have various layers with various sizes per layer, where each layer may derive different sets of features, some more complex than others. Another DL architecture is the Recurrent Neural Networks which are used for sequential and time-series data.

Large-scale DL models heavily rely on large amounts of data, as well as the computational power required to train these models. This need for computational power has driven the development of hardware that specializes in training and running DL models, such as the Tensor Processing Unit (TPU) and DL-centered GPUs like the NVIDIA Tesla K80s.

2.2 Deep Learning Frameworks: TensorFlow vs. PyTorch

DL frameworks provide developers with tools and libraries to quickly build, train, and deploy DL models. Some of the most highly used DL frameworks are Tensorflow by Google and PyTorch by Facebook.

TensorFlow has been out for a while now and is used widely in the industry for broad community support. PyTorch, on the other hand, is more widely used in academia and research and has gained popularity due to its dynamic computational graph, which allows researchers and developers to debug easier. Given that PyTorch has a strong focus within the research community, models written in recent research papers are most likely implemented using PyTorch [@wadawadagi_2023].

Each framework supports the implementation of most DL architectures, as well as pretty much any other common machine learning model. They also offer efficient distributed training using data parallelism and model parallelism techniques, which are also known as synchronous and asynchronous training in terms of how Tensorflow characterizes them.

2.3 Deep Learning Jobs

DL jobs can be split into two groups: training and inference. Training involves updating the model parameters using back-propagation through each iteration. These jobs typically involve many iterations of feeding large amounts of data through the neural network, adjusting its weights, and evaluating the results. The number of iterations for a given DL job is calculated by dividing the size of the dataset by the batch size. Each iteration requires a significant amount of computation, which is why they are typically performed on GPUs due to their ability to perform massive parallel operations through matrix multiplication.

Inference, on the other hand, involves using a trained model to make predictions on new data. As a result, inference jobs are typically faster and require less computational power than training jobs.

2.4 GPU Memory Usage

DL models can require significant amounts of GPU memory. In DL jobs, the entire model, and its associated data must reside in memory for the GPU to perform any computation. As models become even larger and the batch size increases, the memory requirements of DL jobs also increase.

There are high variations in peak memory usage across DL jobs for a GPU, with some models peaking as high as 13.8 GB and some as low as less than 1 GB [@yu2020fine]. However, each DL job iteration is highly predictable, with a well-defined peak memory usage and a trough between iterations [@yu2020fine]. This predictability allows for the identification of scheduler invocation points.

Another important characteristic of memory usage in DL jobs is persistent memory to hold the parameters of a model. This persistent memory usually holds a few large chunks of memory. The peak memory usage can be very high, but most is temporary data created and destroyed within the same iteration. The size of persistent memory is often shallow in comparison to the peak. The authors of the first paper saw peaks ranging from 110.9 MB to 822.2 MB for GPU memory.

In addition to persistent memory, ephemeral memory usage is also essential when optimizing. Ephemeral memory includes the memory used for temporary variables and intermediate results during computation, which is not saved for future use. This type of memory can also contribute significantly to the overall memory usage in DL jobs. The first paper author shows that efficiently managing both persistent and ephemeral memory optimizes performance and prevents out-of-memory errors. Other research has also shown different approaches compared to ephemeral in-memory representations, such as lightweight dictionary data structures, as ephemeral in-memory representations are "difficult to persist and/or not scalable enough under concurrent access" [@nicolae2022scalable].

2.5 Distributed Deep Learning Techniques

Distributed DL is crucial for training large models on big datasets. The benefits of distributing the work for a single DL job are that it usually produces faster training times, better utilization of resources, and the ability to handle larger datasets [@verbraeken2020survey]. Distributed training can be broken down into two main techniques: data parallelism and model parallelism.

Data parallelism involves distributing the data of a dataset across multiple devices or workers and performing the same computation on each piece of data independently. Therefore, each worker or device receives a copy of the model and computes the gradients of the loss function for its part of the data. These gradients are then aggregated across all devices or workers to update the model parameters.

On the other hand, model parallelism involves partitioning the model across multiple devices or workers and performing different parts of the computation on each device. Furthermore, each device or worker is responsible for computing the forward and backward pass for a specific part of the model.

As powerful as distributed training can be, it can lead to significant communication overhead between the workers, limiting the system's performance. Improving this communication overhead is what paper 2 touches on. Another technique that can reduce communication overhead is pipelining, which aims to overlap computation and communication as much as possible [@narayanan2021memory]. Pipelining ensures that every worker is computing instead of waiting for communication from other workers. Pipelining is effective for models with a large number of layers and can reduce the training time significantly.

2.6 GPU vs. CPU

GPUs and CPUs have different hardware architectures suited for different computations. For example, GPUs have more cores (e.g., 100) than CPUs (e.g., 8) and are optimized for running highly parallel computations on large data. On the other hand, CPUs have fewer but more powerful and versatile cores and are better suited for running serial computations requiring high clock speeds and low latency. CPUs are also designed very well for more generalizable tasks and computations.

GPUs have been the staple device for running DL models due to their high parallelism and memory bandwidth. The parallel processing nature of GPUs enables them to perform matrix multiplication much faster than CPUs. Additionally, the high memory bandwidth of GPUs allows them to move data efficiently between memory and computing units.

CPUs, though, can still be especially useful for training and inference on smaller DL models with fewer layers and neurons. Recent work has shown optimized software can achieve competitive performance on CPUs, which is the case for paper 3. Though, one of the challenges in training on a CPU is the relatively small amount of memory available per core compared to GPUs. Therefore, it is harder to fit large models into CPU memory, such as large-scale DL models. A solution for the small memory availability is sub-linear algorithms that scale the model size relative to the amount of available memory, which was shown to be effective based on paper 3's findings.

While the CPU and GPU are powerful in their ways, recent work has shown that DL tasks and jobs are enhanced by combining both CPU and GPU. This results in better utilization of computational resources. Furthermore, inference performance is enhanced regarding deadline miss rate without sacrificing inference accuracy [@li2022enabling]. The authors proposed a dynamic programming-based framework to solve the real-time deep neural network partitioning and scheduling problem via GPU-CPU collaborative execution.

3 A Survey on System-Level Optimization Strategies for Accelerating Deep Learning Systems

In this section, I cover the three papers I surveyed on their system-level optimization strategies for accelerating DL systems: (1) SALUS: Fine-Grained GPU Sharing Primitives for Deep Learning Applications; (2) TICTAC: Accelerating Distributed Deep Learning with Communication Scheduling; (3) SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems.

3.1 Themes & Differences

The general overarching theme between all three papers is that their strategies are based on system-level optimization. However, each paper specializes in a certain part of the DL system to optimize: software optimization vs. hardware-aware optimization. The first two papers focus on optimizing hardware usage for GPUs, while the third paper focuses on optimizing the software DL algorithm for CPUs.

The first two papers, SALUS and TICTAC, are similar in that they both optimize scheduling via certain policies. However, they are fundamentally different. For instance, SALUS optimizes the scheduling of multiple DL jobs, while TICTAC optimizes the scheduling of communication between multiple workers on a single DL job.

SALUS also optimizes memory sharing when running multiple DL jobs on the same GPU by smartly keeping variations of each job's persistent and ephemeral memory on the GPU's internal memory at the same time. The memory is juggled around for each job, enabling it to run in parallel with other jobs. Therefore, the GPU can utilize its memory storage more effectively, improving its overall capacity.

In other words, SALUS optimizes the sharing of memory and scheduling of computation between multiple DL jobs, while TICTAC optimizes the scheduling of communication between workers on a job. One involves multiple DL jobs, and another involves multiple workers per DL job. In the discussion section, I go into my theory of the feasibility of combining both strategies to optimize multiple DL models on a distributed DL system.

The third paper, SLIDE, takes a very different approach when compared to the other two papers based on their strategies. Instead, SLIDE worries about optimizing the DL models' algorithm used to train their architectures. It does this by utilizing the LSH algorithm to implement adaptive sampling within the neurons of the deep neural network. Essentially, it prunes and takes advantage of the sparsity of neural networks by only updating activated neurons' weights during the back-propagation stage.

I will now go through summaries of the three strategies used by each paper by explaining the methodologies, relevant concepts, and results.

3.2 SALUS: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

SALUS is a solution developed to enable the efficient sharing of GPUs while maintaining compatibility with existing DL frameworks. It addresses temporal and spatial memory management aspects by enabling two GPU-sharing primitives. The first primitive is fine-grained time sharing via efficient job switching among ongoing DL jobs. The second is dynamic memory sharing via the GPU lane.

SALUS is implemented as a singleton execution service, consolidating all GPU accesses and enabling sharing while avoiding costly context switches among processes on the GPU. The SALUS execution service achieves GPU sharing via iteration-granularity scheduling of DL jobs. SALUS allows for preemption or running multiple DL jobs in a time- or space-shared manner, which can be utilized by a GPU cluster scheduler. The SALUS API abstracts away low-level details and can be viewed as another (virtual) computation device by DL frameworks, while user scripts will work the same as before.

3.2.1 Fine-grained Scheduling Computation

SALUS offers fine-grained GPU-sharing primitives that allow for various scheduling policies for running multiple DL jobs on a single GPU. Three simple scheduling policies are implemented: (1) PACK for maximizing efficiency by packing jobs together based on their memory usage; (2) SRTF for enabling prioritization by implementing the shortest-remaining-time-first policy; (3) FAIR for equalizing job progress by time-sharing between multiple DL jobs during high contention periods. These policies consider safety conditions to ensure the total peak memory usage across all lanes is smaller than the GPU memory capacity. SALUS opens up huge design space for future works to explore more scheduling policies.

3.2.2 Fine-grained Memory Sharing

Fine-grained memory sharing via the GPU lane is a key feature of SALUS, allowing for more efficient GPU memory utilization. The GPU lane is designed based on classic memory management techniques, where the memory space is divided into two regions: ephemeral and persistent. The ephemeral region is further divided into lanes, which are continuous memory spaces that can contain ephemeral memory allocation for iterations. In addition, each lane can be assigned to multiple DL jobs, which are time-shared within the lane.

The GPU lane provides fine-grained memory sharing by isolating memory allocations across lanes to ensure maximum compatibility while achieving adequate flexibility. Fine-grained memory sharing is achieved by limiting the dynamic allocation in the ephemeral region and ensuring enough capacity for persistent memory of all the admitted jobs. Additionally, there must be enough remaining memory for the iteration with the largest temporary memory requirement. Moreover, at least one job in the lane can proceed while maximizing GPU memory utilization.

Moreover, the lane auto-defragmentation feature moves fragmentation within lanes to fragmentation at the lane level, which is much easier to handle. The allocations are released completely at the end of each iteration, and defragmentation happens almost automatically at no cost. When a job finishes, its lane space is quickly reclaimed at the iteration boundary by the job allocated below it.

3.3 TICTAC: Accelerating Distributed Deep Learning with Communication Scheduling

TICTAC is a system that improves the performance of distributed DL by optimizing communication scheduling between workers. TICTAC achieves this by introducing a new communication scheduling technique that takes into account the communication patterns of different deep learning workloads.

3.3.1 Heuristics

The paper presents two heuristics for scheduling receive (recv) operations in distributed DL: Timing-Independent Communication Scheduling (TIC) and Timing-Aware Communication scheduling (TAC).

TIC assigns priorities to transfers based only on vertex dependencies in the directed acyclic graph (DAG), ignoring the execution time of each operation. It prioritizes transfers that are least blocking on computation, assuming that all operations have equal costs. TIC uses a simple universal time oracle to generate the schedule.

On the other hand, TAC prioritizes transfers that maximize the computation/communication overlap. TAC uses information on the execution time of each operation, estimated with a timing oracle, and the dependencies among operations specified by the computational DAG. The algorithm considers the opportunity for overlapping communication and computation and the completion time of dependent operations.

3.3.2 System Design

The TICTAC system has four main components: the tracing module, the time oracle estimator, the ordering wizard, and the enforcement module.

The tracing module collects runtime stats from an execution, which is later fed to the time oracle estimator. The time oracle is responsible for estimating the runtime of each operation in the system based on the execution timing stats. The runtime may vary depending on the platform, device characteristics, input data, and even across iterations on the same hardware/software. The time oracle implementation chooses the minimum of all measured runs for a given operation.

The ordering wizard assigns priorities to (recv) operations on a single worker, computed offline before the execution. The schedule may be computed based on TIC or TAC. In TAC, the ordering module relies on the time estimated by the time oracle. In TIC, the order is determined based on the DAG alone. Finally, the estimated priorities are sent to the enforcement module after the TIC and TAC heuristics.

The enforcement module takes as input the priority list computed by the ordering module and enforces this order on the network transfers per worker. This module is implemented over the gRPC submodule of TensorFlow, thus providing one channel per worker-PS pair and transferring between the pair to the same queue. Moreover, the enforcement module is implemented at the sender before the transfer is sent to gRPC.

The priorities from the computed priority list are sequentially assigned to an integer in the range of [0, n). The sender maintains a counter for each worker per iteration. The counter is incremented when a corresponding transfer is handed to the gRPC. Before a transfer is handed to the gRPC, it is blocked until the corresponding counter reaches the normalized priority number.

3.4 SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

SLIDE is a neural network architecture that aims to reduce the computational cost of training deep neural networks while maintaining accuracy. The approach taken by SLIDE is to use LSH to sample a small subset of active neurons in each layer of the network and only compute the activations of these active neurons. This sparsity allows for significant computational savings, as the computations are only performed on a fraction of the total number of neurons in the network.

3.4.1 SLIDE Architecture

The SLIDE architecture consists of multiple layers containing a list of neurons and a set of LSH sampling hash tables. The network weights are initialized randomly during the network initialization, and K LSH hash functions are initialized for each layer. The LSH hash codes of the weight vectors of neurons in each layer are computed according to the hash functions. The ids of the neurons are saved into the hash buckets mapped by the LSH function. This construction of LSH hash tables in each layer is a one-time operation that can be easily parallelized independently with multiple threads over different neurons in the layer.

3.4.2 Feed-Forward & Back-Propgation Phases

In the feed-forward phase, given a single training instance, the input to each layer is fed into hash functions to compute the hash codes. The hashing serves as a query to retrieve ids of active (or sampled) neurons from the matching buckets in hash tables. Only the activations of active neurons are calculated and passed on as the inputs to the next layer.

After computing the network output, the back-propagation phase follows, where the errors are propagated layer by layer to calculate the gradient and update the weights. After updating the weights of any given neuron for every training data instance, the neuron propagates the partial gradients (using error propagation) back to only active neurons in previous layers via the connection weights. Updating active neurons ensures that we take full advantage of sparsity. Furthermore, the computation over each input is only of the order of active neurons and weights rather than the total number of parameters.

SLIDE uses the usual batch gradient descent with the adam optimizer, where the batch size is generally in the order of hundreds. Each data instance in the batch runs in a separate thread, and its gradients are computed in parallel. This parallel computation ensures that the gradient computation is independent across different threads, and the memory overhead is negligible for CPUs.

4 Discussion

In this discussion section, I cover the possibilities of combining the strategies from all three papers in various ways and pairings. Furthermore, given the different variations of the three papers, I also discuss what type of DL applications would work.

4.1 Feasibility of Combining SALUS and TICTAC for Distributed DL System Optimization

In this section, I discuss the feasibility of combining the optimization strategies of SALUS and TICTAC for distributed deep learning optimization. Both SALUS and TICTAC specialize in different areas of hardware-aware optimization. For example, SALUS optimizes memory sharing and scheduling policies to maximize GPU utilization and reduce contention among multiple DL jobs running on a single GPU. On the other hand, TICTAC optimizes communication scheduling between workers to reduce communication overhead in distributed DL.

Combining these two optimization strategies can potentially provide a more comprehensive solution for optimizing distributed DL systems. Using SALUS to optimize memory sharing and scheduling policies can reduce contention among multiple DL jobs running on the same GPU and improve GPU utilization. Using TICTAC to optimize communication scheduling between workers can reduce the communication overhead in distributed DL systems and improve the system's overall performance.

However, there are several challenges to consider when combining these two optimization strategies. One challenge is the potential increase in complexity when implementing both strategies together. SALUS and TICTAC require modifications to the DL framework and may introduce additional overhead when combined. Additionally, there may be trade-offs between the two strategies that need to be considered, such as prioritizing memory sharing over communication scheduling or vice versa.

Another challenge is the potential impact of combining both strategies on the system's overall performance. While SALUS and TICTAC are designed to optimize different aspects of the DL system, they may unexpectedly interact when combined. Therefore, future work should test the effectiveness of combining these two techniques. It is important to ensure that the system's performance still holds and that the benefits of each strategy are not diminished when combined.

4.2 Feasibility of Combining SLIDE with SALUS and TICTAC

Combining SLIDE with either of the two techniques may pose a problem as SLIDE was tested and developed to work well on a CPU device rather than a GPU. SLIDE provides a different approach to training a deep neural network by pruning unnecessary neurons into buckets within hash tables through hash functions. This pruning is only possible by not doing matrix multiplication which requires multiple layers as vectors and matrices.

GPUs are designed well for matrix multiplication, so it is unlikely that implementing SLIDE on a GPU would provide additional computation efficiency and training time reduction as seen on a CPU. However, this should be tested in future work as certain models may be more acceptable to use SLIDE on a GPU device. In addition, SLIDE may be better utilized as a standalone acceleration technique for scenarios where hardware acceleration or distributed training is not available or feasible.

4.3 Multiple Workers (Distributed Computing)

TICTAC optimization strategy is designed to optimize communication scheduling between workers in distributed DL systems. In this scenario, TICTAC can effectively reduce communication overhead by scheduling communication in a way that minimizes idle time and reduces synchronization delays. However, the SALUS and SLIDE optimization strategies may not be as effective in this scenario as they focus on hardware acceleration and computational efficiency, respectively, rather than communication overhead.

4.4 Complexity of the Model

The SALUS, TICTAC, and SLIDE optimization strategies should be effective regardless of the complexity of the DL model being trained. However, this should be tested on a greater variety of DL models to ensure similar performance. The effectiveness of each strategy may vary depending on the model's specific requirements.

For example, SALUS may be more effective for models with large memory requirements, as it can optimize memory sharing and reduce contention among multiple jobs running on the same GPU. In addition, TICTAC may be more effective for models with a large number of parameters that require distributed training across multiple workers, as it can optimize communication scheduling to reduce communication overhead. SLIDE, on the other hand, may be more effective for very large DL models where it is more likely that the neurons will be sparse. Therefore it can reduce the computational cost of training.

4.5 Training vs. Inference

The SALUS, TICTAC, and SLIDE optimization strategies are all focused on optimizing the training process of DL models, and were not tested extensively during the inference phase. However, they may be able to improve the inference performance by reducing the computational cost to predict data given the optimizations during the trained phase.

I envision that the SLIDE optimization strategy can be particularly effective for the inference phase, as the final model has less utilized neurons by training on only the most important ones.

5 Limitations and Future Work

There are two limitations there I was able to find when reviewing all three of the papers and their strategies.

The first limitation regarding TICTAC is that when they tested their system design, they discarded the first two iterations as they took a very long to finish. They removed these in their analysis because they said it was due to the warm-up effect, which involves the initialization of the GPUs and cache. I would have liked to see this in their analysis as considerable times during the two iterations could provide enough overheard not to warrant the usage of their strategy. For instance, if the method is applied to smaller models, the initialization phase could be a more significant percentage of the overall training phase makespan.

Furthermore, computing the TIC and TAC heuristics takes approximately 10 seconds. Given the large timespan of both heuristics, this provides even more evidence that for smaller DL models that train fast, the overheard by TICTAC would be too much to warrant their usage of them. They even discuss this possible limitation in their paper: "small networks with small number of workers and parameters servers, the overhead associated with scheduling may overshadow the benefits of better overlap. In rare cases, we observe a slowdown of up to 4.2%" [@hashemi2019tictac]. These may not just be rare cases; future work should try to answer this question.

The second limitation regarding SLIDE is that it was only tested on a CPU device. Testing on only a CPU device makes sense, given that they compared speedups on a CPU to the best benchmarks that a top-of-the-line GPU can do. However, I would have liked to see whether the GPU could implement the strategy. If not, they should have at least mentioned why it may not work because of the lack of matrix multiplication within the algorithm. In addition, the authors only tested it on a single hidden layer neural network; therefore, future work should test the SLIDE implementation on various architectures and devices.

6 Conclusion

In this survey paper, I presented an overview of three system-level strategies for accelerating DL systems: memory & resource sharing, worker scheduling, and algorithmic optimization. I provided some background followed by a general overview of each of the three strategies.

I also theorized which combination of the three strategies and which use cases made the most sense. As tested and developed solely on a CPU device, SLIDE may not be able to combine with the two other strategies effectively. However, I envision future tests to show a positive, if not an additive interaction effect, when combining TICTAC and SALUS on a distributed multi-job DL system.

I envision future researchers who try to retest these strategies in different scenarios and combinations, results will heavily depend on the specific application of the DL system and the hardware device used to test. Future work should also explore additional strategies for accelerating DL systems, as well as investigating possible challenges and limitations of implementing these three strategies in real-world applications and various DL architectures.

References