title: Efficient Deep Learning, the "Systems" way. author: Gerard Donahue date: 4-7-2023 ...

Introduction

Three Papers being Surveyed: - Bandana: Using Non-Volatile Memory for Storing Deep Learning Models - Fine-Grained GPU Sharing Primitives for Deep Learning Applications - SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Specialized Computer Systems

While a computer is comprised of many vital layers of important components; it is arguable that computer systems has the most widespread impact on the success of computation. This widespread impact is because computer systems are an "algorithmic middle-man" between the hardware and the software. Software is programmed and optimized based on computer systems, and hardware is engineered according to the computer systems. The processes that coordinate computer execution allow for feasibilities of key components of the computer such as the operating system, specialized peripherals, memory management, caching, and so much more.

While the methods in computer systems are typically based on an intuition of algorithmic theory, it is useful to discuss the relationship between computer systems and hardware. Hardware engineering is the entity that supports computer systems. Hardware needs to be engineered in such a way that aligns with the standards in the the field that utilizes it the most closely. For example, the prevalence of caching in computer systems has caused the processor itself to have 3 separate layers of cache on-chip. This is what I will refer to as "hierarchical specialization".

The most effective way to develop methods for computer systems for software applications is with this idea of hierarchical specialization. Hierarchical specialization in computer systems is when the system is built for a specific application. For example, in gaming software there may be a protocol for streaming video to the monitor while updating the values with which the streaming is based on. In this case, there may be several considerations that the memory must support in order to properly allow for this optimization. While this protocol for streaming video may lead to low-latency video streaming, this same protocol would not be suitable for other applications such as deep learning.

Computer Systems for Deep Learning

Deep learning requires vast amounts of parallelization for forward passes during inference and backpropogation during training. In order to make systems that better support these two processes, researchers are interested in specializing their systems-based optimization techniques for deep learning. These optimization techniques may range from memory management, to GPU sharing.

While there is much room for improvement regarding the efficiency of deep learning (DL) algorithms, specializing computer systems that support DL is arguably just as important. In fact for any computer program, the computer system that facilitates it's execution can make or break the run-time of the algorithm. When the computer systems that facilitate the DL jobs are specialized for DL, they can use domain-specific knowledge for achieving the best run-times. Since DL jobs often require lots of time to train (sometimes weeks), minimizing the run-time can allow for more iterations of back-propogation in the same amount of time; which can lead to less cost and better DL models.

An example of specialized computer systems techniques for deep learning may be in the field of large language models. To use these deep learning models, their parameters (or "weights") must be stored somewhere proximal to the processing unit. Deep neural networks contains multiple "layers" of neurons that temporally embed the data into features. Specifically for recommender systems, the models rely on a vocabulary of feature embeddings, which must be stored and available. Because DNN's use DRAM for storing model parameters, there is a need to optimize the usage of DRAM for feature embeddings.

Summary of Paper Contributions

The first paper, "Bandana," focuses on improving the storage of deep learning models. The authors propose using non-volatile memory (NVM) to store these models, which can lead to faster loading times and improved energy efficiency compared to traditional storage methods.

The second paper, "SLIDE," argues that using smart algorithms that are designed specifically for large-scale deep learning systems can be more effective than relying solely on hardware acceleration. The authors propose a new algorithm that is optimized for sparse data and can improve the efficiency of deep learning tasks.

The third paper, "Fine-Grained GPU Sharing Primitives," addresses the issue of resource sharing in deep learning systems. The authors propose a set of techniques for sharing GPU resources among multiple deep learning applications, which can improve overall system efficiency and reduce resource waste.

While each paper focuses on a unique aspect of hierarchical specialization for DL jobs, they all share a common goal of improving system efficiency and effectiveness. A common theme in these three papers is caching and nearest neighbor search.

Background

Word embeddings

As with any form of data input for statistical algorithms, it must be transformed from its raw representation to a mathematical representation. Consider the example where you have colors as input. After many decades of scientific discovery, computer scientists understood that they were able to quantify colors with a vector of size 3; where the first dimension represents blue, the second dimension represents green, and the third dimension represents red. This RGB representation of a color can be seen as a mathematical representation which is derived from the raw representation. Similar to color, words can be mathematically represented as well.

Word feature embedding vectors can be obtained from a bag-of-words (BOW) approach, a binary indexing approach, or any other feature extraction method. The BOW approach is where the size of the vector is the size of the vocabulary, and only a single value at a special index is set to 1 while the others are 0. The issue with the BOW approach is that vectors are sparse, and these representations suffer from little semantic meaning. Hence, dense embeddings have risen in popularity for their ability to encode feature correlation after being trained based on large corpus of lanuage data. These dense embeddings take up less space in memory, and can encode semantic meaning by learned features.

To explore the systems approach for handling word-based feature embeddings, I will discuss how they are utilized for social media companies implementing large language recommendation systems. For example, Meta is known for their abundance of user accounts, user interaction, user posts, and more. Specific to facebook, the platform distinguishes between user and post embeddings. The paper claims that these embeddings are stored in dedicate database tables, where the column ID represents an embeddings place within the table. Models receive an ID for a post (or user), extract it from the databse, and process them with their deep learning tool-kits.

For both user and post embeddings, similar embeddings are meant to sit closer to eachother in euclidean space. User embeddings signify the features of a user. They encode interests, disinterests, and activity of the user on the application. Post embeddings are meant to represent individual posts from a user. Since there are much more post embeddings than user embeddings, there is much more compute necessary to properly evaluate and encode the features for posts.

Backpropogation

Backpropagation is a supervised learning algorithm used to optimize the parametric weights of a deep neural network. This process allows the model to iteratively make predictions, analyze the loss (or quality) of its predictions, and change the model in order to decrease this loss. The algorithm works by propagating the loss back through the layers of the network, adjusting the weights based on the calculated error, and optimizing the network's performance.

Backpropogation was developed as a way to efficiently train multi-layer perceptron neural networks, which are feedforward networks that consist of multiple layers of neurons. Before backpropagation, training these networks was a laborious process that required manual adjustment of the weights. Backpropagation made it possible to automatically adjust the weights and train these networks on large datasets.

The backpropagation algorithm works by calculating the gradient of the loss function with respect to each weight in the network. The negative gradient indicates the direction of steepest descent, which is the direction that will result in the most significant decrease in the loss. The algorithm then adjusts the weights in the direction of the negative gradient, using an optimization algorithm like stochastic gradient descent.

One of the main benefits of backpropagation is that it can be used to train networks with many layers, which are known as deep neural networks. These networks can be used for a wide range of applications, including image recognition, natural language processing, and speech recognition. Backpropagation has been critical to the development of these applications, as it allows networks to learn complex representations of data and perform tasks that were previously impossible.

A primary issue for backpropogation is that it can be computationally expensive, especially for large datasets or deep networks with many layers. During training, the algorithm needs to propagate the error backwards through the network for each input, which can take a significant amount of time. This is particularly true for convolutional neural networks, which can have millions of weights and require massive amounts of data to train accurately.

Additionally, backpropagation can be sensitive to the choice of hyperparameters, such as the learning rate, which can affect the runtime and performance of the algorithm. To overcome this, computer systems researchers have developed various specialized algorithms to increase parallelization and decrease the NOO. In this paper we look at two methods which build computer systems that speed up backpropogation.

GPU Utilization

The recent surge in deep learning execution and the democratization of this technology have enabled more people to access GPU resources with greater affordability and ease. However, GPUs are designed to work with a single program at a time, which limits their potential for utilization. As such, there is ongoing research aimed at exploring ways to navigate this exclusivity and enable clusters of servers to work together, utilizing multiple GPUs for training a single model.

As a result of the growing utilization of DL algorithms in practice, the technological community has worked tirelessly to provide expensive and specialized Graphics Processing Units (GPU). These specialized GPUs have led to a boom in DL feasibility, and are the hallmark of many notable achieves in the space of artificial intelligence and machine learning. For fully connected deep learning architectures, the bulk of the operations is in the matrix multiplication. The outputs of one layer are in the form of a matrix, and the neural layers weights of the next layer are also a matrix. As such, there are large numbers of matrix multiplications that are going on in order to complete forward passes on these network.

One of the key issues with current algorithms is that they often fail to fully utilize the available GPU resources. For instance, a model trained on ten GPUs with only 50% utilization could have been trained using only five or six GPUs. This inefficiency can result in significant additional costs for users. By developing algorithms that enable multiple GPUs to share computational power and optimize computations, GPU utilization can be maximized to save users thousands of dollars.

One domain which would benefit greatly from better GPU utilization with resource sharing is automatic model tuning. To find the optimal hyperparameters for the training of a DL model, deployers use automatic model tuning to train many models at once with varying hyperparameters. If the community can develop algorithms to maximize the ability for many GPUs to share computational power and coordinate the optimal computations, then more models can be trained during automatic model tuning. This will allow for a more precise model tuning process and better DL execution in general. In addition to automatic model tuning, more effective GPU sharing will optimize a system's ability to perform tasks such as cross-validation and multi-model evaluation.

The primary resources on the GPU which remain to be fully optimizated are computational capabilities and memory. GPUs provide an abundance of computational capabilities that can perform multiple calculations simultaneously. The memory on the GPU is also important for DL models since it stores the entire DL model, including every weight vector for each layer. During backpropagation or forward passes, the GPU uses this memory to load in the weights for computation. For DL models, the memory is where the entire DL model is stored.

To enhance the performance of multiple GPUs for training or inference in deep learning, fast job switching and memory sharing are crucial. Memory sharing requires a system that supports computation customization and optimization. By optimizing GPU usage, DL models can perform better, train faster, and infer faster during deployment, driving breakthroughs in AI research and opening up new possibilities for the future of deep learning.

Similarity-Based Systems for DL

There are countless methods which use similarity-based algorithms to optimize DL algorithms. KMeans nearest neighbor algorithm is an example, which aims to cluster a dataset of values based on the euclidean distance to a given number of "vector prototypes". While these algorithms are helpful for DL applications, there is much opportunity to use similarity based algorithms to optimize the computer systems that are specialized for DL.

In this section, we introduce two similarity-based computer systems methods from two different papers and discuss where their approaches converge and diverge from an intuitive stand-point. First, we discuss Bandana which uses similarity-based algorithms to optimize the usage of DRAM for word embeddings in large language models. Second, we discuss SLIDE uses the similarity of the weights of a neural network to sparsify the backpropogation algorithm and decrease the number of operations (NOO).

Optimizing DRAM for Large Language Models

The first topic that I will explore regarding hierarchical specialization for DL comes from the paper Bandana: Using Non-Volatile Memory for Storing Deep Learning Models. In this paper, a storage scheme is presented that simulates dozens of small caches to increase locality with respect to feature embeddings. Bandana uses non-volatile memory (NVM) as the primary storage center for feature embeddings. The caching strategy presented in the paper brings feature embeddings into the DRAM based on locality and likeliness of co-occurence. This utilization of DRAM allows for the system to decrease the number of transactions with the NVM.

While DRAM and NVM are both essential for any DL job execution, the expensive of DRAM is increasing. Many sources claim that DRAM is getting more expensive based on a global supply shortage. As a result of the decrease in supply, the price is driven sky high and methods are needed to adapt to this change of market. Many large companies (including Meta Inc.), store the entire set of feature embeddings in DRAM. Considering the high cost of DRAM, maintaining this practice of storing all feature embeddings in DRAM will become harder and harder for companies employing large scale language models.

To alleviate the dependency on DRAM, Bandana explores a caching strategy to decide the best embeddings to load from NVM to DRAM. This approach to caching is very familiar to anyone well-versed in the topics of computer systems: locality. The concept of locality is common in standard L1 caching, where if an address is accessed, the MMU will load in similar addresses which are likely to also be accessed. An example of this would be if a program accesses the first index of an array and the MMU loaded other elements of that array as well.

Similar to traditional fully-associative or direct-mapped caching on the CPU, feature embeddings for word representation would benefit from a locality-based approach. This tradition CPU caching strategy is achieved by mirroring the data from higher latency storage medium (DRAM) in a much lower latency storage medium (on-chip L1 cache). For word embeddings, Bandana dials back the caching one layer by mirroring data from high latency storage (NVM) in low latency storage (DRAM).

While NVM offers many advantages w.r.t bandwidth, the latency is still much higher than that of DRAM. NVM also maintains that the maximum size of read operations is 4 kilobytes (KB), whereas user embeddings are only 64-128 Bytes (B). Therefore simply replacing the NVM for the DRAM for user embeddings will result in underutilization of this read capacity. Bandana directly works to utilize all of this 4 KB read capacity with the following two approaches: pre-fetching embeddings with a Social Hashing Partitioner / K-means clustering and caching word embeddings in DRAM with an efficient eviction policy.

NVM and DRAM Usage

The Bandana paper states that the metric of choice for this paper is effective bandwidth. Effective bandwidth refers to the percentage of total NVM read bandwidth that is used by the application. They also mentioned that caching all of the vectors that sit close to the vector that is accessed in NVM is an even worse policy. These vectors that are brought into memory are random, and there is no locality or smart-ness that goes into storing these vectors. The main purpose of Bandana is to maximize the effective bandwidth.

NVM devices can be either byte-addressable or block addressable; this paper focuses on the block version of NVM. The authors used a collection of I/O tests using the popular I/O service FIO. They came to the conclusion that the performance of NVM devices decreases as the rate of writes per day increases. The authors mention that NVM devices can handle up to 30 writes per day. Fortunately, this is no problem for this research because Facebook only updates their embeddings 10-20 times per day.

Bandana explores using semantic reasoning in order to find similar vectors to cache in DRAM. There are many ways in mathematics to group a dataset of vectors based on semantic partitioning. A very clear choice for this is to use a K-Means clustering algorithm; an iterative algorithms which takes a parameter k. k represents the amount of groups or "clusters" that your data is comprised of. The algorithm will iteratively choose centroids for each cluster, assign embeddings to these centroids, then recompute the clusters based on the previous assignment of embedded vectors.

The issue with Kmeans is that the run-time scales exponentially as the value of K increases. Figure 7 of the paper shows the issues with runtime for Kmeans while figure 8 of the paper shows the effective bandwidth of the NVM when using different degress of sub-clusters for Kmeans. Sub-clusters are an approach to make the algorithm recursive and specialize the optimization procedure. It is a more hierarchical approach to the algorithm.

While relying on euclidean distance to predict semantically similar vectors is intuitive, there is no gaurentee that this claim is effective in practice. Running Kmeans every time the embedding vectors are updated is expensive. As such, Social Hash Partitioning (SHP) does not depend on euclidean distance - but rather learns to group vectors based on past access patterns for each vector. This approach trains a classifier to group these vectors together by index. Since this approach is not reliant on the embedding vectors themselves, it is not vulnerable to update the values of the vector.

The paper then looks at the eviction policy of the cache in DRAM. The issue with evicting at the block level is that some of the embeddings in the cache may be ranked higher than the ones coming to replace the embeddings currently in the cache. The paper then toys with the idea of ranking pre-fetched vectors higher on the eviction queue. This would apparently prevent the eviction of hot vectors in the cache. The authors found no significant findings when enacting this policy.

To further improve this method, the authors introduce another cache called the shadow cache. This cache stores the indices of the embeddings which were already read. When pre-fetching vectors, only vectors that have been red (and are in the shadow cache) can be pre-fetched. The shadow cache size is larger than the DRAM embedding cache itself.

Decreasing NOO for Backpropogation

The Sub-linear Deep Learning Engine (SLIDE) paper claims that matrix multiply hardware capabilities are reaching a limit, while the size of networks is growing exponentially. This unbalanced rate of growth will soon cause hardware to bottleneck the innovation that deep learning presents to the community. As a result, there is a push to create specialized DL hardware to handle the massive amount of computation. However, specialized hardware is, well, specialized. Investing in hardware meant for one task is expensive and less generalizable.

As the Bandana approach for optimal DRAM caching for word embeddings uses similarity-based optimization algorithms, SLIDE uses locality sensitive hashing to execute a similar-based optimization algorithm. Locality sensitive hashing is a family of functions that increase the probability for similar inputs to collide. The LSH family is formally defined in the paper in definition 2.1. This is a randomized algorithmic property which allows for provably efficient query-time with respect to a nearest neighbor measure based on a similarity metric. All-in-all, using an LSH family for a similarity metric allows for nearest neighbor search in sub-linear time.

Rather than using similarity-based algorithms to cache word vectors, SLIDE uses LSH to find similar weight vectors for sparsifying gradient updates during back-propogation. A sparsification process is a non-exhaustive system that selects only a subset of items out of the universal set. During every gradient update, there is a backpropagation procedure which recursively updates the weights of the fully connected neural network in order to minimize the loss function. When training DL models in practice, GPUs are used in order to maximize the parallelism of the backpropogation and forward passes of the network. The authors look to overcome the dependency on expensive and specialized GPU hardware by investigating a "non-exhaustive" procedure for back-propagation.

There are two phases from the sparsification algorithm. First, L has tables are created in order to store the dataset of weight vectors coming from the neural network. Second, a query Q is given to the LSH and a union of L different hash buckets are returned. After generating this union of buckets, the nearest neighbor is computed then by comparing the distance between the items in the union and the original query itself. This drastically reduces the overhead of computing nearest neighbors in the neural network.

The issue with LSH is that you need to have a large number of hash buckets L. Also, the overhead of the LSH algorithm is high. In order to overcome the overhead of the algorithm, researchers were able to discover a better sampling procedure for LSH --- where the algorithm only probes a few different hash buckets. This new version of the algorithm allows for something called "adaptive dropout" for the neurtal networks.

Algorithm 1 of the SLIDE paper uses a hash table at each layer of the NN to encode the weighted parameters of that layer into many buckets. All neurons are inputted into their hash table according to the hash function. The algorithm then iterates, n times. For each iteration there are two phases: first, the inputs are chosen from the input dataset, and Second, neurons are sampled from the hash table at each layer according to the LSH sampling procedure.

Strengths and Limitations

Both SLIDE and BANDANA employ similarity-based algorithms to create cohesive computer systems in support of deep learning. Bandana is able to offload some of the DRAM utilization to the NVM, by using a caching strategy based on nearest-neighbor word embeddings. In a similar manner, SLIDE provides a sub-linear algorithm for finding nearest neighbor parameters for sparsifying the backpropogation algorithm. Bandana improves the utilization of the NVM-DRAM system and SLIDE decreases the NOO needed during the backpropogation algorithm. Furthermore, SLIDE relinquishes the need for strong GPUs by making backpropogation a feasible program for a CPU alone to run.

Thanks to the emergence of GPUs, DL has become more accessible and affordable for individuals and businesses alike. To further utilize AI in data centers and production environments, algorithms like BANDANA and SLIDE have been developed to expedite training execution. Notably, both BANDANA and SLIDE are optimized for large language models. BANDANA employs a caching technique and clustering/nearest neighbor approaches to proactively store word embeddings to DRAM, thereby reducing the need to communicate with the NVM. Meanwhile, SLIDE uses a hashing method to sample a set of weights based on a query and minimize the required level of parallelization during backpropagation.

While SLIDE and BANDANA demonstrate optimized computer systems that support DL jobs, there are some limitations in the experiments which they run on. Both of these papers focus on the use of fully connected DL architectures for their experiments. While fully connected layers are still used widely, there are more modern DL architectures that are used. For example, convolutional neural networks, LSTMs, and transformers are used more frequently for temporal data. Especially in the large field of Natural Language Processing, transformers are being used for any of the best models.

Optimizing GPU Utilization

GPUs contain numerous processors that can perform many computations simultaneously, making them well-suited for the matrix operations that DL jobs require. By using GPUs, DL models can be trained and infer much faster than on CPUs, reducing the time and computational resources needed to train large neural networks. These run-time advantages that GPUs provide also allow for the execution of multiple models with different architectures and hyperparameters. This can help researchers and data scientists to iterate more quickly and find better solutions faster. Overall, optimal GPU utilization can significantly improve the performance and efficiency of deep learning, making it possible to tackle increasingly complex and challenging problems.

In this section, I explore a method called SALUS. Salus is a method which works to enable two GPU sharing primitives: fast job switching and memory sharing. The purpose of these features is to allow for GPU sharing among multiple DL applications and modules. This way the usage of the GPU can be optimized in a better fashion, and GPU utilization gets better during training and inference. Salus performs iteration scheduling and addresses the memory management issues that come with iteration scheduling.

There are two specific components of the SALUS algorithm, which have already been mentioned before in this paper. The first is efficient job switching, to allow the GPUs to seamless share program and computational work. The second is memory sharing, where GPU memory is able to be segmented. Increased segmentation within the GPUs memory will give deployers the ability to customize the GPU memory usage. As a result of these two components, DL jobs can now perform pre-emption, or run many DL training jobs at once.

SALUS: time sharing

SALUS is a recent method that takes advantage of GPUs and enables time sharing. With SALUS, users can allocate multiple tasks on a single GPU and the system will efficiently manage the resources. SALUS is designed to improve GPU utilization while maintaining performance and fairness for each task. This technology has significant implications for companies and individuals who require large-scale deep learning training, as it enables faster and more efficient training execution. In summary, these three technologies demonstrate how the deep learning landscape is evolving and adapting to meet the needs of an increasingly diverse user base.

Time sharing is a feature of GPUs that allows different programs to use the same computer resources at different times. This is useful because it means multiple DL jobs can be executed on the same computer without interfering with eachother. When jobs are switched between in this manner, one program will have to pause for another to resume execution. This pausing functionality is often referred to as "checkpointing". Checkpointing for DL models can be useful in certain cases, however investigating optimal methods for time sharing can alleviate the need for checkpointing to allow for better DL parallelism.

Modern computer programs that do deep learning use checkpointing frequently. Checkpointing allows for DL jobs to pause and resume, which is important when DL jobs take long amounts of time to execute. It can allow for any desired scheduling policy amongst a collection of these jobs.

While checkpointing has proven to be helpful for time sharing, context switching can be slow because it involves moving a lot of data between the computer's RAM and its GPU. This is especially a problem a DL model is used during inference time. In this case, the time it takes to move data can be longer than the time it takes to do the prediction itself. The authors mention that DL jobs can not afford to be switching between jobs too much.

SALUS: memory sharing

The memory scheme on the GPU is primarily composed of two types of data storage. The first is the model parameters. Because the model parameters are usually of a fixed size and data type, this is a predictable and static amount of storage. The second is ephemeral memory, which is the intermediate and temporary data that is used during computations. This may be the outputs from the hidden layers, recurrent state of an LSTM or the kernel or max-pooling algorithm metadata for the DL task.

The authors of the text have designed a special memory layout scheme called "GPU Lane" that helps with memory sharing and utilization in deep learning jobs. The scheme divides the GPU memory space into two regions: ephemeral and persistent. The ephemeral region is further divided into lanes, which are continuous memory spaces that can hold ephemeral memory allocation for iterations. Each lane can be assigned to multiple deep learning jobs, which are time-shared within the lane. Parallelism is achieved across lanes using GPU streams.

By using GPU Lane, the authors aim to address the issue of memory fragmentation, which can cause out-of-memory errors even if there is enough memory available for two iterations' peak memory usage. While framework-internal memory allocations are small in size, they can have a large impact on the overall memory layout and may create more memory fragmentation, leading to inefficient use of memory. Therefore, the GPU Lane scheme provides a solution to this problem by effectively managing the memory space and improving memory utilization. This, in turn, can help optimize the performance of deep learning models and reduce the cost of training and inference.

Synergies

In this section I continue by discussing the relationships (or "synergies") between these three methods. Specifically, I look at where these methods complement eachother (converge), and where they conflict (diverge). By analyzing more deeply how Bandana, SLIDE, and SALUS converge, I reach some important conclusions regarding the benefits of their synergy. In addition, I caution practioners by analyzing how SLIDE and SALUS diverge.

Where do these methods converge?

The convergence of these methods is primarily facilitated by Bandana in the field of NLP. Bandana is an optimization caching techique for word embeddings, and it lies independant from both the SLIDE method and the SALUS method. It creates a system between NVM and DRAM to optimize the effective bandwidth of the NVM. As such, I expect that running deep architectures for NLP can optimize DRAM caching for word embeddings while SLIDE decreases the NOO for backpropogation, and SALUS provides the ability to share GPUs. In summary, combining these three methods enables more efficient cross-validation and automatic model tuning (SALUS) for NLP models (Bandana) with less NOO in backpropogation (SLIDE).

Where do these methods diverge?

Firstly, it is important to note that Bandana is optimized for large language models and is not specified for other DL tasks such as reinforcement learning, or computer vision. While this is a limitation for Bandana, the other two methods provide more general advantages for a variety of DL tasks.

Secondly, a major claim of the SLIDE method is that CPUs can run DL jobs much faster than even the strongest of GPUs. On the other hand, SALUS provides new capabilities for GPU sharing, claiming that GPU sharing leads to many benefits when training many models. Essentially, Salus claims an M → G relationship between machine learning models (M) and GPUs (G) where M > G. SLIDE fails to mention how the number of CPUs scales with the number of DL jobs being performed. As such, my main takeaway from this is that SLIDE has a M → C relationship between machine learning models (M) and CPUs (C) where M > C.

Conclusions

In this paper, I analyzed three methods: Bandana, SLIDE, and SALUS. I began by providing background on word embeddings, backpropogation, and GPU utilization. I then transitioned to explain Bandana and SLIDE, and how they use similarity-based algorithms to make specialized computer systems for DL. Subsequently, I explained the need for better GPU utilization with the method Salus; where I approached the discussion from an intuitive stand-point. I explained how GPU sharing can provide valuable capabilities for cross-validation, and automatic model tuning.

From these three papers, I have been exposed to three strong papers which use hierarchical specialization to optimize DL jobs. I have learned that systems papers are actually very interesting to read. My research is in the field of machine learning and temporal time series data. While my specialty may be separate from these papers, my work is heavily influenced by research in specialized computer systems for DL.

The models that I utilize for my research are trained on some intense GPUs. I have been able to conduct my research and train models without knowing anything at all about the GPU. I am happy that I chose the SALUS paper to give me some more information on the GPU.

Acknowledgements and Thank you

I would like to say thank you to Professor Gene Cooperman for his valuable lectures and philosophies which guided my learning. After taking this course I understand the inherent need for computer systems research, and a valuable methodology for technical writing. I look forward to utilizing these skills in my future life work in computer science.