title: Memory Optimizations for (Parallel) Deep Learning Applications author: Francesca Lucchetti date: 04-07-2023

Introduction

[Bandana: Using Non-Volatile Memory for Storing Deep Learning Models Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, Sachin Katti, 2019. In Machine Learning and Systems 1, Proc. of.] (https://proceedings.mlsys.org/paper/2019/file/34173cb38f07f89ddbebc2ac9128303f-Paper.pdf)

[Fine-Grained GPU Sharing Primitives for Deep Learning Applications Peifeng Yu, Mosharaf Chowdhury, 2020. In Machine Learning and Systems 2, Proc. of.] (https://proceedings.mlsys.org/paper/2020/file/f7177163c833dff4b38fc8d2872f1ec6-Paper.pdf)

[ZeRO: Memory optimizations Toward Training Trillion Parameter Models Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2019. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.] (https://arxiv.org/abs/1910.02054)

A turning point for the field of Deep learning (DL) was the adoption of the General Processing Unit (GPU) as its primary workhorse. The GPU’s specialised architecture offered increased power for DL computations, however standard CPU optimization techniques such as checkpointing, caching, and sharing primitives were not native to the GPU’s operating system.

In recent years, memory requirements for DL applications have scaled along with the growth in neural network parameters. The increase in network size has made the memory constraints of GPUs an obstacle to running larger models. So far the naive solution to the memory problem has been the addition of more and more GPU units, which has been unsustainable due to the high cost and low supply of the units.

This survey examines three different techniques for memory optimizations in DL applications, namely: Storing network parameters in non volatile memory Partitioning the GPU to create a GPU virtual memory Removing redundancy in DL computations

Background

As a new technology, GPUs are constantly evolving with innovations in hardware and the introduction of novel killer applications. Due to the constant updates to GPU architecture and operating system, research in GPU memory optimization is not a mature field as publications quickly become outdated or otherwise adopt widely different approaches. One way to categorise research in this field is by application. Intuitively, since GPUs are a specialised piece of hardware, research on how to best utilise GPUs is tailored around specific applications such as Physics Modeling, Graphics, or Deep Learning.

The efficient utilisation of GPUs for DL requires the use of data parallelism. Data parallelism is what makes GPUs much more convenient for DL than CPUs, since neural networks are trained using the same small computation on a vast amount of data. This means the many GPU cores can each carry out the same exact operation on different data in parallel without having to worry about parallel data access.

In practice, data parallelism in GPUs is often a suboptimal policy because of the heterogeneity of DL job requests. For example, running the model in inference mode underutilises the GPU because the forward pass is not expensive and most of memory lays idle. Conversely a training job occupies the majority of the GPU for a long time due to the computational demand of backdrop. This is not an issue if only one user owns the GPU at a time; however, if we add more users and requests, suddenly the training job creates a Head-of-line block in the job queue. Furthermore, the data type of inputs fed to neural networks can impact memory requirements, since image encodings require more memory to store than word encodings.

Taking into consideration the variable model size, resource usage and data access patterns of DL jobs, an appropriate way to categorise the domain is by the following features: The size of models used; The number of models loaded onto one GPUs; The number of GPUs used, either a single node or a cluster; The running mode of DL jobs, either training, fine-tuning or inference.

The combination of these features determines if certain memory optimizations can be made, for example the model size determines how many models can fit into a single GPU. Likewise the job running mode determines whether multiple jobs can be packed onto a single GPU.

Part of the reason why it is challenging to compare contributions to this field is because authors do not make this distinction in the taxonomy of the field. As such, it can be hard for readers to identify the target usage of the papers: if a method is designed specifically for training large models on GPU clusters, can it also be applied for parallel inference on many small models? Where is the overlap in different use cases, and which use cases are fundamentally at odds with each other? This survey will consider these cases of overlap and incompatibility while comparing approaches to memory optimizations.

Theme: The Cost of Large Language Models

Current GPUs cannot load large models onto one chip. As such these models need to be split onto multiple, which creates issues for both running the model and ensuring GPU memory is properly handled and serialised. This takes a toll on the performance of the whole system. It is possible that in the near future hardware will support efficient communication between GPUs, in which case prior research on the efficient handling of GPU memory will be rendered obsolete. However, there is still an argument to be made for the cost savings that efficient GPU memory utilisation allows.

Due to the emergence of large scale models, a big motivator for GPU optimization is reducing cost. The current cost of a GPU unit is prohibitive for most single users, and even companies and researchers may have a hard time acquiring units due to supply chain issues and long waiting lists. Whether the cost of GPUs will decrease once the demand is met with new supply is as of yet unknown. However, it is still in general interest to reduce the number of required GPUs not only for cost reasons, but for environmental preservation as well, since GPUs can consume significant power even when idle.

This survey examines three systems designed for efficient management of memory in DL applications, namely Bandana, Salus and ZeRO. The biggest obstacle when creating systems for GPU memory management is the need for fault tolerance. All the systems reviewed here are deficient in their discussion of bad DL jobs which crash and render memory unutilizable until cleared. Some types of DL jobs are especially prone to faults, such as deep interventions or activation patching which involve directly editing model parameters in ways that can be unsafe.

Not only are DL jobs prone to crashing due to programmer or model errors, but the GPU hardware itself may cause errors regardless of perfect use. In a parallel setting where multiple users send job requests to the GPU, these errors become more likely, and disrupt more users when they occur. Crucially, GPU kernels do not provide native resource scheduling primitives like CPUs do, and in particular they lack interrupts or clock cycles which allow the system to detect bad jobs and remove them promptly.

One established approach to enabling fault tolerance in GPUs is the creation of GPU-compatible checkpointing. Checkpointing is the process of saving an image of the GPU at runtime before an interruption occurs. This is done so that the context can be restored at a later time and the GPU may continue computation as if no error happened. Checkpointing can also be useful outside of job errors, for example to allow a smart scheduler to switch between jobs if one job is running for longer than expected and hogging resources. This is especially useful in DL jobs where predicting the duration of a job based on its history and run patterns is not very accurate.

The issue with checkpointing is that the context needs to be restored: this is not an issue when dealing with small models, however it becomes a significant operation with large models. The loading of a large model onto a GPU effectively makes that same GPU unusable to all users for a considerable amount of time (several minutes). For this reason, it is important in parallel settings to prevent errors that corrupt the ``persistent” memory where DL model parameters are stored, otherwise they may need to be reloaded. The systems surveyed in this paper each offer alternatives to the issue of protecting persistent model memory.

Approaches

All three systems surveyed share the desire to reduce the cost for DL applications by reducing memory requirements. This is motivated by the desire to democratise access to the SOA models, which currently are accessible only to corporations with the monetary means to host these models.

There are two kinds of memory needed to run neural networks: persistent and volatile. Persistent memory refers to the memory used to store model parameters such as weight matrices, activation thresholds, etc. Model parameters persist in memory because the model must never be offloaded mid-inference or mid-training. Volatile memory refers to memory needed for storing intermediate data or computations like gradient calculations, which are discarded after backprop is completed for a given iteration.

All three systems make tradeoffs with respect to the treatment of persistent and volatile memory. Allocating more space to persistent memory means less GPUs are required, but the possibility of errors increases with the reduced computation space. Allocating more space to volatile memory means more space for computation, but the supported model size for a specific number of GPUs decreases.

Systems Overview

While Salus and ZeRO directly tackle GPU persistent and volatile memory management, Bandana does not deal with GPU memory directly; rather it proposes an alternative storage option for persistent memory. Bandana defines the persistent memory of its models as the space allocated to sparse input embeddings. Embeddings are the model’s encoded representation of data, and can be sparse or dense. One criticism is that not many models suffer from memory concerns due to input embeddings, since these are typically small and dense. The embeddings discussed in Bandana are specific to Facebook recommender systems, and system decisions are tailored according to this data.

Bandana specialised for the workflow of Facebook recommender systems, and not necessarily generalizable to other applications. That being said, the solutions proposed by Bandana can be applied to a novel use case which the authors did not anticipate: the offloading of model computation data to CPU to prevent out-of-memory faults. This is especially useful in the case of Large Language Models, which easily run out of memory during simple operations like inference.

In order to decrease idle memory and increase GPU utilisation, the Salus authors propose a scheduler with the objective of packing as many jobs as possible onto one GPU. Salus is designed specially to confer CPU-like sharing primitives to GPUs. One example is fair time sharing, which allows multiple users to access resources in a given time interval. This is achieved through the implementation of GPU preemptiveness, or the ability to stop a job mid-run and resume it in order to be fair to other jobs.

The authors claim that the Salus scheduler increases GPU utilisation by allowing fine-grained memory access. However, the challenge they face is enabling sharing in a memory safe manner. Salus needs mechanisms to handle rogue jobs with higher memory usage than projected.

ZeRO is a system that proposes a different approach to memory saving by eliminating the redundancy in the training of a model. The training of large language models involves a high number of redundant copies of data, such as model parameters and optimizer algorithm states. This scales with the number of parameters in a model, such that modern billion parameter models require hundreds of GPUs to train.

Not only do large language models need to be split across GPUs due to size, but they are trained on large amounts of data. This generates many copies of the same model, each running on a fraction of the data, meaning a large part of the models loaded into memory are not accessed in the given training iteration. The authors of Zero propose to end the reduplication crisis by efficiently partitioning model states across GPUs. This ensures that only the required states for an iteration are present on a GPU. They argue that ZeRO makes the optimal tradeoff between efficiency, inter-GPU communication load and number of GPUs required.

Discussion

The following sections will analyse the three system designs in detail, evaluating their findings incrementally. There is a way that these three systems can be seen to build on each other progressively, each providing solutions for the other’s shortcomings.

Bandana Implementation Details

Bandana is a system for leveraging non volatile memory (NVM) to efficiently store input embeddings used by Facebook recommender systems. The authors set out to solve the issue of storing embeddings on DRAM, a costly form of storage that is in large part occupied by these static embeddings. Facebook embeddings are long, sparse vectors of occurrence-based counts which are computed for each user and post on Facebook. The issue with sparse embeddings is that they take up space while having mostly zero-values entries, making them inefficient to store.

The Bandana authors propose a novel approach to storing these embeddings in a cheaper form of storage, NVM, which would free up DRAM for storing other data. Singular embeddings would then be loaded back into RAM for computation as needed. The problem with naively moving embeddings from RAM to NVM is that loading back from NVM can take much longer. Non-volatile memory is a type of persistent storage medium such as ROM which is more cost effective than RAM but with slower I/O times. NVM has limited read bandwidth, which means rate of transfer is slower.

The authors’ contribution with Bandana is to create a system which improves the normally impractical NVM bandwidth, making it usable in recommender systems. They achieve this through two techniques: Storing embeddings likely to be requested together adjacently in NVM Dynamically deciding which frequently-accessed embeddings to keep preloaded in DRAM cache and which embeddings to evict

The first intuition behind Bandana is that NVM loading bandwidth needed to be improved. The main bottleneck with this type of storage is that NVM can only read 4K data blocks at a time, but most embeddings are much smaller. Furthermore, more than half of the 4K memory loaded from NVM to DRAM may be unused and discarded by the running query. Thus a single query requires multiple loads from NVM, which is inefficient.

Bandana tackles this challenge by identifying which embeddings are frequently queried one after the other, storing these adjacently in NVM. With this heuristic in place, when a 4K block is pre-fetched from NVM to DRAM, it is likely that following queries will request embeddings in the pre-fetched block and forgo additional loads. To identify which jobs should be stored together in memory, Bandana relies on classification methods to cluster the jobs by access pattern. The authors test various partitioning algorithms for this task, such as K-means and Social Hash Partitioner. The authors found that SHP gave better bandwidth utilisation than K-means for their target workloads.

The second intuition behind Bandana is that frequently requested embeddings should be prefetched to DRAM for additional speed. At the same time, an eviction policy needs to be in place which maintains the oft-requested embeddings in DRAM while evicting others. To support this behaviour, the authors implement an additional LRU caching queue such that Bandana inserts prefetched objects into the queue only if they have been accessed t-times in past runs. The LRU cache queue thus defines which embeddings are frequently accessed.

Bandana Evaluation

Bandana’s performance depends on the value of hyperparameter t, which determines if an embedding qualifies as “frequently-accessed”. To identify the optimal value of t, Bandana runs a cache simulation of the typical workload using miniature caches. The authors claim that the t-thresholds from the simulations are good approximations for larger cache sizes since the hit-rate curve is convex. The use of cache simulations is a novel approach which could be used to simulate memory usage of DL jobs in order to predict resource consumption.

The biggest deterrent to Bandana’s argument is that the authors do not quantitatively compare Bandana’s performance to the SOA. The authors imply that Bandana’s cost efficiency dwarfs SOA speed. The authors do not provide comparisons tests between DRAM performance and NVM performance. They also fail to mention that the sparse embeddings used in Bandana are not as common as dense embeddings, which do not have large space requirements. Since dense embeddings are much smaller, there is no advantage to storing these in NVM.

However, Bandana’s ideas can be applied to a different context that the authors did not anticipate. Large Language Models sometimes rely on CPU-offloading to free up memory. CPU offloading is a technique where certain model parameters or activations are loaded to the CPU in order to leave space for continued computations and prevent GPU out of memory errors. CPU offloading is slow, and could potentially benefit from being swapped for NVM-offloading using Bandana.

NVM offloading could be especially efficient for parallel intervention jobs. An intervention job is popular in the field of neural mechanistic interpretability, where it is useful to extract activations in a specific layer of the model in order to examine them. This requires loading tensors onto the CPU. A problem arises when a user wants to edit the activation weights and perform inference on the new weights. This means the original weights need to be saved in order to be restored for other users.

Should a job switch occur in the middle of an intervention, all memory needs to be checkpointed, cleared and restored to the clean version for other users. By saving the original weights into NVM, we can shortcut this process by having persistent copies that do not need to be checkpointed, cleared and restored each time.

Salus Implementation Details

Bandana proposes a new form of storage to separate volatile and persistent memory; a different strategy is to partition GPU memory directly by designing a GPU virtual memory. This is what the authors of Salus set out to do with the creation of GPU lanes, which allow the safe sharing of GPU memory among different jobs. The motivation behind Salus is that GPUs typically run one job at a time, which underutilises resources; furthermore, large models make context switching between jobs an expensive operation, thus it becomes crucial to pack multiple jobs onto one GPU to use up idle space and avoid job switches. Salus’ advantage over Bandana is that the system is less dependent on loading time. However Salus needs to ensure better memory safety since multiple jobs are reliant on the persistence of the model loaded onto memory.

To ensure memory safety, Salus divides the GPU into persistent and volatile memory, and volatile memory is further divided into GPU lanes. Serialisation of jobs is achieved down a lane, while parallelization is achieved across lanes using GPU streams. Multiple jobs are assigned to a free lane and executed in parallel; safety is upheld by only assigning one job to a lane and blocking any swapping among lanes. GPU lanes are assigned based on a job’s history of peak memory usage; to prevent oversubscription of memory, the memory available to a job will be equivalent to twice its historical peak memory usage. This allows space for unpredicted memory growth.

Salus relies on the idea that DL jobs have heterogeneous memory usage across their lifespan. For example in a training job there will be peaks and troughs of memory usage throughout. The authors intuit that by analysing this heterogeneous memory pattern, an optimal memory division strategy may be achieved which dwarfs naive static memory partitions. Salus assumes that a job’s history may help predict sudden changes in job memory needs, which helps avoid costly checkpoints and context switches.

Since prediction of job usage patterns is not always exact, Salus includes a scheduler that intervenes to ensure fair resource sharing among jobs. The Salus scheduler supports fairness by implementing a SRTF (shortest remaining time first) scheduling policy which reduces Head-of-line blocks from long-running jobs. This approach is specifically targeted to improve GPU utilisation for hyperparameter fine-tuning, since the scheduler raises an interrupt after every training iteration. If the interrupted job is not close to completion, it is then preempted with fast-job switching, or moved to a different GPU entirely. This strategy leverages the fact that GPU request rates are highly variable; many poor jobs are killed off quickly, thus iteration interrupts can periodically enter and free up this unusable memory.

Salus Evaluation

Salus makes the decision to isolate volatile memory with GPU lanes to ensure safety in a parallel setting. This choice eliminates memory fragmentation since lanes are cleared of volatile memory after every iteration. Furthermore, the Salus scheduler takes care of lane assignment such that fairness is preserved among jobs, and the condition holds that volatile memory does not cross over into persistent memory.

While it promises dynamic space allocation that matches optimal GPU utilisation, Salus’ strength is also its flaw because this flexible policy requires the scheduler to enforce strict compliance to safety conditions. Salus is designed to be optimal for a specific workload with several small models loaded onto one GPU, and many iteration-based jobs that can be interrupted at timed intervals. However, this is a very specific subset of DL jobs. Unlike Bandana’s offloading strategy, Salus’ partitioning cannot support intervention jobs which are unpredictable and require access to persistent model memory. Intervention jobs are becoming increasingly important for the understanding of how language models make predictions, and researchers are looking for cost-effective, efficient memory management that allows them to perform such experiments safely.

Notwithstanding design trade-offs, Salus’ intuition of partitioning the GPU directly is a step in the right direction to offer more control over GPU memory management. The ZeRO architecture is an example of how a different kind of partitioning, Data parallel (DP) and Model Parallel (MP) partitioning, can be implemented to optimise model memory usage on GPUs.

ZeRO Implementation Details

ZeRO is a system designed with the needs of large multi-billion parameter models in mind. The authors go as far as to present a theoretical projection of ZeRO for enabling Trillion-parameter model training. The authors believe that Trillion-parameter models will become the SOA for language modelling in the next few years. ZeRO tackles the challenge of splitting multi-billion parameter models across GPUs, focusing on how to reduce the curve of number of GPUs required as model size grows.

The motivation behind ZeRO is that modern training of large models relies on partitioning models and reduplicating them across hundreds of GPUs. Along with model parameters, gradients and optimizer states also have to be partitioned. Zero introduces two systems, ZeRO-DP and ZeRO-R, which eliminate redundant copies of data. ZeRO-DP targets the reduplication of the entire model across many GPUs for each batch of data.

ZeRO-DP eliminates training redundancy by ensuring that only model states required for the current batch exist on the GPU. ZeRO-R is a memory management unit that optimises memory usage in four ways:

Setting the maximum size of temporary buffers used in computation;
Preventing failed memory allocation due to fragmentation by monitoring the lifespan of tensors;
Eliminating duplicate activations stored in the forward pass for backprop;
Offloading activations that are awaiting backprop to the CPU.

The authors show that utilising Zero-R with Zero-DP and MP creates optimal memory savings without sacrificing performance. Implementing ZeRO results in much of the space on a single GPU being freed up as duplicates are removed. This minimises the per-device memory footprint, which is what allows scaling to trillion parameter models as less total memory is required.

Zero Evaluation

ZeRO’s goal is to democratise access to billion-parameter models by reducing the total number of required GPUs. Much of the techniques proposed by Ze-Ro are similar to Salus, namely memory management strategies like limiting buffer size and preventing fragmentation. Unlike Salus, ZeRO does not provide mechanisms for sharing the freed up memory among users, which would reduce cost per-capita. This is because ZeRO is designed singularly for training, and does not consider parallel job requests. The authors argue that memory sharing is out of scope for ZeRO’s use case, further criticising some memory sharing partition strategies like Pipeline Parallelism (PP).

One could argue that the authors’ dismissal of memory sharing and Pipeline Parallelism is too hasty. Pipeline Parallelism is a different mode of partitioning that slices the model horizontally. As opposed to Model Parallelism, PP splits the model by layers and places each block on a GPU. The forward pass through the model proceeds sequentially through the GPUs. The authors of ZeRO find fault in this approach because it requires developers to edit their models in order to fit the PP system.

The ZeRO authors justly emphasise the importance of transparency, defined as not needing to modify a model to fit a system. However, they do not provide valid reasons to dismiss PP beyond the need for tying model weights and relying on inter-GPU communication.

Pipeline parallelism offers an important advantage over ZeRO: if a slice of the model fails, the input can simply be rerouted to the slice on the next GPU. This is an imperfect approach since the model output will become corrupted, however this method allows other users to keep using lower layers of the model uninterrupted. Pipeline Parallelism enables sharing at inference time since multiple prompts can be processed in a serialised manner: as one prompt leaves the GPU, the next one enters. The ZeRO further authors criticise the “bubble” problem typical of PP; this is when a bubble of wasted resources forms as one job terminates and another enters the pipe, causing one idle GPU. Yet the PP bubble is far less wasteful than not implementing GPU sharing by any means.

By overfitting their system to the training scenario, the ZeRO authors fail to bring attention to the problem of GPU errors. Although training is an automated process with less user error than inference or interventions, GPU errors are still possible. Automated hyperparameter tuning is a type of training that frequently generates errors. The model partitioning used in ZeRO has the unfortunate corollary that the loss of a single GPU can have a domino effect on the whole system. The ZeRO system demonstrates how designing optimization techniques around a single DL job mode (training) may cause a powerful idea to be virtually inapplicable elsewhere.

Conclusion

The different partitioning modes in Salus and ZeRO demonstrate two perspectives on the trade-off between efficient memory utilisation and memory safety. Salus enables sharing among jobs, thus packing the GPU as much as possible. However this setup could generate GPU out-of-memory errors that compromise the safety of all jobs and models. ZeRO by contrast partitions the model carefully so as to isolate all the resources on one GPU. While this decreases per-device memory usage by eliminating redundancy, it does not offer ways to utilise all the freed-up space.

Bandana by contrast does not partition GPU space but introduces a different kind of storage as offloading destination. Bandana focuses on improving the loading bandwidth from this storage to RAM. Due to the different approaches of these three systems, it is important to consider the target use case when evaluating the merits and downsides of each. Due to the variety of domains, it is hard to determine which of these three systems is most efficient.

In practice, the optimal strategy for managing GPU memory is likely a combination of these three systems. The question of whether a general framework exists for managing memory across all kinds of DL applications remains open. Certainly, it is easier to make a system that is task or model specific. However even when designating one target use, there are still many factors such as tradeoffs between persistent and volatile storage, varying job lengths or incoming requests. The search for optimal memory management is observably not a convex problem.

Framed in these terms, we may consider ZeRO to be the optimal model for large scale training with access to hundreds of GPUs; Salus may be the optimal model for parallel inference or fine-tuning of smaller models over a limited number of GPUs; Bandana may be the optimal choice for parallel intervention jobs that require persistent copies of model states.

As GPU architectures and models evolve, an operating system capable of handling all kinds of parallel DL jobs may soon become a reality. Such an operating system would require a massive GPU cluster, each node specialized for a job type, as well as considerable engineering effort to support large models and parallel requests. It is conceivable that this system would utilize Salus to ensure memory safety, Bandana to provide offloading space and ZeRO to support large scale training.

This “operating system of the future” may seem like a distant pipe dream, yet some projects are already working towards this goal. At our own Northeastern university, the National Deep Inference Facility (NDIF) led by Dr. David Bau wishes to enable parallel intervention experiments on a shared GPU node. NDIF could boost research in LLMs by allowing scientists with limited budgets to share the cost of running LLMs. As LLMs are increasingly incorporated into our daily lives, it is of utmost importance to promote research in this field; decreasing GPU memory utilization is but the first step towards this important goal.