Introduction

Papers:

Deep learning models are becoming increasingly bigger: models like ResNet-101 (45M parameters) and BERT (350M parameters) were considered huge less than 5 years ago. However, such models are "small" by today's standards, with models like GPT-3+ (175B parameters) and PaLM (up to 550B parameters) dominating most benchmarks and powering advanced NLP applications, including translation systems, chatbots, coding assistants, and much more. Indeed, scaling up deep learning models (both in their size and the amount of data used for training) has been shown to be an effective strategy for improving performance on tasks that one would previously have considered impossible to solve through machine learning. For instance, the current performance on benchmarks such as MMLU surpasses that of the average human and is above what many deep learning experts had predicted to be possible by even 2025.

Nevertheless, advances in the development of more powerful GPUs have not kept up the pace with the development of huge deep learning models, and the current generation of GPUs only offers a ~4x improvement in raw performance over the previous generation. On top of that, there have been no major improvements in terms of the total VRAM available to the most recent GPUs: a single H100 GPU has merely 80GB of VRAM (same memory as the previous gen. A100). Such a GPU can only fit models up to 40B parameters in 16bit float precision, without leaving any room for feeding samples to the model nor the gradients and the activations for the optimizers (which, in the case of the popular AdamW, can amount to up to 8 times the size in memory of the model). To make matters worse, most of the large models developed today are based on attention mechanisms, whose memory usage is quadratic in length of the sequences (typically text) it processes.

There has been a huge impact of large models across fields such as computer vision, natural language processing, biology, and others. Thus, it is no surprise that the development of more efficient approaches for training them has gained significant attention in recent years. Some researchers have devoted their efforts to optimizing specific primitives (such as Flash Attention) and developing more efficient architectures (like state space models). However, many have focused on developing distributed policies for using multiple GPUs across nodes to train or deploy large models. These approaches can be broadly categorized into data parallelism and model parallelism; in addition, hybrid approaches combining both techniques have also been developed. There also exist methodologies for reducing the memory footprint of models, such as quantization. Moreover, one can make training and inference less memory intensive by offloading parts of the model to the CPU and even to SSDs.

In this survey, I will cover recent progress toward making training and inference with large deep learning models more efficient through parallelism and offloading. In particular, I will focus on work proposing 1) memory optimizations for training huge models, 2) collaborative training of huge models over the internet, and 3) efficient approaches for large language model inference on a single GPU. The motivation behind this survey is to highlight the fact that different directions have arised from use cases ranging from training large models with massive compute to running inference on such models using limited compute resources. Depending on the user, this might entail using either on a single computer, a professional cluster, or collaboratively pooled together resources from all across the world. Such developments are critical as big companies continue to amass GPUs and strive to build a monopoly on the development of huge models.

Background

During the first years of the deep learning advent (starting with the 2012 'AlexNet' moment), there were two main approaches for training or running inference on large models when a single GPU was not enough: data parallelism and model parallelism. In addition, it is important to highlight offloading approaches, which can be combined with either of these to leverage RAM memory or even disk storage. By covering these, I want to provide a brief introduction to the simpler approaches that have enabled researchers to scale up deep learning models, and to highlight their downsides when compared to the proposals that this survey focuses on.

In data parallelism, input data is split into multiple subsets, with each subset assigned to a different GPU. Each GPU can run a single iteration and calculate the gradients using the assigned data, and the model parameters are updated by averaging the gradients computed by each GPU in a 'main' process, which then propagates the weight updates. The advantage of data parallelism is that it allows for parallel processing of large datasets, leading to significant speedups in training times as communication needs to be done only between iterations. However, this approach can be limited by the amount of memory available on each GPU; namely, each GPU must be capable of holding, at the very least, the model, the optimizer states, and the activations for a single datum.

In model parallelism, the model is split into multiple sub-models, with each sub-model assigned to a different GPU. Each GPU is responsible for computing a different part of the output, and the outputs are combined to produce the final result. This approach can help overcome memory limitations, as each GPU only needs to hold a subset of the model parameters. However, when carried out naively it can result in slower processing times due to increased communication overhead. For instance, the authors of one of the surveyed papers highlighted that using two DGX-2 nodes for training a 10B parameter model results in less than 5% GPU utilization. Another issue of model parallelism is that the user usually needs to define the model to sub-model architectural details making implementation more complex in contrast with data parallelism.

Finally, there is offloading, which is not a separate approach per se. Instead, it can be combined with parallelism techniques to enable training larger models. Essentially, offlodaing refers to any method for moving either data or operations from the GPU to the CPU (sometimes, it even entails moving data to the disk). One simple example consists of offloading activations (which are used during the backward pass) to the CPU until they are required again. When used naively, offloading entails significant GPU-CPU-disk communication overheads and can result in GPU under-utilization, so work focused on improving parallelism via offloading often focuses on selecting the best moment to move data concurrently to slower devices so that GPUs do not have to sit idle for long.

ZeRO: scaling up training by partitioning optimizer states and model parameters into different GPUs

In their 2020 paper, Rajbhandari et al. described ZeRO: zero redundancy optimizer, an approach to optimize memory usage and training speed while vastly increasing the model size that can be trained in a multi-GPU/multi-node setup. As part of their work, they highlight the benefits of ZeRO in comparison to other existing parallel training techniques, such as naive data and model parallelism, among others. ZeRO addresses limitations of both data parallelism and model parallelism by optimizing the memory usage and communication overhead. In order to achieve this, the authors by analyzing two keys aspects of large model training: 1) the fact that the majority of memory occupied during training is comprised by model states (such as gradients, parameters) and optimizer states, and 2) the communication overhead between devices.

ZeRO: the approach

To address memory usage, ZeRO introduces memory optimization techniques, such as layer-wise checkpointing, to reduce the amount of memory required to store these model states. Layer-wise checkpointing divides the model into a series of smaller sub-models, which are then processed individually during training. This approach significantly reduces the amount of memory required to store the model states, as only a subset of the sub-models are active at any given time. As for communication overhead, ZeRO reduces the number of communications by allowing each GPU to work independently on its own mini-batch, while still sharing the same optimizer state. This approach is achieved by breaking the optimizer's parameters into smaller pieces and distributing them across multiple GPUs. The communication overhead is further reduced by overlapping communication with computation, allowing for more efficient use of available resources.

As a result of the above improvements, ZeRO allows for training much larger models with more efficiency, as the memory requirements per GPU are reduced and the communication overhead is minimized. In addition, ZeRO allows for training across multiple nodes, which is essential for training the largest models. This is achieved through a hierarchical parameter server architecture, which efficiently coordinates the communication between nodes. Overall, ZeRO represents a significant step forward in large-scale deep learning, enabling researchers to train models with billions of parameters in a multi-GPU/multi-node setup. Going deeper into how ZeRO works, let's analyze each step in depth.

First, there is state partitioning: the model and optimizer states usually consume the most memory during training. In order to address this, the authors specifically propose the ZeRO-DP component to enable data parallelism without any redundancy. This entails partitioning the moodel states across devices/nodes instead of replicating them, and avoids an increase in communication costs by using a dynamic communication schedule. This includes three phases: the first two, optimizer state partitioning and gradient partitioning reduce total memory usage by 4x and 8x respectively while keeping the same communication volume as DP, whereas the third one, parameter partitioning, reduces memory usage linearly in the number of devices with only a 50% increase in communication volume.

The above alone results in no redundancy at all, but further improvements could be achieved so the authors also propose optimizing residual state memory: the memory used by activations, temporary buffers, and others. For this, they develop ZeRO-R, which reduces the use of memory for activations by also partitioning them across devices instead of replicating them, and using CPU offloading when appropriate. It also defines adequate sizes for temporary buffers, and manages memory during training to prevent memory fragmentation. Even though this approach can be used while discarding vertical model parallelism, there are still a few cases where combining model parallelism with ZeRO can be useful, such as having very large models or smaller models where activation memory is not an issue.

ZeRO as a framework is the result of combining ZeRO-DP and ZeRO-R. The framework is evaluated by running benchmarks that show that, combined with model parallelism, ZeRO can easily train 175B parameter models, while frameworks like Megatron can only do up to 40B parameters. Furthermore, using ZeRO results in a 10x improvement in training speeds when compared to the state-of-the-art for models with the same size, under the same hardware. The authors also claim that the ZeRO framework can significantly scale the maximum batch size thus improving training throughput and enabling users to make model updates based on more global partitions of the data.

Takeaways from ZeRO

Certainly, one of the most significant contributions of ZeRO is the democratization of large model training. Users of the framework can train models in multi-node setups without having to worry about the engineering intricacies of memory management and communication optimization. The authors also provide an open-source implementation of the framework, making it accessible to the wider deep learning community. Overall, the ZeRO framework represents a significant step forward in massively distributed training of deep learning models.

Petals: collaborative inference and fine-tuning of large models

Now, I continue to review PETALS: Collaborative Inference and Fine-tuning of Large Models, by Borzunov et al. (2023). In this paper, the authors highlight the lack of a methodology for users to pool their resources together to run inference or fine-tune large models (and in particular, large language models). Here, I should highlight that most large models today are language models so it makes sense that some proposals for massive scale training and inference are focused on them. For instance, the largest vision model today measures 22B parameters whereas the largest language models reach up to 500B parameters.

In PETALS, the authors propose a framework for collaborative inference and fine-tuning of large language models. This works by having a centralized platform that allows suers to run a server, a client, or both. Each server holds a subset of the model layers and gets requests from clients, whereas a client can request a set of servers to process their request, typically for inference. However, for fine-tuning, the authors support clients to train either adapters (very small modules that can be updated for a fraction of the cost of training the whole model) or layer-specific fine-tuning. To achieve this, they separate usage of large language models by non-industry users into two scenarios: inference and parameter-efficient adaptation to downstream tasks.

PETALS: the approach

For inference, the client stores the token embeddings (1st layer), which comprise a small portion of the model and can fit in the RAM of most modern consumer-grade hardware. The client looks up a series of servers that collectively hold all the layers and then sends the query (or prompt, as NLP practitioners call them) through the token embedding layer and then to sucessive layers across servers. Once the client receives the output from the final block, it computes the probabilities for the next token and repeats the process. While the session is active, the servers store a cache of past tokens' representations. The clients also store intermediate representations so that replacement servers can be found in case one goes offline. One example of the system requirements is using a 176B parameter model: BLOOM. For this model, the client needs at least 12 GB of RAM to hold the token embeddings, whereas each server requires 16GB CPU RAM, 100 Mbit/s bandwith and 8GB GPU RAM.

For parameter efficient fine-tuning, things are most complex as one needs to consider the backward pass. In fact, the authors of the PETALS paper do not focus on complete fine-tuning of the models' layers, since finetuning BLOOM using AdamW would require almost 3TB GPU memory. Instead, they focus on methods that fine-tune the model by using only a handful of weights. Nevertheless, this comes with several difficulties because all servers must be able to keep a copy of the fine-tuned weights to be able to carry out inference for other clients. The servers can run backpropagation and return gradients, but no updates can happen. For this purpose, the authors propose using approaches like soft prompt tuning, where a token representation initialized randomly is pre-pended to the query text and updated with backpropagation. Since this token is prepended at the very first stage of modelling, it only needs to be held by the client.

Of course, the authors note that performance is a significant consideration in this distributed setup. Namely, they highlight the following aspects:

To run inference using RTX 3070 GPUs, 44 of them would be required, meaning 44 communication rounds. Thus, the authors propose using quantization to store more information per GPU, reducing the number of devices and communication rounds required. More specifically, the authors compress the weights as well as the hidden states that are exchanged between nodes. The hidden states are compressed using dynamic blockwise quantization, halving bandwidth requirements by half. Model weights are compressed using 8-bit mixed matrix decomposition to reduce the memory footprint by separarting weights into 0.1% 16-bit outliers and 99.9% 8-bit values. Both of these techniques can be applied with a negligible loss of accuracy across many benchmarks. For a model like BLOOM, this means reducing the amount of nodes required for inference from 44 to 22. In terms of speed, the authors show that using these techniques can achieve similar speed to using 16-bit precision with half the GPUs.

Regarding communication, the focus of the authors is on reliability. To address this, they leverage the hivemind library for decentralized training and having fault-tolerant protocols for servers and clients. They also carry out server load balancing by having all clients and servers periodically check for peers that leave or fail with a distributed hash table. Finally, they use client-side routing to ensure that the model finds a sequence of servers that can process the query in the shortest time possible. Namely, they run a beam search of pings that finds the path most likely to provide minimal latency. While this approach is not perfect, it can be run fast enough so that computing a path does not become a bottleneck in the inference/finetuning pipeline.

For benchmarking, the authors test several scenarios ranging from ideal to more real-world user cases. Specifically, they run inference on local servers with very high bandwidth and also on 14 servers distributed across North America. They show that performance does not depend on bandwithd or sequence lenght but does degrade with higher latency. In any case, compared with CPU offloading (which would still require having access to a huge amount of RAM), PETALS is an order of magnitude faster under most conditions, which shows the benefits of the approach.

However, there remain some negative aspects: the first server to receive a query could easily recover the input tokens from the embeddings, meaning that the system cannot be used with data that requires privacy. Moreover, the authors do not consider the case of a malicious server that could return incorrect results to the client. On top of that, there remain concerns regarding the incentives for users to participate in the initiative, as there will likely be many more clients than servers, so the authors suggest using a system of rewards for servers that provide continuous servers -for example, by giving them higher priority when it is their turn to run inference. Finally, they highlight that their approach is not yet compatible with finetuning approaches that require modifying the main model's weights, which could be a limitation when soft prompt tuning is not good enough.

Takeaways from PETALS

Overall, this is a very interesting paper that shows the potential of collaborative distributed training and inference for large language models. There is a massive pool of consumer-grade GPUs that could join the proposed community effort and it is likely that this approach will become more popular as more and more people start to use large language models fueled by the advent of ChatGPT and GPT-4. However, there are still many security, usability, and reliability challenges to overcome, such as the ones mentioned above, and it remains to be seen whether this can be leveraged for other types of models such as diffusion models or more general computer vision neural networks.

FlexGen: improving inference throughput on single GPUs

Finally, I would like to cover a paper by Sheng et al. (2023): High-throughput Generative Inference of Large Language Models with a Single GPU. The authors of this papers highlight the need to run inference on large language models on a single GPU, a common use case for many users. Indeed, running the largest GPT model requires 325GB of GPU memory just for the weights, meaning that at least 8 A100 GPUs would be required. Instead of focusing on optimizing interactive use, the authors focus on increasing throughput: that is, sacrificing real time inference for a higher number of parallel operations. Cases where this is relevant include benchmarking, information extraction, data wrangling, among others. These tasks are not sensitive to the latency at which tokens are generated and usually are carried out on a large number of documents.

Previous research directions striving to lower required resources for LLM inference include model compression, collaborative inference as in the PETALS paper, and offloading/model parallelism to leverage CPU and disk memory. Some frameworks for such tasks are FasterTransformer, Orca, LightSeq, PaLM inference, TurboTransformers, DeepSpeed Inference, and Hugging Face Accelerate. However, these frameworks are designed for latency-oriented scenarios with high-end accelerators, limiting their deployment for throughput-oriented inference on easily accessible hardware. To enable LLM inference on such commodity hardware, offloading is necessary, but only DeepSpeed Zero-Inference and Hugging Face Accelerate currently support it.

Although these systems inherit offloading techniques from training systems, they fail to exploit the structure of the throughput-oriented LLM inference computation and miss opportunities for efficient scheduling of I/O traffic. Indeed, the former two directions assume that the model can be fit in a single GPU (which is almost never the case for GPT-3 scale models) whereas the latter suffers from severe penalties in throughput due to the overhead of moving data between disk and CPU (~12 GB/s) and CPU and GPU (~2 GB/s). Moreover, these frameworks often can handle only a batch size of 1 or 2 for the largest models, resulting in further throughput penalties and poor utilization of the GPU's parallelization capabilities.

On the algorithm side, sparsification and quantization have been adopted for LLM inference. Prior works have shown that weights can be compressed down to 3 bits without compressing activations or both weights and activations can be compressed to 8 bits, but this remains insufficient for the largest of models if the goal is to run inference on a single GPU. For example, even if it might be possible to compress the weights of GPT-3, BLOOM or OPT to 4 bits, one would still require more than 40 GB of GPU VRAM to be able to run inference. To make sense of this number, it is almost double the 24GB that is available for the highest end consumer GPUs and around half of what a single industrial-grade A100 or H100 GPU can fit.

FlexGen: the approach

Considering all the above concerns, Sheng et al. center of designing a framework that can run inference on a single consumer-grade GPU. To achieve this, they build upon the offloading paradigm, in which only parts of the model are in the GPU at a given time and the rest can lie in the CPU or the disk when it is not being used. This makes it important for us to highlight the three level memory hierarchy where GPU memory is faster but more scarce (usually on the order of 16-24 GB for consumer machines) than CPU memory (usually around 32 to 64 GB), while disk is slower yet much more abundant than CPU (around 1TB). To achieve better performance on htroughput-oriented scenarios, a large batch size can be used and the expensive I/O operations can be amortized among the different memory hierarchies.

To this end, the key step is designing an efficient offloading strategy, which they call FlexGen. This strategy aggregates GPU, CPU, and disk memory and efifciently schedules I/O operations, as well as all possible compression methods and pipeline parallelism, which is a kind of parallelism in which different parts of the model are run in sequence using the same resources. They make several contributions:

First, they define a search space of possible offloading strategies by taking into account computation scheduling, tensor placement, and computation delegation. Their search space captures computation orders within 2x of the optimal setup and is optimized using a linear programming based search algorithm that finds the best throughput within that search space. This algorithm can be configured for a variety of hardware configurations and can consider latency and throughput constraints to help find better solutions for specialized use cases. In contrast with existing strategies, this approach unifies the placement of weights, activations, and the KV cache, which greatly increases the upper bound of the batch size.

In terms of problem formulation, one can say that there is a machine that contains three devices: a GPU, a CPU, and a disk. The GPU and CPU have the ability to perform computations, while the disk cannot. These three devices form a memory hierarchy with the GPU having the fastest and smallest memory, and the disk having the slowest and largest memory. If an LLM (Large Language Model) cannot fit entirely within the GPU, it must be offloaded to secondary storage, and computation must be done part by part by partially loading the LLM.

Generative inference with offloading is formulated as a graph traversal problem where tokens must go through layers, assuming a dataset with an infinite number of prompts to process. A valid path is defined as one that traverses (i.e., computes) all tokens (or squares) while being subject to certain constraints. These include the requirement that a token can only be computed if all tokens to its left on the same sequence were computed, and that to compute a token on a device, all its inputs (weights, activations, cache) must be loaded to it. After being computed, a token produces two outputs: activations and KV cache. The activations are stored until its right neighbor is computed, and the KV cache should be stored until the rightmost token on the same row is computed. Additionally, the size of tensors stored on a device cannot exceed its memory capacity at any time. The ultimate goal is to find a valid path that minimizes the total execution time, including the compute cost and I/O cost when moving tensors between devices.

Secondly, they show that it is possible to compress components such as weights and the KV cache of huge LLMs like OPT 175B to as few as 4 bits without any retraining or model calibration while maintaining practically the same accuracy across a variety of benchmark tasks. Specifically, they use group-wise quantization to reduce the costs of I/O operations when doing offloading. The weights and KV cache are compressed to 4 bits with a group size of 64, with weights grouped along the output channel dimension and the KV cache grouped along the hidden dimension. The tensors are stored in the quantized format and converted back to FP16 before computation. However, the fine-grained group-wise quantization in FlexGen causes some overhead in compression and decompression, which is significant when run on a CPU, so CPU delegation is turned off when quantization is enabled.

The article presents the results on the OPT family of models and shows that the accuracy is preserved while being runtime-efficient in practice. On top of that, they show that FlexGen is very efficient by running OPT-175B on a single NVIDIA T4 GPU which has merely 16 GB of RAM. FlexGen often allows batch sizes orders of magnitude larger under the same time/resource constraints. Namely, it performs well even on non-enterprise setups like a workstation with T4 GPU with 208 GB CPU DRAM and a 1.5 TB SSD, with input sequence length 512 and output sequence length 32.

Their evaluations showed that when both FlexGen and DeepSpeed Inference have the same latency of 5000 seconds, FlexGen, with an effective batch size of 64, can achieve 40 times higher throughput than DeepSpeed Inference, which has a batch size of 1. In contrast, Hugging Face Accelerate cannot even complete a single batch. Furthermore, they found that when allowing for a higher latency of 12000 seconds, FlexGen was able to achieve 69 times higher maximum throughput compared to the baselines. This was because FlexGen could enlarge the effective batch size to 256 (8192 tokens generated in total), whereas DeepSpeed Inference and Hugging Face Accelerate could not use a batch size > 2 due to memory issues.

Moreover, the study discovered that if 4-bit compression is allowed, FlexGen can reach a maximum throughput that is 100 times higher than the baselines. This can be achieved with an effective batch size of 144 (or 4608 tokens generated in total) and a latency of 4000 seconds. This is made possible by holding all weights in the CPU and getting rid of disk offloading. Finally, they compared offloading with the decentralized inference used in PETALS and find that FlexGen outperforms a PETALS cluster in terms of per-GPU throughput, particularly when PETALS has slow server connections. All in all, FlexGen is a promising framework for single GPU inference that can achieve 1 token/s inference with an effective batch size of 144 on a single 16GB GPU. Moreover, the approach can be extended to multi-GPU scenarios which is a very promising direction for future work and promises to help democratize large model data wrangling and evaluation.

Conclusions

Our survey thus has covered both approaches for making large scale training and large scale inference more accessible outside big industry labs with access to thousands of GPUs. Recent advances have helped make training accessible for research institutions, whereas inference is today accessible even to individual users with consumer-grade hardware thanks to the collective effort of researchers that have made their work open source. I believe that, if approaches like FlexGen can be combined with collaborative inference proposals like PETALS, inference at large scale will become truly accessible for the majority of users. Of course, it would be limited to high-throughput scenarios, but this is already a very useful capability for many users, particularlty those that want to use large language models for processing many documents at once.

It remains to be seen whether similar advances can be made for training or for tasks that require a very low generation latency, such as chatbots. I think there is potential for ideas on memory allocation and compression to be applied to training. In particular, FlexGen's approach in which throughput is prioritized could be very useful for training scenarios, as one usually cares more about throughput when training a large model. However, it is important to note that FlexGen was also designed with auto-regressive generation in mind, whereas training of large language models usually relies on the parallelization provided by the teacher forcing approach. In any case, there are valuable ideas to be used too in terms of using compression. In particular, other researchers have already made 8-bit training possible with certain optimizers. If the same is achieved with 4-bit compression, it would be very interesting to see how this would affect the training time of large language models and whether the LP-based search used in FlexGen could be applied.