Efficiency in deep learning: full-stack systems efficiency Author: Millicent Li

Introduction

The papers that I am surveying are:
8-bit optimizer via Block-wise Quantization
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Efficiency has been a longstanding issue in machine learning, especially to develop systems that can be used by individuals who do not have access to large compute. In the space of machine learning, individuals are developing at more rapid paces larger and larger models. In particular, scale has been shown to improve models for natural language and has given rise to “emergent properties” in these large models. However, it is unlikely that any regular user may be able to use these models since most people don’t have access to thousands of GPUs. Machine learning, then, requires efficiency in scaling both hardware and software. This realization has become more important to level the playing field between academia (regular users, in this case) and industry (where most of the large compute opportunities are).

So, given that many users are excluded from these models that are forever scaling, what can possibly be done to improve accessible machine learning for all? The solution lies in the possibilities in both the hardware and software side, where both sides can commingle to produce a harmonious output. In particular, I start our surveying at the bottom of the stack, where I focus on optimizing deep learning for different hardware architectures with the paper “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning”. Following this, I focus on mechanisms that can improve efficiency of machine learning models on these systems. For instance, these include improving optimizers and matrix multiplication for transformers models to improve memory efficiency when running these models on various hardware.

The main goal of these papers is to be able to deploy models on different devices and systems, such as cloud servers to self-driving cars to phones [3] and to do so efficiently. However, existing limitations in how current deep learning frameworks work to delegate graph-level optimizations to operator libraries make it so that it’s difficult to support multiple devices [3]. This is similar to [1] and [2] in that it would be ideal to make machine learning ubiquitously efficient and simple to use across multiple mediums; however, this is readily not the case, as many devices are less than optimized to run such GPU and hardware needed. That is where [3] steps in. In [3], they develop TVM, “a compiler that takes a high-level specification of a deep learning program from existing frameworks and generates low-level optimized code for a diverse set of hardware back-ends”. That way, different devices can use the same framework at hopefully the same performance. On the actual device, [1] applies block-wise quantization to allow parallel running across cores for optimization, combined with dynamic quantization and stable embedding layers to effectively reduce the amount of memory footprint needed to run these models. As a result, it isn’t just about the model, but the required tools to train the model with that also need optimization. Finally, [2] reduces the memory required to do inference on these models at a software level, allowing little to no performance degradation. All together, these three research papers combined result in the pipeline that is needed to allow efficient and ubiquitous access to using machine learning models everywhere and anytime.

Background

All three papers vary in topics, but altogether, works well in harmony when ideas are combined. [1] focuses on trying to maintain gradient statistics over time. When training machine learning models, there needs to be constant bookkeeping regarding the memory required to be allocated to hold the model, which, as models scale, becomes increasingly difficult. To solve this problem, they develop a “fast-precision, non-linear quantization method – block-wise quantization – that enables 8-bit optimizers” [1]. As they do this, they also achieve the same accuracy as a 32-bit optimizer, with a fraction of the memory. This is a hard problem mainly because as the number of bits to represent an optimizer decreases, large models find it more difficult to train, since it effectively decreases the number of possible values from 65536 to 256.

Another reason why using a smaller range is difficult is due to quantization accuracy, computational efficiency, and large-scale stability. Simply put, as optimizer sizes decrease, there might be a lower memory usage, but accuracy takes a hit. In particular, a good optimizer needs to be able to “maintain good mean error but excellent wors[t] case performance” since a single quantization error could effectively ruin training and cause divergence [1]. The main block-wise quantization that they propose is to essentially split the tensor into blocks and calculate the quantization on each block independently. In this way, they can remove outliers in the tensors, speed up stability, and improve performance. In addition, their method of block-wise processing allows the optimization to be run independently on each core, greatly speeding up the process.

In contrast to the focus of [1], [2] focuses on more of a different aspect of efficiency: efficiency within the model. While the prior paper [1] focuses on optimization, which is one part of the equation for making accessible machine learning, [2] focuses on how they can do inference on large language models efficiently without as much GPU memory. In the era of GPT-3, ChatGPT, and more, running large language models have strayed further away from individual use due to the compute requirements. Especially in the world of NLP, large language models have required a repertoire of GPUs to train. [2] focuses more on optimizing inference (not yet training!) of these large models.

So, how possible is it to improve training so general users can possibly use these models? Well, [2] propose to reduce the size of the parameters and quantize them in fewer bits, all the while using low-bit-precision matrix multiplication. While individuals have developed 8-bit quantization methods, [2] covers the fact that these existing works have not looked at larger models, models larger than 350M parameters, and that many of these prior works have also not been very successful. In particular, they degrade performance and require tuning and retuning quantization further after training. As a result, the main focus is to understand how quantization might be helpful and tuned for larger models as they continue to scale. They are able to apply quantization to models with billions of parameters, with little to no loss in performance.

Amazingly, [2] show that they can achieve quantization to billion-parameter models with no loss in performance by solving two challenges: in particular, 1) higher quantization precision at larger and larger scales and 2) through outlier features (and removing them at emergent levels, which is 6.7B parameters or more). To achieve 1), they are able to use vector-wise quantization to solve this problem, as opposed to existing methods. To achieve 2), they were able to discover the emergence of outlier features which affect inference in the model. In particular, these features appear at higher parameter counts in a systematic way: they are concentrated in particular areas of the transformer and being able to discover these systematic outliers and then separate them from non-outliers to apply matrix multiplication separately allows them to achieve the performance that they do, but with only 8-bit precision. This allows the running of very large language models, up to 175B parameters, which can be run on 8 consumer GPUs. This is the other area of interest in efficiency for machine learning: improving existing methods for efficiently (and accurately) preserving (and mostly reducing) the amount of memory used, and they are luckily able to open source and provide this knowledge to the research community.

While efficiencies are made on the software side for hardware optimizations, there are also direct hardware improvements for running machine learning models in a ubiquitous way. [3] addresses this direction very carefully, allowing users to run machine learning models on different platforms, without having to use and write vendor-specific operator libraries for each device. TVM is specifically a compiler to make this easier in the long run. It is a “compiler that exposes graph-level and operator-level optimizations to improve performance portability to deep learning workloads across diverse hardware back-ends” [3].

CPU, GPU, and TPU accelerators require different architectures. Some code and devices will run faster on GPUs rather than TPUs, and vice versa. Other things might also diverge, such as in terms of memory organization and functional units. The current limitations of existing deep learning frameworks, such as TensorFlow, MXNet, Caffe, and PyTorch, is that they “rely on a computational graph intermediate representation to implement optimizations, e.g., auto differentiation and dynamic memory management” [3]. This is particularly problematic because on different platforms and devices, the implementations for the hardware may be different.

For example, the high-level operations that are difficult for the hardware include graph-level optimizations because many of these frameworks focus “on a narrow class of server-class GPU devices and delegate target-specific optimizations to highly engineered and vendor-specific operator libraries” [3]. This in turn affects operator-level libraries and then requires manual tuning to fix on different hardware devices, making it very difficult to port elsewhere. Imagine porting a library that is purposed for a large, deep learning rig, and instead trying to port it to a smartphone. The requirements for each would be different in terms of frameworks required to run the machine learning training. In essence, this is what they are trying to solve. Essentially, there is a mismatch between the optimizations that the framework can run and the operators that are defined in the operator library.

In addition, some of these implementations of operators may not be optimized for the specific hardware, so there must be some strategy and way to try to figure out what operations are most optimized. TVM essentially allows the user to choose a specific high-level specification they would like to use, specifically building it around existing frameworks for a specific low-level optimized code for a diverse set of hardware backends. Essentially, this allows the user to build a custom compiler that can be specified regardless of if the user knows the frameworks needed, making this more easily usable for individuals who might be outside of the scope of machine learning and systems research. Because hardware optimizations are essentially optimizing over different data layouts and a bunch of matrix multiplications, it is imperative to understand how to search in the space of hardware. Furthermore, combinatorics can be expensive, and having to search for a large space within optimization can be costly and expensive.

In the end, TVM addresses these challenges by introducing a (1) tensor expression language, specifically for the development of operators and to provide program transformation primitives to generate a whole slew of programs that have different optimizations and programs. Furthermore, they also look at (2) an automated program optimization framework that allows them to find optimized tensor operators. Finally, (3) they introduce a graph rewriter to take full advantage of high- and operator-level (lower-level) optimizations. Put together, they are able to develop an end-to-end compiler that can take specifications easily. This efficiency optimization for machine learning is critical in the sense that being able to run the same machine learning model on different devices in an efficient way will pave the ability to embed future tools with machine learning. For instance, like with Siri or Cortana, these tools can be embedded to run on smaller devices with the same amount of accuracy and efficiency as larger devices.

These three papers provide the trifecta of what research in efficiency for machine learning on different machines needs to focus on to move forward. The interesting part of the combination of these papers is that each plays a role in improving the current systems, and in tangent, they could all be used together to improve a larger, deep learning system. Current to today’s limitations in running models that are trained via RLHF (Reinforcement Learning from Human Feedback) is the fact that these models cannot be trained without large compute and of course large swaths of human feedback data.

Eventually, you can imagine that individuals may want to use these models on consumer systems such as phones and smart devices, especially in smart homes of sorts that are portable and generally used for chat-like activities. TVM [3] allows the user to use these machine learning models on these devices, and if someone were to run machine learning inference or training on these models, both [1] and [2] would be solutions for this. However, it is less likely that a consumer system would want to have online learning and inference simultaneously since these tools are generally deployed in production, and online learning could be costly to continue getting data and retraining the model.

Methodology

Paper 1

So, how are these methods actually implemented, and what experiments does each paper run? Again, for [1], they chalk the section down into three parts: (1) block-wise quantization, (2) dynamic quantization, and (3) adding a stable embedding layer to improve stability. To perform (1) and do block-wise quantization, they normalize the tensor into an appropriate range; in their case, they choose [-1, 1]. To actually perform the block-wise portion of the quantization, they reduce costs by chunking the input tensor into small sizes. By doing it this way, normalizing independently across blocks saves cost and time. Block-wise quantization has several advantages in the realm of stability and efficiency. The block-wise quantization can be computed independently, across multiple cores, which allows the optimizer to take advantage of multi-core CPUs.

Secondly, the quantization makes the optimizer more robust to outliers in the input tensor. They mean that since the blocks are chunked into smaller subproblems, the normalization values are decided by outliers that are smaller in nature, and so more values are normalized to a smaller value. Block-wise quantization can also approximate outlier values without any error, which guarantees that the largest optimizer state can be quantized with full precision, which leads to more robust and precise performance. In the case of dynamic quantization (2), they repurpose and extend an existing approach called dynamic tree quantization. Essentially, they get clever with the sign bit of the Adam optimizer and notice that the sign bit is never used (since Adam is strictly positive). Instead, they use the sign bit of the optimizer as a fixed bit for the fraction. There are differences of magnitude for the Adam state (3-5 orders of magnitude) versus the existing dynamic tree quantization, which has 7 orders of magnitude, which makes this change effective. It makes sense that the fraction allows them to get a more precise reading on the state of the optimizer.

Finally, they introduce (3), a stable embedding layer. Specifically, the embedding layer is a variation on BERT’s embedding layer. The layer is a method to aggressively quantize the non-uniform distribution of inputs to avoid extreme gradient variation. By doing this, they are able to maintain the same variance of roughly one at both initialization and during training, which ensures stability when using the optimizer. In this case, they also note that they use a 32-bit optimizer to train this layer. This is due to the fact that other works have shown that training with a 32-bit optimizer for the embedding layer improves performance drastically, and their goal is to use less memory when training the model with an 8-bit optimizer, all while maintaining the same level of performance as using a 32-bit optimizer for all layers of the model. Finally, for experiments, they run their new, 8-bit optimizer against existing 32-bit optimizers on existing public benchmarks to see how well their optimizer performs. The result is that they find that their optimizer saves up to 8.5 GB of GPU memory for their largest 1.5B parameter model and 2.0 GB for RoBERTa (another “small” model when compared to today’s literature). So, their method works, and saves a bunch of memory at training time when focusing solely on the optimizer.

Paper 2

Again, [2] moves more towards fixing the model and how much memory the parameters of these large models use, which go hand in hand with training using an optimizer. They ask in their work two specific questions to guide their experiments: firstly, “at which scale and why do quantization techniques fail”, and secondly, “how does this related to quantization precision?” [2]. Their questions are primarily guided by scaling model sizes upwards until the quantization techniques break the model.

More specifically, there are three different types of quantization methods noted in the paper. Firstly, there is absmax quantization, which is a quantization method that scales inputs into the 8-bit range [-127, 127], which is 126 divided by the absolute max of the entire tensor. With zeropoint quantization, they apply an affine transformation by shifting the input distribution into the full range of [-127, 127], scaling with a normalized dynamic range factor and shifting by some zeropoint value. Zeropoint quantization allows specific outputs that are constrained to a specific output space (such as ReLU) to take advantage of the full range, [-127, 127] (unlike that of absmax quantization), applying the affine transformation to deal with the problem. Finally, there is int8 matrix multiplication with 16-bit float inputs and outputs, using the definition of absmax quantization and zeropoint quantization to get the corresponding outputs. These three quantization methods have been used in the past. However, in their case, they look at scaling int8 matrix multiplication with bigger and bigger models.

To scale matrix multiplication for bigger models, they focus on a core component of their method: using a combination of (1) vector-wise quantization (absmax specifically) and (2) mixed-precision decomposition. For (1) they focus on increasing the number of scaling constants for matrix multiplication by applying a sequence of independent inner products for each row of the hidden states and columns of the weight matrix, denormalizing the inner product result by 1 over the product of the two. For (2), which is the core of LLM.int8(), they notice that 8-bit transformers often have large magnitude features. Their solution for (1) does not handle outliers well. However, due to the sparsity of the outlier features, they are able to develop a new decomposition technique for the model. They do so by separating outlier feature dimensions into a new set (because they occur so few times) according to some magnitude threshold that they choose to cut off the outlier features. So, they separate the parts to be calculated in an 8-bit setting, and the parts to be calculated in a 16-bit setting, giving high precision multiplication of outliers when using memory-efficient matrix multiplication. And best of all, the decomposition uses little memory. Following both part (1) and (2), they run an experiment that includes measuring the robustness of quantization methods as they scale the models (which is what they wanted to show to begin with). There are two experiments, and they are as follows: 1) based on language modeling perplexity and 2) zeroshot zeroshot accuracy. They find that their combination of zeropoint LLM.int8() (vector-wise + decomp) is competitive with a 32-bit float baseline, getting exactly the same score if not better). In addition, efficiency-wise, they note that LLM.int8() runs two times faster for large matrix multiplications equivalent to those in 175B models. As a result, their methods with LLM.int8() produce results that allow individuals without many GPUs to run their models. This is a step in the right direction to improve access to models for everyone.

Paper 3

Last but not least, [3] is a different flavor of work. Instead of directly optimizing a specific model, they look to optimize the entire end-to-end compiler platform for as many possible use cases. As a result, there are multiple components to explore. The order goes like this: the system takes an input model from an existing framework, transforms it into some computational graph representation, then performs a high-level dataflow rewriting to generate an optimized graph. As a result, the code generated must be efficient and work with the specific operators in the graph. TVM allows the identification of code optimizations rather than having the user specify it, so this is on the fly. But to determine this, they use an ML cost model to determine what optimizers need to be optimized. At the terminus of the system, the generated code is put into a deployable model.

So, the end-to-end pipeline for TVM is essentially chunked down to three parts, and each part will be explained in more detail: 1) the optimization of computational graphs, 2) the generation of tensor operations, and 3) automating the optimization. For 1) TVM exploits a computational graph representation and applies high-level optimization. The graph-level operations that are implemented include operator fusion, which fuses smaller operations together; constant-folding, which pre-computes graph parts; static memory planning pass, which pre-allocates memory to hold each intermediate tensor; and data layout transformations, which transforms internal data layouts into back-end-friendly forms. Essentially, (1) in itself is a pipeline that they develop for the purposes of optimizing these computational graphs.

For (2) TVM produces efficient code for each operator by generating as many valid implementation options as possible. However, the limitation for this could be the different permutations of code that could be created for each operator. On the other hand, they do not necessarily have to show all permutations, but instead, give a few options that are better suited for the current situation. In their work to produce efficient code for each operator, they introduce a tensor expression language to support automatic code generation. Automatic code generation has been done reasonably well by existing language models today, but as this paper was published in 2020, was fairly difficult (especially for this specific task definition at hand). The tensor expression language needs to be as flexible as possible for any framework, but general enough to cover as many cases as possible. Finally, they include their own version of a scheduling primitive to denote a specific mapping from tensor expression to low-level code (essentially, the different capabilities or ways to perform a specific function).

They also embed parallelism in scheduling to improve the efficiency of compute-intensive kernels in deep learning workloads. To support an ever-growing cache of primitives since new accelerators are always being built, they develop an extensible solution to create a dynamic set of primitives by introducing a tensorize schedule primitive to replace the existing unit of computation with a corresponding intrinsic. Essentially, the compiler is able to match the current computation pattern with the hardware that the compiler needs to be built on, matching the computation pattern with a hardware declaration and lowering to the corresponding hardware intrinsic.

Lastly, they also introduce a virtual threading scheduling primitive that allows them to specify a high-level data parallel program. This gives the power of latency hiding by giving the user a choice for which strategy they want to use in the compiler to the user, allowing the use of a high-level multi-threaded program schedule that then can integrate a low-level synchronization operation, giving interleaving opportunities. All in all, the hardware is able to recover the available pipeline parallelism dictated at the lower level by the instruction stream.

Finally, for (3), they try to find the optimal operator optimization. To do so, they look at it with two components, through the development of an automated scheduler optimizer: a schedule explorer to explore configurations and a cost model that predicts performance of a given configuration. For the schedule explorer, they develop a schedule template specification API to give the power to the developer (with or without domain knowledge) to declare knobs in the schedule space. Their cost model focuses on taking statistical approaches to solve the modeling problem. The cost model takes a schedule configuration, predicts the run time of a schedule, and then is trained in an online fashion to update accordingly, exploring a large swath of configurations that can improve accuracy of the workloads.

The cost model is a core part of TVM: having the ability to choose the best schedule automatically. In the evaluation part of their work, they look to answer several questions, but more generally whether TVM as a whole would be better than existing frameworks to support new machine learning workloads. They look at four different types of platforms and show as a whole that TVM results in a better relative speedup (and less time) compared to other frameworks, such as Tensorflow Lite. The result of the creation of this end-to-end compiler system is ideally the adoption of using such an dynamically extensible system for optimization of deep learning and machine learning.

Limitations

Since the main purpose of the paper has now been covered, the remainder of the survey will now focus on the limitations and how each limitation could be covered by the other papers in the set of three. For instance, [1] has limitations regarding the way the work reduces memory load to save GPU computation space. They mention that the work does not save activation memory footprint and other parts of memory unrelated to the number of parameters of the model.

In [2], the research is similar to [1] in that [2] improves memory consumption related to the number of parameters of the model as it is used for inference, instead of the optimizer. Both papers have the same downsides, which is to find ways outside of the number of parameters to save further memory in training and inference. However, the interesting part of work [2] in how it accomplishes the task compared to [1] is the fact that models over a certain parameter size have some effect of divergence, causing exploding gradients; however, one discovery from [2] could be the existence of outliers that are causing exploding gradients. This is interesting since [2] was not a project that directly succeeded [1], but the concepts that are introduced in [2] could have been in [1] and prevented further progress in the area. They hypothesized in [1] that outliers could be the reason for the extreme instability; in [2], they prove that this phenomenon is true to some extent, although finding perfectly stable training or inference is fairly difficult as a whole.

The limitations of [3] are harder to relate to [1] and [2]. However, one could presume that the limitation could be the fact that there might be newer and newer hardware specifications introduced in the future that might not be predictable, as well as issues regarding keeping up with existing frameworks. However, as there are few ways to combat the aging of technology, one can only work to keep software and hardware up to date with each other. By creating more effectively efficient ways to train models in the case of [1] and [2], benchmarking on a compiler like TVM [3] can become more consistent with existing state-of-the-art methods.

Do the papers stand the test of time?

Another question that might be of interest would be to understand whether each paper’s approaches can stand the test of time, in that their methods would be applied to future work. Optimization strategies that use fewer bits will hold over in machine learning for as long as possible. So, there is a good chance that both papers [1] and [2] have the ability to achieve this. However, if the state of the art in machine learning (and more specifically natural language processing) changes the types of models that they use, the methods involved may become obsolete; however, matrix multiplication is always the crux of machine learning, so there is a chance these methods may still carry over into the new state of the art. Furthermore, the idea of outlier features may remain prevalent, as scientists continue to scale models. If there is a future interest in looking at smaller models (which could be more prevalent in industry), the necessity to understand outliers may decrease as a whole.

On the other hand, a system such as TVM will always be relevant in all contexts in the sense that our future involves embedding as many devices as possible with machine learning capabilities. Whether or not that inherently is a good idea does not really matter since there are many “smart” tools that are being introduced, such as smart fridges (like, who actually even needs a smart fridge?). In the end, efficiency is a very important problem to tackle, especially as models continue scaling (as the hype for large models continues) and as researchers continue to throw more data at these models to train for longer and discover newer “emergent” abilities of these models.

Potential for widespread adoption

Finally, maybe the more general direction to address is whether all of these papers are topically interesting enough to be used outside of the scope of domain experts who may be interested in using these tools. For instance, will improving efficiency for these machine learning models encourage the more generic user to learn how machine learning works? Most likely, yes, since these papers will allow more model training on consumer hardware. However, are these tools easy to set up to use? It is unlikely that something like TVM [3] is easy for a non-domain expert to interpret since the entire compiler seems to be geared towards the seasoned programmer. However, given that [1] and [2] are implemented in a user-friendly fashion (via Huggingface’s API), it is likely that [1] and [2] are posed for a wide adoption. To make a deep learning compiler like TVM be more accessible to laymen, it is indeed necessary to design entry points that are less daunting than what is introduced in the paper.

Conclusion

Again, to finally come back to the main point: efficiency in machine learning is extremely important. As shown in these papers, building scalable systems and methods to approach introducing efficiency can lower the barrier of entry to machine learning and give others the opportunity to understand how these systems work to support machine learning research. However, without harmonious agreement with both the software stack and hardware stack, it is very difficult to achieve raw efficiency in deep learning and machine learning. As a result, to continue to pave innovations in this area of research, everyone must continue to optimize our hardware to further improve our software and aid in making machine learning more efficient for all.