A Survey on Optimizations in Machine Learning Pipelines - Intensive Systems - CS 7600

Zohair Shafi

Introduction

In the past decade, machine learning has made significant strides, leading to the development of specialized hardware such as Tensor Processing Units (TPUs) and Graphics Processing Units (GPUs) that offer larger inbuilt memory to accommodate the ever-increasing size of models. In tandem with these hardware advancements, sophisticated distributed software has also emerged to support the development and deployment of machine learning workflows at scale. However, despite these developments, there remains a significant gap between the hardware and software available today, hindering the ability to fully utilize the potential of modern hardware.

One of the primary reasons for this disconnect is the need for software to be versatile enough to run across different types of hardware with minimal code modifications. While this approach ensures compatibility across a wide range of hardware, it also limits the ability to optimize software for specific hardware configurations. However, recent research has shown that it is possible to bridge this gap by incorporating optimizations that improve performance while maintaining generality in terms of hardware support.

In this paper, we explore three recent works that seek to reduce the gap between hardware and software, with the goal of improving the utilization of GPUs and TPUs and enhancing overall performance in machine learning workflows. Two of these works aim to improve performance across machine learning workflows in general, while the third focuses on a specific type of machine learning problem, namely, graph neural networks, that presents unique challenges.

We begin by examining the use of virtualization as a means of improving performance in Section . By making hardware transparent, virtualization enables the benefits of standard distributed computing to be leveraged in machine learning workflows. Next, in Section , we delve into the details of the machine learning pipeline to identify bottlenecks and enhance efficiency. Finally, in Section , we apply the same pipeline analysis approach to improve performance specifically for graph neural networks.

Overall, this paper aims to provide a comprehensive overview of recent research in the field of hardware-software co-design for machine learning. By highlighting the potential for hardware-aware optimization techniques to enhance performance in machine learning workflows, we hope to inspire further exploration and innovation in this exciting field.

Virtualization

This section discusses in detail work by Or et. al. titled Virtualflow: Decoupling deep learning models from the underlying hardware''. The authors note that current machine learning libraries like PyTorch and Tensorflow tightly couple the model and the underlying hardware, due to which a host of challenges arrive : 1. Adapting to Heterogeneous Environments Jobs are currently restricted to a single type of accelerator. Decoupling the model from the underlying hardware and tying it instead to "virtual nodes" would help utilise leftover hardware. 2. Dynamic Resource Allocation Currently, adjusting a running job to utilise more resources requires interrupting and restarting the job, which is time consuming. 3. Lack of Reproducibility Reproducing the same model convergence behavior across different hardware typically requires readjusting important hyperparameters such as the batch size and the learning rate. The authors argue that these challenges can be solved by decoupling systems level constraints from application level semantics, i.e., a model should be able to converge to the same performance regardless of the underlying hardware it is trained on. To that end, they introduceVirtual Nodes'', where each batch of the input data is partitioned among virtual nodes instead of hardware accelerators. One or more virtual nodes are then mapped to each hardware accelerator and processed sequentially on the accelerator, thus producing one or more MapReduce-style waves of execution within each step of training or inference. All virtual nodes share the same model parameters. This allows the model to be cached in each accelerator’s memory at the beginning of each step and efficiently reused by all virtual nodes mapped to that accelerator. The gradients produced by these virtual nodes are then aggregated into a shared mem- ory buffer on the accelerator, thus adding a small, constant overhead independent of the number of virtual nodes. Note here that this allows one to preserve model convergence behavior across different hardware by fixing the total number of virtual nodes, and thus the batch size and other hyperparameters. Instead, only the mapping between virtual nodes and hardware accelerators need to be adjusted. The authors claim that this form of utilisation improves cluster utilization by 20% and reduces job completion time by 48% with elasticity, and improves job throughput by 42% with heterogeneous training when implemented on top of TensorFlow and evaluated on a set of representative models like ResNet, BERT and Transformers. ## Current Practices and Solutions To understand why these issues have not been address earlier, we take a look into current state of implementations.

Hyperparamters

Hyperparameters play a crucial role in the convergence of a model. Batch size, learning rate, and dropout rate are some of the key hyperparameters that significantly affect the model's performance. Larger batch sizes generally lead to better training and inference throughput, but this is constrained by the memory limits of hardware accelerators. As a result, reproducing existing results on different hardware can be challenging as different GPUs may have different memory limits, leading to different optimal batch sizes.

Model Graphs

One area where the model and hardware are linked is through the model graph, which outlines the sequence of operations to be performed on input data. This graph is compiled and optimized at the beginning of training, and then reused for the entire training process. Additionally, the graph also contains information on the underlying cluster configuration. Consequently, synchronization operations are used that involve a fixed set of hardware accelerators. Any changes to the allocation of resources require the entire model graph to be rebuilt from scratch, along with reloading any previously trained models.

Fortunately, virtual nodes can help solve this problem. They maintain a dynamic mapping between virtual nodes and hardware accelerators, which can be redistributed as needed based on changes in cluster demand. During redistribution, the total number of virtual nodes remains the same, ensuring that any adjustments to resource allocation are seamless from the perspective of both the application and the model graph. In order to scale out, however, certain virtual node state must be migrated to the new accelerators. This includes model parameters as well as stateful kernels, which contain important information such as the moving mean and variance of batch normalization layers. Overall, the use of virtual nodes can help optimize the use of hardware resources and streamline the training process.

Types of Parallelizations

In distributed deep learning workloads, the most widely adopted form of parallelism is data parallelism, which involves distributing the input batch across multiple accelerators, and each accelerator processes its own share of the batch independently. Model parallelism, on the other hand, is used for very large models that cannot fit in the memory of a single accelerator and involves partitioning, rather than replicating, the model graph across multiple accelerators.

This study focuses specifically on data parallelism in synchronous training, where gradients are synchronized across different data chunks. This method has been widely adopted in distributed deep learning because it provides a high degree of parallelism while also being easy to implement. However, the drawback of synchronous training is that it requires all the accelerators to wait for the slowest one to finish processing its batch, leading to underutilization of some accelerators.

Limitations and Discussions

The authors of the paper present virtual nodes as a solution for dynamically changing the type and number of hardware accelerators during the training process without disrupting it. However, this method has a limitation - it only works if the largest batch size used can fit into the GPU or TPU with the smallest memory limit. Although it may not completely resolve the issue of changing hyperparameters due to memory constraints, it does enable the data to be efficiently partitioned if there is at least one accelerator available that can accommodate the batch size.

To optimize the data split across different types of accelerators, the authors propose a linear programming formulation that runs a profiler on the machine learning task with the available types of accelerators. The objective is to find the most efficient allocation of batch sizes to each accelerator type to ensure maximum utilization of each accelerator. By doing so, the authors aim to achieve the most optimal setting to reduce training time and cost.

Pipeline Analysis

In this section, we delve into the work of Kuchnik et al. titled `Plumber: Diagnosing and removing performance bottlenecks in machine learning data pipelines.'' Unlike the previous work on virtualization, this research focuses more on detailed analysis and profiling of machine learning workflows. The authors analyzed more than two million jobs in Google data centers and concluded that a significant fraction of model training jobs could benefit from faster input data pipelines. Additionally, they found that most jobs do not saturate host hardware, meaning that most bottlenecks result from software-based inefficiencies. In response to this, they proposePlumber', an operational analysis analytical model that is extensible and interpretable. Plumber is designed to automatically tune parallelism, prefetching, and caching under host resource constraints and can obtain speedups of up to 47x for misconfigured pipelines. It is essential to differentiate between the two types of bottlenecks that arise in machine learning pipelines - hardware and software bottlenecks. Hardware bottlenecks occur when input data processing saturates the host CPU and/or memory resources, while software bottlenecks arise due to poor configuration or I/O and do not saturate the host. The authors also present a linear programming formulation using rates traced from the machine learning pipelines during run time, predicting an upper bound on performance. Throughout the work, the authors focus mostly on the Tensorflow library. ## Current Practices in Pipeline Analysis One common bottleneck in machine learning pipelines is the input bottleneck, which occurs when the input pipeline cannot generate training examples as quickly as the training computation can consume them. This can cause a data stall and significantly impact performance. To address this issue, the TensorFlow Profiler has added a bottleneck discovery feature that identifies the Iterator object with the highest impact on the critical path of a Dataset object. However, this approach has limitations as it can only rank Datasets by slowness and cannot predict their effect on performance.

To improve pipeline performance, tuners that adjust the degree of parallelism have been developed. For instance, `tf.data' applies dynamic optimization of pipeline parameters when users specify AUTOTUNE for supported parameters, such as the degree of parallelism and prefetch buffer size. This algorithm models Iterators in a pipeline as an M/M/1/k queue and uses an analytical model to determine the queue's latency. Runtime statistics are then recorded and combined with the analytical model to tune the latency of the pipeline with respect to performance parameters. However, the algorithm has limitations, such as being hard to understand and extend, and requiring heuristic constraints to be used to prevent the output latency function from being driven to zero. Overall, while several approaches have been developed to address pipeline bottlenecks, challenges remain in understanding and optimizing pipeline performance under different resource constraints. ## Proposed Solutions Plumber adopts a layered approach to performance analysis with the aim of abstracting basic Dataset-level statistics into comparable and extensible costs. The layers are designed to facilitate optimization and analysis of the costs.

Tracing

During the tracing process, various statistics at the Dataset level are collected, including the number of elements processed, CPU time consumption, and the size of each element in bytes. Plumber then saves these statistics into a file together with the serialized pipeline program at regular intervals.

Analysis

In order to identify bottlenecks, Plumber uses analytical modeling to analyze the traced data and convert Dataset-level statistics into comparable cost units. Plumber views the pipeline as a closed system, in which each component operates asynchronously but shares the same resource budget.

Optimizer

Plumber offers support for various optimizations, such as CPU and disk parallelism, caching, and prefetch injection. Achieving the optimal level of CPU and I/O parallelism involves allocating enough resources to ensure that the pipeline remains balanced in terms of throughput capacity, which is done using a Linear Program. Caching helps to reduce the workload by avoiding the need to re-compute data dependencies, and the optimal cache is placed as high in the pipeline as possible to minimize the overall work. Additionally, prefetching is applied proportionally to the idleness in the pipeline under a benchmark workload as a subsequent pass.

To derive an optimized input pipeline configuration, it is necessary to comprehend how each Dataset's performance is influenced by changes in the fraction of hardware resources allocated to it, such as changing the degree of parallelism of a Dataset or introducing prefetching or caching Datasets that consume additional memory. The usage of three resource types - CPU, disk bandwidth, and memory capacity - in input pipeline execution is of interest. To solve for the optimal CPU resource allocation, a Linear Program (LP) formulation is first employed before moving on to disk and memory capacity.

Discussion

In this paper, the authors offer a comprehensive approach to analyzing and optimizing machine learning pipelines to identify and address potential bottlenecks. By abstracting basic Dataset-level statistics into costs, Plumber allows for the comparison, optimization, and extension of pipeline components. Through tracing and analysis, bottlenecks can be identified and resolved through optimizing CPU and disk parallelism, caching, and prefetching. These optimizations result in significant improvements in overall run time performance and hardware utilization. Overall, the approach presented in this paper provides a valuable tool for optimizing machine learning pipelines and improving their efficiency.

Model Specific Speed Ups

This section discusses work by Kaler et. al. titled ``Accelerating training and inference of graph neural networks with fast sampling and pipelining''. Unlike previous work seen in this survey so far, the authors in this work focus on a particular type of machine learning model, namely, graph neural networks, since these models provide their own set of interesting challenges. Much of the methodology of optimising lies again in profiling existing workflows, identifying bottlenecks and fixing them. Graph neural networks pose interesting challenges when it comes to processing graphs in mini-batches, a common practice in machine learning tasks. This is due to the exponential increase in neighborhood size with respect to the number of network layers, which can lead to a prohibitively large expanded neighborhood. Additionally, the memory consumption of the features and intermediate representations of nodes in the expanded neighborhood can be substantial. This is particularly problematic when using accelerators such as GPUs, as the neighborhood data size can exceed the accelerator memory capacity. To address this issue, neighborhood sampling has become a popular remedy. However, the authors identify batch preparation and transfer as major bottlenecks in this process. Batch preparation involves expanding the sampled neighborhood for a mini-batch of nodes and slicing out the feature vectors of all involved nodes. The resulting subgraph and feature vectors must then be transferred to the GPUs since the entire graph and feature data are often too large to fit in GPU memory. Interestingly, the authors discovered that batch preparation and transfer take substantially longer than the core GNN training operations, such as loss, gradient, and model parameter computation. Therefore, they propose several optimizations to accelerate these operations. These optimizations lead to significant improvements in overall run time performance and utilization of hardware for training and inference of graph neural networks. Three optimization approaches are discussed - fast neighborhood sampler, shared-memory parallelization, and pipelined batch transfer and computation. ## Fast Neighborhood Sampling The authors focus only on PyG's GNN implementations and find that amongst design choices that are most impactful to performance are the data structure for global-to-local node ID mapping between the input graph and sampled graph. They observe that the space of possible design choices and optimizations is too large to explore manually and hence design a parameterized implementation of sampled subgraph generation to systematically explore this optimization space and identify the ones that yield high performance across compute architectures. Analyzing the results shows that the most impactful changes, relative to the baseline PyG code, are related to data structures. Changing the C++ STL hash map and hash set to a flat swiss-table implementation yields a 2× speedup. Using an array instead of a hash table for the set provides a further 17% improvement. Despite its linear search complexity, the array set benefits from cache locality.

Shared memory parallel batch preparation

SALIENT parallelizes batch preparation through the use of shared-memory multithreading. Shared-memory parallelization has several key advantages over PyTorch’s multi-processing, including lower synchronization overheads and, critically, the ability to perform zero-copy communication with the main training process.

To parallelize batch preparation across mini-batches, SALIENT uses C++ threads which prepare batches end- to-end, each performing sampling and slicing sequentially. Since these threads run C++ code, they are not affected by Python’s global interpreter lock. By using a serial tensor- slicing code, which is otherwise parallelized in PyTorch by default, SALIENT improves cache locality and avoids contention between threads.

A particularly impactful optimization enabled by shared-memory parallelization is the ability to perform slicing while the main process is blocked on GPU training. A batch preparation thread writes sliced tensors directly into pinned memory accessible by the main process. By comparison, slicing in PyTorch multiprocessing workers would require copying the sliced data from the worker process to the main process via POSIX shared memory, effectively halving the observed memory bandwidth and inhibiting parallel scaling.

Data transfer pipelining

To mitigate data transfer related bottlenecks, SALIENT employs two optimizations to minimize data transfer latency and overlap data transfer with GPU computation. Detailed profiling reveals redundant CPU-GPU round trips which create idle time between data transfers of the sampled subgraph edges. These round trips are attributed to assertions in PyG’s sparse tensor library that check the validity of the sparse adjacency matrix after it is transferred. These blocking assertions are unnecessary for data transfers, since they have already been performed when the sparse tensor was constructed on the CPU. Adding an option to skip assertions in such circumstances allows 99% of peak data transfer throughput.

SALIENT further increases GPU utilization by overlapping data transfers with GPU training computations. Specifically, SALIENT uses separate GPU streams for computation and data transfer, synchronizing those streams to ensure a training iteration begins after the necessary data is transferred.

Discussions

The study aimed to pinpoint and resolve bottlenecks within the training pipeline of a specific machine learning model. To achieve this, the authors carried out an in-depth analysis of the pipeline, examining elements such as data structures and assert statements that impeded data transfer. Although the research was limited to a particular type of model, the analysis approach has the potential to be applied to other pipelines.

Cross Cutting Themes

While all three works look to improve machine learning workflows, here are some cross cutting themes common to all three works :

1. **Profiling**
In all three works, profiling plays an essential role in identifying bottlenecks in the pipeline. By profiling the pipeline, the authors can pinpoint which operations are consuming the most resources and causing the most significant delays. This knowledge allows them to focus their optimization efforts on the most critical areas. 

2. **Optimization techniques**
The three works aim to optimize the pipeline's performance by addressing the bottlenecks identified through profiling. The authors use various techniques, such as parallelism, caching, prefetching, and pipeline redesign, to improve the pipeline's performance. Another source of optimization is changes in the data structure, where choice of data structure can have a significant impact on the pipeline's efficiency. 

3. **Computational Resource Management**
In all three works, the authors focus on managing computational resources effectively to improve pipeline performance. This includes managing CPU and disk parallelism, GPU memory usage, and balancing resource allocation across pipeline stages. An interesting way to manage resources seems to stem from linear programming approaches where two of the three works utilise a linear programming model to optimally distribute either the data across a given amount of compute or distribute compute across a given set of data. 

4. **Machine Learning Model Specificity**
The three works highlight how different machine learning models pose unique challenges in terms of optimizing their training pipelines. The authors must consider the specific features of each model when identifying bottlenecks and devising optimization strategies.

Conclusion

In conclusion, the past decade has seen significant progress in the field of machine learning, with the development of specialized hardware and sophisticated software to support the development and deployment of machine learning workflows at scale. However, the gap between the hardware and software remains significant, hindering the ability to fully utilize the potential of modern hardware.

This survey paper has explored three recent works that seek to reduce the gap between hardware and software, with the goal of improving the utilization of GPUs and TPUs and enhancing overall performance in machine learning workflows. These works have demonstrated that hardware-aware optimization techniques can significantly enhance performance in machine learning workflows while maintaining hardware compatibility. Alternatively, it has also shown that making the hardware `transparent', by using virtual compute nodes can help alleviate issues that plague machine learning in particular. By highlighting the potential for hardware-aware optimization techniques to enhance performance in machine learning workflows, we hope to inspire further exploration and innovation in this exciting field. Ultimately, these developments will enable the development and deployment of even more complex and sophisticated machine learning models that can help solve some of the most challenging problems faced by society today.