List of Papers:

Introduction

Machine learning has been the driving force behind many modern technological advancements. However, training these models has become challenging with the ever-increasing complexity of models and input data. To overcome this challenge, distributed machine learning has emerged, enabling model training on a large corpus of decentralized data. This approach has paved the way for innovative methods such as Federated Learning, one of the topics I will discuss in this survey.

Along with distributing the data and the model among different processing units over the cloud, the machine learning algorithms being run should also be as optimally produced as possible. Motivated by these ideas, I want to explore multiple methods that have been used to improve the efficiency and time taken to train deep learning models. Later, I will investigate how these methods can be used for inference in recent literature.

They distributed Deep Learning with Communication Scheduling" paper. This paper addresses the high variance in iteration time in systems with graph representation for computation, such as Tensorflow and PyTorch, which results in increased iteration time in communication-heavy deep learning systems. The authors introduce TicTac, a system that addresses this issue by identifying and enforcing an order of network transfers, improving the iteration time using prioritization. The authors implement TicTac over TensorFlow and show that it improves the throughput by up to 37.7% in inference and 19.2% in training while reducing the straggler effect by up to 2.3x.

The second paper I will examine is "Towards Federated Learning at Scale: System Design." The paper describes a scalable production system for Federated Learning in the domain of mobile devices based on TensorFlow. The authors highlight the high-level design of the system, address some of the challenges and their solutions, and touch upon open problems and future directions in Federated Learning.

Finally, the third paper I will discuss is "SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems." This paper highlights that specialized hardware for model training is expensive and hard to generalize to many tasks. By exploring this paper, I want to cover the importance of algorithm efficiency alongside scaling the programs up and out.

The authors in this paper propose SLIDE (Sub-LInear Deep Learning Engine), which blends intelligent randomized algorithms with multi-core parallelism and workload optimization. SLIDE uses only a CPU and drastically reduces the computations during training and inference, outperforming an optimized implementation of Tensorflow on the best available GPU. The authors evaluate SLIDE on industry-scale recommendation datasets with large, fully connected architectures and show that training with SLIDE on a 44-core CPU is more than 3.5 times faster than the same network trained using TensorFlow on Tesla V100 at any given accuracy level. Additionally, on the same CPU hardware, SLIDE is over 10x faster than TensorFlow.

Overall, in this survey, I will explore these three papers in-depth, examine the contributions of each paper, and identify the key takeaways. I will also compare and contrast the approaches taken in each paper and discuss how they fit into the broader field of distributed machine learning and optimizing machine learning algorithms over adding new computing resources such as GPUs and TPUs.

Related Works

Distributed deep learning is a machine learning technique that allows multiple machines to work together to solve complex learning problems. In traditional deep learning, a single machine trains a model on a large or moderate dataset. While training small machine learning models can be achieved with limited data inputs, larger models such as neural networks require more data inputs exponentially as the number of parameters increases.

With the increased demand for processing training data, computing machinery's computational power has not kept up. There is a need to distribute the machine learning workload among multiple machines and shift from a centralized system to a distributed system to address the high need for computing power. In distributed deep learning, the workload is divided among multiple machines that work in parallel to train the model. Each machine processes a portion of the data and communicates with other machines to share information about the model's progress. By breaking the workload into smaller pieces, distributed deep learning can significantly reduce the time needed to train a model.

TicTac

The paper "TicTac: Accelerating Distributed Deep Learning with Communication Scheduling" aims to improve the performance of distributed deep learning systems. The authors identify a shortcoming in distributed deep learning systems with graph representation for computation, such as Tensor-Flow and PyTorch, that results in a high variance in iteration time. Specifically, the random order of received parameters across workers can lead to significant communication overhead and straggler effects, which can reduce the system's overall throughput.

To address this problem, the authors propose a system called TicTac that improves the iteration time by fixing this issue in distributed deep learning with Parameter Servers while guaranteeing near-optimal overlap of communication and computation. TicTac, a scale-out paradigm, identifies and enforces an order of network transfers, improving the iteration time and decreasing the network overhead using prioritization.

Prior to this work, several other studies have tried to optimize the performance of DNNs during training. One approach that has been used is reducing the computation time using more efficient algorithms and techniques. For example, using low-precision arithmetic or sparse representations can significantly reduce the computation time without sacrificing accuracy. Another approach is to reduce communication time by minimizing the amount of data that needs to be transferred between operational systems. This can be achieved through various techniques, such as compressing gradients, using decentralized optimization, or reducing the frequency of communication.

Federated Learning

In the next part of this survey, we will focus more on Federated Learning (FL) and introduce a paper in this area. FL is an emerging distributed machine learning paradigm that enables model training on a large corpus of decentralized data. This approach is beneficial in settings where data privacy and security are paramount and where centralizing all of the data is impractical or infeasible due to its large size or geographical distribution.

In the “Towards Federated Learning at Scale: System Design” paper, the authors describe a scalable production system for Federated Learning in the domain of mobile devices based on TensorFlow. The framework proposed by the authors is an early-stage work and gives an acceptable sketch of what could be done in the field of Federated Learning on Android phone devices. The proposed system is designed to allow thousands or even billions of mobile devices to participate in the Federated Learning process while ensuring that the privacy and security of the data are protected at all times.

The authors first introduce the high-level design of their Federated Learning system, which consists of Android phones and a cloud-based distributed FL server. Devices notify the server that they can perform an FL task for a given population, which is identified by a globally unique name. The server selects a subset of devices to work on a specific FL task and sends them an FL plan, which includes a TensorFlow graph and instructions for executing it. The devices perform local computations based on the global state and their local dataset and send updates as an FL checkpoint back to the server. The server incorporates these updates into its global state, and the process repeats.

SLIDE:

Given the discussions on distributed deep learning and privacy-preserving FL and the works presented, it can be concluded that these methods effectively optimize the time taken to train and infer deep learning models. This is especially true as the input dataset and the model sizes increase. Although the methods, as mentioned earlier, prove helpful with the increasing data size and can achieve high accuracy, they require expensive specialized hardware for model training.

While algorithmic advances have not yet demonstrated a significant advantage over powerful hardware like NVIDIA-V100 GPUs, the "SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems" paper presents an exception. The authors propose a solution called SLIDE, which combines intelligent randomized algorithms, multi-core parallelism, and workload optimization to reduce the computational cost of training and inference on just a CPU. Their evaluations on industry-scale recommendation datasets show that SLIDE is over 3.5 times faster than an optimized implementation of Tensorflow on the best available GPU and over 10x faster on the same CPU hardware. This work provides a promising alternative to expensive specialized hardware.

In the beginning, the paper's author discussed the significant interest in Deep Learning (DL) in the research community and its success in improving many applications, particularly image, text classification, and speech recognition. However, the exponential increase in computing capabilities and data volumes has made hardware acceleration essential for DL. While advancements in Graphic Processing Units (GPUs) have accelerated the training process of DL models, the community is heavily investing in dedicated hardware to take DL beyond the limits of matrix multiplication. While designing dedicated hardware is risky, this investment is justified due to the lack of significant progress in algorithmic alternatives for years.

The paper discusses an alternative approach for matrix multiplication using adaptive sparsity to train neural networks. They discuss unsuccessful attempts to replace matrix multiplication with cheaper algorithmic alternatives, which have shown minimal practical benefits. The paper proposes exploiting the idea of adaptive sparsity to train neural networks by selectively sparsifying most of the neurons, based on their activation, during every gradient update. The proposed algorithm employs Locality Sensitive Hash (LSH) tables to efficiently identify a sparse set of neurons during each update, making the gradient update HOGWILD style parallel. However, the current implementation of this algorithm needs to demonstrate that the computational advantage can be translated into a faster implementation when directly compared with hardware acceleration of matrix multiplication. The paper provides the first implementation for large, fully connected neural networks.

Approach

TicTac

The paper "TicTac: Accelerating Distributed Deep Learning with Communication Scheduling" presents a system called TicTac that aims to improve the performance of distributed deep learning systems. The authors identify a significant issue with graph representation computation systems, such as Tensor-Flow and PyTorch, which causes a high variance in iteration time due to the random order of received parameters across workers. This can result in significant communication overhead and straggler effects, decreasing the system's overall throughput. TicTac solves this problem by enforcing a fixed order of network transfers, resulting in improved iteration time and decreased network overhead through prioritization. The system guarantees near-optimal overlap of communication and computation and is a scale-out paradigm.

A deep learning system in the training phase works in iterations. The time it takes to complete an iteration in deep learning systems relies on three factors: (i) computation time, (ii) communication time, and (iii) the degree of overlap between the two. At the start of each iteration, workers get parameters from the parameter server, but they do not use all of them at once. Instead, they are consumed based on the computationally directed acyclic graph (DAG) dependencies.

Identifying the optimal schedule of parameter transfers is crucial for reducing blockages during computation, which are influenced by DAG dependencies, and subsequently enhancing overlap and iteration time. Even though a particular schedule of parameter transfers over the complete set of parameters in a given model during a single iteration may promote faster computation, it could also result in blockages. Therefore, there is a need to explore different strategies to find the optimal transfer schedule that will reduce blockages and enhance overlap, thereby minimizing iteration time.

Optimizing the communication cost is crucial since it increases with scale-out and is critical to keep to a minimum in deep learning systems. When aiming for maximum GPU utilization, it is advantageous to have communication time as short as or shorter than computation time. Effective communication and computation coordination is crucial for achieving high throughput. Various prior methods have been suggested to improve system performance:

  1. Increasing computation time: The batch size can be increased, but doing so might reduce accuracy, requiring additional correction mechanisms. That will increase the fraction of computation time as compared to communication time. Furthermore, this approach may only be feasible under resource constraints in some cases.

  2. Decreasing communication time: There are various ways to decrease communication time in networks. One of them is modifying the machine learning algorithm to reduce communication costs. Another method is reducing the precision of parameter representation. Changing the network primitives to collective (such as all reduce) or broadcast is also a solution to reduce network communication.

  3. Smarter interleaving of computation and communication: A more intelligent approach to combining computation and communication is utilized in specific layer-by-layer systems where the models are sequential, and the ordering is easy to obtain. However, these solutions cannot be applied generally to current DAG-based systems such as TensorFlow and PyTorch. The consideration of inter-resource dependencies like GPU memory in one study and network in another is limited to layer-by-layer models. In this research, we focus on enhancing iteration time by achieving a more efficient and predictable overlap of communication and computation in the Parameter Server. While optimization techniques for reducing communication and computation time are orthogonal to the proposed method in Tictac, they can be used in parallel.

The authors explain that TicTac is a heuristic that produces near-optimal scheduling results. They also define a scheduling problem whose objective is finding the optimal network transfer schedule. The goal of this optimal schedule is to minimize the iteration time by improving the communication/computation overlap. The inputs to this optimization problem are the partitioned graph, the computational graph with resource tags associated with each op, and the characteristics of the underlying platform represented by a timing oracle.

The time oracle predicts the execution time of a given op. The output of the scheduling algorithm is a feasible schedule of ops that minimizes the iteration time. The algorithm assigns priority numbers to ops in the partitioned graph and enforces the order of execution by assigning the lowest priority number to the next op to be executed. The authors propose two heuristics to approximate the optimal schedule of network transfers: Timing-Independent Communication scheduling (TIC) and Timing-Aware Communication scheduling (TAC).

The TIC component of the heuristic considers the time intervals during which a worker is available to work on a task. For each task, the heuristic calculates the number of available workers during each time interval that the task requires completion. The heuristic then selects the time interval with the highest number of available workers and assigns the task to one of those workers. This helps ensure that tasks are assigned to workers during times when they are most likely to be available, which can reduce idle time and improve overall efficiency.

The TAC component of the heuristic considers the dependencies between tasks and tries to minimize the time between the completion of one task and the start of another. To do this, the heuristic calculates the earliest start time for each task based on the availability of workers and the dependencies between tasks. The heuristic then assigns tasks to workers based on these earliest start times, prioritizing tasks with earlier start times over tasks with later start times. This helps ensure that tasks are assigned in a way that minimizes the time between the completion of one task and the start of the next, which can help reduce overall completion time and improve overall efficiency.

FL

This survey will focus on Federated Learning (FL), a distributed machine learning paradigm used for model training on a large corpus of decentralized data. FL is beneficial in settings where data privacy and security are essential concerns or when centralizing all data is impractical due to its size or geographical distribution. In the paper "Towards Federated Learning at Scale: System Design," the authors propose a scalable production system for Federated Learning on mobile devices based on TensorFlow. The authors' system is designed to enable thousands or even billions of mobile devices to participate in the Federated Learning process while ensuring data privacy and security. This work presents an early-stage framework for FL on Android devices.

The communication protocol enables devices to advance the global, singleton model of an FL population between rounds, where each round consists of three phases: selection, configuration, and reporting. In the selection phase, devices that meet eligibility criteria, such as being charged and connected to an unmetered network, check into the server by opening a bidirectional stream. The server selects a subset of devices based on specific goals, such as the optimal number of participating devices, and sends each device's FL plan and an FL checkpoint with the global model. In the configuration phase, the server waits for the participating devices to report updates. As updates are received, the server aggregates them using Federated Averaging and instructs the reporting devices when to reconnect. In the reporting phase, the server updates its global model if enough devices report in time; otherwise, the round is abandoned.

The protocol has a certain tolerance for straggling devices that do not report back in time or do not react to configuration by the server, which is configurable per FL task. The selection and reporting phases are specified by a set of parameters that spawn flexible time windows. Pace steering is a flow control mechanism regulating the pattern of device connections. It enables the FL server to scale down to handle small FL populations and to scale up to huge FL populations. Pace steering suggests to the device the optimum time window reconnect. It considers the diurnal oscillation in the number of active devices, avoiding excessive activity during peak hours without hurting FL performance during other times of the day.

The authors then describe some of the challenges they faced in building this FL system and the solutions they developed to overcome them. One of the biggest challenges was ensuring the privacy and security of the data, which is a critical concern when working with sensitive personal information. The authors implemented several privacy-preserving techniques to address this challenge, including differential privacy, secure aggregation, and client-side encryption. These techniques help protect the data's privacy while still allowing the clients to contribute their updates to the global model.

Another major challenge was ensuring that the Federated Learning system could scale to handle the large number of mobile devices that could participate in the training process. The authors developed several optimizations and improvements to the TensorFlow framework to address this challenge. The optimizations include a dynamic batching algorithm that adjusts the batch size based on the available computational resources on each client device. Additionally, they developed a parallelized update process that allows multiple clients to update the global model simultaneously.

In addition to these technical challenges, the authors also discuss some open problems and future directions for Federated Learning at scale. One of the biggest open problems is handling heterogeneous data sources, such as different types of mobile devices or different data formats. The authors suggest that this could be addressed through transfer learning techniques, which allow models to be trained on one set of data and then adapted to another. Another future direction is the development of more sophisticated scheduling algorithms, which could improve the efficiency and scalability of Federated Learning even further.

Slide

The paper "SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems" proposes a new solution called SLIDE that combines smart randomized algorithms, multi-core parallelism, and workload optimization to reduce the computational cost of training and inference on just a CPU. The authors evaluate the performance of SLIDE on industry-scale recommendation datasets and show that it outperforms an optimized implementation of TensorFlow on the best available GPU by over 3.5 times and over ten times on the same CPU hardware. This work presents a promising alternative to expensive specialized hardware.

The authors contributed to four items: They introduce a new system called SLIDE that uses C++ OpenMP and LSH-based sparsification to train fully connected neural networks faster on a standard CPU compared to a powerful V100 GPU. They make novel algorithmic and data-structural choices to minimize computational overhead and achieve negligible update conflicts, which creates ideal settings for Asynchronous SGD. They thoroughly evaluate their system on two large benchmark datasets, showing that SLIDE can be up to 2.7x faster than the best possible alternative at any accuracy. Ultimately, they also perform a CPU-efficiency analysis of SLIDE and suggest optimizations to further speed it up by roughly 1.3x, making it 10x faster than TF-CPU.

The paper explores the use of Locality Sensitive Hashing (LSH) and adaptive dropouts in neural networks. LSH is a technique for approximate nearest-neighbor search, which uses a family of functions to hash similar input objects to the same bucket. The LSH algorithm uses two parameters (K, L), where L-independent hash tables are constructed with a meta-hash function that concatenates K random independent hash functions. The candidate generation algorithm works in two phases: pre-processing phase and the query phase. The pre-processing phase constructs L hash tables from the data, and in the query phase, given a query, they search for its nearest neighbors by reporting the union of all the buckets collected from the L hash tables. Finally, the nearest neighbor is computed by comparing the distance between each item in the candidate set and the query.

The author introduced the SLIDE algorithm with details about initializations, feed-forward, and backpropagation steps. Also, they examined the effect of different sampling methods to reduce sampling overhead. The paper introduces the idea of using LSH for the adaptive sampling neurons. Three strategies for sampling neurons with large activation are presented: Vanilla Sampling, TopK Sampling, and Hard Thresholding. In Vanilla Sampling, after computing the hash codes of the input, a random table is chosen, and neurons are retrieved from their corresponding bucket. This process continues until the desired number of neurons is retrieved, or all tables have been looked up. In TopK Sampling, neurons are retrieved from the corresponding bucket in each hash table, and their frequencies are aggregated across all hash tables. Neurons with top frequencies are then selected.

The authors have done empirical evaluations of SLIDE's performance on various fronts. They compared SLIDE's performance with TF-GPU and TF-CPU, evaluated adaptive sampling against sampled softmax, and checked its scalability against TF-CPU with CPU core count. They also studied the effect of batch size and the benefits of design choices. As a result, they established several CPU optimizations that improved SLIDE's performance by approximately 30%.

Using different datasets, baselines, and architectures, they showed that SLIDE outperformed TF-GPU and TF-CPU in achieving accuracy faster on CPU than on V100. In contrast, the advantage of GPU over CPU was not always noticeable due to the sparsity of the datasets where the red line representing SLIDE is consistently faster than the blue and black lines representing TF-GPU and TF-CPU, respectively.

SLIDE outperforms TF-GPU by 1.8 times on Delicious 200k and 2.7 times on the larger Amazon 670k dataset. The computational benefits of SLIDE come from sampling a small subset of active neurons in the output layer, with less than 0.5% of active neurons. After a few iterations, the average number of neurons sampled in the output layer for Delicious200K is around 1000, and for Amazon, 670K is around 3000. Even with such a small subset, SLIDE outperforms TF-GPU and TF-CPU with AVX2 instructions. With OpenMP parallelism and intelligent randomized algorithms, SLIDE is 8x faster than TF-CPU, which is exciting given that there were no rigorous optimizations in the prototype.

In extreme classification tasks, computing logits for all classes in training on Tensorflow can become expensive. This has led to the development of sampling-based methods that shortlist a candidate set of classes for each batch of training data, significantly reducing the number of computed logits. However, when comparing the TensorFlow optimized implementation of sampled softmax to SLIDE, which uses a dynamically changing sampling probability with input-specific hash tables, SLIDE outperforms sampled softmax in terms of both time and iteration-wise performance. This is because the static sampling strategy of sampled softmax leads to poor accuracy, even when using significantly more neurons than SLIDE.

The research shows that SLIDE is 1.8 times faster than TF-GPU on Delicious 200k and 2.7 times faster on the larger Amazon 670k dataset. Most of the computational benefits of SLIDE come from sampling a small subset of active neurons in the output layer. SLIDE outperforms TF-GPU by a large margin, even using less than 0.5% of active neurons. Despite needing more rigorous optimization in the prototype, SLIDE outperforms both baselines using intelligent randomized algorithms with OpenMP parallelism.

Conclusion

In conclusion, the paper emphasizes the significance of communication scheduling in distributed deep learning systems and proposes two heuristics for efficient scheduling. The authors show that significant improvements in iteration throughput can be achieved through communication scheduling, leading to significant savings in compute power for DNN training that runs for days to weeks. The paper calls for further research in network scheduling for parameter servers and other transfer patterns, as well as exploring additional metrics such as network congestion for better network performance. The initial results also motivate extending the scheduling to additional resources such as memory and storage. Overall, the paper presents an essential contribution to distributed deep learning and highlights the potential benefits of communication scheduling for improving the efficiency of large-scale training.

However, as this paper uses a DAG-based approach to optimize the scheduling times between processes and data movement, it lets us implement it ourselves in different settings. For example, we can combine federated learning and TICTAC approach to achieve optimized time management on federated learning.

In the second paper, "Towards Federated Learning at Scale: System Design," the authors demonstrated the feasibility and scalability of the protocol by implementing it in a mobile app that enabled users to train models on their data without sharing it with the app developer. The app allowed users to select a model and a dataset and to train the model locally or collaboratively with other users. The authors evaluated the app's performance on several datasets and showed that it achieved comparable or better accuracy than centralized training while preserving user privacy. They also analyzed the app's scalability by simulating the behavior of thousands of users and showed that it could handle large FL populations with low latency and high throughput. The authors concluded that the proposed protocol was a promising approach to FL on mobile devices and opened up new opportunities for privacy-preserving machine learning applications.

Effectively implementing Federated Learning can be challenging due to various factors. Interruptions caused by the users' phones or lost connections may disrupt the code, and the server may experience a heavy workload, adding to the complexity of the process. The paper highlights these challenges that need to be addressed to ensure the success of Federated Learning.

For the third paper, the authors present SLIDE, an intelligent algorithm that outperforms NVIDIA-V100. SLIDE is one of the best available hardware for training sizeable deep learning architectures, using modest CPU OpenMP parallelism. SLIDE uses carefully tailored randomized hashing algorithms and suitable data structures for asynchronous parallelism. The system shows up to 3.5x gain against TF-GPU and 10x gain against TF-CPU in training time with similar precision on popular extreme classification datasets. The authors plan to extend SLIDE to include convolutional layers. They anticipate a distributed implementation of SLIDE would be very appealing due to its unique benefits in random memory accesses and parallelism with minimal communication costs.

In the end, although this paper has established a great approach to using CPU, it requires many CPU cores. This makes it hard or the contribution trivial for conventional CPU systems. Additionally, this approach only works for linear layers and is not applied to convolutional layers, widely used in computer vision applications.