Survey on Optimizing Deep Learning

Philip Yao

Introduction

This paper surveys three papers on deep learning systems: Paper 1 EXPLORING THE LIMITS OF CONCURRENCY IN ML TRAINING ON GOOGLE TPUS, Paper 2 BANDANA: USING NON-VOLATILE MEMORY FOR STORING DEEP LEARNING MODELS, Paper 3 TOWARDS FEDERATED LEARNING AT SCALE: SYSTEM DESIGN, and Paper 4 TICTAC: ACCELERATING DISTRIBUTED DEEP LEARNING WITH COMMUNICATION SCHEDULING. Papers 1, 3 and 4 deal with distributed deep learning, but Paper 2 proposes a novel memory storage system using a new memory technology. As deep learning models become incredibly massive, the need for robust machine learning systems to increase computing efficiency, speed, and automation becomes imperative. As anyone who has worked with distributed training knows that a reliable, robust infrastructure is critical because errors can be very hard to diagnose. Distributed deep learning errors are even more difficult to debug.

State of the art deep learning language models like Chat-GPT and BARD are trained for months on hundreds to thousands of GPUs. These SOTA ML models are known as Large Language Models (LLMs). Aptly named, these massive neural networks have billions of parameters which are tuned on significant quantities of text. It is not possible to efficiently or quickly train a state of the art ML model on a single GPU, so distributed training is very necessary. For example, the recent LaMDA LLM family, developed by Google and powers Bard's backend, has instances that reach upwards of 137 billion parameters and is pre-trained on 1.56 trillion words. Pre-training took 57.7 days on 1024 TPU-V3 chips.

EXPLORING THE LIMITS OF CONCURRENCY IN ML TRAINING ON GOOGLE TPUS

Paper 1 utilizes various techniques to scale on their multipod. The first approach they use is model parallelism with XLA's Single Program Multiple Data (SPMD) partitioner when data parallelism is not enough. SPMD is a parallelism approach that runs different data on copies of a same model. Model graphs, spatial dimensions of images, and feature dimensions are all valid aspects that XLA's SPMD can partition.

Another optimization done is weight update sharding. Sharding is the division of a dataset into multiple pieces and then stored in different storage devices. In this paper, a global reduce-scatter is executed allocating a shard of summed gradients which is then used to compute the updated weights. The new weights is used to update replicas. They then evaluate their usage of TPU pods on MLPerf benchmarks. MLPerf is the name of a group of AI practitioners spanning across industry and academia. Their benchmarks are meant to evaluate systems in which a fixed model is trained on with a fixed dataset.

TOWARDS FEDERATED LEARNING AT SCALE: SYSTEM DESIGN

Paper 3 "Towards Federated Learning at Scale: System Design" implemented a federated learning system that solves several problems related to edge device availability, poor connectivity, inconsistent execution, and storage and compute resources. Although various Federated Learning frameworks like OpenFL and IBM Federated Learning already exist, few work well with current approaches using the Google ecosystem. The implementation of the system that they designed is comprised of a FL server and many Android phone devices.

There are several phases in their communication protocol. Selection is the first phase. In this round, a subset of devices connect to the server. These devices are selected by the server and typically need to satisfy a few criteria like currently being charged. Not all devices are selected by the server because of various circumstances (e.g. the desire of a specific optimal number of devices).

The second phase is configuration. The server configures itself and dispatches instructions and states to the edge devices. The final step is reporting. The server will receive updates and update its model if the round was a success. If not enough devices report the updates, then the round would be considered a failure, and no update occurs. The server is configured to determine the number of non-reports that is acceptable. The server will also send more instructions to the edge devices about when to reconnect.

In production, the federated learning server needs to handle hundreds of millions of devices, so the authors center its design around the Actor Programming Model. In this way actors can be temporarily scaled up across various data centers when needed. There are several different actors in their architecture. One type is the coordinator whose job is to synchronize and advance rounds. There are multiple coordinators, with each one dedicated to a particular population of devices that are centered around an FL task. In case the Coordinator actor fails, a Selector will restart the Coordinator.

Selectors are a type of actor that receive information from a Coordinator in order to determine which devices are forwarded to Aggregators. This distinction between Selectors and Coordinators allows selector actors to be physically placed closer to devices. The Master Aggregator manages Aggregator actors which perform the required work. If either an Aggregator or Selector fails, then the devices corresponding to these actors will be lost. Hence this system is fault tolerant. An important aspect that Paper 3 discusses is pace steering which is meant to balance the number of connecting devices. If too many devices connect, the server can become overloaded, and if two few devices connect then no update would happen.

From the analytics aspect, the authors emphasize the importance of being able to track statistics of the edge devices like battery and computation usage which potentially degrade phone performance. They log these various attributes and events in order to present this data in dashboards. Some attributes are number of devices accepted and rejected, upload/download data throughput, errors, etc. These analytics are important to ensure that federated learning does not impact the end user in a negative manner.

Some issues Paper 3 tackles is that training examples are not inspectable since they lie on edge devices, models can't be run interactively, and model attributes need to be verified by the infrastructure. They tackle these issues by supplying tools for the user engineers. The first set of tools they provider is a library to build and test FL tasks. These tools allow engineers to model and simulate the tasks by providing functions to map input data to output metrics. This input data would be test data during development and exchanged for edge device data during production.

Regarding plan generation, Paper 3 emphasizes that in regular data center training, the FL plan would be represented by a Python program. Here, the authors create two FL plans. The first plan is for the device and contains the TensorFlow graph, and the second plan contains the aggregation logic. Regarding versioning, TensorFlow runtime and graphs are rebuilt as necessary in a normal environment. In the federated learning environment, devices can be running very old versions of the Tensorflow runtime. This is fixed by the FL infrastructure by "generating versioned FL plans for each task".

To evaluate the performance of their system design, the authors conducted experiments on a large-scale dataset with thousands of clients and a complex neural network model. They compared their system with two baseline approaches: a centralized learning approach that trains the model on a single server, and a naive federated learning approach that aggregates the model updates from all clients without selection or optimization. The results show that their system outperforms both baselines in terms of model accuracy, convergence speed, and communication efficiency. Moreover, their system is more robust to network heterogeneity, client availability, and data distribution. The authors also demonstrate the effectiveness of their client selection and communication optimization techniques in improving the performance and privacy of the system.

BANDANA: USING NON-VOLATILE MEMORY FOR STORING DEEP LEARNING MODELS

In Paper 2 Bandana, the authors address the issue of deep learning model storage. Model features, also known as embeddings, are stored in memory at large web companies like Facebook so that they can be quickly accessed in real-time in order to compute recommendations. For example, the Facebook post recommendation system is aimed at recommending relevant content to users. These recommendations need to be computed quickly and the relevant embeddings are stored in memory. Given the active number of users at Facebook at a time, the amount of RAM used is significant.

These embedding vectors are stored in tables where the column id represents the embedding id and the column is the embedding vector. Facebook has two types of embeddings: post and user embeddings. The average number of vector lookups per request varies across the tables. The embedding tables are a critical component of recommender systems that map users and items to low-dimensional vectors. These vectors are used to calculate recommendations for users. The authors show that the size of the embedding tables can be several gigabytes, making it difficult to store them entirely in DRAM.

The authors of the paper suggest to store deep learning models in non voltaile memory (NVM) rather than dynamic RAM (DRAM). DRAM is a volatile memory where the bits are stored in capacitors which are either charged or not. This memory is volatile because the capacitor will gradually lose charge over time. The main issue with using DRAM is its price. NVM is significantly cheaper, at about an order of magnitude per bit. The big issue with NVM, though, is its bandwidth, even though latency requirements are typically fine.

They mention that NVM has been proposed as a substitute for DRAM in various contexts, such as databases and file systems. They reference MyNVM, a SQL database that uses block-level NVM as a second-level cache for flash and a lower-cost substitute for DRAM. Other databases, such as CDDS, Echo, FPTree, and HiKV, simulate NVM in its byte-addressable form. There have also been several prior projects that use NVM in its byte-addressable form for file systems, including NOVA, Fortis, LAWN, and ByVFS.

Non-volatile memory is a new technology that is better than flash but has worse performance than DRAM. It's interesting to note that NVM devices can only be written to upwards of 30 times a day or else there will be errors occuring. Facebook embedding vectors are typically only updated 10-20 times a day so it's suitable for their application.

The authors developed Bandana as a solution. Fundamentally, it's an NVM storage system which optimizes the bandwidth and has a DRAM cache in order to fetch embedding vectors from NVM. The policy is based on the observation that only a small portion of the embedding tables is accessed frequently. The authors claim that the optimal read granularity is 4KB. Since User embeddings are typically between 64-128 B, they group embedding vectors together by how likely they will be accessed together. They find that Social Hash Partitioner (SHP) is the best way of partitioning vectors into blocks. SHP is a hypergraph partitioning algorithm. This can be applied to maximize the colocation of related vectors. By considering the relationship between embedding vectors as graph nodes and the problem of maximizing the grouping of relevant vectors, one can consider this problem equivalent to graph partitioning.

To determine which vectors to keep in DRAM, Bandana uses a Least Recently Used (LRU) queue. Bandana only inserts objects from prefetched blocks to the queue that have been accessed t times in the past. The system runs miniature caches that simulate the hit rate curve of different values of t for each embedding table, with very low overhead. Based on the simulation results, Bandana picks the optimal threshold for each embedding table. The authors demonstrate that Bandana significantly improves the effective bandwidth of NVM, enabling it to be used as a primary storage medium for embeddings.

The contributions of this paper are three-fold: (1) Bandana is the first system that leverages NVM to store deep learning models and one of the first published systems to use NVM in large-scale production workloads; (2) it uses hypergraph partitioning to determine which vectors are likely to be accessed together in the future; and (3) it applies recent techniques from key-value caching to run lightweight simulations of dozens of miniature caches to determine how aggressively to cache prefetched vectors for different cache sizes.

The authors of paper 2 analyze how their approach would work in production. Their workload is a lookup of 1 billion embedding vectors. It turns out that post lookups are more common, comprising about 95% of all lookups, but user lookups consume 75% of DRAM capacity. User embeddings are frequently accessed for tasks like ranking posts, and each table typically represents a different user behavior class. The study analyzed a production workload that contained over 1 billion vector lookups, which represents traffic over one hour for a single model. The study found that the post embeddings are read more frequently than user embeddings, as multiple posts are ranked for a single user. Post lookups constitute about 95% of the total embedding reads, whereas user embeddings contain more features and consume about 75% of the total DRAM capacity.

The authors also analyze the performance of Bandana in different scenarios. They show that Bandana performs well when the workload consists of random reads, which is common in recommender systems. They also show that Bandana performs well when the workload consists of sequential reads and writes, which is common in deep learning training.

Bandana investigated storing related vectors based on semantic partitioning and k-means clustering. Semantic partitioning assumes similarity of two vectors given that they are close in Euclidean space. The k-means clustering algorithm clusters into k sets by Euclidean distance.

TICTAC: ACCELERATING DISTRIBUTED DEEP LEARNING WITH COMMUNICATION SCHEDULING

In Paper 4, the authors reduce both training and inference time for deep learning models in distributed scenarios. They are able to accomplish this by discovering inefficiencies in the communication scheduling protocol of those systems. The authors fix this issue by essentially overlapping the communication and computation time. If some operations are dependent on receiving some data, then the authors would make sure that the data received unblocks operations. For example, if op1 is dependent on op2, then the authors ascertain the data for op1 is sent first.

Background

Tensorflow and Google infrastructure users are likely familiar with Tensor processing units (TPUs) and TPU Pods. A TPU is a Google built hardware accelerator that is similar to a GPU. GPUs have historically been used for graphics based tasks, but TPUs are optimized for machine learning. They are faster than standard non-deep learning specialized GPUs and generally more energy efficient. A TPU pod is a conglomerate of TPUs which allow you to distribute training or evaluation across those TPUs. Hence TPU pods are a common tool for modern distributed deep learning, especially for Tensorflow researchers.

Distributed deep learning is the process of training a deep learning model across multiple machines. It is meant to decrease training time or to split the model and data if they're simply too large to be stored on one machine. This technique exploits either data or model parallelism. Data parallelism splits the training and evaluation data across multiple machines. The popular Imagenet dataset is only 150 GB, but it is not large enough to train state of the art models. Those datasets can reach upwards of hundreds of terabytes. Model parallelism splits a single model across various machines. For example, several NN layers can be stored on one machine and the rest on another. Model parallelism is not generally used to increase speed, however.

For data parallelism, we can further divide the distributed approaches into either synchronous training or asynchronous training. In synchronous training, the workers begin computing gradients at the same time using copies of the model and a portion of the data. After every worker is finished, they combine their gradients together and update their model weights using the combined gradients. The all-reduce algorithm is used for gradient aggregation. The downside of synchronous training is that all workers must finish before moving onto the next round.

In asynchronous training, the workers do not wait for other workers to begin their tasks. This asynchronous approach is commonly implemented using a parameter server (PS) where a one or more workers are designated as the PS and holds the up-to-date gradients. As usual, the workers will have a copy of the model and a subset of the data. When a worker begins training, they request a copy of the model parameters from the PS, perform their calculations, and send their updates back to the PS. However, the downside for asynchronous training is that only one worker can calculate gradients at a time because otherwise a second worker would only have stale model parameters.

Federated learning (FL) is a subset of distributed deep learning where edge devices like mobile phones contain the training data and collaboratively train a model. The most significant benefit of federated learning is data privacy and security because the data remains at their original source. Privacy is ensured in this way because the alternative approach would be to send copies of the data to a central repository to be used for training. Instead, the edge device will receive a copy of the model and learn some updates that will then be sent to the cloud. In other words, rather than sending data from the edge device to a central system, federated learning reverses this process by sending the model to the edge devices and only sending the model updates back to the cloud.

This ensurance in privacy is particularly important in areas with strong privacy laws and concerns like health care or finance. For example, consider a scenario where an ML model wants to determine the risk of a user to certain diseases based off their smart watch sensors: activity levels, heart rate levels, etc. Consider another scenario where an ML model learns to recognize faces. Most users would not want this kind of data sent off their devices to a third party. Furthermore, federated learning is also useful in situations where it would be better for the edge device to perform the computation. This situation would arise if the device needs the results faster and cannot wait for calls to a server, like in the situation of on-device ranking where apps or other device features are ranked. With federated learning, models could be trained on decentralized data at the source. Such a technique requires a robust scalable system.

Synergistic approaches

Although papers 1 and 3 may at first appear very different because paper 1 addresses in-datacenter TPU pods and paper 3 addresses edge devies, they are fundamentally equivalent. The single TPUs in a TPU pod would be analogous to edge devices in federated learning. Thus their techniques may be combined together and approaches discussed in one paper would be applicable to the other.

Of course, there are still major differences between the issues encountered in datacenters and those encountered in federated learning and edge computing. First of all, the issue of privacy is significantly reduced in datacenter tasks. The TPUs in datacenters are typically free to communicate with each other and share sensitive data. This is because the TPUs are located in close proximity and are all situated on the same infrastructure. Hence it would not be expected that the data from one or more TPUs is insecure. A malicious actor should be able to access all TPUs in the same datacenter, not just a subset of TPUs.

One can conceive a situation where the TPUs in a single TPU Pod are not located physically close to each other. A TPU might be more vulnerable than another by being in a separate room. However, this problem typically does not exist due to the reasonable layout of datacenters. Another obvious issue is data bandwidth and time delay. The datacenter TPUs have significantly more freedom to communicate with each other. They are not subject to the same bandwidth constraints and propagation delay that edge devices encounter. Edge devices are physically spread out amongst a large distance and rely on the internet which is comprised of numerous unreliable third party infrastructure.

The main problem setup of Paper 3 and federated learning assumes that data originates and resides on the edge devices. One can cast the general computing approach in distributed systems into a federated learning approach by dispatching data onto edge devices, with only one extra round of communication. This approach reduces the need for two different ML infrastructures to maintain and build, reducing cost and engineering time.

Paper 1 discusses several efforts to improve training on their computer cluster but this can be applied to federated learning. Using the bfloat16 data type is one improvement they make on BERT because this allows them to reduce memory bandwidth. With edge devices, one could also make this improvement, but it would be even more beneficial because these edge devices communicate at a signficantly slower speed than the TPUs in a datacenter.

A model benchmarked in MLPerf and used by paper 1 is the popular ResNet-50. Paper 1 applied distributed evaluation and batch normalization on this model as an optimization. This approach can also be handy for the work done in Paper 3 by allowing the edge devices, rather than the server, to perform evaluation and normalization.

In the reverse scenario, the techniques in paper 3 can be used in datacenters as well. The fault tolerant system design of their federated learning approach lends well to TPUs. TPUs are subject to failure like edge devices and some tasks on one TPU can take longer than another TPU.

Analytics wise, paper 3's approach to aggregating and displaying data can also be applied to what's done in paper 1. Paper 1 can also log activity on the TPUs themselves, although data that's logged by Paper 3 like phone model, OS, etc. wouldn't be needed. Additionally, as mentioned the privacy concerns in federated learning are unwarranted, so there wouldn't be any need to avoid the personally identifiable information (PII) that Paper 3 cautions. However,

Furthermore, paper 2 can be combined with the distributed learning approaches described in papers 1. The TPU Pods contain a large number of TPUs and each TPU has a large number of memory cells. By exchanging the memory cells for a combination of DRAM and NVM, significant savings can be achieved. For example, a TPU typically has 32 GB memory. Thus a TPU Pod of 1000 TPUs will have 32000 GB memory. The cost of a TPU Pod's memory only will reach upwards of $200000. Reducing this cost by a magnitude of 10 into $20000 would be a highly significant saving. Thus, Paper 1 can benefit from the work described in Paper 2.

Paper 4's technique is an excellent optimization that can be applied to both Paper 1's TPU compute center and Paper 3's federated learning. Paper 4's communication scheduling was specifically designed for computer centers, so Paper 1 can benefit significantly. Paper 1 had a significant bottleneck issue with their information transportation, so implementing Paper 4's communication scheduling would greatly improve the TPU pod's performance.

The communication scheduling optimization can also be applied to federated learning as well. The time delay to reach edge devices from a central server is much greater than the time delay for intra-cluster signals. As a result, an edge device being stalled by unreceived data can reach much higher wait times. They are also very compatible since both the federated learning approach described in Paper 3 and the distributed training in Paper 4 use synchronous training.

Weaknesses

There are numerous issues described in Paper 3 Federated Learning. A large issue with this FL approach is that compute results are discarded when they are not reached during a given round. It's possible for the results to be recycled and patched in the next round. For example assume in round 1 a certain number of devices do not report their results in time. At round 2, they'd be able to report their results but are instead discarded. The system should not waste computational effort and should be able to take the results of round 2 and patch them into the model as if the update occurred in round 1. Since the devices being used to perform this work are edge devices, any computation being done is a rare resource and should not be easily discarded. The results should be recycled. There is no reason why the model cannot be updated partially in a later round beyond a complication of the system.

Additionally, the design is overcomplicated. The selector and aggregator actors can be merged into one actor. By splitting these into two actors, the selector must now forward to the aggregator, resulting in a propagation time delay and overengineering. Likewise, the Master Aggregator should be combined with the Coordinator actor. In this simplification, there are only two actors rather than the 4 described.

Although analytics are useful for the end developer, they open a big security risk where reverse engineered signals can recreate user data. Several signals that are sent back include device state, when and how long the training ran, phone model, etc. Although appearing to be innocuous, this data can give hackers clear images of the end user. If a hacker knows that a model can only be training when the phone is fully powered and connected to wifi, then the hacker can gain an understanding of the user's schedule: when they go to sleep, when they are working, etc. Despite the authors claim that this information is not personally identifiable information (PII), the signals can still be used to identify users.

The core data being sent back to the server are the model weights. Research has shown that model weights can be reversed engineered back into the data. However, this problem is inherent to federated learning in general and is not specific to or caused by Paper 3's approach.

A significant issue in Bandana is the accuracy of prefetching and the latency in case of a prefetch failure. Although accuracy may appear high, in the few circumstances where prefetching does not perform well, the user needs to spend a lot of time waiting. For casual browsing in today's society, attention is a very rare resource given by users. If a user cannot have their recommendations loaded in time, they would lose interest, which would be a critical issue for Facebook.

Even if the prefetching is 99% accurate, for a user base as large as Facebook's, this would still result in a large number of users experiencing delays in their content delivery. Additionally, the switch to newer NVM is risky since it's a new technology that has yet to mature. Unexpected problems may occur and scaling up production of NVM would take time. The users also acknowledge that more than 30 writes a day would result in errors. Even those most posts average around 20 writes a day, the standard deviation of writes may be large and a significant portion of people would experience issues. This is a critical problem with NVMs.

TicTac's approach has several issues. Their framework is based on synchronous training which is inherently inefficient since all workers need to finish their tasks first before proceeding. Not only that, they've also restrained their framework to a smaller subset of users by not endowing their infrastructure with asynchronous training capability. In their infrastructure design, they have a Time Oracle component. This component estimates the runtime of operations and based off that determines the scheduling. This is another weakness because the prediction is not guaranteed to remain accurate across varied workloads.

Additionally, their core problem that they are solving can be tackled through different approaches. For example, they can switch the order of the operations. Rather than some operation 2 being blocked on some operation 1 because operation 1 did not receive its data yet, we can switch to first executing operation 2. Another solution is that we can pipeline the data transmission across the rounds in this synchronous training framework. In other words, instead of sending the data in a single round, the parameter server can instead deploy multiple rounds of data at once. In this scenario, each operation would have at least one round of data, so the operation would never be blocked.

In TicTac's performance evaluation, the authors briefly mention that they exclude the first two rounds in their timing calculation due to warm up effects. They did not mention whether or not the performance they're comparing against also exlcuded the warm up rounds. If those implementations did not exclude those rounds, then their comparison would be invalid. In addition, they also exclude the TicTac heuristic computation. This heuristic computation takes 10 seconds, which is a significant amount of time, and when the model changes or if the server crashes, this setup needs to be recomputed.

TicTac uses Imagenet dataset for their experiments, but Imagenet is a very small workload compared to current SOTA tasks. In modern distributed deep learning data centers, workloads take terabytes of data and months of training time, eclipsing Imagenet's workload. They also find that scaling up the workers greatly reduce the benefits of their communication scheduling. Although for a small number of workers, they receive a 37% speedup, this quickly diminishes. In modern workloads, more than hundreds of GPUs are used (in FL more than hundreds of edge devices), so their infrastructure is unnecessary and adds engineering complications. # Conclusion Distributed systems is a mature topic, but distributed machine learning is an emerging field. As deep learning grows, there will be greater problems with ML infrastructure. Particularly, ML utilization is greatly increasing and ML infrastructure will need to support training and inference on vast amounts of data. In addition, real-time and safety critical ML systems will also become more prominent. Hence ML infrastructure not only needs to be fast, they also must be robust to failure in order to prevent catastrophic circumstances (autonomous driving). ML systems also need to become more energy efficient. The current footprint of state of the art ML models uses significant amounts of energy and have massive impact on the environment. ML systems must support multiple modes of operation. They must be fast when needed but energy efficient most of the time.