Paper Urls: * Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding * On-Device Neural Net Inference with Mobile GPUs * Towards Federated Learning at Scale: System Design

Introduction

This summary focuses on three important aspects of mobile machine learning systems: model compression, on-device inference, and collaborative model training via federated learning. The first paper, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding proposes a method for compressing deep neural networks by combining pruning, trained quantization, and Huffman coding. The second paper in this summary, On-Device Neural Net Inference with Mobile GPUs, discusses the current challenges of running inference on deep learning models with mobile CPUs, then proposes and implements a framework to accelerate inference computations using non-specialized mobile GPUs. The third paper, Towards Federated Learning at Scale: System Design, discusses the difficulties deploying federated learning at scale and proposes a three-pronged system that mitigates some of the systems-level challenges that arise when attempting to collaboratively train machine learning models across a large number of mobile devices. Overall, these three papers provide important advances with respect to the challenges associated with implementing effective and scalable machine learning on mobile devices, making way for more private, secure, and low-latency machine learning systems.

Background and Approaches

Compression of Machine Learning Models

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding is a paper by Song Han, Huizi Mao, and William J. Dally that was published at the International Conference on Learning Representations (ICLR) in 2016. The paper proposes a method for compressing deep neural networks by combining pruning, trained quantization, and Huffman coding. The goal of the method is to reduce the size of deep neural networks so that they can be more easily deployed on mobile and embedded devices.

The method proposed in the paper has three main components: pruning, trained quantization, and Huffman coding. Pruning involves removing connections between neurons in the network that are deemed to be redundant. This is done by setting small weights to zero, effectively removing the corresponding connections. Trained quantization is used to reduce the precision of the remaining weights in the network. This involves learning a set of quantization levels for each layer of the network during training, so that each weight can be represented using fewer bits. Lastly, Huffman coding is used to further compress the quantized weights by assigning shorter codes to more frequently occurring values.

The authors of the paper evaluated their method on several benchmark datasets, including ImageNet and MNIST, and deep learning architectures such as AlexNet, LeNet, and VGG-16. They found that their method was able to achieve compression rates of up to 49x on VGG-16, a large 552MB convolutional neural network, without significant loss of accuracy. The authors found similar results for the other architectures evaluated. The compressed models were also shown to have lower memory and computational requirements, making them more suitable for deployment on mobile and embedded devices.

One of the key contributions of this paper is the use of trained quantization. Quantization is the process of representing a continuous range of values with a finite set of discrete values. Previous methods for reducing the precision of weights in deep neural networks relied on fixed quantization levels, which could result in suboptimal performance. By learning the quantization levels during training, the proposed method is able to achieve better compression rates while maintaining or even improving accuracy.

Another important contribution of the paper is the use of pruning to reduce the size of neural networks. Pruning has been shown to be an effective technique for reducing the size of deep learning models, but previous methods have not combined pruning with other compression like quantization and Huffman coding. By combining these techniques, the proposed method is able to achieve even higher compression rates than prior works.

Overall, this paper presents a method for compressing deep neural networks that is both effective and efficient. The method combines pruning, trained quantization, and Huffman coding to achieve high compression rates while maintaining or even improving accuracy. The work presented in this paper has had a significant impact on the prospect of performing deep learning on less capable devices, and has motivated several follow-up papers and extensions.

On-Device Machine Learning

The paper On-Device Neural Net Inference with Mobile GPUs was published by a group of researchers from Google in the Efficient Deep Learning for Computer Vision Workshop at the Conference on Computer Vision and Pattern Recognition (CVPR) in 2019. This paper discusses the challenges of running intensive tasks, such as inference on deep learning models, on mobile CPUs due to limited computing power, thermal constraints, and energy consumption. To this end, the authors propose a solution by using mobile GPUs for on-device inference of machine learning models.

In the paper's introduction, the authors highlight that device manufacturers are adding neural processing units into high-end phones for on-device inference, but these account for only a small fraction of hand-held devices. On-device inference of deep learning models is desirable when using a mobile phone because of lower latency and increased privacy, but these kinds of computations are difficult to grapple with when only a small subset of devices contain specialized hardware to execute them.

Using non-specialized GPUs that are already part of most mobile systems-on-chips (SoC), the authors propose several methods to accelerate on-device inference. They present the architectural design of a new on-device inference engine for TensorFlow Lite (TFLite GPU) and include their implementations in the TensorFlow Lite package, which has since been made publicly available. This approach offers several advantages over using the mobile CPU alone or specialized hardware accelerators like neural processing units.

The authors provide an explanation of how on-device machine learning inference works using their TFLite GPU architecture. TFLite GPU first checks whether it can execute all the neural network model's operators with the GPU delegate. If it can, the graph is partitioned into several sub-graphs, and the GPU backend is responsible for executing this sub-graph. Any unsupported operators are computed by the CPU. Before inference, GPU backends require initialization involving shader compilation and optimization by the driver. During inference, input tensors are reshaped, shader programs are linked and dispatched, and the GPU driver schedules and executes all shader programs in the queue, with the result made available to the CPU. The source code for each program is generated, and each shader gets compiled before the GPU backend is ready for inference.

The TFLite GPU inference engine uses a specific memory layout called PHWC4, which is optimized to reduce cache misses. This memory layout stores 3D tensors as 4-channel slices that are stored sequentially in memory, with padding added if the number of channels is not divisible by 4. During inference, the order of computation affects memory load instructions, and optimizations focus on neighboring threads within a work group since threads inside a work group execute in a particular order, picking channels sequentially. This memory layout and thread execution optimization are part of the initialization process for the GPU backend that occurs before inference.

To evaluate the benefits of using mobile GPUs over CPUs, the authors performed evaluation on various mobile-compatible neural networks (e.g. MobileNetv1) using both TFLite GPU and CPU inference on a range of smartphones, such as Samsung S9, Huawei P20 Pro, iPhone Xs and Google Pixel 3. The authors generally found that there was a 2-9x speedup in inference latency across all devices and model architectures. These speedups are particularly of note as inference computations sustained over long periods of time are shown to cause thermal throttling, which leads to even slower inference latency. Additionally, the authors show that TFLite GPU is often bound by memory bandwidth, resulting in low ALU utilization and larger cache sizes on iOS devices result in better performance than their standard OpenGL backend.

In summary, this paper presents a new architecture for performing on-device inference without specialized hardware, named TFLite GPU. The framework utilizes the GPUs that are commonly integrated into smartphone SoCs to enhance the speed of machine learning tasks and facilitate on-device machine learning to provide users with more privacy and lower latency.

Scalable Federated Learning Systems

Federated learning is a collaborative machine learning technique, where each data contributor locally and individually computes model updates, then publishes them to a central server or their peers. The paper Towards Federated Learning at Scale: System Design was published in SysML 2019 by a team of researchers from Google, and it presents a detailed description of a practical and scalable system design for federated learning.

The authors begin by highlighting the limitations of traditional machine learning approaches. These traditional methods require data to first be collected and curated in some central location, which may raise privacy concerns. In the federated learning setting, users can contribute their data to a central or shared model without their data ever leaving their device, which reduces the need for data to be centrally stored, enabling privacy preservation and reducing the risk of data breaches. Thus, federated learning is a highly desirable technique to enable privacy preserving machine learning.

To perform federated learning at scale, the authors propose a system architecture consisting of three key components: a client, a server, and a scheduler. The client represents the individual mobile devices that will be participating in the collaborative learning process. The server represents the centralized location where the shared model is stored and individual or aggregated updates are sent after each round of training. The scheduler is an on-device component that determines when a specific client will be selected to participate in each round of training and how the data will be aggregated.

The authors then describe the communication protocol used in their federated learning system, which is designed to minimize communication overhead while still ensuring privacy and security. They also describe several techniques used to improve the efficiency of the training process, such as quantization and compression of the model updates sent by the clients.

While the interactions between the client and server seem trivial in a vacuum, creating a robust and reliable scheduling system is the primary challenge when attempting to do federated learning at scale. Scheduling could be an issue for several reasons, such as different processing capabilities, battery life, and network conditions, which can affect their ability to participate in the federated learning process at any given time. Along with these challenges, considerations must be made for maintaining data privacy and security, balancing the frequency of updates with limited device resources and minimizing communication overhead.

To this end, the authors propose a scheduler that addresses the challenges of scheduling federated learning at scale by using adaptive device selection and client weighting techniques to dynamically select devices based on their performance and available resources. The proposed scheduler also handles the additional considerations described earlier by balancing the frequency of updates with limited device resources, minimizing communication overhead, securely aggregating model updates, and detecting potential adversarial behavior from participating devices to ensure data privacy and security. To ensure optimal performance, the authors take into account several factors, such as the device's battery level, network conditions, and processing capabilities, and prioritize selecting devices that are in an ideal state, such as being in a charging state and having a stable network connection. % Discuss issues that may arise with scheduling

To evaluate the proposed federated learning system, the authors consider at in several real-world applications (within Google), where the on-device data is more relevant or privacy-sensitive, making it undesirable or infeasible to transmit to servers. These applications are focused on supervised learning tasks, typically using labels inferred from user activity such as clicks or typed words. The effectiveness of their system is shown throughout a variety of tasks, such as on-device item ranking, content suggestions for on-device keyboards, and next-word prediction.

While the system the authors have designed addresses several issues with the deployment of federated learning, the evaluation of the system shows some limitations. One scalability limitation is that the convergence time of the system may increase with the number of devices and the complexity of the model. Additionally, the heterogeneity of device bandwidth and performance can lead to biased model updates, which could negatively impact the accuracy of the collaboratively trained model. While the authors' system provides a promising approach to addressing the challenges of federated learning at scale, these are pervasive challenges challenges when attempting to perform federated learning in the real world.

In conclusion, federated learning is a desirable approach to perform machine learning with user data when privacy must be taken into consideration. The system design proposed in Towards Federated Learning at Scale: System Design addresses several challenges, and it proposes solutions to the scheduling problems that render current realizations of federated learning systems unscalable. Real-world applications have shown promising results, but scalability issues such as convergence time and biased model updates continue to exist across several federated learning algorithms. In spite of these challenges, the authors' proposed system marks a significant advancement in enabling scalable collaborative learning.

Limitations

In the previous section, I discussed the strengths of each of these technologies. Now, I will discuss the limitations of each of these technologies in isolation when working towards the goal of machine learning on mobile devices.

The paper Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding presents a state-of-the-art approach for compressing deep neural networks to reduce their computational complexity (inference latency) and memory footprint but also has some trade-offs to achieve excellent compression. The compression method proposed in this paper requires a multifaceted and complex implementation that includes pruning, trained quantization, and Huffman coding. All of these techniques require significant computational resources, which makes it difficult to implement them on mobile devices with limited resources. Even though the compressed models have reduced size, they still might be too large to be stored on resource-constrained devices. Additionally, compressing the model is itself an expensive operation, which may limit its scalability in large-scale applications, such as federated learning.

While On-Device Neural Net Inference with Mobile GPUs proposes a solution for on-device inference using mobile GPUs, it also highlights a pervasive challenge in mobile machine learning: device heterogeneity. Heterogeneity of mobile devices impacts the realization of performing gradient updates on-device. Even with the use of GPUs and improved inference latency, less capable devices may not be able to track and compute gradients. Because each mobile device has potentially unique hardware and software configurations, it is difficult to develop a generalized solution that can parallelize matrix computations and model updates in a way that does not disproportionately prefer devices with specialized hardware.

The approaches proposed in Towards Federated Learning at Scale: System Design also have limitations that must be considered. One limitation is the security and privacy risk that arises due to the distributed nature of the data sources and the need to aggregate model updates in a central location. Users could potentially "poison" their data to mount an integrity attack on the central server, and users must trust the central server to respect their privacy when aggregating updates. While the authors suggest that differential privacy can be used to mitigate these issues, implementing privacy-preserving techniques based on additive noise or shuffling can add complexity to the system and affect its utility. Furthermore, the system proposed in this paper involves multiple components such as a scheduler, a server, and a communication framework, which require sufficient computational resources and expertise for their implementation and deployment.

Cross-cutting Themes and Synthesis

The three papers, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, On-Device Neural Net Inference with Mobile GPUs, and Towards Federated Learning at Scale: System Design, all focus on different aspects of enabling deep learning on mobile devices, and they are interconnected in their contributions towards this goal.

One of the primary challenges associated with deploying deep learning models on mobile devices is the limited processing power and memory. The first paper addresses this issue through clever compression techniques for deep learning models, which significantly reduce their memory footprint and computation overhead required for performing inference and training in mobile applications. By reducing the size and precision of the model, its compressed variant can be efficiently deployed on resource-constrained devices, which is critical for mobile deep learning applications, such as consistently publishing federated learning updates.

However, the deployment of compressed models on mobile devices does not necessarily address the issues of updating these models with new data and performing low-latency inference, which is a common requirement in many machine learning applications. The second paper directly addresses these issues by accelerating these computations using hardware that is already present in a wide number of mobile devices. While compression plays a significant part in enabling deep learning on mobile devices, generic CPUs are not made to perform computations like matrix (and more generally tensor) multiplications and manipulations in a parallel fashion. No matter how much one can compress or optimize the models, CPUs inherently have a disadvantage when it comes to matrix multiplications, as these computations always take roughly quadratic time, whereas GPUs can treat them like constant time computations due to their architecture design.

The third paper addresses this issue through federated learning, a decentralized approach to training models that allows multiple devices to train a shared model collaboratively without sharing their data. The system proposed in this paper will always be subject to the inability of mobile devices to match the deep learning performance of dedicated desktop and server hardware. Because of this, the types of systems optimizations presented in the papers on deep compression and on-device inference are essential to turn scalable federated learning into a realizable goal. Without compression, the collaboratively trained model could not be shared between the participating individuals, and without acceleration, model updates could not be computed in a timely manner that does not interfere with the client's device health and resources.

Altogether, the contributions of these papers highlight the importance of a systems-level approach to enabling deep learning on mobile devices. By compressing models to reduce their size, deploying them through federated learning, and performing on-device training and inference with mobile GPUs, mobile developers and researchers can significantly reduce communication overhead, improve energy efficiency, and improve privacy and security. These improvements of mobile machine learning systems are not only important to smartphones, but can play a signigicant role in accelerating, powering, and securing applications that require real-time processing, such as autonomous driving, virtual assistants, and augmented reality.

All in all, the papers discussed in this summary address different aspects of enabling deep learning on mobile devices. Together they provide several interconnected and comprehensive systems-level approaches to problems that plague this area. By leveraging a variety of optimization techniques, like GPU acceleration and compression, and higher level systems techniques, like scalable deployments of federated learning, the developers of mobile machine learning systems can overcome the challenges associated with limited processing power and memory, improve energy efficiency, and enhance privacy and security.

Conclusion

In conclusion, the papers discussed in this summary highlight important advances in the design of systems that can enable mobile machine learning systems. These papers address several of the chalenges associated with deploying scalable and efficient machine learning on mobile devices. In particular, these papers addressed model compression, on-device inference, and collaborative model training via federated learning.

The first paper, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, provides systems designers with a method for compressing a wide variety of neural networks that combines pruning, Huffman coding, and trained quantization. The methods discussed in this paper achieve high compression rates while maintaining the majority of the model's accuracy. This method has significant implications deploying deep learning on less capable devices, and makes way for wireless transfer of models via mobile networks.

The second paper, On-Device Neural Net Inference with Mobile GPUs, desgins and implements a solution to the challenges associated with running intensive machine learning tasks on mobile CPUs by using generic mobile GPUs for on-device inference. Not only does this increase the efficiency of inference computations, but it also enables gradient computations on devices with less capable CPUs.

The third paper, Towards Federated Learning at Scale: System Design, addresses the difficulties deploying federated learning at scale and proposes a three-pronged system that mitigates some of the systems-level challenges that arise when attempting to train machine learning models across heterogeneous devices in a collaborative way. Federated learning has the potential to enable truly privacy-preserving machine learning, and the systems proposed in this paper will help create scalable systems to do so.

Overall, while machine learning on mobile devices is a relatively new challenge, these three papers demonstrate significant progress towards making mobile machine learning more accessible, efficient, and secure.

Looking toward the future, it is likely that there will be continuous advances and strides in this field towards development of better model compression techniques, more ubiquitous specialized machine learning hardware, and the refinement of systems that unerlie federated learning, such as scheduling.

As mobile devices become ever more ingrained in our lives, the demand for low latency, private mobile machine learning will continue to grow. Nevertheless, the progress made by these papers alone suggests that the future of mobile machine learning systems is bright, and that there is still ample room for innovation.