Abstract

This paper aims to survey three papers in the domain of large-scale distributed machine learning models. In this paper, it describes the domain and the core mechanisms of the frameworks—DistBelief, advanced parameter server, and Tensorflow dataflow graphs—proposed in the surveyed papers, and it shows the connections between these models. Then, this paper compares and contrasts different approaches of the frameworks, and it describes the strengths and weaknesses of these approaches. The paper also describes the potential new advances in the future domain.

1. Introduction

Machine learning is a powerful tool that has profound potential in advancing various applications and fields. As more sophisticated models are developed in the domain, it has been found that increasing the scales of the machine learning models can further enhance their overall performances. This paper surveys 3 different frameworks in the domain of developing large-scale models:

Models with large-scales can have many advantages, such as higher accuracy, flexibility, usage efficiency, and fault tolerance. To give an example, a model with a large scale can include more parameters, which can in return provide more accurate inferences on the data. Because the model is able to analyze with higher accuracy based on more features and data examples. To give another example, a large-scale model can better recover from a machine failure. Because the large-scale mode is comprised of many machines, it can use the techniques such as checkpointing to quickly recover from the machine failure, which can greatly increase its fault tolerance.

Several frameworks are proposed to further advance the scalability and flexibility of machine learning models in the domain, in order to achieve better performances. For example, some proposed to use large-scale machine clusters to train models, and each model is partitioned across different machines (Dean et al., 2012). Because it has been found that the recent approach of using GPU to train large models has some limitations—the model’s training speed-up is small, when the model does not fit in the GPU memory (Dean et al., 2012).

The better model performance achieved by the different frameworks also leads to the utilization of multiple CPU cores and the increases in efficiency. Because the increased scalability and flexibility of the machine learning models allow the user to train and operate larger models in less time. One of the frameworks that use this concept is DistBelief, which also supports model parallelism and data parallelism through the training algorithm of Downpour SGD and Sandblaster L-BFGS.

Building on top of DistBelief design, some alternative frameworks are developed to focus on improving the scalable and practical features of the parameter server. For instance, a framework provides server nodes with many functionalities: range push and pull; user-defined server functions; and consistency models (Li et al., 2014). This enables the model’s parameter server to become faster and more reliable in communicating and sharing parameters.

Although a model might have a high scalability, it might not be able to operate in many different system environments. For example, a model might be able to train on tens of thousands of parameters and data, but it might fail in operating in various operating systems environments, such as Windows, Mac OS X, and Linux. Hence, instead of using the parameter server design from the DistBelief framework, Tensorflow framework is developed using the dataflow graph architecture (Abadi et al., 2016). In this approach, the dataflow graph represents the algorithm’s computations and the algorithm’s operation states, which allows the model to run in varied system environments. Dataflow graphs can also map nodes both across machines and within machines, which can facilitate the user’s process of developing new model architectures and training large models.

2. Background

Machine learning is widely used in many areas. A machine learning model learns from the training data and makes predictions on the testing data. The model can be based on classification or regression, depending on the task. For classification task, a model needs to have a classification function, an objective function, and a learning algorithm (Jurafsky & Martin, 2022).

The classification function is a function that computes the estimated probability of a class label given the data. For example, for nonlinear classification functions, this can be the functions such as sigmoid, tanh, or ReLu. The objective function measures the error of the predicted value from the true value. The learning algorithm involves minimizing the computed error from the objective function to achieve better prediction accuracy. There are many learning algorithms, some of which are stochastic gradient descent, batch gradient descent, and distributed subgradient descent. The goal of the learning is to learn the weights that can produce the best prediction accuracy.

A neural network is composed of different layers of activation units. Each activation unit produces the activation by taking the dot product of the unit’s input and weight and then adding the bias (Jurafsky & Martin, 2022). The activation of each layer is the sum of the activation of its units. Then, the layer’s activation is transformed with the classification function, and the transformation output is fed to the next activation layer. Then, if stochastic gradient descent is used as the learning algorithm, it would calculate the gradient of the objective function to update the new weights, which reduces the prediction error in each iteration. At the classification step of the output layer, the model would select the class label that is associated with the highest estimated probability.

In distributed subgradient descent, each worker node (replica) uses its own training data to decide the weight updates (Jurafsky & Martin, 2022), and the updates are expressed as a subgradient. In this way, it allows the different updates to mix together.

3. Methods

This section starts by introducing some domain approaches in developing large-scale machine learning models. Then, the section describes 3 frameworks in the domain and compares their underlying mechanisms. The section ends by discussing a synergistic framework produced by combining the 3 described frameworks.

3.1. Domain Approaches

Different approaches can be taken in the domain to develop large-scale machine learning models. For example, a model can use dimensionality reduction methods to make it more scalable. Because dimensionality reduction methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-distributed Stochastic Neighbor Embedding (t-SNE), can reduce the data features and preserve the most important features. This reduces the data complexity for the model when the number of parameters and data inputs are large, and it also allows the model to predict based on the selected features that contain the maximum amount of information.

An alternative approach is to use a hash-based classification function. By hashing the output values of the classification function, it can allow the model to easily store and retrieve the transformed activations of each activation unit. This can help reduce the amount of time in passing the gradients in the neural network model, which makes the model’s classification process more efficient.

3.2. DistBelief

DistBelief is a framework that supports large-scale distributed neural networks. In this framework, the user can develop large-scale models by partitioning the model across different machines (Dean et al., 2012). In this way, the computation power of the model is increased. Because the model now has different machines to conduct the node computations in parallel with CPU cores.

The framework proposes two distributed optimization algorithms—Downpour SGD and Sandblaster L-BFGS—that can further improve the development of large-scale neural networks (Dean et al., 2012). These algorithms both enable parallel computations across the different model instances, and they both use the parameter server for the model instances to share and update parameters.

Downpour SGD is a variant of asynchronous stochastic gradient descent, and it uses multiple replicas of a model instance (Dean et al., 2012). In this algorithm, each model replica is trained on a subset of the training data, and it communicates updates with other replicas through the parameter server. In the asynchronous stochastic gradient descent, each model replica acquires an updated copy of its parameters from the parameter server, then the replica computes its gradient and sends it to the parameter server.

The asynchronous feature of Downpour SGD allows it to have higher fault tolerance than the synchronous SGD. Because Downpour SGD has a large number of machines working asynchronously, so that the failure of one of the machines would have less impact on the training process. But a machine failure can greatly delay the training process of a synchronous SGD, since all the model replicas follow the same working schedule. Thus, the operations of the other models would be interrupted by the failed machine.

Furthermore, there are also different proposed techniques to improve Downpour SGD. For instance, one can reduce the communication overhead of the algorithm by specifying the frequency for the model replicas to request and update the information to the parameter server (Dean et al., 2012). Also, one can use Adagrad adaptive learning rate procedure to further increase the performance of Downpour SGD. This can increase the maximum number of model replicas in the framework, which also improves the scalability of the model.

On the other hand, Sandblaster L-BFGS uses a batch procedure, and it involves the model replicas doing less pulling and pushing of parameters from the parameter server (Dean et al., 2012). Because Sandblaster L-BFGS has a coordinator component that sends messages to the model replicas and the parameter server, which allows the information stored in the parameter server to be accessed only when needed. Hence, this prevents the overhead of accessing the parameter server by the model replicas.

3.3. Advanced Parameter Server

A framework proposes a more advanced and sophisticated parameter server for large-scale machine learning models (Li et al., 2014). Compares to the parameter servers in the previous DistBelief framework, the structure of this proposed parameter server is similar to that of the parameter server in the Sandblaster L-BFGS. Because these two parameter servers are both based on batch optimization, and they all support model parallelism between different model replicas and data parallelism within each replica. Also, they both have a resource coordinator that helps the server and worker replicas share parameters.

However, the advanced parameter server in this framework has more sophisticated structure designs. In the structure of this new parameter server, it contains a server manager in the server group and a task scheduler in the worker group (Li et al., 2014). In the operation, the server nodes keep the shared parameters, and the server manager stores the server metadata. The worker nodes execute applications, and the task scheduler allocates tasks to workers and monitors their progress.

This version of the parameter server also supports many functionalities. For instance, it supports range push and pull, wherein only a range of keys is communicated. This further increases the operational efficiency of the model. Furthermore, this parameter server also supports user-defined functions on the server. This allows the user to have more flexibility in utilizing the information stored in the shared parameters, which can also be beneficial for the user to acquire more control of this distributed system. Moreover, this parameter server allows the user to define consistency models, which can help reduce the nodes’ data inconsistency caused by the model independency. With this feature, the user can choose between three consistency models: Sequential; Eventual; and Bounded Delay. In Sequential model, tasks are executed one after another. In Eventual model, tasks can be executed at the same time. In Bounded Delay model, each task can only be executed after all previous tasks a certain time ago have been completed.

3.4. Tensorflow Dataflow Graphs

Tensorflow uses dataflow graphs to support the development of large-scale models in heterogeneous environments (Abadi et al., 2016). Instead of using its previous version of DistBelief which is based on the parameter server architecture, Tensorflow uses dataflow graphs to represent the model components (Abadi et al., 2016). This is due to several reasons. To begin with, its previous DistBelief uses C++ to define new layers. But this creates barriers for the user who is using other programming languages, such as Python, to conduct machine learning research. Additionally, DistBelief requires modifying the parameter server implementation to experiment with new optimization methods. Also, DistBelief has a fixed execution pattern and is hard to scale down, which makes it hard to execute in different environments.

A dataflow graph contains several elements: tensors; operations; variables; and queues (Abadi et al., 2016). Tensors are multi-dimensional arrays that represent the algorithm’s operation states, and they are carried in the edges of the graph. In the model, all data are represented as tensors for higher transmission efficiency. Operations are the element that can update the algorithm’s states, and the operations are represented as the vertices in the graph. For example, if a variable ‘x’ needs to be multiplicand with another variable ‘w’, the vertex between these variables would store the operation of multiplication. Variables are operations that have buffers to store the models’ shared parameters. Queues are the elements that can support more advanced forms of data access in the graph.

To further advance its usability and model performance, Tensorflow also provides many functionalities and features (Abadi et al., 2016). For instance, it supports partial and concurrent execution. This allows the user to declare what subgraphs to execute. Assume that a user only wants to execute some of the subgraphs in reading input data, the model can only focus on executing these subgraphs and save its resources from executing all of the subgraphs. Thus, this division of the framework system into different subgraphs makes the framework more distributive and efficient in allocating resources, which builds the foundation for developing large-scale models. Also, the idea of subgraphs allows the user to run many subgraphs concurrently, which further saves the computation resources and time needed to finish the task. Another feature that advances the distributed execution of Tensorflow framework is that each operation resides on a device, which executes a kernel for the operation (Abadi et al., 2016). This provides more flexibility in mapping operations to devices, and it also increases the utilization of devices for executing the dataflow graph.

Tensorflow framework has high extensibility. Although it uses a model architecture that is different from DistBelief, Tensorflow develops a novel way to compute the gradients: it differentiates the symbolic expression for the objective function, and it produces a new symbolic expression to represent the gradients (Abadi et al., 2016). Furthermore, Tensorflow is also fault-tolerant, which is achieved by the checkpointing operations of save and restore (Abadi et al., 2016). In the save operation, tensors are written to a checkpoint file. In the restore operation, tensors are read from the checkpoint file. Thus, suppose some electrical outage occurred during the training process, the model developed with Tensorflow is able to save the current progress as tensors in the checkpoint file. Then, this model can be restored when its previously saved tensors are read from the checkpoint file. This mechanism allows the model to have high usability and fault tolerance, which can save the time to retrain the model when the training process is interrupted. In addition, Tensorflow also supports synchronous training through blocking queue (Abadi et al., 2016). This queue can ensure workers read the same parameter values and apply gradient updates synchronously, which can improve the speed of the training with GPUs.

3.5. Cross-Cutting Themes

The frameworks of DistBelief, advanced parameter server, and Tensorflow dataflow graphs are all related to developing large-scale distributed machine learning models, and they share some ideas among their approaches.

Between frameworks of DistBelief’s Sandblaster L-BFGS and the advanced parameter server, they all support parallelism both within the model replicas and across the model replicas. Because each model replica runs on a different machine, this helps the framework to achieve a high speed and good resource utilization in training large-scale models. Also, both frameworks use the similar distributed architecture that is based on DistBelief, which is composed of the parameter server and the worker nodes. This provides an organized and distributed way for the worker nodes to acquire the updated parameters and push the gradients to the parameter server, which can lead to the efficiently shared parameters across the worker nodes.

Furthermore, the three frameworks of DistBelief, advanced parameter server, and Tensorflow dataflow graphs also share some similar elements. First, they have all been influenced by the DistBelief architecture. By focusing on improving the parameter server part of the architecture, the advanced parameter server framework is closely related to the DistBelief framework. Likewise, the framework of Tensorflow dataflow graphs is a successor of its previous DistBelief framework, and it was inspired by the idea of the distributed model in DistBelief framework.

Secondly, all three frameworks support fault tolerance. For instance, the asynchronous feature of DistBelief’s Downpour SGD allows it to train model replicas independently on many different machines. Thus, when one of the machines fails, other machines can continue their work, which reduces the negative impacts that might be caused by the failed machine. Similarly, Tensorflow also provides fault tolerance. But it actualizes this by implementing checkpointing operations in the dataflow graphs. So when a training interruption occurs, it can use this approach to restore the model’s previous training progress as tensors, which allows the model to continue training from where it was interrupted.

Thirdly, all of the frameworks involve some form of system partitions to make the models distributive. In DistBelief framework, each neural network is partitioned across several different machines, and each machine is only responsible for transmitting the states that are associated with the nodes inside its partition boundary. This produces model parallelism, since each machine manages the nodes in its own partition in parallel. In a similar fashion, Tensorflow also contains partitions of its dataflow graphs. During the training process, the dataflow graph is partitioned into different subgraphs, which execute the tasks concurrently and share information with each other. To further extend the user’s control over the training, Tensorflow also allows the user to specify the subgraph that needs to be executed. This provides the user with more freedom in training the model and makes the execution process more efficient.

Finally, the three frameworks above all manifest model parallelism through distributed execution. For example, DistBelief has several machines that execute the different parts of the model. This provides it with the ability to execute the model on the machines in parallel, which increases the training speed and saves computation resources for each machine. Likewise, the operation in a Tensorflow dataflow graph is stored on a device, such as CPU or GPU, for certain task. In the training, the device executes a kernel for its assigned operation. Although this framework does not involve the computation on different machines, its computation is carried out by different devices. This makes the training more distributive and flexible, since it maps operations in the dataflow graph to devices in a self-contained and organized way. Hence, this also helps the model to achieve model parallelism in the training.

3.6. Synergistic Approaches

Since the three frameworks share several similar ideas in their approaches, some of their approaches can be merged to create a new large-scale distributed machine learning framework.

To begin with, the new framework can combine the architectures of DistBelief and Tensorflow dataflow graphs, so that it is composed of multiple dataflow graphs running on different machines and devices. Each dataflow graph is partitioned into different subgraphs, and each machine is responsible for managing one of the subgraphs. Also, the dataflow graph replicas are also run among multiple machines. Hence, this allows the model to maintain the feature of execution parallelism both between and within the model replicas. To make the model more distributive, each operation of the model would reside on a device. This also allows the model to have both model parallelism and device parallelism.

Furthermore, a server manager and task scheduler can be deployed inside each dataflow graph. While the dataflow graph keeps its own set of parameters and does not need to share a common parameter server with other replicas, it can still take advantages of having a server manager and task scheduler in its system. Because the server manager can provide the model with the metadata of the parameters, which makes the parameters inside the structure more informed and manageable. On another hand, the task scheduler can assign tasks to the worker. This make makes the model’s process of reading parameters and applying gradients more specified and efficient, since it would have a clear task assigned and does not need to search for all parameters in the system.

Moreover, the new framework would enable fault tolerance with both multiple machines and user-level checkpointing. Similar to the DistBelief framework, the new framework can reduce the influence of a machine failure by running the multiple model replicas among many machines. The size of the machine cluster can offset the loss caused by a machine failure. Also, the implementation of periodic checkpointing can further increase the robustness of this framework’s fault tolerance. Because making checkpoint files periodically can directly save the state of the model for later restoration. Hence, even if the majority of the machines are failed all at once, the model can still be restored using the tensors in the saved checkpoint file. In this case, the training process of the model would not receive a large impact or delay, which further improves the fault tolerance of the model.

In addition, the new framework would support user-declared execution subgraphs to increase flexibility. In the dataflow graph, not all subgraphs are used in the execution. For example, a user may want an asynchronous model. Hence, the blocking queue subgraph is not needed in the dataflow graph. By declaring to exclude this subgraph, it allows the user to develop models with better resource utilization. This shows that enabling the user to specify which subgraphs should be executed can provide more freedom and flexibility for the user to define and execute the model, which can further improve the model performance.

Last but not least, the new framework can include a table for the dataflow graph to store the backwards paths that lead to each operation vertex and the summed gradients associated with each path. Because instead of having to use breadth-first search to find all the paths and gradients that lead to the target operation vertex, the differentiation algorithm in this new framework would only need to perform a lookup in the table to find vertices’ paths and gradients. This can substantially decrease the time and complexity of the search process. Also, this can make the model more scalable. Because when training a large-scale model that has many computation operations and a large number of data inputs, the incorporation of the lookup table can enable the model to find the paths and gradients to the different operation vertices with more efficiency.

4. Strengths and Weaknesses

While the three frameworks in the survey papers all share the common goal of developing large-scale distributed models, each framework has its own strengths and weaknesses.

As one of the most important frameworks, DistBelief has various strengths. For instance, DistBelief supports model and data parallelisms, which helps to build the basis for building large-scale models with more efficiency. Because DistBelief partitions each neural network model, and it distributes each partition to a machine. Hence, by having multiple machines executing a model partition in parallel, it can conserve the computation energy and resource of each machine while completing the execution of the entire model. Furthermore, DistBelief uses multiple replicas of the model instance. This allows model parallelism not only within a model replica but also across different model replicas. In this way, the failure of a machine would not necessarily impact the operations of the other ones, which creates an efficient and robust foundation for building large-scale models. Additionally, DistBelief contains a parameter server that is shared among the different model replicas. This provides a structured way for the model replicas to acquire the updated parameters and push their computed gradients, and it also enables the replicas to communicate with the server and share their parameters, which can lead to a more automatic and organized system.

However, DistBelief also contains some weaknesses. One of its weaknesses is that DistBelief lacks flexibility, and it does not provide many user control functionalities. For instance, DistBelief does not provide any user-declared features in the model architecture or operation. The architecture of DistBelief is rather fixed, and it does not seem to allow the user to specify the elements such as the consistency functions in the server or the isolation of any shared parameters.

Another weakness of DistBelief is that it might not be able to adapt to all system environments. Because even though it can have many machines to run the model replicas, DistBelief’s operation might be limited by the different environments. For example, DistBelief might be able to run on Windows, but it might not necessarily be able to run on Linux. To solve this potential issue, DistBelief needs to incorporate a more flexible architecture that can allow it to operate in heterogeneous environments.

By improving on the parameter server of DistBelief-like architecture, the framework of the advanced parameter server has various advantages. First, it uses the server manager to strengthen the communication between the server nodes. This enables the model to share parameters in the server more efficiently. Secondly, this framework provides many user-declared features, such as user-defined server functions, server consistency models, and user-defined filters. This makes the model more flexible and usable, because it allows the user to have more control and freedom in deciding the operation mechanisms of the model.

Nevertheless, the advanced parameter server also has some weaknesses. For example, it emphasizes mostly on the parameter server. In this framework, it mainly covers the high-level structure of the server group and worker group, but it does not cover the low-level or internal structure of each model replica. This can reduce the clarity of the proposed architecture. Because the internal model structure, which is shared among the machines, can also provide the important feature of model parallelism. Hence, this framework should include more description of the internal structure of each model replica to make the work more complete.

Another weakness of the advanced parameter server is that it might not support heterogeneous environments. Similar to DistBelief, this framework is essentially an advanced extension where the parameter server is more robust and controllable. However, it is not guaranteed that this advanced parameter server architecture is able to operate on all system environments, which might also impose limitations on the model.

Being inspired by its previous DistBelief framework, the framework of Tensorflow dataflow graphs acquires lots of strengths. First, it enables the operations to be stored on different devices, which enables the operations to be computed in parallel. Consequently, the model would be able to pass the computed tensors more efficiently in the dataflow graph.

Secondly, the dataflow graph is divided into different subgraphs, and each subgraph is responsible for one particular task. This can make the architecture more distributed, because only the subgraphs that the user selects would be executed. This can also help to save a lot of computation resources and make the model more flexible.

Thirdly, this framework has a high fault tolerance. Because it uses periodic checkpointing to actively save the model state. Thus, the model can efficiently restore its previous state by using the checkpoint file that is saved, which can reduce the impacts of an interruption and make the model more robust.

Finally, the dataflow graphs in the framework can support heterogeneous environments. Because the architecture of the dataflow graphs is easier to scale down to be used on the different system environments. In contrast, Tensorflow’s previous DistBelief architecture has a very fixed structure and execution pattern, which make it hard to scale down or perform well in different environments.

Although the framework of Tensorflow dataflow graphs has a more flexible architecture than DistBelief, it also has some weaknesses. One of the framework’s weaknesses is that this framework has not yet decided on the default policies that work well for all users. Because different users might need different optimizations, and a fixed policy does not always have good performance for all users. Thus, an automatic optimization should be included to address this issue.

Another weakness of Tensorflow dataflow graphs is that it might not support applications with strong consistency requirements. Although many of the applications have weak consistency requirements, some applications might need to have a higher consistency framework to operate the model. Hence, policies for application with strong consistency requirements would also need to be included.

Finally, the static nature of Tensorflow dataflow graphs might impose limitations on the complex deep learning algorithms. Therefore, the framework might need to include more dynamic features in its architecture.

5. Conclusion

This paper provides a survey of three proposed frameworks in the domain of developing large-scale distributed models. The paper compares and contrasts the approaches in the three frameworks, and it also analyzes the strengths and weaknesses of these different frameworks.

While there are some research work on the domain of developing large-scale distributed models, this domain is relatively new. However, this domain might have various new advances in the future. For example, there might be continued work on improving the model parallelism of the model through the utilization of multiple machines and devices. Also, the new frameworks might include more flexible user-declared features, such as subgraph and parameter selections. Furthermore, the future frameworks might also develop new architectures to further enable the models to operate on more varied system environments. In addition, more mechanisms might be developed in the future to strengthen the fault tolerance of the model, such as more advanced machine allocations and periodic checkpointing.

6. References

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V. and Ng, A.Y., 2012. Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231).
Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J. and Su, B.Y., 2014, October. Scaling Distributed Machine Learning with the Parameter Server. In OSDI (Vol. 1, No. 10.4, p. 3).
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M. and Kudlur, M., 2016, November. TensorFlow: A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283).
Jurafsky, D. and Martin, J. H., 2022. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson.

A Survey on Large-Scale Machine Learning Models

Wenqing Wang