Cold Start, Heterogeneous and Scalability: A Survey of Serverless Computing

author: Rui Li date: 4/21/2023 ----------- # Introduction

Serverless computing has become increasingly popular due to its ease of use and cost-effectiveness. Serverless computing is a modern cloud computing paradigm that uses single-purpose services or functions as the basic computation unit. It offers various benefits such as allowing developers to focus on core logic, adopting a pay-as-you-go model with fine-grained charging granularity, and allowing cloud providers to manage resources more efficiently. Major cloud providers such as AWS Lambda, Azure Functions, Google Serverless, Alibaba Serverless Application Engine, and Huawei Cloud Functions all support serverless computing. This model is more economical since the platform only bills for executed functions and not for idle time.

The demand for serverless platforms has been increasing, leading to challenges related to scalability and performance. Firstly, cold start - launching a container from scratch for each function - is a key challenge for fast auto-scaling. Because the start time can be orders of magnitude higher than the execution time for ephemeral serverless functions. Secondly, the development of heterogeneous hardware has led to requirements for serverless platforms to be built on diverse computers. Thirdly, the need to transfer state between serverless nodes has become more popular, and simple stateless approaches are no longer enough.

In this survey paper, I will explore three recent papers that address these challenges by proposing innovative solutions. The first paper, SOCK: Rapid Task Provisioning with Serverless-Optimized Containers, introduces a container system optimized for serverless workloads that avoid kernel scalability bottlenecks resulting in significant speedups. The second paper, Serverless Computing on Heterogeneous Computers, proposes Molecule, the first serverless computing system that utilizes heterogeneous computers to improve function density and application performance. Lastly, the paper No Provisioned Concurrency: Fast RDMA-codesigned Remote Fork for Serverless Computing introduces MITOSIS, an operating system primitive that leverages RDMA to reduce function tail latency and memory usage in serverless workflows that require state transfer. Together, these papers offer valuable insights into improving the scalability and performance of serverless computing platforms, making them a must-read for anyone interested in this area.

The SOCK paper analyzes the scalability bottlenecks related to storage and network isolation in Linux container primitives that can affect serverless platforms. Additionally, the paper studies the impact of importing popular Python libraries on cold start time and shows that it can add up to 100 ms. To address these issues, the paper proposes a container system called SOCK optimized for serverless workloads that avoid kernel scalability bottlenecks, resulting in an 18x speedup. Furthermore, a generalized-Zygote provisioning strategy and a three-tier caching strategy based on Zygotes are implemented, resulting in an additional 3x and 45x speedup, respectively.

The Molecule paper highlights the limitations of existing serverless computing platforms built upon homogeneous computers. Molecule is the first serverless computing system that utilizes heterogeneous computers. Molecule significantly improves function density and application performance, enabling both general-purpose devices and domain-specific accelerators. To this end, the paper proposes XPU-Shim, a distributed shim to bridge the gap between underlying multi-OS systems, and vectorized sandbox, a sandbox abstraction to abstract hardware heterogeneity. Additional, this paper also reviews state-of-the-art serverless optimizations on cold start and communication latency.

The MITOSIS paper introduces an operating system primitive that leverages the fast remote read capability of RDMA and partial state transfer. The tradeoff between container cold start time and provisioned concurrency in serverless platforms is exacerbated by the need for frequent remote container initialization. And MITOSIS bridges the performance gap between local and remote container initialization. MITOSIS can fork over 10,000 new containers from one instance across multiple machines within a second while efficiently transferring the pre-materialized states of the forked one. This approach helps to reduce function tail latency and memory usage and improves the execution time of serverless workflows that require state transfer by 86%.

In the following sections, I will provide detailed introductions to these papers, followed by discussion sections. # Background and Challenges In this section, I will describe the background and challenges of serverless discussed in papers I surveyed.

Cold Start

Both the SOCK and MITOSIS papers emphasize the Cold Start problem, which can stem from various causes such as resource allocation overhead, load running time, and libraries. Previous works have attempted to address this issue by implementing a "warm-start" approach as opposed to a cold start. However, "warm-start" means the provider has to maintain a number of live instances all the time, which increase the cost. MITOSIS and SOCK proposed their new method to solve these problem.

Package and Runtime Overhead

The SOCK paper focus on the cold start problem caused by running time and its package loading. Firstly, the process cold-start becomes more frequent and expensive with serverless techniques. Secondly, languages such as Python and JavaScript which are commonly used in serverless computing require heavy runtimes, causing the cold start time to be over 10 times slower than launching an equivalent C program. Fast cold start is important for both tenants and providers as avoiding cold starts can be costly for providers. Currently, most serverless platforms wait for minutes or even hours to recycle idle, unbilled lambda instances.

The SOCK paper further conducted two detailed studies performed to better understand the sandboxing and application characteristics that interfere with efficient cold start in serverless computing. Firstly, the paper analyzes the performance and scalability of various Linux isolation primitives. Secondly, the paper studies 876,000 Python projects from GitHub and analyzes 101,000 unique packages from the PyPI repository. The study finds that many popular packages take 100 ms to import, and installing them can take seconds, which can significantly impact the cold start time.

The first study in SOCK is a Container Performance study and have several implications for the design of serverless containers. Firstly, flexible stacking of union file systems may not be worth the performance cost relative to bind mounts. Moreover, file-system tree transformations that rely upon copying the mount namespace are costly at scale. Secondly, network namespaces are a major scalability bottleneck, and network namespacing is of little value in serverless platforms such as AWS Lambda that execute handlers behind a Network Address Translator. Finally, reusing cgroups is twice as fast as creating new cgroups so that SOCK can reduce cold start latency and improve overall throughput.

The second study in SOCK is a python initialization study that shows the presence of language runtimes and package dependencies can still make cold start slow. To better understand this issue, the authors scraped 876K Python projects from GitHub and identified likely dependencies on packages from the popular Python Package Index (PyPI) repository. They found that downloading and installing a package and its dependencies from a local mirror can take seconds. And importing installed packages can take over 100 ms. However, the authors also suggest that storing large package repositories locally on disk is feasible. And the strong popularity skew of certain packages creates opportunities to pre-import a subset of packages into interpreter memory. The goal of this analysis is to identify potential obstacles that may prevent applications from being ported to lambdas in the future.

Provisioned Concurrency Overhead

On the other hand, the MITOSIS paper discussed the problem of the state-of-the-art way to slove the cold start problem, i.e. provisioned concurrency. Scaling serverless function to multiple machines is common and not well addressed. Accelerating cold start has become a hot topic in both academia and industry, with most approaches resorting to a form of "warm start" through provisioned concurrency. However, these approaches require non-trivial resources when scaling functions to a distributed setting, where each machine should deploy many cached containers. Unfortunately, scaling functions to multiple machines is common.

The authors of MITOSIS introduce remote fork as a promising primitive for efficient function launching and fast function state sharing across machines. The authors cite the efficiency of the fork mechanism for launching containers on a single machine and the potential for remote fork to provide transparent intermediate state sharing between remote functions as their motivations. They argue that the state-of-the-art systems that use Checkpoint/Restore techniques can only achieve a conservative remote fork, which is not efficient for serverless computing.

MITOSIS employs RDMA for remote fork with the kernel, which requires fast and scalable RDMA connection establishment, efficient access control of the parent container's physical memory, and efficient parent container lifecycle management at scale. The authors propose solutions to these challenges for MITOSIS, including: (i)retrofitting advanced RDMA feature DCT for fast and scalable connection establishment. (ii)proposing a new connection-based memory access control method specifically designed for remote fork. (iii)and co-designing container lifecycle management with the help of the serverless platform.

MITOSIS futher conducted experiments to measure the cold start overhead of existing provisioned concurrency ("warm start") methods. The cold start performance cost of starting a container from scratch is slow and costly, taking hundreds of milliseconds. To reduce this cost, warm start techniques have been developed, such as caching finished containers and reusing them with nearly no cold start cost. However, this method requires a large amount of in-memory resources.

Another technique for warm start is using a cached container to call the fork system call to start new containers, reducing the resource provisioned for caching. However, this method is still proportional to the number of machines and cannot generalize to a distributed setting. Checkpoint/Restore starts containers from container checkpoints stored in a file and only requires one resource, making it optimal in resource usage, but it is much slower than caching and fork.

Heterogeneous Serverless

Another challenge is the need for high function density in a single machine to support auto-scalability and low communication latency. The authors of the Molecule paper believe future serverless platforms will need to be deployed on heterogeneous computers. However, homogeneous computers are limited by the prospect of dark silicon, which reduces the effectiveness of general-purpose parallelism in computers that only have CPUs as processing units. The second challenge is that many important applications, such as machine learning, artificial intelligence, video classification, and genome analysis, rely on heterogeneous accelerators for faster computation. The inability to leverage these accelerators restricts serverless computing to limited scenarios. Additionally, co-locating I/O stacks with computations in the CPU can lead to worse resource utilization and break performance isolation.

The Molecule is the first serverless computing system to take both general-purpose devices and domain-specific accelerators into account. By leveraging DPUs for better function density and FPGAs for better performance, Molecule provides an easy-to-use programming model that allows developers to easily utilize heterogeneous processing units to write their serverless applications. Additionally, Molecule retains the benefits promised by serverless computing, such as auto-scalability. Thus, Molecule is a promising solution for overcoming the limitations of homogeneous serverless computing systems.

Designing serverless computing systems on heterogeneous computers faces several challenges, such as multi-OS systems, hardware and software abstractions, and communication complexities. The Molecule proposes two solutions to overcome these challenges: (i) a generic serverless abstraction called vectorized sandbox and (ii) an indirection layer called XPU-Shim. And the molecule is built upon these solutions. Molecule enables the utilization of general-purpose devices and domain-specific accelerators. And it significantly improves function density and application performance while retaining the benefits of serverless computing.

State Transfer Overhead

Although traditional serverless computing is stateless, state transferring is a emerging feature need by newer and larger application. MITOSIS reports that dependent functions running in separate containers and the need for state transfer between them. In the absence of direct state transfer, dependent functions must resort to message passing or cloud storage for state transfer. The state transfering accounts for up to 95% of the function execution time for dependent functions, which is unacceptable.

MITOSIS further conducted experiments state transferring overhead. Transferring states via messages and cloud storage incurs overheads, causing a significant slowdown. Existing work proposes serverless-optimized messaging primitives and specialized storage systems, but none eliminate the overhead. Fastlane co-locates functions in the same container with threads to bypass the overhead with shared memory accesses, but threads cannot generalize to a distributed setting. So it falls back to message passing if the upstream and downstream functions are on different machines.

Approach

In this section, I will discuss detailed approaches for solving the cold start problem, building a heterogeneous serverless architecture, and mitigating state transfer overhead.

Serverless-optimized containers (SOCK) for Cold Start

SOCK (serverless-optimized containers) is a special-purpose container system. And it aims at achieving low-latency invocation for Python handlers that import libraries and efficient sandbox initialization to achieve high steady-state throughput. SOCK is based on three novel techniques: (i) Lightweight isolation primitives are used to avoiding performance bottlenecks identified in the Linux primitive study, resulting in an 18x speedup over Docker. (ii) A generalized Zygote-provisioning strategy to avoid Python initialization costs identified in the package study. (iii) A three-tiered package-aware caching system is built using the Zygote mechanism, achieving 45x speedups relative to SOCK containers without Zygote initialization.

The Lean Containers

The first new technology is lean containers, which provide lightweight isolation primitives. SOCK creates lean containers for lambdas by avoiding expensive operations that are only necessary for general-purpose containers. To do this, SOCK uses bind mounts to stitch together a root file system from four host directories, including a base Ubuntu image, a package caching directory, handler code, and a writable scratch directory. The container is then created using the chroot operation, which is faster and simpler than creating a new mount namespace.

Communication between the OpenLambda manager and processes inside the container is done using a Unix domain socket mounted in the container's scratch space. Isolation is achieved through a combination of cgroup and namespace primitives, with a pool of cgroups being created upon SOCK container creation and returned to the pool after container termination. The "init" process is the first to run in a SOCK container, creating a set of new namespaces with a call to unshare.

The Zygote-provisioning strategy

The second technology is the Zygote-provisioning strategy, which is used to start new processes as forks of an initial process. The Zygote, that has already pre-imported various libraries likely to be needed by applications, thereby saving child processes from repeatedly doing the same initialization work and consuming excess memory with multiple identical copies. The SOCK Zygotes scale to very large package sets by maintaining multiple Zygotes with different pre-imported packages and provisioning is fully integrated with containers.

Processes are not vulnerable to malicious packages they did not import. The key challenge is using Linux APIs to ensure that the forked process lands in a new container, distinct from the container housing the Zygote helper. The SOCK protocol provisions a helper handler from a helper Zygote, and new Zygotes are provisioned from existing Zygotes. Initialization of the Python runtime and packages will only be done once, and subsequent initialization will be faster. SOCK protects innocent lambdas by never initializing them from a Zygote that has pre-imported modules not required by the lambda.

Three-tier caching system

The third technology is a three-tier caching system. The first tier is a handler cache that maintains paused containers, which can be unpaused faster than creating new ones, but consume memory when paused. The second tier is an install cache containing a static set of pre-installed packages on disk, mapped read-only into each container. The third tier is an import cache that manages Zygotes and selects entries with the most matching packages, breaking ties randomly. The import cache also uses a simple runtime model to estimate potential memory reclamation. And it evicts entries with the highest benefit-to-cost ratio when memory utilization surpasses a limit.

MITOSIS for Cold Start and Fast State Transfer

MITOSIS is an operating system primitive that provides a fast remote fork by deeply co-designing with RDMA. The authors highlight the key insight that the OS can directly access the physical memory on remote machines via RDMA-capable NICs (RNICs). RNICs enables the realization of remote fork by imitating local fork through mapping a child container's virtual memory to its parent container's physical memory without checkpointing the memory.

MITOSIS Design

MITOSIS use kernel-space RDMA to achieve efficient remote forking. RDMA allows for the kernel to read/write the physical memory of remote machines with low latency and high bandwidth. MITOSIS imitates the local fork with RDMA by copying the parent's metadata to a condensed descriptor and copying it to the child via RDMA. During execution, the child's remote memory accesses trigger page faults, and the kernel reads the remote pages accordingly. MITOSIS uses one-sided RDMA READ to read remote physical memory, bypassing software overheads.

MITOSIS can be used in a decentralized architecture and does not require dedicated resources to fork containers. The system adds four components to the kernel: the fork orchestrator, network daemon, OS virtual memory subsystems, and fallback daemon. MITOSIS preserves the security model of containers. MITOSIS addresses issues with connection establishment and remote physical memory control through the use of the DCT feature and a registration-free memory control method, respectively. Lastly, MITOSIS addresses the problem of parent container lifecycle management by offloading it to the serverless platform.

To be specific, MITOSIS implements Fork_prepare and Fork_resume in MITOSIS. Users can call fork_prepare to generate metadata related to remote fork, which is identified by a local unique handle-id and key. The fork_resume function is used to start a child on another machine by fetching the parent descriptor and restoring it. The method uses a container descriptor to capture the parent states, including CPU register values, page table and virtual memory areas, and opened file information. The method also uses one-sided RDMA for fast descriptor fetch and generalized lean containers for fast restore.

MITOSIS Application on Serverless

The authors of MITOSIS also introduced how to apply MITOSIS to FN, a popular open-source serverless platform, to accelerate function cold start and state transfer. Firstly, seeds prepared via fork_prepare are used to boost function cold start and accelerate state transfer, and the platform is responsible for reclaiming the seeds. Seeds are categorized into two classes: long-lived seeds and short-lived seeds.

Short-lived seeds are created by a fork-aware coordinator, which is used to send prepare/resume requests to the invoker. And the coordinator looks up an available seed for a single function call. The coordinator dynamically creates short-lived seeds based on state transfer relationships during workflow execution. Upstream function results are piggybacked in the reply of the function, and downstream functions inherit the pre-materialized results. The user must specify which function to fork for cases where a function has multiple upstream functions.

Besides, long-lived seeds are deployed as cached containers and loaded into memory. The invoker generates a seed by calling fork_prepare to cache a container. The system adjusts its cache policy to be fork-aware and only caches the first container facing cold start across the platform. A seed store is used to record a mapping between the function name and the corresponding seed's RDMA address, the handle_id, and the key. Long-lived seeds are reclaimed by timeout, and coordinators can renew them if necessary.

Moreover, the seed store maps function names to the corresponding seed's RDMA address and record the time when the seed was deployed. The long-lived seeds are reclaimed by timeout, and the fork tree structure is maintained at the coordinator. In the fork tree, the upper-layer nodes correspond to upstream functions, and the lower-layer nodes represent the downstream functions. The fork tree is constructed when the coordinator forks a new child from a short-lived seed, and it is destroyed after all functions in the tree finish. The fork tree is fault-tolerant, and a simple timeout-based mechanism is used to tolerate failures.

Molecule for Heterogeneous Serverless and Cold Start

Recall that the Molecule use a generic serverless abstraction called vectorized sandbox and an indirection layer called XPU-Shim to implement the heterogeneous serverless platform. Here I provide the details of thoses new technologies and then introduce the architecture of Molecule.

Vectorized sandbox

The Molecule implements a new sandbox runtime named runf, which is responsible for maintaining sandboxes on FPGAs. The Open Container Initiative (OCI) runtime specification is widely used as a sandbox abstraction in serverless computing.

However, it is limited in handling the hardware heterogeneity of serverless computing, particularly in the case of serverless functions running on FPGAs. To address this, the runf maintains FPGA serverless instance states and downloads the corresponding FPGA image to program it into FPGA devices when creating a sandbox running on FPGA. The start is invoked when the serverless runtime needs to handle a request for the FPGA function. In this case, runf transfers the arguments to the FPGA devices and issues a command to the device to execute the function and waits for the results. Similarly, runf can erase the FPGA devices to delete a sandbox. Overall, the vectorized sandbox abstraction extends the OCI runtime specification to handle the hardware heterogeneity of serverless computing.

Vectorized sandbox approach enables efficient use of accelerators for serverless computing. The approach has three extensions: (i) vectorized sandbox creation allowing for direct invocation of the target sandbox when requests arrive. (ii) vectorized start interface for concurrent execution of sandboxes. (iii) no explicit destruction of sandboxes, with the real destroy operations happening in the next create operation. This approach improves performance and scalability, while also not adding overheads to the next create operation.

XPU-shim

The second technology XPU-shim is a distributed shim that acts as an indirection layer between a serverless runtime and multiple operating systems (OSes) on heterogeneous computers. It provides a unified abstraction, called XPU calls, to manage and utilize resources on different processing units (PUs), such as CPUs, DPUs, and FPGAs. XPU-Shim is implemented on each local OS and uses vectorized sandbox interfaces to manage heterogeneous functions, making it flexible to different local OSs and PUs. It also maintains the global states of the heterogeneous computer and provides efficient communication between applications on different PUs using distributed capabilities and neighbor IPC primitives.

Moreover, XPU-Shim maintains global resources and states for user-space applications using distributed objects and capabilities. It achieves global process identification by assigning a globally unique ID to each process based on a static partitioning scheme. Permission control is managed through the distributed capability system, with each CAP_Group (per process) maintaining a list of capabilities and permissions. XPU-Shim checks for permissions in XPU calls to ensure proper access control.

Neighbor IPC (nIPC)

Besides, to make sure processes running on different processing units can communicate with each other, neighbor IPC (nIPC) is implemented inside XPU-Shim. nIPC is achieved through a simple software stack that uses similar network-based communication methods as PCIe. Compared to remote communication methods like HTTP over sockets, the connection between two PUs on a single machine is much more reliable. This means that a serverless runtime using XPU-Shim can avoid the need for an API gateway.

The performance of nIPC is evaluated and compared to Linux FIFO in Molecule. Three nIPC cases based on different XPUcall implementations are evaluated, and nIPC-Base and nIPC-MPSC have higher latency than Linux IPC, while nIPC-Polling achieves lower latency than Linux IPC on DPU. XPUcall optimizations are efficient on devices, but they cause fewer costs in the CPU.

Based on the above two technologies, the authors of the Molecule build a heterogeneous serverless platform architecture. The Molecule architecture allows for serving serverless requests from a global manager, which can run on any processing unit (PU) in a heterogeneous computer. It manages functions in other PUs using the XPU-Shim and launches executors through xSpawn. Executors manage local function instances using the vectorized sandbox abstraction and receive commands from Molecule through nIPC. And they execute them on the local OS and returning the results. For accelerators like FPGAs that cannot launch a generic program, a virtual XPU-Shim instance is started on a neighboring CPU/DPU to run the corresponding executor and manage the accelerator.

Molecule for Cold Start

Not only build a heterogeneous serverless platform, Molecule also provide mechanisms to optimizing cold start overhead and communication overhead. Molecule's container fork (cfork) allows for the container-level fork on heterogeneous computers, overcoming challenges such as multi-threading, migration of forked instances to new containers, and multi-OS systems. To handle multi-threading, cfork uses forkable language runtime, temporarily merging threads into a single thread and saving contexts in memory. To migrate forked instances, Molecule prepares a new container for each forked instance. Lastly, cfork supports multi-OS systems by having each PU prepare a template container for each language runtime, utilizing nIPC to create function containers and cfork new instances.

Molecule caches function instances on FPGA, instead of forking them, to mitigate cold-boot costs. It utilizes keep-alive policies to predict cached instances and prepares an FPGA image with them. When a request arrives, Molecule can directly invoke the cached function instance without re-programming the FPGA. The number of cached instances depends on the FPGA wrapper's design, which should provide isolation and fair sharing. The state-of-the-art system Coyote can achieve performance isolation with DRAM stripping and is a good choice for Molecule to support serverless computing. Molecule currently uses a simpler way to statically partition and protect resources.

Molecule also has solution for function DAG Communication. For nIPC-based DAG calls, establishing a full-duplex connection between caller and callee function instances using FIFOs. It supports CPU-DPU heterogeneous computers through the XPU-FIFO abstraction provided by nIPC. And it supports FPGA by utilizing a zero-copying method that allows it to load a new FPGA image without erasing the data in the FPGA-attached DRAM. Thus, Molecule eliminates the need for copying data between the caller and callee FPGA functions. The FPGA wrapper is responsible for clearing sensitive data.

Evaluation

In this section, I will present evaluations of the proposed methods in the surveyed papers. Through this evaluation, I can determine the performance gains achieved by each part of the methods and compare them with the state-of-the-art techniques.

SOCK: Container and Package Optimization

In performance evaluations for container, SOCK outperformed Docker in request throughput and average latency, and using Zygote-style pre-initialization further improved SOCK's throughput. SOCK also keeps recently used handlers that are idle in a paused state to avoid cold start, but unpausing is faster than creating a new container. Although SOCK enables more aggressive resource reclamation, it is still beneficial to pause idle handlers before evicting them.

To evaluate the impact of the package optimization of SOCK, a simple workload was used where a single task sequentially invokes different lambdas that use the same single library but perform no work. The results showed that without optimizations, downloading, installing, and importing packages took at least a second. However, with import and install caching, latency was reduced to 20 ms, a significant 45x improvement.

SOCK: Case Study and Comparsion

The author use a case study evaluates the performance of three serverless platforms (SOCK, AWS Lambda, and OpenWhisk) for on-demand image resizing using the Pillow package. They use 1 GB lambdas for AWS Lambda and a pair of m4.xlarge AWS EC2 instances for SOCK and OpenWhisk. They exercise cold-start performance by measuring request latency after re-uploading the code as a new handler. The study finds that SOCK outperforms AWS Lambda and OpenWhisk with a platform latency of 365 ms, which is 2.8× and 5.3× faster, respectively. In addition, the study finds that SOCK performs package initialization work as part of the platform, which reduces compute time. The study also evaluates a scenario where a different handler that uses the Pillow package has recently run, which further reduces the SOCK platform latency by 3× to 120 ms.

Molecule: Cold Start and Communication Optimization

Molecule's cfork can significantly outperform baseline cold boot in terms of function cold start latency on CPU and DPU. With XPU-Shim support, remote template forking can be done with negligible costs. For FPGA, erasing the old image is unnecessary in most cases, and Molecule can achieve better performance with a vectorized sandbox design and cached functions.

They compare Molecule with the baseline method and consider different cases such as CPU only, DPU only, and cross CPU and DPU. In all cases, IPC-based DAG optimizations achieve significantly better latency than the baseline, while nIPC outperforms the baseline method by 10-13x. The authors also discuss the use of DMA to transfer data between CPU and FPGA functions in Molecule's nIPC, which incurs low costs. They compare the basic approach with shared memory optimization based on data retention. And it shows that the optimization can effectively improve end-to-end performance by mitigating unnecessary data movement.

Molecule: Overall Performance and Comparsion

Molecule improves the function density on CPU-DPU computers by utilizing DPUs, achieving more than 50% density with 2 Bluefield DPUs for one Python image processing function, while not improving the per-PU density. The implementation of matrix operations in CPU and FPGA functions show that the latter achieves lower latency (2.15x to 2.82x) compared to the former.

Molecule's cfork is the first optimization that can fork a container-based serverless function and can overcome the challenges of forking multi-threaded processes. Additionally, it is also the first to support cross-PU fork, which is necessary for heterogeneous serverless computing. Molecule uses IPC for communication and neighbor IPC to achieve the fastest cross-PU communication latencies.

MITOSIS: Prepare Time, Cold Start Time and Execution Time

The preparation time is the time for the parent to prepare a remote fork, and it differs between different techniques. CRIU-local and CRIU-remote take time to checkpoint a container, while MITOSIS has the fork_prepare time. Caching and FaasNET do not have this phase. MITOSIS is significantly faster in preparation than CRIU-local and CRIU-remote, reducing the prepare time by 94%. CRIU variants are slower due to copying the container state from memory to filesystems.

The cold start time is measured as the time between receiving the function request and the first line of code execution, and MITOSIS is only slower than caching. Caching is the fastest, taking only 0.5 ms. MITOSIS comes next, taking only 6 ms and is faster than CRIU-local, CRIU-remote, and FaasNET by up to 99%, 94%, and 97%, respectively. The cold start time of MITOSIS is dominated by the container setup time, while CRIU-local is dominated by copying the entire file. The cold start cost of FaasNET is dominated by the runtime initialization of the function.

For the execution time, MITOSIS is slower than Caching, CRIU-local, and FaasNET, except for the hello/H function. The overhead is mainly due to page faults and reading remote memory. The performance is most affected in recognition/R that reads 321MB of parent memory. MITOSIS+cache reduces the number of remote memory accesses and improves performance by up to 17%, making MITOSIS close to or better than CRIU-local and FaasNET. Caching is always faster than FaasNET and CRIU-local as it has no page fault overhead. Finally, MITOSIS is faster than CRIU-remote, thanks to bypassing DFS for reading remote pages.

MITOSIS: Bottleneck Analysis and Throughput Comparison

The bottleneck analysis shows that while using a single seed function is optimal for resource usage. But the parent-side network bandwidth and two RPC threads can become bottlenecks in MITOSIS. The aggregated client-side CPU resources processing the function logic can also become a bottleneck. MITOSIS can achieve a lower throughput than Caching when the parent-side network bandwidth is the bottleneck, but similar throughput to caching when the children CPU is the bottleneck. The RPC is never a bottleneck as the two kernel threads can handle up to 1.1 million requests per second.

The throughput comparison shows that MITOSIS is up to 8 times faster than CRIU-local and up to 20.4 times faster than CRIU-remote, except for R. Because R is mostly affected by network latency due to its large working set. The comparison between MITOSIS and Caching has been studied in the bottleneck analysis and is not included in this figure.

MITOSIS: State Transfer Performance

MITOSIS uses a microbenchmark comparing different approaches to state transfer between two remote functions using the data-transfer test case in ServerlessBench. MITOSIS is shown to be up to 1.4-5x faster than Fn, which uses Redis to transfer data. Compared to CRIU-local/remote, MITOSIS is faster due to its design for a fast remote fork. The text also presents the performance of MITOSIS on FINRA. And it is significantly faster than Fn, CRIU-local, and CRIU-remote, and can scale to a distributed setting with minimal cost.

Discussion

In this section, I will provide some discussion on the three papers surveyed and the development of serverless technology.

One critical issue in serverless optimization is the cold start problem. Although the three papers have different focuses, they all evaluate the cold start problem. Cold start will always be a challenge in auto-scaling since the tradeoff between performance and the cost of preserving serverless functions exists. Despite many efforts to solve this problem, cold start will continue to be a research topic in the future.

Traditional system technologies such as cache, checkpoint/restart, and fork play important roles in the novel serverless platform. SOCK proposes using cache to warm up and save cold start time. MITOSIS uses remote-fork to scale up serverless functions and introduces several CRIU-based serverless methods for optimizing cold start overhead.

Heterogeneous servers are a promising area, but the authors should provide more evidence of their usefulness for serverless users. Although cloud providers like AWS provide different hardware (such as x86 or ARM CPUs) for users to build their own serverless functions, it is not truly heterogeneous. For such a new architecture of serverless, performance should not be the only measure, and the authors should also consider the cost of building such a system, including the initial cost and running cost. This is the most important aspect that cloud providers and users may concern.

Conclusion

In conclusion, the three papers reviewed in this survey have proposed innovative solutions to address the scalability and performance challenges associated with serverless computing platforms. The first paper introduced SOCK, a container system optimized for serverless workloads that avoid kernel scalability bottlenecks resulting in significant speedups. The second paper proposed Molecule, the first serverless computing system that utilizes heterogeneous computers to improve function density and application performance. Lastly, the paper on MITOSIS introduced an operating system primitive that leverages RDMA to reduce function tail latency and memory usage in serverless workflows that require state transfer.

Overall, these papers demonstrate that serverless computing platforms are still evolving, and there is much room for improvement. The proposed solutions have shown promising results and have the potential to shape the future of serverless computing. As the demand for serverless computing continues to grow, it is crucial to address these scalability and performance challenges to ensure that serverless platforms remain cost-effective and efficient.