ACM Transactions on

Computer Systems (TOCS)

Latest Articles

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today—logs, counters, and metrics—have two important limitations: what gets recorded is defined a priori, and the information is recorded in a... (more)

Building Consistent Transactions with Inconsistent Replication

Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google’s... (more)

Ryoan: A Distributed Sandbox for Untrusted Computation on Secret Data

Users of modern data-processing services such as tax preparation or genomic screening are forced to trust them with data that the users wish to keep secret. Ryoan1 protects secret data while it is processed by services that the data owner does not trust. Accomplishing this goal in a distributed setting is difficult, because the user has no control... (more)


New Editor-in-Chief

ACM Transactions on Computer Systems (TOCS) welcomes Michael Swift as new Editor-in-Chief as of November 1, 2018. Michael is a Professor in the Computer Sciences Department at University of Wisconsin, Madison.

Forthcoming Articles
Derecho: Fast State Machine Replication for Cloud Services

The coming generation of Internet-of-Things (IoT) applications will process massive amounts of incoming data while supporting data mining and online learning. In cases with demanding real-time requirements, such systems behave as smart memories: high-bandwidth services that capture sensor input, processes it using machine-learning tools, replicate and store interesting data (discarding uninteresting content), update knowledge models, and trigger urgently-needed responses. Derecho is a high-throughput library for building smart memories and similar services. At its core Derecho implements atomic multicast and state machine replication. Derechos replicated template defines a replicated type; the corresponding objects are associated with subgroups, which can be sharded into keyvalue structures. The persistent and volatile storage templates implement version vectors with optional NVM persistence. These support time-indexed access, offering lock-free snapshot isolation that blends temporal precision and causal consistency. Derecho automates application management, supporting multigroup structures and providing consistent knowledge of the current membership mapping. A query can access data from many shards or subgroups, and consistency is guaranteed without any form of distributed locking. Whereas many systems run consensus on the critical path, Derecho requires consensus only when updating membership. The approach results in a software library offering exceptional speed and flexibility.

Lock - Unlock: Is That All? A pragmatic Analysis of Locking in Software Systems

A plethora of mutex lock algorithms have been designed to mitigate performance bottlenecks. Unfortunately, there is currently no broad study of the behavior of lock algorithms on realistic applications that consider different performance metrics (energy efficiency and tail latency in addition to throughput). In this paper, we perform an analysis of synchronization to provide application developers with enough information to design fast, scalable and energy-efficient synchronization. First, we study the performance of 28 lock algorithms, on 40 applications, on four multicore machines, considering throughput, energy efficiency and tail latency. Second, we describe nine lock performance bottlenecks, and propose six guidelines helping developers with their choice of a lock algorithm. From our analysis, we make, several observations: (i) applications stress the full locking API (e.g., trylocks), (ii) the memory footprint of a lock can affect performance, (iii) the interaction between locks and scheduling is an application performance factor, and (iv) lock tail latencies may or may not affect application tail latency. These findings highlight that locking involves more considerations than the simple lock  unlock interface and call for further research on designing low-memory footprint adaptive locks that fully and efficiently support the full lock interface, and consider all performance metrics.

Venice : An Effective Resource Sharing Architecture for Data Center Server

Consolidated server racks are quickly becoming the standard infrastructure for engineering, business, medicine, and science. Such servers are still designed much as they were when organized as individual, distributed systems. Given that many fields increasingly rely on big-data analytics, we can improve cost-effectiveness and performance by flexibly allowing resources to be shared across nodes. Here we describe Venice, a family of data-center server architectures that includes a strong communication substrate as a first-class resource. Venice supports a diverse set of resource-joining mechanisms that enables applications to leverage non-local resources efficiently. We have constructed a hardware prototype to better understand the implications of design decisions about system support for resource sharing. We use it to measure the performance of at-scale applications and to explore performance, power, and resource-sharing transparency tradeoffs (i.e., how many programming changes are needed). We analyze these tradeoffs for sharing memory, accelerators, or NICs. We find that reducing/hiding latency is particularly important, that which communication channels are used should match the sharing access patterns of the applications, and that we can improve performance by exploiting inter-channel collaboration.

Mitigating Load Imbalance in Distributed Data Serving Through Rack-Scale Memory Pooling

In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data is aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the racks micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key-value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. We devise two distinct approaches to load balancing within a RackOut unit, one based on random selection of nodesRackOut staticand another one based on an adaptive load balancing mechanismRackOut adaptive. Our results show that RackOut static increases throughput by up to 6× for RDMA and 8.6× for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives. RackOut adaptive improves the throughput by 30% for workloads with 20% of writes over RackOut stat

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs

Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties of low-level non-standard interfaces, which bypass the OS file I/O layers and may hurt system performance. Developers must possess intimate knowledge of low-level interfaces in order to manually handle the subtleties of data consistency and misaligned accesses. We present SPIN, which integrates p2p into the standard OS file I/O stack, dynamically activating p2p where appropriate, transparently to the user. It combines p2p with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID. We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding p2p throughput by up to an order of magnitude.

Deca: a Garbage Collection Optimizer for In-memory Data Processing

In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in big data processing systems . However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap. These generated objects may quickly saturate the garbage collector, especially when handling a large dataset, and hence, limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects, and then allo cates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency.

All ACM Journals | See Full Journal Index

Search TOCS
enter search term and/or author name