ACM DL

ACM Transactions on

Computer Systems (TOCS)

Menu
Latest Articles

Derecho: Fast State Machine Replication for Cloud Services

Cloud computing services often replicate data and may require ways to coordinate distributed actions. Here we present Derecho, a library for such tasks. The API provides interfaces for structuring applications into patterns of subgroups and shards, supports state machine replication within them, and... (more)

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs

Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties of low-level non-standard interfaces, which bypass the OS file I/O layers and may hurt system... (more)

Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling

To provide low-latency and high-throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural... (more)

NEWS

New Editor-in-Chief

ACM Transactions on Computer Systems (TOCS) welcomes Michael Swift as new Editor-in-Chief as of November 1, 2018. Michael is a Professor in the Computer Sciences Department at University of Wisconsin, Madison.

Forthcoming Articles
Software Prefetching for Indirect Memory Accesses: A Microarchitectural Perspective

Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited. This paper develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instructions are architecture dependent, and the class of programs that are particularly amenable. Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3x for an Intel Haswell processor, 1.1x for both an ARM Cortex-A57 and Qualcomm Kryo, 1.2x for a Cortex-72 and an Intel Kaby Lake, and 1.35x for an Intel Xeon Phi Knight's Landing, each of which is an out-of-order core, and performance improvements of 2.1x and 2.7x for the in-order ARM Cortex-A53 and first generation Intel Xeon Phi.

The Arm Triple Core Lock-Step (TCLS) Processor

The Arm Triple Core Lock-Step (TCLS) processor is the natural evolution of Arm Cortex-R Dual Core Lock- Step (DCLS) processors to increase reliability, predictability and availability in safety-critical and ultrareliable applications. TCLS is simple, scalable and easy to deploy in applications where Arm DCLS processors are widely used (e.g., automotive), as well as in new applications where the presence of Arm technology is incipient (e.g., enterprise) or almost non-existent (e.g., space). This article discusses the fundamentals of the Arm TCLS processor, providing key functioning and implementation details. The article also describes a TRL6 proof-of-concept TCLS-based System-on-Chip (SoC) that has been prototyped and tested in an Airbus Defence and Space telecom satellite on-board computer. The article provides implementation results of the latter SoC using commercial and rad-hard process technology.

An Instruction Set Architecture for Machine Learning

ML techniques are conventionally executed on general-purpose processors, which usually are not energy-efficient since they invest excessive hardware resources to flexibility. Consequently, application-specific hardware accelerators have been proposed to improve the energy-efficiency. However, such accelerators were designed for a small set of ML techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks (such as layers in neural networks). The lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different ML techniques with sufficient flexibility and efficiency. We first propose a novel domain-specific ISA for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. We then extend the application scope of Cambricon from NN to ML techniques. We also propose an assembly language, an assembler and runtime to support programming with Cambricon. Our evaluation over a total of 16 representative yet distinct ML techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of ML techniques, and provides higher code density than general-purpose ISAs such as x86, MIPS, and GPGPU.

All ACM Journals | See Full Journal Index

Search TOCS
enter search term and/or author name