The coming generation of Internet-of-Things (IoT) applications will process massive amounts of incoming data while supporting data mining and online learning. In cases with demanding real-time requirements, such systems behave as smart memories: high-bandwidth services that capture sensor input, processes it using machine-learning tools, replicate and store interesting data (discarding uninteresting content), update knowledge models, and trigger urgently-needed responses.
Derecho is a high-throughput library for building smart memories and similar services. At its core Derecho implements atomic multicast and state machine replication. Derechos replicated
A plethora of mutex lock algorithms have been designed to mitigate performance bottlenecks. Unfortunately, there is currently no broad study of the behavior of lock algorithms on realistic applications that consider different performance metrics (energy efficiency and tail latency in addition to throughput). In this paper, we perform an analysis of synchronization to provide application developers with enough information to design fast, scalable and energy-efficient synchronization. First, we study the performance of 28 lock algorithms, on 40 applications, on four multicore machines, considering throughput, energy efficiency and tail latency. Second, we describe nine lock performance bottlenecks, and propose six guidelines helping developers with their choice of a lock algorithm. From our analysis, we make, several observations: (i) applications stress the full locking API (e.g., trylocks), (ii) the memory footprint of a lock can affect performance, (iii) the interaction between locks and scheduling is an application performance factor, and (iv) lock tail latencies may or may not affect application tail latency. These findings highlight that locking involves more considerations than the simple lock unlock interface and call for further research on designing low-memory footprint adaptive locks that fully and efficiently support the full lock interface, and consider all performance metrics.
Consolidated server racks are quickly becoming the standard infrastructure for engineering, business, medicine, and science. Such servers are still designed much as they were when organized as individual, distributed systems. Given that many fields increasingly rely on big-data analytics, we can improve cost-effectiveness and performance by flexibly allowing resources to be shared across nodes. Here we describe Venice, a family of data-center server architectures that includes a strong communication substrate as a first-class resource. Venice supports a diverse set of resource-joining mechanisms that enables applications to leverage non-local resources efficiently. We have constructed a hardware prototype to better understand the implications of design decisions about system support for resource sharing. We use it to measure the performance of at-scale applications and to explore performance, power, and resource-sharing transparency tradeoffs (i.e., how many programming changes are needed). We analyze these tradeoffs for sharing memory, accelerators, or NICs. We find that reducing/hiding latency is particularly important, that which communication channels are used should match the sharing access patterns of the applications, and that we can improve performance by exploiting inter-channel collaboration.
In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data is aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the racks micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key-value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. We devise two distinct approaches to load balancing within a RackOut unit, one based on random selection of nodesRackOut staticand another one based on an adaptive load balancing mechanismRackOut adaptive. Our results show that RackOut static increases throughput by up to 6× for RDMA and 8.6× for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives. RackOut adaptive improves the throughput by 30% for workloads with 20% of writes over RackOut stat
Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties of low-level non-standard interfaces, which bypass the OS file I/O layers and may hurt system performance. Developers must possess intimate knowledge of low-level interfaces in order to manually handle the subtleties of data consistency and misaligned accesses. We present SPIN, which integrates p2p into the standard OS file I/O stack, dynamically activating p2p where appropriate, transparently to the user. It combines p2p with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID. We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding p2p throughput by up to an order of magnitude.
In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in big data processing systems . However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap. These generated objects may quickly saturate the garbage collector, especially when handling a large dataset, and hence, limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects, and then allo cates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency.