Nikitas Dimopoulos,
University of Victoria, Canada
On Neural Computing and Architectures
Abstract. Computational Neuroscience explores brain function in terms of the structures that make up the nervous system through mathematical modeling, simulation and the analysis of experimental data. As a discipline, it differs from machine learning, and neural networks although all cross-fertilize each other. The main difference is that Computational Neuroscience strives to develop models that are as close as possible to the observed physiology and structures, while neural networks focusing on learning, develop models that may not be physiologically plausible. In this talk we shall present some of the standard neuron models and our experience using these to model complex structures such as regions of the hippocampus. We shall show the performance and energy requirements of these models on current processors. We shall consider the scalability of the current models towards more complex structures and eventually whole brain simulations and determine that current processors cannot be used to scale these computations. We shall survey efforts in developing architectures that hold the promise of scalability and identify areas of future research.
Manolis Katevenis,
FORTH-ICS and University of Crete, Greece
Cluster Communication Latency: towards approaching the Minimum Hardware Limits, on Low-Power Platforms
Abstract. Cluster computing is a necessity for High Performance Computing (HPC) beyond the scale reachable by hardware cache coherence (the scale of a single "node"). Unfortunately, still today, inter-node communication in many clusters (other than expensive and power-hungry ones) is based on old-time hardware and software architectures, dating back to when such communication was viewed as "I/O rather than memory" and hence non-latency-critical. Although we know how to reduce latency when accessing "memory", the majority of low-cost and low-power commercial platforms, still today, do not care to do this when "performing I/O". Within a series of 5 projects, we are working towards fixing this problem, for modern energy-efficient ARM-based platforms, which hold great promise for Data Centers and HPC; this talk will present a summary of our efforts, under the general name "UNIMEM", and in particular: To reduce latency and energy consumption during communication, we need to avoid interrupts and system calls and reduce the number of data copies. Single-word (e.g. control/sync) communication can be efficiently performed via remote load/store instructions, provided there is a Global Address Space (GAS). Block-data communication is efficiently performed via remote DMA (RDMA), where again a GAS is required in order to save a data copy at the remote node. DMA and I/O have to be cache-coherent (not always provided, even today); at the arrival node, it is desirable to be able to control cache-allocation or not. RDMA initiation without system call requires virtual address arguments, which we translate using the IOMMU (aka System MMU). For scalability, address translation should be performed at the destination, i.e. we advocate a 64-bit Global Virtual Address Space (GVAS); coupled with inter-node (process) protection, this requires 80- to 96-bits within network packets. These are incompatible with current MMU's that always translate load/store addresses at the issuer, and with current DMA engines and I/O maps that only handle 40 or 48 address bits. Allowing RDMA between arbitrary (user) memory areas (as opposed to registered and pinned ones, only), without any extra data copies, requires RDMA to tolerate and recover from page faults. RDMA completion detection is tricky, especially in the presence of multipath routing: it requires hardware support at the receiver NI, in order to avoid one network round-trip, and mailboxes (hardware queues) for notification without interrupts. Finally, when trying to have MPI communicate with zero-copy and single-network-trip latency, we would like either the receive call to accept data at system-selected address, contrary to the current standard, or send-receive matching to be performed at the sender, which cannot be done in the case of MPI_ANY_SOURCE.
Wayne Luk,
Imperial College London, United Kingdom
Progress towards self-optimising and self-verifying design
Abstract. This talk describes recent research on self-optimising and self-verifying design, a vision first articulated in the 2007 Symposium on the Future of Computing in memory of Professor Stamatis Vassiliadis. Developments based on machine learning and transparent assertions will be presented; the promise of the proposed approaches will be discussed.
Trevor Mudge,
University of Michigan, 
United States
Machine Learning Processors — Deja Vu?
Abstract. Recently we have seen the emergence of processors targeted at machine learning. The best example, or at least the one about which there is the most detail, is Google’s Tensor Processing Unit. This new class of processors bear a striking similarity to existing digital signal processors and graphic processing units. This talk will outline similarities and differences and comment on the possibility of a common solution.
Walid Najjar,
University of California - Riverside, United States
Embracing Rather Than Fighting Memory Latency
Abstract. Memory latency remains one of the most daunting challenges in computer architecture. Most modern multicore systems mitigate it through the use of massive cache hierarchies that take up over 80% of the chip area and a proportional fraction of its power budget. This approach, however, pre-supposes some form of temporal or spatial locality. A growing fraction of modern workloads, such as databases, data and graph analytics, data mining, bioinformatics, etc, does not exhibit much spatial and/or temporal localities. these workloads are particularly well suited for an alternative solution: latency masking multithreading that trades off memory bandwidth for latency. In this talk I describe how latency masking multithreaded execution on FPGAs can achieve a higher throughput that CPUs and/or GPUs on sparse linear algebra and database operations.
Yale Patt,
The University of Texas at Austin, United States
Economics Be Damned!
Abstract. I have been advocating breaking the arbitraty barriers between layers of my transformation hierarchy for decades, insisting for example that until the algorithm people talk to the microarchitects, we are never going to get the full benefit of what computing can provide. And, I have given examples such as predication where the compiler, the ISA and the microarchitect can get rid of the misprediction penalty of conditional branches. As long as Moore's Law gave us more and faster transistors, we could be lazy. But now that that is about to end, we need to look elsewhere for continued performance benefit. Ergo, my mantra: break the layers! which my critics argue that portability will go out the window and economics will prevent industry from getting on board. To which I say: Economics Be Damned! I will explain why I believe this position makes sense.
Per Stenström,
Chalmers University of Technology, Sweden
Efficient Computing in the Post-Moore Era
Abstract. The free lunch is now over of leveraging on the growth of transistor count as offered by Moore’s Law over about five decades. Fortunately, there is a significant headroom for computer architects to use compute and memory resources more efficiently. Two such opportunities will be addressed in this talk. A first opportunity is to enable optimizations across the compute stack while retaining well-defined functional interfaces. I will first talk about a concept called across-the-stack cache optimization in which static information from the parallel programming model level is used by the runtime together with run-time statistics collected at the architecture level to enable a range of new cache optimizations, among them global dead-block cache management. A second opportunity I will cover is to remove data value redundancies in the cache/memory hierarchy. We have pioneered statistical cache/memory compression as a means to use cache and memory resources substantially more efficiently – many times more than 3X in compression ratio without imposing any harmful overhead on the critical memory access path. These are examples of many opportunities that will be explored in the years to come.
Uri Weiser,
Technion - Israel Institute of Technology, Israel
When to Process in Storage?
Abstract. Memory hierarchy in modern computing systems perform well for workloads that exhibit temporal data locality. Thus, data that is accessed frequently is brought to DRAM and caches that are close to the computing cores, allowing fast data access to repeated data, high data bandwidth and reduction in data movement energy waste. However, this architecture does not support efficiently some of Big Data programs. Furthermore, many Big Data applications, demonstrate non-temporal locality data accesses patterns (ak’a “Read once”). When running these applications on modern computing systems, large amount of “read once” data is nevertheless transmitted and copied to memory hierarchy levels, leading to energy waste and bandwidth pressure. In this talk we’ll ask the question: on what conditions processing data in storage will be more effective?
Mateo Valero,
Barcelona Supercomputing Center, Spain
From Classical to Runtime Aware Architectures
Abstract. In the last years the traditional ways to keep the increase of hardware performance to the rate predicted by the Moore's Law vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well-defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while computer architects proposed techniques to aggressively exploit Instruction-Level Parallelism (ILP) in superscalar processors. Current multi-cores are designed as simple symmetric multiprocessors on a chip. While these designs are able to compensate the clock frequency stagnation, they face multiple problems in terms of power consumption, programmability, resilience or memory. The solution is to give more responsibility to the runtime system and to let it tightly collaborate with the hardware. The runtime has to drive the design of future multi-cores architectures. In this talk, we will introduce an approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime's perspective. RAA aims at supporting the parallel runtime system by enabling fine-grain tasking or managing hybrid memory hierarchies.