(With a little effort, this page might work itself into reasonable shape. Due to perpetual lack of time, I am always behind in updating it. The information below, though, should suffice to whet your appetite. If you want more information on any of these projects, please drop me a note.)
The overarching research umbrella of the Parallel Architecture Group @ Northwestern (PARAG@N) is energy-efficient computing. At the macro scale, computers consume inordinate amounts of energy, negatively impacting the economics and environmental footprint of computing. At the micro scale, power constraints prevent us from riding and extending Moore's Law. We attack both problems by identifying sources of energy inefficiencies and working across the hardware and software stacks to address them. Thus, our work extends from novel devices all the way to application software through circuit and hardware designs, compilers and runtimes, OS optimizations, and programming languages.
Our path has taken us from classical conventional computing, to nano-photonics in computer architectures, and more recently to Quantum Systems. The sections below attempt to elaborate more on our research directions. We are also always in need for brilliant, passionate Ph.D. students to work with us. If you find any of the projects below interesting, consider applying to our program.
A (quite old by now) overview of our research at PARAG@N was presented at an invited talk at IBM T.J. Watson Research Center and Google in March 2012. Many exciting new developments have happened since then, but the talk is a good starting point.The last few stages of an IBM Q dilution refrigerator, used for cooling quantm chips to low temperatures of a few milli-Kelvin.
QSys: Innovating the Quantum Systems Layer
There has been great progress in quantum computing hardware, and the number of qubits per device is growing exponentially. However, qubits are still very fragile. Transmon qubits decohere within a few microseconds, gates are plagued by high errors, error correcting codes require orders of magnitude more qubits than available today, and many useful quantum algorithms require even more. Algorithms and systems software can work together with the hardware to alleviate many of the problems affecting current noisy intermediate-scale quantum hardware (NISQ), and can achieve orders-of-magnitude more efficient quantum computation. The QSys project seeks to innovate at the intersection of physical machines, systems software and architecture, aiming to make quantum computing practical decades before hardware alone could achieve this goal. As a starting point, we collaborated with researchers at Princeton University and the University of Chicago in the design of SupermarQ, a scalable, hardware-agnostic quantum benchmark suite which uses application-level metrics to measure performance. The introduction of SupermarQ was motivated by the scarcity of techniques to reliably measure and compare the performance of quantum computations running on today's quantum computer systems due to the high variety of quantum architectures and devices. QSys is a new project, and we are looking for Ph.D. students that are passionate about contributing to this nascent field. Students with combined physics and computer science or engineering backgrounds are especially encouraged to apply.
Software & Hardware for Scalable Frictionless Parallelism
Parallelism should be frictionless, allowing every developer to start with the assumption of parallelism instead of being forced to take it up once performance demands it. Considerable progress has been made in achieving this vision on the language and training front; it has been demonstrated that sophomores can learn basic data structures and algorithms in a "parallel first" model enabled by a high-level parallel language. However, achieving both high productivity and high performance on current and future heterogeneous systems requires innovation throughout the hardware/software stack. This project brings two distinct perspectives to this problem: the "theory down" approach, focusing on high-level parallel languages and the theory and practice of achieving provable performance bounds within them; and the "architecture up" approach, focusing on rethinking abstractions at the architectural, operating system, runtime, and compiler levels to optimize raw performance. This is a new project starting in Fall 2021, and we are looking for Ph.D. students in computer systems (architecture, FPGAs, operating systems, compilers) to help us turn our vision into reality. This work is supported by NSF awards SPX-2028851 and SPX-2119069.The Andromeda Galaxy (M31), the largest member of the Local Group.
Galaxy: Computer Architecture Meets Silicon Photonics
This project combines advances in parallel computer architecture and silicon photonics to develop architectures that break past the power, bandwidth and utilization walls (dark silicon) that plague modern processors. The Galaxy architecture of optically-connected disintegrated processors argues that instead of building monolithic chips, we should split them into several smaller chiplets and form a "virtual macro-chip" by connecting them with optical links. The optics allow such high bandwidth communication that break the bandwidth wall entirely, and such low latency that the virtual macro-chip behaves as a single tightly-coupled chip. As each chiplet has its own power budget and the optical links eliminate the traditional chip-to-chip communication overheads, the macro-chip behaves as an oversized multicore that scales beyond single-chip area limits, while maintaining high yield and reasonable cost (only faulty chiplets need replacement). Our preliminary results indicate that Galaxy scales seamlessly to 4000 cores, making it possible to shrink an entire rack's worth of computational power onto a single wafer. Galaxy was first proposed in WINDS 2010, long before the industry jumped onto chiplet-based designs, and the full design was presented at an EPFL talk in 2014 and published at ICS-2014. This project has advanced the state of the art in silicon photonic interconnects by designing a family of laser power-gating NoCs (EcoLaser, LaC, EcoLaser+), co-designing the on-chip NoC with the architecture in ProLaser, escalating the laser power-gating to datacenter optical networks with SLaC and projecting on the datacenter energy savings, and overcoming the thermal transfer problems of 3D-stacked electro-optical processor/photonics chips with Parka. Even more exciting, we designed Pho$, a multicore optical cache hierarchy that replaces all private L1/L2 caches with a single, shared, single-cycle-access optical L1 cache. Compared to conventional all-electronic cache hierarchies, Pho$ achieves 1.41x application speedup (4x max) and 31% lower energy-delay product (90% max). To the best of our knowledge, Pho$ is the first practical design of an optical cache that can reach a useful capacity (several MBs). This work was nominated for a Best Paper Award at ISLPED 2021. We are now expanding our design to include optical phase-change memories (non volatile) for last-level caching. A full list of publications appears in the NSF CCF-1453853 project web page on energy-efficient and energy-proportional silicon photonic manycore architectures, which partially funded this work.Core rope memory from the Apollo spacecraft. By passing or not wires through a magnetic ring, knitters created 1s and 0s and hence a ROM-stored program. This is arguably the first example of software "woven" into hardware.
Interweaving the Hardware-Software Parallel Stack
The Interweaving project seeks to advance the state of the art for parallel systems. Usually, the layers of a parallel system (compiler, runtime, operating system, and hardware) are considered as separate entities with a rigid division of labor. This project investigates an alternative model, Interweaving, in which these layers are integrated as needed to improve the performance, scalability, and efficiency of the specific parallel system. Our ROSS paper at Supercomputing 2021 presents the case for an interwoven parallel hardware/software stack. We designed fast barriers by blending hardware and software on an Intel HARP system that integrates x64 cores and an FPGA fabric in the same package. We studied the prospects of functional address translation for parallel systems, and developed CARAT, a system that performs address translation as an OS/compiler co-design, rather than a contract beteween OS and hardware, and CARAT CAKE, a system that brings CARAT into the kernel and fully replaces OS paging via compiler/kernel cooperation. We developed and implemented TPAL, a task parallel assembly language that leverages existing kernel and hardware support for interrupts to allow parallelism to remain latent until a heartbeat (fast user-level timing interrupt), when it can be manifested with low cost. We discovered spatio-temporal value correlation, an important but overlooked software behavior in which the values computed by the same line of code tend to be of similar magnitude as the instruction repeatedly executes. We capitalized on this software property to design ST2 GPU, a GPU architecture that employs specialized adders for energy efficiency. To evaluate ST2 GPU, we developed and released as an open source framework AccelWattch, a highly-accurate power model for Nvidia Volta GPUs that is within 7.5% of hardware power measurements and it is the first power modeling tool that can be driven entirely by software simulation (e.g., Accel-Sim), or hardware performance counters, or a hybrid combination of the two. This work has been partially supported by NSF CNS-1763743.
SeaFire: Application-Specific Design for Dark Silicon
While Elastic Fidelity and Elastic Memory Hierarchies cut back on the energy consumption, they do not push the power wall far enough. To gain another order of magnitude in energy efficiency, we must minimize the overheads of modern computing. The idea behind the SeaFire project is that instead of building conventional high-overhead multicores that we cannot power, we should repurpose the dark silicon for specialized energy-efficient cores. A running application will power up only the cores most closely matching its computational requirements, while the rest of the chip remains off to conserve energy. Preliminary results on SeaFire have been published at a highly-cited IEEE Micro article in July 2011, an invited USENIX ;login: article in April 2012, the ACLD workshop in 2010, a keynote at ISPDC in 2010, an invited presentation at the NSF Workshop on Sustainable Energy-Efficient Data Management in 2011 (the abstract is here), and an invited presentation at HPTS in 2011. This work was partially funded by an ISEN Booster award and later continued as part of the Intel Parallel Computing Center at Northwestern that I co-founded with faculty from the IEMS department.
Elastic Fidelity: Disciplined Approximate Computing
At the circuit level, the shrinking transistor geometries and race for energy-efficient computing result in significant error rates at smaller technologies due to process variation and low voltages (especially with near-threshold computing). Traditionally, these errors are handled at the circuit and architectural layers, as computations expect 100% reliability. Elastic Fidelity computing is based on the observation that not all computations and data require 100% fidelity; we can judiciously let errors manifest in the error-resilient data, and handle them higher in the stack. We envision programming language extensions that allow data objects to be instantiated with certain accuracy guarantees, which are recorded by the compiler and communicated to hardware, which then steers computations and data to separate ALU/FPU blocks and cache/memory regions that relax the guardbands and run at lower voltage to conserve energy. Our vision was first presented at a poster in ASPLOS 2011. To accurately model the impact of errors we developed b-HiVE, a bit-level history-based error model for functional units which, for the first time, accounts for the value correlation that is inherently found in software systems. We then developed Lazy Pipelines, a microarchitecture that utilizes vacant functional unit cycles to reduce computation error rate under lower-than-nominal voltage. We showed how elastic fidelity can lead to significant energy savings in real-world graph applications through a novel edge importance identification technique for graphs based on locality sensitive hashing, which allows for processing low-importance edges with elastic fidelity operations. We further developed the concept of elastic fidelity through Temporal Approximate Function Memoization, a compiler transformation that replaces function executions with historical results when the function output is stable. Our work with Elastic Fidelity also formed the stepping stone for VaLHALLA, a variable-latency speculative lazy adder that saves 70% of the nominal power while guaranteed correctness. This work was partially funded by NSF CCF-1218768 and NSF CCF-1217353.
Elastic Memory Hierarchies
In this project we develop adaptive cache designs and memory hierarchy sub-systems that minimize the overheads of storing, retrieving and communicating data to/from memories and other cores. Reactive NUCA, an incarnation of Elastic Memory Hierarchies for near-optimal data placement was published at ISCA 2009 and won an IEEE Micro Top Picks award in 2010, while newer papers on Dynamic Directories at DATE 2012 and IEEE Computer Special Issue on Multicore Coherence in 2013 present an instance of Elastic Memory Hierarchies that minimize interconnect power by co-locating directory meta-data with sharer cores. You can also find an interview on Dynamic Directories conducted by Prof. Srini Devadas (MIT) here. Later, we designed SCP, an instance of Elastic Memory Hierarchies that stores the prefetching engine's meta-data in the cache space saved by cache compression, leading to 13-22% application speedup. Through this project we also investigated DRAM thermal management techniques, which have been largely overlooked by the community, even though more than a third of energy is consumed on memory, and thermal events play an important role on the overall DRAM power consumption and reliability. Together with fellow faculty Seda and Gokan Memik, we recognized the importance of the problem, and devised techniques to shape the power and thermal profile of DRAMs using OS-level optimizations. We published some of our results on DRAM thermal management at HPCA 2011. This thrust currently focuses on revisiting memory hierarchy designs, optical memories, and new hardware-software co-designs for virtual-to-physical address mapping. This work is partially funded by NSF CCF-1218768 and CCF-1453853.