Machine Learning in Hardware

Past few decades have seen unprecedented growth in the information processing capabilities of electronic systems such as desktops, laptops, mobiles phone etc. This emergence of advanced data processing systems has revolutionized several industries and has led to the availability of vast amounts of data. Recent advances in machine learning and big data explore ways of deriving useful conclusions from the available data but at a significant cost in silicon. Hence, it has now become crucial to ask, “what is the best way to build information processing systems for the future?”. This session invites researchers working on addressing various aspects of this question, including but not limited to, advances in state-of-the-art digital and analog CMOS-based designs, advances in state-of-the-art computer architectures and compilers, ways of addressing challenges such as high device variability and leakage power, alternative computing paradigms such as bio-neuro-inspired computing or computing using beyond-CMOS devices, alternative storage paradigms such as in-memory computers, novel memories such as RRAM or MRAM etc.

Time and place – Friday Feb. 8, 9am-12pm, CSL B02 

Keynote Speaker: Song Han – MIT

Hardware-Centric AutoML: Design Automation for Efficient Deep Learning Computing

Summary – In the post-ImageNet era, researchers are solving more complicated AI problems using larger data sets which drives the demand for more computation. However, Moore’s Law is slowing down. The mismatch between supply and demand for computation highlights the need for co-designing efficient machine learning algorithms and domain-specific hardware architectures. We introduce our recent work using machine learning to optimize the machine learning system (Hardware-centric AutoML): learning the optimal pruning strategy (AMC) and quantization strategy (HAQ) on the target hardware; learning the optimal neural network architecture that is specialized for a target hardware architecture (ProxylessNAS); learning to optimize analog circuit parameters, rather than relying on experienced analog engineers to tune those transistors(L2DC). For hardware-friendly machine learning algorithms, I’ll introduce the temporal shift module (TSM) for efficient video understanding, that offers 8x lower latency, 12x higher throughput than 3D convolution-based methods. Next we introduce Defensive Quantization (DQ) that balances robustness and efficiency. I’ll conclude the talk by giving an outlook of the design automation for efficient deep learning computing.

Bio – Song Han is an assistant professor in the EECS Department of Massachusetts Institute of Technology (MIT) and PI for HAN Lab: Hardware, AI and Neural-nets. Dr. Han’s research focuses on energy-efficient deep learning and domain-specific architectures. He proposed “Deep Compression” that widely impacted the industry. He was the co-founder and chief scientist of DeePhi Tech based on his PhD thesis. Prior to joining MIT, Song Han graduated from Stanford University.

MIT HAN Lab’s research focuses on:

H: High performance, High energy efficiency Hardware

A: Architectures an Accelerators for Artificial Intelligence

N: Novel Algorithms for Neural Networks

Glenn Gihyun Ko – Harvard

FlexGibbs: Reconfigurable Parallel Gibbs Sampling Accelerator for Structured Graphs

Summary – Many regard one of the key component to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian learning and probabilistic graphical models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithm for probabilistic inference, is studied with a focus on parallelism and parameterized components. We propose FlexGibbs, a reconfigurable, parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using massively parallel Gibbs samplers, enabled by Chromatic sampling. We show that for sound source separation application, FlexGibbs configured on Zynq ARM/SoC FPGA platform achieved Gibbs sampling inference speedup of 1048x and 99.86% reduction in energy over running it on ARMCortex-A53. We further show that FlexGibbs can be used to run other perceptual tasks such as image segmentation, restoration and stereo vision.

Bio – Glenn Gihyun Ko is a postdoctoral researcher at Harvard University working with Professor Gu-Yeon Wei and Professor David Brooks. He received B.S. and M.S. in Electrical and Computer Engineering, both from University of Illinois at Urbana-Champaign in 2004 and 2006 respectively. He then joined Samsung Electronics where he engaged in research and development of Samsung Exynos mobile processor SoCs. He returned to Illinois and received Ph.D. in Electrical and Computer Engineering in 2017 before joining Harvard University in 2018. He has also spent summers at Qualcomm Research and IBM Research on machine learning accelerator research. His current research interests are machine learning, algorithm-hardware co-design and scalable accelerator architectures on the cloud and edge devices.

Rachneet Kaur – UIUC

Virtual Reality, Visual Cliffs and Movement Disorders

Summary – Virtual Reality provides a safe space for therapy dealing with physical disabilities. The goal of our work is understanding and mitigating fear of falling, particularly among the elderly. A treadmill and harness can be used to protect the subject while a non-lab environment is simulated. Our project combines this with with modern electroencephalography (EEG) techniques. We created VR environments designed to either create an anxious reaction or soothe the subject. Currently we have more than 20 different terrains which vary from flat ground to steep cliffs. Besides the shapes of terrains we also applied different textures to better mimic the real world, such as ice, grass, rock and snow. The subjects used EEG while engaging in the simulated environment and have their brain activity collected. From the data, an anxiety level is calculated between -1 to 1 using various techniques, namely filtering of the signals, independent component analysis and studying power spectrum density from the left and the right frontal cortex of the brain signals. The virtual world adaptively modifies itself to be correspondingly more or less soothing by changing the coming terrains based on the condition of the subject. During these transitions, to make it less sensible to subjects, we used fog effect between two adjacent terrains. This process helps train to manage the mental aspects of dealing with a movement disorder. We hope that our setup may be useful in helping subjects develop mechanisms to compensate for significant fear of falling while walking.

Bio – Rachneet Kaur is a Ph.D. student in the Department of Industrial and Enterprise Systems Engineering at University of Illinois at Urbana-Champaign. Rachneet earned a B.S. (with honours) in mathematics from the University of Delhi and an M.S. in Mathematics from the Indian Institute of Technology (IIT), Delhi. Her research interests are signal processing, big data, health analytics and control.

Sitao Huang – UIUC

Triangle Counting and Truss Decomposition using FPGA

Summary – Triangle counting and truss decomposition are two essential procedures in graph analysis. As the scale of graphs grows larger, designing highly efficient graph analysis systems with less power demand becomes more and more urgent. In this talk, Sitao will present triangle counting and truss decomposition solutions using a low-power Field-Programmable Gate Array (FPGA). In this work, the flexibility of FPGAs is leveraged and low-latency high-efficiency implementations are achieved. Evaluation on SNAP dataset shows that the triangle counting and truss decomposition implementations achieve 43.5x on average (up to 757.7x) and 6.4x on average (up to 68.0x) higher performance per Watt respectively over GPU solutions. In this talk, Sitao will also discuss how the efficient graph analysis techniques can be applied in implementing efficient deep learning systems.

Bio – Sitao Huang is a Ph.D. candidate at the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, under the supervision of Prof. Deming Chen and Prof. Wen-mei Hwu. His research interest includes hardware acceleration, high-level synthesis, and heterogeneous systems. He is a recipient of 2018 Rambus Computer Engineering Fellowship. His research won the Student Innovation Award at the 2018 IEEE-HPEC Graph Challenge. Sitao Huang received his B.S. in Electronics Engineering at Tsinghua University in 2014, and his M.S. in Computer Engineering at the University of Illinois at Urbana-Champaign in 2017.

Kartik Hegde – UIUC

Morph: Towards Highly Flexible Machine Learning Accelerators

Summary – As machine learning becomes ubiquitous, we have seen unprecedented advances in both algorithmic and hardware world targeted at machine learning. With increasing heterogeneity in types of Neural Networks and their parameters, it is a challenging task to maintain compute performance and power efficiency across networks and even across different layers of deep neural networks. Earlier works implement a fixed design strategy to define a data flow paradigm that can not be changed across layers or networks; leading to sub-optimal inference efficiency.

In this light, we design and implement Morph, a highly flexible hardware accelerator that adapts based on the requirements of the network, supported by Morph software optimizer that performs a design space search to identify the most optimal run-time parameters such as loop order, tiling and parallelism. Evaluated on state-of-the-art 3D CNNs for video understanding, Morph achieves up to 3.4x (2.5x average) reduction in energy consumption and improves performance/watt by up to 5.1x (4x average) compared to a baseline 3D CNN accelerator, with a minimal area overhead of 5%. Morph further achieves a 15.9x average energy reduction on 3D CNNs when compared to Eyeriss.

In this talk, I hope to excite the community about the new trend in machine learning accelerator design, that heavily deploys hardware-software codesign techniques to dramatically improve the efficiency. Imparting run-time flexibility to hardware accelerators is the way forward to build highly efficient accelerators to serve a large range of deep learning algorithms.

Bio – Kartik Hegde is a second-year PhD student at the University of Illinois at Urbana-Champaign, advised by Chris Fletcher. His current research at CS@Illinois focuses on developing efficient programmable hardware accelerators for sparse-dense tensor algebra and deep learning. Prior to this, he worked at Arm where he contributed to developing hardware accelerators for ML at the edge and their integration in the System-on-Chip. He is a Facebook fellow with their hardware for Machine Learning research group.