Machine Learning and Signal Processing

Machine Learning and Signal Processing Session

2:00 pm to 5:00 pm, February 24 on Zoom

The goal of Machine Learning is to understand fundamental principles and capabilities of learning from data, as well as designing and analyzing machine learning algorithms. We invite you to the Machine Learning and Signal Processing Session of the CSL student conference if you are curious about when, how, and why machine learning algorithms work. Besides the theoretical aspects of machine learning, this session covers topics including (but not limited to) computer vision, deep learning, acoustics, signal processing, etc. The Machine Learning and Signal Processing session will have a keynote talk by Prof. Rebecca Willett, a prominent researcher in the area.

Keynote Speaker – Prof. Rebecca Willett, University of Chicago


“The Role of Linear Layers in Nonlinear Interpolating Networks”

Time: 2:00 pm to 3:00 pm, February 24

Talk Abstract: An outstanding problem in understanding the performance of overparameterized neural networks is to characterize which functions are best represented by neural networks of varying architectures. Past work explored the notion of representation costs — i.e., how much does it “cost” for a neural network to represent some function? For instance, given a set of training samples, consider finding the interpolating function that minimizes the representation cost; how is that interpolant different for a network with three layers instead of two layers? Both functions have the same values on the training samples, but they may have very different behaviors elsewhere in the domain. In this talk, I will describe the representation cost of a family of networks with L layers in which L-1 layers have linear activations and the final layer has a ReLU activation. Our results show that the linear layers in this network yields a representation cost that reflects a complex interplay between the alignment and sparsity of ReLU units. For example, using a neural network to fit training data with minimum representation cost yields an interpolating function that is constant in directions perpendicular to a low-dimensional subspace on which a parsimonious interpolant exists. We will explore these effects and their implications for future work on generalization. This is joint work with Greg Ongie.

Biography: Rebecca Willett is a Professor of Statistics and Computer Science at the University of Chicago. Her research is focused on machine learning, signal processing, and large-scale data science. Prof. Willett received the National Science Foundation CAREER Award in 2007, was a member of the DARPA Computer Science Study Group 2007-2011, received an Air Force Office of Scientific Research Young Investigator Program award in 2010, was named a Fellow of the Society of Industrial and Applied Mathematics in 2021, and was named a Fellow of the IEEE in 2022. She is a co-principal investigator and member of the Executive Committee for the Institute for the Foundations of Data Science, helps direct the Air Force Research Lab University Center of Excellence on Machine Learning, and currently leads the University of Chicago’s AI+Science Initiative. She serves on advisory committees for the National Science Foundation’s Institute for Mathematical and Statistical Innovation, the AI for Science Committee for the US Department of Energy’s Advanced Scientific Computing Research program, the Sandia National Laboratories Computing and Information Sciences Program, and the University of Tokyo Institute for AI and Beyond. She completed her PhD in Electrical and Computer Engineering at Rice University in 2005 and was an Assistant then tenured Associate Professor of Electrical and Computer Engineering at Duke University from 2005 to 2013. She was an Associate Professor of Electrical and Computer Engineering, Harvey D. Spangler Faculty Scholar, and Fellow of the Wisconsin Institutes for Discovery at the University of Wisconsin-Madison from 2013 to 2018. Prof. Willett has also held visiting researcher positions at the Institute for Pure and Applied Mathematics at UCLA in 2004, the University of Wisconsin-Madison 2003-2005, the French National Institute for Research in Computer Science and Control (INRIA) in 2003, and the Applied Science Research and Development Laboratory at GE Medical Systems (now GE Healthcare) in 2002. She is also an instructor for FEMMES (Females Excelling More in Math Engineering and Science; news article here) and a local exhibit leader for Sally Ride Festivals. She was a recipient of the National Science Foundation Graduate Research Fellowship, the Rice University Presidential Scholarship, the Society of Women Engineers Caterpillar Scholarship, and the Angier B. Duke Memorial Scholarship.

Invited Student Speaker – Vishnu Lokhande, University of Wisconsin-Madison


“Handling Correlations and Nuisance Variables in Deep Learning Models Robustly”

Time: 3:00 pm to 3:40 pm, February 24

Talk Abstract: Training vision and machine learning models has been successful endeavor in the past decade. While this success is appealing, several issues of bias and fairness in learning-based models have received negative attention in recent years. Bias in AI models can be systematically studied by measuring the degree of impartiality the model provides with respect to certain attributes. These attributes could be sensitive such as gender, race, or age. They could also be “nuisance attributes” that are collinear to the response variable but otherwise should / should not be used in the decision-making process. Moreover, bias manifests differently across different domains — in tasks such as image classification or object detection, which are fundamental in computer vision, a nuisance attribute could correspond to the background information in an image or a co-existing object. In a biomedical application, a deployed AI system may utilize irrelevant/extraneous features that conform to the scanner rather than the disease to make a diagnosis. Therefore, the study of mechanisms, that mitigate bias over sensitive and nuisance attributes, is important and crucial to the advancement of machine learning. Handling nuisance attributes systematically will make the models more robust and trustworthy. In this talk, we will study a few methods that help identify and mitigate bias. We will cover a few different problem domains and their corresponding attribute types.

Biography: Vishnu Lokhande is a PhD student in Computer Sciences at the University of Wisconsin-Madison. He works at the interface of Computer Vision and Machine Learning on one end and numerical optimization on the other, in the context of high impact needs in fairness algorithms, data pooling and harmonization in biomedical applications and semi-supervised learning. Over the previous summers, Vishnu pursued research internships in Google AI and Microsoft Research. He was one of the finalists in the 2021 Microsoft Research PhD Fellowship. Prior to his graduate studies, he obtained his bachelor’s from Indian Institute of Technology Kanpur.


Student Speakers

Rucha Deshpande, Washington University in St Louis (primary) and UIUC (visiting student)


“Evaluation of the Capacity of GANs to Reproduce High-order Spatial Context via Stochastic Object Models for Diagnostic Applications”

Time: 3:40 pm to 4:00 pm, February 24

The popularity of generative adversarial networks (GANs) in proposed medical imaging pipelines necessitates the development of domain-specific evaluation methods to ensure that clinical inferences are not impacted negatively. However, such evaluations are challenging because there is not an unambiguous definition of the ground truth in many clinical use cases. We propose purposefully designed stochastic object models (SOMs) that enable the algorithmic encoding of high-order spatial context that remains recoverable in post-hoc analysis of GAN-generated images, enabling task-relevant error-rate quantification. Designed SOMs allow the representation of clinically important image properties without formulaically specifying features or anatomy. We found that despite high visual similarity and low FID-10k scores, substantial per-realization errors in feature prevalence and feature-specific intensity distributions are observed at multiple spatial orders; these are of critical importance in diagnostic images. Thus, the GAN-generated ensemble might not retain the diagnostic value of the training ensemble. Although the proposed methodology is demonstrated on network instances of two popular GANs, it is applicable to other GAN architectures and generative models. Beyond unconditional image synthesis, such tailored design of SOMs opens avenues for evaluation in areas such as medical domain transfer and data augmentation.


Yu-Lin Wei


“Personalized Sound Zone”

Time: 4:00 pm to 4:20 pm, February 24

We consider the problem of creating personalized sound zones (PSZ) where a loud-speaker-array must play sounds that are audible at a given location L1 but inaudible at another given location L2. In terms of applications, this means that TV speakers could filter its sound such that Alice can hear the TV from her location at the couch (L1), while Bob remains undisturbed at his desk (L2). Existing solutions use a microphone-array (collocated with the speakers) to listen to the user’s voice, and then utilize channel reciprocity to estimate the filter parameters to create PSZ. Our research is focussed on improving the viability of PSZs. Two problems are of interest: (1) The user speaks from her mouth but listens at ears, which implies that the filters needs to be extrapolated for different locations. (2) The radiation pattern of the speakers and microphones is not identical, hence the filter must cope with such variations. We proposes an algorithm that (a) interpolates the acoustic channels in a neighborhood region of L1, and (b) compensates for the radiation patterns. without any calibrations. Thus, we are able to create sound bubble at each ear of the user with a 6dBSL difference.


Liming Wang


“Multimodal phoneme discovery for low-resource speech recognition”

Time: 4:20 pm to 4:40 pm, February 24

Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Under mild assumptions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns a better phoneme-level representation and achieves a lower error rate in a zero-resource phoneme recognition task than previous state-of-the-art self-supervised representation learning algorithms.


Zhijian Yang


“PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound”

Time: 4:40 pm to 5:00 pm, February 24

Reconstructing the 3D pose of a person in metric scale from a single view image is a geometrically ill-posed problem. We can not measure the exact distance of a person to the camera from a single view image without additional scene assumptions (e.g., known height). Existing learning based approaches circumvent this issue by reconstructing the 3D pose up to scale. However, there are many applications such as virtual telepresence, robotics, and augmented reality that require metric scale reconstruction. In this paper, we show that audio signals recorded along with an image, provide complementary information to reconstruct the metric 3D pose of the person. The key insight is that as the audio signals traverse across the 3D space, their interactions with the body provide metric information about the body’s pose. Based on this insight, we introduce a time-invariant transfer function called pose kernel—the impulse response of audio signals induced by the body pose, which highly corresponds to the body geometry. We design a multi-stage 3D CNN that fuses audio and visual signals and learns to reconstruct 3D pose in a metric scale. We show that our multi-modal method produces accurate metric reconstruction in real-world scenes, which is not possible with state-of-the-art lifting approaches including parametric mesh regression and depth regression.



For more information, please contact the session chairs, Anadi Chaman or Corey Snyder.