Siddharth Swaroop | Publications

2025

Connecting Federated ADMM to Bayes
Siddharth Swaroop, Mohammad Emtiyaz Khan, Finale Doshi-Velez
ICLR 2025 [arxiv]

Abstract

We provide new connections between two distinct federated learning approaches based on (i) ADMM and (ii) Variational Bayes (VB), and propose new variants by combining their complementary strengths. Specifically, we show that the dual variables in ADMM naturally emerge through the `site' parameters used in VB with isotropic Gaussian covariances. Using this, we derive two versions of ADMM from VB that use flexible covariances and functional regularisation, respectively. Through numerical experiments, we validate the improvements obtained in performance. The work shows connection between two fields that are believed to be fundamentally different and combines them to improve federated learning.

Contrastive Explanations That Anticipate Human Misconceptions Can Improve Human Decision-Making Skills
Zana Buçinca, Siddharth Swaroop, Amanda E Paluch, Finale Doshi-Velez, Krzysztof Z Gajos
CHI 2025 [arxiv]

Abstract

People's decision-making abilities often fail to improve or may even erode when they rely on AI for decision-support, even when the AI provides informative explanations. We argue this is partly because people intuitively seek contrastive explanations, which clarify the difference between the AI's decision and their own reasoning, while most AI systems offer `unilateral' explanations that justify the AI’s decision but do not account for users' knowledge and thinking. To address potential human knowledge gaps, we introduce a framework for generating human-centered contrastive explanations which explain the difference between AI's choice and a predicted, likely human choice about the same task. Results from a large-scale experiment (N = 628) demonstrate that contrastive explanations significantly enhance users' independent decision-making skills compared to unilateral explanations, without sacrificing decision accuracy. As concerns about deskilling in AI-supported tasks grow, our research demonstrates that integrating human reasoning into AI design can promote human skill development.

Personalising AI assistance based on overreliance rate in AI-assisted decision making
Siddharth Swaroop, Zana Buçinca, Krzysztof Z Gajos, Finale Doshi-Velez
IUI 2025 [arxiv]

Abstract

Personalising decision-making assistance to different users and tasks can improve human-AI team performance, such as by appropriately impacting reliance on AI assistance. However, people are different in many ways, with many hidden qualities, and adapting AI assistance to these hidden qualities is difficult. In this work, we consider a hidden quality previously identified as important: overreliance on AI assistance. We would like to (i) quickly determine the value of this hidden quality, and (ii) personalise AI assistance based on this value. In our first study, we introduce a few probe questions (where we know the true answer) to determine if a user is an overrelier or not, finding that correctly-chosen probe questions work well. In our second study, we improve human-AI team performance, personalising AI assistance based on users’ overreliance quality. Exploratory analysis indicates that people learn different strategies of using AI assistance depending on what AI assistance they saw previously, indicating that we may need to take this into account when designing adaptive AI assistance. We hope that future work will continue exploring how to infer and personalise to other important hidden qualities.

2024

AI Agents & Liability – Mapping Insights from ML and HCI Research to Policy
Connor Dunlop^✝, Weiwei Pan^✝, Julia Smakman^✝, Lisa Soder^✝, Siddharth Swaroop^✝
Workshop on Socially Responsible Language Modelling Research @ NeurIPS 2024, Workshop on Safe & Trustworthy Agents @ NeurIPS 2024 [Not available yet]

Abstract

The potential of increasingly autonomous AI agents and the consequent reduction in human oversight pose significant challenges to our legal system, particularly in the realm of tort law. This paper addresses the complexities of applying traditional fault-based liability frameworks to AI agents, focusing on two critical questions: "Who is best able to prevent harms?" and "What constitutes a reasonable duty of care?" We argue that by examining key properties of AI systems—namely autonomy, complexity, and adaptability—we can map existing (socio-)technical research to explicit categories of "foreseeable harms" in tort liability, as well as point to "reasonable actions" that developers can take to mitigate harms.

Levels of Autonomy: Liability in the age of AI Agents
Julia Smakman^✝, Lisa Soder^✝, Connor Dunlop^✝, Weiwei Pan, Siddharth Swaroop
Workshop on Socially Responsible Language Modelling Research @ NeurIPS 2024, Workshop on Safe & Trustworthy Agents @ NeurIPS 2024 [Not available yet]

Abstract

The promise of AI agents—systems capable of independently executing complex, open-ended tasks—presents significant challenges to liability law. As these systems might evolve towards greater autonomy and act in complex settings, they complicate the assignment of liability when harms occur. Drawing parallels with autonomous vehicle governance, we propose an "autonomy taxonomy" for AI agents to help structure discussions around liability assignment. The paper analyses key concepts in tort liability, explores challenges specific to AI agents, and uses the UK's Automated Vehicles Act 2024 as a case study. By proposing a preliminary autonomy taxonomy for AI agents, we aim to inform future policy discussions and research on liability frameworks that can adapt to advances in AI capabilities.

Understanding Model Bias Requires Systematic Probing Across Tasks
Helen Zhao^✝, Susannah Su^✝, Soline Boussard^✝, Siddharth Swaroop, Finale Doshi-Velez, Weiwei Pan
Workshop on Socially Responsible Language Modelling Research @ NeurIPS 2024 [Not available yet]

Abstract

There is a growing body of literature exposing social biases of LLMs. However, these works often focus on a specific protected group, a specific prompt type and a specific decision task. Given the large and complex input-output space of LLMs, case-by-case analyses alone may not paint a picture of the systematic biases of these models. In this paper, we argue for broad and systematic bias probing. We propose to do so by comparing the distribution of outputs over a wide range of prompts, multiple protected attributes and across different realistic decision making settings in the same application domain. We demonstrate this approach for three personalized healthcare advice-seeking settings. We argue that studying the complex patterns of bias across tasks helps us better anticipate the way behaviors (specifically biased behaviors) of LLMs might generalize to new tasks.

Accuracy-Time Tradeoffs in AI-Assisted Decision Making under Time Pressure
Siddharth Swaroop, Zana Buçinca, Krzysztof Z Gajos, Finale Doshi-Velez
IUI 2024 [IUI]

Abstract

In settings where users both need high accuracy and are time-pressured, such as doctors working in emergency rooms, we want to provide AI assistance that both increases decision accuracy and reduces decision-making time. Current literature focusses on how users interact with AI assistance when there is no time pressure, finding that different AI assistances have different benefits: some can reduce time taken while increasing overreliance on AI, while others do the opposite. The precise benefit can depend on both the user and task. In time-pressured scenarios, adapting when we show AI assistance is especially important: relying on the AI assistance can save time, and can therefore be beneficial when the AI is likely to be right. We would ideally adapt what AI assistance we show depending on various properties (of the task and of the user) in order to best trade off accuracy and time. We introduce a study where users have to answer a series of logic puzzles. We find that time pressure affects how users use different AI assistances, making some assistances more beneficial than others when compared to no-time-pressure settings. We also find that a user’s overreliance rate is a key predictor of their behaviour: overreliers and not-overreliers use different AI assistance types differently. We find marginal correlations between a user’s overreliance rate (which is related to the user’s trust in AI recommendations) and their personality traits (Big Five Personality traits). Overall, our work suggests that AI assistances have different accuracy-time tradeoffs when people are under time pressure compared to no time pressure, and we explore how we might adapt AI assistances in this setting.

Where do doctors disagree? Characterizing Decision Points for Safe Reinforcement Learning in Choosing Vasopressor Treatment
Esther Brown, Shivam Raval, Alex Rojas, Jiayu Yao, Sonali Parbhoo, Leo A Celi, Siddharth Swaroop, Weiwei Pan, Finale Doshi-Velez
AMIA Journal 2024 [AMIA]

Abstract

In clinical settings, domain experts sometimes disagree on optimal treatment actions. These ''decision points'' must be comprehensively characterized, as they offer opportunities for Artificial Intelligence (AI) to provide statistically informed recommendations. To address this, we introduce a pipeline to investigate ''decision regions'', clusters of decision points, by training classifiers for prediction and applying clustering techniques to the classifier's embedding space. Our methodology includes: a robustness analysis confirming the topological stability of decision regions across diverse design parameters; an empirical study using the MIMIC-III database, focusing on the binary decision to administer vasopressors to hypotensive patients in the ICU; and an expert-validated summary of the decision regions' statistical attributes with novel clinical interpretations. We demonstrate that the topology of these decision regions remains stable across various design choices, reinforcing the reliability of our findings and generalizability of our approach. We encourage future work to extend this approach to other medical datasets.

AMBER: An Entropy Maximizing Environment Design Algorithm for Inverse Reinforcement Learning
Paul Nitschke, Lars Lien Ankile, Eura Nofshin, Siddharth Swaroop, Finale Doshi-Velez, Weiwei Pan
Models of Human Feedback for AI Alignment @ ICML 2024 [Openreview]

Abstract

In Inverse Reinforcement Learning (IRL), we learn the underlying reward function of humans from observations. Recent work shows that we can learn the reward function more accurately by observing the human in multiple related environments, but efficiently finding informative environments is an open question. We present AMBER, an information-theoretic algorithm that generates highly informative environments. With theoretical and empirical analysis, we show that AMBER efficiently finds informative environments and improves reward learning.

Rethinking Discount Regularization: New Interpretations, Unintended Consequences, and Solutions for Regularization in Reinforcement Learning
Sarah Rathnam, Sonali Parbhoo, Siddharth Swaroop, Weiwei Pan, Susan Murphy, Finale Doshi-Velez
JMLR 2024 [JMLR]

Abstract

Discount regularization, using a shorter planning horizon when calculating the optimal policy, is a popular choice to avoid overfitting when faced with sparse or noisy data. It is commonly interpreted as de-emphasizing or ignoring delayed effects. In this paper, we prove two alternative views of discount regularization that expose unintended consequences and motivate novel regularization methods. In model-based RL, planning under a lower discount factor acts like a prior with stronger regularization on state-action pairs with more transition data. This leads to poor performance when the transition matrix is estimated from data sets with uneven amounts of data across state-action pairs. In model-free RL, discount regularization equates to planning using a weighted average Bellman update, where the agent plans as if the values of all state-action pairs are closer than implied by the data. Our equivalence theorems motivate simple methods that generalize discount regularization by setting parameters locally for individual state-action pairs rather than globally. We demonstrate the failures of discount regularization and how we remedy them using our state-action-specific methods across empirical examples with both tabular and continuous state spaces.

Towards Optimizing Human-Centric Objectives in AI-Assisted Decision-Making With Offline Reinforcement Learning
Zana Buçinca, Siddharth Swaroop, Amanda E Paluch, Susan A Murphy, Krzysztof Z Gajos
arXiv 2024 [arXiv]

Abstract

Imagine if AI decision-support tools not only complemented our ability to make accurate decisions, but also improved our skills, boosted collaboration, and elevated the joy we derive from our tasks. Despite the potential to optimize a broad spectrum of such human-centric objectives, the design of current AI tools remains focused on decision accuracy alone. We propose offline reinforcement learning (RL) as a general approach for modeling human-AI decision-making to optimize human-AI interaction for diverse objectives. RL can optimize such objectives by tailoring decision support, providing the right type of assistance to the right person at the right time. We instantiated our approach with two objectives: human-AI accuracy on the decision-making task and human learning about the task and learned decision support policies from previous human-AI interaction data. We compared the optimized policies against several baselines in AI-assisted decision-making. Across two experiments (N=316 and N=964), our results demonstrated that people interacting with policies optimized for accuracy achieve significantly better accuracy -- and even human-AI complementarity -- compared to those interacting with any other type of AI support. Our results further indicated that human learning was more difficult to optimize than accuracy, with participants who interacted with learning-optimized policies showing significant learning improvement only at times. Our research (1) demonstrates offline RL to be a promising approach to model human-AI decision-making, leading to policies that may optimize human-centric objectives and provide novel insights about the AI-assisted decision-making space, and (2) emphasizes the importance of considering human-centric objectives beyond decision accuracy in AI-assisted decision-making, opening up the novel research challenge of optimizing human-AI interaction for such objectives.

Reinforcement Learning Interventions on Boundedly Rational Human Agents in Frictionful Tasks
Eura Shin, Siddharth Swaroop, Weiwei Pan, Susan A Murphy, Finale Doshi-Velez
AAMAS 2024 [AAMAS] [arXiv]

Abstract

Many important behavior changes are frictionful; they require individuals to expend effort over a long period with little immediate gratification. Here, an artificial intelligence (AI) agent can provide personalized interventions to help individuals stick to their goals. In these settings, the AI agent must personalize rapidly (before the individual disengages) and interpretably, to help us understand the behavioral interventions. In this paper, we introduce Behavior Model Reinforcement Learning (BMRL), a framework in which an AI agent intervenes on the parameters of a Markov Decision Process (MDP) belonging to a boundedly rational human agent. Our formulation of the human decision-maker as a planning agent allows us to attribute undesirable human policies (ones that do not lead to the goal) to their maladapted MDP parameters, such as an extremely low discount factor. Furthermore, we propose a class of tractable human models that captures fundamental behaviors in frictionful tasks. Introducing a notion of MDP equivalence specific to BMRL, we theoretically and empirically show that AI planning with our human models can lead to helpful policies on a wide range of more complex, ground-truth humans.

Lifelong Learning for Deep Neural Networks with Bayesian Principles
Cuong V Nguyen^✝, Siddharth Swaroop^✝, Thang D Bui, Yingzhen Li, Richard E Turner
Book chapter in ‘‘Towards Human Brain Inspired Lifelong Learning’’ 2024 [Book chapter]

Abstract

2023

Improving Continual Learning by Accurate Gradient Reconstructions of the Past
Erik Daxberger, Siddharth Swaroop, Kazuki Osawa, Rio Yokota, Richard E Turner, José Miguel Hernández-Lobato, Mohammad Emtiyaz Khan
TMLR & CoLLAs 2023 [TMLR]

Abstract

Weight-regularization and experience replay are two popular continual-learning strategies with complementary strengths: while weight-regularization requires less memory, replay can more accurately mimic batch training. How can we combine them to get better methods? Despite the simplicity of the question, little is known or done to optimally combine these approaches. In this paper, we present such a method by using a recently proposed principle of adaptation that relies on a faithful reconstruction of the gradients of the past data. Using this principle, we design a prior which combines two types of replay methods with a quadratic weight-regularizer and achieves better gradient reconstructions. The combination improves performance on standard task-incremental continual learning benchmarks such as Split-CIFAR, SplitTinyImageNet, and ImageNet-1000, achieving > 80% of the batch performance by simply utilizing a memory of < 10% of the past data. Our work shows that a good combination of the two strategies can be very effective in reducing forgetting.

Discovering User Types: Mapping User Traits by Task-Specific Behaviors in Reinforcement Learning
Lars L. Ankile^✝, Brian S. Ham^✝, Kevin Mao, Eura Shin, Siddharth Swaroop, Finale Doshi-Velez, Weiwei Pan
AI&HCI workshop at @ ICML 2023 (Honorable mention for best paper award), Interactive Learning with Implicit Human Feedback Workshop @ ICML 2023 [arXiv]

Abstract

When assisting human users in reinforcement learning (RL), we can represent users as RL agents and study key parameters, called mph{user traits}, to inform intervention design. We study the relationship between user behaviors (policy classes) and user traits. Given an environment, we introduce an intuitive tool for studying the breakdown of "user types": broad sets of traits that result in the same behavior. We show that seemingly different real-world environments admit the same set of user types and formalize this observation as an equivalence relation defined on environments. By transferring intervention design between environments within the same equivalence class, we can help rapidly personalize interventions.

Memory Maps to Understand Models
Dharmesh Tailor, Paul Chang, Siddharth Swaroop, Eric Nalisnick, Arno Solin, Mohammad Emtiyaz Khan
Duality Principles for Modern Machine Learning Workshop @ ICML 2023 [Not available yet]

Abstract

What do models know and how? Answering this question requires exploratory analyses comparing many models, but existing techniques are specialized to specific models and analyses. We present memory maps as a general tool to understand a wide range of models by visualizing their sensitivity to data. Memory maps are extensions of residual-leverage plots where the two criteria are modified by easy-to-compute dual parameters obtained by using a Bayesian framework. The new criteria are used to understand a model's memory through a 2D scatter plot where tail regions often contain examples with high prediction-error and variance. All sorts of models can be analyzed this way, including not only those arising in kernel methods, Bayesian methods, and deep learning but also the ones obtained during training. We show use cases of memory maps to diagnose overfitting, compare various models, and analyze training trajectories.

Adaptive interventions for both accuracy and time in AI-assisted human decision making
Siddharth Swaroop, Zana Buçinca, Finale Doshi-Velez
AI&HCI workshop at @ ICML 2023 [arXiv]

Abstract

In settings where users are both time-pressured and need high accuracy, such as doctors working in Emergency Rooms, we want to provide AI assistance that both increases accuracy and reduces time. However, different types of AI assistance have different benefits: some reduce time taken while increasing overreliance on AI, while others do the opposite. We therefore want to adapt what AI assistance we show depending on various properties (of the question and of the user) in order to best tradeoff our two objectives. We introduce a study where users have to prescribe medicines to aliens, and use it to explore the potential for adapting AI assistance. We find evidence that it is beneficial to adapt our AI assistance depending on the question, leading to good tradeoffs between time taken and accuracy. Future work would consider machine-learning algorithms (such as reinforcement learning) to automatically adapt quickly.

Soft prompting might be a bug, not a feature
Luke Bailey^✝, Gustaf Ahdritz^✝, Anat Kleiman^✝, Siddharth Swaroop, Finale Doshi-Velez, Weiwei Pan
Challenges of Deploying Generative AI Workshop @ ICML 2023 [OpenReview]

Abstract

Prompt tuning, or "soft prompting," replaces text prompts to generative models with learned embeddings (i.e. vectors) and is used as an alternative to parameter-efficient fine-tuning. Prior work suggests analyzing soft prompts by interpreting them as natural language prompts. However, we find that soft prompts occupy regions in the embedding space that are distinct from those containing natural language, meaning that direct comparisons may be misleading. We argue that because soft prompts are currently uninterpretable, they could potentially be a source of vulnerability of LLMs to malicious manipulations during deployment.

Differentially private partitioned variational inference
Mikko A. Heikkilä, Matthew Ashman, Siddharth Swaroop, Richard E Turner, Antti Honkela
TMLR 2023 [TMLR]

Abstract

Learning a privacy-preserving model from sensitive data which are distributed across multiple devices is an increasingly important problem. The problem is often formulated in the federated learning context, with the aim of learning a single global model while keeping the data distributed. Moreover, Bayesian learning is a popular approach for modelling, since it naturally supports reliable uncertainty estimates. However, Bayesian learning is generally intractable even with centralised non-private data and so approximation techniques such as variational inference are a necessity. Variational inference has recently been extended to the non-private federated learning setting via the partitioned variational inference algorithm. For privacy protection, the current gold standard is called differential privacy. Differential privacy guarantees privacy in a strong, mathematically clearly defined sense. In this paper, we present differentially private partitioned variational inference, the first general framework for learning a variational approximation to a Bayesian posterior distribution in the federated learning setting while minimising the number of communication rounds and providing differential privacy guarantees for data subjects. We propose three alternative implementations in the general framework, one based on perturbing local optimisation runs done by individual parties, and two based on perturbing updates to the global model (one using a version of federated averaging, the second one adding virtual parties to the protocol), and compare their properties both theoretically and empirically. We show that perturbing the local optimisation works well with simple and complex models as long as each party has enough local data. However, the privacy is always guaranteed independently by each party. In contrast, perturbing the global updates works best with relatively simple models. Given access to suitable secure primitives, such as secure aggregation or secure shuffling, the performance can be improved by all parties guaranteeing privacy jointly.

Modeling Mobile Health Users as Reinforcement Learning Agents
Eura Shin, Siddharth Swaroop, Weiwei Pan, Susan Murphy, Finale Doshi-Velez
AAAI Workshop on AI for Behavior Change (Contributed talk) 2023 [arXiv]

Abstract

Mobile health (mHealth) technologies empower patients to adopt/maintain healthy behaviors in their daily lives, by providing interventions (e.g. push notifications) tailored to the user's needs. In these settings, without intervention, human decision making may be impaired (e.g. valuing near term pleasure over own long term goals). In this work, we formalize this relationship with a framework in which the user optimizes a (potentially impaired) Markov Decision Process (MDP) and the mHealth agent intervenes on the user's MDP parameters. We show that different types of impairments imply different types of optimal intervention. We also provide analytical and empirical explorations of these differences.

2022

Probabilistic Continual Learning using Neural Networks
Siddharth Swaroop
PhD thesis 2022 [PhD thesis]

Abstract

Neural networks are being increasingly used in society due to their strong performance at a large scale. They excel when they have access to all data at once, requiring multiple passes through the data. However, standard deep-learning techniques are unable to continually adapt as the environment changes: either they forget old data or they fail to sufficiently adapt to new data. This limitation is a major barrier to applications in many real-world settings, where the environment is often changing, and also in stark contrast to humans, who continuously learn over their lifetimes. The study of learning systems in these settings is called continual learning: data examples arrive sequentially and predictions must be made online. In this thesis we present new algorithms for continual learning using neural networks. We use the probabilistic approach, which maintains a distribution over beliefs, naturally handling continual learning by recursively updating from priors to posteriors. Although previous work has been limited by approximations to this idealised scheme, we scale our probabilistic algorithms to large-data settings and show strong empirical performance. We also theoretically analyse why our algorithms perform well in continual learning.

Partitioned Variational Inference: A Framework for Probabilistic Federated Learning
Matthew Ashman, Thang D Bui, Cuong V Nguyen, Stratis Markou, Adrian Weller, Siddharth Swaroop, Richard E Turner
Preprint 2022 [arXiv]

Abstract

The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.

2021

Knowledge-Adaptation Priors
Mohammad Emtiyaz Khan^✝ & Siddharth Swaroop^✝
NeurIPS 2021 [arXiv] [NeurIPS]

Abstract

Humans and animals have a natural ability to quickly adapt to their surroundings, but machine-learning models, when subjected to changes, often require a complete retraining from scratch. We present Knowledge-adaptation priors (K-priors) to reduce the cost of retraining by enabling quick and accurate adaptation for a wide-variety of tasks and models. This is made possible by a combination of weight and function-space priors to reconstruct the gradients of the past, which recovers and generalizes many existing, but seemingly-unrelated, adaptation strategies. Training with simple first-order gradient methods can often recover the exact retrained model to an arbitrary accuracy by choosing a sufficiently large memory of the past data. Empirical results show that adaptation with K-priors achieves performance similar to full retraining, but only requires training on a handful of past examples.

Collapsed Variational Bounds for Bayesian Neural Networks
Marcin B Tomczak, Siddharth Swaroop, Andrew YK Foong, Richard E Turner
NeurIPS 2021 [NeurIPS]

Abstract

Recent interest in learning large variational Bayesian Neural Networks (BNNs) has been partly hampered by poor predictive performance caused by underfitting, and their performance is known to be very sensitive to the prior over weights. Current practice often fixes the prior parameters to standard values or tunes them using heuristics or cross-validation. In this paper, we treat prior parameters in a distributional way by extending the model and collapsing the variational bound with respect to their posteriors. This leads to novel and tighter Evidence Lower Bounds (ELBOs) for performing variational inference (VI) in BNNs. Our experiments show that the new bounds significantly improve the performance of Gaussian mean-field VI applied to BNNs on a variety of data sets, demonstrating that mean-field VI works well even in deep models. We also find that the tighter ELBOs can be good optimization targets for learning the hyperparameters of hierarchical priors.

Generalized Variational Continual Learning
Noel Loo, Siddharth Swaroop, Richard E Turner
ICLR 2021 [OpenReview]

Abstract

Continual learning deals with training models on new tasks and datasets in an online fashion. One strand of research has used probabilistic regularization for continual learning, with two of the main approaches in this vein being Online Elastic Weight Consolidation (Online EWC) and Variational Continual Learning (VCL). VCL employs variational inference, which in other settings has been improved empirically by applying likelihood-tempering. We show that applying this modification to VCL recovers Online EWC as a limiting case, allowing for interpolation between the two approaches. We term the general algorithm Generalized VCL (GVCL). In order to mitigate the observed overpruning effect of VI, we take inspiration from a common multi-task architecture, neural networks with task-specific FiLM layers, and find that this addition leads to significant performance gains, specifically for variational methods. In the small-data regime, GVCL strongly outperforms existing baselines. In larger datasets, GVCL with FiLM layers outperforms or is competitive with existing baselines in terms of accuracy, whilst also providing significantly better calibration.

2020

Continual Deep Learning by Functional Regularisation of Memorable Past
Pingbo Pan^✝, Siddharth Swaroop^✝, Alexander Immer, Runa Eschenhagen, Richard E Turner, Mohammad Emtiyaz Khan
NeurIPS Oral presentation (top 1% of submissions) 2020 [arXiv] [NeurIPS]

Abstract

Continually learning new skills is important for intelligent systems, yet most deep learning methods suffer from catastrophic forgetting of the past. Recent works address this with weight regularisation. Functional regularisation, although computationally expensive, is expected to perform better, but rarely does so in practice. In this paper, we fix this issue by proposing a new functional-regularisation approach that utilises a few memorable past examples that are crucial to avoid forgetting. By using a Gaussian Process formulation of deep networks, our approach enables training in weight-space while identifying both the memorable past and a functional prior. Our method achieves state-of-the-art performance on standard benchmarks and opens a new direction for life-long learning where regularisation and memory-based methods are naturally combined.

Efficient Low Rank Gaussian Variational Inference for Neural Networks
Marcin B Tomczak, Siddharth Swaroop, Richard E Turner
NeurIPS 2020 [NeurIPS]

Abstract

Bayesian neural networks are enjoying a renaissance driven in part by recent advances in variational inference (VI). The most common form of VI employs a fully factorized or mean-field distribution, but this is known to suffer from several pathologies, especially as we expect posterior distributions with highly correlated parameters. Current algorithms that capture these correlations with a Gaussian approximating family are difficult to scale to large models due to computational costs and high variance of gradient updates. By using a new form of the reparametrization trick, we derive a computationally efficient algorithm for performing VI with a Gaussian family with a low-rank plus diagonal covariance structure. We scale to deep feed-forward and convolutional architectures. We find that adding low-rank terms to parametrized diagonal covariance does not improve predictive performance except on small networks, but low-rank terms added to a constant diagonal covariance improves performance on small and large-scale network architectures.

Combining Variational Continual Learning with FiLM Layers
Noel Loo, Siddharth Swaroop, Richard E Turner
LifeLongML workshop Oral Presentation (ICML) 2020 [OpenReview]

Abstract

The standard architecture for continual learning is a multi-headed neural network, which has shared body parameters and task-specific heads. Features for each task are generated in the same way. This could be too restrictive, particularly when tasks are very distinct. We propose combining FiLM layers, a flexible way to enable task-specific feature modulation in CNNs, with an existing algorithm, Variational Continual Learning (VCL). We show that this addition consistently improves performance, particularly when tasks are more varied. Furthermore, we demonstrate how FiLM Layers can mitigate VCL's tendency to over-prune and help it use more model capacity. Finally, we find that FiLM Layers perform feature modulation as opposed to gating, making them more flexible than binary mask based approaches.

2019

Practical Deep Learning with Bayesian Principles
Kazuki Osawa, Siddharth Swaroop^✝, Anirudh Jain^✝, Runa Eschenhagen, Richard E Turner, Rio Yokota, Mohammad Emtiyaz Khan
NeurIPS 2019 [NeurIPS] [arXiv]

Abstract

Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar performance in about the same number of epochs as the Adam optimiser, even on large datasets such as ImageNet. Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are well-calibrated, uncertainties on out-of-distribution data are improved, and continual-learning performance is boosted. This work enables practical deep learning while preserving benefits of Bayesian principles. A PyTorch implementation is available as a plug-and-play optimiser.

Differentially Private Federated Variational Inference
Mrinank Sharma, Michael Hutchinson, Siddharth Swaroop, Antti Honkela, Richard E Turner
Privacy in Machine Learning Workshop (NeurIPS) 2019 [arXiv]

Abstract

In many real-world applications of machine learning, data are distributed across many clients and cannot leave the devices they are stored on. Furthermore, each client's data, computational resources and communication constraints may be very different. This setting is known as federated learning, in which privacy is a key concern. Differential privacy is commonly used to provide mathematical privacy guarantees. This work, to the best of our knowledge, is the first to consider federated, differentially private, Bayesian learning. We build on Partitioned Variational Inference (PVI) which was recently developed to support approximate Bayesian inference in the federated setting. We modify the client-side optimisation of PVI to provide an (ϵ, δ)-DP guarantee. We show that it is possible to learn moderately private logistic regression models in the federated setting that achieve similar performance to models trained non-privately on centralised data.

2018

Improving and Understanding Variational Continual Learning
Siddharth Swaroop, Thang D Bui, Cuong V Nguyen, Richard E Turner
Continual Learning Workshop Oral presentation (NeurIPS) 2018 [arXiv]

Abstract

In the continual learning setting, tasks are encountered sequentially. The goal is to learn whilst i) avoiding catastrophic forgetting, ii) efficiently using model capacity, and iii) employing forward and backward transfer learning. In this paper, we explore how the Variational Continual Learning (VCL) framework achieves these desiderata on two benchmarks in continual learning: split MNIST and permuted MNIST. We first report significantly improved results on what was already a competitive approach. The improvements are achieved by establishing a new best practice approach to mean-field variational Bayesian neural networks. We then look at the solutions in detail. This allows us to obtain an understanding of why VCL performs as it does, and we compare the solution to what an `ideal' continual learning solution might be.

Partitioned Variational Inference: A unified framework encompassing federated and continual learning
Thang D Bui, Cuong V Nguyen, Siddharth Swaroop, Richard E Turner
Preprint, Bayesian Deep Learning Workshop spotlight (NeurIPS) 2018 [arXiv]

Abstract

Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. However, practitioners are faced with a fragmented literature that offers a bewildering array of algorithmic options. First, the variational family. Second, the granularity of the updates e.g. whether the updates are local to each data point and employ message passing or global. Third, the method of optimization (bespoke or blackbox, closed-form or stochastic updates, etc.). This paper presents a new framework, termed Partitioned Variational Inference (PVI), that explicitly acknowledges these algorithmic dimensions of VI, unifies disparate literature, and provides guidance on usage. Crucially, the proposed PVI framework allows us to identify new ways of performing VI that are ideally suited to challenging learning scenarios including federated learning (where distributed computing is leveraged to process non-centralized data) and continual learning (where new data and tasks arrive over time and must be accommodated quickly). We showcase these new capabilities by developing communication-efficient federated training of Bayesian neural networks and continual learning for Gaussian process models with private pseudo-points. The new methods significantly outperform the state-of-the-art, whilst being almost as straightforward to implement as standard VI.

Neural network ensembles and variational inference revisited
Marcin B Tomczak, Siddharth Swaroop, Richard E Turner
Advances in Approximate Bayesian Inference Symposium 2018 [pdf]

Abstract

Ensembling methods and variational inference provide two orthogonal methods for obtaining reliable predictive uncertainty estimates for neural networks. In this work we compare and combine these approaches finding that: i) variational inference outperforms ensembles of neural networks, and ii) ensembled versions of variational inference bring further improvements. The first finding appears at odds with previous work (Lakshminarayanan et al., 2017), but we show that the previous results were due to an ambiguous experimental protocol in which the model and inference method were simultaneously changed.

2017

Understanding Expectation Propagation
Siddharth Swaroop, Richard E Turner
Advances in Approximate Bayesian Inference workshop (NIPS) 2017 [pdf]

Abstract

Understanding and characterising the properties of approximate inference schemes is extremely important, but arguably under studied. This report continues work on characterising Expectation Propagation (EP), an approximate Bayesian inference scheme, looking at four toy cases of interest. We initially focus on the empirically motivated conjecture stating that EP's approximation for the model evidence is an underestimate of the true model evidence. The first two toy cases apply EP to a simple classification example. They indicate why EP tends to underestimate the model evidence on realistic datasets, even though there are counter-examples to the conjecture, which we show analytically for the first time. The third toy case uses the link between the Fully Independent Training Condition algorithm (FITC, a sparse approximation method for Gaussian Process regression) and EP to find another analytic counter-example. This toy case also raises interesting questions as to how and why FITC works, which we consider mathematically. The final toy example compares mean field EP to mean field and structured Variational Inference (VI) on a small time-series model. We find that EP's uncertainty estimates do not collapse pathologically as they do for mean field VI.