2023



Improving Continual Learning by Accurate Gradient Reconstructions of the Past
Erik Daxberger, Siddharth Swaroop, Kazuki Osawa, Rio Yokota, Richard E Turner, José Miguel Hernández-Lobato, Mohammad Emtiyaz Khan
TMLR 2023 [TMLR]

AbstractWeight-regularization and experience replay are two popular continual-learning strategies with complementary strengths: while weight-regularization requires less memory, replay can more accurately mimic batch training. How can we combine them to get better methods? Despite the simplicity of the question, little is known or done to optimally combine these approaches. In this paper, we present such a method by using a recently proposed principle of adaptation that relies on a faithful reconstruction of the gradients of the past data. Using this principle, we design a prior which combines two types of replay methods with a quadratic weight-regularizer and achieves better gradient reconstructions. The combination improves performance on standard task-incremental continual learning benchmarks such as Split-CIFAR, SplitTinyImageNet, and ImageNet-1000, achieving > 80% of the batch performance by simply utilizing a memory of < 10% of the past data. Our work shows that a good combination of the two strategies can be very effective in reducing forgetting.


Memory Maps to Understand Models
Dharmesh Tailor, Paul Chang, Siddharth Swaroop, Eric Nalisnick, Arno Solin, Mohammad Emtiyaz Khan
Duality Principles for Modern Machine Learning Workshop @ ICML 2023 [Not available yet]

AbstractWhat do models know and how? Answering this question requires exploratory analyses comparing many models, but existing techniques are specialized to specific models and analyses. We present memory maps as a general tool to understand a wide range of models by visualizing their sensitivity to data. Memory maps are extensions of residual-leverage plots where the two criteria are modified by easy-to-compute dual parameters obtained by using a Bayesian framework. The new criteria are used to understand a model's memory through a 2D scatter plot where tail regions often contain examples with high prediction-error and variance. All sorts of models can be analyzed this way, including not only those arising in kernel methods, Bayesian methods, and deep learning but also the ones obtained during training. We show use cases of memory maps to diagnose overfitting, compare various models, and analyze training trajectories.


Adaptive interventions for both accuracy and time in AI-assisted human decision making
Siddharth Swaroop, Zana Buçinca, Finale Doshi-Velez
AI&HCI workshop at @ ICML 2023 [arXiv]

AbstractIn settings where users are both time-pressured and need high accuracy, such as doctors working in Emergency Rooms, we want to provide AI assistance that both increases accuracy and reduces time. However, different types of AI assistance have different benefits: some reduce time taken while increasing overreliance on AI, while others do the opposite. We therefore want to adapt what AI assistance we show depending on various properties (of the question and of the user) in order to best tradeoff our two objectives. We introduce a study where users have to prescribe medicines to aliens, and use it to explore the potential for adapting AI assistance. We find evidence that it is beneficial to adapt our AI assistance depending on the question, leading to good tradeoffs between time taken and accuracy. Future work would consider machine-learning algorithms (such as reinforcement learning) to automatically adapt quickly.


Discovering User Types: Mapping User Traits by Task-Specific Behaviors in Reinforcement Learning
Lars L. Ankile, Brian S. Ham, Kevin Mao, Eura Shin, Siddharth Swaroop, Finale Doshi-Velez, Weiwei Pan
AI&HCI workshop at @ ICML 2023 (Honorable mention for best paper award), Interactive Learning with Implicit Human Feedback Workshop @ ICML 2023 [arXiv]

AbstractWhen assisting human users in reinforcement learning (RL), we can represent users as RL agents and study key parameters, called mph{user traits}, to inform intervention design. We study the relationship between user behaviors (policy classes) and user traits. Given an environment, we introduce an intuitive tool for studying the breakdown of "user types": broad sets of traits that result in the same behavior. We show that seemingly different real-world environments admit the same set of user types and formalize this observation as an equivalence relation defined on environments. By transferring intervention design between environments within the same equivalence class, we can help rapidly personalize interventions.


Soft prompting might be a bug, not a feature
Luke Bailey, Gustaf Ahdritz, Anat Kleiman, Siddharth Swaroop, Finale Doshi-Velez, Weiwei Pan
Challenges of Deploying Generative AI Workshop @ ICML 2023 [OpenReview]

AbstractPrompt tuning, or "soft prompting," replaces text prompts to generative models with learned embeddings (i.e. vectors) and is used as an alternative to parameter-efficient fine-tuning. Prior work suggests analyzing soft prompts by interpreting them as natural language prompts. However, we find that soft prompts occupy regions in the embedding space that are distinct from those containing natural language, meaning that direct comparisons may be misleading. We argue that because soft prompts are currently uninterpretable, they could potentially be a source of vulnerability of LLMs to malicious manipulations during deployment.


Differentially private partitioned variational inference
Mikko A. Heikkilä, Matthew Ashman, Siddharth Swaroop, Richard E Turner, Antti Honkela
TMLR 2023 [TMLR]

AbstractLearning a privacy-preserving model from sensitive data which are distributed across multiple devices is an increasingly important problem. The problem is often formulated in the federated learning context, with the aim of learning a single global model while keeping the data distributed. Moreover, Bayesian learning is a popular approach for modelling, since it naturally supports reliable uncertainty estimates. However, Bayesian learning is generally intractable even with centralised non-private data and so approximation techniques such as variational inference are a necessity. Variational inference has recently been extended to the non-private federated learning setting via the partitioned variational inference algorithm. For privacy protection, the current gold standard is called differential privacy. Differential privacy guarantees privacy in a strong, mathematically clearly defined sense. In this paper, we present differentially private partitioned variational inference, the first general framework for learning a variational approximation to a Bayesian posterior distribution in the federated learning setting while minimising the number of communication rounds and providing differential privacy guarantees for data subjects. We propose three alternative implementations in the general framework, one based on perturbing local optimisation runs done by individual parties, and two based on perturbing updates to the global model (one using a version of federated averaging, the second one adding virtual parties to the protocol), and compare their properties both theoretically and empirically. We show that perturbing the local optimisation works well with simple and complex models as long as each party has enough local data. However, the privacy is always guaranteed independently by each party. In contrast, perturbing the global updates works best with relatively simple models. Given access to suitable secure primitives, such as secure aggregation or secure shuffling, the performance can be improved by all parties guaranteeing privacy jointly.


Modeling Mobile Health Users as Reinforcement Learning Agents
Eura Shin, Siddharth Swaroop, Weiwei Pan, Susan Murphy, Finale Doshi-Velez
AAAI Workshop on AI for Behavior Change (Contributed talk) 2023 [arXiv]

AbstractMobile health (mHealth) technologies empower patients to adopt/maintain healthy behaviors in their daily lives, by providing interventions (e.g. push notifications) tailored to the user's needs. In these settings, without intervention, human decision making may be impaired (e.g. valuing near term pleasure over own long term goals). In this work, we formalize this relationship with a framework in which the user optimizes a (potentially impaired) Markov Decision Process (MDP) and the mHealth agent intervenes on the user's MDP parameters. We show that different types of impairments imply different types of optimal intervention. We also provide analytical and empirical explorations of these differences.


2022



Probabilistic Continual Learning using Neural Networks
Siddharth Swaroop
PhD thesis 2022 [PhD thesis]

AbstractNeural networks are being increasingly used in society due to their strong performance at a large scale. They excel when they have access to all data at once, requiring multiple passes through the data. However, standard deep-learning techniques are unable to continually adapt as the environment changes: either they forget old data or they fail to sufficiently adapt to new data. This limitation is a major barrier to applications in many real-world settings, where the environment is often changing, and also in stark contrast to humans, who continuously learn over their lifetimes. The study of learning systems in these settings is called continual learning: data examples arrive sequentially and predictions must be made online. In this thesis we present new algorithms for continual learning using neural networks. We use the probabilistic approach, which maintains a distribution over beliefs, naturally handling continual learning by recursively updating from priors to posteriors. Although previous work has been limited by approximations to this idealised scheme, we scale our probabilistic algorithms to large-data settings and show strong empirical performance. We also theoretically analyse why our algorithms perform well in continual learning.


Partitioned Variational Inference: A Framework for Probabilistic Federated Learning
Matthew Ashman, Thang D Bui, Cuong V Nguyen, Stratis Markou, Adrian Weller, Siddharth Swaroop, Richard E Turner
Preprint 2022 [arXiv]

AbstractThe proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.


2021



Knowledge-Adaptation Priors
Mohammad Emtiyaz Khan & Siddharth Swaroop
NeurIPS 2021 [arXiv] [NeurIPS]

AbstractHumans and animals have a natural ability to quickly adapt to their surroundings, but machine-learning models, when subjected to changes, often require a complete retraining from scratch. We present Knowledge-adaptation priors (K-priors) to reduce the cost of retraining by enabling quick and accurate adaptation for a wide-variety of tasks and models. This is made possible by a combination of weight and function-space priors to reconstruct the gradients of the past, which recovers and generalizes many existing, but seemingly-unrelated, adaptation strategies. Training with simple first-order gradient methods can often recover the exact retrained model to an arbitrary accuracy by choosing a sufficiently large memory of the past data. Empirical results show that adaptation with K-priors achieves performance similar to full retraining, but only requires training on a handful of past examples.


Collapsed Variational Bounds for Bayesian Neural Networks
Marcin B Tomczak, Siddharth Swaroop, Andrew YK Foong, Richard E Turner
NeurIPS 2021 [NeurIPS]

AbstractRecent interest in learning large variational Bayesian Neural Networks (BNNs) has been partly hampered by poor predictive performance caused by underfitting, and their performance is known to be very sensitive to the prior over weights. Current practice often fixes the prior parameters to standard values or tunes them using heuristics or cross-validation. In this paper, we treat prior parameters in a distributional way by extending the model and collapsing the variational bound with respect to their posteriors. This leads to novel and tighter Evidence Lower Bounds (ELBOs) for performing variational inference (VI) in BNNs. Our experiments show that the new bounds significantly improve the performance of Gaussian mean-field VI applied to BNNs on a variety of data sets, demonstrating that mean-field VI works well even in deep models. We also find that the tighter ELBOs can be good optimization targets for learning the hyperparameters of hierarchical priors.


Generalized Variational Continual Learning
Noel Loo, Siddharth Swaroop, Richard E Turner
ICLR 2021 [OpenReview]

AbstractContinual learning deals with training models on new tasks and datasets in an online fashion. One strand of research has used probabilistic regularization for continual learning, with two of the main approaches in this vein being Online Elastic Weight Consolidation (Online EWC) and Variational Continual Learning (VCL). VCL employs variational inference, which in other settings has been improved empirically by applying likelihood-tempering. We show that applying this modification to VCL recovers Online EWC as a limiting case, allowing for interpolation between the two approaches. We term the general algorithm Generalized VCL (GVCL). In order to mitigate the observed overpruning effect of VI, we take inspiration from a common multi-task architecture, neural networks with task-specific FiLM layers, and find that this addition leads to significant performance gains, specifically for variational methods. In the small-data regime, GVCL strongly outperforms existing baselines. In larger datasets, GVCL with FiLM layers outperforms or is competitive with existing baselines in terms of accuracy, whilst also providing significantly better calibration.


2020



Continual Deep Learning by Functional Regularisation of Memorable Past
Pingbo Pan, Siddharth Swaroop, Alexander Immer, Runa Eschenhagen, Richard E Turner, Mohammad Emtiyaz Khan
NeurIPS Oral presentation (top 1% of submissions) 2020 [arXiv] [NeurIPS]

AbstractContinually learning new skills is important for intelligent systems, yet most deep learning methods suffer from catastrophic forgetting of the past. Recent works address this with weight regularisation. Functional regularisation, although computationally expensive, is expected to perform better, but rarely does so in practice. In this paper, we fix this issue by proposing a new functional-regularisation approach that utilises a few memorable past examples that are crucial to avoid forgetting. By using a Gaussian Process formulation of deep networks, our approach enables training in weight-space while identifying both the memorable past and a functional prior. Our method achieves state-of-the-art performance on standard benchmarks and opens a new direction for life-long learning where regularisation and memory-based methods are naturally combined.


Efficient Low Rank Gaussian Variational Inference for Neural Networks
Marcin B Tomczak, Siddharth Swaroop, Richard E Turner
NeurIPS 2020 [NeurIPS]

AbstractBayesian neural networks are enjoying a renaissance driven in part by recent advances in variational inference (VI). The most common form of VI employs a fully factorized or mean-field distribution, but this is known to suffer from several pathologies, especially as we expect posterior distributions with highly correlated parameters. Current algorithms that capture these correlations with a Gaussian approximating family are difficult to scale to large models due to computational costs and high variance of gradient updates. By using a new form of the reparametrization trick, we derive a computationally efficient algorithm for performing VI with a Gaussian family with a low-rank plus diagonal covariance structure. We scale to deep feed-forward and convolutional architectures. We find that adding low-rank terms to parametrized diagonal covariance does not improve predictive performance except on small networks, but low-rank terms added to a constant diagonal covariance improves performance on small and large-scale network architectures.


Combining Variational Continual Learning with FiLM Layers
Noel Loo, Siddharth Swaroop, Richard E Turner
LifeLongML workshop Oral Presentation (ICML) 2020 [OpenReview]

AbstractThe standard architecture for continual learning is a multi-headed neural network, which has shared body parameters and task-specific heads. Features for each task are generated in the same way. This could be too restrictive, particularly when tasks are very distinct. We propose combining FiLM layers, a flexible way to enable task-specific feature modulation in CNNs, with an existing algorithm, Variational Continual Learning (VCL). We show that this addition consistently improves performance, particularly when tasks are more varied. Furthermore, we demonstrate how FiLM Layers can mitigate VCL's tendency to over-prune and help it use more model capacity. Finally, we find that FiLM Layers perform feature modulation as opposed to gating, making them more flexible than binary mask based approaches.


2019



Practical Deep Learning with Bayesian Principles
Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E Turner, Rio Yokota, Mohammad Emtiyaz Khan
NeurIPS 2019 [NeurIPS] [arXiv]

AbstractBayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar performance in about the same number of epochs as the Adam optimiser, even on large datasets such as ImageNet. Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are well-calibrated, uncertainties on out-of-distribution data are improved, and continual-learning performance is boosted. This work enables practical deep learning while preserving benefits of Bayesian principles. A PyTorch implementation is available as a plug-and-play optimiser.


Differentially Private Federated Variational Inference
Mrinank Sharma, Michael Hutchinson, Siddharth Swaroop, Antti Honkela, Richard E Turner
Privacy in Machine Learning Workshop (NeurIPS) 2019 [arXiv]

AbstractIn many real-world applications of machine learning, data are distributed across many clients and cannot leave the devices they are stored on. Furthermore, each client's data, computational resources and communication constraints may be very different. This setting is known as federated learning, in which privacy is a key concern. Differential privacy is commonly used to provide mathematical privacy guarantees. This work, to the best of our knowledge, is the first to consider federated, differentially private, Bayesian learning. We build on Partitioned Variational Inference (PVI) which was recently developed to support approximate Bayesian inference in the federated setting. We modify the client-side optimisation of PVI to provide an (ϵ, δ)-DP guarantee. We show that it is possible to learn moderately private logistic regression models in the federated setting that achieve similar performance to models trained non-privately on centralised data.


Improving and Understanding Variational Continual Learning
Siddharth Swaroop, Thang D Bui, Cuong V Nguyen, Richard E Turner
Continual Learning Workshop Oral presentation (NeurIPS) 2018 [arXiv]

AbstractIn the continual learning setting, tasks are encountered sequentially. The goal is to learn whilst i) avoiding catastrophic forgetting, ii) efficiently using model capacity, and iii) employing forward and backward transfer learning. In this paper, we explore how the Variational Continual Learning (VCL) framework achieves these desiderata on two benchmarks in continual learning: split MNIST and permuted MNIST. We first report significantly improved results on what was already a competitive approach. The improvements are achieved by establishing a new best practice approach to mean-field variational Bayesian neural networks. We then look at the solutions in detail. This allows us to obtain an understanding of why VCL performs as it does, and we compare the solution to what an `ideal' continual learning solution might be.


2018



Partitioned Variational Inference: A unified framework encompassing federated and continual learning
Thang D Bui, Cuong V Nguyen, Siddharth Swaroop, Richard E Turner
Preprint, Bayesian Deep Learning Workshop spotlight (NeurIPS) 2018 [arXiv]

AbstractVariational inference (VI) has become the method of choice for fitting many modern probabilistic models. However, practitioners are faced with a fragmented literature that offers a bewildering array of algorithmic options. First, the variational family. Second, the granularity of the updates e.g. whether the updates are local to each data point and employ message passing or global. Third, the method of optimization (bespoke or blackbox, closed-form or stochastic updates, etc.). This paper presents a new framework, termed Partitioned Variational Inference (PVI), that explicitly acknowledges these algorithmic dimensions of VI, unifies disparate literature, and provides guidance on usage. Crucially, the proposed PVI framework allows us to identify new ways of performing VI that are ideally suited to challenging learning scenarios including federated learning (where distributed computing is leveraged to process non-centralized data) and continual learning (where new data and tasks arrive over time and must be accommodated quickly). We showcase these new capabilities by developing communication-efficient federated training of Bayesian neural networks and continual learning for Gaussian process models with private pseudo-points. The new methods significantly outperform the state-of-the-art, whilst being almost as straightforward to implement as standard VI.


Neural network ensembles and variational inference revisited
Marcin B Tomczak, Siddharth Swaroop, Richard E Turner
Advances in Approximate Bayesian Inference Symposium 2018 [pdf]

AbstractEnsembling methods and variational inference provide two orthogonal methods for obtaining reliable predictive uncertainty estimates for neural networks. In this work we compare and combine these approaches finding that: i) variational inference outperforms ensembles of neural networks, and ii) ensembled versions of variational inference bring further improvements. The first finding appears at odds with previous work (Lakshminarayanan et al., 2017), but we show that the previous results were due to an ambiguous experimental protocol in which the model and inference method were simultaneously changed.


2017



Understanding Expectation Propagation
Siddharth Swaroop, Richard E Turner
Advances in Approximate Bayesian Inference workshop (NIPS) 2017 [pdf]

AbstractUnderstanding and characterising the properties of approximate inference schemes is extremely important, but arguably under studied. This report continues work on characterising Expectation Propagation (EP), an approximate Bayesian inference scheme, looking at four toy cases of interest. We initially focus on the empirically motivated conjecture stating that EP's approximation for the model evidence is an underestimate of the true model evidence. The first two toy cases apply EP to a simple classification example. They indicate why EP tends to underestimate the model evidence on realistic datasets, even though there are counter-examples to the conjecture, which we show analytically for the first time. The third toy case uses the link between the Fully Independent Training Condition algorithm (FITC, a sparse approximation method for Gaussian Process regression) and EP to find another analytic counter-example. This toy case also raises interesting questions as to how and why FITC works, which we consider mathematically. The final toy example compares mean field EP to mean field and structured Variational Inference (VI) on a small time-series model. We find that EP's uncertainty estimates do not collapse pathologically as they do for mean field VI.