# Vincent Zoonekynd's Blog

## Mon, 15 Jun 2020: ICLR 2020

Here are some of the topics I found interesting at the recent ICLR 2020 conference.

While most neural networks are based on linear transformations and element-wise nonlinearities, one deceptively simple operation is at the core of many innovations (LSTM, attention, hypernetworks, etc.): multiplication.

The notion of gradient is omnipresent in deep learning, but we can compute several of them (gradient of the output or of the loss, wrt the input or the parameters), and they all have their uses.

There is an increasing interplay between classical algorithms (numeric algorithms, machine learning algorithms, etc.) and deep learning: we can use algorithms as layers inside neural nets, we can use neural nets as building blocks inside algorithms, and we can even put neural nets inside neural nets -- anything in a neural net can be replaced by another neural net: weights, learning rate schedule, activation functions, optimization algorithm, loss, etc.

Our understanding of why deep learning works has improved over the years, but most of the results have only been proved for unrealistic models (such as linear neural nets). Since there was also empirical evidence for more realistic models, it was widely believed that those results remained valid for general networks. That was wrong.

We often consider that the set of possible weights for a given neural net is Euclidean (in particular, flat) -- it is an easy choice, but not a natural one: it is possible to deform this space to make the loss landscape easier to optimize on. The geometry of the input and/or the output may not be Euclidean either: for instance, the orientation of an object in 3D space is a point on a sphere.

Convolutional neural networks look for patterns regardless of their location in the image: they are translation-equivariant. But this is not the only desirable group action: we may also want to detect scale- and rotation-invariant patterns.

Neural networks are notorious for making incorrect predictions with high confidence: ensembles can help detect when this happens -- but there are clever and economical ways of building those ensembles, even from a single model.

Supervised learning needs data. Labeled data. But the quality of those labels is often questionable: we expect some of them to be incorrect, and want our models to account for those noisy labels.

With the rise of the internet of things and mobile devices, we need to reduce the energy consumption and/or the size of our models. This is often done by pruning and quantization, but naively applying those on a trained network is suboptimal: an end-to-end approach works better.

Adversarial attacks, and the ease at which we can fool deep learning models, are frightening, but research on more robust models progresses -- and so does that on attacks to evade thoses defenses: it is still a cat-and-mouse game.

The conference also covered countless other topics: small details to improve the training of neural nets; transformers, their applications to natural language processing, and their generalizations to images or graphs; graph neural networks, their applications to knowledge graphs; compositionality; disentanglement; reinforcement learning; etc.

The usual caveats apply: my selection is biased by my centers of interests, my understanding (or lack therefof) of certain fields, and the ease with which I can summarize them. In each paper, I mention what I understood, which is incomplete, sometimes besides the point (the "prior work" section is often very good -- I may stop there), and occasionally incorrect, if I misunderstand or overlook an important point: consider this summary as a list of potentially useful ideas, not a list of truths.

## Multiplication

Neural networks tend to only use simple operations: linear transformations and element-wise nonlinearities. Once in a while, someone tries to add a more complicated operation: multiplication. This may sound underwhelming, but this was the main innovation of LSTMs or, more generally, of gating mechanisms (a gating mechanism is just a multiplication by a number between 0 and 1, thanks to a sigmoid function). Attention is also a multiplication (between query, key and value). Hypernetworks can also be formulated in this way.

In a standard neural net, one could replace concatenation with multiplication or, more generally, linear maps (x,z) ↦ W [x;z] + b with bilinear ones (x,z) ↦z' W x + z' U + V x + b, where W is a 3-dimensional tensor.

Multiplicative Interactions and Where to Find Them
S.M. Jayakumar et al.
https://openreview.net/forum?id=rylnK6VtDH

## Gradients

The notion of "gradient" is omnipresent in deep learning, but it is ambiguous: we can compute several gradients. What are we differentiating? The loss (a scalar)? or the output (some tensor, e.g., an image or a probability distribution)? What are we differentiating it with respect to? The parameters? or the input?

All those gradients are useful.

For training, we use the gradient of the loss wrt the parameters.

The gradient of the output wrt the parameters provides a latent representation of the input.

Gradients as Features for Deep Representation Learning
F. Mu et al.
https://openreview.net/forum?id=BkeoaeHKDS

The gradient of the loss wrt the input is used to mount adversarial attacks. One way to thwart those attacks is to use non-differentiable layers (differentiable wrt the parameters, for training, but not wrt the input).

Enhancing Adversarial Defense by k-Winners-Take-All
C. Xiao et al.
https://openreview.net/forum?id=Skgvy64tvr

In the case of images, that gradient (of the loss wrt the input) is itself an image, which can help tell how robust a model is: for robust models, it looks like the input, i.e., like an actual picture, while for non-robust models, it looks like noise. During training, adding a discriminator to make this gradient image look more like a picture improves robustness.

Jacobian Adversarially Regularized Networks for Robustness
A. Chan et al.
https://openreview.net/forum?id=Hke0V1rKPS

An oblique decision tree is a decision tree whose decision nodes are linear classifiers (i.e., of the form w'x<α instead of xᵢ<α). Locally constant neural networks model oblique decision trees, in a parsimonious way (N neurons suffice for 2^N nodes); they can be obtained as the gradient (of the output wrt the input) of a ReLU network.

Oblique Decision Trees from Derivatives of ReLU Networks
G.H. Lee and T.S. Jaakkola
https://openreview.net/forum?id=Bke8UR4FPB

## Neural networks inside algorithms, algorithms inside neural networks, neural networks inside neural networks

There is an interplay between machine learning (or, more generally, traditional "algorithms") and deep learning: we can use one inside the other. This goes both ways: we can often use traditional algorithms (k-means, optimization, etc.) as layers of neural nets, or use neural nets to replace some steps in traditional algorithms.

The same applies to neural nets: some of the building blocks of a neural net can be provided by another neural net. Examples include the initialization (meta-learning), the weights (hypernetworks), a gradient pre-conditioner (warped gradient), the activation functions, the network architecture (NAS), the optimization algorithm (learning to learn-by-gradient-descent by gradient descent), the learning rate decay, the hyperparameters, the loss function (GAN), etc.

Here are more such examples.

### Decision trees in neural nets

Many traditional machine learning algorithms can be used as layers of deep neural networks, either as is, if they are differentiable, or after regularization if they involve discrete choices. For instance, in oblivious decision trees (decision trees which use the same features and thresholds at each tree level), one can use soft thresholds, i.e., replace xᵢ>b with σ((∑λᵢxᵢ-b)/s), and sparsemax instead of softmax for sparsity. Those layers can be stacked.

Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data
S. Popov et al.
https://arxiv.org/abs/1909.06312

### Discrete optimization in neural nets

The usual trick to use discrete operations as a layer in a neural net is to find a smooth relaxation, but use it only in the backward pass.

Differentiation of Blackbox Combinatorial Solvers
M.V. Pogančić et al.
https://openreview.net/forum?id=BkevoJSYPB

### Neural nets in Gaussian processes

In Bayesian optimization, one could replace the acquisition function with a neural network, trained on similar objective functions.

Meta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization
M. Volpp et al.
https://openreview.net/forum?id=ryeYpJSKwr

### Neural nets as heuristics for search algorithms

Many classical algorithms, rely on some "heuristic", to speed them up: this is in particular the case of search algorithms (A*, DPLL, etc.), which explore a tree or a graph, and need to choose which node to explore next.

Those heuristics can be replaced with a neural net.

Learning Space Partitions for Nearest Neighbor Search
Y. Dong et al.
https://openreview.net/forum?id=rkenmREFDr

A similar idea applies to symbolic operations, such as computing an integral or solving a differential equation: we can use a neural net to suggest a solution. Of course, it is not guaranteed to be correct but neural nets give a few possible solutions, very quickly, which can easily be checked for correctness -- if they do not find anything, we can then fall back on a traditional (slower) algorithm.

Deep Learning For Symbolic Mathematics
G. Lample and F. Charton
https://openreview.net/forum?id=S1eZYeHFDS

### Latent representations for multi-hop symbolic reasoning

There are three main types of logic: propositional logic ("and", "or", "not" operators, but no quantifiers: SAT solvers, such as minisat), first-order logic (with quantifiers over variables: SMT solvers, e.g., z3), and higher-order logic (with quantifiers over variables, functions and propositions: theorem provers, such as coq, agda, isabelle). HOL Light is an interactive theorem prover, with a database of 19,000 mathematical statements, covering algebra, topology, calculus, and measure theory. Like most theorem provers, it relies on rewrite rules -- but it is very time-consuming to check which rules can be applied and where they lead to.

With a graph neural net, one can learn an embedding that helps tell if a rewrite rule applies to a given statement (by taking the scalar product of the embeddings) and perform approximate reasoning, by moving several steps in the latent space.

This turns a logic problem (proving a statement) into a geometric one (finding a path in the latent space learned by the network).

Mathematical Reasoning in Latent Space
D. Lee et al.
https://openreview.net/forum?id=Ske31kBtPr

Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning
G. Lederman et al.
https://openreview.net/forum?id=BJluxREKDB

Rewrite rules are also used by compilers, to simplify expressions. Reinforcement learning can help choose a good sequence of rules.

Deep Symbolic Superoptimization Without Human Knowledge
H. Shi et al.
https://openreview.net/forum?id=r1egIyBFPS

### Neural nets instead of hyperparameter search

One can also use a neural net to predict the best hyperparameters of a classical algorithm -- here, hierarchical clustering.

Learning to Link
M.F. Balcan et al.
https://arxiv.org/abs/1907.00533

### Neural nets to interpret neural nets

Counterfactuals are a way of explaining the forecasts of a model: ways of changing the input to change the output -- adversarial examples. They are usually computed as an optimization problem, but one could use a neural network to automatically compute them.

Explanation by Progressive Exaggeration
S. Singla et al.
https://openreview.net/forum?id=H1xFWgrFPS

### Hypernetworks

Hypernetworks are neural networks computing the weights of another network. Here are some possible applications. They can be used to apply a different decoder for each type of object when reconstructing it from a cloud of points; to uncompress, on the fly, the weights of a neural network; to generate a family of image compression models, using compression+λ·reconstruction as loss, with the weights depending on λ; or for continual learning, preventing catastrophic forgetting by computing an embedding of the task and constraining the weights of the previous tasks to remain fixed.

Higher-Order Function Networks for Learning Composable 3D Object Representations
E. Mitchell et al.
https://openreview.net/forum?id=HJgfDREKDB

Neural Epitome Search for Architecture-Agnostic Network Compression
D. Zhou et al.
https://openreview.net/forum?id=HyxjOyrKvr

You Only Train Once: Loss-Conditional Training of Deep Networks
A. Dosovitskiy and J. Djolonga
https://openreview.net/forum?id=HyxY6JHKwr

Continual learning with hypernetworks
J. von Oswald et al.
https://openreview.net/forum?id=SJgwNerKvB

### Meta-learning

The goal of meta-learning, or "learning to learn", is to be able to learn a new task with little data ("few-shot learning") and only a few steps of gradient descent. MAML (model-agnostic meta learning) is one of the popular algorithms for this: it learns a good initialization, from which new tasks can be learned quickly -- in other words, it is a neural network that computes good initial weights for the training of another network.

If you are not familiar with that topic, check C. Finn's Meta Learning course:

https://cs330.stanford.edu/

Many small improvements or variants were proposed, such as using "task features" to select hyperparameters (learning rate, number of epochs), to do task-specific data augmentation, or to fine-tune the initial weights prior to training (as with hypernetworks).

Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks
H.B. Lee et al.
https://openreview.net/forum?id=rkeZIJBYvr

https://openreview.net/forum?id=BJgd81SYwr
Meta Dropout: Learning to Perturb Latent Features for Generalization
H.B. Lee et al.

Empirical Bayes Transductive Meta-Learning with Synthetic Gradients
S.X. Hu et al.
https://openreview.net/forum?id=Hkg-xgrYvH

Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML
A. Raghu et al.
https://arxiv.org/abs/1909.09157

## Negative results

### Common knowledge

Many commonly-held beliefs about neural networks, only proved for unrealistic architectures, such as deep linear networks, are actually wrong for general neural nets: there are sub-optimal local minima; low L^2 norm local minima need not be better; low-rank models need not be better or more robust (maximizing the rank actually works better). In spite of this, deep neural networks work well, but only thanks to their initialization.

Truth or backpropaganda? An empirical investigation of deep learning theory
M. Goldblum et al.
https://openreview.net/forum?id=HyxyIgHFvr

Other papers confirm the importance of initialization.

Neural networks may appear too expressive to lead to good generalization, but the training procedures do not stray far from the initialization, considerably reducing the risk of overfitting. Increasing the number of parameters decreases the distance from initialization.

Generalization bounds for deep convolutional neural networks
P.M. Long and H. Sedghi
https://openreview.net/forum?id=r1e_FpNFDr

### Loss landscape

There were a few other results on the loss landscape of neural networks.

ReLU networks divide the input space into "cells". In each cell, the local minima are connected but, contrary to common belief, there are spurious minima.

Piecewise linear activations substantially shape the loss surfaces of neural networks
F. He et al.
https://openreview.net/forum?id=B1x6BTEKwr

We already knew that sufficiently over-parametrized shallow ReLU networks were easy to train (the loss landscape is "nice" -- check the paper for a rigorous definition). The converse is true: with networks that are not sufficiently over-parametrized, the optimization can easily get stuck.

Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks
A. Sharifnassab et al.
https://openreview.net/forum?id=BkgXHTNtvS

### Network architecture search

A few doubts were also cast on network architecture search (NAS) algorithms: evaluating them is tricky, and they do not seem to perform much better than random search. They are biased towards fast-converging cells (shallow and wide), which do not guarantee better generalization.

Evaluating The Search Phase of Neural Architecture Search
K. Yu et al.
https://openreview.net/forum?id=H1loF2NFwr

NAS evaluation is frustratingly hard
A. Yang et al.
https://arxiv.org/abs/1912.12522

Understanding Architectures Learnt by Cell-based Neural Architecture Search
Y. Shu et al.
https://arxiv.org/abs/1909.09569

### Information theory: mutual information

When learning a latent representation, we want the latent representation to simpler (lower-dimensional) but to keep as much information as possible from the input. This can be achieved by maximizing the mutual information MI(x,f(x)) between the input x and the latent representation f(x) (InfoMax).

This may sound right, but the mutual information is maximized by bijections, which are not good representations. In practice, the mutual information cannot be computed, and we use various approximations: those approximations may explain why we nonetheless find good representations.

On Mutual Information Maximization for Representation Learning
M. Tschannen et al.
https://openreview.net/forum?id=rkxoh24FPH

Understanding the Limitations of Variational Mutual Information Estimators
J. Song et al.
https://arxiv.org/abs/1910.06222

Renyi correlation, ρ(X,Y) = Sup E[ f(X) g(Y) ], where the supremum is over functions f and g such that E[f(X)]=E[g(Y)]=0 and E[f(X)²]=E[g(X)²]=1, is an alternative to mutual information: for discrete variables, it is tractable, especially when one is binary.

Rényi Fair Inference
S. Baharlouei et al.
https://arxiv.org/abs/1906.12005

## Geometry

### Warped gradient descent

Gradient descent tends zigzag towards the minimum of the objective function. This is due to the loss landscape: close to the optimum, the level surfaces are ellipsoids; since the gradient directions are orthogonal to them, if they are elongated, the gradients will switch direction at each step. Momentum tries to smooth those direction changes by averaging the gradients.

But another, much more straightforward way of addressing the problem would be to "straighten" those ellipsoids into spheres, with a linear transformation: the gradients would then be in the correct direction, towards the minimum. This is what natural gradient descent and second order optimization algorithms are doing, but estimating that linear transformation is difficult.

Instead, WarpGrad uses a neural network to compute this transformation -- it is also more flexible, because it is no longer constrained to be linear.

Meta-Learning with Warped Gradient Descent
S. Flennerhag et al.
https://arxiv.org/abs/1909.00025
https://slideslive.com/38923089/contributed-talk-1-metalearning-with-warped-gradient-descent

### Equivariance

CNNs are translation-equivariant, but not scale-equivariant. We can easily make them scale- (and rotation-) equivariant by using a predefined basis of filters, at different scales (and orientations), and learning a (scale- and rotation-independent) linear combination of them.

Scale-Equivariant Steerable Networks
I. Sosnovik et al.
https://openreview.net/forum?id=HJgpugrKPS

Other papers explored similar ideas.

B-Spline CNNs on Lie Groups
E.J. Bekkers
https://arxiv.org/abs/1909.12057

Building Deep, Equivariant Capsule Networks
S. Venkatraman et al.
https://arxiv.org/abs/1908.01300

DeepSphere: a graph-based spherical CNN
M. Defferrard et al.
https://openreview.net/forum?id=B1e3OlStPB

Co-Attentive Equivariant Neural Networks: Focusing Equivariance On Transformations Co-Occurring in Data
D.W.. Romero and M. Hoogendoorn
https://openreview.net/forum?id=r1g6ogrtDr

Other papers looked at equivariance under the symmetric group 𝔖ₙ.

On Universal Equivariant Set Networks
N. Segol and Y. Lipman
https://arxiv.org/abs/1910.02421

Permutation Equivariant Models for Compositional Generalization in Language
J. Gordon et al.
https://openreview.net/forum?id=SylVNerFvr

### Non-euclidean spaces

The Stiefel manifold is the set of orthonormal (rectangular) matrices. Optimization on such non-euclidean spaces can often be done by some parametrization of the space, while keeping information about the curvature of the space (Riemannian optimization).

Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform
J. Li et al.
https://openreview.net/forum?id=HJxV-ANKDH

Some (more technical) papers also stressed the difference between "weight space" and "function space" -- their geometry is different, and the latter is more relevant.

Pure and Spurious Critical Points: a Geometric Study of Linear Networks
M. Trager et al.
https://arxiv.org/abs/1910.01671

## Uncertainty

Deep learning models tend to be over-confident: they do not know when they do not know.

### Ensembles

To see if a model is extrapolating (and if the forecasts are unreliable), train several models (with the same structure) and check if they agree. This can be done with a single model as well: at the local minimum, changing the weights in the direction of the smallest eigenvalues of the Hessian of the loss does not change the loss much, and provides a "self-ensemble" of models.

Detecting Extrapolation with Local Ensembles
D. Madras et al.
https://openreview.net/forum?id=BJl6bANtwH

To estimate the uncertainty of a neural network, train several networks, with different random initializations, and check if they match.

Conservative Uncertainty Estimation By Fitting Prior Networks
K. Ciosek et al.
https://openreview.net/forum?id=BJlahxHYDS

If the models also provide some measure of uncertainty, e.g., if they output probabilities, we have more information: if the models are all unsure, we are in-distribution, but close to the decision boundary; if the models are sure but disagree, the data is out-of-distribution; if the models are sure and agree, everything is fine.

Ensemble Distribution Distillation
A. Malinin et al.
https://arxiv.org/abs/1905.00076

### Adding another neural network...

More empirically, given a trained neural net, we could train another neural net to predict the uncertainty of the forecasts, (x,ŷ) ↦ ‖y-ŷ‖².

Quantifying Point-Prediction Uncertainty in Neural Networks via Residual Estimation with an I/O Kernel
X. Qiu et al.
https://openreview.net/forum?id=rkxNh1Stvr

### Bayes's formula

Our models do not really model P[y|x], but P[y|x,in-distribution]. We can actually estimate P[y|x], using Bayes's formula.

Towards neural networks that provably know when they don't know
A. Meinke and M. Hein
https://arxiv.org/abs/1909.12180

### Extrapolation with (neural) symbolic regression

Neural networks are good at interpolation, but not at extrapolation: arithmetic expressions perform better. For instance, the neural arithmetic logic unit (NALU) is defined as

w = tanh × sigmoid        (sign and scale)
a = ∑ sum wᵢ xᵢ            (sum)
b = exp ∑ wᵢ log(|xᵢ|+ε)   (product)
g: sigmoid                 (gate)
y = g a + (1-g) b          (output)

But it does not consistently find the correct solution (and, with this parametrization, negative numbers are a problem). Instead, the neural arithmetic unit, biases w towards -1, 0 and +1, and uses an actual multiplication, with a gating mechanism to select what to multiply.

Neural Arithmetic Units
A. Madsen and A.R. Johansen
https://openreview.net/forum?id=H1gNOeHKPS

## Noisy labels, anomaly detection

### Identifying noisy observations

With low-quality (crowd-sourced) training data, some of the labels may be incorrect. To account for this, we can train only on reliable labels: to identify them, compare the predictions of the current model and the model a few epochs ago: more reliable labels have more stable predictions.

SELF: Learning to Filter Noisy Labels with Self-Ensembling
D.T. Nguyen et al.
https://openreview.net/forum?id=HkgsPhNYPS

### Detecting noisy observations to make the network more robust

To make a network more robust, we can train it on the (100-e)% observations with the smallest loss (in statistics, this is the idea behind "truncated regression").

We can furthermore identify noisy examples, among those small-loss examples, by perturbing the network: if an image was not noisy (learned by generalization), the prediction does not change, but if it was noisy (learned by memorization), it changes.

An ensemble of neural nets, obtained by adding noise, at test time, to the input or to the network (e.g., with droupout, at test time), or by taking the same network from previous epochs is more robust than a single network.

Robust training with ensemble consensus
J. Lee and S.Y. Chung
https://openreview.net/forum?id=ryxOUTVYDH

### Modeling noisy observations

The traditional approach, in statistics, to account for noisy observations, is to model the noise: the per-sample loss is then a two-component mixture (noisy samples have a larger loss). One can then discard the noisy labels and use semi-supervised learning. To avoid confirmation bias, train two networks, each filtering the other's noise.

DivideMix: Learning with Noisy Labels as Semi-supervised Learning
J. Li et al.
https://arxiv.org/abs/2002.07394

### Data augmentation

Traditional data augmentation applies one transformation to the image. Instead, one can apply several data augmentations, and mix the results; one can also add a consistency penalty to the loss to ensure that different augmentations yield the same result.

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty
D. Hendrycks et al.
https://arxiv.org/abs/1912.02781

MixMatch performs semi-supervised learning by training on the labeled data, running the model on many augmentations of the unlabeled data, and using the resulting label distribution. ReMixMatch enforces consistency among augmentations, makes the label distribution close to that of the training set, and adds a self-supervised loss (rotation prediction).

ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring
D. Berthelot et al.
https://openreview.net/forum?id=HklkeR4KPB

### Anomaly detection

Autoencoders can be used to detect anomalies: outliers have a large reconstruction error. They can be improved by not only comparing the input and the output of the network, but by feeding the output back to the input and also comparing the intermediate layers.

RaPP: Novelty Detection with Reconstruction along Projection Pathway
K.H. Kim et al.
https://openreview.net/forum?id=HkgeGeBYDB

However, autoencoders are not robust to outliers: add a "robust subspace recovery" layer z↦Az between the encoder and the decoder, which projects to an even lower-dimensional subspace, ‖AA'-I‖ and ‖z-A'Az‖ as penalties.

Robust Subspace Recovery Layer for Unsupervised Anomaly Detection
C.H. Lai et al.
https://iclr.cc/virtual_2020/poster_rylb3eBtwr.html

A similar idea can be applied to many other models: for out-of-distribution samples, the error is often larger. For instance, GOAD picks $M$ random affine transformations, transforms the data with them, and tries to recover which transformation was used: the prediction accuracy is an anomaly score.

Classification-Based Anomaly Detection for General Data
L. Bergman and Y. Hoshen
https://openreview.net/forum?id=H1lK_lBtvS

The likelihood of a generative model could also be used, but it is correlated with the complexity of the data: an out-of-distribution sample can have a higher likelihood than actual data if it is simpler (e.g., in terms of Kolmogorov complexity). The complexity-adjusted likelihood can be used for anomaly detection.

Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models
J. Serrà et al.
https://openreview.net/forum?id=SyxIWpVYvr

Random network distillation (RND) takes two networks: a randomly-initialized, fixed network g, and a network f trained to replicate g on the training data. Since they only match on the training data, their difference ‖f(x) - g(x)‖ is larger for out-of-distribution samples. Blurring the data helps.

Novelty Detection Via Blurring
S. Choi and S.Y. Chung
https://openreview.net/forum?id=ByeNra4FDB

### Titration

Several unrelated papers used a form a "titration" to gain insight into neural networks (but I do not think they use that word). For instance, progressively adding noise to images and checking the output, or the confusion matrix, can help understand classification biases.

White Noise Analysis of Neural Networks
A. Borji and S. Lin
https://arxiv.org/abs/1912.12106

Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection
M. Tsang et al.
https://openreview.net/forum?id=BkgnhTEtDS

## Pruning, quantization and energy consumption

With the rise of the internet of things, and the omnipresence of mobile devices, the energy consumption and the size of neural nets have become a concern. Approaches include pruning (making the network smaller, by discarding unneeded connections or neurons), quantization (using low-precision computations), and distillation (training a large neural network, because they are easier to train, and using it to teach a smaller "student" network).

### Pruning after training

Neural nets are often pruned by keeping the largest weights: instead, we could look at the "contribution" of each weight to the output and keep the weights with the largest contributions.

Data-Independent Neural Pruning via Coresets
B. Mussay et al.
https://openreview.net/forum?id=H1gmHaEKwB

Provable Filter Pruning for Efficient Neural Networks
L. Liebenwein et al.
https://openreview.net/forum?id=BJxkOlSYDH

Instead of looking all the way down until the output, we can just look at the weights in the previous and next layer, and prune weights that are small and/or connected to small weights.

Lookahead: A Far-sighted Alternative of Magnitude-based Pruning
S. Park et al.
https://openreview.net/forum?id=ryl3ygHYDB

### Rewinding

After pruning a network, one usually fine-tunes it, but, since the network is now different, the optimal hyperparameters are no longer the same: they have to be re-estimated. Rewinding the weights (i.e., using old weights) or just the learning rate, before fine-tuning, works as well as hyperparameter tuning.

Comparing Rewinding and Fine-tuning in Neural Network Pruning
A. Renda et al.
https://openreview.net/forum?id=S1gSj0NKvB

This was not the only use of rewinding in the conference. Another paper tried to rewind a layer (only one) all the way back to its initialization: the impact was sometimes visible, sometimes not. This suggests that some layers are more important than others. This can be explained by the "loss landscape", from the initialization to the final weights, which looks like a tunnel for some layers (less important), or like a funnel (more important). The width of that funnel ("network criticality") is related to generalization.

The intriguing role of module criticality in the generalization of deep networks
N.S. Chatterji et al.
https://arxiv.org/abs/1912.00528

### Pruning during training

Pruning can be done before training, after training, several times during training, but also dynamically during training, potentially reactivating prematurely pruned weights. This can be done with a binary mask on the weights (use the mask in the forward pass, but not in the backward pass; the mask can change at each step) or a threshold vector (layer-wise, filter-wise or neuron-wise).

Dynamic Model Pruning with Feedback
T. Lin et al.
https://openreview.net/forum?id=SJem8lSFwB

Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers
J. Liu et al.
https://openreview.net/forum?id=SJlbGJrtDB

Drawing early-bird tickets: Towards more efficient training of deep networks
H. You et al.
https://arxiv.org/abs/1909.11957

### Pruning before training

Surprisingly, it is possible to prune the network before training, by looking at the change in loss when removing a parameter (it can be computed as a gradient); this works better if the singular values of the weights matrices of each layer (strictly speaking, the singular values of the jacobian of each layer) are close to 1.

Signal Propagation Perspective for Pruning Neural Networks at Initialization
N. Lee et al.
https://arxiv.org/abs/1906.06307

### Partial computations, early exit

To reduce the energy consumption of a neural net, one can use only part of it, with a gating mechanism to choose which features to compute, or a halting mechanism, at different depths of the network, to decide where to stop the computations (as an added benefit, multi-output networks seem more robust to adversarial attacks).

Batch-shaping for learning conditional channel gated networks
B.E. Bejnordi et al.
https://openreview.net/forum?id=Bke89JBtvB

Depth-Adaptive Transformer
M. Elbayad et al.
https://openreview.net/forum?id=SJg7KhVKPH

Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference
T.K. Hu
https://openreview.net/forum?id=rJgzzJHtDB

### Pruning and quantization

"Pruning" is the removal of edges or neurons in a neural net. "Quantization" is a reduction of the precision of weights and activations, from 32 bits down to 16 bits or less -- sometimes even just one bit.

After training, neural networks are sometimes used in a way that was not intended during training -- in a way in which the network is not optimal. It makes more sense to take into account how we plan to use the network during training. For instance, here are a few ways of accounting for post-training quantization.

To learn layer-specific quantization, with gradient descent, use the floor function in the forward pass, and relax it to the identity in the backward pass (using a different function in the forward and backward pass, to deal with locally constant functions, is a common trick).

Learned Step Size Quantization
S.K. Esser et al.
https://openreview.net/forum?id=rkgO66VKDS

In a mixed precision deep neural net, the bitwidth is different for each layer (for the weights and the activations). The quantization parameters can be trained: add a penalty for the total memory required for the weights and activations and, for the backward pass, set the derivative of the floor function to 1.

Mixed Precision DNNs: All you need is a good parametrization
S. Uhlich et al.
https://openreview.net/forum?id=Hyx0slrFvH

It is also possible to make the quantization data-dependent, computing most features in low precision and a few (input-dependent) important ones in high precision, using a gating mechanism.

Precision Gating: Improving Neural Network Efficiency with Dynamic Dual-Precision Activations
Y. Zhang et al.
https://openreview.net/forum?id=SJgVU0EKwS

Neural networks can be trained to be more robust to post-training quantization (a bounded, additive perturbation) by adding an L^1 penalty for the gradient of the loss (wrt the weights).

Gradient L^1 Regularization for Quantization Robustness
M. Alizadeh et al.
https://openreview.net/forum?id=ryxK0JBtPr

Robust Learning with Jacobian Regularization
J. Hoffman et al.
https://openreview.net/forum?id=ryl-RTEYvB

Quantization (after clipping) can be uniform (write the number with $k$ bits), power-of-two (less precision for larger numbers), or sum of powers of two, $q = 2^x + 2^y + 2^z$ (with $k$ terms).

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks
Y. Li et al.
https://openreview.net/forum?id=BkgXT24tDS

Binary neural nets, i.e., neural nets whose weights and activations are ±1, are difficult to train. Approaches include coupling them with a higher-precision network, or using a surrogate, such as a stochastic binary neural net.

BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations
H. Kim et al.
https://openreview.net/forum?id=r1x0lxrFPS

Critical initialisation in continuous approximations of binary neural networks
G. Stamatescu et al.
https://openreview.net/forum?id=rylmoxrFDH

### Structured matrices

To reduce the number of parameters in a neural net, one can use "structured matrices" instead of arbitrary, dense matrices: low-rank, sparse, Kronecker-factored, DFT, etc. Those matrices are products of sparse matrices, and they can be obtained from "butterfly matrices" (block matrices, whose blocks are diagonal).

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps
T. Dao et al.
https://openreview.net/forum?id=BkgrBgSYDS

## Attacks

Countless presentations started with the very same slide, taken from I. Goodfellow's 2015 paper on adversarial attacks: adding some imperceptible noise to an image can make it comically impossible to recognize for a neural net.

There have since been many attempts to design more robust models ("adversarial training" is the machine learning translation of "robust estimator") and to design better attacks.

### Building adversarial attacks

Computer vision neural networks can be trained to be robust to adversarial attacks, in the sense that images close to the input will give the same output, where "closeness" is measured with the L^\infty or L^2 norm.

But this leaves enough room for photo-realistic attacks, manipulating colour and texture or adding barely perceptible shadows.

Unrestricted Adversarial Examples via Semantic Manipulation
A. Bhattad et al.
https://openreview.net/forum?id=Sye_OgHFwH

Breaking certified defenses: semantic adversarial examples with spoofed robustness certificates
A. Ghiasi et al.
https://openreview.net/forum?id=HJxdTxHYvB

Many models are built from publically-available pre-trained networks: only the "head" of the network is unknown to the attacker. Inputs with a single very large activation provide good adversarial attacks.

A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning
S. Rezaei and X. Liu
https://arxiv.org/abs/1904.04334

More generally, using the latent feature distributions, from the intermediate layers, leads to more transferable attacks.

Transferable Perturbations of Deep Feature Distributions
N. Inkawhich et al.
https://openreview.net/forum?id=rJxAo2VYwr

### Model stealing attacks

If a model, trained by fine-tuning a publically-available pre-trained model (e.g., BERT -- the availablility of the pre-trained model is important) on proprietary data, has a public API (e.g., Google Translate), it is economically possible to extract it, by feeding it data (e.g., random words, with no grammatical structure, or Wikipedia excerpts), collecting the results, and training a new model. This model stealing attack allows intellectual property theft, (white-box) adversarial attacks, and data leakage.

Thieves on Sesame Street! Model Extraction of BERT-based APIs
K. Krishna et al.
https://iclr.cc/virtual_2020/poster_Byl5NREFDr.html

To protect a trained model f:x↦y against stealing attacks, try to perturb the attacker's gradient: do not return y=f(x), but y+δ, where δ is small, |δ|<ε, and maximizes the angle between the gradients ∇Loss(y) and ∇Loss(y+δ).

Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks
T. Orekondy et al.
https://openreview.net/forum?id=SyevYxHtDB

### Defenses against adversarial attacks

Low-precision networks are less accurate but more robust to adversarial attacks than full-precision ones: use an ensemble of both.

EMPIR: Ensembles of Mixed Precision Deep Networks for Increased Robustness Against Adversarial Attacks
S. Sen et al.
https://openreview.net/forum?id=HJem3yHKwH

To defend against adversarial attacks, one can use data augmentation at test time (not only at training time), i.e., feed several altered images to the classifier, and use a majority vote -- but accuracy degrades. Instead, train the model that will actually be used: the model including data augmentation.

Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier
C. Kou et al.
https://openreview.net/forum?id=BkgWahEFvr

Training against rectangular occlusion attacks improves robustness to physically realizable attacks.

Defending Against Physically Realizable Attacks on Image Classification
T. Wu et al.
https://openreview.net/forum?id=H1xscnEKDr

### Adversarial training

The following papers review some of the classical approaches to adversarial attacks and adversarial training.

Fast is better than free: Revisiting adversarial training
E. Wong et al.
https://openreview.net/forum?id=BJx040EFvH

FreeLB: Enhanced Adversarial Training for Natural Language Understanding
C. Zhu et al.
https://openreview.net/forum?id=BygzbyHFvB

### Lipschitz regularization

Networks with a low Lipschitz norm tend to be more robust to adversarial attacks.

The Lipschitz regularization, E[ |f(y)-f(x)| / |y-x| ], tends to diverge when estimated using pairs of points from a minibatch: instead, use adversarial examples, y=x+r, |r|≤ε (one could also use a gradient penalty, as in WGANs).

Adversarial Lipschitz Regularization
D.Terjék
https://openreview.net/forum?id=Bke_DertPB

The Lipschitz constraint can also be replaced by "consistency regularization", as in semi-supervised learning: deformed images should have the same output.

Consistency Regularization for Generative Adversarial Networks
H. Zhang et al.
https://iclr.cc/virtual_2020/poster_S1lxKlSKPH.html

### Robustness certification with interval bound propagation

Interval arithmetics can provide robustness certification, i.e., a proof that a given model is robust to a certain kind of attack (often, L^∞-bounded attacks -- but this may not be the norm you want).

Universal Approximation with Certified Networks
M. Baader et al.
https://openreview.net/forum?id=B1gX8kBtPr

Certified Defenses for Adversarial Patches
P.Y. Chiang et al.
https://openreview.net/forum?id=HyeaSkrYPH

Towards Stable and Efficient Training of Verifiably Robust Neural Networks
H. Zhang et al.
https://arxiv.org/abs/1906.06316

## Training

### Learning rate

There are countless learning rate schedules: decreasing, triangular, cosine, etc. Even weirder ones work, such as an exponentially increasing one (yes, increasing, not decreasing: multiply the learning rate by (1+c), c>0, at each iteration) (this assumes the presence of normalizing layers, which make the loss function scale-invariant).

If your training budget is limited and known in advance, linearly decreasing the learning rate, down to zero, may be good enough.

An Exponential Learning Rate Schedule for Deep Learning
Z. Li and S. Arora
https://arxiv.org/abs/1910.07454

Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints
M. Li et al.
https://arxiv.org/abs/1905.04753

What matters is not really the learning rate, but the "effective learning rate", η/(1-m), which combines learning rate and momentum.

Rethinking the Hyperparameters for Fine-tuning
H. Li  et al.
https://openreview.net/forum?id=B1g8VkHFPH

### Semi-supervised learning, student-teacher networks, and distillation

Semi-supervised learning often works as follows: train a "teacher network" on real data; train a "student network" on data labeled by the teacher; fine-tune on real data; iterate (the student becomes teacher).

Revisiting Self-Training for Neural Sequence Generation
J. He et al.
https://openreview.net/forum?id=SJgdnAVKDH

Distillation is similar, but instead of matching the output of the teacher network, i.e., the class labels, the student network tries to match the logits. Another variant would be to match the features, one layer before, after mapping them to the same vector space (teacher and student latent representations may have different dimensions).

Contrastive Representation Distillation
Y. Tian et al.
https://openreview.net/forum?id=SkgpBJrtvS

### Penalties

Instead of the lasso sparsifying regularizer, try the "Hoyer-square" regularizer, H(w) = ( ‖w‖₁/‖w‖₂ )². It still trims small values, but preserves large ones.

DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures
H. Yang et al.
https://openreview.net/forum?id=rylBK34FDS

### Activation functions

The activation functions (ReLU, sigmoid, tanh, etc.) are usually fixed. Instead, we can choose a very flexible parametrization, that allows a large variety of shapes, and learn the activation functions. The Padé approximations, i.e., rational functions (quotients of polynomials) are often used by computers in numerical computations (e.g., for special functions).

Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks
A. Molina et al.
https://openreview.net/forum?id=BJlBSkHtDS

Most theoretical studies assume ReLU activations. Experiments suggest that non-smooth activation functions (e.g., ReLU, ELU) lead to faster convergence (than swish, tanh). The situation improves with depth.

Effect of Activation Functions on the Training of Overparametrized Neural Nets
A. Panigrahi et al.
https://arxiv.org/abs/1908.05660

### Output

Here is another way of adding nonlinearities to a neural net: do not output an absolute value (age, distance, pitch, score, etc.), but compare inputs, with a siamese network; at test time, the numeric value can be computed by interpolation. This approach can be combined with clustering, if there are distinct groups in the input requiring different models: train separate models for random clusters, update cluster membership from the model performance, iterate.

Order Learning and Its Application to Age Estimation
K. Lim et al.
https://openreview.net/forum?id=HygsuaNFwr

### L^4 loss

Several papers suggested to use the L^4 norm, or "spikiness", instead of the usual L^2 or L^1.

Geometric Analysis of Nonconvex Optimization Landscapes for Overcomplete Learning
Q. Qu et al.
https://openreview.net/forum?id=rygixkHKDH

Understanding l4-based Dictionary Learning: Interpretation, Stability, and Robustness
Y. Zhai et al.
https://openreview.net/forum?id=SJeY-1BKDS

### Dropout

Drop-connect sets some of the weights to zero. Instead of setting them to zero, set them to the weights of a pre-trained network.

Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models
C. Lee et al.
https://openreview.net/forum?id=HkgaETNtDB

### SGD variants

There are still countless new variants of stochastic gradient descent (SGD)...

Accelerating SGD with momentum for over-parameterized learning
C. Liu and M. Belkin
https://arxiv.org/abs/1810.13395

On the Variance of the Adaptive Learning Rate and Beyond
L. Liu et al.
https://openreview.net/forum?id=rkgz2aEKDr

Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets
M. Liu et al.
https://arxiv.org/abs/1912.11940

Variance Reduction With Sparse Gradients
M. Elibol et al.
https://openreview.net/forum?id=Syx1DkSYwB

Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization
S. Chatterjee
https://openreview.net/forum?id=ryeFY0EFwS

### Distributed SGD

In distributed stochastic gradient descent, several workers process the data, compute the gradients, and send them to a centralized "parameter server".

There are several approaches to synchronization. One could wait until all workers have finished -- but one of them will be slower, and all the others will have to wait for it. One could ignore the synchronization problems. One could adjust the learning rate for the delay, to lower the influence of stale gradients.

Gap-Aware Mitigation of Gradient Staleness
S. Barkai et al.
https://openreview.net/forum?id=B1lLw6EYwB

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?
N. Giladi et al.
https://openreview.net/forum?id=Bkeb7lHtvH

One can also try to lower the communication burden by only sending the largest gradients, but keeping track of the accumulated errors, to send them later.

Decentralized Deep Learning with Arbitrary Communication Compression
A. Koloskova et al.
https://openreview.net/forum?id=SkgGCkrKvH

### Label smoothing

Label smoothing can be applied to GANs: the "realness GAN" considers that samples are not fully real or fully fake, but only partially so.

Real or Not Real, that is the Question
Y. Xiangli et al.
https://openreview.net/forum?id=B1lPaCNtPB

## Code

A few Python libraries were mentioned.

### Gaussian processes

Neural-Tangents provide infinitely-wide layers, as a drop-in replacement for stax, the jax neural network library. Since these are Gaussian processes, they do not scale well with dataset size.

Neural Tangents: Fast and Easy Infinite Neural Networks in Python
R. Novak et al.
https://openreview.net/forum?id=SklD9yrFPS
https://github.com/google/neural-tangents

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks
S. Arora et al.
https://arxiv.org/abs/1910.01663

### Digital signal processing

DDSP provides digital signal processing operations (oscillators, filters, etc. -- everything you would do in an audio editor such as Audacity or with a music programming language sucg as ChucK), as Tensorflow layers. The applications presented include changing the timber of a sound (flute → piano, voice → violin, etc.) and removing reverb.

DDSP: Differentiable Digital Signal Processing
J. Engel et al.
https://openreview.net/forum?id=B1x1ma4tDr
https://magenta.tensorflow.org/ddsp

### More gradients

BackPACK is a PyTorch library to compute more quantities during back-propagation: the individual gradients from a mini-batch (for instance to identify which samples are informative and which are not, for importance sampling), the variance of the gradients in a mini-batch (to monitor the signal-to-noise ratio and increase the batch size if needed) a Fisher information matrix approximation.

BackPACK: Packing more into Backprop
F. Dangel et al.
https://openreview.net/forum?id=BJlrF24twB
pip install backpack-for-pytorch

## Transformers

The Transformer has supplanted recurrent neural nets to process sequential data (text, time series).

### More transformer models

There are so many variants or applications of the Transformer model that is is difficult to keep track of them. Here are some of them (I have probably forgotten several).

The reformer is a memory-efficient variant of the transformer, using reversible residual layers (there is no longer any need to keep all the activations in memory), and locality-sensitive hashing (LSH) to locate and compute the largest elements of softmax(Q'K) (which is almost sparse).

Reformer: The Efficient Transformer
N. Kitaev et al.
https://openreview.net/forum?id=rkgNKkHtvB
https://huggingface.co/transformers/model_doc/reformer.html

The attention matrix can be decomposed into "local attention" (dense, close to the diagonal) and "global attention" (sparse, off-diagonal): one can use a lower-dimensional attention mechanism for the global part, and a CNN for the local part.

Lite Transformer with Long-Short Range Attention
Z. Wu et al.
https://iclr.cc/virtual_2020/poster_ByeMPlHKPH.html

Instead of masked pretraining, pretrain with "replaced token detection". The replacements come from a small BERT, trained at the same time. (The model "only" requires 4 GPU·day.)

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
K. Clark
https://openreview.net/forum?id=r1xMH1BtvB

Albert is a smaller BERT-like model, with a projection between the 1-hot word encodings and their vector embeddings, to reduce the number of parameters, shared parameters across layers, a sentence-order prediction loss, no dropout (there is enough data: it is not needed), and more data.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Z. Lan et al.
https://openreview.net/forum?id=H1eA7AEtvS

### Extracting structural information

Transformer-based language models present visible patterns (vertical or horizontal lines, rectangles) in their self-attention heatmaps, which can be used to construct parse trees.

FSPool: Learning Set Representations with Featurewise Sort Pooling
Y. Zhang et al.
https://openreview.net/forum?id=HJgBA2VYwH

### Transformers for images

Transformers (self-attention with positional encoding) also work for images: they can compute convolutions (you need as many heads as there are pixels in the receptive field).

On the Relationship between Self-Attention and Convolutional Layers
J.B. Cordonnier et al.
https://openreview.net/forum?id=HJlnC1rKPB

Not only can the (1-dimensional) positional encoding used in transformers be generalized to 2 dimensions, it can also be learned: it is an embedding (x,y)↦latent_representation.

Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells
G. Mai et al.
https://arxiv.org/abs/2003.00824

### Problems with language models: lack of diversity and reliance on shallow cues

Standard language models produce too many frequent tokens and too few rare tokens. This may be due to maximum likelihood estimation: add an "unlikelihood" penalty to explicitly penalize "negative" tokens, defined as tokens previously used in the sentence.

Text-generation models (e.g., GPT2), relying on beam search, generate text less surprising than humans do. Instead of beam search, which is an approximation of maximum likelihood, use sampling, but discard the most unlikely words, not by retaining the top-k proposals, but by retaining the top-p (probability mass) ones (p=0.95).

Neural Text Generation With Unlikelihood Training
S. Welleck et al.
https://openreview.net/forum?id=SJeYe0NtvH

The Curious Case of Neural Text Degeneration
A. Holtzman et al.
https://openreview.net/forum?id=rygGQyrFvH

Language models suffer from undersensitivity: deleting parts of a sentence can make a model more confident, because it exploits shallow clues (negation, premise-only entailment, etc.). To avoid those problems, ensure that removing words does not increase the probability, e.g., with interval bound propagation (IBP).

Towards Verified Robustness under Text Deletion Interventions
J. Welbl et al.
https://openreview.net/forum?id=SyxhVkrYvr

Text generation evaluation metrics are often based on n-grams (BLEU, ROUGE, METEOR, chrF), but they struggle with synonyms. More recent metrics (Meant, YiSi, BERTScore) are based on embeddings. BERTScore computes the BERT embeddings of the two sentences (generated and reference), then all the cosine similarities; it then greedily matches the words (in each directions, to separate precision from recall) averages the similarities, and computes an F_1 score.

BERTScore: Evaluating Text Generation with BERT
T. Zhang et al.
https://openreview.net/forum?id=SkeHuCVFDr

The word embeddings of current language models do not span the whole available space; this can be seen in the distribution of the singular values. Instead of learning an embedding x↦Wx from 1-hot-encoded vectors, learn its SVD decomposition W=UΣV', with penalties to enforce the orthogonality of U and V and to make the k-th singular value close to c k^-γ or c₁ exp( - c₂ k^γ ).

Improving Neural Language Generation with Spectrum Control
L. Wang et al.
https://openreview.net/forum?id=ByxY8CNtvr

### Multilingual models

Non-parallel (i.e., monolingual) data can help train better NMT systems, e.g., using back-translation (translated sentences should sound idiomatic). The 4 models (source and target language models, and the translation models in both directions) can be linked with a VAE whose latent variable models the contents of the sentence, trained iteratively (one direction at a time).

Mirror-Generative Neural Machine Translation
Z. Zheng et al.
https://openreview.net/forum?id=HkxQRTNYPH

Many other papers looked at various ways of improving the alignment of language models for different languages.

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework
Z. Wang et al.
https://openreview.net/forum?id=S1l-C0NtwS

Multilingual Alignment of Contextual Word Representations
S. Cao et al.
https://openreview.net/forum?id=r1xCMyBtPS

## Graphs

### Transformers for graphs

The attention mechanism can also be generalized to graph neural nets (GNN), or hierarchical structures, e.g., parsed sentences.

Learning deep graph matching with channel-independent embedding and Hungarian attention
T. Yu et al.
https://openreview.net/forum?id=rJgBd2NYPH

Tree-Structured Attention with Hierarchical Accumulation
X.P. Nguyen et al.
https://openreview.net/forum?id=HJxK5pEYvr

Attention can also be generalized to higher-order relations (between 3 objects instead of 2), by using two keys instead of one, and replacing the scalar product of query and key with the tetrahedron volume, ⟨a,b,c⟩ = (a·b)c - (a·c)b + (b·c)a.

Logic and the 2-Simplicial Transformer
J. Clift et al.
https://openreview.net/forum?id=rkecJ6VFvr

### Modeling elements and relations with vectors and matrices

Neural networks can be used to explain an image by mapping its elements (e.g., car, wheel, window, etc.) to vectors, and relations between those elements to matrices, and using an attention mechanism to find first-order-logic formulas matching the scene.

Learn to Explain Efficiently via Neural Logic Inductive Learning
Y. Yang and L. Song
https://iclr.cc/virtual_2020/poster_SJlh8CEYDB.html

### Graph neural nets (GNN) to model computer code

Computer code can be parsed into a tree (an "abstract syntax tree", or AST), which can be fed to a graph neural net, for various prediction tasks: type inference, bug detection, etc.

LambdaNet: Probabilistic Type Inference using Graph Neural Networks
J. Wei et al.
https://openreview.net/forum?id=Hkx6hANtwH

Hoppity: learning graph transformations to detect and fix bugs in programs
E. Dinella et al.
https://openreview.net/forum?id=SJeqs6EFvB

Global Relational Models of Source Code
V.J. Hellendoorn et al.
https://openreview.net/forum?id=B1lnbRNtwr

Learning Execution through Neural Code Fusion
Z. Shi et al.
https://openreview.net/forum?id=SJetQpEYvB

### Training graph neural nets

To limit negative transfer in graph neural nets, pre-train both node embedding (predict (masked) node labels, or context) and graph embeddings (various supervised tasks: attribute prediction, structural similarity).

Strategies for Pre-training Graph Neural Networks
W. Hu et al.
https://openreview.net/forum?id=HJlWWJSFDH

To process knowledge graphs with GNN, learn both node and edge embeddings.

Composition-based Multi-Relational Graph Convolutional Networks
S. Vashishth et al.
https://openreview.net/forum?id=BylA_C4tPr

### Multihop reasoning on a knowledge graph

Information retrieval often starts with a query, looks for documents "similar", in some sense, to the query, and tries to extract the information from those documents. But this approach struggles with questions requiring several documents: recurrent neural nets (RNN), reinforcement learning, or more traditional approaches such as sparse matrix products and k-nearest-neighbour queries can help with multi-hop reasoning.

Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering
A. Asai et al.
https://arxiv.org/abs/1911.10470

Differentiable Reasoning over a Virtual Knowledge Base
B. Dhingra et al.
https://openreview.net/forum?id=SJxstlHFPH

### Implementation matters

Implementation details matter and make paper results difficult to compare: small (recent) tricks can make old models (designed when those tricks were not known) look as good as recent models (designed after those tricks became widespread).

You CAN Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings
D. Ruffinelli et al.
https://openreview.net/forum?id=BkxSmlBFvr

## Convolutional neural nets (CNN)

### Shape of the CNN filters

CNNs gather information from fixed neighbourhoods: instead, the "fixed grouping layer" (FGL) uses neighbourhoods defined by data, e.g., with some clustering algorithm (fMRI data) or external data (industries, in finance), and ensures that the receptive fields do not span group boundaries.

Towards a Deep Network Architecture for Structured Smoothness
H. Habeeb and O. Koyejo
https://openreview.net/forum?id=Hklr204Fvr

The filters used by convolutions need not be square: form some applications, T-shaped ones make sense.

Deep 3D Pan via adaptive "t-shaped" convolutions with global and local adaptive dilations
J.L.G. Bello and M. Kim
https://arxiv.org/abs/1910.01089

### U-Net autoencoders and denoising

An auto-encoder (U-Net) trained on a single image (n=1) first learns the image, then the noise -- early stopping provides denoising. This comes from the linear upsampling, which fits the smooth part (low frequency) before the noise.

Denoising and Regularization via Exploiting the Structural Bias of Convolutional Generators
R. Heckel and M. Soltanolkotabi
https://openreview.net/forum?id=HJeqhA4YDS

Auto-encoders (U-Net) can denoise images, but not sound: sounds that are close to frequency f are not only in [(1-ε)f,(1+ε)f], but also around integer multiples of f. Harmonic convolutions (a form of dilated convolutions) work better.

Deep Audio Priors Emerge From Harmonic Convolutional Networks
Z. Zhang et al.
https://openreview.net/forum?id=rygjHxrYDB

## Miscellaneous

### Biological applications: Cryo-EM

Cryo-EM (electron microscopy) is probably one of the most impressive applications of image processing -- and the worst signal-to-noise ratio I have ever seen. Each Cryo-EM image contains only one molecule, in a random orientation and location, but they may not all be in the same configuration. We want to recognize the shapes of those molecules.

The images look like white noise.

With enough data (10⁵ images) it is however possible to recover the shapes of the molecules, by clustering them, i.e., by assuming there is only a finite number of configurations.

Instead, we can assume that the configurations form a 2-dimensional manifold, and use an optimization in SO(3)×R^2 (for the orientation and the configuration).

Reconstructing continuous distributions of 3D protein structure from cryo-EM images
E.D. Zhong et al.
https://openreview.net/forum?id=SJxUjlBtwB

### Biologically plausible neural nets

Artificial neural nets are not biologically plausible. Several papers tried to design alternatives to artificial neural nets, closer to their biological counterparts, by using spiking neurons, and/or by replacing backpropagation.

SpikeGrad: An ANN-equivalent Computation Model for Implementing Backpropagation with Spikes
J.C. Thiele et al.
https://arxiv.org/abs/1906.00851

Spike-based causal inference for weight alignment
J. Guerguiev et al.
https://arxiv.org/abs/1910.01689

Enabling Deep Spiking Neural Networks with Hybrid Conversion and Spike Timing Dependent Backpropagation
N. Rathi et al.
https://openreview.net/forum?id=B1xSperKvH

### Self-organizing systems

To find interesting and diverse patterns in a self-organizing system (snowflakes, animal skin patterns, or, as in this example, the continuous game of life -- time and space are still discrete, but the values are in [0,1]), start with a random initial state and look after 200 steps.

To ensure diversity in the output, learn a latent representation (8-dimensional VAE) of the outputs; pick a point at random in this latent space; find an input whose output is close to that point; iterate (and retrain the VAE once in a while).

Intrinsically Motivated Discovery of Diverse Patterns in Self-Organizing Systems
C. Reinke et al.
https://arxiv.org/abs/1908.06663
https://automated-discovery.github.io/

### Dynamical systems

Koopman operator theory takes a non-linear dynamical system x[n+1]=F(x[n]) and finds an embedding into a higher-dimensional space y=g(x) such that the dynamics become linear y[n+1]=K·y[n]. Systems made of many parts are often modeled (and controlled) with interaction graphs. Those two approaches can be combined, the graph structure corresponding to a block Koopman matrix.

Learning Compositional Koopman Operators for Model-Based Control
Y. Li et al.
https://openreview.net/forum?id=H1ldzA4tPr

### Time series

The dynamic time-lag regression, y( t + g(x_t) ) = f(x_t), (with f and g unknown) has a closed-form log-likelihood. It has applications in astronomy (solar wind prediction).

Dynamic Time Lag Regression: Predicting What & When
M. Chandorkar et al.
https://openreview.net/forum?id=SkxybANtDB

### Point processes and normalizing flows

When modeling temporal point processes (TPP), instead of using an RNN to estimate the intensity function λ_t, estimate the density function of the time until the next event, with normalizing flows (to transform a Gaussian density into the desired density) or a Gaussian mixture.

Intensity-Free Learning of Temporal Point Processes
O. Shchur et al.
https://openreview.net/forum?id=HygOjhEYDH

Variational Autoencoders for Highly Multivariate Spatial Point Processes Intensities
B. Yuan et al.
https://openreview.net/forum?id=B1lj20NFDS

### Fairness

Sensitive attributes can be censored using adversarial training (it should be impossible to predict the sensitive attribute from the latent representation) or information theory (by minimizing the mutual information between the latent representation and the sensitive attribute). Unfortunately, such censoring is either inefficient or harmful to the model.

Overlearning Reveals Sensitive Attributes
C. Song, and V. Shmatikov

### Controlling generative models

One can learn transformations (rotation, translation, zoom, colour changes, etc.) in the latent space of GANs, to some extent (not beyond the training distribution).

On the "steerability" of generative adversarial networks
A. Jahanian et al.
https://openreview.net/forum?id=HylsTT4FvB

Controlling generative models with continuous factors of variations
A. Plumerault et al.
https://openreview.net/forum?id=H1laeJrKDB

Adjustable Real-time Style Transfer
M. Babaeizadeh and G. Ghiasi
https://openreview.net/forum?id=HJg4E8IFdE

The desire to control the output of generative models is not limited to images: we also want to control sentiment, topic, style, etc. of (GPT2-)generated text.

Plug and Play Language Models: A Simple Approach to Controlled Text Generation
S. Dathathri et al.
https://openreview.net/forum?id=H1edEyBKDS

### Minimax problem

The nonconvex-nonconcave minimax problem

Min_x Max_y g(x,y)

can be solved with the follow-the-ridge algorithm,

x ← x - η ∇_x f(x,y)
y ← y - η ∇_y f(x,y) + η H_yy⁻¹ H_yx ∇_x f(x,y)

which requires second-order information (the gradient is zero along the ridge).

On Solving Minimax Optimization Locally: A Follow-the-Ridge Approach
Y. Wang et al.
https://openreview.net/forum?id=Hkx7_1rKwS

### Semi-supervised learning

Jointly learn a representation and a clustering (or labeling) of the data, with a constraint that each label be used the same number of times, by iteratively computing the optimal labeling (optimal transport) and training a classifier.

Self-labelling via simultaneous clustering and representation learning
Y.M. Asano et al.

https://openreview.net/forum?id=Hyx-jyBFPr

## Appendix: code used to download the videos

### Slideslive

The presentations are available on slideslive, but I find this website very unpleasant to use -- probably one of the worst I have seen in terms of "user experience" (UX).

The presentations have, side by side, a video of the speaker and the slides. The video does not present any interest, but you can, through a convoluted menu, show only the presentation, while still having the sound -- but you need to do that again for each and every presentation.

The slides themselves are JPEG images (a bad choice for text), never in their native resolution (they are rescaled to the browser's width), guaranteeing blurry text. If you want to pause the video to read what is written on the slide, or to look more carefully at a formula, navigation icons appear on top of the slide, hiding most of the contents. This makes Slideslive, in spite of its name, unsuitable for slides.

If the slides contain animations, or videos, they are not in the "slides" frame, but in the "video" frame: if you had put the "slides" frame full-screen, to better read them, you need to switch back to the video, which requires several clicks through those badly-designed menus. You end up focusing on the UI navigation rather than the presentation, and you lose 10 to 20 seconds of the video.

The in-browser, streaming experience is awful: the stream sometimes stalls (this may be my Wifi); if you want to skip ahead, you have to wait for the buffer to fill up; you cannot easily control the video speed. Being able to download the videos to play them with a "normal" video player, which allows you to move instantly to any part of the video, would be helpful. That would also cut down the time spent clicking and waiting between videos: even if it only takes 30 seconds, with more than 600 5-minute videos, that amounts to 5 hours...

### Code

I ended up downloading the videos, but Slideslive makes it difficult: the webpages are dynamically generated (if you just wget them, they are virtually empty); the slides are separete JPG images; the videos are private videos on Vimeo (only visible if you come from the slideslive or ICLR website).

## Create a new virtual environment
conda create --name slideslive python=3.6
. activate slideslive

## Install slideslive-slides-dl, to donwload the slides (not the videos), and youtube-dl (for the videos)
git clone https://github.com/PeterTheOne/slideslive-slides-dl
cd slideslive-slides-dl
pip install --upgrade youtube_dl scipy
pip install -r requirements.txt

## Get the list of presentations
## The URLs are of the form https://iclr.cc/virtual_2020/poster_r1g87C4KwB.html
## Not sure where that came from.
## I probably went to https://iclr.cc/virtual_2020/papers.html?filter=keywords
## with a web browser (wget returns an almost empty page),
## right-clicked, selected "inspect element", copied the HTML code, and extracted the URLs.
ICLR_2020_posters.urls

## Loop over all the presentations
for url in cat ICLR_2020_posters.urls
do
(

## Get the title of the presentation
wget -O index.html "$url" title=$( grep citation_title index.html | perl -p -e 's/.*content="//; s/".*//' )
id=$( grep presentationId index.html | perl -p -e 's/[^0-9]+//g' ) name=$( echo $title | perl -p -e 's/\s+$//; y/A-Z/a-z/; s/[^a-z]/-/g' )
slides_url="https://slideslive.com/$id/$name"

## To gain access to the video URL, we need to run the Javascript code in the page.
## wget.py uses Selenium to retrieve a webpage; waits 10 seconds; prints its contents; then prints the contents of the first iframe.
python wget.py "$url" > index0.html video_url=$( grep vimeo index0.html | grep /video/ | perl -p -e 's/.*src="//; s/".*//' )

## Create a directory to store the slides and the video
mkdir -p "$i-$name"
cd "$id-$name"

## Download the slides
python slideslive-slides.dl.py "$slides_url" ## Download the video (only the audio) youtube-dl --extract-audio --referer "$slides_url" "\$video_url"

## Merge the audio and the slides
ffmpeg -f concat -i ffmpeg_concat.txt -i *m4a slides-video.mp4

)
done

Here is the wget.py script:

from selenium import webdriver
import sys, time

def LOG(*o):
print( time.strftime("%Y-%m-%d %H:%M:%S "), *o, file=sys.stderr, sep="", flush = True )

LOG( "Command-line arguments" )
if len(sys.argv) >= 2:
url = sys.argv[1]
else:
url = "https://iclr.cc/virtual_2020/poster_r1g87C4KwB.html"
LOG( f"  URL: {url}" )

LOG( "Chrome" )
options = webdriver.ChromeOptions()
options.add_argument( "user-data-dir=/tmp/Chrome" )
options.add_argument( "--headless" )
options.add_argument( "window-size=1400,2100" )
options.add_argument( "user-agent=Mozilla" )
driver = webdriver.Chrome( options = options, executable_path='chromium.chromedriver' )

LOG( url )
driver.get(url)

LOG( "Wait 10 seconds" )
time.sleep(10)

LOG( "Page source" )
print( driver.page_source )

LOG( "Screenshot (for debugging): /tmp/debug.png" )
driver.save_screenshot('/tmp/debug.png')

LOG( "iframe" )
el = driver.find_element_by_tag_name("iframe")
driver.switch_to.frame(el)

LOG( "Page source (iframe)" )
print( driver.page_source )

LOG( "Quit" )
driver.quit()

LOG( "Done." )

posted at: 15:33 | path: /ML | permanent link to this entry