Vincent Zoonekynd's Blog

Fri, 10 Jan 2020: 2019 in Machine Learning

Rather than a review of 2019 in machine learning, this is a review of (some of) what I read this year: this is biased towards my interests, incomplete, and some of the topics may already be a few years old.


1a. The main transformation in the machine learning landscape was undoubtedly the rise of language models: they are not only more powerful, trained on more data, they are also more directly usable, thanks to the HuggingFace transformers library.

It is difficult, however, to keep track of which model is currently the best. When you read this, it may no longer be one of the following: BERT and its many variants (Albert (lightweight), RoBERTa (Facebook), DistilBERT, CamemBERT (French), Ernie (Baidu, Chinese), etc.), XLNet, XLM, GPT-2, etc.).

1b. Those models provide contextual word embeddings – in contrast to word2vec, GloVe, FastText, which are not contextual.

Visualizing and measuring the geometry of BERT

Contextual string embeddings for sequence labelings

1c. There has also been some work on multilingual models, in particular multilingual embeddings: for languages with less training data, one can simply learn a single model for several (related) languages.

Image search using multilingual texts: a cross-modal learning approach between image and text

1d. Some (traditional) word embeddings can account for polysemy.

Linear algebraic structure of word senses with applications to polysemy

Multimodal word distributions

2. "Self-supervised learning" is a special case of unsupervised learning (or a first, pretraining, step in semi-supervised learning, or pre-training for supervised learning) in which we turn the unlabeled data into a supervised problem. For instance, given unlabeled images, we can train a model to tell if two patches come from the same image. Given sound data, we can train a model to tell if two patches are close in time or not – the features learned will identify phonemes, if we want very close patches, or speakers, for slightly more distant ones.

MixMatch: a holistic approach to semi-supervised learning

3. Adversarial training (that is a new deep learning word for the more classical notion of "robust optimization") tries to make neural networks (slightly) more robust to adversarial attacks.

Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers

Adversarial Examples Are Not Bugs, They Are Features

4. We have a better understanding of the loss landscape of deep neural networks:

- The set of global minima is connected

- Wide extrema generalize better

- Stochastic gradient descent (SGD) tends to turn around one of those minima: average the model weights rather than the forecasts

- Different SGD runs will turn around different minima: do not average the weights but the forecasts

The following website shows very nice pictures of those landscapes.

5a. Pure stochastic gradient descent (SGD) is not the only way to train a neural network. The gradient descent step is very similar to the Euler discretization of an ODE: many SGD variants can be interpreted as other ODE discretizations.

A differential equation for modeling Nesterov's accelerated gradient method: theory and insight

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

5b. Since the successive estimates of the weights of a network form a sequence, we can use all the sequence-related tricks we learned in elementary calculus: Cesàro summation, convergence acceleration, etc.

Bridging the gap between constant step size stochastic gradient descent and Markov chains

Constant-step-size least-mean-square and optimal sampling distributions

5c. Not all training examples are equally relevant: importance sampling gives more weight to the most informative samples, identified by the norm of their gradient.

Not all samples are created equal: deep learning with importance sampling

5d. With the rise of (clusters of) GPUs, distributed training can speed up computations but, because of the communication bandwidth required, the gains may not be as large as one would expect.

Byzantine-tolerant machine learning

Deep gradient compression: reducing the communication bandwidth for distributed training

5e. There are also alternatives to back-propagation, for instance by learning to forecast the gradients instead of actually computing them, or by training the layers separately, using auxiliary objectives.

Decoupled neural interfaces using synthetic gradients

Putting An End to End-to-End: Gradient-Isolated Learning of Representations

6a. Monte Carlo simulations usually rely on Metropolis-Hastings, Gibbs sampling, or Hamiltonian Monte Carlo (HMC), but these are not the only choices. Langevin Monte Carlo is very similar to gradient descent: move in the direction of the gradient of the log posterior probability, but add noise.

Langevin diffusions and the Metropolis-adjusted Langevin algorithm

Riemann manifold Langevin Hamiltonian Monte Carlo

6b. Most Monte Carlo simulation methods are reversible, but that is not actually needed: non-reversible alternatives may converge faster.

Stochastic bouncy particle sampler

The bouncy particle sampler: a nonreversible rejection-free Markov chain Monte Carlo method

Non-reversible Metropolis-Hastings

Irreversible Monte Carlo algorithms for efficient sampling

Liftings of tree-structured Markov chains

The zig-zag process and super-efficient sampling for Bayesian analysis of big data

A piecewise deterministic scaling limit of lifted Metropolis-Hastings in the Curie-Weiss model

7. Graph data is becoming easier to use.

7a. A simple way to use graph data is to use a graph kernel, i.e., a way of defining how close two nodes are.

Kernels and regularization on graphs

Weisfeiler-Lehman graph kernels

A persistent Weisfeiler-Lehman procedure for graph classification

Another way is to embed the nodes in a Euclidean space, for instance using word2vec on random walks on the graph.

DeepWalk: online learning of social representations

node2vec: scalable feature learning for networks

Anonymous walk embeddings

Structural deep network embedding

Link prediction via subgraph embedding-based convex matrix completion

GraRep: learning graph representations with global structural information

DeepTrax: embedding graphs of financial transactions

Holographic embeddings of knowledge graphs

Many calculus notions (derivatives, Fourier analysis, etc.) can be discretized using a grid, which is a special type of graph. It turns out that those notions can be generalized to arbitrary graphs: signal processing on graphs.

Multi-dimensional graph Fourier transform

Calculus on graphs

In particular, the notion of convolution can be generalized from grids to graphs.

CayleyNets: graph convolutional neural networks with complex rational spectral filters

Link prediction based on graph neural networks

Modeling relational data with graph convolutional networks

Graph convolutional neural networks for web-scale recommender systems

Convolutional neural networks on graphs with fast localized spectral filtering

Graph2Seq: graph to sequence learning with attention-based neural networks

GANs are still everywhere and can be used to generate random graphs, hopefully more realistic than the classical models (Erdős–Rényi, Barabási–Albert (preferential attachment), Watts–Strogatz (rewired ring lattice), exponential random graph, stochastic block model).

Learning deep generative models of graphs

8a. Reinforcement learning is still a very hot topic, but it remains highly technical: contrary to mainstream machine learning algorithms and neural networks, very few of us actually use it.

Here is some learning material:

Algorithms for reinforcement learning (2009)

Stanford course

Berkeley course

8b. There are two main concerns with many of the recent reinforcement learning "successes":

- They are not sample-efficient – they require an insane amount of training, equivalent to 1000 or 10,000 years of human time;

- They focus on a single, very specific task, and cannot easily generalize to other tasks.

Meta-learning tries to respond to those challenges, by training a network for learn new tasks from just a few samples (few-shot learning).

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

How to train your MAML

RL²: Fast Reinforcement Learning via Slow Reinforcement Learning

Here is a simple meta-learning benchmark, which can be used as an intelligence test (for computers):

On the Measure of Intelligence

8c. Bayesian methods still have their role to play in the low-data regime.

9. Variational inference is another of those technical topics. The basic idea is simple: we want to compute a probability distribution p, usually the posterior from some Bayesian model, but it is intractable. Instead, we will try to find a simpler distribution q, tractable, but close to p. To measure closeness, we use the KL divergence KL(q‖p) – even if p is intractable, KL(q‖p) may be easier to compute.

For instance, we can search for q as independent Gaussians ("mean field approximation"). Since this completely forgets the dependence structure in p, it looks like a bad idea – but even if the variational distribution q is far from the target p, most of what we want to do with it, in particular, computing expectations, will give similar results.

Unfortunately, this requires ad hoc computations, specific to each model.

A tutorial on variational Bayesian inference

Automatic differentiation variational inference

10. Geometry

10a. The Kullback-Leibler (KL) divergence is often used to compute the "distance" between distributions. For instance, minimizing the log-likelihood of a model is equivalent to minimizing the KL divergence between the data and the model. But it is often insufficient, because it ignores the geometry of the underlying space: if two distributions have disjoint support, they will be maximally distant for the KL divergence, regardless of the distance between their supports.

The Wasserstein distance (contrary to the KL divergence, it is actually a distance) addresses those problems but, unfortunately, it it difficult to compute. The Sinkhorn distance is a regularization of the Wasserstein distance. The sliced Wasserstein distance computes the Wasserstein distance after an random projection on a 1-dimensional subspace (where it is easy to compute) and averages several of those distances.

Sinkhorn distances: lightspeed computation of optimal transportation distances

Fast computation of Wasserstein barycenters

Sliced Gromov-Wasserstein

Large-scale optimal transport and mapping estimation

The Wasserstein distance is also used in GANs: even if the discriminator is good, and contrary to the KL divergence, the gradient of the Wasserstein distance tells the generator in which direction to go.

Improved training of Wasserstein GANs

10b. Topological data analysis (TDA) remains an underground topic, in spite of the existence of the "TDA" R package which hides all the complexity from the end user. It has been applied to study the complexity of neural networks, and to regularize them.

On characterizing the capacity of neural networks using algebraic topology

Neural persistence: a complexity measure for deep neural networks using algebraic topology

Connectivity-optimized representation learning via persistent homology

A topological regularizer for classifiers via persistent homology

An introduction to topological data analysis: fundamental and practical aspects for data scientists

10c. Other bits of algebraic topology find applications in statistics.

Statistical inference using the Morse-Smale complex

Morse-Smale regression

Visual exploration of high-dimensional scalar functions

Data analysis with the Morse-Smale complex: the msr package for R

10d. Deep learning uses a lot of linear transformations and Euclidean distances but, for some types of data, Euclidean space is not the best choice. For instance, trees are difficult to embed in flat spaces: if you start from the root, add three children to it, add three children to each of them, and repeat a few times, the nodes become closer and closer. Hyperbolic spaces do not have that problem and are therefore a good choice to compute embeddings of hierarchies, trees or more general graphs.

On the tree-likeness of hyperbolic spaces

Hyperbolic deep learning for Chinese natural language understanding

Hyperbolic neural networks

Hyperbolic attention networks

Representation trade-offs for hyperbolic embeddings

Reconstructing approximate tree metrics

Learning continuous hierarchies in the Lorentz model of hyperbolic geometry

10e. When training a neural network, we often consider the space in which the weights live as a vector space. But points in that space represent models of the data, and the Euclidean distance between those points does not correspond to a useful notion of distance between those models. It is already visible with a simple model, univariate Gaussians. There are two parameters, μ and σ (or μ and σ², or  μ and log σ, or anything you want – the fact that there are several meaningful reparametrizations leading to different Euclidean distances is already troubling). We are tempted to say that the distance between N(0,1) and N(1,1) is 1, and so it the distance between N(0,σ²) and N(1,σ²), regardless of the variance. But when the variance is low, N(0,σ²) and N(1,σ²) are very easy to distinguish, while if the variance is high, they are almost indistinguishable: to reflect this, we need something else than the Euclidean distance.

It is possible to define a "metric" (a Riemannian structure) using the Fisher information matrix that behaves as we expect: the distance between N(0,σ²) and N(1,σ²) decreases as σ increases.

The same idea applies to neural networks, and has applications to their training: the Riemannian structure defines a more geometric gradient, called the "natural gradient", which speeds up convergence.

Information-geometric optimization algorithms: a unifying picture via invariance principles

The Riemannian geometry of deep generative models

Limitations of the empirical Fisher approximation

11a. Interpretable machine learning

11b. Researchers are more and more concerned by ethical issues, for instance, that of "fairness". Unfortunately, there are several legitimate, but incompatible notions of "fairness" (statistical parity, equal opportunities, disparate impact, average odds, etc.).

Long list of fairness measures and mitigation algorithms (pre-, in- and post-processing)

While you should worry about those topics, and even assess the fairness of your models, you should leave the choice of which definition of "fairness" to use and of the trade-off between fairness and performance to someone else (stakeholders, legal department, management, etc.) – otherwise, your choices will be challenged.

12. Other topics

12a. New layer types

Lattice layers can enforce monotonicity constraints

Deep lattice networks and partial monotonic functions

Monotonic calibrated interpolated look-up tables

Lattice regression

Fast and flexible monotonic functions with ensembles of lattices

Constrained optimization layer

Differentiable Convex Optimization Layers

Variance layers

Variance networks: when expectation does not meet your expectations

12b. Data structures, in computer science (think of hash tables), are usually designed to provide the correct answer. For a few years, probabilistic data structures, which return the correct answer with high probability, have risen in popularity (for instance, Bloom filters). Those probabilistic data structures can now be learned from data.

The case for learned index structures

12c. Tensor decompositions

Tensor networks

12d. RKHS

A primer on reproducing kernel Hilbert spaces

My summary of those papers (and many more, in reverse chronological order of reading time) is here:

posted at: 11:36 | path: /R | permanent link to this entry