Vincent Zoonekynd's Blog

Fri, 10 Jan 2020: 2019 in Machine Learning

Rather than a review of 2019 in machine learning, this is a review of (some of) what I read this year: this is biased towards my interests, incomplete, and some of the topics may already be a few years old.

1a. The main transformation in the machine learning landscape was undoubtedly the rise of language models: they are not only more powerful, trained on more data, they are also more directly usable, thanks to the HuggingFace transformers library.

https://huggingface.co/transformers/index.html

It is difficult, however, to keep track of which model is currently the best. When you read this, it may no longer be one of the following: BERT and its many variants (Albert (lightweight), RoBERTa (Facebook), DistilBERT, CamemBERT (French), Ernie (Baidu, Chinese), etc.), XLNet, XLM, GPT-2, etc.).

1b. Those models provide contextual word embeddings – in contrast to word2vec, GloVe, FastText, which are not contextual.

Visualizing and measuring the geometry of BERT
https://arxiv.org/abs/1906.02715

Contextual string embeddings for sequence labelings
https://pdfs.semanticscholar.org/421f/c2556836a6b441de806d7b393a35b6eaea58.pdf

1c. There has also been some work on multilingual models, in particular multilingual embeddings: for languages with less training data, one can simply learn a single model for several (related) languages.

Image search using multilingual texts: a cross-modal learning approach between image and text
https://arxiv.org/abs/1903.11299

https://github.com/facebookresearch/MUSE

1d. Some (traditional) word embeddings can account for polysemy.

Linear algebraic structure of word senses with applications to polysemy
https://arxiv.org/abs/1601.03764

Multimodal word distributions
https://arxiv.org/abs/1704.08424

2. "Self-supervised learning" is a special case of unsupervised learning (or a first, pretraining, step in semi-supervised learning, or pre-training for supervised learning) in which we turn the unlabeled data into a supervised problem. For instance, given unlabeled images, we can train a model to tell if two patches come from the same image. Given sound data, we can train a model to tell if two patches are close in time or not – the features learned will identify phonemes, if we want very close patches, or speakers, for slightly more distant ones.

MixMatch: a holistic approach to semi-supervised learning
https://arxiv.org/abs/1905.02249

3. Adversarial training (that is a new deep learning word for the more classical notion of "robust optimization") tries to make neural networks (slightly) more robust to adversarial attacks.

Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers
https://arxiv.org/abs/1906.04584

Adversarial Examples Are Not Bugs, They Are Features
https://arxiv.org/abs/1905.02175

4. We have a better understanding of the loss landscape of deep neural networks:

- The set of global minima is connected

- Wide extrema generalize better

- Stochastic gradient descent (SGD) tends to turn around one of those minima: average the model weights rather than the forecasts

- Different SGD runs will turn around different minima: do not average the weights but the forecasts

The following website shows very nice pictures of those landscapes.

https://losslandscape.com/

5a. Pure stochastic gradient descent (SGD) is not the only way to train a neural network. The gradient descent step is very similar to the Euler discretization of an ODE: many SGD variants can be interpreted as other ODE discretizations.

A differential equation for modeling Nesterov's accelerated gradient method: theory and insight
https://arxiv.org/abs/1503.01243

Neural Tangent Kernel: Convergence and Generalization in Neural Networks
https://arxiv.org/abs/1806.07572

5b. Since the successive estimates of the weights of a network form a sequence, we can use all the sequence-related tricks we learned in elementary calculus: Cesàro summation, convergence acceleration, etc.

Bridging the gap between constant step size stochastic gradient descent and Markov chains
https://arxiv.org/abs/1707.06386

Constant-step-size least-mean-square and optimal sampling distributions
https://arxiv.org/abs/1412.0156

5c. Not all training examples are equally relevant: importance sampling gives more weight to the most informative samples, identified by the norm of their gradient.

Not all samples are created equal: deep learning with importance sampling
https://arxiv.org/abs/1803.00942

5d. With the rise of (clusters of) GPUs, distributed training can speed up computations but, because of the communication bandwidth required, the gains may not be as large as one would expect.

Byzantine-tolerant machine learning
https://arxiv.org/abs/1703.02757

Deep gradient compression: reducing the communication bandwidth for distributed training
https://arxiv.org/abs/1712.01887

5e. There are also alternatives to back-propagation, for instance by learning to forecast the gradients instead of actually computing them, or by training the layers separately, using auxiliary objectives.

Decoupled neural interfaces using synthetic gradients
https://arxiv.org/abs/1608.05343

Putting An End to End-to-End: Gradient-Isolated Learning of Representations
https://papers.nips.cc/paper/8568-putting-an-end-to-end-to-end-gradient-isolated-learning-of-representations

6a. Monte Carlo simulations usually rely on Metropolis-Hastings, Gibbs sampling, or Hamiltonian Monte Carlo (HMC), but these are not the only choices. Langevin Monte Carlo is very similar to gradient descent: move in the direction of the gradient of the log posterior probability, but add noise.

Langevin diffusions and the Metropolis-adjusted Langevin algorithm
https://arxiv.org/abs/1309.2983

Riemann manifold Langevin Hamiltonian Monte Carlo
https://pdfs.semanticscholar.org/16c5/06c5bb253f7528ddcc80c72673fabf584f32.pdf

6b. Most Monte Carlo simulation methods are reversible, but that is not actually needed: non-reversible alternatives may converge faster.

Stochastic bouncy particle sampler
https://arxiv.org/abs/1609.00770

The bouncy particle sampler: a nonreversible rejection-free Markov chain Monte Carlo method
https://arxiv.org/abs/1510.02451

Non-reversible Metropolis-Hastings
https://arxiv.org/abs/1401.8087

Irreversible Monte Carlo algorithms for efficient sampling
https://arxiv.org/abs/0809.0916

Liftings of tree-structured Markov chains
https://people.eecs.berkeley.edu/~sinclair/liftings.pdf

The zig-zag process and super-efficient sampling for Bayesian analysis of big data
https://arxiv.org/abs/1607.03188

A piecewise deterministic scaling limit of lifted Metropolis-Hastings in the Curie-Weiss model
https://arxiv.org/abs/1509.00302

7. Graph data is becoming easier to use.

7a. A simple way to use graph data is to use a graph kernel, i.e., a way of defining how close two nodes are.

Kernels and regularization on graphs
https://people.cs.uchicago.edu/~risi/papers/SmolaKondor.pdf

Weisfeiler-Lehman graph kernels
http://www.jmlr.org/papers/volume12/shervashidze11a/shervashidze11a.pdf

A persistent Weisfeiler-Lehman procedure for graph classification
http://proceedings.mlr.press/v97/rieck19a/rieck19a.pdf

Another way is to embed the nodes in a Euclidean space, for instance using word2vec on random walks on the graph.

DeepWalk: online learning of social representations
https://arxiv.org/abs/1403.6652

node2vec: scalable feature learning for networks
https://arxiv.org/abs/1607.00653

Anonymous walk embeddings
https://arxiv.org/abs/1805.11921

Structural deep network embedding
https://www.kdd.org/kdd2016/papers/files/rfp0191-wangAemb.pdf

Link prediction via subgraph embedding-based convex matrix completion
http://gerard.demelo.org/papers/link-prediction-subgraphembeddings.pdf

GraRep: learning graph representations with global structural information
https://paperswithcode.com/paper/grarep-learning-graph-representations-with

DeepTrax: embedding graphs of financial transactions
https://arxiv.org/abs/1907.07225

Holographic embeddings of knowledge graphs
https://arxiv.org/abs/1510.04935

Many calculus notions (derivatives, Fourier analysis, etc.) can be discretized using a grid, which is a special type of graph. It turns out that those notions can be generalized to arbitrary graphs: signal processing on graphs.

Multi-dimensional graph Fourier transform
https://arxiv.org/abs/1712.07811

Calculus on graphs
https://arxiv.org/abs/cs/0408028

In particular, the notion of convolution can be generalized from grids to graphs.

CayleyNets: graph convolutional neural networks with complex rational spectral filters
https://arxiv.org/abs/1705.07664

Link prediction based on graph neural networks
https://arxiv.org/abs/1802.09691

Modeling relational data with graph convolutional networks
https://arxiv.org/abs/1703.06103

Graph convolutional neural networks for web-scale recommender systems
https://arxiv.org/abs/1806.01973

Convolutional neural networks on graphs with fast localized spectral filtering
https://arxiv.org/abs/1606.09375

Graph2Seq: graph to sequence learning with attention-based neural networks
https://arxiv.org/abs/1804.00823

GANs are still everywhere and can be used to generate random graphs, hopefully more realistic than the classical models (Erdős–Rényi, Barabási–Albert (preferential attachment), Watts–Strogatz (rewired ring lattice), exponential random graph, stochastic block model).

Learning deep generative models of graphs
https://arxiv.org/abs/1803.03324

8a. Reinforcement learning is still a very hot topic, but it remains highly technical: contrary to mainstream machine learning algorithms and neural networks, very few of us actually use it.

Here is some learning material:

Algorithms for reinforcement learning (2009)
https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf

Stanford course
https://www.youtube.com/playlist?list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u

Berkeley course
https://www.youtube.com/playlist?list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A

8b. There are two main concerns with many of the recent reinforcement learning "successes":

- They are not sample-efficient – they require an insane amount of training, equivalent to 1000 or 10,000 years of human time;

- They focus on a single, very specific task, and cannot easily generalize to other tasks.

Meta-learning tries to respond to those challenges, by training a network for learn new tasks from just a few samples (few-shot learning).

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
https://arxiv.org/abs/1703.03400

How to train your MAML
https://arxiv.org/abs/1810.09502

RL²: Fast Reinforcement Learning via Slow Reinforcement Learning
https://arxiv.org/abs/1611.02779

Here is a simple meta-learning benchmark, which can be used as an intelligence test (for computers):

On the Measure of Intelligence
https://arxiv.org/abs/1911.01547

8c. Bayesian methods still have their role to play in the low-data regime.

9. Variational inference is another of those technical topics. The basic idea is simple: we want to compute a probability distribution p, usually the posterior from some Bayesian model, but it is intractable. Instead, we will try to find a simpler distribution q, tractable, but close to p. To measure closeness, we use the KL divergence KL(q‖p) – even if p is intractable, KL(q‖p) may be easier to compute.

For instance, we can search for q as independent Gaussians ("mean field approximation"). Since this completely forgets the dependence structure in p, it looks like a bad idea – but even if the variational distribution q is far from the target p, most of what we want to do with it, in particular, computing expectations, will give similar results.

Unfortunately, this requires ad hoc computations, specific to each model.

A tutorial on variational Bayesian inference
http://www.orchid.ac.uk/eprints/40/1/fox_vbtut.pdf

Automatic differentiation variational inference
https://arxiv.org/abs/1603.00788

10. Geometry

10a. The Kullback-Leibler (KL) divergence is often used to compute the "distance" between distributions. For instance, minimizing the log-likelihood of a model is equivalent to minimizing the KL divergence between the data and the model. But it is often insufficient, because it ignores the geometry of the underlying space: if two distributions have disjoint support, they will be maximally distant for the KL divergence, regardless of the distance between their supports.

The Wasserstein distance (contrary to the KL divergence, it is actually a distance) addresses those problems but, unfortunately, it it difficult to compute. The Sinkhorn distance is a regularization of the Wasserstein distance. The sliced Wasserstein distance computes the Wasserstein distance after an random projection on a 1-dimensional subspace (where it is easy to compute) and averages several of those distances.

Sinkhorn distances: lightspeed computation of optimal transportation distances
https://arxiv.org/abs/1306.0895

Fast computation of Wasserstein barycenters
https://arxiv.org/abs/1310.4375

Sliced Gromov-Wasserstein
https://arxiv.org/abs/1905.10124

Large-scale optimal transport and mapping estimation
https://arxiv.org/abs/1711.02283

The Wasserstein distance is also used in GANs: even if the discriminator is good, and contrary to the KL divergence, the gradient of the Wasserstein distance tells the generator in which direction to go.

Improved training of Wasserstein GANs
https://arxiv.org/abs/1704.00028

10b. Topological data analysis (TDA) remains an underground topic, in spite of the existence of the "TDA" R package which hides all the complexity from the end user. It has been applied to study the complexity of neural networks, and to regularize them.

On characterizing the capacity of neural networks using algebraic topology
https://arxiv.org/abs/1802.04443

Neural persistence: a complexity measure for deep neural networks using algebraic topology
https://arxiv.org/abs/1812.09764

Connectivity-optimized representation learning via persistent homology
https://arxiv.org/abs/1906.09003

A topological regularizer for classifiers via persistent homology
https://arxiv.org/abs/1806.10714

An introduction to topological data analysis: fundamental and practical aspects for data scientists
https://arxiv.org/abs/1710.04019

10c. Other bits of algebraic topology find applications in statistics.

Statistical inference using the Morse-Smale complex
https://arxiv.org/abs/1506.08826

Morse-Smale regression
https://www.ncbi.nlm.nih.gov/pubmed/23687424

Visual exploration of high-dimensional scalar functions
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.226.3032&rep=rep1&type=pdf

Data analysis with the Morse-Smale complex: the msr package for R
https://www.jstatsoft.org/article/view/v050i02/v50i02.pdf

10d. Deep learning uses a lot of linear transformations and Euclidean distances but, for some types of data, Euclidean space is not the best choice. For instance, trees are difficult to embed in flat spaces: if you start from the root, add three children to it, add three children to each of them, and repeat a few times, the nodes become closer and closer. Hyperbolic spaces do not have that problem and are therefore a good choice to compute embeddings of hierarchies, trees or more general graphs.

On the tree-likeness of hyperbolic spaces
https://arxiv.org/abs/1105.3925

Hyperbolic deep learning for Chinese natural language understanding
https://arxiv.org/abs/1812.10408

Hyperbolic neural networks
https://arxiv.org/abs/1805.09112

Hyperbolic attention networks
https://arxiv.org/abs/1805.09786

Representation trade-offs for hyperbolic embeddings
https://arxiv.org/abs/1804.03329

Reconstructing approximate tree metrics
http://www.cs.yale.edu/homes/mahesh/papers/approx-tree.pdf

Learning continuous hierarchies in the Lorentz model of hyperbolic geometry
https://arxiv.org/abs/1806.03417

10e. When training a neural network, we often consider the space in which the weights live as a vector space. But points in that space represent models of the data, and the Euclidean distance between those points does not correspond to a useful notion of distance between those models. It is already visible with a simple model, univariate Gaussians. There are two parameters, μ and σ (or μ and σ², or  μ and log σ, or anything you want – the fact that there are several meaningful reparametrizations leading to different Euclidean distances is already troubling). We are tempted to say that the distance between N(0,1) and N(1,1) is 1, and so it the distance between N(0,σ²) and N(1,σ²), regardless of the variance. But when the variance is low, N(0,σ²) and N(1,σ²) are very easy to distinguish, while if the variance is high, they are almost indistinguishable: to reflect this, we need something else than the Euclidean distance.

It is possible to define a "metric" (a Riemannian structure) using the Fisher information matrix that behaves as we expect: the distance between N(0,σ²) and N(1,σ²) decreases as σ increases.

The same idea applies to neural networks, and has applications to their training: the Riemannian structure defines a more geometric gradient, called the "natural gradient", which speeds up convergence.

Information-geometric optimization algorithms: a unifying picture via invariance principles
https://arxiv.org/abs/1106.3708

The Riemannian geometry of deep generative models
https://arxiv.org/abs/1711.08014

Limitations of the empirical Fisher approximation
https://arxiv.org/abs/1905.12558

11a. Interpretable machine learning

https://christophm.github.io/interpretable-ml-book/

11b. Researchers are more and more concerned by ethical issues, for instance, that of "fairness". Unfortunately, there are several legitimate, but incompatible notions of "fairness" (statistical parity, equal opportunities, disparate impact, average odds, etc.).

Long list of fairness measures and mitigation algorithms (pre-, in- and post-processing)
https://github.com/IBM/AIF360

While you should worry about those topics, and even assess the fairness of your models, you should leave the choice of which definition of "fairness" to use and of the trade-off between fairness and performance to someone else (stakeholders, legal department, management, etc.) – otherwise, your choices will be challenged.

12. Other topics

12a. New layer types

Lattice layers can enforce monotonicity constraints

Deep lattice networks and partial monotonic functions
https://arxiv.org/abs/1709.06680

Monotonic calibrated interpolated look-up tables
https://arxiv.org/abs/1505.06378

Lattice regression
https://papers.nips.cc/paper/3694-lattice-regression.pdf

Fast and flexible monotonic functions with ensembles of lattices
https://papers.nips.cc/paper/6377-fast-and-flexible-monotonic-functions-with-ensembles-of-lattices

Constrained optimization layer

Differentiable Convex Optimization Layers
http://web.stanford.edu/~boyd/papers/pdf/diff_cvxpy.pdf
https://github.com/cvxgrp/cvxpylayers

Variance layers

Variance networks: when expectation does not meet your expectations
https://arxiv.org/abs/1803.03764

12b. Data structures, in computer science (think of hash tables), are usually designed to provide the correct answer. For a few years, probabilistic data structures, which return the correct answer with high probability, have risen in popularity (for instance, Bloom filters). Those probabilistic data structures can now be learned from data.

The case for learned index structures
https://arxiv.org/abs/1712.01208

12c. Tensor decompositions

Tensor networks
https://tensornetwork.org/

12d. RKHS

A primer on reproducing kernel Hilbert spaces
https://arxiv.org/abs/1408.0952

My summary of those papers (and many more, in reverse chronological order of reading time) is here:

http://zoonek.free.fr/Ecrits/articles.pdf

posted at: 11:36 | path: /R | permanent link to this entry