Vincent Zoonekynd's Blog

Sun, 05 Jan 2025: 2024 in Machine Learning

Here is, as every year, an overview of some of the papers and books I read this year. The main topics are causality, explainability, uncertainty quantification (conformal learning), and data-centric machine learning.

For a longer reading list:

http://zoonek.free.fr/Ecrits/articles.pdf

Causality

Introduction

We all know the mantra "correlation is not causation" -- but how do we assess causation?

The goal of causal discovery is to recover the data generation process or, at least, the corresponding (causal) relations between the variables.

A structural causal model (SCM) is a data generation process, i.e., a list of equations defining each variable from the previous ones, e.g.,

X₁ = f₁(ε₁)
X₂ = f₂(X₁,ε₂)
X₃ = f₃(ε₃)
X₄ = f₄(X₂,X₃,ε₄)
X₅ = f₅(X₄,ε₅)

The dependencies between the variables form a directed acyclic graph (DAG).

We often hear of an opposition between the "DAG approach" and the "potential outcomes" approach. There is no opposition: these are separate steps in the causal analysis of data: the DAG approach assumes that the DAG is unknown or complex, while the potential outcomes approach assumes that the DAG is known and simple (only treatment, outcome and confounders, sometimes also mediators or instrumental variables) -- one can use the DAG approach to simplify the DAG and make the problem amenable to the potential outcomes approach.

Let us review a few (classical) causal discovery algorithms.

Classical Causal discovery algorithms

The "PC" algorithm is the most commonly-used causal discovery algorithm: it performs conditional independence tests, between all the variables, and tries to find a DAG compatible with (most of) the test results. Since confounders and colliders lead to different conditional independence relations, it is possible to detect the orientation of some of the edges (but usually, not all).

The "GES" algorithm (greedy equivalence search) looks for a DAG with a high "score" (for instance, the BIC of a Gaussian model), adding, one by one, the edges that improve the score the most. Exact search is possible (with dynamic programming or, better, with the A* search algorithm), but more time-consuming.

A* Lasso for learning a sparse Bayesian network structure for continuous variables
J. Xiang and S. Kim (2013)
https://papers.nips.cc/paper_files/paper/2013/hash/8ce6790cc6a94e65f17f908f462fae85-Abstract.html

https://en.wikipedia.org/wiki/A*_search_algorithm

The last classical causal discovery algorithm, LiNGAM, assumes the model is linear, with non-Gaussian noise, and uses ICA (independent component analysis) to recover it.

Causal discovery looks like a combinatorial optimization problem (a DAG is a discrete object), which would prevent us from using differentiable optimization algorithms, such as gradient descent.

But the DAG constraint, on the adjacency matrix of a graph, can actually be written in a differentiable way: the diagonal elements of the powers of the adjacency matrix count cycles -- we just have the request that the trace of the sum of those matrices be zero. The NOTEARS constraint uses the matrix exponential, but any polynomial (with a term for each power, with positive coefficients, for each degree, until the number of nodes) would do. This works on paper, but in practice, it only helps avoid small cycles.

A survey on causal discovery: theory and practice
A. Zanga and F. Stella (2023)
https://arxiv.org/abs/2305.10032

Methods and tools for causal discovery and causal inference
A.R. Nogueira et al. (2022)
https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1449

Review of causal discovery methods based on graphical models
C. Glymour  et al. (2019)
https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2019.00524/full

Deep learning for causal discovery

NOTEARS assumes the SCM is linear, but it can be made non-linear, with neural networks (cGAN, GranDAG, MCSL, DAG-GNN).

A graph autoencoder approach to causal structure learning
I. Ng et al. (2019)
https://arxiv.org/abs/1911.07420

Gradient-based neural DAG learning
S. Lachapelle et al. (2019)
https://arxiv.org/abs/1906.02226

Masked gradient-based causal structure learning
I. Ng et al. (2022)
https://arxiv.org/abs/1910.08527

DAG-GNN: DAG structure learning with graph neural networks
Y. Yu et al. (2019)
https://arxiv.org/abs/1904.10098

The causal discovery problem can be decomposed into two steps: first, find a topological order on the variables, then find the edges.

This is what CAM does: first, if there are many variables, select 10 potential parents for each node, using GAM boosting; then, use unpenalized GES, to find a complete DAG (the topological order); finally, prine that DAG, using the GAM p-values.

CAM: Causal additive models, high-dimensional order search and penalized regression
P. Bühlmann et al. (2014)
https://arxiv.org/abs/1310.1533

One can also use reinforcement learning to search for a DAG with a good score: the state could be a DAG, or a bootstrap sample of the data, the action a new DAG, and the reward is the score (with a penalty to ensure it is a DAG).

Causal discovery with reinforcement learning
S. Zhu et al. (2019)
https://arxiv.org/abs/1906.04477

Alternatively, one can use RL to progressively build the DAG (or the topological order), one edge (or node) at a time.

Ordering-based causal discovery with reinforcement learning
X. Wang et al. (2021)
https://arxiv.org/abs/2105.06631

Lastly, we can use an LLM -- not with the data, but just the meta-data: give it the name of the columns of your table or data-frame, with their description if available, and ask for the causal relations. You may want to proceed step by step, first asking for a node with no parents, then asking for its children, thereby building a topological order one node at a time. You can also provide the LLM with some numeric data derived from the data, such as correlations or partial correlations.

Efficient causal graph discovery using large language models
T. Jiralerspong et al. (2024)
https://arxiv.org/abs/2402.01207

Towards automating causal discovery in financial markets and beyond
A. Sokolov et al. (2023)
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4679414

Causal structure learning supervised by large language models
T. Ban et al. (2023)
https://arxiv.org/abs/2311.11689

One could also use LLMs to read text (e.g., research papers in a given domain) and extract causal relations from those texts: they become a knowledge graph.

Causal discovery for time series

Causal discovery for time series often relies on (generalizations of) Granger causality,

Neural Granger causality
A. Tank et al.
https://arxiv.org/abs/1802.05842

Economy statistical recurrent units for inferring nonlinear Granger causality
S. Khanna and V.Y.F. Tan (2019)
https://arxiv.org/abs/1911.09879

Discovering nonlinear relations with minimum predictive information regularization
T. Wu et al.  (2020)
https://arxiv.org/abs/2001.01885

the PC algorithm (PCMCI),

Causal inference for time series analysis: problems, methods and evaluation
R. Moraffah et al. (2021)
https://arxiv.org/abs/2102.05829

Causality-inspired models for financial time series forecasting
D.C. Oliveira et al. (2024)
https://arxiv.org/abs/2408.09960

structural equation models (LiNGAM on VAR residuals),

Estimation of a structural vector autoregression model using non-Gaussianity
A. Hyvärinen et al. (2010)
https://www.jmlr.org/papers/v11/hyvarinen10a.html

The evolving causal structure of equity risk factors
G. D'Acunto et al. (2021)
https://arxiv.org/abs/2111.05072

Python package for causal discovery based on LiNGAM
T. Ikeuchi et al. (2023)
https://jmlr.org/papers/v24/21-0321.html
https://github.com/cdt15/lingam

or NOTEARS (Dynotears).

Dynotears: structure learning from time series data
R. Pamfil et al. (2020)
https://arxiv.org/abs/2002.00498

To account for non-stationarity, adding a time variable may suffice.

Causal discovery in financial markets: a framework for nonstationary time series data
A. Sadeghi et al. (2024)
https://arxiv.org/abs/2312.17375

Causal discovery from heterogeneous/nonstationary data
B. Huang et al. (2019)
https://arxiv.org/abs/1903.01672

Unobserved confounders

The algorithms we have mentioned so far make a few assumptions which may not always hold: in particular, that we observe all the confounders.

It is possible to modify the algorithms we saw to account for possible unobserved latent variables.

For instance, the presence or absence of an (unobserved) variable can change some of the conditional independence relations tested by the PC algorithm; it also affects the rank of the cross-covariance matrices.

A versatile causal discovery framework to allow causally related hidden variables
X. Dong et al. (2023)
https://arxiv.org/abs/2312.11001

If you suspect pervasive confounding, i.e., unobserved confounders affecting many observed variables, PCA can help recover them.

The DeCAMFounder: nonlinear causal discovery in the presence of hidden variables
R. Agrawal et al. (2021)
https://arxiv.org/abs/2102.07921

Unfortunately, the results of those algorithms are rarely useful: quite often, the conclusion is that there might be a causal relation in one direction, or in the other, or there might be a confounder -- most edges are left undirected.

It is possible, however, to estimate the strength a confounder should have to change the sign of an effect -- we may have overlooked confounders with a weak effect, but overlooking a strong one is less likely.

Cycles

Causal discovery usually assumes that the data generation process defines a directed acyclic graph (DAG). This sounds like a reasonable assumption but, in some cases, we can end up with cycles. Here are a few examples.

Consider two variables, A and B evolving over time, the current value of each influencing the next value of the other. If we only observe the equilibrium state of that system, we have a cycle, A(∞)⇆B(∞).

In the same situation, A(t)→B(t+1) and B(t)→A(t+1), if we only observe aggregated data, a(t)=A(t)+A(t-1) and b(t)=B(t)+B(t-1), we have a circular relation between a and b. If the time scales are very different (e.g., seconds and months), we would have to add millions of unobserved variables: that is unreasonable.

On the three demons in causality in finance: time resolution, nonstationarity and latent factors
X. Dong et al. (2024)
https://arxiv.org/abs/2401.05414

Cyclic relations can also come from constraints: if A+B=1, then any intervention on A changes B, and conversely: each variable causes the other (in finance, this happens with portfolio weights).

Structural equation models (SEM) generalize structural causal models (SCM) by allowing cycles (and unobserved variables). If the model is linear, Gaussian, with known structure (the sparsity pattern, i.e., the directed graph, is known), it can be estimated by comparing the sample variance matrix of the observed variables with that of the model.

semopy: a Python package for structural equation modeling
G. Meschcheryakov and A.A. Igolkina (2019)
https://arxiv.org/abs/1905.09376

semopy 2: a structural equation modeling package with random effects in Python
G. Mosccheryakov et al. (2021)
https://arxiv.org/abs/2106.01140

Distances

To assess the quality of a causal discovery algorithm, we need to compare the predicted DAG with the ground-truth one.

The structural Hamming distance (SHD), the L¹ distance between the adjacency matrices, i.e., the number of edges to add or remove to recover the true graph, is an obvious choice, but it does not reflect the way the graph will be used -- to compute the causal impact of one variable on another.

Instead, the structural intervention distance (SID) counts how many times the parents of a node in the predicted graph are a valid adjustment set in the true graph.

Structural intervention distance (SID) for evaluating causal graphs
J. Peters and P. Bühlmann (2013)
https://arxiv.org/abs/1306.1043

This can be done with other valid adjustment sets, e.g., all previous nodes in a topological order, or the "optimal adjustment set" (optimal in terms of variance).

Adjustment identification distance: a gadjid for  causal structure learning
L. Henckel et al. (2024)
https://arxiv.org/abs/2402.08616

Graphical criteria for efficient total effect estimation via adjustment in causal linear models
L. Henkel et al. (2020)
https://arxiv.org/abs/1907.02435

Lesser-used distances include the number of conditional independencies entailed by one graph and not the other, or the number of causes (pairs of nodes with a causal path between them).

Investment decisions under almost complete causal ignorance
J. Simonian (2022)

Ensembling

Causal discovery algorithms are... disappointing. They tend to give very different results. To try to reduce the variability in their outputs, we could try to average them. But how can we average DAGs? Some majority vote, e.g., keeping the edges present in 50% (or some other threshold) of the graph gives a directed graph, but there is no guarantee that it will be acyclic.

This can be formulated as a combinatorial optimization problem: minimize the sum of the L¹ distances to the DAGs. The constraint that the graph be a DAG looks tricky, but it is actually easy to enforce: solve the problem without any constraint, check if there is a cycle, if there is one add the constraint that specific cycle, solve the problem again, and iterate, adding a new constraint each time you find a cycle. This is a sequence of mixed-integer linear programs (MIP).

Similarly, one can use a MaxSAT solver to combine background knowledge with the output of a causal discovery algorithm.

Introduction to the foundations of causal discovery
F. Eberhardt (2016)
https://link.springer.com/article/10.1007/s41060-016-0038-6

Simulated data

Assessing the quality of causal discovery algorithms is tricky: with real-world data, the ground truth causal graph is rarely known.

Instead, we can use simulated data, but it is very easy to introduce artefacts, which would not be present in real-world data, and help causal discovery -- for instance, variance tends to increase along a topological order.

Beware of the simulated DAG! Causal discovery benchmarks may be easy to game
A.G. Reisach et al. (2021)
https://arxiv.org/abs/2102.13647

Causal Inference

The goal of causal inference is (once you have the causal graph) to estimate the causal effect of one variable ("treatment") T on another ("outcome") Y.

A straightforward approach is to control for the confounders (with more complex causal graphs, the notion of confounder should be replaced by "valid adjustment set"), i.e., to fit a model Y ~ T + X.

But controllong for confounders (aka backdoor adjustment, direct method) is not the only (or even the best) option. Inverse propensity weighting first fprecasts the treatment from the confounders, T ~ X, and then uses the predicted treatment probability ("propensity") as weights fo forecast the outcome.

"Doubly robust" methods combine controlling for confounders and inverse propensity weighting.

These are different estimators of the same quantity (the average treatment effect, or the conditional averate treatment effect) -- and there are many more: S-learner, T-learner, X-learner, double machine learning (DML), causal forests, etc.

Applied causal inference powered by ML and AI
V. Chernozhukov et al. (2024)
https://causalml-book.org/

Estimation and inference of heterogeneous treatment effects using random forests
S. Wager and S. Athey (2017)
https://arxiv.org/abs/1510.04342

Software, Books

For a concise and rigorous introduction to causal inference, check:

Causal Inference
N. Brady (2020)
https://www.bradyneal.com/causal-inference-course

For more details on the different causal inference procedures (controlling for confounders, inverse propensity weighting, doubly robust methods, etc.), check the following.

Causal inference in R workshop
https://www.youtube.com/watch?v=RyIyqV2LRvY
https://r-causal.github.io/causal_workshop_website/

Software for causal discovery include gcastle (which provides a unified API to the classical algorithms) and causal-learn (each algorithm has a different API, but it is easier to examine and change their inner components, e.g., the conditional independence tests for the PC algorithm).

https://github.com/huawei-noah/trustworthyAI/tree/master/gcastle
https://causal-learn.readthedocs.io/en/latest/

For causal inference, check dowhy.

https://github.com/py-why/dowhy

Explainable / Interpretable AI

Machine learning and deep learning models are becoming larger, and it is increasingly difficult to understand what they are doing. We can attempt to address this issue in two ways: explainability and interpretability.

Explainability (XAI, explainable AI) takes a trained model, and tries to infer what it is doing, how it is using its inputs. For instance, partial dependency plots look at the average effect of changing one of the inputs while keeping the others constant; LIME approximates the model, in the neighbourhood of an observation, with a linear model; Shapley values decompose the output (separately for each sample) into a sum of contributions of the inputs. But XAI is only a rationalization of what the model is doing -- a plausible, convincing explanation, which can be completely wrong.

Interpretable machine learning
C. Molnar 
https://christophm.github.io/interpretable-ml-book/

In contrast, interpretability seeks models that are interpretable by construction. Those models tend to be additive and/or sparse and/or monotonic and/or modular.

Designing inherently interpretable machine learning models
A. Sudjianto and A. Zhang (2021)
https://arxiv.org/abs/2111.01743

Let us list some of those models.

Interpretable and explainable machine learning: a methods-centric overview with concrete examples
R. Marcinkevičs and J.E. Vogt (2023)
https://wires.onlinelibrary.wiley.com/doi/full/10.1002/widm.1493

Linear models

Linear models are interpretable, provided there is a small number of predictors: if there are many, a lasso or elasticnet penalty can help select a subset of them.

Statistical learning with sparsity
T. Hastie et al. (2015)
https://hastie.su.domains/StatLearnSparsity/

Interpretability can be improved further if you include some prior knowledge on the predictors: for instance, you often know whether they have a positive or negative effect on the target variable -- you can include those sign constraints in the regression. OLS is a quadratic optimization problem; you are just adding linear constraints: this is easily solved by a convex optimizer.

If the model is designed to be used by humans (this is often the case in medicine: "if this condition is satisfied, add 1 point; if that condition is satisfied, add 4 points; etc.; if the total to 10 or more, follow this procedure"), you can also ask for integer coefficients in the regression (this is a mixed integer optimization problem, which is still easy to solve -- scikit-learn's LinearRegression has a "positive" argument).

Supersparse linear integer models for interpretable classification
B. Ustun et al. (2014)
https://arxiv.org/abs/1306.6677

Supersparse linear integer models for optimized medical scoring systems
B. Ustun and C. Rudin 
https://arxiv.org/abs/1502.04269

Machine learning models

Some machine learning models readily accept monotonicity constraints, e.g., decision trees, or boosting (XGBoost, LightGBM, CatBoost all have an argument for that).

Decision trees are interpretable, but only if they are very shallow. Instead of learning balanced decision trees, whose branches quickly spread out, we can learn "1-dimensional" trees, or "rule lists": ask a question; if the answer is yes, return a forecast; if no, ask another question; iterate. "Falling rule lists" also add the condition that the forecast (say, the probability of a disease) decreases as you move along the list of questions.

An optimization approach to learning falling rule lists
C. Chen and C. Rudin (2018)
https://arxiv.org/abs/1710.02572

Linear regression does not account for... nonlinearities. Generalized additive models (GAM) are a straightforward generalization of linear models: each variable if first transformed, separately, in a nonlinear way; they are then combined linearly. It is possible to add monotonicity constraints, and it is also possible to combine GAMs with boosting: we do not end with with a single model, but a family of models (a "regularization path"), increasingly complex, in which the predictors enter one by one, initially in a linear fashion,

https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf

It is possible to add interactions, but care should be take to clearly separate the 1-variable effects from the interactions. GAMs can also be estimated with neural nets, or trees.

GAMI-Net: an explainable neural network based on generalized additive models with structured interactions
Z. Yang et al. (2020)
https://arxiv.org/abs/2003.07132

Interpretable machine learning based on functional Anova framework: algorithms and comparison
L. Hu et al. (2023)
https://arxiv.org/abs/2305.15670

Using model-based trees with boosting to fit low-order functional Anova models
L. Hu et al. (2022)
https://arxiv.org/abs/2207.06950

Deep Learning: monotonicity

Neural nets are the archetypal example of non-interpretable models -- but this can be fixed, at least to some extent. For instance, we can constraint a neural network to be monotonic in its inputs.

The easiest way to do that is to use monotonic activation functions and positive weights. You should not (or not only) use ReLU for this: the network would be monotonic and convex in each of its inputs.

Constrained monotonic neural networks
D. Runje and S.M. Shankaranarayana (2022)
https://arxiv.org/abs/2205.11775

Alternatively, in particular if the model is more complex than a simple MLP, we can add a penalty on the gradient, to make it positive -- not the gradient of the loss wrt the parameters, which is used for training, but the gradient of the output wrt the input. The resulting model will be close to monotonic, but there is no guarantee.

How to incorporate monotonicity in deep networks while preserving flexibility?
A. Gupta et al. (2019)
https://arxiv.org/abs/1909.10662

If you need a guarantee that the model is monotonic (e.g., for legal reasons), monotonicity can be formulated as a satisfiability problem: this will give you a proof that the model is monotonic, regardless of how it was trained (or a proof that it is not).

Certified monotonic neural networks
X. Liu et al. (2020)
https://arxiv.org/abs/2011.10219

It is also possible to look for monotonicity counter-examples during training, and use them to make the model more monotonic.

Counterexample guided learning of monotonic neural networks
A. Sivaraman et al. (2020)
https://arxiv.org/abs/2006.08852

Deep learning: sparsity

Making neural networks sparse (making them use only a small subset of their inputs) looks easy: just add an L¹ penalty to the coefficients of the first layer. In practice, however, it does not really work: none of the coefficients is exactly zero. To fix this, you can separate the L¹ penalty from the rest of the loss, and use proximal gradient.

Sparse-input neural networks for high-dimensional nonparametric regression and classification
J. Feng and N. Simon (2019)
https://arxiv.org/abs/1711.07592

Other models

Symbolic regression looks for a formula explaining the target variable from the predictors, using all the mathematical operators you provide. It may look appealing, but it is limited to situations with a small number of variables, and either a lot of data or very little noise -- it is good to find physical laws, but disappointing in many situations.

Interpretable machine learning with PySR and SymbolicRegression.jl
M. Cranmer (2023)
https://arxiv.org/abs/2305.01582

Additive index models (AIM) combine dimension reduction and GAM: they first project the input on a (learnt) low-dimensional space, and then fit a GAM on this space; the whole model is trained end-to-end.

Explainable neural networks based on additive index models
J. Vaughan et al. (2018)
https://arxiv.org/abs/1806.01933

Adaptive explainable neural networks (AxNNs)
J. Chen et al. (2020)
https://arxiv.org/abs/2004.02353

A variable coefficient model (VCM) is a locally linear model: a linear model in which the coefficients vary, but very slowly. To fit them, use a neural network, with a penalty to ensure the coefficients vary slowly.

Towards robust interpretability with self-explaining neural networks
D. Alvarez-Melis and T.S. Jaakola  
https://arxiv.org/abs/1806.07538

Varying coefficient models
T. Hastie and R. Tibschirani (1993)

Some of those approaches can be combined, for instance, GAM and VCM.

Granger-causal attentive mixtures of experts: learning important features with neural nets
P. Schwab et al. (2019)
https://arxiv.org/abs/1802.02195

For a Python implementation of some of those models, check imodels (open source) or PiML (closed-source).

https://github.com/csinva/imodels
https://github.com/SelfExplainML/PiML-Toolbox

Uncertainty quantification

Under certain (reasonable) assumptions, the disagreement rate of the models in an ensemble (on unseen, unlabeled data) is a good estimator of the expected test accuracy.

On the joint interaction of models, data and features
Y. Jiang et al. (2024)
https://arxiv.org/abs/2306.04793

Assessing generalization via disagreement
Y. Jiang et al. (2021)
https://arxiv.org/abs/2106.13799

Conformal prediction

When you model a function with a deep learning model, one input gives one output: the model is always certain of its answers. Even if the output of the model is a probability distribution, it is rarely calibrated.

Surprisingly, it is possible to automatically calibrate those models -- any model. This is what conformal prediction does: it transforms any model into a model returning "confidence sets", guaranteed to contain the correct value α% of the time.

One idea (inductive conformal prediction, split conformal prediction) is to split the data in two, define a "non-conformity score" with the first split, and use the second split to decide how to use it to form confidence intervals. There are many possible choices for the non-conformity score; they can also be defined without splitting, with cross-validation or the Jackknife.

Another idea (transductive conformal prediction, full conformal prediction) is, given a new value of X, to refit the model with that new observation and all possible values of $y$.

Conformal prediction assumes exchangeability, i.e., that the order of the observations does not matter, which seems to exclude time series data, but there are a few work-arounds.

Distribution-free predictive uncertainty quantification: strength and limits of conformal prediction
(ICML 2024 tutorial)
https://icml.cc/virtual/2024/tutorial/35231

Introduction to conformal prediction with Python
C. Molnar (2023)

For Python implementations, check mapie, crepes, nixtla.

https://mapie.readthedocs.io/en/stable/
https://crepes.readthedocs.io/en/latest/
https://nixtlaverse.nixtla.io/statsforecast/docs/tutorials/conformalprediction.html

Data-centric machine learning

Coresets

Recent, large models seem to require a huge amount of data. But it may be possible to achieve the same performance, with less data, provided we choose the training data wisely.

This is done, to some extent, when cleaning the data: deduplication, boilerplate removal (for text), identification to generated contents, etc.

Textbooks are all you need
S. Gunasekar et al. (2023)
https://arxiv.org/abs/2306.11644

The curse of recursion: training on generated data makes models forget
O. Shumailov et al. (2023)
https://arxiv.org/abs/2305.17493

But we can do more. For instance, during training, we can drop low-loss samples, or low-gradient samples, and train on the rest (you may need to adjust the gradients to reduce their bias).

InfoBatch: lossless training speed up by unbiased dynamic data pruning
Z. Qin et al. (2023)
https://arxiv.org/abs/2303.04947

You can also identify which samples are easy to learn and which are hard, by looking at the loss of a smaller, proxy models, and then decide to train on the easy ones (if you want to drastically reduce the size of the training set, focus on easy samples, if you can afford a larger training set, focus on the hard ones).

Towards a statistical theory of data selection under weak supervision
G. Kolossov et al.
https://arxiv.org/abs/2309.14563

For more details, and for applications to contrastive learning, check the following ICML tutorial.

Data-efficient machine learning
(ICML 2024 Tutorial)
https://icml.cc/virtual/2024/tutorial/35234

Data Attribution

Data attribution looks at the opposite problem: once the model has been trained, which samples contributed the most?

Data attribution at scale
(ICML 2024 tutorial)
https://icml.cc/virtual/2024/tutorial/35228

Relevance-based forecasting

The solution of the OLS problem is often written

ŷ = x' β

i.e., the forecast ŷ is a linear combination of the predictors y. But it can also be written

ŷ = x' β
  = [ x' (X' X )⁻¹ X' ] Y
  = w' Y

i.e., the forecast is a linear combination of the training set y's (the coefficients are data-specific: they depend on the new value x).

Those coefficients, known as "relevance" (of each training sample to the new observation) can be decomposed into a sum of 3 terms:

- The similarity (opposite of the Mahalanobis distance) between the new observation and each training sample;

- The informativeness of the new observation (distance to the average of the training data);

- The informativeness of each training sample.

Relevance-based forecasting uses those relevance weights, but only for the most relevant training samples.

Relevance
M. Czasonis et al.(2022)
https://globalmarkets.statestreet.com/research/service/public/v1/article/insights/pdf/v2/ecf31e52-64f3-49c7-84d7-19d6b70d5d0f/joim_relevance.pdf

Relevance-Based Prediction: A Transparent and Adaptive Alternative to Machine Learning
M. Czasonis et al.(2023)
https://globalmarkets.statestreet.com/research/service/public/v1/article/insights/pdf/v2/d0cdca66-3c68-4fd5-9608-f8953dc0ac78/jfds_relevance_based_prediction.pdf

A Transparent Alternative to Neural Networks with An Application to Predicting Volatility
M. Czasonis et al.(2024)
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4961917

The Virtue of Transparency: How to Maximize the Utility of Data without Overfitting
M. Czasonis et al.(2024)
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4903395

Portfolio construction when regimes are ambiguous
M. Kritzman et al. (2023)
https://globalmarkets.statestreet.com/research/service/public/v1/article/insights/pdf/v2/61df8160-4810-4cbe-8827-e6dc83a43dc8/jpm_portfolio_construction_when_regimes_are_ambiguous.pdf

Reinforcement learning and bandits

Reinforcement learning (RL)

We are familiar with supervised learning: we are given X (e.g., the features of a house: area, number of rooms, number of floors, presence of a pool, etc.) and y (the price of the house), for many examples, and we are asked to find a mapping from X to y.

In contrast, reinforcement learning focuses on decision making: we are given a "state" s (e.g., the location of the pieces on a chessboard) and we are asked fo find the best "action" a (the best move). To help us make that decision, we are given occasional "rewards" r (in the chess example, we only receive the reward at the end of the game, +1 or -1, depending on who wins).

The data used by supervised learning and reinforcement learning is very different. In supervised learning, we are give, a table with X and y columns; in reinforcement learning, we are given a table with 4 columns: current state s, action chosen a, reward r received immediately after that action, and next state s'.

The task is also very different. In supervised learning, we want to find a mapping from X to y. In reinforcement learning, there is also a mapping from state-action pairs, to reward and next state

s, a ↦ r, s'

but the goal is not to find that mapping, but to find the best action – a mapping ("policy") from state to action

π: s ↦ a

That task is fraught with difficulties.

- Contrary to supervised learning, we are never told what the best action is: the reward only tells us how good it is, not if it could have been better, or how to improve it -- the model will have to figure that out by itself.

- The rewards are delayed: we do not know immediately how good the action was: we may have to wait. Consequently, when we receive a reward, we do not know which action (or actions) contributed to it: some past actions could have been bad, some could have been good, but we do not know which ones. There could also be interactions between them: actions that would have been bad in isolation, but proved beneficial in combination with others.

- The state is constantly changing: an action is not good or bad in itself -- it could be good in one state and bad in another (e.g., in chess, losing a piece can be good if followed by other actions -- a gambit).

Bandits

The multi-arm bandit (MAB) problem is a special case of reinforcement learning, in which there is no state: for eacg time step, we must choose an action ("arm") a∈⟦1,n⟧; we then receive a reward r ~ pₐ from some (unknown) distribution pₐ. We want a way of choosing the action (a "policy") to maximize the sum of the discounted rewards.

Here are a few policies.

ε-greedy keeps track of the average reward from each arm, pick the best one with probability 1-ε ("exploitation") and picks a random arm with probability ε ("exploration").

Thompson sampling assumes the data is Gaussian and keeps track of the (Bayesian) posterior distributions of the mean reward of each arm: to choose an arm, it samples a (random) reward from each of those posterior distributions, and picks the arm with the largest (sampled) reward.

UCB keeps track of a confidence interval for the mean of each arm, and picks that with the largest upper confidence bound.

Contextual bandits

Contextual bandits add state (but, contrary to reinforcement learning, there are no state transitions). The previous algorithms are straighforward to generalize: fit a model to predict the rewards of each arm from the context (the predictors, the state), e.g., with a linear regression (or some other machine learning model) and:

Pick the arm with the best forecast with probability 1-ε, and a random arm with probability ε (ε-greedy);

Use a Bayesian model to forecast the rewards, sample from the posterior expected reward, and pick the arm with the largest sampled reward (LinTS, Thompson sampling);

Use frequentist models, compute a confidence interval on their forecasts, and pick the arm with the largest upper confidence bound (LinUCB).

A contextual bandit approach to personalized news article recommendation
L. Li et al. (2010)
https://arxiv.org/abs/1003.0146

Thompson sampling for contextual bandits with linear payoffs
S. Agrawal and N. Goyal (2013)
https://arxiv.org/abs/1209.3352

Value-difference based exploration: adaptive control between epsilon-greedy and softmax
M. Tokic and G. Palm (2011)
http://www.tokic.com/www/tokicm/publikationen/papers/KI2011.pdf

Algorithms for the multi-armed bandit problem
V. Kuleshov and D. Precup (2009)
https://arxiv.org/abs/1402.6028

Reinforcement learning

Traditional reinforcement learning maximizes the sum of discounted future rewards -- but we may want to maximize some other function of the rewards, e.g., the maximum, or some quantile (instead of the sum).

It is possible to adjust RL inforcement learning algorithms to arbitrary objective functions, e.g., by formulating the objective as an expectation (as with CVaR -- risk-seeking RL) or, as a last resort, by using f(r(≤t))-f(r(<t)) as rewards -- but you may need to augment the state space to ensure the decision process is still Markov.

Tackling decision processes with non-cummulative objectives using reinforcement learning
M. Nägele et al.
https://arxiv.org/abs/2405.13609

Large Language Models

RLHF and DPO

Large language models are traditionally trained in 2 steps: they are first trained to complete text, and then fine-tuned to work as an "assistant", answering queries, as helpfully as possible. That second step, "reinforcement learning with human feedback" (RLHF), uses reinforcement learning (RL): the model receives a reward, higher if the answer is more helpful. It would be impractical to ask humans to give those rewards directly: instead, there is an intermediate step (that makes 3 steps), in which we learn the reward function, from a few (human-annotated) examples; that reward function is then used for RL.

Learning the reward function is easy, but RL is hard. It turns out that those two steps can be fused: they then become a simple optimization problem (DPO).

Direct preference optimization: your language model is secretly a reward model
R. Rafailov et al. (2023)
https://arxiv.org/abs/2305.18290

Vision

Transformers were popularized for their applications to text, but they are not specifically targetted at sequence data -- on the contrary, they are permutation-invariant, i.e., the order of the tokens is ignored, unless it is provided separately, with a positional encoding.

Transformers can therefore also be applied to images, by using patches as tokens, and by replacing the 1-dimensional positional encoding with a 2-dimensional one (vision transformer, ViT). Surprisingly, the attention maps are not as informative as with text: tokens in low-information regions (e.g., the sky) appear important. It turns out that, since they were not useful to the model, the model decided to use them for internal computations. It is possible to recover the interpretability of the attention map by providing additional tokens ("registers") for the model to use for its internal computations.

Vision transformers need registers
T. Darcet et al.
https://arxiv.org/abs/2309.16588

GFlowNet

Some applications of large language models require a modification of the sampling probabilities, for instance, for infilling q(B|A,C) ∝ p(ABC), or to add constraints q(A) ∝ p(A) c(A) (not on the next token, but on the next sequence of tokens). This can be done with GFlowNets, a diversity-seeking reinforcement learning algorithm.

Amortizing intractable inference in large language models
E.J. Hu et al. (2024)
https://openreview.net/forum?id=Ouj6p4ca60

Prompting

The bag of tricks to train or use LLM models gets longer and longer. Here is a (far from exhaustive) random list.

To generate instruction-answer pairs for RLHF, use an LLM: start with the answers (good quality web pages) and ask an LLM to produce instructions that would lead to those answers.

Self-alignment with instruction back-translation
X. Li et al.
https://arxiv.org/abs/2308.06259

LLMs can be used to compress text: just ask them for a short prompt that will generate the desired text -- the result may not be human readable, but another model should be able to uncompress it.

Semantic compression with large language models
H. Gilbert et al. (2024)
https://arxiv.org/abs/2304.12512

One can also compress images in the same way.

If you do not know if you need RAG (retrieval-augmented generation) for a given query, you can just ask the LLM if it needs it.

Self-RAG: learning to retrieve, generate, and critique through self-reflection
A. Asai et al. (2023)
https://arxiv.org/abs/2310.11511

Instead of storing information in a vector store and querying it with RAG, one could store data in smaller LLMs -- ask each of those small LLMs to generate a document for the desired query, and proceed as if their answers came from a vector store.

Knowledge card: filling LLM's knowledge gaps with plugin specialized language models
S. Feng et al. (2024)
https://arxiv.org/abs/2305.09955

It is possible to recover most of a text from its embedding, by lerning an error-correcting model and applying it a few times.

Text embeddings reveal (almost) as much as text
J.X. Morris et al. (2023)
https://arxiv.org/abs/2310.06816

Time series

Foundational models

The LLMs we currently use are "foundational models" for text: they can be repurposed to any text-related task. There are efforts to build foundational models for other modalities: in particular for images, for tabular data, or for time series.

The current candidates for a "time series foundational model" (the examples below are not an exhaustive list) are still unconvincing: simpler approaches (ARIMA, N-BEATS, etc.) still perform better.

MOMENT: A family of open time-series foundation models
M. Goswami et al.
https://arxiv.org/abs/2402.03885

Lag-Llama: towards foundation models for probabilistic time series forecasting
K. Rasul et al.
https://arxiv.org/abs/2310.08278

A decoder-only foundation model for time series forecasting
A. Das et al.
https://arxiv.org/abs/2310.10688

Instead, one could try to re-purpose LLMs, from text to time series: freeze the transformers (the "intelligence") of the model, and only retrain the encoder (patch embedding) and decoder, for time series tasks

One Fits All: power general time series analysis by pretrained LM
T. Zhou et al.
https://arxiv.org/abs/2302.11939

Transformers

Transformers can be used with time series, perhaps with a few modifications, such as using patches

A time series is worth 64 words: long-term forecasting with transformers
Y. Nie et al. (2022)
https://arxiv.org/abs/2211.14730

or replacing the products with convolution products,

Autoformer: decomposition transformers with autocorrelation for long-term series forecasting
H. Wu et al. (2021)
https://arxiv.org/abs/2106.13008

or operating in the frequency domain.

FEDFormer: frequency enhanced decomposed transformer for long-term series forecasting
T. Zhou et al. (2022)
https://arxiv.org/abs/2201.12740

Considering that foundational LLMs are a general form of intelligence, we can re-purpose them for time series: remove the first and last layers, which are text-specific, and replace them with time-series ones, and only train those, keeping the bulk of the network frozen. For the input, we can add both the numeric values of the time series (patches) and a textual description, if available.

Time-LLM: time series forecasting by reprogramming large language models
M. Jin et al.
https://arxiv.org/abs/2310.01728

State space models, Mamba

Recurrent neural networks (RNN) are a very natural way of processing sequence data, but they have a short memory -- generalizations such as GRU or LSTM improve the situation, but not enough: they have been superseded by the transformer.

Unfortunately, the complexity of the transformer is quadratic in the context size. State space models may be able to revive RNNs and make them competitive with transformers.

A (continuous) state space model is of the form

ẋ = Ax + Bu
y = Cx + Du

where x is latent (hidden). It can be discretized into

xₖ = Ā xₖ₋₁ + ̄B uₖ
yₖ = ̄C xₖ

and solved with a convolution

y = ̄K * u.

Small technical details (FFT, diagonal-plus-low-rank parametrizations, gating, etc.) make it efficient and competitive with transformers. There is now a large number of such models: S4, Mamba, Hyena, etc.

Mamba: linear time sequence modeling with selective state spaces
A. Gu and T. Dao (2023)
https://arxiv.org/abs/2312.00752

Periodicity

Topological data analysis (TDA) looks for "holes" in data: for instance, a circle is a 1-dimensional hole. This can be used to detect periodic patterns in time series.

Topological time series analysis
J.A. Perea (2018)
https://arxiv.org/abs/1812.05143

Sliding windows and persistence: an application of topological methods to signal analysis
J.A. Perea and J. Harer (2013)
https://arxiv.org/abs/1307.6188

Change point detection

To detect a "changepoint" in a time series, look at the performance of a classification model predicting if t>t₀ from xₜ.

Random forests for change point detection
M. Londschien et al. (2023)
https://arxiv.org/abs/2205.04997

If there are potentially several change-points, use dynamic programming.

A kernel multiple change-point algorithm via model selection
S. Arlot et al. (2012)
https://arxiv.org/abs/1202.3878

Optimization

Symmetry teleportation

When trying to minimize a function ℓ with gradient-based algorithms, its symmetries can help: if ℓ(g·x) = ℓ(x) for all x, we can replace the current solution candidate x with g·x; and we can choose g to make the gradient steeper -- this speeds up convergence, and may also improve generalization.

Improving convergence and generalization using parameter symmetries
B. Zhao et al. (2023)
https://arxiv.org/abs/2305.13404

Symmetry teleportation for accelerated optimization 
B. Zhao et al. (2022)
https://arxiv.org/abs/2205.10637

Symmetries, flat minima, and the conserved quantities of gradient flow
B. Zhao et al. (2022)
https://arxiv.org/abs/2210.17216

Stochastic heavy ball

To minimize a function f, we can consider a ball, moving on the graph of f, subject to both friction γ and acceleration: ẍ+γẋ+∇f(x)=0.

Stochastic heavy ball
S. Gadat et al. (2016)
https://arxiv.org/abs/1609.04228

Disciplined saddle programming

We are familiar with "disciplined convex programming": a set of rules ensuring that an optimization problem is convex and amenable to convex solvers. This is what you are doing if you are using cvxpy.

This can be generalized to the search for saddle points.

Disciplined saddle programming
P. Schiele et al. (2023)
https://openreview.net/forum?id=KhMLfEIoUm
https://github.com/cvxgrp/dsp

Finance

Reinforcement learning

Alpha mining, searching for simple models with good predictive powers, is often done with symbolic regression and genetic algorithms. Instead, one can use reinforcement learning to explore the space of formulas, maximizing, not the expected cummulated reward, but some quantile of it ("risk-seeking" reinforcement learning).

RiskMiner: discovering formulaic alphas via risk-seeking Monte Carlo tree search
T. Ren et al.
https://arxiv.org/abs/2402.07080

Stylized facts

The stylized facts of financial time series are usually listed as fat fails, volatility clustering and, sometimes, leverage. There are actually more.

Modeling financial time series with generative adversarial networks
S. Takahashi et al. (2019)

Synthetic data

Those stylized facts could be used to generate synthetic data, but the results are still unconvincing: synthetic financial data does not look or behave like real data.

Synthetic data applications in finance
V.K. Potluru et al. (2024)
https://arxiv.org/abs/2401.00081

Neural factors: a novel factor learning approach to generative modeling of equities
A. Gopal (2024)
https://arxiv.org/abs/2408.01499

Risk measures

In the long list of performance and risk measures, the entropic value at risk is becoming more popular.

Properties of the entropic risk measure EVaR in relation to selected distributions
Y. Mishura et al. (2024)
https://arxiv.org/abs/2403.01468

Risk may depend on the time scale (investing in stocks is risky in the short term, but less so in the long term): it can be decomposed into contributions of different time scales.

A new measure of risk using Fourier analysis
M. Grabinski and G. Klinkova (2024)
https://arxiv.org/abs/2408.10279

Portfolios

To build an investment strategy, one could forecast future stock returns and buy stocks with good forecasts. But not all those forecasts will be good: you may want to estimate how reliable those forecasts, and prefer stocks with good and reliable forecasts.

This can also be done at the portfolio level: look for a portfolio whose returns are (positive and) easy to forecast.

Canonical portfolios: optimal asset and signal combination
N. Firoozye et al. (2022)
https://arxiv.org/abs/2202.10817

Books and tutorials

Probabilistic machine learning

If you want a (big) machine learning textbook from a Bayesian viewpoint, I recommend:

Probabilistic machine learning
K.P. Murphy (2022)
https://probml.github.io/pml-book/book1.html

(I have not read volume 2 yet...)

Tutorials

I have already mentioned some of the ICML tutorials:

Distribution-free predictive uncertainty quantification: strength and limits of conformal prediction
https://icml.cc/virtual/2024/tutorial/35231

Data-efficient machine learning
https://icml.cc/virtual/2024/tutorial/35234

Data attribution at scale
https://icml.cc/virtual/2024/tutorial/35228

The following may also be of interest.

Strategic machine learning is a generalization of adversarial machine learning: the users know the model (or parts of it) and change their features accordingly. (It is not always adversarial: for a recommendation system, the user will try to tweak the model to have better recommendations).

Strategic ML: learning with data that behaves
https://icml.cc/virtual/2024/tutorial/35230

Transformer-based models often end with a mixture of experts: the model can take several paths ("experts"), and decides which ones to take (with a "gating" mechanism).

Mixtures of experts in the era of LLMs
https://slideslive.com/39021919/mixtureofexperts-in-the-era-of-llms-a-new-odyssey?ref=folder-147931

In convex analysis, we would like to claim "every convex function has a minimum" -- but that is not true: convex functions can be unbounded. It is however possible to extend ℝⁿ in a way useful for complex analysis. (I am not convinced of the practical applications, but the tutorial was clear and interesting.)

Convex analysis at infinity: introduction to astral space
https://icml.cc/virtual/2024/tutorial/35232
https://arxiv.org/abs/2205.03260

posted at: 10:39 | path: /ML | permanent link to this entry

Vincent Zoonekynd's Blog

Sun, 05 Jan 2025: 2024 in Machine Learning

Causality

Introduction

Classical Causal discovery algorithms

Deep learning for causal discovery

Causal discovery for time series

Unobserved confounders

Cycles

Distances

Ensembling

Simulated data

Causal Inference

Software, Books

Explainable / Interpretable AI

Linear models

Machine learning models

Deep Learning: monotonicity

Deep learning: sparsity

Other models

Uncertainty quantification

Uncertainty quantification

Conformal prediction

Data-centric machine learning

Coresets

Data Attribution

Relevance-based forecasting

Reinforcement learning and bandits

Reinforcement learning (RL)

Bandits

Contextual bandits

Reinforcement learning

Large Language Models

RLHF and DPO

Vision

GFlowNet

Prompting

Time series

Foundational models

Transformers

State space models, Mamba

Periodicity

Change point detection

Optimization

Symmetry teleportation

Stochastic heavy ball

Disciplined saddle programming

Finance

Reinforcement learning

Stylized facts

Synthetic data

Risk measures

Portfolios

Other topics

Graphs

Diffusion

Tabular data

Optimal Transport

Continuous

Topological data analysis (TDA)

Federated learning

LoRA

p-hacking

Trees

Books and tutorials

Probabilistic machine learning

Tutorials