Last week, I attended the 2006 UseR! conference: here is a (long) summary of some of the talks that took place in Vienna -- since there were up to six simultaneous talks, I could not attend all of them...

In this note:

0. General remarks 1. Tutorial: Bayesian statistics and marketing 2. Tutorial: Rmetrics 3. Recurring topics 4. Other topics 5. Conclusion

For more information about the contents of the conference and about the previous ones (the useR! conference was initially linked to the Directions in Statistical Computing (DSC) ones):

http://www.r-project.org/useR-2006/ http://www.ci.tuwien.ac.at/Conferences/useR-2004/ http://www.stat.auckland.ac.nz/dsc-2007/ http://depts.washington.edu/dsc2005/ http://www.ci.tuwien.ac.at/Conferences/DSC-2003/ http://www.ci.tuwien.ac.at/Conferences/DSC-2001/ http://www.ci.tuwien.ac.at/Conferences/DSC-1999/

I attended two tutorials prior to the conference: one on bayesian statistics (in marketing) and one on Rmetrics.

There were 400 participants, 160 presentations.

Among the people present, approximately 50% were using Windows (perhaps less: it is very difficult to distinguish between Windows and Linux), 30% MacOS, 20% Linux (mostly Gnome, to my great surprise).

The goal of statistical inference is to make probability statements from various information sources: the data, but also prior sources, for instance, "this parameter should be positive" or "this parameter should have reasonable values". The difference between marketing and econometrics is that those statistical statements lead to actions -- see J. Berger's book, Statistical decision theory and bayesian analysis.

Bayesian methods produce the whole posterior distribution of the parameters, from which you can extract any information -- not simply "the most likely value" of this parameter, as with maximum likelihood (ML) estimators. This is akin to the sampling distribution, i.e., the distribution of the estimated (say, ML) parameters if we had run the experiment millions of times.

Bayes's theorem simply says that the posterior probability (or probability density function) is (proportional to) the product of the prior and the likelihood.

Bayesian statistics can be applied to any kind of model, e.g., binomial, regression, multiple regressions, probit, logit, hierarchical, etc.

The beta distribution is a good candidate for parameters in [0,1]: it can be symetric or skewed, with a large or narrow peak, or even U-shaped.

The inverted chi squared distribution is often used as a prior for variances, becausse it is amenable to computations: it is said to be conjugate -- but in the computer era, there is no reason to limit ourselves to these.

In the prior distribution, one often has to choose parameters: these are called hyper-parameters.

Bayesian estimators are often said to be shrinkage estimators: they are "between" the prior and the maximum likelihood estimators (MLE).

The larger the sample size, the smaller the influence of the prior.

Sampling from multivariate distributions can be tricky: it is often easier to sample from univariate distributions, i.e., to sample one dimension at a time. The idea of Gibbs sampling is to replace sampling from (x1, x2) by sampling from x1|x2 and then x2|x1. This bears a ressemblance to the EM algorithm -- it might even be exactly the same: for instance, both can be used to fill in missing data.

The following plot show a Gibbs sampler sampling from a bivariate gaussian distribution -- if we draw the intermediary steps, only one coordinate changes at a time and the path is a staircase one.

When using Gibbs sampling, be sure to check the following:

- the autocorrelation and the cross-correlation

- the time series of the sampled parameters: are they stationary?

- plot the bayesian estimators versus the ML ones; plot the bayesian estimators versus the bayesian estimators with a different prior

- run several chains (the sampled time time series are often called Monte Carlo Markov Chains (MCMC)) and check if they mix (i.e., if they look interchangeable).

Having the full distribution of the parameters allows you to extract a lot of information: e.g., you can investigate the distribution of X1*X2 or X1/X2 (with standard gaussian distributions, the latter does not even have moments).

In the bivariate gaussian example, since X1 and X2 are correlated, successive values of (X1, X2) will be correlated: the samples do not contain as much information as an independant sample of the same size.

With the probit model,

Z = X beta + epsilon Y = ifelse( Z > 0, 1, 0 ) epsilon ~ N(0,1)

(this can be seen as a censored (or truncated) model: Z is replaced by the intervals (-infinity,0] or (0,+infinity) -- there is less information in censored data than in complete data), bayesian methods provide more information: we also have the distribution of the latent variable Z -- more generally, bayesian methods provide an estimation of the distribution of the parameters, latent variables and missing values.

Mixtures of gaussians can be tackled in the same way: the latent variable is the number of the cluster. But there, the situation is much worse: a permutation of the numbering of the components does not change anything -- as a result, if there are n components, the likelihood has n! modes... Actually, it is not a problem: the Markov chain will switch the labels from time to time, but this will have no consequence.

The multinomial probit model (the variable to predict is not binary but can take n values) is almost intractable with classical methods but is amenable to bayesian methods. Some care is needed, though: see the book.

The same goes for the multivariate probit model (the variable to predict is a subset of those n values).

The Metropolis algorithm (an alternative to the Gibbs sample, that allows you to easily sample from a multivariate distribution, and which is preferable when the variables are too dependant) was not tackled -- see the book.

Finally, the interesting part: panel data and hierarchical model. Since this is getting more intricate, I prefer to refer you to the book.

The presenter developped the bayesm package, that provides, among others, the following functions:

- runireg samples from the posterior distribution of the parameters of a regression y ~ x with an inverted chi squared prior on the variance sigma^2 and a gaussian prior of beta|sigma

- rmultireg samples from the posterior distribution of the parameters of a family of regressions Y ~ x with an inverted Wishart prior on the variance Sigma and a gaussian prior on beta|Sigma

- rbiNormGiggs samples from a bivariate gaussian, using a Gibbs sampler

- numEff computes the effective sample size of a time series, i.e., the size of the series fo independant variables that would contain the same information. It tells you how much thinning you should use.

- runiregGibbs: samples from the posterior distribution of the parameters of a regression y ~ x with an inverted chi squared prior on the variance sigma^2 and a gaussian prior of beta (and not beta|sigma)

- rbprobitGibbs: samples from the posterior distribution of the parameters of a binary probit regression y ~ x with a gaussian prior on beta (and sigma=1)

- rnmixGibbs: samples from the posterior distribution of the parameters of a mixture of gaussians with a Dirichlet prior on the probabilities of the components, a gaussian prior on their means, an invert Wishart prior on their variance.

- rmnpGibbs: samples from the posterior distribution of the parameters of a multinomial probit regression (Beware of the results: they are not intended to be stationary -- consider beta / sqrt(sigma_{1,1}) and Sigma / sqrt(sigma_{1,1}) instead)

- createX: ancillary function to create the design matrix given to rmnpGiggs.

- rnmlIndepMetrop: multinomial logit

- rhierBinLogit, etc.: there are also a lot of hierarchical models: this is a slippery slope, do not use them unless you really know what you are doing.

Mixed models are only partial bayesian methods: you have to provide a prior, but you do not look at the whole posterior distribution, which might be misleading if it not gaussian. Given the power of current computers, there is no need for such a restriction: we can afford a full bayesian method.

There are more general bayesian packages (Bugs, JAGS), with which you can simulate any kind of model, but that generality comes at a price: on special cases, the computations are not as fast as they could be -- by several orders of magnitude...

Rmetrics is a set of R packages for quantitative finance and econophysics, initially developped for educational purposes. This tutorial reviewed those packages one at a time.

The fBasics package helps study the stylized facts of financial time series.

The fCalendar package is devoted to time manipulations. The notion of time zone (TZ) is replaced by that of financial center, that encompasses daylight saving time (DST) rules and holidays (to perform operations such as "nest business day"). This is still imperfect: different markets in the same financial center have different holidays (e.g., Chicago/Equities and Chicago/Bonds).

The timeSeries package defines the timeSeries class: those objects contain one or several time series, having the same set of timestamps.

The fSeries package provides the garchFit function, for fit GARCH models and their variants. GARCH models are a bit of a problem with statistical software: for a long time, there has been no benchmark against which to assess an implementation, yielding very disparate results across systems (Ox is not that good, but the others, including SPlus and SAS, are worse).

The fExtreme package is devoted to extreme values. It contains a set of functions to (visually) study distribution tails (emdPlot, lilPlot, mePlot, msratioPlot, qqPlot, recordsPlots, sllnPlots, etc.)

To compute the Value at Risk (VaR) or the Expected Shortfall (ES) one can try to fit a distribution to the tail of the data, chosen from the family of limit distribution of tails of distributions: the Generalized Pareto Distribution (GPD).

pgdriskmeasures(gpdFit( x, threshold = .95, method = c("pwm", "mle", "obre") ))

("obre" stands for "optimally biased robust estimator")

Extreme Value Theory (EVT) also studies the distribution of the maximum of iid random variables: there is a limit theorem, similar to the central limit theorem (with "max" instead of "mean") that identifies the limit distribution as one of the GEV (Generalized Extreme Values) distribution (of which the Gumbel, Frechet, Weibut are special cases).

The fCopulae package is devoted to copulas. The implementation is more reliable than that of SPlus (SPlus seems to use numeric differenciation, which is unstable in extreme cases; Rmetrics uses formal derivatives).

Copulas address the following fallacies.

- Fallacy 1: Marginal distribution and their correlation matrix uniquely determine the joint distribution.

- Fallacy 2: Var(X1 + X2)is maximal when Cor(X1,X2) is maximal

- Fallacy 3: Cor(X1,X2) small implies that X1 and X2 are almost independant.

The fOptions package is the best-known part of Rmetrics: all the (equity-based) options, priced by exact formulas (when available) binomial trees, Monte-Carlo simulations (with antithetic variables, low discrepancy sequences), PDEs.

The fBonds package is devoted to bonds (but I am not familiar with bonds).

The fBrowser is an Rcmdr-based GUI that provides the above functionalities. You can extend it and add your own menus.

Conclusion: The coverage is impressive, and Rmetrics should be considered if we plan to use options or if/when we start to investigate Econophysics.

If you do not already regularly read them, there is a selection of Econophysics preprints, regularly updated:

http://www.unifr.ch/econophysics/

You may also want to have a look at the Rmetrics website

http://www.itp.phys.ethz.ch/econophysics/R/

The following topics were tackled in several talks.

Windows users are typically intimidated by the almost empty starting screen of R and wonder "where is the GUI?".

Novice users, who do not want to tamper with the command line, are probably better off with an Rcmdr-like interface: indeed several projects build on John Fox's Rcmdr (which provides basic statistics) to provide domain-specific functionalities with a menu-driven interface: fBrowser in Rmetrics for finance, GEAR for econometrics, etc.

Programmers also complain about the difficulties of debugging R code and the lack of a VisualStudio-like IDE (Integrated Development Environment). Note that SPlus is addressing this concern by providing an Eclipse-based workbench -- there is an Eclipse plug-in for R, http://www.walware.de/goto/statet but many features are still missing.

Let us also mention ESS, Texmacs and JGR.

Some people claim that the lack of interactive graphics is one of the major drawbacks of R: they would like to be able to have several plotting windows, presenting different views of the same data set, to be able to select points in one plot and see the corresponding points highlighted in the others (this is called brushing).

iPlots (built with rJava) provides those facilities, together with a (portable) GUI, but is still under development -- it seemed perfectly useable, though.

GGobi can already do all that, but it is a separate application (that can talk to R) and it does not provide user-defined plots.

The rgl package leverages OpenGL (one of the technologies used by the graphics card found in most computers and needed to play most video games -- a large, untapped source of computational power) to produce 3-dimensional plots, that can be interactively rotated; but their elements cannot be selected, brushed, etc.

Of course, one can still use Tk widgets (and the tkrplot package) to produce plots that are automatically updated when the user moves a slider.

http://bioinf.wehi.edu.au/folders/james/index.html

Though the R documentation is often better than that of other software, it still has a few problems: the manual pages are terse reference manuals, often unsuitable to begining users; the contributed manuals, that cater to users with very specific backgrounds, are not updated as timely as R is.

To tackle this problem, some suggested to write collaborative documentation, in the spirit of Wikipedia: after several months of discussions on the R-SIG-Wiki mailing list (about which wiki engine to use, how to have it understand R, etc.), Philippe Grosjean opened the R Wiki as the conference started, with already some contents (R tips, the R Tk tutorial, and the first chapter of my "Statistics with R"):

http://wiki.r-project.org/

Wikis are also used in several e-Learning projects: the teacher sets up the structure of the site and the students fill it in, with the notes they have taken, the statistical analyses they have carried out, the problems they have had, how they have solved them, etc.

Some companies are using a corporate wiki for their intranet and part of their web site -- in particular large companies, when they want to show that their employees are human, that their projects are steadily progressing. (If you want another buzzword for that, you can use the more general term Web 2.0.)

Incidentally, we already have a "collaborative writing" software: Lotus notes -- contrary to what many people think, it is not a mail user agent, but a "groupware framework", with which you can build collaborative applications, the mail client being only an example of how great (cough) that framework is.

Frank Harrell suggests to use Wikis for knowledge management -- his web site is a wiki, and has been so for years. He also encourages us to use Sweave for document management.

For more about Wikis, CMS (Content Management Systems), and web sites in general, check, for instance GNU/Linux Pratique HS 5 (in French -- there is nothing similar here, in the UK...).

Several people advocated the need of reproducible results, mainly with Sweave, and indeed, most of the presentations were made with Sweave and Beamer.

http://www.ci.tuwien.ac.at/~leisch/Sweave/ http://latex-beamer.sourceforge.net/

Some people even suggested to cryptographically fingerprint the datasets used (do not do that in the US, though: someone managed to patent it -- you can fingerprint files, but not files containing data).

I did not attend all those presentations.

Some explained that R could be used as a component in a larger process, scheduled in an automated way: they usually resort to RServe or rJava to access R as a web service or as a Java class.

Some of those systems exhibited pretty, impressive but utterly useless (Java) graphical front-ends.

Some explained how to exchange structured data between R and other systems (for simple data, such as a data.frame, simply use a CSV file or a database), using an XML schema (this is sometimes called a DTD) to store data.frames, lists, lists of lists, etc. They provide an R package (StatDataML) and a Java class (JStatDataML) to this end.

(There used to be an XML schema to store and exchange statistical models and data between statistical applications, called PMML, but it was not mentionned and I do not know it the project is still alive.)

Some explained how to extend XSLT to have it call R and perform statistical computations on the data being transformed.

Some explained how to embed R into a web server.

Of course, in this area, the most important thing is the number of acronyms and buzzwords you can fit in a single sentence: as an exercise, try to form a sentence using the words XML, XSLT, JAXB, PyXML, R/Apache, POI, JDBC, Jython, Struts, Hibernate, YAWL, Ruby on Rail.

When your computational needs grow, you will want to run computations in parallel, on several computers, or several processors on the same machine: these could be completely different processes, similar processes on different data, or a single computation that can be split up into several pieces.

A few packages can facilitate this parallelization: rpvm (uses PVM), rmpi (uses MPI), snow (to transparently parallelize parallelizable code) or nws.

The problem of large datasets, that do not fit into memory, was not tackled -- the advice did not change, use a database to store the data and/or buy more memory (and use an operating system that can use it).

There are often several packages on the same subject, each providing similar but different, complementary and incompatible capabilities. In several areas, such as robust statistics (with the robustbase package) or econometrics (with the GEAR package, that will provide basic econometric functions and a GUI), people are starting to unify all this.

Also note the forthcoming book, Applied Econometrics with R, by C. Kleiber and A. Zeileis.

One of the challenges faced by R is the increasing amount of data to process and the timeliness of that processing: more and more, we will want real-time results or plots, that pop up as soon as the data arrive, that are updated as soon as the data is.

There is some progress in that direction (such as algorithms to compute a moving median, a moving quantile; or frameworks for enterprise processes that encompass R), but the path to a real stream-processing engine will be long.

(I only attended one of those talks, so I do not know if the following was mentionned: it is possible to write triggers in R for PostgreSQL and thus launch computations when the data arrive.)

As non-statisticians progressively want to harness the power of R, they will want to access it from the software they are familiar with, such as spreadsheets or databases.

It is already possible to access R from Gnumeric (a spreadsheet) and from PostgreSQL (a DBMS).

(There were also presentations and tutorials about R and Windows, but I did not attend them: they usually assume that you are already a proficient Windows programmer.)

There was a whole session on machine learning, with emphasis on Support Vector Machines (SVM).

Bayesian networks and neural networks were not forgotten, though: an R neural networks toolbox is being developped, similar to the Matlab one.

http://www.mathworks.com/access/helpdesk/help/pdf_doc/nnet/nnet.pdf

Bayesian methods rely on two ideas.

First, before doing an experiment or before looking at the data, we have some information: it can be, for instance, a "reasonable" range for the quantities to estimate. This information is called the "prior".

Second, instead of computing the single "best" value for the parameters of interest, we want the full "sample distribution" of those parameters, i.e., the distribution of the "best" parameters that we would observe if we could repeat the experiment tens of thousands of times: this is the "posterior" distribution.

Those methods used to require lengthy simulations, but the are becoming more and more amenable to commodity PCs.

People have long been using Bugs (WinBugs or the suposedly portable OpenBugs, or its open source replacement JAGS) to sample from the posterior distribution. Those software can accomodate any kind of model, but because of that generality, the computations can take a lot of time.

For very specific models, the computations can be greatly sped up: this is what the bayesm and MCMCpack packages do -- MCMCpack also provides you with the building blocks needed to sample from other models.

Bayesian methods can also be used to compare models, as a replacement of p-values: check the BayesFactor and PostProbMod function in the MCMCpack package.

When using bayesian methods, one should not forget to perform a few diagnostic tests or plots: this is what the coda package does.

A robust statistical method is one that is not sensitive to outlying data: even if part of the data is outrageously wrong, it has little impact on the results.

This is often measued with the breaking point of an estimator (say, a regression): this is the proportion of observations you can tamper with without being able to make the estimator arbitrarily large.

The influence of an observation is the change its incusion or deletion induces in the result.

One can sometimes spot outliers with the Pearson residuals:

sample density / density according to the model - 1

(a presenter showed this with circular data, where outliers are not really "far"...).

The robustbase package is an attempt to unify the elementary robust methods currently scattered accross various packages: it will provide robust regressio (lmrob, glmrob), replacements for the Mean Average Distance (MAD) (Qn, Sn), MCD (Minimum Covariance Determinant) covariance matrix, etc.

One can "robustify" the (linear, gaussian) Kalman filter by replacing the matrix estimations it does (mainly extected values and covariance) by robust equivalences (median and MCD covariance).

One can robustify Principal Component Analysis (PCA): this is then called projection pursuit. PCA finds the direction in which the "dispersion", as measured by the variance, is largest; robust PCA replaces the variance with a robust equivalent.

One can generalize this to other measures of dispersion, or measures of non-gaussianity (this is called Independant Component Analysis, ICA)

One problem with robust covariance matrices is that their robustness decreases with size (since projection pursuit uses 1-dimensional subspaces, it might not be much of an issue, but for other applications, it will).

There is a mailing list devoted to robust methods in R: R-SIG-Robust.

A shrinkage estimator is an estimator somewhere on the path betwen a prior estimation (very stable, reliable, but hardly informative) and an estimator (e.g., a MLE estimator: it contains all the information, but can be extremely noisy).

The "best" position on the path can be chosen by 10-fold cross-validation (CV).

There are variants of this idea:

- Principal component regression is a regression in the first k principal components (the path is discrete, indexed by the number of components retained).

- Forward variable selection (here again, the path is discrete and corresponds to an order on the set of variables)

- Ridge regression is a regression with a penalty on the amplitude of the coefficients (if some of the predictors are correlated, the corresponding coefficients can be extremely large, with opposite signs: the penalty tries to avoid this)

- Lasso regression: idem with an L^1 penalty

- Forward stagewise regression: instead of completely adding the variables, as in forward variable selection, just add a small part of them, say 0.1*X_i (the same variable may be added several times, to increas its coefficient)

Least angle regression (LARS) is very similar to forward stagewise regression:

- find the variable X_i the most correlated with the variable to predict Y

- add it in the model, with a small coefficient, and increase the coefficient until another variable, X_j, becomes more correlated with the residuals: then, progressively change the coefficients of the two variables, X_i and X_j, until a third variable...

Lasso and forward stagewise regressions are actually special cases of LARS.

The regularization path of LARS is more stable, less chaotic than that of the lasso.

The number of degrees of freedom if a LARS regression is the number of variables that have been included -- with, say, variable selection, it is much more!

There are further generalizations of LARS: elasticnet (a mixture of L^1 and L^2 penalties, that tend to select the variables in groups); glmpath (e.g., for logistic regression); pathseeker (take the top k variables instead of the best); Cosso (we know, a priori, that the variables are grouped); svmpath.

The first speaker tried to convince us that using R on Windows, installing packages from source or even writing you own R packages on Windows was not difficult. He almost made his point: he only needed one slide to list the prerequisite software (not mentionning how to install them and forgetting about the incompatibilities with other already installed software) and two more slides to explain how to install a package (targeted at advanced Windows users: he tells to change environment variables without reminding us how) -- a stark contrast with similar explanations for a Unix platform where, if you do not understand, you simply copy and paste the instructions.

He also noted that using Windows instead of Linux "only" reduced the speed by 10% -- which is even more impressive if you consider that 64-bit R on Linux no longer runs slower that 32-bit R on Linux.

However, his talk was followed by a similar talk, that tried to do the same thing on MacOSX: the differences are amazing (the only instruction is "do not forget to install R"; R is well integrated with other MacOSX applications).

After those two talks, it really seems insane to use R on Windows (or anything else than R, for that matter -- most of the problems are not specific to R).

By the way, most of the developpers of R ("R-core", not the people writing the add-on packages, but the people writing the core of the R system itself) are on MacOSX...

You might already know that there are four ways to use Object Oriented Programming (OOP) with R: S3, S4, R.oo and proto.

Similarly, there is now a third way of producing graphics: after the old graphics (with the "plot" function), the lattice graphics (with the "xyplot" and "grid.*" functions), there is now ggplot, that implements the ideas of the book "The Grammar of Graphics".

This Windows-only package fits ARMA models to time series, but can infer the order by itself.

Two talks presented methods to automate the fitting of time series (for instance, with an ARIMA model, you have to select the order(s) of the model).

http://www.r-project.org/useR-2006/Abstracts/Hyndman.pdf http://www.r-project.org/useR-2006/Abstracts/Unkuri.pdf

Some multivariate generalizations of the GARCH model were presented, such as Constant Conditional Correlation (CCC: same equation, with diagonal matrices); Dynamic Conditional Correlation (DCC: idem, but those matrices are allowed to change over time); Smooth Transition Conditional Correlation; Extended Conditional Correlation (ECC: the amtrices are no longer diagonal, but in order to ensure that the variance matrix is positive definite, you have to add an infinite number of conditions -- people usually replace these conditions by a single one, but this is too restrictive); BEKK.

All the code presented was developped with a 2-dimensional (or low-dimensional) case in mind.

A particle filter is very similar to a Kalman filter, but it neither assumes that the underlying process is linear nor that the noise is gaussian.

The basic idea is that of an MCMC simulation. Instead of performing 10,000 (independant) simulations, one can try to mix them: at each step, the particles are simply resampled (it sounds trivial, but it is the only difference with an MCMC simulation).

They were applying that to FX data.

It looked interesting, it is related to the use of dynamic programming to build portfolios and rebalance them over time, but they only had five minutes and I did not understand anything.

They use those ideas to trade currencies.

For more information, google for Markov Decision Processes (MDP), temporal distance learning, TD-learning, Q-learning, reinforcement learning.

You are probably familiar with my rant against T-values (stating T-values assumes that your readers are very familiar with the T distribution and that they know how the number of degrees of freedom affects those values and does not extend to other tests, you should rather use p-values, that simply assume that your readers understand the uniform distribution on [0,1]; the significant level, usually 5%, has a completely different meaning depending on the size of the sample and neither you nor your reader knows how to interpret it, you can state a difference of BIC instead of a p-value): that presentation (which I did not attend), gives more details about the meaning of the significance level depending on the sample size.

The riffle package provides yet another clustering algorithm.

This talk highlighted several areas where cluster analysis is not yet a mature subject: transaction data (i.e., clustering subsets, e.g., clustering shopping baskets) and time series.

http://www.cs.ucr.edu/~eamonn/

To find a consensus price for a commodity, accross a multitude of local markets, in order to set up a futures market, they use an adaptively trimmed mean.

To check if a portfolio manager is a good portfolio manager, one can compare his performance with that of a "random" portfolio: simply generate permuted portfolios from the actual one.

The problem is that these permuted portfolios breach all the constraints the portfolio manager has to abide by. To recover them, one can feed these permuted portfolios to an optimizer, with no alpha and no variance matrix -- just the constraints.

According to the speaker, the resulting portfolios are not as uniformly distributed as they should be...

http://www.burns-stat.com/

Sparklines are word-like plots, that can be used inside a sentence or in a table.

http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR&topic_id=1 http://www.stat.yale.edu/~jay/R/Vienna/Vienna.pdf

The dataset is that of the American Statistical Association (ASA) visualization contest.

http://www.amstat-online.org/sections/graphics/dataexpo/2006.php

Histograms fail to spot ties in the data: a barcodeplot is similar to a rugplot, but the other dimension is used to indicate ties. It can be seen as a alternative to the boxplot.

The pairs() function in R only accomodates quantitative variables: one can modify it to account for qualititive variables as well, with boxplots or barcode plots when one variable is qualitative and mosaic plots when both are.

http://www.stat.yale.edu/~jay/R/Vienna/Vienna.pdf

When displaying information (here, about mortgages in the UK) on a map, one usually divides it into regions that are coloured according to the average (or total) value of the variables. This is called a choropleth map.

But this is misleading: the information conveyed by the plot can change depending on the colour scheme chosen (the boundaries between the colours can be chosen at round values (say, 1, 2, 5, 10, 20, 50, 100, etc.) or at the quantiles of the variable displayed), on the regions chosen (there is a wealth of different, incompatible, administrative (or not) decompositions into regions); larger regions (which tend to have a low density, in our example), are over-emphasized; there is no indication of estimation uncertainty (confidence intervals).

A linked micromap plot is actually a table, with the maps in the first column, and the variables in the others; each row corresponds to four regions, highlighted in the map, with a dotchart in the remaining columns.

See also:

http://www.amstat-online.org/sections/graphics/newsletter/Volumes/v132.pdf

After the talk, someone asked an interesting question: "how do we explain to our clients that producing those plots is a time-comsuming activity that should be charged accordingly?" -- "Surprise them. The results should be unexpected, they should learn something from the plots."

(There were a lot of talks about psychometrics.)

With large datasets, boxplots are not as informative as they should be: in particular, too many points are unduely labeled as "outliers" -- such points are supposed to be examined one by one...

Letter-value plots generalize boxplots, by displaying the 1/2, 1/4, 1/8, 1/16, etc. fractiles instead of just the median and the quartile, the zone between two such fractiles is represented by a box, of decreasing width (and or changing colour).

The denpro package helps you visualize high-dimensional datasets, as a 2-dimensional plot, stressing its multimodality, its dispersion or its tails.

This looks great, but the presentation was impossible to understand. The original articles are easier to read:

http://www.vwl.uni-mannheim.de/mammen/klemela/

Grid graphics can now draw X-splines, connect non-rectangular elements with arrows, clip rectangular regions and (with the grImport package) import vector graphics (SVG, PDF, PS -- you should transform tem in PostScript first).

The logit and probit link functions are not always sufficient: tests exist to check if they reflect the data, and if they are rejected, we need to look beyond them.

This talk presented the cauchit link function (which is more tolerant to surprising observations), the Gosset family (based on tyhe Student T distribution), and the Pregibon family (or Tukey lambda family).

Those links are implemented in the gld package.

If R started as a system "not unlike S" (S in the ancestor of SPlus), the situation is now reversed: to survive, SPlus cannot afford to ignore R.

The next version of SPlus will have a package system very similar to (and compatible with) that of R.

They also try to keep the lead they have in the user interface, by providing an "SPLus workbench" based on Eclipse.

In case you do not know, Eclipse, developped by IBM, is a Java IDE (Integrated Development Environment), i.e., a text editor for Java programmers, that can be extended to accomodate other languages, such as C or C++. There is already an R Eclispe plugin, but it still lacks many features.

Do not overfit the data: use shrinkage, use penalized likelihood estimators.

Respect continuous variables, do not bin them before computation.

Use non-parametric methods (not everything is linear).

Account for multiple tests: when performing several tests, i.e., when you have several p-values, correct those p-values.

Be honnest with multi-step procedures: you might feel safe when you perform a gaussianity test and decide on the path to follow next (e.g., parametric versus non-parametric tests) depending on the results of this test, but this is actually a multiple test, whose final p-value should be corrected -- and do not be fooled by the power of non-parametric tests.

Use effective graphics, routinely.

The notion of depth of a point in a cloud of points (minimum number of points of the cloud in a half-space passing through that point) can be used to define bagplots, a 2-dimensional generalization of boxplots.

This was the funniest and fastest-paced talk.

This talk explained how to use FFT (Fast Fourrier Transform) to estimate a loss distribution. The data gathering part of their process required a lot of social engineering.

I did not attend this talk: the author explains how to build a robust (stock) index for a country or a region -- this can be seen as a robust portfolio.

I did not attend this talk: they cluster discrete time series, using transition matrices.

I did not attend this talk: to assess a clustering algorithm, you can apply it on the initial data and on resampled data, and check if the results are similar.

(not attended)

(not attended)

(not attended)

In the forthcoming years, R will face the following challenges:

- real time data processing

- embeddability in other software (spreadsheets, databases)

- large scale computations (distributed or not)

To that list, I would like to add:

- relational data (data that do not fit in a single rectangular table)

- large datasets

Tackling those changes may require drastic changes to R, that will trigger incompatibilities with existing code (the situation could be similar to the switch from Perl 4 to Perl 5, for those of you who lived it).

If you need more information about any of the subjects mentionned above, or about R in general, feel free to peruse the following ressources:

- The R-help mailing list

https://stat.ethz.ch/pipermail/r-help/

- More specialized mailing lists, such as the R-SIG-Finance list

https://stat.ethz.ch/pipermail/r-sig-finance/ http://www.r-project.org/mail.html

- RNews

http://cran.r-project.org/doc/Rnews/

- The R-wiki

http://wiki.r-project.org/

- The R gallery

http://addictedtor.free.fr/graphiques/

- Some finance forums, such as Wilmott or NuclearPhynance, mention R from time to time.

http://www.wilmott.com/ http://www.nuclearphynance.com/

- The Journal of Statistical Software

http://www.jstatsoft.org/

posted at: 00:16 | path: /R | permanent link to this entry