Experts in data analysis, statistics and machine learning for physics gathered September 9-12 at Imperial College London for PHYSTAT’s Statistics meets Machine Learning workshop. The objective of the meeting, part of the PHYSTAT series, was to discuss recent developments in machine learning (ML) and their impact on statistical data analysis techniques used in particle physics and science. astronomy.
Particle physics experiments typically produce large amounts of very complex data. Extracting from these data information on the properties of fundamental physical interactions is a non-trivial task. The general availability of simulation frameworks makes it relatively simple to model the advanced process of data analysis: moving from an analytically formulated theory of nature to a sample of simulated events that describe the observation of that theory for a collider and a given particle detector in just a few minutes. detail. The reverse process – inferring from a set of observed data what one learns about a theory – is much more difficult since predictions at the detector level are only available as “clouds”. “points” of simulated events, rather than as analytically formulated distributions. necessary for most statistical inference methods.
Traditionally, statistical techniques have found various ways to address this problem, primarily focused on simplifying data through summary statistics that can be modeled empirically in an analytical form. A wide range of ML algorithms, ranging from neural networks to improved decision trees trained to classify events as signals or background events, have been used over the past 25 years to construct such summary statistics.
The broader field of ML has seen very rapid development in recent years, moving from relatively simple models capable of describing a handful of observable quantities, to neural models with advanced architectures such as flow normalization, diffusion models and transformers. These have millions or even billions of parameters potentially capable of describing hundreds or even thousands of observables – and can now extract features from data with performance an order of magnitude better than traditional approaches.
New generation
These advances are driven by the newly available computational strategies that calculate not only learned functions, but also their analytical derivatives with respect to all model parameters, significantly speeding up training times, especially in combination with modern hardware equipped with graphics processing units (GPU). which facilitate massively parallel calculations. This new generation of ML models offers great potential for new uses in physics data analysis, but has not yet found its place in the mainstream of large-scale published physics results. Nevertheless, significant progress has been made in the particle physics community in learning the necessary technology, and many new developments using this technology were presented at the workshop.
This new generation of machine learning models offers great potential for new uses in physical data analysis
Many of these ML developments showcase the ability of modern ML architectures to learn multidimensional distributions from point cloud training samples to a very good approximation, even when the number of dimensions is large, for example between 20 and 100.
A primary use case for these ML models is an emerging statistical analysis strategy known as simulation-based inference (SBI), in which learned approximations of the signal’s probability density and background over the entire high-dimensional observable space are used, overcoming the problem. concept of summary statistics to simplify data. Many examples were presented at the workshop, with applications ranging from particle physics to astronomy, indicating significant improvements in sensitivity. Work is underway on procedures for modeling systematic uncertainties, and no published results in particle physics exist to date. Examples from astronomy have shown that SBI can give results with comparable accuracy to the default Markov chain Monte Carlo approach for Bayesian calculations, but with calculation times several times faster.
Beyond grouping
A commonly used alternative approach for full theoretical parameter inference from observed data is known as deconvolution or unfolding. Here, the aim is to publish intermediate results in a form where the detector response has been removed, but without going so far as to interpret this result within a particular theoretical framework. The classical approach to unfolding requires estimating a response matrix that captures the smearing effect of the detector on a particular observable, and applying the inverse of this to obtain an estimate of a distribution at the theoretical level. However, this approach is difficult and limited in scope. , because the inversion is numerically unstable and requires low-dimensionality clustering of the data. Results from several ML-based approaches were presented, which directly learn the response matrix from modeling distributions (the generative approach) or learn classifiers that respond to simulated samples (the discriminant approach). Both approaches show very promising results that do not have the limitations in clustering and distribution dimensionality of the classical response inversion approach.
A third area where ML is facilitating great progress is in anomaly searches, where an anomaly can be either a single observation that does not fit the distribution (mainly in astronomy), or a set of events that together , do not correspond to the distribution. (mainly in particle physics). Several analyzes have highlighted both the power of ML models in such research and the limits of statistical theory: it is impossible to optimize sensitivity for single event anomalies without knowing the distribution of outliers, and the Unsupervised anomaly detectors require a semi-supervised statistical model to interpret the ensembles. of outliers.
A final application of machine-learned distributions that has been much discussed is data augmentation – sampling a new, larger data sample from a learned distribution. If the synthetic data is significantly larger than the training sample, its statistical power will be greater, but will derive this statistical power from the smooth interpolation of the model, potentially generating what is called inductive bias. The validity of the assumed regularity depends on its realism in a particular context, for which there is no generic validation strategy. Using a generative model amounts to a compromise between bias and variance.
Interpretable and explainable
Beyond the various new applications of ML, there have been lively discussions about the most fundamental aspects of artificial intelligence (AI), including the notion and need for AI to be interpretable or explainable. Explainable AI aims to elucidate what input information was used and its relative importance, but this goal does not have an unambiguous definition. The debate over the need for explainability focuses largely on trust: would you trust a finding if it is unclear what information the model used and how it was used? Can you convince your peers of the validity of your result? The notion of interpretable AI goes beyond this. This is a quality often sought by scientists, because human knowledge from AI-based science is generally desired to be interpretable, for example in the form of theories based on symmetries, or simple structures, or “low-level” rank”. However, interpretability has no formal criteria, making it an impractical requirement. Beyond the practical aspect, there is also a fundamental point: why should nature be simple? Why should the models that describe it be limited to being interpretable? The almost philosophical nature of this question made the discussion on interpretability one of the liveliest of the workshop, but for the moment without conclusion.
It is generally desired that human knowledge from AI-based science be interpretable.
In the longer term, several interesting developments are underway. In the design and training of new neural models, two techniques have shown great promise. The first is the concept of base models, which are very large models pre-trained by very large datasets to learn generic features of the data. When these generic pre-trained models are retrained to perform a specific task, they outperform models specially trained for that same task. The second concerns the coding of domain knowledge in the network. Networks that have known symmetry principles encoded into the model can significantly outperform models that are generically trained on the same data.
The evaluation of systematic effects is still mainly handled during the statistical post-processing stage. Future ML techniques could more fully incorporate systematic uncertainties, for example by reducing sensitivity to these uncertainties through adversarial training or pivoting methods. Beyond this, future methods could also incorporate the currently distinct step of propagating systematic uncertainties (“learning profiling”) into procedure learning. Truly global end-to-end optimization of the complete analysis chain could ultimately become feasible and computationally feasible for models providing analytical derivatives.