IOP : The LHC Olympics 2020 a community challenge for anomaly detection in high energy physics

Kasieczka, Gregor; Amram, Oz; Kamenik, Jernej F.; Khosa, Charanjit K.; Nachman, Benjamin; Yunus, Mikaeel; Mikuni, Vinicius; Murphy, Christopher W.; Kahn, Alan; Collins, Jack H.; Suarez, Cristina Mantilla; Harris, Philip; Park, Sang Eon; Le Pottier, Luc; Dong, Zhongtian; Faroughy, D.A.; Sanz, Veronica; Matevc, Andrej; Donini, Julien; Dillon, Barry M.; Williams, Daniel; Dai, Biwei; Metodiev, Eric; Bortolato, Blaz; Thaler, Jesse; Udrescu, Silviu-Marian; Tsan, Steven; Pierini, Maurizio; Szewc, Manuel; Dinu, Ioan-Mihail; Sarda, Nilai; Ochoa, Inês; Canelli, Florencia; Seljak, Urŏ; Duarte, Javier; Shih, David; Vlimant, Jean-Roch; Gonski, Julia; Smolkovic, Aleks; Rankin, Dylan; Martín-Ramiro, Pablo; Komiske, Patrick; De Freitas, Felipe F.; Benkendorfer, Kees; Vaslin, Louis; Stein, George; Brooijmans, Gustaaf; Andreassen, Anders; Seljak, Uros

doi:10.1088/1361-6633/ac36b9

The LHC Olympics 2020 a community challenge for anomaly detection in high energy physics - Kasieczka, Gregor et al - arXiv:2101.08320

A Variational Recurrent Neural Network cell. The $x(t)$ and $y(t)$ layers represent respectively the input constituent and reconstructed constituents' four-momentum components $p_\text{T}$, $\eta$, and $\phi$. The $\phi_{x}$ and $\phi_{z}$ layers are \textit{feature-extracting layers} which encode a representation of the features in the input layer $x(t)$ and latent space $z$ respectively. $h(t-1)$ represents the current time-step's hidden state, which is updated each iteration via a transition function between $h(t-1)$, $\phi_{x}$, and $\phi_{z}$ carried out by a Gated Recurrent Unit (GRU). At each time-step, the prior distribution defined by $\mu_{t}$ and $\sigma_{t}$ is determined from the current hidden state.

Dijet invariant mass distributions before (left) and after (right) a selection on the Event Score, with a two-prong Z' signal contamination of 0.5\%.

Dijet invariant mass distributions before (left) and after (right) a selection on the Event Score from the Black Box 1 dataset. The signal present is a $Z'$ boson with a mass of 3800 GeV.

Dijet invariant mass distributions before (left) and after (right) a selection on the Event Score from the Black Box 2 dataset. No signal is present, and the dataset shown consists entirely of multijet background events.

Dijet invariant mass distributions before (left) and after (right) a selection on the Event Score from the Black Box 3 dataset. The signal present is a new boson with a mass of 4200 GeV.

Scatter plot of $R(x|m)$ versus $\log p_\text{background}(x|m)$ across the test set in the SR. Background events are shown (as a two-dimensional histogram) in grayscale and individual signal events are shown in red. Ref.~\cite{Nachman:2020lpy}.

Receiver Operating Characteristic (ROC) curve (left) and Significance Improvement Characteristic (SIC) curve (right). Figure reproduced from Ref.~\cite{Nachman:2020lpy}.

ROC curve obtained with the VAE classifier on the R\&D data.

The invariant mass distribution for the blackbox 1 data after applying the VAE classifier.

The jet mass distributions for the blackbox 1 data after applying the VAE classifier and restricting to the invariant mass range $[3.6,4.0]$ TeV.

Euclidean distance distributions and ROC curves obtained for the R\&D dataset.

Euclidean distance distributions and ROC curves obtained for the black boxes datasets.

Shaping function obtained for each black box. From left to right, black box 1, 2 and 3.

Result of the BumpHunter scan obtained for each black box. From left to right, Black Box 1, 2 and 3.

In-distribution anomaly detection through conditional density estimation. Consider samples of a 1D feature $x$ and a conditional parameter of interest $M$ (left panel), drawn from a smooth Gaussian `background' with a small number of anomalous `signal' events added (inside red circle for clarity). The conditional density values at each data point do not allow the anomaly to be distinguished from the background (center left panel), as they only identify the outliers. However, the local over-density anomaly ratio $\mathrm{\alpha}$ peaks at the anomalous data points (center right panel), and implementing a minimum cut on the anomaly ratio reveals the anomalous events (right panel).

The anomaly score for each event as a function of the invariant mass of the leading two jets. A number of anomalous events are clearly seen near $\mathrm{M_{JJ}\approx 3750 GeV}$.

Parameter distributions of the events that remain after imposing cuts on the anomaly score $\alpha$, and limiting the mass range to $\mathrm{3600\ GeV < M_{JJ} < 3900\ GeV}$. Vertical dashed lines are the true anomalous events that were unveiled after the close of the competition.

The eight most anomalous events in the black box. Each pair of images visualizes the particles belonging to the lead two jets. Images were constructed by binning the transverse momentum of each particle belonging to the jet in ($\eta$, $\phi$), and oriented along the y axis using using the $p_\text{T}$ weighted moment of inertia. Color is log scaled.

Best inferred latent distributions of the two themes (left and right column) for Black Box 1 with the LDA method. Shown is the $m_0, m_1/m_0$ plane of the mass-basis for the heavier (top row) and the lighter (bottom row) of the two jets.

Invariant mass event distribution of the simulated background (left) and Black Box 1 (right) after performing an LDA-based cut along with the background estimation using the uncut invariant mass distribution. Bottom row displays the corresponding excess found by BumpHunter.

Schematic of the particle graph autoencoder model proposed. Each input jet is represented as a graph in which each particle of the jet is a node, and each node has an edge connecting it to every other particle in the jet. After an edge convolution layer~\cite{DGCNN}, each particle is encoded in a reduced two-dimensional latent space, before another edge convolution layer reconstructs each particle's four-momentum $(E, p_x, p_y, p_z)$.

Comparison of input and reconstructed features $E$ (far left), $p_x$ (center left), $p_y$ (center right), and $p_z$ (far right) for the models trained with MSE (top) and Chamfer (bottom) loss functions on the QCD testing dataset.

ROC curves for the PGAE trained with the MSE (left) and Chamfer loss (right).

Illustration of the simplified background estimation procedure in BB 2 for the GAE trained with MSE loss. A comparison between the nonoutlier and outlier jet mass distribution is shown (upper left). The ratio of the two distributions is fit with a fourth-order polynomial to derive a transfer factor (lower left). The corresponding postfit prediction is also shown (upper right). The postfit ratio is randomly scattered around one as expected for BB 2, which contains no signal.

BB 1, MSE, $2.1\,\sigma$ at $3.9$~TeV, BB 2, MSE, $0.8\,\sigma$ at $3.3$~TeV, BB 1, Chamfer, $1.5\,\sigma$ at $2.8$~TeV, BB 2, Chamfer, $-1.4\,\sigma$ at $5.1$~TeV. Bump hunt in the dijet invariant mass in BB 1 (left) and 2 (right) using MSE (top) and Chamfer (bottom) as the loss functions. Outlier jets have a reconstruction loss in the top 10\% with respect to the corresponding BB. Outlier events are required to have both jets be outliers. BB 1 has an anomalous large-radius dijet signal $\PZpr \to \PX\PY \to (\Pq\Pq)(\Pq\Pq)$ injected at $m_\PZpr=3823$~GeV (with $m_\PX = 732$~GeV and $m_\PY= 378$~GeV), while BB 2 has no injected anomalies.

An example representation of dependencies between the data $\mathbf{x}$, latent variables $\mathbf{u}$, $\mathbf{v}$ and the normally distributed variable $\mathbf{z}$. Here the example data has 8 dimensions and the latent space has 5 dimensions. The bijective transformations are learned with Masked Autoregressive Flows (MAFs).

Signal detection ROC curves in the R\&D dataset for different anomaly scores

Overlapping $m_{jj}$ distributions below (left) and above (right) two threshold cuts on $\mathcal{R}_{m_{jj}}$. Distributions for a $50^{th}$ percentile cut are in blue, while distributions for a $70^{th}$ percentile cut are in orange. The $x$ axis is in $GeV/c^2$.

ABCNet architecture used in UCluster for a batch size N, F input features, and embedding space of size E. Fully connected layers and encoding node sizes are denoted inside ``\{\}''. For each GAPLayer, the number of k-nearest neighbors (k) and heads (H) are given. Full lines represent direct connections while dotted lines denote skip connections.

Visualisation of the embedding space created for anomaly detection for 1000 events. The true labels are show in the left, while the cluster labels created by UCluster are shown in the right. Figure from Ref. \cite{Mikuni:2020qds}.

Maximum signal-to-background ratio found for different clustering sizes (left) and maximum approximate significance found for UCluster trained and evaluated on different number of events with cluster size fixed to 30 (right). The uncertainty corresponds to the standard deviation of five trainings with different random weight initialization. Figure from Ref. \cite{Mikuni:2020qds}.

$p$-values obtained from the analysis in the resonance mass scan for BB2 (left) and BB1 (right) at selection efficiencies 10\%, 1\%, 0.2\%. The dashed black line is the result with no selection cut.

$m_{JJ}$ distributions obtained for BB2 (left) and BB1 (right) for the signal region centered around $3500\,\mathrm{GeV}$ after a series of selection cuts. The top line and data points corresponds to no selection cut.

Substructure distributions in the anomalous BB1 signal region for signal-like (red), and background-like (grey) events. For this figure, signal-like is defined by a selection on the classifier output with efficiency 0.5\%

\textbf{Left plot:} Performance of CWoLa (blue), the Autoencoder trained on Jet 1 (brown) and Jet 2 (green), and their average (orange), as measured by the AUC metric. The error bars denote the standard deviation on the AUC metric. \textbf{Right plot:} Significance of the signal region excess after applying different cuts for CWoLa (blue) and the Autoencoder (orange). The best cuts for CWoLa and the AE ensemble correspond to the $0.3 \%$ and the (Jet 1, Jet 2) = $(80 \, \%, 2.5 \, \%)$ event selections, respectively. The initial significance of the excess ($100 \, \%$ selection) is shown in green. Note that the fit to the raw distribution (i.e. no cut applied) is lower than the naive expected significance $S/\sqrt B$ due to a downward fluctuation in the number of background events in the signal region.

An illustration of the Tag N’ Train technique. Here O1 and O2 represent Object-1 and Object-2, the two components of the data one wishes to train classifiers for.

Events in the first data subset after final selection for Blackbox 1. The signal peak can be seen slightly above 3800 GeV. The local p-value for just this subset of the data was around 3$\sigma$.

A histogram of the classifier output (left) and the subleading $\tau_{21}$ (right) for a neural network trained to distinguish `data' (Pythia) and `simulation' (Herwig) in the signal region. The ratio between the `simulation' (Herwig) or `simulation + \textsc{Dctr}' and `data' (Pythia) is depicted by orange circles (green squares) in the lower panels. Figure from Ref.~\cite{Andreassen:2020nkr}.

Left: the significance improvement at the a fixed 50\% signal efficiency as a function of the signal-to-background ratio ($S/B$) in the signal region. The evaluation of these metrics requires signal labels, even though the training of the classifiers themselves do not have signal labels. Error bars correspond to the standard deviation from training five different classifiers. Each classifier is itself the truncated mean over ten random initializations. Right: The predicted efficiency normalized to the true data efficiency in the signal region for various threshold requirements on the NN. The $x$-axis is the data efficiency from the threshold. The error bars are due to statistical uncertainties. Figure from Ref.~\cite{Andreassen:2020nkr}.

A Receiver Operating Characteristic (ROC) curve (left) and significance improvement curve (right) for various anomaly detection methods described in the text. The significance improvement is defined as the ratio of the signal efficiency to the square root of the background efficiency. A significance improvement of 2 means that the initial significance would be amplified by about a factor of two after employing the anomaly detection strategy. The supervised line is unachievable unless there is no mismodeling and one designed a search for the specific $W'$ signal used in this paper. The curve labeled `Random' corresponds to equal efficiency for signal and background. Figure from Ref.~\cite{1815227}.

Left: BDT scores using the kinematic observables and the scores from ResNet-34. Right: BDT scores using the kinematic observables only.

Left: ROC curve for a BDT using the kinematic observables and the scores from ResNet-34. Right: ROC curve for a BDT using the kinematic observables only.

Data Features for the Blackbox data 1. The dark blue line (background) refers to the labeled dataset, whereas the other three lines are distributions from the blackbox.

Data Features for the Blackbox data 1.

The paired observables from a dijet sample can be represented as a histogram, shown as the matrix $\bbm{D}$. The generative process we describe can be visualized as the matrix product $\bbm{PFP}^\intercal$, shown as a decomposition on the right. This example is for separating dijet events into quark and gluon categories, where the observable is jet constituent multiplicity.

Anomalous components at 10\% signal, Background component at 10\% signal, Anomalous components at 1\% signal, Background component at 1\% signal The components retrieved from factorized topic modeling of the LHC Olympics R\&D dataset, using jet mass as our observable. Our method shows good agreement between the learned topics and the ground truth on the jet mass observable. We are able to recover both of the new physics resonant masses (at 100 GeV and 500 GeV) with signal fraction of 10\% (top row) and 1\% (bottom row). The dips in the background model at the resonance masses arise because the topic finding procedure attempts to identify the most orthogonal components.

The QUAK approach

(Left) Receiver Operator Characteristic (ROC) for signal versus background selection for different test priors. Performance comparison of the 1D (QCD prior only), 2D (QCD prior and two prong $(m_{jj},m_{j1},m_{j2}) = (4500,500,150)$), 3D (QCD prior, two prong $(m_{jj},m_{j1},m_{j2}) = (4500,500,150)$ prior, and three prong $(m_{jj},m_{j1},m_{j2}) = (5000,500,500)$) with fully supervised training on the correct signal prior (red). Jet masses $(m_{j1}, m_{j2})$ are excluded in the training of the supervised classifier to mitigate model dependence and to allow for the potential of signal extraction through mass fits. (Right) ROC for signal versus background selection for 2D QUAK (solid) and a fixed supervised network (dashed). For both QUAK and the supervised network a signal prior of $(m_{jj},m_{j1},m_{j2}) = (4500,500,150)$ is used in the training.

Performance in separating the digit 5 from 9 for a fully supervised network (full) with (left) (orange) QUAK 1D, (green) the space of digits 0-8 as input to small MLP trained with 5 against 9,(red) the space of 5 and 9 trained with a small MLP to separate 5 and 9, (purple) the space of digits 0-8 with a linear discriminant(LDA) using 7 as a signal proxy; (center) A linear discriminant on the space of digits 0-8 using a proxy signal that is X\% 7 and the opposite percent 0; all corresponds to all digits (right) with an N-dimensional latent space from the VAE trained with a supervised network of 5 vs Signal(9)/Proxy(7) compared to QUAK space using an LDA with either Signal(9)/Proxy(7).

Results of unblinding the first black box. Shown are the predicted resonance mass (top left), the number of signal events (top right), the mass of the first daughter particle (bottom left), and the mass of the second daughter particle (bottom right). Horizontal bars indicate the uncertainty (only if provided by the submitting groups). In a smaller panel the pull (answer-true)/uncertainty is given. Descriptions of the tested models are provided in the text.

The organization of physics analysis groups in ATLAS and CMS. The large circles on the left represent analysis groups that are primarily focused on measuring properties of the Standard Model. The group called SM is focused on the electroweak and QCD aspects of the SM that are not covered by the other groups. The large circles on the right represent the analysis groups primarily focused on searches for new particles. Selected supporting organizations that are connected to both measurement and search groups are depicted in smaller circles in the middle. The ATLAS CWoLa hunting search was performed in the HDBS analysis group in ATLAS (as a `model agnostic extension of the diboson resonance search') and the ATLAS and CMS data-versus-simulation analyses are performed in the Exotics/Exotics groups.

An illustration of the nested loops required for signal model-dependent interpretation of a model-agnostic search. The parenthetical remark for the signal cross section refers to the fact that if the number of predicted signal events is small, one may need to repeat the injection many times due to the large statistical fluctuations in the phase space. This is not a problem for model-dependent search where one can use all simulated signal events and scale by the predicted cross section. Unsupervised approaches may be able to avoid certain steps if they do not change in response to modifications in the data.

CERN Document Server

Access articles, reports and multimedia content in HEP

Main menu

CERN Accelerating science