MUSiC—An Automated Scan for Deviations between Data and Monte Carlo Simulation

A model independent analysis approach is presented, systematically scanning the data for deviations from the standard model Monte Carlo expectation. Such an analysis can contribute to the understanding of the CMS detector and the tuning of event generators. The approach is sensitive to a variety of models of new physics, including those not yet thought of.

The startup of the LHC opens up unknown territory in particle physics. However, the initial understanding of the detector and the validation of the Monte Carlo (MC) generator predictions are important tasks which precede most searches for new physics. In addition, it is not clear which effects of new physics will appear in the first data, and which theory beyond the standard model (SM) will describe them. Therefore it is planned to systematically analyze the data with as little bias as possible. A special algorithm called "MUSiC" (Model Unspecific Search in CMS) has been developed. Details of the strategy, the algorithm and some representative results are presented in [1] and summarized here, assuming an integrated luminosity of 1 fb −1 at √ s = 14 TeV. Similar strategies have already been applied successfully at other experiments [2,3].
The present study is restricted to events containing at least one lepton (e or µ). Events are classified into event classes which group events according to their final state topology. Each class is defined by the number of physics objects in the event, e.g. 1µ 3jet. Exclusive and inclusive event classes are considered: in the inclusive classes a minimal number of objects (e.g. 1µ 3jet + X, at least one muon and 3 jets) is required. The following physics objects are included: muons (µ), electrons (e), photons (γ), hadronic jets, and missing transverse energy ( E T ). This leads to approximately 300−400 event classes. Inclusion of additional objects such as τ leptons or b-jets is planned for the future. It is assumed that new physics will appear in events with high-p T objects.
The CMS software framework including a full detector simulation is used in order to process the simulated samples and to reconstruct the physics objects. The SM backgrounds W +jets, Z+jets and tt+jets are simulated with the ALPGEN [4] generator. Diboson samples, lepton enriched QCD, bottomonia and charmonia events as well as QCD multi-jets, γ+jets backgrounds, and minimum bias events are simulated using PYTHIA [5]. Some benchmark supersymmetry (SUSY) samples are generated using the SoftSusy [6] program to calculate the mass spectrum, and PYTHIA for the event genera-tion. For the dominant backgrounds of many new physics signals (tt+jets and W /Z+jets) a constant k-factor has been applied consistently for all subsamples in order to reweight the leading order cross section to the next-to-leading order (NLO) prediction obtained from MCFM [7]. For SUSY the Prospino 2 [8] NLO order cross sections are used.
Event selection. Single and dilepton triggers (e, µ) are used. Muons which are measured by the muon system and the inner tracker are selected with p T (µ) > 30 GeV (well above the trigger threshold) and |η(µ)| < 2.1. Isolation is required, mainly to reject nonisolated muons from heavy flavor decays. Additional offline criteria are applied to reject misreconstructed muon candidates. Electrons are identified using tracker and calorimeter information. Isolated well-identified electrons with p T (e) > 30 GeV and |η(e)| < 2.5 are selected. Jets reconstructed with the "iterative cone" algorithm with a radius of R = ∆φ 2 + ∆η 2 = 0.5 are selected. Jet energy scale (JES) corrections are applied, and jets with p T (jet) > 60 GeV and |η(jet)| < 2.5 are selected. The hadronic fraction of jets is required to be E had /E tot > 0.05. Finally, the missing transverse energy is considered when exceeding E T > 100 GeV, after accounting for JES corrections and subtracting muon momenta. Further cuts are applied to reduce ambiguities, misidentifications and duplications of the various physics objects.
The search algorithm. The composition of the selected events, i.e. the number of muons, jets etc. determines to which event class it is assigned. At the present time three distributions are investigated for each event class: the scalar sum of the transverse momentum ∑ p T of all physics objects; the invariant mass M inv of all physics objects (transverse mass M T for classes with E T ); and for classes with missing transverse energy: E T . The ∑ p T distribution is the most general observable. The invariant mass has an obvious advantage for new particles produced as resonances. Models beyond the SM which aim to provide a dark matter candidate might be spotted with the E T distribution. The implementation of additional variables is straightforward.
All distributions are input to the MUSiC algorithm (similar to [3]) which scans them systematically for deviations, comparing the MC prediction with the measured data. Each connected bin region is considered within the distributions, including individual bins as well as broad regions. For each connected region, a counting experiment is performed, adding up the various Monte Carlo contributions (N SM ) and comparing this sum to the data (N data ). In addition to these two numbers, the systematic uncertainty of the prediction δ N SM is used. A Poisson probability is computed, determining how likely the prediction fluctuates to the number seen in the data. The systematic uncertainties, taking correlations into account, are included using a convolution with a Gaussian function. From all possible combinations of connected bins, the region with the smallest p-value (p data min ) is chosen. This is called the region of interest. This approach is sensitive to an excess of data as well as a deficit. A statistical penalty factor derived from toy MC experiments is included to account for the number of regions investigated, thus obtaining the event class significance P. The value of P can be translated into standard deviations and is comparable to the widely used CL b [9].
The following systematic uncertainties are included in the present study. The algorithm accounts for known correlations within one error when computing p-values and when generating pseudo-data. A 10% cross section uncertainty on all SM processes reflects the current theoretical knowledge which might improve in the future. The uncertainty on the integrated luminosity is assumed to be 5%. The JES uncertainty is expected to be 5%, and is also propagated to the value of E T . The reconstruction and identification efficiencies for electrons and muons are assumed to be known with a precision of 2% (1% for jets), while the uncertainty of the misidentification rate for electrons and muons is 100%. Finally, the statistical uncertainty of the MC prediction is included. The contributions of other background sources can be regarded as another systematic uncertainty.
Results. The sensitivity of MUSiC is tested with typical models. In addition to producing pseudo-data for the background-only hypothesis, one can also assume signal + background, i.e. add a signal to the SM distributions. Several pseudo-experiments are repeated, and the expected significance of a possible signal is determined. MUSiC has been tested for various use cases and benchmark signals. Three main scenarios which fit to the concept of a model independent search are detector effects and generator tuning, a prominent single deviation, and complex multiple deviations. In this paper, only the latter is discussed, while examples for the former two are found in [1]. Examples for detector effects are an underestimation of the JES uncertainty, and examples for a single prominent deviation include new gauge bosons, which would only show up in one or few event classes. In MUSiC, as a reasonable threshold for a significant deviation, event classes with P < 1 · 10 −3 (> 3.3σ ) are considered to be "interesting".
There are several reasons why a generic analysis strategy is a good supplement to SUSY searches. The free parameters in most SUSY models lead to a large parameter space from which nature can pick the scenario realized. Through their decay chains, SUSY particles often lead to spectacular cascades with high multiplicities of leptons and jets and a large amount of E T due to the LSP (lightest SUSY particle). SUSY often does not predominantly favor a single topology, but contributes to a multitude of event classes. The generic search can give an overall picture of the SUSY signatures. The results shown here use a typical mSUGRA point ("LM4") with m 0 = 210 GeV, m 1/2 = 285 GeV, tan β = 10, sgn(µ) = +, A 0 = 0, and σ (NLO)= 27.7 pb. For LM4, the decay of theχ 0 2 into on-shell Z's is characteristic. The production ofqg is dominant, contributing about half of the total cross section. The squark and gluino masses are below ∼ 700 GeV, leading to relatively large total cross sections.
A global scan of all event classes is performed. In total 375 inclusive classes and 315 exclusive classes are populated (either signal and/or background MC). Deviations are found in a multitude of places: LM4 contributes to 160 (260) exclusive (inclusive) classes, 94 (170) classes with E T : 15% (36%) show significant deviations with P (expected) < 1 · 10 −3 in ∑ p T ; 38% (59%) show significant deviations with P (expected) < 1 · 10 −3 in E T . In the case of inclusive classes, the deviations are partially overlapping since 1µ 5jet events contribute to 1µ 2jet + X, 1µ 3jet + X and so on. When comparing similar final state topologies, the inclusive classes tend to have smaller expected event class significances than the exclusive ones. The two kinematic distributions examined, ∑ p T and M inv (M T ), lead to similar results. However, one can observe a clear gain when using the E T distribution which is a prominent signature of the LSP.
In Fig. 1(a) the E T distribution for a single pseudo-experiment is shown. A deviation well above 4σ is found, despite the 20% − 30% systematic uncertainties on the tt+jets and W +jets backgrounds, in particular due to the JES uncertainties. Besides the single lepton plus jets event classes, another type of class with significant deviations contains multileptons plus jets (+ E T ). Many combinations of 2 or 3 leptons show small values of P. One example is the 1e 1µ 3jet E T + X class where a deviation > 4.4σ is found in ∑ p T (region of interest 1000 − 2650 GeV with N data = 188 and N MC = 61 ± 18). In order to quantify the global compatibility of data and SM expectation, one can plot the frequency distribution of the P values using all event classes. If there is a signal leading to deviations in several event classes, one would expect the tails of this distribution to differ from the SM-only scenario. More entries than expected with small P should be observed. Fig. 1(b) is an example for such a distribution, using the ∑ p T distribution in the exclusive case. Here the P values (− log 10 P, thus 3 = 3.3σ ) of all event classes with pseudo-data entries are shown. One can clearly see that SUSY leads to significant deviations in numerous classes. Note that classes where only an upper limit can be set ( P < X, indicated by the red arrow) all contribute to the rightmost bin.