CMS Software and Offline preparation for future runs

The next LHC Runs, nominally Run III and Run IV, pose problems to the offline and computing systems in CMS. Run IV in particular will need completely different solutions, given the current estimates of LHC conditions and Trigger estimates. We report on the R&D process CMS has established, in order to gain insight on the needs and the possible solutions for the 2020+ CMS computing.


Introduction
The CMS [1] experiment has been taking data at the CERN/LHC collider since late 2009, collecting year-over-year records on events collected and energy at the collisions. To date, CMS has published almost 1000 physics papers, at 7, 8 and 13 TeV.
The computing, based on a Tiered Distributed e-Infrastructure model, has been able to support the physics program in a quite comfortable way, allowing for unplanned operational modes in the last year of the Run-II, 2018.

A wrap-up of 2018 data taking
During 2018, the last year of Run II, LHC delivered to CMS integrated luminosity in excess of 67 fb −1 , out of which more than 64 fb −1 were recorded for offline utilization. Apart from the planned pp data taking at ∼1 kHz, during 2018 CMS experimented with two novel operational modes.
For most of the data taking period, a special trigger designed to collect "bb" events was activated, with rates up to 6 kHz in the last part of the fill (where pile-up, also called PU in the following, and event size is at the minimum). This sample is going to be the basis for an intense program on b-quark physics, to be completed largely before the restart of the LHC in 2021. In this mode, 12 billion events were collected, resulting in a b sample factors larger than the full samples collected by BaBar and Belle in their lifetime.
The end of year Heavy Ion Pb-Pb collected events at a much larger rate than previous years, with the attempt to collect a large statistics of unbiased Minimum Bias events (at 6 kHz) on top of the standard central physics. Overall, a sample of 4.5 billion Minimum Bias events has been collected, and is currently being analyzed.
Overall, the CMS Computing model has shown its maturity in 2018, allowing agile data taking operations, an intensive Monte Carlo production (totalling 24 billion full simulation events) and analysis operations at a constant level on roughly 50 thousand cores. The overall resources deployed by CMS in year 2019, according to WLCG Rebus[2], are shown in Table 1.

The challenges ahead
With the performance of computing system as detailed in the last section, the situation for the near future is apparently under control.  Table 2 highlights the expected future LHC data taking periods. Indeed, the next LHC Run (Run III, 2021-2024) has projected computing needs that are similar to those of the recently completed Run II. The CMS experiment expects that the current computing model will be able to support Run III.
The longer term scenario is completely different: Run IV, starting tentatively in 2026, is currently expected to deliver instantaneous luminosities up to 7.5 10 34 cm −2 s −1 , a factor 7.5 higher than the initial LHC design, and a factor 4 greater than that which is expected to be reachable in Run III. This translates into an average number of superimposed inelastic pp collisions per beam crossing (called pile-up) of up to 200. On top of that, the CMS experiment is currently extrapolating the need for a selection rate to offline of up to 7.5 kHz, in order to not lose performance on the relatively low mass Higgs boson precision studies.
A back-of-the envelope estimate for computing resources, even assuming that all the categories scale linearly with the number of events collected and their size, can be extrapolated from 2018 conditions by applying a factor of 7.5 for trigger rate, and a factor of 200/37 for event size and complexity, thus yielding a global factor exceeding 40. On top of that, event complexities are expected to increase due to the new detectors that CMS wants to deploy by Run IV [3], introducing another factor of up to 2. Overall, a simple estimate of the computing resource needs in case the same computing, data taking and analyses models are applied to Run IV, is close to 100 times more than the 2018 (or 2019) resources. In perspective, technology evolution has always been considered as the main ingredient to a sustainably increasing computing infratructure, allowing in the past 1-2 decades for year by year increases in perfromance for the same price of up to +50%. Unfortunately, there is strong evidence[4] that such a steep evolution has slowed down considerably in the last 5 years, with new and more realistic extrapolations limited at +15%/y; over a period of 8 years, technology is than not expected to help for more than a factor 3x.

CMS R&D activities for HL-LHC
Even considering a quite optimistic technology gain factor of 3x, the HL-LHC computing would on paper be expected to need a budget in excess of today's by factors of 20-30x. These are clearly unfeasible, and would result in the experiments implementing a much smaller physics program to fit within the available resources (for example, by dractically reducing the trigger selections on low energy objects).
CMS has started an intense research program on the HL-LHC computing requirements, steered via the "Evolution of Computing Model 202X" (ECoM2X) task force, which includes efforts from the physics, the trigger and the detector communities, in a global effort in order to better understand the needs for CMS at HL-LHC, and to eventually propose solutions, changes, and other directions of study. The task force is structured into 7 working groups, covering aspects from technology tracking, modelling of computing needs, evolution of the infrastructure and of the computing environment.
In the rest of the paper we will present some highlights from the ECoM2X work in progess, and more in general from the R&D efforts CMS is putting in place in order to plan for sustainable computing operations during HL-LHC.

Technology Tracking
It is difficult to predict the 2026+ technology scenario, but trends already started are likely to negatively impact CMS computing at HL-LHC.
The most performant computing architectures (e.g. when measuread as Flops/$) will not be standard multi core CPUs, but simpler chips with a larger level of parallelism (SIMD).
General Purpose Graphics Processing Units (GPGPUs) are a derivative of the graphic cards which have seen an explosion in recent years, mostly due to the video game market. They are vector processors, with thousands of available core, and very limited capabilities for serial programming. Their utilization is best suited in extremely parallel algorithms, where the same operation has to be performed on a series of input data (SIMD[5]); they also deploy an extremely rigid memory model, with access to external memory having a larger on timing than the actual computing operation. While very difficult to objectively compare performance of CPUs and GPGPUs in general, given the different programming model, some selected applications have been ported and show large speedups; see [6,7] for HEP (High Energy Physics) specific examples of such comparisons.
Field-Programmable Gate Arrays (FPGAs) offer a way to port algorithms in-silico, either via low level languages like VHDL[8], or via synthetization from higher level languages [9]. The main interest in the technology comes from the acquisition of ALTERA (one of the biggest FPGA producers) by Intel[10]: this paves the way for a strict integration of current x86 64 and FPGA technologies, potentially on the same chip and with large communication bandwidths. FPGA based technologies have been common since many years in the online systems of experiments; an availability on offline systems paves the way to their utilization as accelerators in standard workflows. Examples in such directions are [11,12].
Tensor Processing Units (TPUs) are chips designed for fast matrix manipulation. While not a completely new idea, they have gained renewed interest in the last years due to the emergent sector of Artificial Intelligence, where matrix algebra is a key tool in algorithms like gradient , and powers its internal tools, from search systems to decision systems; unfortunately, Google's TPUs are not available on the market, but just via direct partnership.
It is difficult to imagine a different technology emerging now and relevant for the HL-LHC timeline, so CMS has decided to focus its studies on software solutions embedding these architectures as possible targets.
For what concerns storage, the price differential between rotating disks and solid state disks is not reducing with the increasing sizes. Standard disks, using MAMR or HAMR technologies [15], are probably staying with us, with solid state disks limited to specific solutions like fast analysis systems or caches.
Tape technology is still evolving at a good pace, but is suffering from a shrinking market, and a now basically unique manufacturer [16].
Global networking availability (as bandwidth, number of links, their qualities) have never been problematic for LHC up to now, with perfromance exceeding our needs; in our computing models, networking has generally been considered as an infinitely available resource. This could change by Run IV, at least on the expensive and difficult to deploy intercontinental routes, which have seen yearly increases in traffic up to +40%/y. The need for a proper networking modelling, for example in order to avoid unnecessary multiple transatlantic transfers, has clearly emerged.

Physics choices
As detailed before, it is difficult to imagine any gain coming from a reduction of the trigger rates to offline, unless a reduction of the CMS physics capabilities is accepted.
A reduction of the computing needs could still be possible by implementing smarter data handling approaches, in principle with a small effect on the experiment's physics reach: • Park to tape, with no prompt reconstruction, a large fraction of the selected events, processing only the fraction needed to ensure good data quaility; the rest could be processed either in the winter shutdown, or at the end of the LHC Run. While feasible and with no long term effect on physics output, it has clearly an effect in that it slows down analysis activities with respect to the competition. • Implement scouting triggers (as was already done on small scale), for which the trigger objects are direclty used for analysis; eventually the original raw data can be discarded, making these samples not reprocessable in the future. • Switch a large(r) part of the Monte Carlo production to fast simulation, investing for example on realistic GAN based tools [17]. • Invest in colalborations with the authors of Event Generators, in order to make sure the tools scale well on modern hardware.

Towards heterogeneous architectures
There is general consensus in the HEP community that a large help in solving the HL-LHC computing scaling can come from cost effective computing architectures, as those listed in the technology tracking section. In the last year CMS has invested in a systematic effort to include non-CPU artchitectures as first citizen in its software. The strategy usss the concept of "multiple equivalent modules" being able to perform a given task, with the module selection possible at submission, site level or even event by event basis [18]. When accelerators, or in any case technologies different from standard CPu are involved, the CMS framwork is non-blocking, allowing a full utilization of the hardware in all occasions. At the same time, CUDA [19] has been included in the standard deployment of the CMS software, in order to ease the development effort. The next step will be a survey of the available tools for automatic code translation on

Reduced data formats
A large part of the disk storage needed in the computing operations, is to host data and Monte Carlo samples for user analysis. Along the years, with the achievement of a better undertanding of the LHC environment, the CMS detectors, and the needs for typical analysis, CMS has been able to drastically shrink the data format in which the samples are presented to analysis users. Table 3 shows how effective such a shrink has been, for a total of a 3000x reduction since the start of beam activities. This is an ingredient of primary importance in keeping disk requests low. On the other hand, having formalized a data format like NANOAOD, currently expected to be valid for ∼ 50% of the studies, also reduced the computing needs since most users will be able to find preprocessed data for their analyses. The NANOAOD format is now produced for all the CMS data and Monte Carlo samples, but its utilization is at first stages and will need to monitored in the next years.

Common tools
Even if not strictly included into computing requests, CMS needs to deploy and maintain into production a large series of software tools, for aspects concerning databases, toolkits for detector simulation and description, data and workload management infrastructures. The human cost behind these is not easy to calculate, but is not a small correction to the overall needs.
In order to reduce the longterm maintenance effort, CMS is evaluating the adoption of standard tools, at least within HEP. Current targets are: • Rucio [20] for Data Management; • CRIC [21] as Information System; • DD4HEP [22] as geometry description toolkit.
CMS expects to finish evaluating and porting to such tools by 2020, in order to be ready with Run III data taking.

Changes to the infrastructure
In the context of WLCG, a general effort towards a new-generation data infrastructure is strongly active, also via groups like DOMA[23] and European Projects like ESCAPE[24] and XDC [25]. The "data-lake" approach aims to create a solid, secure and curated data infrastructure, using HEP-owned data centers, linked via middleware allowing them to appear as much as possible as single logical system. Strong reliance of network links is needed both for inter-lake communications, and for data delivery to computing centers external to the lake, and eventually being: • standard "GRID-like" centers; Three r&d programmes are expected to drive down the storage needs, and are detailed in the following paragraphs. The DataLake[datalake] storage approach is a sharp deviation from the initial LHC computing model, in which storage and CPU resources had to be deployed in a symmetric and balanced among sites, in such a way that computing tasks would essentially read local site data. The initial approach was driven by the (lack of) trust in general purpose networking between distributed computing sites, which required input data to be present locally at the processing site; as a side effect, in order not to make match making inefficient, multiple input data copies needed to be available. The DataLake model comes from the realization that, mostly thanks to services like Netflix and Youtube, the general connectivity between sites, has much improved, and is in many cases not different from costly dedicated lines. This allows for a centralization of storage into fewer and bigger sites, with the processing sites accessing remotely, via streaming or cache-mediated, the input data. The decoupling between storage and processing resources is welcome not only for the expected reduction of input copies, natural in the model, but also in order to serve atypical processing resources, like disk-less temporary facilities, or grant-based HPC systems. As discussed previously, RAW data from detectors during pp collisions are 1-10 MB sized. The complete output of reconstruction, including all the charged tracks, the calorimetric deposits and the single detector responses can be up to a factor 10 bigger, and is rarely needed at the level of physics analyses. The experiments are dedicating a big effort in the definition of a physics object set still suitable for analyses activities. This would have two substantial effects: • A reduction of the disk storage space needed, proportional to the decrease in size with respect to standard datasets. • A reduction of the processing resources dedicated to analysis, since the reduced set would be preprocessed with close-to-final quantities for analysis, not requesting sizeable postprocessing. The CMS experiment is currently leading the r&d effort, followed closely by the ALICE Collaboration. In table[dataformat] the evolution of the main input for analyses is shown, as a function of time; the increased understanding of the accelerator conditions, of the detector calibrations and of the analysis patterns has allowed for a reduction factor of 3000x from the early commissioning days, to the 2019 analysis scenarios. A similar pattern is expected to be valid for HL-LHC: after a commissioning period, CMS expects at least 50% of the analysis activities to be possible with the smallest NanoAOD data format [nano]. In order to be able to utilize even small chunks of processing time, like on HPC systems when backfilling tasks during the preemption time of large parallel workflows, the granularity of tasks as events processed per job must be as elastic as possible. ATLAS is pioneering the utilization of an Event Service[atlasevs], which can serve events in bunches as low as 1; such an approach, complicated from the point of view of bookkeeping but very effective in the utilization of spare CPU cycles, is being evaluated also by the other experiments. On the processing side, r&d is ongoing on the utilization of high throughput accelerators as those detailed in a previous section. In general, an economically sustainable processing at HL-LHC will need to be able and utilize heterogeneous hardware, when made available for example at HPC sites or at online farms made available to offline uses when beam is off. The experiments are trying to modify their core frameworks in order to allow for asynchronous utilization of external hardware, local or remote, even deferring the decision at process startup via auto identification of the available hardware [het]. The biggest worries along the capability to use different processing architectures come from the manpower needed to write a large part of the algorithms multiple times, and the physics level validation between those versions. While some initial solutions have been developed[alicerohr], • diskless "GRID-like" centers;

CMS Data Format Name
• commercial cloud providers; • High Performance Computing centers; • short lived centers coming from grants, collaborations, etc; The concept behind the data-lake is to make sure our data is safe, and to profit for every computing opportunities, lowering the bar for their acceptance for what concerns lifetime, local storage and support level.

Current CMS extrapolated resource needs for HL-HLC
A part of the solutions described in the previous section has been already inserted in the CMS computing model simulation, used for long term planning. The last public figures are shown in Figure1. Starting from the back-of-the-envelope resource increases up to 100x with respect to 2019, current best estimates are 15x for storage and 22x for CPU (which does not include any reliance on GPGPUs or such yet); more information and details can be found in [26].

Conclusions
The current understanding of CMS computing needs and strategies at HL-LHC are detailed, delining a possible evolution from current infrastructure and inserting changes at the level of computing architectures, physics operations down to analyses. While CMS cannot yet demonstrate a viable solution to HL-LHC computing, a large effort is ongoing invoving all the Collaboration sub-projects, with the already evident result of year by year decreasing projected needs.

Acknowledgments
This paper is partially supported by the EU Project ESCAPE, G.A. 824064.