COMPUTING

Introduction

Just two months after the “LHC First Physics” event of 30th March, the analysis of the O(200) million 7 TeV collision events in CMS accumulated during the first 60 days is well under way.

The consistency of the CMS computing model has been confirmed during these first weeks of data taking. This model is based on a hierarchy of use-cases deployed between the different tiers and, in particular, the distribution of RECO data to T1s, who then serve data on request to T2s, along a topology known as “fat tree”. Indeed, during this period this model was further extended by almost full “mesh” commissioning, meaning that RECO data were shipped to T2s whenever possible, enabling additional physics analyses compared with the “fat tree” model. Computing activities at the CMS Analysis Facility (CAF) have been marked by a good time response for a load almost evenly shared between ALCA (Alignment and Calibration tasks - highest priority), commissioning and physics analyses. Latencies, in particular at T0 and CAF, were well within design goals, allowing prompt reconstruction to be performed and calibration constants to be produced in a timely fashion.

There has been a continuous export of data from CERN, with high peaks during the first LHC “squeeze” at the end of April, tripling the initial transfer rate without any difficulties. Aggregated transfer rates of processed data from CERN to all T1s and T2s were well in the range of a few GB/sec and the system showed flexibility in dealing with occasional backlogs. The observed quality of service at T1s for prompt skimming (selecting samples of data for particular analyses) and reprocessing is satisfactory.

The vibrant activity at the T2s is an excellent indication of the expectations of the physics community, and a sense of the scale of the transfers is given Data Operations and Facilities Operations sections. The very high proportion of successful jobs can be directly linked to the readiness of the T2s: this key factor in a distributed environment has been constant during the past 12 months, an achievement only possible with the commitment and high quality work of the staff at the remote sites, the CMS computing shifters and Facility Operators.

Four new operators joined Core computing during the last three months, a result of a high turnover and of the tasks reengineering which has taken place in Facility Operations and Data operations in order to support mission critical tasks with limited resources.

In conclusion, the whole system – hardware and software - is stable and reliable. The data volume so far amounts to more than 100 TB of raw data, which is still modest compared to what is expected for the whole period of CMS operations with higher luminosities (around 3 orders of magnitude expected in the coming 18 months). But the past few weeks have given good indications about the capacity of the CMS computing system to deliver more, and to support the needs of the thousands of CMS physicists.

Facilities and operations

The bi-weekly ‘European/Asian CMS T2 support Meeting’, in addition to the already existing support meeting on the OSG side, provides central support to sites whenever needed. The average CMS Site Readiness for T2 sites is improving. Additionally, dedicated datasets were produced, and JobRobot and SAM tests moved to use latest CMSSW and CRAB releases. Complementing the Site Readiness, FacOps team is establishing a group to follow-up Tier-1 production performance and resource utilization.

In this first period of 2010 data taking, the CAF job/user monitoring has been improved.  CAF is heavily used in bursts, with jobs always able to start almost immediately in low-latency queues. There is a broadly distributed usage of resources among the different groups, with approximately 130 active users.

The pool of CMS Computing Shift Persons was further extended to 60 Computing Shift Persons in 3 time-zones, distributed in ~10 remote centers around the world. Procedures for 24/7 coverage of Critical Services are being deployed. The Computing Run Coordinator (CRC) acts now as liaison for WLCG daily Operations calls, and feedback to CMS is reported on Monday’s Computing Operations meeting. All CRCs have been provided with TEAM/ALARM roles to open GGUS tickets, and the Savannah-GGUS Bridge is currently fully operational.


Fig.1: Site readiness for CMS Tier 2 sites

The HTTP group is under creation, to implement a single oversight body and service operation best practices, and specifically security measures, for all CMS offline services delivered using HTTP/S and centrally supported and hosted at CERN by CMS. The project will establish a bridge for both the Offline project under DMWM and the Computing Project under Facility Operations.

Finally, we would like to inform that the CERN Facilities Operations manpower is now complete and the new team is fully operational since April 2010.

Data Operations

T0 Operations.  
Data collection from collisions at 7 TeV started end of March in the acquisition era Commissioning10. At that time MinimumBias was the only primary dataset used by physics; several skims from commissioning/PVT (for example GOODCOLL) and from physics (SD, CS) were run on MinimumBias as well. We recently moved to a physics setup of 8  PDs (Primary Datasets): Mu, MuMonitor, EG, EGMonitor, JetMETTau, JetMETTauMonitor, MinimumBias, ZeroBias; plus default commissioning PDs: Commissioning, Cosmics, etc.,  with Run2010A as acquisition era. By the end of May, we had recorded almost 580 million events in RAW format, yielding a total data volume in Commissioning10 over 430 TB, including all re-reconstructions.


Fig. 2: Simulated MC events per month.

Fig. 3: Simulated MC, size in GB per month.

Tier-1 Operations

Various re-RECO passes on real data and MC have been performed, as well as a complete redigi/rereco pass on the Summer09 MC sample. 500 workflows generated roughly 1500 output datasets; 500 million input events were processed in over 500 thousand jobs; the total input data size amounted to 400 TB, while the output data size for each format reached respectively ~400 TB for RAW and 220 TB for RECO.

MC production
350 Million events were simulated during the past 3 months for a total volume over 430 TB.  The completed samples are promptly announced on hn-cms-datasets@cern.ch as soon as they have been archived on tape at a T1 site.

RelVal
We produced almost 240 million events consuming over 35 TB of tape space  (at Tier-0 and Tier-1s) in 3189 individual datasets for 20 releases.

Transfers  
In the last 90 days, we transferred from T0 to T1 sites over 0.8 PB, and over 3 PB from all T1 sites to all T2 sites (transfers of datasets for analysis). 


Fig. 4: Cumulative transfer volume over last 90 days from T0 to T1 sites.
Fig. 5: Cumulative transfer volume over last 90 days from T1 to T2 sites.

User support

The User Support together with the Physics Analysis Toolkit (PAT) team has setup extensive study material for a remote e-learning course on "Using PAT in your analysis". The course is "virtual" i.e. there are no lectures and all material and exercises will be available on the Web. Tutoring service will be available for registered participants during the tutorial week 21-25 June.

This is the follow-up of the successful and well-received series of PAT courses with lectures and exercises, which will be continued later this year. An opportunity has been taken to consolidate the existing material and to add a thorough set of preparatory exercises for those with no or little knowledge of CMSSW. This approach was found to be very useful during the EJTerm course at Fermilab in January.

A team of motivated and committed experts and tutors has been formed around the PAT course, ensuring that the new updated material reflects the commitment of the providers. Maintaining CMS software and computing documentation requires continuous work. The CMS WorkBook benefitted greatly from the PAT course: not only has the PAT part been restructured and updated, but also the general material participants need to go through before entering to the details has been optimized. Feedback is welcome as usual.

The CMSSW reference manual has also been updated. Major improvements have been made to make the access to the class documentation quick and easy. The main entry page now gives the CMSSW directory structure based on the assignment of the packages to different projects and quick links to the SWGuide for general documentation and to cvs browser for the source code have been added.

Distributed Facilities and Services (DFS)

In 2010 Computing Integration has been split into two areas, CERN Facilities and Services (CFS) Integration and Distributed Facilities and Services (DFS) Integration. Stephen Gowdy and David Mason have been appointed coordinators of the former and Claudio Grandi and Jose Hernandez are the coordinators of the latter.

The DFS Integration activity will act as liaison between the various Computing Operations groups (Analysis, Facilities and Data Operations) and Offline DMWM developers for matters related to the distributed computing. It will coordinate the collection of requirements for the DMWM tools, bug reporting and release validation.

DFS Integration will act as liaison between Computing Operations and Offline CMSSW developers collecting requirements and in general observations related to the use of CMSSW from Computing Operations.

This includes e.g. memory usage, I/O access patterns, issues related with CMSSW deployment at the computing sites, etc.

DFS Integration will report to CMS Computing & Offline about the activities of WLCG and related projects on issues of potential interest and impact on the CMS software, e.g. middleware features, security constrains, changes in the infrastructure that may require changes in the procedures, etc. DFS Integration will collect requirements from CMS Computing & Offline and report them to WLCG and related projects, follow the implementation of the required functionalities and report back to CMS. DFS Integration will also provide advice to Computing and Offline on the definition of procedures and policies in accordance with the CMS Computing Model and on possible modifications to the model itself coming from the needs of Computing, Offline other bodies.

Analysis operations

The Level 2 task "Analysis Operations" in Computing is focused on the operational aspects of enabling physics data analysis at Tier-2 and Tier-3 centers worldwide. Its activities are subdivided into three subtasks: Data movement, access, and validation, CRAB server operations and related analysis support, and Metrics and evaluation of the global system for analysis.

Fig.6: Analysis jobs/day in last 3 months at CMS T2s.
Fig.7: Data transferred in last 3 months from T1s to T2s, about a half is data placement by Analysis Operations, which by now has placed about 1500 TBytes of data at more then 50 T2s.

The last months have seen the transition to a new version of CRAB and CRAB server. Both are now essentially complete from a functional standpoint of what will be needed to analyse the data in the current LHC run. The volume of operations has substantially increased both on user’s side and in central data placement, with resource usage now close to, or exceeding the Computing TDR expectations.

Fig.8: Mail volume (#messages) handled by Analysis Operations on the CrabFeedback forum in 2010.