COMPUTING

Introduction

It has been a very active quarter in Computing with interesting progress in all areas. The activity level at the computing facilities, driven by both organised processing from data operations and user analysis, has been steadily increasing. The large-scale production of simulated events that has been progressing throughout the fall is wrapping-up and reprocessing with pile-up will continue. A large reprocessing of all the proton-proton data has just been released and another will follow shortly. The number of analysis jobs by users each day, that was already hitting the computing model expectations at the time of ICHEP, is now 33% higher. We are expecting a busy holiday break to ensure samples are ready in time for the winter conferences.

Heavy Ion

The Tier 0 infrastructure was able to repack and promptly reconstruct heavy-ion collision data. Two copies were made of the data at CERN using a large CASTOR disk pool, and the core physics sample was replicated to tape at a Tier 1. The decision of CMS not to zero suppress the Tracker led to raw events sizes of greater than 10MB/event. The large event size placed a very heavy load on CASTOR. During repacking where nearly equal sized events are read and written, and in reconstruction where large raw events are read to produce smaller reconstructed events, the IO load on CASTOR was large. CMS routinely asked 5 GB/s of reads and 1 GB/s of writes, including to tape. The load on CASTOR for reconstruction was increased by the excellent performance of the CMS reconstruction code. CMS was able to reconstruct promptly the complete heavy-ion datasets. Our original estimates were that reconstruction would need to stretch into the technical stop and holiday break.

The performance of the accelerator during the first heavy-ion run was an interesting indication of the running scenarios we will see in 2011 in proton-proton. The machine was able to setup and collide within a few hours of dumping the previous beam. The reconstruction stretched into the inter-fill periods and good resource utilisation was achieved.

Facilities and operations

The Facilities and Infrastructure Operations team has made important progress. Operational procedures involving the test/deployment/maintenance of the new CMS Workflow Management tool WMAgent have been established, in close collaboration with the CMS Computing Integration and Data Operations teams. Responsibility of the ongoing deployment of a new GlideIn WMS fabric at CERN continues. There has been an active participation in the test of improved data management tools, in particular the PhEDEx Data Transfer Service and the Frontier/Launchpad/Squid monitoring. Responsibility for migrating many central CMS Services from real to Virtual Machines (VM) at CERN, in collaboration with the CMS Offline project and CERN/IT has continued. Advantage was taken of the LHC winter break to improve various important monitoring aspects of the project, in particular: testing and deploying of Lemon alarming/recovery procedures for services running on VoBox's at CERN; further improving the computing shift procedures, critical service recovery procedures and computing shift monitoring; and re-enforcing the CMS site status and site downtime monitoring, in close collaboration with the CERN/Dashboard team.

The new Site Availability Monitoring machinery based on "Nagios" has been tested and is ready to deploy. Responsibility continued for migrating the current CMS analysis data stored on disks at CERN plus related data access patterns from a storage technology based on CASTOR to the new "EOS" storage solution proposed by CERN. Tests are ongoing, in close collaboration with CERN/IT Department, with the goal of having the full CMS Analysis data migrated to "EOS" by end of 2011.

Data Operations

The Data Operations Team worked very hard during the Christmas break to provide data and MC for the winter conference. While producing samples for new analyses, data operations has been actively involved in cleaning samples that are not needed anymore. The regular clean-out of older derived data samples will a regular feature of 2011 and 2012. The oldest reconstruction passes for data and MC and the full 8 TeV MC where on the list for clean-up.

The Tier 0 successfully supported HI data-taking in November and December. Tier 0 ran zero suppression for HI data in February and March. It is to be noted that two attempts were necessary because of software problem. Once the data was reprocessed a skim was created from the new datasets.

In addition to heavy-ion activities, the Tier 1s processed all 2010 data twice (4th and 22nd December). The Tier 1s were also engaged in processing all Fall ’10 MC and added pile-up events.

On the Tier 1 level, a lot or work was invested in the use of old MC production infrastructure to account for 100% of the data re-reconstructed and MC re-digitised/re-reconstructed.

User Support

The User Support conducted a tutorial and an analysis school in January 2011. One was the RooStat Tutorial (21st January 2011) that provides tools for the high-level statistics questions in ROOT and is built on RooFit, which provides basic building blocks for the statistical questions. RooStats has been distributed in the ROOT release since version 5.22 and is being updated continuously.

The other tutorial was the very first CMS Data Analysis School (CMSDAS) held at LHC Physics Center (LPC) at Fermilab. The school was designed to help CMS physicists from across the collaboration to learn, or to learn more, about CMS analysis and thereby to participate in significant ways to the discovery and elucidation of new physics. The innovative classes allowed the students, in some cases with zero experience, to have hands on experience with real data, physics measurements that CMS published just days before and then make them more precise by searching for new processes that the collaboration hasn’t done yet. The students are expected to continue the work on the measurements they started and then see it through to publication after the school is over. It provides opportunity to new members to meet in person with a lot of the experts and is very good way for bringing a new generation of people into the experiment. Of the 100 participants some came from as far as Brazil, Korea and Europe. The school was huge success and got the CMS management interested and there is now a plan in the works to hold this school at an annual frequency across CMS institutions on different continents. The same school was held at LPC, Fermilab, a year ago and was then called Extended-JTerm but the name has been changed to emphasise its primary focus: the analysis of real data and the opportunity to search for new physics. The material presented for the school is part of the CMS WorkBook.

The next periodic PAT tutorial is at CERN (4th-8th April 2011) following the current CMS week. There is a plan in the works to include other very useful tools used in physics analysis like tag-and-probe, Lumitools, Edmtools, FWLite etc. as part tutorials and hold these on regular basis, which is not the case now.

We wish to thank all users involved in tutorials, demonstrating a true collaborative spirit. An up-to-date list of tutorials held by the User Support can be found at https://twiki.cern.ch/twiki/bin/view/CMS/Tutorials.

On the documentation side, the user support team is working on updating the data format documentation and all RECO and AOD format tables will be updated for CMSSW_4_2_0. To check the completeness of the tables, the event content is read out from the data files with the framework tools and the contents are compared to the documentation pages.

Integration of Distributed Facilities and Services

The Distributed Computing Integration group has concentrated in the last few months on the integration, deployment, validation and testing of the new CMS workload management system, WMAgent. It is planned that WMAgent will replace the current production system (ProdAgent) within the next few months, and will be used as the basis for the net generation of user analysis job management system (CRAB) in the future. This work has been done in close collaboration with the development and operations teams. The new system uses a common framework that is easier to maintain. It is expected to be more reliable and reduce the operational load. WMAgent Integration test instances have been installed at CERN and FNAL. The system has been thoroughly exercised by means of large-scale tests and long-running continuous workflows at the Tier 1 sites. The testing helped in identifying various aspects of the system that can be improved in terms of reliability, scalability and ease of use. Changes are being been implemented and WMAgent is expected to be used for the data reprocessing of the 2010 data at the beginning of April. The Distributed Computing Integration group will increase its personnel with a new Cat-A person from beginning of April.

Analysis Operations

The volume of Analysis activity in CMS has now stably reached the Computing TDR expectations with about 400 different users every week and more then 100,000 grid jobs submitted daily and surpassed Data Operation volume as number of jobs (i.e. operational units of service). Analysis activity briefly reduced over the December holidays, coming back stably to the same level as Fall 2010.

 


Figure 5: Total number of jobs per week in CMS Computing infrastructure from 1st January 2010 to 1st March 2011. Analysis has peaked at almost 2M jobs/week.

Figure 6: Number of CMS analysis jobs per week from 1st January 2010 to 1st March 2011. The activity level in 2011 is only modestly reduced with respect to the Fall 2010 peak.

Analysis Operations’ focus in the last few months has been to keep the current system running and to deal with users’ problems. Some people left the project and while new effort is coming in with Spring 2011 we had to cut the scope to bare minimum so far.

Problem solving for users is still a substantial effort drain as the volume of mails in user support form is staying constant in the 400~600/month range. To help us in this respect we have deployed a new log-collecting service and provided a few fixes for the CRAB client and server to improve error diagnostic and reporting and cure some common cases of failures.


Figure 7: Mail volume (#messages) handled by Analysis Operations on the CrabFeedback forum in 2010 (left) and 2011 (right)

Analysis Operations is managing now about 2.5 TB of disk across all Tier 2s and a massive clean-up campaign is about to start to free space for 2011 data. Effort in this area is stable and the procedure under control.


by I. Fisk, P. Kreuzer, J. Flix, O. Gutsche, M. Klute, P. McBride, K. Lassila-Perini, S. Malik, G. Grandi, J. Hernandez, S. Belforte and F. Wuerthwein