LHC data to be made public via Open Access initiative

CMS has collected around 64 petabytes of analysable proton-proton data so far. Along with published papers, these data constitute the scientific legacy of the CMS collaboration, and preserving the data for future generations is crucial.


High-school students analysing CMS data. Image: Marzena Lapka.

“To preserve not only the data but also the information on how to use them, we intend to make available through open access our data that are no longer under active analysis,” says Kati Lassila-Perini, head of the CMS Data Preservation and Open Access project at the Helsinki Institute of Physics.

Although providing open scientific data allows potentially everyone to perform their own analyses, doing so is very difficult. CMS scientists working in groups take many months or even years to perform a single analysis. Each analysis must be scrutinised by the whole collaboration before a scientific paper can be published.

CMS therefore decided to launch a pilot project for its open data aimed at education. This project, in partnership with Finland’s IT Center for Science (the CSC) and partially funded by the Finnish Ministry of Education and Culture, will integrate CMS data into the physics curriculum of Finnish high schools.

CMS data are classified into four levels in increasing order of complexity. Level 1 is all data in CMS publications. Level 2 data are small samples selected for education programmes; while students get a feel for how physics analyses work, they cannot do any in-depth studies.

Level 3 is what CMS scientists use: it includes meaningful representations of the data along with simulations, documentation and software tools. CMS is making these analysable Level-3 data available publicly, in a first for high energy physics. Level 4 consists of the so-called “raw” data – all the original collision data without any physics objects such as electrons and particle jets being identified. These data will remain available only to the members of the collaboration.

Example of public CMS Level-2 data being used in an online event display. Image: Achintya Rao/Tom McCauley.

CMS wants to enable people outside the collaboration to build educational tools on top of its data but performing a physics analysis requires lots of digital storage and distributed computing facilities.  “If someone wants to download and play with our data,” cautions Lassila-Perini, “you can’t tell them to first download the CMS virtual-machine running environment, ensure that it is working and so on. We therefore need data centres like the CSC to be intermediate providers for applications that mimic our research environment on a small scale.”

Finland is ideal to pilot this programme. 75% of Finnish high schools have classes that have visited CERN, and thanks to CERN's teacher programmes many teachers are familiar with the basics of particle physics. An ongoing survey of the teachers will help understand their perspectives on teaching data analysis and take on board ideas for potential applications.

Lassila-Perini has big dreams. “Imagine a repository of particle physics data to which schools can sign in,” she says. “They collaborate with other high schools, develop code together and perform analyses, much like how we work. It is important to teach not just the science but also how science works: particle physics research isn’t done in isolation but by people contributing to a common goal.”

Taking LHC data into classrooms

LHC data is also openly available through the CERN Masterclasses. This programme provides real experimental data from the ATLAS, LHCb, CMS and Alice experiments for analysis.


by Achintya Rao