Data defenders

Uniting High-Energy Physics institutes, experimental collaborations and funding agencies, the HEP Data Preservation Initiative (DPHEP) has set out to change the way we save “information”.

 

When we talk about preserving physics results, our minds first turn to preserving raw data. But data preservation is much more than just the keeping of bits; it also involves saving the software used. Data needs to be available once experiments end, and it needs to be interpretable. Suppose a new theory or discovery arises and we need to revisit previous data sets with our new understanding. This could occur five, ten, fifty years from now... how can we ensure that the full potential of our data will be accessible then?

Back in 2009, collaborations at CERN, DESY, SLAC and FNAL had a similar revelation. Colliders were coming to the end of their life and, if no action was taken, the data would effectively be lost forever. To tackle this issue, laboratories and experiments worldwide established a study group now known as DPHEP. They published a thorough description of the problem, the DPHEP Blueprint, drawing the attention of the HEP community back to the issue and bringing the problem to the top of list. “The main priority was to ensure that data is not lost, since this has happened many times in the past,” says Cristinel Diaconu, who chairs the DPHEP initiative.

"We have a clear picture of the problem at hand, and while there are many projects that have been working out solutions to similar issues, they have yet to be put into place in the HEP community," confirms Jamie Shiers, CERN IT Department member and current DPHEP Project Manager. "That's where our initiative comes in. It is not only providing the IT support and resources but - more importantly - it sets out to change the paradigm we have about data preservation. We already know we can keep the bits, but unless physicists are involved, no one will know what to do with them! Data preservation is something that needs to be considered right from the start of an experiment, looking decades ahead if possible." With many funding agencies now requiring new projects to present a data and software management plan that includes preservation, there's also a financial motivation.

One of the solutions proposed by the DPHEP initiative is to implement a data preservation certification for all experimental projects, based on industry standards. “Instead of insisting on a single area of data preservation, the certifications focus on verifying the data's overall accessibility using a balanced set of criteria,” explains Shiers.

With technology changing so rapidly, whatever hardware looks like the solution today may well be obsolete tomorrow. That being said, an option that looks particularly promising is virtual machines (VM). "CernVM takes a snapshot of an experiment's software environments," says Shiers. "These can be used for data preservation, with snapshots repackaged and accessed in the distant future. The first pilot project using CernVM is packaging together some of the 2010 CMS data and software environments. We want to prove that virtual machines will work in the long haul, and will check in on the packages in 5 years’ time to see if any issues have arisen." The CMS data package will be available for CERN's 60th anniversary outreach activities. Similar projects from ATLAS, ALICE and LHCb are in the works.

by Katarina Anthony