Another way of managing large amounts of data

Jeff Hammerbacher is Vice President of Products and Chief Scientist at Cloudera, a US software company that provides solutions for managing and analysing very large data sets. His invited talk on 21 August was a good opportunity to exchange views with the CERN experts who face similar problems.

Although still relatively young, Jeff has considerable experience in developing tools for storing and processing large amounts of data. Before Cloudera Jeff conceived, built and led the Data team at Facebook. He has also worked as a quantitative analyst on Wall Street. Jeff holds a Bachelor’s Degree in mathematics from Harvard University.

At CERN, handling large amounts of data is the job of the Grid; Hadoop, the software Cloudera is developing, is intended for the same scope but has different technical features and implementations. "The Grid software products are designed for many organisations to collaborate on large-scale data analysis across many data centres. In contrast, Hadoop is designed to optimize large-scale data storage and processing for a single organisation using many servers in a single data centre", explains Jeff. "We do not use Grid software at Cloudera. However, at the University of Nebraska-Lincoln, they export data stored in their Hadoop cluster to "the Grid" via the GridFTP software (see http://www.cloudera.com/blog/2009/05/01/high-energy-hadoop/ for more details), so there is some opportunity for Hadoop clusters to serve as a single site within a larger Grid".

A lot of research and development has been carried out in several high-energy physics (HEP) laboratories to solve the problem of the increasing amount of data flow and the LHC will be a very powerful test bench with its 15 petabytes of data produced every year. "At Cloudera, we’ve been in close contact with a few HEP labs storing hundreds of terabytes of data in HDFS, the storage component of the Hadoop software. In fact, HDFS is now installed at 2 CMS Tier2 sites in the US, 2 CMS Tier3 sites, and at 1 non-LHC Grid site", says Jeff. "Given the success of Hadoop at other sites, we have reason to believe that the experts at CERN will find some value in the software".

The core of Cloudera’s offerings is based on open source software. Why? "In my experience, a talented team in a vacuum cannot produce great, mature software," says Jeff. "You need a difficult problem to serve as a foil. Making your code open source provides the best means of exposing software to demanding users and difficult problems. Building a map of all documents and links on the web was an immense problem that Yahoo! was able to solve with Hadoop, and the project is far better today because of it. Similarly, building a multi-petabyte data warehouse with millions of users was a problem Facebook was able to solve with Hadoop, and the rest of the community now benefits from their contributions.

Another reason open source software is at the core of our offerings is our belief in intellectual honesty and showing your work. No matter what you read, you can always download the source of our distribution of Hadoop and try things out for yourself. The team at Yahoo! has done a great job benchmarking Hadoop (more details of the breaking world records in the process) and making their benchmarking code and configuration available for you to run yourself. If you’re going to store petabytes of data for many years, that sort of transparency is critical. In this regard, the value of open source is similar to the value of reproducible research in science".

Jeff’s visit to CERN was the opportunity to start an informal collaboration between CERN and Cloudera. "We’re all big data junkies at Cloudera, having come from places like Google, Facebook, and Yahoo!, and we’re always on the lookout for bigger data problems to solve—and it doesn’t get much bigger than the LHC!" he concludes.

The video of Jeff Hammerbacher’s presentation at CERN