Aug 11

Aug 11 More Perspectives on Sharing Large Open Research Data Sets: Physics

Big Data, BDPM, Open Data, Research, Science, Science Education, Data Program Management, Research Data, Data Governance, NAE

By Dennis D. McDonald

In Some Perspectives on Sharing Large Open Research Data Sets I wrote,

“We’re probably going to see an increasing number of of reports like Genetic Drivers of Immune Response to Cancer Discovered Through ‘Big Data’ Analysis where access to and analysis of a large body of previously collected data leads to significant findings.”

One area where data sharing is already active is in Physics, as described by the University of Notre Dame’s Thomas McCauley in his presentation HEP Data for Everyone: CERN open data and the ATLAS and CMS experiments.

CERN launched its Open Data Portal for data from its Large Hadron Collider experiments in 2014. McCauley’s presentation puts this into historical context.

What I found most interesting about McCauley’s presentation is how CERN’s open data policies and practices are intertwined with CERN’s dual mission of research and education. The topics of CERN’s research, the data generated, and the communities its programs work with are complex. There are, as I like to say, “many moving parts.”

McCauley provides a nuanced view of open data in the CERN context that I believe helps make sense of a very complex situation. Slide number 5 describes four “levels of access to data”:

Level 1: data directly related to publications
Level 2: simplified data formats suitable for education and outreach
Level 3: “analysis-level” reconstructed data, simulation, and software
Level 4: raw data and associated software

The rest of the presentation is devoted to the tools and approaches for making different types of data available for different uses to different groups. It makes for instructive reading even if your open data program doesn’t reach the volume or complexity of CERN data.

Based on my own reading of the presentation I had some reactions that might be useful when applying these ideas other programs.

It’s clear from McCauley’s presentation that what has evolved at CERN is significantly more than just “throwing data over the fence and hoping people will analyze it.” There are some serious “wrapper” types of services required to make data useful in different ways to different users. It’s clear that time, attention, and money have been devoted to creating and sustaining these services. As I’ve noted in other open data contexts, addressing the “who pays for what” cost issues head-on is a must-have part of your strategy; this seems to have been done at CERN.

It’s also likely that what looks like a well organized open data program now has actually evolved not smoothly but in fits and starts. That’s not a criticism but the reality of what happens when you enter into a program where research and the methods for sharing data are evolving. You have to experiment to see what works. You also have to be ready for surprises – and failures. This is especially true when new methods and approaches for making data available and analyzable are being introduced (e.g., see Informatica Unveils Hourly-Priced AWS Data Management Tools).

Having a governance framework also helps. I don’t just mean a data governance framework that defines and maintains quality and currency of data and metadata but an approach to overall data program governance that is empowered to orchestrate, coordinate, and where appropriate, require action. Such a program governance will only be effective if it is closely aligned with the program itself not just its “open data” goals.