Login to LibApps

Scholarly Communication in High Energy Physics (HEP): Data Management

Other Resources

NSF web pages

U.S. Department of Energy


Most data in the sciences are created by and for research purposes.

The vast majority of scientific data in documentary form (e.g., text, numbers, and images) now are 'born digital,' including many types that previously were created in media such as film (e.g.,X-ray images). While laboratory and field notes still exist on paper, a growing proportion of research notes are taken on handheld devices or laptop computers.

-(Borgman 2007, 182)

Particle Physics and Big Data

A year of collisions at a single LHC experiment generates close to 1 million petabytes of raw data. If they kept it all, scientists would be looking at enough data to fill roughly 1 billion good-sized hard drives from a computer retailer.

The LHC experiments took the concept of grid computing to the next level, building a globe-spanning network of distributed computing.

In short, says Amber Boehnlein, who leads scientific computing at SLAC, “CERN made an integrated, seamless system that allowed them to take advantage of global resources.”

Today the Worldwide LHC Computing Grid comprises some 200,000 processing cores and 150 petabytes of storage distributed across more than 150 computing centers in 36 countries. (Learn more about this distributed system in Deconstruction: big data.)

By sharing the load of storing, distributing and analyzing the LHC’s staggering data set, the grid allows each member to focus on only part of the data. This is especially important because once a researcher begins to analyze a piece of raw data to recreate particle collisions, that process creates new data, increasing the size of the entire data set exponentially. It’s helpful for this to happen only after the mammoth LHC data set has been chopped up according to collaborators’ needs.

The LHC Computing Grid also allows researchers to access LHC data a little closer to home.

(Symmetry Magazine)

National Science Foundation

Effective January 18, 2011 the National Science Foundation (NSF) requires a 2-page Data Management Plan as a supplement to all NSF grant proposals.  While several NSF Directorates, such as Engineering, already required a Data Management Plan, this is a new requirement for Directorates such as the Social, Behavioral and Economic Sciences. "This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (NSF web site)." 

NSF suggests these basic elements for a Data Management Plan:

  1. the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project;
  2. the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies);
  3. policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements;
  4. policies and provisions for re-use, re-distribution, and the production of derivatives; and<
  5. plans for archiving data, samples, and other research products, and for preservation of access to them.


Data Solutions: Trigger Systems

Since the 1950s, high-energy physicists have had to contend with culling data immediately after each particle collision. By the 1970s they had come up with highly sophisticated and automated ways—called triggers—to separate interesting collisions from the more common ones.

For example, at the Large Hadron Collider, where proton bunches collide 20 million times in a single second, only a few out of every 1,000 collisions produce the sought-after physics.

So researchers set up a trigger system that examines every particle collision to conclude, based on pre-determined criteria, whether the collision produced particles of interest. If so, the data from that collision continue on to be recorded. If not, they’re discarded.

Over the years, physicists have improved trigger systems to the point that they can evaluate all the data from a collision and discard the information they don’t need in millionths of a second. At that rapid pace, no unwanted data ever clogs the pipes.

Hours and Locations