NSF web pages
U.S. Department of Energy
Most data in the sciences are created by and for research purposes.
The vast majority of scientific data in documentary form (e.g., text, numbers, and images) now are 'born digital,' including many types that previously were created in media such as film (e.g.,X-ray images). While laboratory and field notes still exist on paper, a growing proportion of research notes are taken on handheld devices or laptop computers.
A year of collisions at a single LHC experiment generates close to 1 million petabytes of raw data. If they kept it all, scientists would be looking at enough data to fill roughly 1 billion good-sized hard drives from a computer retailer.
The LHC experiments took the concept of grid computing to the next level, building a globe-spanning network of distributed computing.
In short, says Amber Boehnlein, who leads scientific computing at SLAC, “CERN made an integrated, seamless system that allowed them to take advantage of global resources.”
Today the Worldwide LHC Computing Grid comprises some 200,000 processing cores and 150 petabytes of storage distributed across more than 150 computing centers in 36 countries. (Learn more about this distributed system in Deconstruction: big data.)
By sharing the load of storing, distributing and analyzing the LHC’s staggering data set, the grid allows each member to focus on only part of the data. This is especially important because once a researcher begins to analyze a piece of raw data to recreate particle collisions, that process creates new data, increasing the size of the entire data set exponentially. It’s helpful for this to happen only after the mammoth LHC data set has been chopped up according to collaborators’ needs.
The LHC Computing Grid also allows researchers to access LHC data a little closer to home.
Effective January 18, 2011 the National Science Foundation (NSF) requires a 2-page Data Management Plan as a supplement to all NSF grant proposals. While several NSF Directorates, such as Engineering, already required a Data Management Plan, this is a new requirement for Directorates such as the Social, Behavioral and Economic Sciences. "This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (NSF web site)."
NSF suggests these basic elements for a Data Management Plan:
Since the 1950s, high-energy physicists have had to contend with culling data immediately after each particle collision. By the 1970s they had come up with highly sophisticated and automated ways—called triggers—to separate interesting collisions from the more common ones.
For example, at the Large Hadron Collider, where proton bunches collide 20 million times in a single second, only a few out of every 1,000 collisions produce the sought-after physics.
So researchers set up a trigger system that examines every particle collision to conclude, based on pre-determined criteria, whether the collision produced particles of interest. If so, the data from that collision continue on to be recorded. If not, they’re discarded.
Over the years, physicists have improved trigger systems to the point that they can evaluate all the data from a collision and discard the information they don’t need in millionths of a second. At that rapid pace, no unwanted data ever clogs the pipes.