Wednesday, October 9, 2013

Data Driven: The New Big Science: Chapter 4 Biology in the Era of Big Data

The Simons Foundation is Jim Simons' baby, originally conceived to fund research in mathematics (Mr. Simons' specialty), the foundation has expanded its areas of interest to all basic science.
They put out a magazine called Quanta which we last visited in "Maybe Space-Time is an Illusion".

From Quanta: 
Data Driven: The New Big Science
Our Bodies, Our Data
New technologies have launched the life sciences into the age of big data. Biologists must now make sense of their windfall.

 
Twenty years ago, sequencing the human genome was one of the most ambitious science projects ever attempted. Today, compared to the collection of genomes of the microorganisms living in our bodies, the ocean, the soil and elsewhere, each human genome, which easily fits on a DVD, is comparatively simple. Its 3 billion DNA base pairs and about 20,000 genes seem paltry next to the roughly 100 billion bases and millions of genes that make up the microbes found in the human body.

And a host of other variables accompanies that microbial DNA, including the age and health status of the microbial host, when and where the sample was collected, and how it was collected and processed. Take the mouth, populated by hundreds of species of microbes, with as many as tens of thousands of organisms living on each tooth. Beyond the challenges of analyzing all of these, scientists need to figure out how to reliably and reproducibly characterize the environment where they collect the data.

“There are the clinical measurements that periodontists use to describe the gum pocket, chemical measurements, the composition of fluid in the pocket, immunological measures,” said David Relman, a physician and microbiologist at Stanford University who studies the human microbiome. “It gets complex really fast.”

Ambitious attempts to study complex systems like the human microbiome mark biology’s arrival in the world of big data. The life sciences have long been considered a descriptive science — 10 years ago, the field was relatively data poor, and scientists could easily keep up with the data they generated. But with advances in genomics, imaging and other technologies, biologists are now generating data at crushing speeds.

One culprit is DNA sequencing, whose costs began to plunge about five years ago, falling even more quickly than the cost of computer chips. Since then, thousands of human genomes, along with those of thousands of other organisms, including plants, animals and microbes, have been deciphered. Public genome repositories, such as the one maintained by the National Center for Biotechnology Information, or NCBI, already house petabytes — millions of gigabytes — of data, and biologists around the world are churning out 15 petabases (a base is a letter of DNA) of sequence per year. If these were stored on regular DVDs, the resulting stack would be 2.2 miles tall.

“The life sciences are becoming a big data enterprise,” said Eric Green, director of the National Human Genome Research Institute in Bethesda, Md. In a short period of time, he said, biologists are finding themselves unable to extract full value from the large amounts of data becoming available....MUCH MORE
Previously in the Data Driven: The New Big Science series:
Chapter 1: Who's Driving?
Imagining Data Without Division

Chapter 2: Digits in the Sky
A Digital Copy of the Universe, Encrypted

Chapter 3: Revolutionary Algorithms
The Mathematical Shape of Things to Come