Biopharma is Hungry for Data Scientists

November 9, 2016 |

4 min read

November 17, 2016
By Mark Terry, BioSpace.com Breaking News Staff

Big data is big and biopharma knows it. Data science is a relatively new field, one that’s tended to be the purview of IT companies (and probably intelligence agencies), and now biopharma wants in, and is willing to poach top people from Silicon Valley.

One example, as profiled by Ron Leuty in the San Francisco Business Times, is Amy Gershkoff, who has worked for eBay and most recently Zynga, best known for making games like FarmVille and Words With Friends. Next month, she will be joining Ancestry.com as its first chief data officer. Ancestry is a company with a lot of data—80 million family trees, supporting historical records, and 2.5 million genomic samples from its AncestryDNA database.

Big data is just that—enormous amounts of data. In IT, computer processing has allowed it to be stored, handled and manipulated. And since the Human Genome Project came along in the 1990s and the advent of affordable gene sequencing platforms, enormous amounts of data is being generated.

“Decades ago, this could be processed by humans,” Gershkoff told Leuty. “Now we have algorithms to extract insights. We’re seeing a real shift to a doubling down in data science, because it’s not just about finding the insight but surfacing it for the consumer in a timely enough fashion that it’s still useful for them and easy to consume.”

Ryan McBride, writing recently for FierceBiotech, laid out opportunities for big data in biopharma. It is a list supported by MastersInDataScience.org. Let’s take a look at just a few of the uses of big data analytics in biopharma.

1. Genomics.

Each human genome is made up of three billion base pairs, which are organized into 20,000 to 25,000 genes. This totals around three gigabytes of data. That’s for a single person!

In addition, these genes are not single entities, they interact with each other and the environment. And how they are turned on and off is another complicated variable.

2. The Human Microbiome.

Part of the above-mentioned environment includes microbes, bacteria, viruses and fungi, that live inside the body or on the body. The NIH’s Human Microbiome Project has found more than 10,000 microbes in the human body, totaling more than 100 times the genes in the human body.

The Harvard Public School of Health used data science to identify about 350 of the most important organisms, and using DNA sequencing, analyzed 3.5 terabytes of genomic data to pinpoint genetic “name tags,” that can identify where and how those markers behave in a healthy population.

3. Crowdsourcing.

There is an online game called Foldit. In 2011, players created an accurate 3D model of the M-PMV retroviral protease enzyme on the game. Prior to that, researchers had spent 15 years unsuccessfully trying to figure out that structure.

They followed that up the next year by redesigning a protein that increased its activity by more than 18 times.

4. Synthesizing Diverse Data.

What the previous three points suggest is that in biopharma, there is a lot of data from many different sources, proteins, DNA/RNA, clinical, social, environmental. This has obvious implications for drug research, but the same synthesis has applications in the business end of things. Master’s In Data Science notes that, “Cambridge Semantics has developed semantic web technologies that help pharmaceutical companies sort and select which businesses to acquire and which drug compounds to license.”

5. Drug Recycling.

Atul Butte, a researcher at Stanford University, recently helped found NuMedii, which uses data science to sift through molecular data in order to discover new uses for old drugs. McBride wrote, “National Institutes of Health chief Francis Collins has endorsed such approaches, which offer a way to rapidly bring new (and affordable) treatments to patients using drugs with known safety profiles.”

6. Clinical Data.

The federal government has been pushing electronic health records (EHR), which has digitized millions of medical histories. Although HIPAA privacy laws can create some boundaries, several companies are working on big data applications that de-identify patient information so the data can be searched and correlated for research purposes.

In many ways, the applications are endless.

In Star Wars, Episode III: Revenge of the Sith, Obi Wan Kenobi visits an old friend, Dexter Jettster. He’s interested in identifying a poisoned dart that’s not showing up in the Jedi archives. “Dex” comments, “I should think that you Jedi would have more respect for the difference between knowledge and wisdom.”

Which is part of the issue with all the data being generated these days. How to deal with it? And how to sort out good data from bad data, particularly as it continues to grow at an astonishing rate, also known as the 3Vs: volume, velocity and variety.

And a good data scientist will find him—or herself in huge demand by the biopharma industry.

Related Jobs
Scientist II	Research Associate III	Chemist
Research Scientist	Lead/Senior Scientist, Pharmacokinetics	Protein Scientist
View More Jobs

Check out the latest Career Insider eNewsletter - November 17, 2016.
Sign up for the free bi-weekly Career Insider eNewsletter.

Career advice Job search strategy Australia Europe South America Africa Asia NextGen: Class of 2026 Job creations