Recursion Releases Open-Source Data from Largest Ever Dataset of Biological Images, Inviting Data Science Community to Develop New and Improved Machine Learning Algorithms for the Life Sciences Industry
Clinical-stage biotech offering unique access to cell biology images with the goal of driving more effective artificial intelligence methods in drug discovery and development
SALT LAKE CITY--(BUSINESS WIRE)-- Recursion, a Fast Company “Most Innovative Company” and leader in the artificial intelligence for drug discovery movement, today announced it will open-source a glimpse of the massive biological dataset the company has been building for more than five years. At more than two petabytes, and across more than 10 million different biological contexts, Recursion’s data is the world’s largest image-based dataset designed specifically for the development of machine learning algorithms in experimental biology and drug discovery.
The announcement was made at the global machine learning conference, ICLR 2019, and will be accompanied by a competition available through the NeurIPS 2019 Competition Track and co-sponsored by NVIDIA and Google Cloud. The goal of the competition is to inspire the development of effective machine learning methods that can identify representations of biology from the complex experimental dataset, called RxRx1.
“To answer fundamental questions facing biology and disease, and reimagine the drug discovery paradigm, we’re building the world’s largest, relatable, empirical biological dataset,” said Chris Gibson, Ph.D., CEO, Recursion. “The RxRx1 dataset we’re announcing today represents an important resource for the machine learning community, with more than 100,000 images and 300-plus gigabytes of data representing diverse biological contexts. Yet despite the massive scale of this dataset, it represents just 0.4 percent of what we generate at Recursion on a weekly basis. We expect that the richness of this dataset, combined with the context surrounding the scale of our efforts, will inspire the world’s machine learning and AI community to help us in our mission to decode biology to radically improve lives.”
Added Gibson, “If we are successful in our collective efforts, not only will new treatments make it to market faster, but more companies will be incentivized to develop new drugs for smaller markets, such as rare diseases, where many patients still face a major unmet need.”
The RxRx1 dataset is composed of images of human cells from more than 1,000 experimental conditions with dozens of biological replicates produced weeks and months apart in a variety of human cell types. These data were generated at multiple Recursion sites under the highly controlled experimental procedures characteristic of Recursion’s process. However, each batch of experimental data contains unique experimental variations, giving data scientists a rich proving ground to experiment with methods to tackle the noise inherent in even the most well-run empirical studies.
Experimental complexity and variability are major challenges in the application of machine learning to biological datasets, particularly in drug discovery. While machine learning approaches have the potential to accelerate drug discovery, fundamental challenges remain in combating the complexity and variability in biological datasets and to ensure algorithms are tuned in to fundamental biology and not to experimental heterogeneity in the data.
“This dataset provides a great playground for those working in multiple areas of machine learning research, such as domain adaptation and k-shot learning,” said Berton Earnshaw, Vice President of Data Science, Recursion. “Developing methods to account for the non-random experimental noise is something that should be of interest to those beyond just the life science community.”
New methods – including those derived from the NeurIPS competition – that effectively control for experimental heterogeneity in machine learning datasets will revolutionize large-scale biological data analysis, and lead to greatly improved drug discovery applications and insights.
“Advances in machine learning methods outside of the life sciences have already been accelerated through the availability of large-scale public datasets, such as ImageNet and COCO, among many others,” said Mason Victors, Chief Technology Officer and Chief Product Officer, Recursion. “Like these initiatives, we aim to create resources that will enable the community to collectively identify and adopt new machine learning methods that benefit the entire life sciences industry. We are excited to provide the data science community with the first longitudinally-generated, human cell biology image dataset to facilitate new machine learning applications. Best of luck to those in the competition, we’re rooting for you.”
Recursion is a clinical-stage biotechnology company combining experimental biology and automation with artificial intelligence in a massively parallel system to efficiently discover potential drugs for diverse indications, including genetic disease, inflammation, immunology, and infectious disease. Recursion applies causative perturbations to human cells to generate disease models and associated biological image data. Recursion’s rich, relatable database of more than two petabytes of biological images generated in-house on the company’s robotics platform enables advanced machine learning approaches to reveal drug candidates, mechanisms of action, and potential toxicity, with the eventual goal of decoding biology and advancing new therapeutics to radically improve lives. Recursion is headquartered in Salt Lake City and in 2019 was designated a Fast Company “Most Innovative Company.” Learn more at www.recursionpharma.com, or connect on Twitter, Facebook, and LinkedIn.
Jessica Yingling, Ph.D.
President, Little Dog Communications Inc.