INTRODUCTION

Building Maps of Biology and Chemistry

At Recursion, we build maps of biology and chemistry to explore uncharted areas of disease biology, unravel its complexity, and industrialize drug discovery. Just as a map helps to navigate the physical world, our maps are designed to help us understand as much as we can about the connectedness of human biology so we can navigate the path to new medicines more efficiently.

Our maps are built using image-based high-dimensional data generated in-house. We conduct up to 2.2 million experiments every week in our highly automated labs, where we use deep learning models to embed high dimensional representations of billions of images of human cells that have been manipulated by CRISPR/Cas9-mediated gene knockouts, compounds, or other reagents. This allows us to create representations that can be compared and contrasted to predict trillions of relationships across biology and chemistry — even without physically testing all of the possible combinations. Recursion's Maps and associated applications help navigate complex biology and chemistry by revealing relationships across genes and chemical compounds.

RxRx3 is a publicly available map of biology that represents a small subset – less than 1% – of Recursion’s total dataset. MolRec™️ is a simple demo example of such an application that can be built on this type of map.

17,063

genes profiled*

spanning CRISPR knockouts of most of the human genome

2.2M

images of HUVEC cells

associated DL embeddings of each image also included

1,674

known chemical entities at 8 concentrations each

FDA approved and commercially available bioactive compounds at 8 concentrations and tens of thousands of control images

<1%

of Recursion’s total dataset

*Approximately 16,000 of these genes are anonymized in the dataset, enabling people to explore and learn from this massive dataset while protecting Recursion’s business interests. Recursion may de-anonymize genes in this dataset in the future.

THE POWER OF DATASET RELEASES

Progress in machine learning is punctuated by seminal dataset releases. Perhaps the most famous of these is ImageNet, which helped usher in the next generation of computer vision models. Fei-Fei Li, creator of ImageNet, set out with the goal to “...map out the entire world of objects”  so that the models would be trained on realistic data. Just as ImageNet mapped out the world of objects, RxRx3, and the broader RxRx.ai dataset family, is mapping out biology and chemical space. 

RxRx3 is one of, if not the, largest collections of cellular screening data, and as far as we know, the largest generated consistently in a single process at a single site. Our goal is to enable the next generation of machine learning methodologies on these to foster research, methods development, and collaboration.

Comparison with Other Computer Vision Datasets


Dataset

Released

# of Samples
Bio/Chem Phenomic Maps (~100 TB)
RxRx3
2023
2.2M
JUMPCP
2023
823,438
Autonomous Driving (~1-5 TB)
Waymo Open Dataset
2018
~105,000
nuScenes
2018
1000
Image / Object Recognition (10GB - ~1 TB)
ImageNet (21k)
2009
14M
COCO
2014
330,000

The RxRx3 dataset is closely related to datasets previously released by Recursion, although there are some key differences. For ease of comparison and understanding, we provide the following table highlighting the primary differences:

Release Date
June 2019
August 2020
April 2020
August 2020
January 2023
Cell Types
HUVEC
RPE
U2OS
HepG2
HUVEC
HRCE
Vero
HUVEC
HUVEC
Stains (Channels)
Hoechst
ConA
Phalloidin
Syto14
MitoTracker
WGA
Hoechst
ConA
Phalloidin
Syto14
MitoTracker
WGA
Hoechst
ConA
Phalloidin
Syto14
WGA
Hoechst
ConA
Phalloidin
Syto14
MitoTracker
WGA
Hoechst
ConA
Phalloidin
Syto14
MitoTracker
WGA
Plate Density
384-well
1536-well
1536-well
1536-well
1536-well
Imaging Sites per Well
2
4
4
1
1
Perturbations Evaluated
1,138 siRNAs
434 soluble factors at 6 concentrations
1,672 small molecules at 6+ concentrations
Three viral conditions (active virus, irradiated, mock)
1,856 small molecules at 4-6 concentrations in three COVID-19-associated cytokine storm conditions (severe storm, healthy, and no cytokines)
17,063 CRISPR/Cas9-mediated gene knockouts
1,674 compounds at 8 concentrations each
Total Number of Images
125,510
131,953
305,520
70,384
~2.2M
Image Dimension
512x512x6
1024x1024x6
1024x1024x5
2048x2048x6
2048x2048x6
Compressed Dataset Size
~46GB
~185GB
~450GB
~409GB
~83,100GB
DOWNLOAD

Download links for RxRx3 are currently only available in the MolRec application.

You can view the README for the dataset here.



Sign up for updates about the RxRx datasets
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.