RxRx1

NeurIPS 2019 competition
coming soon:

CellSignal: Disentangling biological signal from experimental noise in cellular images.

There is a clear link in machine learning innovation between the availability of better data and significant technological advancements.

Recursion is releasing the RxRx1 dataset to kickstart a flurry of innovation in machine learning on large biological datasets to impact drug discovery and development. RxRx1 is a dataset consisting of 296 GB of 16-bit fluorescent microscopy images, the result of the same experimental design being run multiple times with the primary differences between experiments being technical noise unrelated to the underlying biology. As such, RxRx1 provides a significant sample of controlled biological variability that is prime for training models to discern classes of cell morphology, independent from experimental batch variation. It’s important to note that RxRx1 has been created in a controlled manner to provide the appropriate data for discerning biological variation in its common context of changing experimental conditions.

Since 2013, Recursion has been generating the industry’s largest fully-relatable dataset of biological images representing human disease biology and pharmaceutical chemistry.

To date, we have generated over 2 petabytes of image data. RxRx1 represents a glimpse into the massive and truly unique dataset that is being generated at Recursion. RxRx1 is approximately 296 GB, consisting of 125,510 total images representing 1,108 classes. This is comparable to datasets such as ImageNet (ILSVRC2012) which is approximately 155 GB and 1.2m images with 1000 classes and other biological datasets such as BBBC017 (among others) from the Broad Institute of MIT which is about 56 GB, 64,512 total images, representing 4,903 classes.

10+

Finding new drugs can take over 10 years and cost hundreds of millions of dollars.

Artificial Intelligence has the potential to dramatically reframe the challenge of understanding how drugs interact with human cells. Recursion is reinventing drug discovery and development using machine learning and rich biological datasets generated in-house, built for-purpose for machine learning algorithms. RxRx1 is a curated sample of this data that represents less than 1% of Recursion’s current weekly data generation.

One of the great promises of machine learning image classification is extending vision models to perform tasks that are not possible for humans to do.

RxRx1 presents such a task. Figure 1 demonstrates the complexity of identifying relevant biological variation and separating it from technical noise caused by batch effects. Even when experiments are designed to control for technical variables such as temperature, humidity, and reagent concentration, batch effects unavoidably enter into the data, resulting in images that contain factors of variation due to either biologically relevant variables or irrelevant technical variables. Batch effects threaten to confound any set of experiments across the entire field of biology. Machine disentanglement of batch effects from relevant biological variables would be applicable across the field and could have broad impacts on accelerating drug discovery and development.

Figure 1: Images of two different genetic conditions (rows) in HUVEC cells across four experimental batches (columns). Notice the visual similarities of images from the same batch.

THE BIOLOGY

The 6-channel fluorescent microscopy images that comprise the RxRx1 dataset illuminate different organelles of the cell - the nucleus, endoplasmic reticulum, actin cytoskeleton, nucleolus, mitochondria, and golgi apparatus.

The experiment uses a modified Cell Painting staining protocol (CellPainting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes, Bray et. al., 2016) which uses 6 different stains to adhere to different parts of the cell. The stains fluoresce at different wavelengths and are therefore captured by different imaging channels; thus there are 6 images per imaging site in a well. Each image captures different morphology of the same segment of the well, like layers of a 3 dimensional structure.

The images in RxRx1 are generated by carrying out biological experiments using reagents known as siRNAs. A small interfering RNA (siRNA) is a biological reagent used to knockdown a particular gene, and every genetic perturbation used in the RxRx1 dataset is carried out via an siRNA. To understand these biological reagents, it’s important to review some key biological concepts.

Primer on the Central Dogma

Recall that each gene in our DNA encodes for a specific protein (sometimes several), and the process by which proteins are created involves transcription (reading the DNA to create complementary mRNA) and translation (reading the mRNA to link amino acids together to create a protein). These chains of amino acids are folded and further modified to yield functional proteins. This notion of information flowing from DNA to mRNA to proteins is called the central dogma of biology.

Figure 2a: Transcription: the synthesis of an mRNA copy of a segment of DNA

Figure 2b: Translation: the process of generating a polypeptide (protein) from mRNA

Modeling Genetic Loss-of-function

There are a number of ways to model this experimentally, one of which is to directly target the DNA and create a mutation that will lead to a lack of protein production. Alternatively, one can target the mRNA, degrading them before they are translated into proteins by the ribosome. This is, effectively, what siRNAs do, and so they provide a useful construct by which to model the loss-of-function of a particular gene (see Fig. 3). An siRNA is a 21 nucleotide RNA strand, which is designed to be fully complementary to a specific section of an mRNA, enabling efficient binding and ultimately cleavage of the target mRNA. This allows these siRNAs to target specific genes and lead to a significant reduction in mRNA present in a cellular system, effectively reducing the total amount of the associated protein.

Figure 3a: Depiction of full-complimentarity (intended) of an siRNA to an mRNA to knockdown a particular target gene.

Figure 3b: Depiction of partial-complimentarity in the seed-region of an siRNA, leading to partial knockdown of hundreds of additional genes

However, siRNAs are known to have severe off-target effects - they not only degrade the targeted mRNA, but also can block translation of hundreds of additional mRNAs. This is done via the miRNA pathway, and such off-target effects are driven by the seed region (nucleotides 2-8) of the siRNA. These seed-based off-target effects dominate the signal in any siRNA-involved study, and thus to effectively model gene loss-of-function, one must use multiple siRNAs targeting each gene and a number of computational methods to determine if there is any particular gene-driven effect in an assay.

The purpose of this explanation is not to dive into the details of how to mitigate off-target effects, but rather to ensure that researchers working with the RxRx1 dataset sufficiently understand the underlying biology so as to focus on the right problems that can be addressed via this dataset. As no gene is targeted by more than 1 siRNA in the RxRx1 dataset, this dataset should not be used to try to identify gene-specific knockdown effects.

The combined effects of targeted knockdown and seed-based effects lead to observable morphology of a cell culture called a phenotype. Figure 4 shows 4 phenotypes of distinct siRNAs from a single experiment and plate. The phenotype is sometimes visually recognizable from the images, but often the difference in cell morphology is subtle and hard to detect by the human eye.

Since the images in RxRx1 are generated by carrying out biological experiments using siRNAs which are designed to target and knockdown a specific gene, it is tempting to use this data to identify gene-specific morphological changes. However, since siRNAs are known to have significant off-target effects, you would need data from many different siRNAs targeting the same gene combined with computational methods for deconvolving the target signal from off-target effects. Since this dataset only includes one siRNA per gene, the data provided is insufficient for making gene-specific morphological conclusions.

One of the great promises of machine learning image classification is extending vision models to perform tasks that are not possible for humans to do.

Figure 4: Images of four different siRNA phenotypes in HUVEC (same experiment and plate).

THE DATA

RxRx1 includes data from 51 instances of the same experiment design executed in different experimental batches. In this experiment, we use 1,108 different siRNAs to knockdown 1,108 different genes.

The experiment uses 384-well plates (see Fig. 5) to isolate populations of cells into wells where exactly one of 1,108 different siRNAs is introduced into the well to create distinct genetic conditions. A well is like a single test tube at a small scale, 3.3 mm2. The outer rows and columns of the plate are not used because they are subject to greater environmental effects; so there are 308 used wells on each plate. Thus the experiment consists of 4 total plates. Each plate holds the same 30 control siRNA conditions, 277 different non-control siRNA, and one untreated well. The location of each of the 1,108 non-control siRNA conditions is randomized in each experiment to prevent confounding effects of the location of a particular well (see Plate Effects). Each well in each plate contains two 512 x 512 x 6 images. The images were acquired from two non-overlapping regions of each well. Each of the 6 channels can be assigned a consistent color and composited for ease of reviewing (see Fig. 6), however the RxRx1 contains the 6-channel images and not the composite images.

Figure 5: Schematic of a 384-well plate demonstrating imaging sites and 6-channel images. The 4-plate experiments in RxRx1 were run in the wells of such 384-well plates. We are releasing 2 6-channel imaging sites per well.

Figure 6: The top-left image is a composite of the 6 channels. It is followed by each of the 6 individual channel faux-colored images of HUVEC cells: nuclei (blue), endoplasmic reticuli (green), actin (red), nucleoli (cyan), mitochondria (magenta), and golgi apparatus (yellow). The overlap in channel content is due in part to the lack of complete spectral separation between fluorescent stains.

Each batch represents a single cell type: 24 in HUVEC, 11 in RPE, 11 in HepG2, and 5 in U2OS. Figure 7 shows the phenotype of a single siRNA in the four different cell types. For each image, the accompanying metadata provides the following information about the associated well: 1) its cell type, 2) its experiment, 3) its plate within the experiment, 4) its location on the plate, and 5) its siRNA. Since each of the 51 experiments was run in different batches, the images exhibit technical effects common to their batch and distinct from other batches; these batch effects are discussed further below.

Figure 7: Images of the same siRNA across four cell types: HUVEC, RPE, HepG2, U2OS.

When the images were originally created by Recursion, they were of size 2048 x 2048 x 6, but in order to make the dataset size more manageable, they were downsampled by a side-length factor of 2 and only the center 512 x 512 crop is provided.

Batch Effects

As described above, each of the 51 experiments was executed in a different experimental batch. A batch is a set of experiment plates that are executed together, at the same time with the same materials. This means that all the plates within a batch are similar in their reagent synthesis, environmental conditions, etc., and plates from one batch differ from those from another batch in a consistent way. There are changes from batch to batch in environmental and experimental conditions that cause these effects. Examples of environmental conditions include humidity and temperature. Examples of experimental conditions include synthesis and concentration of reagents, as well as cell culture density. As seen in Figure 8, the batch effects are more visually salient than the relevant biological variation introduced by different siRNAs.

These batch effects are an inherent feature of experimentation and are unavoidably introduced into data collected across multiple batches. Any scientific conclusions drawn from such data should rely on the relevant biological variation in the data rather than on these incidental effects. A machine learning approach to separating batch effects from biological variation could be used widely in the field to extend the comparability of large image sets without a biologist needing to deconvolute the biological variation manually, hence RxRx1 has the potential to spur innovation of models which will overcome the issues plaguing the pharmaceutical industry.

Figure 8: Images of two different genetic conditions (rows) in HUVEC cells across four experimental batches (columns). Notice the visual similarities of images from the same batch.

Plate Effects

One particular set of metadata descriptors worth discussing more fully are experiment, plate, well, and site (see Fig. 5). These describe information about the physical location of each image in terms of the data generation process. Every image is taken of a particular site of a cell culture well on a 384-well plate. These cell cultures are distributed across a 16x24 grid of wells on a plate, and there are 4 plates per experiment in the RxRx1 dataset. Each experiment (set of 4 plates) was run in a different batch than the other experiments in RxRx1, such that the experimental noise that occurs due to slightly different conditions in the lab will take on a different form for each experiment. These are the batch effects referenced above. But there can be additional noise within an experiment driven by both inter- and intra-plate effects. An inter-plate effect is any effect primarily driven by the plate assignment within a batch (differences between plates), and an intra-plate effect is any effect primarily driven by the well assignment within a plate (differences between wells, or locations, within the same plate). All three of these sources of experimental variation may prove important to properly model the RxRx1 data, and the dataset has been generated in such a way that there are very few instances where a perturbation will be in the same well twice.

Positive and Negative Controls

In each experiment, the same 30 siRNAs appear on every plate as positive controls. In addition, there is one well per plate that is left untreated as a negative control. The 30 control siRNAs target 30 different genes and produce a variety of morphological effects. Together, these wells provide a set of reference controls on each plate.

RESEARCH AREAS

There are a number of areas of active machine learning research that could be furthered by the use of the RxRx1 dataset

Generalization

Of obvious note are areas of generalization, as this dataset (and any biological dataset) contains non-random experimental effects which make generalization challenging. This dataset is well suited for tasks such as transfer learning (e.g. to a new cell type), domain adaptation (treating a new batch as a new target domain) and K-shot learning (a number of perturbations are present across every plate). While generalizability is important in every ML problem, it is of particular importance in working with biological datasets as mentioned above.

Context Modeling

Given the metadata associated with each image, the RxRx1 dataset provides a good opportunity for further research in context modeling. This could include using contexts such as cell types, plate and well assignments. The exploration of methods to use these contexts to enhance machine learning methods in their ability to represent the biological perturbations is an additional avenue of research with RxRx1.

Computer Vision

While much research has been done in computer vision across many domains, this dataset is large and rich and presents a very different data distribution than is found in most publicly available imaging datasets. Some of these differences include the relative independence of many of the channels (unlike RGB images) and the fact that each example is one of a population of objects treated similarly as opposed to singletons. The RxRx1 dataset presents an opportunity for further fundamental research in computer vision techniques.

THE COMPETITION

Using the RxRx1 dataset, we are sponsoring a NeurIPS 2019 competition called CellSignal to encourage researchers to explore methods of separating biological and technical factors in biological data.

The task is to correctly classify the perturbation present in each image in a held out set of experiments that were run in batches different from the experiments in the training set. Thus, in order for the classifier to generalize well to unseen batches, it must learn to separate biological and technical factors and make predictions only on the biology of the perturbation.

The evaluation metric will be the siRNA classification accuracy averaged over images. This metric is useful as an overall measure of the goodness of the classifier since the training and hold-out sets are approximately balanced across the siRNA classes, and the metric improves with each correctly classified image. And because the hold-out experiments are from entirely different batches than the training experiments, classifiers will have to generalize well to unseen experimental batches in order to score well on accuracy. In addition, since we will not have a separate task for accuracy on individual cell types, results will improve as the classifiers learn to do well on each cell type.

This competition will be of interest to the rapidly growing community of researchers looking to apply machine learning methods to complex biological data sets, and especially those working on biological images. The specific task of removing experimental batch effects is highly relevant to the broader life sciences scientific community and can provide insights that enable researchers to develop improved methods for working with other experimental datasets. However, the competition itself should be of great interest to the larger community of machine learning researchers since the image set is large, systematically produced, and useful in more general areas of machine learning research as mentioned above.

The competition was held on Kaggle. Visit the Kaggle site to check out the leaderboard and forums.

THE ORGANIZING COMMITTEE