Recursion is releasing the RxRx1 dataset to kickstart a flurry of innovation in machine learning on large biological datasets to impact drug discovery and development. RxRx1 is a dataset consisting of 296 GB of fluorescent microscopy images, the result of the same experimental design being run multiple times with the primary differences between experiments being limited to technical noise, and unrelated to the underlying biology. As such, RxRx1 provides a significant sample of controlled biological variability that is prime for training models to discern classes of cell morphology, independent from experimental batch variation. It’s important to note that RxRx1 has been created in a controlled manner to provide the appropriate data for discerning biological variation in its common context of changing experimental conditions.
To date, we have generated over 2 PB of image data. RxRx1 represents a glimpse into the massive and truly unique dataset that is being generated at Recursion. RxRx1 is approximately 296 GB, consisting of 125,568 total images representing 1,108 classes. This is comparable to datasets such as ImageNet (ILSVRC2012) which is approximately 155 GB and 1.2m images with 1000 classes and other biological datasets such as BBBC017 (among others) from the Broad Institute of MIT which is about 56 GB, 64,512 total images, representing 4,903 classes.
Artificial Intelligence has the potential to dramatically reframe the challenge of understanding how drugs interact with human cells. Recursion is reinventing drug discovery and development using machine learning and rich biological datasets generated in-house, built for-purpose for machine learning algorithms. RxRx1 is a curated sample of this data that represents 0.4% of Recursion’s current weekly data generation.
RxRx1 presents such a task. The following images demonstrate the complexity of identifying relevant biological variation and separating it from technical noise caused by batch effects. Even when experiments are designed to control for technical variables such as temperature, humidity, and reagent concentration, batch effects unavoidably enter into the data, resulting in data that contain factors of variation due to either biologically relevant variables or irrelevant technical variables. Batch effects threaten to confound any set of experiments across the entire field of biology. Machine disentanglement of batch effects from relevant biological variables would be applicable across the field and could have broad impacts on accelerating drug discovery and development.
As the images in RxRx1 are generated by carrying out biological experiments using reagents known as siRNAs, which are designed to target and knockdown a specific gene (more on this in another section), some may be tempted to use this to identify gene-specific morphological changes. DO NOT DO THIS. siRNAs are known to have significant off-target effects which you only have the chance to overcome through a number of computational methods and using multiple siRNAs per gene. As this dataset only includes one siRNA per gene for a random subset of genes, do not attempt to identify gene-specific signal. There are many ways you can convince yourself you have succeeded in this. You will be wrong. The data provided is insufficient for that task, and should thus be used to conduct research focused on alternative problems only. Just for clarity because we know somebody will ignore the warnings above, we’ll state it again more clearly: DO NOT USE THIS DATASET TO TRY TO GET AT GENE-SPECIFIC CHANGES. IT WILL NOT WORK.
As mentioned elsewhere, the images in RxRx1 are generated by carrying out biological experiments using reagents known as siRNAs. A small interfering RNA (siRNA) is a biological reagent used to knockdown a particular gene, and every genetic perturbation used in the RxRx1 dataset is carried out via an siRNA. To understand these biological reagents, it’s important to review some key biological concepts.
Recall that each gene in our DNA encodes for a specific protein (sometimes several), and the process by which proteins are created involves transcription (reading the DNA to create complimentary mRNA) and translation (reading the mRNA to link amino acids together to create a protein). These chains of amino acids linked together during translation are folded and further modified to yield functional proteins.
The loss of the product(s) of a single gene in a cellular system can have devastating effects, leading to any of thousands of diseases. This often happens via mutations in the DNA that then prevent proper transcription or translation.
There are a number of ways to model this experimentally, one of which is to directly target the DNA and create a mutation that will lead to a lack of protein production. Alternatively, one can target the mRNA, degrading them before they are translated into proteins by the ribosome. This is, effectively, what siRNAs do, and so they provide a useful construct by which to model the loss-of-function of a particular gene. An siRNA is a 21 nucleotide RNA strand, which is designed to be fully complementary to a specific section of an mRNA, enabling efficient binding and ultimately cleavage of the target mRNA. This allows these siRNAs to target specific genes and lead to a significant reduction in mRNA present in a cellular system, effectively reducing the total amount of the associated protein.
However, siRNAs are known to have severe off-target effects - they not only degrade the targeted mRNA, but also can block translation of hundreds of additional mRNAs. This is done via the miRNA pathway, and such off-target effects are driven by the seed region (nucleotides 2-7) of the siRNA. These seed-based off-target effects dominate the signal in any siRNA-involved study, and thus to effectively model gene loss-of-function, one must use multiple siRNAs targeting each gene and a number of computational methods to determine if there is any particular gene-driven effect in an assay.
The purpose of this explanation is not to dive into the details of how to mitigate off-target effects, but rather to ensure that researchers working with the RxRx1 dataset sufficiently understand the underlying biology so as to focus on the right problems that can be addressed via this dataset. As no gene is targeted by more than 1 siRNA in the RxRx1 dataset, this dataset should not be used to try to identify gene-specific knockdown effects.
For each experiment, the dataset contains two 512 x 512 x 6 images for usually each of the 1,108 siRNAs. The images were acquired from two non-overlapping regions of each well. The 6 channels of an image illuminate different organelles of the cell - the nucleus, endoplasmic reticulum, actin cytoskeleton, nucleolus, mitochondria, and golgi apparatus. Our experiment uses a modified Cell Painting staining protocol (CellPainting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes, Bray et. al., 2016) which uses 6 different stains to adhere to different morphology of the cell. The stains fluoresce at different wavelengths and are therefore captured by different imaging channels; thus there are 6 images per imaging site in a well. Each image captures different morphology of the same segment of the well, like layers of a 3 dimensional structure. These images are assigned a consistent color and can be composited for ease of reviewing (see Fig. 2). RxRx1 contains the 6-channel images and not the composite images.
RxRx1 includes data from 51 instances of the same experiment design executed in different experiment batches. In this experiment, we use 1,108 different siRNAs to knockdown 1,108 different genes. The experiment uses 384-well plates (see Fig. 3) to isolate populations of cells into wells where exactly one of 1,108 different siRNAs is introduced into the well to create distinct genetic conditions. A well is like a single test tube at a small scale, 3.3 mm2. The outer rows and columns of the plate are not used because they are subject to greater environmental effects; so there are 308 used wells on each plate. Thus the experiment consists of 4 total plates. Each plate holds the same 30 control siRNA conditions, 277 different non-control siRNA, and one untreated well. The location of each of the 1,108 non-control siRNA conditions is randomized in each experiment to prevent confounding effects of the location of a particular well (see Plate Effects).
As already described, siRNA is a biological technology designed to match a specific sequence of mRNA; it will bind to that segment and cause it to be degraded by the cell. This causes a reduction in the protein product of the gene targeted by the siRNA. As noted above, the siRNA does not completely eliminate expression of the gene in the cell - some mRNA remain and are translated into proteins, but at significantly reduced levels. The observable morphology of a cell culture is called its phenotype. Figure 4 shows 4 phenotypes of distinct siRNAs from a single experiment and plate. The phenotype is sometimes visually recognizable from the images, but often the difference in cell morphology is subtle and hard to detect by the human eye.
Further complicating the signal are the unpredictable (but consistent) off-target effects where the siRNA has interacted with unintended mRNA species to varying degrees. Each siRNA is imprecise in that it partially knocks down both the intended gene and other, unrelated genes, causing incomplete on-target and off-target effects on the cell morphology.
Each batch represents a single cell type: 24 in HUVEC, 11 in RPE, 11 in HepG2, and 5 in U2OS. Figure 5 shows the phenotype of a single siRNA in the four different cell types. For each image, the accompanying metadata provides the following information about the associated well: 1) its cell type, 2) its experiment, 3) its plate within the experiment, 4) its location on the plate, and 5) its siRNA. Since each of the 51 experiments was run in different batches, the images exhibit technical effects common to their batch and distinct from other batches; these batch effects are discussed further below.
When the images were originally created by Recursion, they were of size 2048 x 2048 x 6, but in order to make the dataset size more manageable, they were downsampled by a side-length factor of 2 and only the center 512 x 512 crop is provided.
As described above, each of the 51 experiments was executed in a different experimental batch. A batch is a set of experiment plates that are executed together, at the same time with the same materials. This means that all the plates within a batch are similar in their reagent synthesis, environmental conditions, etc., and plates from one batch differ from those from another batch in a consistent way. There are changes from batch to batch in environmental and experimental conditions that cause these effects. Examples of environmental conditions include humidity and temperature. Examples of experimental conditions include synthesis and concentration of reagents, as well as cell culture density. As seen in Figure 1, the batch effects are significantly more visually salient than the relevant biological variation introduced by different siRNAs.
These batch effects are an inherent feature of experimentation and are unavoidably introduced into data collected across multiple batches. Any scientific conclusions drawn from such data should rely on the relevant biological variation in the data rather than on these incidental effects. A machine learning approach to separating batch effects from biological variation could be used widely in the field to extend the comparability of large image sets without a biologist needing to deconvolute the biological variation manually, hence RxRx1 has the potential to spur innovation of models which will overcome the issues plaguing the pharmaceutical industry.
One particular set of metadata descriptors worth discussing more fully are experiment, plate, well, and site (see Fig. 2). These describe information about the physical location of each image in terms of the data generation process. Every image is taken of a particular site of a cell culture well on a 384-well plate. These cell cultures are distributed across a 16x24 grid of wells on a plate, and there are 4 plates per experiment in the RxRx1 dataset. Each experiment (set of 4 plates) was run in a different batch than the other experiments in RxRx1, such that the experimental noise that occurs due to slightly different conditions in the lab will take on a different form for each experiment. These are the batch effects referenced above. But there can be additional noise within an experiment driven by both inter- and intra-plate effects. An inter-plate effect is any effect primarily driven by the plate assignment within a batch (differences between plates), and an intra-plate effect is any effect primarily driven by the well assignment within a plate (differences between wells, or locations, within the same plate). All three of these sources of experimental variation may prove important to properly model the RxRx1 data, and the dataset has been generated in such a way that it should be very few instances of each perturbation will be the same well twice.
In each experiment, the same 30 siRNAs appear on every plate as positive controls. In addition, there is one well per plate that is left untreated as a negative control. The 30 control siRNAs target 30 different genes and produce a variety of morphological effects. Together, these wells provide a set of reference controls on each plate.
There are a number of areas of active machine learning research that could be furthered by the use of the RxRx1 dataset.
Of obvious note are areas of generalization, as this dataset (and any biological dataset) contains non-random experimental effects which make generalization challenging. This dataset is well suited for tasks such as transfer learning (e.g. to a new cell type), domain adaptation (treating a new batch as a new target domain) and K-shot learning (a number of perturbations are present across every plate). While generalizability is important in every ML problem, it is of particular importance in working with biological datasets as mentioned above.
Given the metadata associated with each image, the RxRx1 dataset provides a good opportunity for further research in context modeling. This could include using contexts such as cell types, plate and well assignments, in order to more effectively model the data.
While much research has been done in computer vision across many domains, this dataset is large and rich and presents a very different data distribution than is found in most publicly available imaging datasets. Some of these differences include the relative independence of many of the channels (unlike RGB images) and the fact that each example is one of a population of objects treated similarly as opposed to singletons. The RxRx1 dataset presents an opportunity for further fundamental research in computer vision techniques.
Using the RxRx1 dataset, we are sponsoring a NeurIPS 2019 competition called CellSignal to encourage researchers to explore methods of separating biological and technical factors in biological data. The task is to correctly classify the perturbation present in each image in a held out set of experiments that were run in batches different from the experiments in the training set. Thus, in order for the classifier to generalize well to unseen batches, it must learn to separate biological and technical factors and make predictions only on the biology of the perturbation.
The evaluation metric will be the siRNA classification accuracy averaged over images. This metric is useful as an overall measure of the goodness of the classifier since the training and hold-out sets are approximately balanced across the siRNA classes, and the metric improves with each correctly classified image. And because the hold-out experiments are from entirely different batches than the training experiments, classifiers will have to generalize well to unseen experimental batches in order to score well on accuracy. In addition, since we will not have a separate task for accuracy on individual cell types, results will improve as the classifiers learn to do well on each cell type. The participants with three highest accuracies on the test set win the competition.
This competition will be of interest to the rapidly growing community of researchers looking to apply machine learning methods to complex biological data sets, and especially those working on biological images. The specific task of removing experimental batch effects is highly relevant to the broader life sciences scientific community and can provide insights that enable researchers to develop improved methods for working with other experimental datasets. However, the competition itself should be of great interest to the larger community of machine learning researchers since the image set is large, systematically produced, and useful in more general areas of machine learning research such as domain adaptation, transfer learning, and k-shot learning.