EQAPOL - Methods

There are several stages in our approach to automated analysis - these are:

The present implementation has automated procedures for steps 1, 2 and 4, but for the results presented here step 3 still relies on expert interpretation (i.e. gates).

Building a statistical model of the data

The basic concept is to consider flow cytometry events as a sample from a multi-dimensional probability distribution. Since the shape of the distribution is highly complex and unknown, we need a flexible multi-dimensional model to fit the data. This complex model can be constructed by adding together many simpler models that are easy to define and specify. We use a multivariate normal (or Gaussian) distribution as the simple model that serves as a building block [1] [2] it is known that any multi-dimensional distribution, no matter how complex, can be approximated very well by combinations (or mixtures) of a sufficient number of simple multivariate normals. Technically, the model used for analysis is a Dirichlet process Gaussian mixture model (DPGMM), and the algorithm is able to infer the needed number of building blocks (Gaussian components) from the data automatically. Each Gaussian component has mean, variance and weight parameters - these are also automatically estimated from the data. We only use the DPGMM model for the dimensions necessary to establish basic cell subsets, namely FSC, SSC, CD3, CD4 and CD8; cytokine positivity is determined in a separate process described in Step 4.

Assigning events to clusters based on their statistical properties

After fitting a DPGMM to flow data, each event is then characterized by it statistical properties, the most important being its probability of coming from each of the multivariate Gaussian component. By assigning each event to the Gaussian component it is most likely to come from, we can assign each event to a specific cluster. Clusters group events that are close to each other in multi-dimensional space, similar to how gates are used to group events that are close together in 2D dot plots. As an example of the possible usefulness of clustering, we consider the example of dead or dying cells that non-specifically bind to antibodies. Since the 4 color EQAPOL ICS

panel does not include a viability dye, gating out such events may be challenging with 2D plots. However, non-specific binding events cluster together in a diagonal in multi-dimensional space, and it is simple to identify such clusters and filter them from further analysis.

Classifying clusters into basic lymphocyte cell subsets

To interpret the clusters, we need to assign biological labels to them, for example, T helper cell. This process of classification necessarily requires additional information about the properties used to define the biological cell subsets not directly available from the data itself. This information can be input in several ways - 1) direct expert evaluation (e.g. by gating on emph{clusters}), 2) heuristic rules (e.g. common pre-specified gates) and 3) use of a data set pre-labeled by experts to train an automated classifier that can subsequently correctly classify new data sets. Currently, we use procedure 1 to map clusters to biological cell subsets. Common gates are at present infeasible due to the wide variation in fluorescent intensity distributions among the laboratories even after statistical data normalization. Strategies to increase the consistency of data distributions across laboratories as well as procedure 3 implementations will be discussed at the end of the appendix.

Defining positivity thresholds for cytokine positive events

At this stage, we are interested in finding out the frequency of CD4+ or CD8+ T cells that are cytokine positive. It is standard to define a positivity threshold on the cytokine channel, and count all events above the threshold as cytokine positive. Defining a threshold implicitly requires balancing of sensitivity and specificity considerations, and is ideally performed with iterative backgating and threshold definition looking at all the data samples. However, this is time-consuming and often thresholds are simply set visually by an expert operator without backgating analysis, sometimes on the negative control alone. Consequently, thresholds may not be set optimally, and there is usual consensus on where to draw positivity thresholds in general.

For further information about positivity thresholding see the Algorithm implementation as well as An example application.

Literature cited

References

[1]Chan, C.; Feng, F.; Ottinger, J.; Foster, D.; West, M. & Kepler, T. B. Statistical mixture modeling for cell subtype identification in flow cytometry Cytometry. Part A, 2008, 73, 693-701
[2]Suchard, M. A.; Wang, Q.; Chan, C.; Frelinger, J.; Cron, A. & West, M. Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures Journal of computational and graphical statistics, 2010, 19, 419-438