Department of Plant Sciences, University of Cambridge, Downing Street, Cambridge, UK

Abstract

Background

High throughput sequencing has become an important technology for studying expression levels in many types of genomic, and particularly transcriptomic, data. One key way of analysing such data is to look for elements of the data which display particular patterns of differential expression in order to take these forward for further analysis and validation.

Results

We propose a framework for defining patterns of differential expression and develop a novel algorithm, baySeq, which uses an empirical Bayes approach to detect these patterns of differential expression within a set of sequencing samples. The method assumes a negative binomial distribution for the data and derives an empirically determined prior distribution from the entire dataset. We examine the performance of the method on real and simulated data.

Conclusions

Our method performs at least as well, and often better, than existing methods for analyses of pairwise differential expression in both real and simulated data. When we compare methods for the analysis of data from experimental designs involving multiple sample groups, our method again shows substantial gains in performance. We believe that this approach thus represents an important step forward for the analysis of count data from sequencing experiments.

Background

The development of high-throughput sequencing technologies in recent years

This type of data first emerged from the serial analysis of gene expression (SAGE)

We develop here an empirical Bayesian approach that is able to increase the accuracy of predictions by borrowing information across the dataset, but which removes the restriction of only considering pairwise comparisons and allows us to analyse more complex experimental designs. We are able to show that our method gives equivalent or improved performance in both simulated and biological data when compared to existing methods for the discovery of differential expression in pairwise comparisons, and offers improvements in performance for more complex designs.

In order to address the problem of more complex experimental designs involving multiple groups of samples, we develop our method in a very general form by first establishing a framework for describing diverse patterns of differential expression within a dataset. Using this framework to define a set of models, we seek to establish posterior probabilities of each model. Finally, we demonstrate the applicability of our method to these experimental designs on simulated data, and are able to show substantial improvements in performance using our method.

Methods

We adopt and adapt the nomenclature of Robinson and Smyth

Approach

We take an empirical Bayesian approach to estimate the posterior probabilities of each of a set of models that define patterns of differential expression for each tuple. This approach begins by defining each of our models in terms of similarity and difference between samples. For a given model, we seek to define which samples behave similarly to each other, and for which sets of samples there are identifiable differences. In order to assess the posterior probabilities of each model for each tuple, we consider a distribution for the tuple defined by a set of underlying parameters for which some prior distribution exists. Samples behaving similarly to each other should possess the same prior distribution on the underlying parameters of the tuple, while samples behaving differently should possess different prior distributions. We develop our method based on the negative binomial distribution for the tuple data, and derive an empirical distribution on the set of underlying parameters from the whole of the data set.

An important advantage of our method is that the evaluation of posterior probability for multiple models is simply achieved. For this reason, the techniques described are developed in a very general form.

Model definitions

In forming a set of models for the data, we consider which patterns are biologically likely. In the simplest case of a pairwise comparison, we have count data from some samples from both condition _{1}, _{2}, _{1}, _{2}, where _{1}, _{2 }and _{1}, _{2 }are the replicates. In most cases, it is reasonable to suppose that at least some of the tuples may be unaffected by our experimental conditions _{1 }and _{2 }will share the same set of underlying parameters, the data from samples _{1 }and _{2 }will share the same set of underlying parameters, but, crucially, these sets of parameters will not be identical. We can thus treat our models as non-overlapping sets of samples. Our first model, of no differential expression, is thus defined by the set of samples {_{1}, _{2}, _{1}, _{2}}. Our second model, of differential expression between condition _{1}, _{2}} and {_{1}, _{2}}.

More complex models

In the simple example described, only two models are plausible, and this framework may seem overly complex. However, in experimental designs involving multiple sample groups, many more models are possible. As an example, we consider the next most complex experimental design, involving samples from three distinct conditions

In the first of these, all samples are equivalently distributed, and so the model is defined by the set {_{1}, _{2}, ..., _{1}, _{2}, ..., _{1}, _{2}, ...}. We then need to consider the three models under which there is equivalent distribution under two conditions but not the third. The first of these models can be described by the sets {_{1}, _{2}, ..., _{1}, _{2}, ...}, {_{1}, _{2}, ...}, in which the data from condition _{1}, _{2}, ..., _{1}, _{2}, ...}, {_{1}, _{2}, ...} and {_{1}, _{2}, ..., _{1}, _{2}, ...}, {_{1}, _{2}, ...}. Finally, we need to consider the model defined by the sets {_{1}, _{2}, ... }, {_{1}, _{2}, ... }, {_{1}, _{2}, ...}, in which the data from all three conditions are differently distributed.

It is clear from considering even this relatively simple example that the number of potential models rises rapidly as the number of different experimental conditions increases. We should also note, however, that in many cases we will be able to exclude particular models based on biological knowledge (if, for example, we know that condition _{1}, _{2}, ..., _{1}, _{2}, ...}, {_{1}, _{2}, ...}), and so the complexity of the system need not grow too rapidly. Our task is now to determine the posterior probability of each of our models, given the data, for each tuple. This will allow us to form ranked lists of the tuples, ordered by the posterior probabilities of a particular model (for instance, a model of differential expression between experimental conditions).

One interesting advantage of determining posterior probabilities, rather than significance values (

Equivalence of distributions

Suppose we have the count data from a set of _{1}, ..., _{n}}, such that the observed data for a particular tuple, _{1c}, ..., _{nc}) where _{ic }is the count for a particular tuple _{i}, we also have the library size scaling factor _{i}. For each tuple, then, we can consider the data to be

Now we consider some model _{1}, ..., _{m}}. If, in this model, the samples _{i }and _{j }are in the same set _{q}, then we know that they have the same parameters of underlying distribution _{q}. We can define a set _{1}, ..., _{m}}. For notational simplicity, we will also define the data associated with the set _{q }as _{qc }= {(_{ic }: _{i }∈ _{q}), (_{i }: _{i }∈ _{q})} Given a model _{c}, that is

We can then attempt to calculate ℙ(_{c }|

Negative binomially distributed data

There are a number of possible distributions which could be used for _{c}|_{c }|

In the case of equal library sizes, it is possible under an assumption of a negative binomial distribution to develop an exact test for the likelihood of observing the data given non-differential expression. The problem of unequal library sizes can be approached by generating 'pseudodata' that is approximately identically distributed to the real data but has a common library size. This is the approach taken by Robinson and Smyth _{i }belonging to the set _{q }with library size _{i}. We now assume that the count in this sample at tuple _{ic }is distributed negative binomially, with mean _{q}l_{i }and dispersion _{q}, where _{q }= (_{q}, _{q}). Then one parametrization can be defined as

There is unfortunately no obvious conjugacy that can be applied as in the Poisson-Gamma case. However, if we can define an empirical distribution on _{c }| _{q }∈

This assumption reduces the dimensionality of the integral and thus improves the accuracy of the numerical approximation to the integral.

Next we suppose that for each _{q }∈ _{q }that are sampled from the distribution of _{q}. Then we can derive the approximation

The task that then remains is to derive the set Θ_{q }from the data.

Empirically derived distributions on K

We can derive an empirical distribution on _{q}, we would like to find some estimate of the mean and dispersion of the distribution underlying the data from a single tuple, _{qc}. By similarly finding estimates of the mean and dispersion for a large number of tuples, we would have our sampling Θ_{q}. The chief difficulty here lies in properly estimating the dispersion. For example, suppose that the data from a given tuple shows genuine differential expression. If the model that we are testing assumes that there is no differential expression, then the dispersion will be substantially over-estimated for this tuple. Since we do not know in advance which tuples are genuinely differentially expressed and which are not, we need to consider the replicate structure of the data in order to properly estimate the dispersions. We define the replicate structure by considering the sets {_{1}, ... _{s}} where _{r }if and only if sample _{j }is a replicate of _{i}.

Given this structure for the data, we can estimate the dispersion of the data in a tuple _{c }by quasi-likelihood methods _{c }such that

Taking this value for _{c }we can then re-estimate the values

for each

We then iterate on our estimations of _{c }and

This gives us a value for _{c}. We then need to estimate the mean of the distribution underlying the data _{qc}, that is, for the set of samples in _{q}, which we can easily do by fixing the value acquired for _{c }and estimating the mean _{qc }by maximum likelihood methods, choosing the value for _{qc }that maximises the likelihood

for each q.

We can then form the set Θ_{q }= {(_{qc}, _{c})} by repeating this process for multiple _{c }|

This method of estimating the dispersion assumes that the dispersion of a tuple is constant across different sets of samples. In most cases, where the number of samples is low, this is likely to be the best approach. Where there is some expectation that the dispersion will be substantially different between sets of replicates, there may be advantages to estimating the dispersions individually for each of the different sets of samples in each model, while still considering the replicate structure within these sets. This is easily done by restricting the data (and corresponding replicate structure) to _{qc }when estimating the dispersion in Eqn 4. We found no substantial differences between these approaches in simulation studies (unpublished data) and so show only the results acquired when the dispersion of each tuple is assumed constant.

Estimation of prior probabilities of each model

A number of options are available when considering the prior probabilities of each model ℙ(_{c}) for the

for the prior probability of model

An alternative to this approach would be to establish some distribution on the prior probabilities of our models and find the marginal posterior probability of the data based on this distribution. One approach to this might be to use the distribution of posterior probabilities as an approximation to a distribution on the priors. We could then use a numerical integration method to re-estimate the posterior probabilities, and iterate as before. However, in practice this method is extremely computationally intensive and offers little improvement in the accuracy of the predictions made (unpublished data).

The scaling factor ℙ(_{c})

Finally, we need to consider the scaling factor ℙ(_{c}) in Eqn. 1. Since the number of possible models on _{c}) can be determined by summing over all possible

Results and Discussion

We use both simulated and real data to compare the method we have developed to the previously developed methods of Robinson and Smyth

Comparison of methods for pairwise comparisons: simulated data

We begin by applying the methods being evaluated to the simulation studies described in Robinson and Smyth

Random dispersion simulations

Robinson and Smyth _{i}, are sampled from a uniform distribution between 30000 and 90000. These library sizes are considerably smaller than those available from the current generation of sequencing technologies. However, increasing the library size to better reflect current levels does not significantly alter the conclusions drawn, because the 'library size' is, in effect, a scaling factor. All tuples are simulated from a negative binomial distribution, and we simulate differential expression by varying the means of the distribution from which they are sampled.

For a non-differentially expressed tuple _{c}_{i }where the _{c }are sampled randomly from a a set of values empirically estimated by the

Ten percent of the ten thousand simulated tuples are differentially expressed. In order to produce both over and under-expression in our simulated data, we simulate the differentially expressed data in one of two ways, where the alternatives are chosen at random for each tuple. We can simulate the data for the first _{1 }samples with means _{2 }samples are simulated with mean _{1 }samples with mean _{2 }samples are simulated with mean

Small (_{1 }= _{2 }= 2) and moderate (_{1 }= _{2 }= 5) numbers of libraries are compared, with large (

For the _{1 }libraries and the second _{2 }libraries and one defining no differential expression between any library. Figure _{1 }= _{2 }= 5. We see a 'wine glass' shaped plot, characteristic of this analysis.

Estimated posterior probabilities of differential expression against observed fold-change

**Estimated posterior probabilities of differential expression against observed fold-change**. Estimated posterior probabilities of differential expression against observed fold-change from a single simulation of ten thousand tuples, of which one thousand are truly differentially expressed (DE) and nine thousand are not differentially expressed (non-DE).

The 'stem' of the goblet is made up of tuples with low fold change and reasonably high levels of expression. With these tuples, it is relatively easy to identify them as non-differentially expressed, and so these tuples have low posterior probability of differential expression. However, some tuples with low fold change also have very low absolute values. With low absolute values in a tuple, it becomes harder to determine whether or not the tuple is genuinely differentially expressed or not, and so these values tend to have slightly higher posterior probabilities of differential expression than tuples with high absolute values but low fold change. The top of the stem, with a posterior probability of differential expression of around 0.2, is thus composed of tuples that have only one or two counts observed in any sample. For these very low expression tuples, changes of only one or two counts in a sample can lead to a relatively large fold change difference. However, these small changes do not substantially affect the posterior probability and so, although we see a spread in the fold change at the top of the stem, the posterior probability of differential expression remains low for these tuples. We tend not to see a similar spread for the tuples near the base of the stem as these tuples tend to have a high expression. For a tuple with a high expression to show a high fold change, but nevertheless have a low posterior probability of differential expression, there must be a very high dispersion associated with such a tuple, which will not often occur.

In the arms of the wine glass, we see that as the fold change increases, the posterior probability of differential expression also increases, although there is a wide range of posterior probabilities for (for example) a fold change of 4. We see this range of posterior probabilities of differential expression for a given fold change as the posterior probability also depends heavily on both the dispersion observed within the data, and the level of expression of the tuple, since, as before, it is easier to tell whether or not a highly expressed tuple is genuinely differentially expressed or not. For high posterior probabilities of differential expression, we see an increased density of tuples, predominately consisting of truly differentially expressed tuples.

As in Robinson and Smyth

Mean FDR curves for different numbers of libraries and degrees of differential expression

**Mean FDR curves for different numbers of libraries and degrees of differential expression**. Mean FDR curves, based on 100 simulations, comparing the performance of multiple methods in identifying pairwise differential expression. The data contain 1000 truly DE tuples and 9000 non-DE tuples and are simulated with varying number of libraries _{1 }and _{2}, different degrees of differential expression

In these simulations, the _{1 }= _{2 }= 2). For larger numbers of libraries,

To establish whether this difference in performance for these methods is meaningful in a practical sense, we estimate from these analyses that if we were to validate the top 200 tuples identified by edgeR, _{1 }= _{2 }= 2, _{1 }= _{2 }= 2, _{1 }= _{2 }= 5, for

Fixed dispersion simulations

For completeness of comparison with previous methods, we also consider a less realistic simulation first developed by Lu _{i}, and 5000 tuples are chosen to be differentially expressed; those from libraries 1-5 are again simulated with mean _{i }while those from libraries 6-10 are simulated with mean _{i}, and so we see only over-expression of libraries 6-10 in the data. These simulations are applied with

As in Robinson and Smyth

Mean ROC curves for data with constant dispersion

**Mean ROC curves for data with constant dispersion**. Mean ROC curves, based on 100 simulations, comparing the performance of multiple methods in identifying pairwise differential expression. The data contain 5000 truly DE tuples and 5000 non-DE tuples and are simulated from a negative binomial distribution with constant dispersion for all tuples

Of the remaining methods, we see that as the dispersion increases, the performance of all the methods decreases; however, the

Comparison of methods for pairwise comparisons: biological data

We next apply the methods to a set of data acquired by Illumina sequencing small RNAs (20-24 nucleotide) from leaf samples of

We consider only those sequence reads that perfectly matched the

(Log) p-values of real sequence data under null hypothesis of no overdispersion against mean expression levels of each sequence

**(Log) p-values of real sequence data under null hypothesis of no overdispersion against mean expression levels of each sequence**. (Log) p-values of real sequence data under the null hypothesis of no overdispersion and alternative hypothesis of overdispersion. We acquire these for each sequence by performing likelihood-ratio tests on the fit of a Poisson model and an alternative negative binomial model, allowing for both differences in library size and between the two sample types. Although a number of sequences show no significant variation from the Poisson model, a substantial number show very significant variation. The sequences for which overdispersion is particularly significant are those with high mean expression levels, as these are the sequences for which overdispersion can most easily be detected.

We identified 678 different small RNA sequences that perfectly matched the tasRNA loci (TAS1a, TAS1b, TAS1c, TAS2, TAS3 and TAS3b) and matched nowhere else in the genome. 21 of these small RNA sequences showed higher expression in the RDR6 mutant than in the wild-type samples and these were excluded, leaving 657 potential true positives. We applied the methods to the count data for each small RNA sequence, seeking differential expression between the wild-type samples and the RDR6 knockout samples. We then ranked the sequences by the extent to which they are reported as differentially expressed by each method. We would expect a sizeable fraction of our 657 potential true positives to appear near the top of the list.

Figure

Number of tasRNA-associated small RNAs identified as differentially expressed in RDR6 knockout experiment

**Number of tasRNA-associated small RNAs identified as differentially expressed in RDR6 knockout experiment**. Number of tasRNA-associated small RNAs against the number of differentially expressed small RNAs at the top of each list acquired by each method in an analysis of small RNA data from two wild-type samples and two RDR6 knockout samples. We expect tasRNA-associated small RNAs to be under-expressed in the RDR6 knockout samples, and hence to find these amongst the differentially expressed tuples.

Multi-group experimental designs

We next illustrate the application of our method to a more complex experimental design involving multiple experimental conditions. We return to the example discussed in the Methods section, in which we have sequence data from three conditions; condition

We investigate the ability of our method to detect such patterns of differential expression by adapting the more realistic simulations proposed by Robinson and Smyth

Five hundred tuples are simulated to have equivalently distributed data between condition

Another five hundred tuples are simulated similarly such that tuples have equivalently distributed data in conditions

A further five hundred tuples are simulated in such a way that the data from all three conditions are differently distributed. For a given tuple, we simulate data from condition _{1 }from a distribution with mean _{cli}. For condition _{2}, we simulate from a distribution with mean _{3 }we simulate from a distribution with mean _{1,}
_{2}, _{3 }for each tuple, and so we see various patterns of differential expression between these samples.

We again evaluate the methods by looking at the false discovery rates. In this analysis, we are interested in the ability of our method to accurately identify each of the different types of differential expression by simultaneously considering all possible models for the data. We can also consider the ability of our method to detect differential expression of any kind by taking the sum, for each tuple, of the posterior probabilities of all five models describing differential expression. We can thus consider four FDR curves for each type of differential expression present in the data, and an additional FDR curve for data showing differential expression of any kind.

For the pre-existing methods, in the overdispersed log-linear and the overdispersed logistic approaches, we are able to form linear models that describe all possible patterns of differential expression present in the data. For the edgeR,

We present the data (Figure

Mean FDR curves for analyses of more complex experimental designs

**Mean FDR curves for analyses of more complex experimental designs**. Mean FDR curves, based on 100 simulations, comparing the performance of multiple methods in identifying more complex patterns of differential expression. The data are simulated from samples coming from three experimental conditions _{1}, ..., _{n}, _{1}, ..., _{n}} {_{1}, ... _{n}}) and for the identification of tuples where all three experimental conditions are different ({_{1}, ..., _{n}}{_{1}, ... _{n}}{_{1}, ... _{n}}). The data are simulated with varying number of libraries

Figure

Comparison of

**Comparison of ****method's performance for different models in complex experimental designs**. Mean FDR curves, based on 100 simulations, comparing the performance of the _{1}, ..., _{n}, _{1}, ... _{n}}{_{1}, ... _{n}}) and for the identification of tuples where all three experimental conditions are different ({_{1}, ..., _{n}}{_{1}, ... _{n}}{_{1}, ... _{n}}). We also show false discovery rates for the identification of tuples showing differential expression of any kind.

Conclusions

We present an empirical Bayes method,

In developing this method, we have established a well-defined framework for describing diverse patterns of differential expression between samples. We then take an empirical Bayes approach in order to establish posterior probabilities of each model for each tuple. We achieve this by assuming that the data for each tuple is negative binomially distributed. This assumption is supported by the presence of over-dispersion in true data (Figure

Our method is relatively computationally intensive, but has been implemented to take advantage of parallel processing, such that an analysis of pairwise differential expression of ten thousand tuples coming from ten samples takes approximately 7.5 minutes running on a machine with eight 2 GHz processors. We compare

Comparisons of the methods on pairwise data are made on the basis of previously developed simulation studies

For analyses of data with random dispersions (Figure

Analysis of real biological data again suggests that our method performs at least as well, and potentially better, than edgeR, while both methods appear to substantially outperform the overdispersed log-linear and logistic methods. The

The chief advantage of the empirical Bayes method developed here, however, is its ready applicability to more complex experimental designs, although at present these methods remain limited to comparisons involving multiple groups, and are not able to account for, for example, paired samples. One possible extension to this work is thus the generalisation of the methods to some form of generalised linear model approach. However, our method is able to simultaneously identify multiple types of differential expression from a single experiment. In comparisons of the methods using simulations of an experimental design involving multiple groups (Figure

Our method thus provides performance as good as or better than previous methods whilst enabling experimenters to simultaneously consider many diverse sample types in a single sequencing experiment. We believe that this is a valuable approach representing an important step forward for the analysis of count data from sequencing experiments.

Availability and Requirements

The empirical Bayes method developed in this paper are implemented in the software package

Authors' contributions

TJH designed and implemented the

Acknowledgements

The authors wish to thank Ericka R. Havecker and David C. Baulcombe for valuable discussions. David C. Baulcombe and Nataliya Yelina supplied the biological data. We would like to thank two anonymous reviewers for their helpful suggestions.

Thomas J. Hardcastle is supported by the European Commission Seventh Framework Programme grant number 233325. This work was supported by the European Commission Sixth Framework Programme Integrated Project SIROCCO; contract number LSHG-CT-2006-037900.