Department of Plant Sciences, University of Cambridge, Downing Street, Cambridge, CB2 3EA, UK

Abstract

Background

Pairing of samples arises naturally in many genomic experiments; for example, gene expression in tumour and normal tissue from the same patients. Methods for analysing high-throughput sequencing data from such experiments are required to identify differential expression, both within paired samples and between pairs under different experimental conditions.

Results

We develop an empirical Bayesian method based on the beta-binomial distribution to model paired data from high-throughput sequencing experiments. We examine the performance of this method on simulated and real data in a variety of scenarios. Our methods are implemented as part of the R

Conclusions

We compare our approach to alternatives based on generalised linear modelling approaches and show that our method offers significant gains in performance on simulated data. In testing on real data from oral squamous cell carcinoma patients, we discover greater enrichment of previously identified head and neck squamous cell carcinoma associated gene sets than has previously been achieved through a generalised linear modelling approach, suggesting that similar gains in performance may be found in real data. Our methods thus show real and substantial improvements in analyses of high-throughput sequencing data from paired samples.

Background

High-throughput sequencing technologies

Analysis methods for an important class of experimental design, that involving paired data, are less well developed. In a paired experimental design, we are generally interested in examining how the ratio of expression between paired counts varies, a scenario that arises naturally in a number of important settings. For example, in oncological studies we may take normal and tumour tissue from the same patient and wish to determine whether the ratio of gene expression differs from a one-to-one ratio between patients within a treatment group, or whether this ratio varies between treatment groups. Similarly, we may wish to compare individuals pre- and post-infection to establish how different strains of a species respond to infection. Paired samples provide a useful approach to such problems as even when the expression of particular genes varies substantially between individuals, the effect of treatment may be relatively consistent. By using paired samples, we can account for individual-specific effects and consequently better detect treatment effects.

Two key questions arise in analyses of paired data. Firstly, we can examine differential expression

We present here an empirical Bayesian method based on an over-dispersed binomial distribution, the beta-binomial, for addressing the problem of detecting both types of differential expression in paired sequencing data. The beta-binomial distribution has previously been suggested as a suitable model for the analysis of unpaired high-throughput sequencing data

Analyses that account for paired data have thus far employed simplifying assumptions that neglect the full structure of the data. The only published method that has attempted the analysis of paired data is the generalised linear model approach implemented in the

Methods

The data from high-throughput sequencing experiments used in differential expression analysis may be thought of as a set of

In analyses of paired data, we introduce the concept of a _{
i
} and _{1c
}, ⋯, _{
nc
}) where _{
ic
} is the count of the _{
i
}, and the data for the sample pairs as

Model definitions

In forming a set of models for the data, we consider which patterns are biologically likely. In the simple case of a pairwise comparison, we have count data for some sample pairs from condition _{1}, _{2}, _{1}, _{2} paired with, respectively, counts from sequencing libraries

We can represent the models described in terms of the sets of samples for which the data are equivalently distributed under the model. Thus, the model of no differential expression between experimental conditions can be represented by a single set

The model for differential expression between the two experimental conditions can similarly be represented by the two sets

This set based description of the models allows great flexibility in constructing multiple models that may describe the observed data. The evaluation of the posterior likelihood of such a model based on the observed data for a single tuple pair is described below.

Posterior likelihood of a model

Consider some model _{1}, ⋯, _{
m
}}. If, in this model, the _{
q
}, then for these sample pairs, the data at tuple pair _{
q
}, and are conditionally independent given these parameters. The _{
q
} are in turn drawn from some underlying distribution Θ
_{
q
}. For computational simplicity, we assume that the _{
q
} are independently sampled from the distribution Θ
_{
q
} for each set _{
q
}.

Given a model _{
c
}, that is

We can then calculate

The assumption of independence of the _{
q
} reduces the dimensionality of the integral allowing a numerical approximation to this integral to be more easily calculated. We suppose that for each Θ
_{
q
} we have a set of values Θ
_{
q
} that are sampled from the distribution of Θ
_{
q
}. Then, following Evans & Swartz

The task that then remains is to derive the set Θ
_{
q
} from the data.

Beta-binomially distributed data

There are a number of possible distributions which could be used for _{
q
}∣Θ
_{
q
}. We develop our method based on the beta-binomial distribution for the tuple pair data, and derive an empirical distribution for the set of underlying parameters using the whole data set. We justify the use of the beta-binomial through the assumption of a Poisson distribution for the number of sequenced reads for a given tuple

If the count _{
ic
} is Poisson distributed, and the count of the paired library _{
ic
} is binomially distributed with parameter

We suppose that the expected proportion of reads from which _{
ic
} is sampled is _{
i
} and _{
ic
} and _{
i
} and _{
i
} and

Using the beta-binomial as a model for over-dispersion, we adopt the following parameterisation for the distribution

where _{
ic
} and

The variance of the binomial distribution under this parametrisation is

Empirically derived distributions

We can derive an empirical distribution for the parameters of a model _{
q
}, we would like to find an estimate of the mean and dispersion of the distribution underlying the data from a single tuple pair; _{
c
}. By finding estimates of the mean and dispersion for a large number of tuple pairs, we create the sampling Θ
_{
q
}. The chief difficulty here lies in properly estimating the dispersion. Suppose that the data from a given tuple pair shows genuine differential expression. If the model that we are testing assumes that there is no differential expression, then the dispersion will be substantially over-estimated for this tuple pair. Since we do not know in advance which tuple pairs are genuinely differentially expressed and which are not, we need to consider the replicate structure of the data in order to properly estimate the dispersions. We define the replicate structure by considering the sets {_{1}, ⋯_{
s
}} where _{
r
} if and only if sample pair

Given this structure for the data, we can estimate the dispersion of the data in a tuple pair _{
c
} by maximum-likelihood methods. We consider the likelihood of the tuple pair _{
c
} under the replicate structure to be

and choose _{
rc
} and _{
c
} to maximise this likelihood. This gives us a value for _{
c
}, the dispersion of the

In analysis of paired data, one question of interest may be to identify tuple pairs which show a particular ratio of expression between the sample pairs. The most usual case will be a one-to-one ratio (after accounting for variation in library scaling factor), indicating that there is no differential expression of the tuple pair between the sample pairs. To model this, we simply set the _{
qc
} as the constant proportion of expression to be examined for all

Alternatively, we may wish to consider a model in which we are not interested primarily in the value of the ratios of expression between sample pairs, but only on whether these ratios are similar or different amongst various experimental groups defined by the sets _{
q
}. To approximate a distribution on the Θ
_{
q
} for such a model, we can estimate the proportion _{
qc
} of reads in the first count of each pair of samples for the tuple pair _{
c
} and estimating _{
qc
} by maximum likelihood methods. For notational simplicity, we define the data associated with the set _{
q
} as _{
qc
} to be

We then choose _{
qc
} to maximise this likelihood for each _{
q
} = {(_{
qc
}, _{
c
})} by repeating one of these processes for multiple sampled tuple pairs. We can then calculate

This method of estimating the dispersion assumes that the dispersion of a tuple pair is constant across experimental groups. Where the number of samples is small, this is likely to be the best approach. Where there is an expectation that the dispersion will be substantially different between experimental groups, and there are adequate numbers of replicates, there may be advantages to estimating the dispersions individually for each of the different sets of samples in each model, while still considering the replicate structure within these sets. This is easily done by restricting the data (and corresponding replicate structure) to _{
qc
} when estimating the dispersion in Eqn 4.

Estimation of prior probabilities of each model

A number of options are available when considering the prior probabilities of each model _{
c
} we apply the BIC to select the most likely model based on the calculated likelihoods _{
c
}|_{
c
}) for any individual tuple pair _{
c
}.

The scaling factor

Finally, we need to consider the scaling factor

False discovery rates from posterior likelihoods

False discovery rates can be estimated directly from the posterior likelihoods estimated for each model. If the likelihood of a model _{
m
} is the set of the top

Results and discussion

We use both simulated and real data to compare the beta-binomial method described to the edgeR-GLM

Simulated data

We base our simulations on those described by Robinson & Smyth

We assess the performance of the methods by ranking the tuple pairs by their strength of association with each type of differential expression and computing the true and false positive rates using these ranked lists. For increased robustness, we estimate the mean of these rates over one hundred simulations under each set of conditions.

For the _{
ic
} and _{
c
}
_{
i
}
_{
ic
}
_{
ic
} and _{
c
}, which define a baseline of expression for the tuple pair when scaled by the library size, are sampled randomly from a set of values empirically estimated by the _{
i
} and

We simulate individual effects in the data by allowing _{
ic
} to vary for each sample pair

The _{
ic
} allow us to introduce differences between experimental groups of sample pairs and between the members of a pair, while allowing for variation between biological replicates. They are sampled from a beta distribution with shape parameters _{
ic
} and _{
c
}, which indicates the level of dispersion (and hence, biological noise) are drawn from a beta distribution with shape parameters 1, 10.

We begin by simulating the simplest case of a paired analysis. In this scenario we are interested only in the differences _{
c
} is drawn from a uniform distribution between -

We examined the performance of the methods on the basis of ROC curves. Figure

Comparison of methods identifying differential expression within paired counts

**Comparison of methods identifying differential expression within paired counts.** ROC curves showing the performance of the beta-binomial, edgeR-GLM and DESeq-GLM methods in identifying differential expression within paired counts in simulated data for various combinations of

For low false positive rates, the performance of the methods is approximately equal as each identify the ‘low-hanging fruit’, those tuple pairs showing high differential expression with relatively low biological variation. However, for higher false positive rates the beta-binomial method shows a clear and consistent gain in performance over the generalised linear modelling approaches. The DESeq-GLM in general performs better than edgeR-GLM, especially for higher numbers of sequenced libraries. For high library numbers, the performance of DESeq-GLM approaches that of our beta-binomial approach.

We next consider the more complex case where differential expression exists both within paired counts, and between experimental groups. This is equivalent to an experimental set-up in which we have sample pairs from condition _{1}, …, _{
n
} paired with samples _{1}, …, _{
n
} paired with

We again simulate ten thousand tuple pairs. For one thousand of these, we simulate differential expression within paired counts as before. For a second group of one thousand tuple pairs, we also simulate differential expression between experimental conditions. We simulate differential expression between experimental conditions by applying a scaling factor _{
c
} to one of the two experimental conditions. This is applied such that for a differentially expressed tuple pair _{
c
} is simulated as before and _{
c
} is drawn from a uniform distribution between -_{
ic
} is an indicator variable randomly sampled from {0,1} for each tuple pair

Both the beta-binomial approach and the generalised linear modelling approaches are capable of simultaneously detecting both types of differential expression, however, the form of results acquired by these two approaches differs. For the beta-binomial approach, posterior likelihoods are calculated for each available model, and hence only one model for each tuple pair can be assigned a high posterior likelihood. If the true differential expression of a tuple pair involves changes in expression ratios between experimental groups, the model for consistent change from a one-to-one ratio between paired counts will have a low posterior likelihood as the change will not be consistent across the tuple pair. For the generalised linear modelling approaches, both a pair effect and an experimental group effect, and the significance with which these differ from zero, are calculated for each tuple pair. Consequently, both effects can be present with high significance even when changes in expression are driven primarily by a change in expression ratios between experimental groups.

If those tuple pairs simulated as showing differential expression ratios between experimental groups are treated as false positives when considering differences from a one-to-one ratio between paired counts, this heavily penalises the generalised linear model methods. If they are treated as true positives, the generalised linear modelling approaches are evaluated on the basis of two thousand true positives where the beta-binomial method is evaluated on the basis of one thousand true positives, making performance comparisons difficult. To allow fair comparisons between the methods, we therefore exclude the thousand tuple pairs simulated as showing differential expression ratios between experimental groups when calculating the true and false positive rates for detection of differences from a one-to-one ratio within paired counts.

Figure

Comparison of methods identifying differential expression within paired counts and between experimental groups

**Comparison of methods identifying differential expression within paired counts and between experimental groups.** ROC curves showing the performance of the methods in simultaneously identifying differences from a one-to-one ratio within paired counts (solid lines) and differential expression ratios between experimental groups (dashed lines) in simulated data for various combinations of

In this more complex case, the difference between the performance of the methods is considerably more pronounced. Particularly in identifying differential expression between experimental groups, the beta-binomial method shows considerably better performance than that of both generalised linear modelling approaches. In identifying differential expression from a one-to-one ratio within paired counts, the performance of the beta binomial method is similar to that shown in Figure

The simulated data described above are drawn from sets of Poisson distributions whose parameters are a multiple of a random variable drawn from a beta distribution. Therefore, the simulated data have a beta-binomial distribution, the model proposed for the analysis. We can examine the robustness of the model by considering an alternative distribution for the simulations. Since the Poisson distribution is a well established model for the technical effects observed in high-throughput sequencing data _{
ic
}. The minimax distribution is also a two-parameter distribution on (0,1) with density

The moments of this distribution are given in terms of the beta function such that

Consequently, it is not possible to establish closed-form expressions for the parameters _{
ic
}, nor is it possible to define a dispersion parameter for this distribution. In order to select parameters for the minimax distributions used to simulate the data, we therefore calculate the parameters for the beta distribution as described above. We then (numerically) calculate the parameters of the minimax distribution such that the mean and variance of each random variable are identical to those which would be used in the case of the beta distribution. This approach has the advantage that, for given parameters of simulation, the results are directly comparable between those data simulated using a beta distribution and those simulated using a minimax distribution.

Results for the application of the three methods to data simulated using the minimax distribution are shown in Additional file

**Supplementary Figures and Tables. ****Figure S1:** ROC curves showing the performance of the beta-binomial, edgeR-GLM and DESeq-GLM methods in identifying differential expression within paired counts in data simulated using the minimax distribution to simulate biological variation. Simulations are carried out for various combinations of **Figure S2:** ROC curves showing the performance of the beta-binomial, edgeR-GLM and DESeq-GLM methods in simultaneously identifying differential expression from a one-to-one ratio within paired counts (solid lines) and differential expression between experimental groups (dashed lines) in data simulated using the minimax distribution to simulate biological variation. Simulations are carried out for various combinations of **Table S1:** The top twenty-nine genes (FDR < 0.05) showing consistent ratios within patients of differential expression between normal and tumour samples. **Table S2:** The twenty-five genes identified by Yu **Table S3:** The nine genes identified by Tuch **Table S4:** Gene sets showing enrichment (**Table S5:** Gene sets showing enrichment (**Table S6:** Gene sets showing enrichment (top fifty) in the 2033 down-regulated in tumour genes showing any differential expression at FDR > 0.05. Head and neck squamous cell carcinoma gene sets are highlighted. **Table S7:** Gene sets showing enrichment (top fifty) in the 572 up-regulated in tumour genes showing any differential expression at FDR > 0.05. Head and neck squamous cell carcinoma gene sets are highlighted.

Click here for file

Biological data

We examine a set of paired data from a recent study of oral squamous cell carcinoma

We analyse these data to find both genes displaying a consistent fold-change between tumour and normal tissue, and those genes which show heterogeneity in fold-change between the paired counts belonging to the individual patients. The patients are treated as biological replicates for the purposes of dispersion estimation (Eqn. 4) despite the presence of some genes displaying patient-specific effects. In the absence of true biological replicates, this is required in order to carry out a meaningful analysis. We construct a set of models testing both for consistent differential expression between tumour and normal tissue and for differing ratios of expression between individuals. We acquire posterior likelihoods for each of these models of differential expression and hence can evaluate either the likelihood that each gene displays consistent differential expression between normal and tumour samples, or the likelihood that a gene displays differential expression of any kind (by taking the sum of the posterior likelihoods of all models describing differential expression).

We identify 29 genes displaying a consistent ratio of differential expression between tumour and normal samples at a false discovery rate (FDR) of 0.05 (Additional file

Comparisons with the highest-ranked differentially expressed genes discovered by the edgeR-GLM approach show a more consistent picture. Of the reported ten most significant genes from their analysis, five are also selected in our list of the twenty-nine genes showing consistent differential gene expression ratios at an FDR of 0.05, while the remainder still have an estimated likelihood of consistent differential expression greater than 90%. Rank correlation between the gene lists produced by the two methods is 0.59 if the genes are ranked by the likelihood of consistent differential expression but 0.88 if they are ranked by the likelihood of differential expression of any kind.

As in McCarthy

Conclusions

We have presented here an empirical Bayesian approach to analysing differential expression in paired sample high-throughput sequencing data based on the beta-binomial distribution. The distributions of the parameters of the beta-binomial distribution are estimated by repeated sampling from the observed data, and these distributions are used to estimate posterior likelihoods for each proposed model of expression for each tuple pair. Estimating the distributions of the prior parameters in this way creates a ‘borrowing’ of information across tuple pairs, as the posterior likelihoods for each tuple are calculated using the observed data for all sampled tuple pairs. In analyses with large numbers of outliers, it may be advantageous to ‘squeeze’

Our method is implemented as part of the software package

As with the most successful approaches to analysis of unpaired sequencing data

A key assumption made in developing this method concerns the nature of the over-dispersion between samples caused by biological variation. In the absence of available data from which to infer the precise nature of the over-dispersion, we have assumed for computational convenience that the beta distribution is a suitable model for the biological variation in ratios of expression between sample pairs and hence that the distribution of the count data may be modelled with the beta-binomial. The beta distribution is remarkably flexible and is thus likely to be capable of accounting for the behaviour of most paired data, although in certain circumstances this assumption may fail. We note, however, that the principles of the empirical Bayesian approach may be applied for any underlying distribution, and so might be adapted to meet this circumstance.

We demonstrate the performance of our methods on both simulated and real data. In analyses of simulated data using a range of parameters, we show considerable gains in performance compared with two implementations of a generalised linear modelling approach, especially when more complex patterns of differential expression are present in the data. The gain in performance using our methods is particularly marked for larger numbers of samples, a result that is likely to be increasingly important as the cost of sequencing experiments declines, allowing larger studies. This gain in performance is also found when the minimax distribution

The analysis of the biological data from Tuch

The comparison of paired mRNA-Seq samples is a major application for our method. However, there are other key applications. In particular, paired data arise naturally in studies of epigenetic markers, such as chromatin and methylation marks, where the prevalance of a particular marker is compared to a baseline measurement for each marker. Our method is, therefore, likely to have wide applicability not only in cancer and other areas of medicine but also in fundamental life science research.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

TJH designed and implemented the methods described and drafted the manuscript. KAK helped to draft the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

Thomas J. Hardcastle is supported by the European Commission Seventh Framework Programme grant number 233325.