Statistical Laboratory, Centre for Mathematical Sciences, Wilberforce Road, Cambridge, UK

DAMTP, Centre for Mathematical Sciences, Wilberforce Road, Cambridge, UK

Cancer Research UK, Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge UK

Department of Oncology, University of Cambridge, Li Ka Shing Centre, Robinson Way, Cambridge, UK

Abstract

Background

High-throughput sequencing technology has become popular and widely used to study protein and DNA interactions. Chromatin immunoprecipitation, followed by sequencing of the resulting samples, produces large amounts of data that can be used to map genomic features such as transcription factor binding sites and histone modifications.

Methods

Our proposed statistical algorithm, BayesPeak, uses a fully Bayesian hidden Markov model to detect enriched locations in the genome. The structure accommodates the natural features of the Solexa/Illumina sequencing data and allows for overdispersion in the abundance of reads in different regions. Moreover, a control sample can be incorporated in the analysis to account for experimental and sequence biases. Markov chain Monte Carlo algorithms are applied to estimate the posterior distributions of the model parameters, and posterior probabilities are used to detect the sites of interest.

Conclusion

We have presented a flexible approach for identifying peaks from ChIP-seq reads, suitable for use on both transcription factor binding and histone modification data. Our method estimates probabilities of enrichment that can be used in downstream analysis. The method is assessed using experimentally verified data and is shown to provide high-confidence calls with low false positive rates.

Background

The importance of DNA-binding proteins in molecular functions such as transcription, replication, DNA repair and chromosome segregation highlights the significance of identifying the locations of their binding sites throughout the genome. The most widely used method for mapping these genomic locations is chromatin immunoprecipitation (ChIP). This process involves shearing the DNA and isolating the fragments to which proteins have bound

In this paper we focus on a method of analyzing ChIP-seq data to identify protein-binding locations and the presence of specific histone modifications in the genome. Such data consist of the locations of the ends of the protein-bound and background fragments from the sample of interest as well as often containing control data from a sample that contains fragments of DNA with no preference for the regions to which the specific protein binds. Various statistical tools have been developed to interpret the data resulting from these techniques, but the set of available tools is not yet mature.

Our algorithm models the positions of the sequenced fragments and determines the locations of enriched areas, such as binding sites, by using HMMs and Bayesian statistical methodology. In this (and other) ways, it differs from previously published methods that we now briefly review.

ChipSeq Peak Finder

The extended set method XSET

A feature of ChIP-seq is that, by examining only the start of protein-bound fragments, we can identify peaks offset on the forward and reverse strands of the DNA, the true binding site lying somewhere in between. Model-based Analysis for ChIP-seq (MACS)

Quantitative enrichment of sequence tags (QuEST)

In Kharchenko et al.

Comparison of different peak-calling algorithms

**Method**

**A**

**B**

**C**

**D**

**E**

**F**

**G**

CSPF

control or IP only

read length

no orientation

merge strands

no shift

N

simple height criteria

ROC curve (empirically)

both

XSET

IP only

fragment length

orientation

merge strands

no shift

Y

simple height criteria

FDR estimate using Poisson distribution

both

Mikkelsen et al.

IP only

no orientation

no merge

no shift

Y

no official FDR

both

MACS

control or IP only

fragment length

orientation

no duplicated reads

shift reads

merge strands

N

Poisson

FDR estimate by peaks in control:IP

both

QuEST

control

orientation

shift reads

merge strands

N

kernel density estimation

FDR estimate by permutations of the control

better for TF

FindPeaks

IP only

fragment length

orientation

no merge

no shift

N

simple height criteria

FDR estimate by permutations of the IP

both

SISSR

control or IP only

fragment length

orientation

no merge

no shift

N

compares reads on different strands

FDR estimate by peaks in background:IP

better for TF

Kharchenko et al.

control

orientation

no merge

no shift

N

Poisson distribution

FDR estimate by permutations of the control

better for TF

PeakSeq

control

fragment length

orientation

merge strands

Y

sample normalisation Binomial distribution

FDR estimate, q-values (BH correction)

both

BayesPeak

control or IP only

fragment length

orientation

no merge

no shift

N

negative binomial distribution, Bayesian posterior probabilities

posterior enrichment probabilities

both

The methods shown are compared with respect to the following features:

A. whether they require a control sample (control) or whether they only use the ChIP sample (IP only)

B. whether they take into account the (average) length of the reads/fragments and their orientation

C. whether they take into account the different DNA strands or if they merge the reads, and whether the reads are shifted towards the 3' end

D. whether an externally estimated mappability file is used

E. how the scores, on which the classifications are based, are estimated

F. whether/how any FDR or sensitivity/specificity estimates are calculated

G. whether or not the method is applicable to both transcription factor (TF) and histone mark data.

Results

Algorithm

During chromatin immunoprecipitation, the proteins are cross-linked with the DNA, the cells are lysed, and the DNA is randomly sheared. The fragments bound by the protein of interest are isolated using specific antibodies to immunoprecipitate the protein and the cross-links of protein and DNA are reversed to liberate the DNA fragments. The resulting sample is enriched in the target immunoprecipitated areas but consists mainly of background DNA fragments.

Following the experiment, high throughput sequencing is used to reveal the identity of a sample of the fragments. The fragments are size-selected beforehand to improve the throughput and reproducibility of the sequencing reaction. We use the Illumina Genome Analyzer platform, in which the samples are placed on flow cells and go through several cycles of preparation, imaging and identification. The short reads from the ends of fragments are then mapped back to the reference genome to give the chromosomal position and strand of each read. The length of each fragment is unknown, since only one end is sequenced, but the average fragment length can be estimated experimentally (

An important part of the process is the addition of a control sample, such as an Input preparation, which undergoes the same cross-linking, fragmentation and sequencing procedure, the key difference being that the bound fragments are

HMM model description

We have constructed a fully probabilistic model that takes into account the natural features of the data and incorporates them in a hidden Markov framework. The method divides the genome into equidistant regions, or windows, whose size is not less than half the mean fragment length; depending on the experiment and the length of the sites of interest, the resolution can be modified as desired. Counts for each window are defined as the number of 5' fragment ends (the end that was sequenced) that map to that region, either on the forward or the reverse DNA strand. We define these counts as

We use a hidden Markov model (HMM) that assigns a state to each region _{t }= 1, if there is a binding site or modification in that region increasing relative fragment abundance, and _{t }= 0 if not. We assume that the dependence between adjacent windows is the same throughout the genomic region under study, _{t+1 }= 1|_{t }= 0) = _{t+1 }= 1|_{t }= 1) =

As the fragment counts _{t }and _{t+1}, the working states correspond to the set of paired combinations

for all windows _{t }= {1, 2, 3} to have the same enrichment effect and the state _{t }= 0 to have none. The initial state distribution for

Illustration of the model

**Illustration of the model**. This figure shows how the reads (arrows) on the forward and reverse strand, indicated by red and blue respectively, are counted as _{t }state corresponds to the underlying ones _{t }and _{t+1}.

Conditional on the parameters _{t }(defined below) and on the hidden state _{t }= 0, the counts are negative binomially (NB) distributed, and given _{t }= 1 the counts are the sum of two independent negative binomial random variables, corresponding to background and foreground counts. Using this distribution avoids estimation problems caused by overdispersion of the data when greater variability than expected is observed. Such issues would arise if we used the simple Poisson model that implies equality between the mean and the variance of the counts. In addition, the NB can be expressed as a Poisson-Gamma mixture to make parameter estimation and additions to the model more straightforward. When a control sample is available, it is included in the analysis by introducing a parameter via Poisson regression, as shown below. In this way external factors causing high or low read concentration can be quantified using the density of the Input reads and protein-related enrichment can be correctly identified.

The emission distributions of the model are

where Γ (^{α-1 }exp(-_{t }is the number of Input fragments with 5' ends in windows _{0 }and _{1 }are the parameters corresponding to relative fragment abundance in the unenriched and enriched regions respectively, _{0}, _{0}, _{1 }and _{1 }are hyperparameters.

Parameter and state estimation

Within the Bayesian framework of this paper we use efficient Markov chain Monte Carlo (MCMC) algorithms, as opposed to the Expectation-Maximization (EM) algorithm

MCMC algorithms take an approach similar to EM by using the complete data (_{0}, _{0}, _{1 }and _{1 }and

We evaluate the likelihood expression using a recursive technique introduced by Baum and Welch

To sample from the posterior distribution of the states we use another recursive technique, called forward-backward (FB) Gibbs

The nature of the hidden states is then estimated using their marginal posterior probabilities, which indicate whether each window is enriched or not, according to the model and the data. More details on how these are calculated can be found in the Methods section. Once the posterior probabilities for the ** Z **states are estimated, it is very easy to calculate the equivalent ones for the

Implementation

We applied the method to ChIP-seq data to study both transcription factor binding and histone modification assays. We used ChIP samples for the liver-enriched transcription factor HNF4

The reads were aligned using Illumina's Eland program, allowing for up to two mismatches in the first 32 bases of each read. To improve efficiency of the algorithm we split the genome into 5 Mb-long regions and analysed them separately using non-overlapping windows. Furthermore, we ran the method twice, the second time using an offset of half a window's length, to classify correctly all the regions and avoid any bias due to the position of the windows. We then merged all the called peaks between the two runs.

Choice of window length

As expected, the transcription factor and histone mark data look very different. The first have narrow peaks at the locations of the binding sites, and the latter have wider enriched regions that are usually covered by multiple peaks. Figure

A view of the HNF4

**A view of the HNF4 α, H3K4me3 and Input data**. This diagram shows the read distribution on a region of chromosome 16 for the HNF4

A closer view of some HeK4me3 and HNF4

**A closer view of some HeK4me3 and HNF4 α peaks**. These histograms present the counts of the 5' ends of the reads from the H3K4me3 and the HNF4

To ensure the algorithm is suitable for both analyses, we applied it using different window-lengths. The length of the library fragments as reported by the experimental procedure was in the range 110-260 bp with a slight preference for shorter fragments. The mean fragment length of approximately 190 bp places a lower limit on the window size of a little under 100 bp, and our desire for fine resolution places an upper limit of approximately 300 bp on the window size, for which reason we investigated window sizes of 100 bp, 200 bp and 300 bp.

We observed that, as the length of the windows increased, fewer enriched regions were identified for both samples. For example, for HNF4

Inclusion of the control sample

Our method can be implemented in a manner that makes use of a control sample or not, as the situation dictates. For the HNF4

Checking adequacy of the model

Since several assumptions have been used in the construction of the algorithm, it is important to check the practical fit of the model. In a classical setting, a goodness-of-fit test compares observed and fitted values by quantifying how extreme the data are if we assume the model to be true. In a Bayesian framework, model fit can be tested using posterior predictive distributions of test statistics that can be functions of both the data and the parameters

To investigate relevant features of ChIP-seq data, some sensible discrepancy variables include the mean number of reads in each window, the corresponding standard deviation, and the maximum possible number of reads in one location. We plot histograms of the simulated values, which represent the posterior predictive distributions, and visually compare them to the observed values.

Since the Poisson distribution has been widely used in modelling the distribution of reads in the genome in ChIP-seq applications, we compare it with the more flexible model we propose. In Figure

Checking the model fit: Histograms of the discrepancy statistics for the HNF4

**Checking the model fit: Histograms of the discrepancy statistics for the HNF4 α data**. The red lines represent the value we observe from the data for the mean number of peaks per window, their standard deviation, and the maximum possible read score. The histograms are plots of the same variables estimated using the 3,000 simulated values of the parameters during the run of the algorithm. The closer the real value is to the simulated distribution, the better the model explains those aspects of the data. The first row shows the three histograms of the values generated using the Poisson model and the second row shows the corresponding values for the negative binomial model. The latter shows a notable improvement in explaining these features of the data.

Testing

Application to exampled data

We present the results from region 90,000,000-95,000,000 bp on mouse chromosome 16, which accommodates a large number of reads and is likely to have more sites than a randomly selected genomic region. As can be seen in Figure

Posterior probability plots

**Posterior probability plots**. These plots show the distributions of the posterior probabilities _{t }= 1),

Our method called 149 peaks for HNF4

We used the algorithms ChIPSeq Peak-Finder (CSPF), MACS and PeakSeq for comparison. We chose those three because they take different approaches and vary in complexity, thus giving a wide spectrum of possible results. CSPF uses simple height criteria in comparison with the control sample to call peaks, whereas MACS takes a model-based approach by first scanning for peaks on opposite DNA strands and then using Poisson probabilities to detect enrichment. As mentioned in the introduction, PeakSeq takes into account the mappability of the underlying regions and makes the control and ChIP samples comparable by normalising the first with respect to the background noise of the latter. Then it uses binomial probabilities to generate p-values and detect significantly large read concentrations. The results are presented as Venn diagrams in Figure

Algorithm comparison for HNF4

**Algorithm comparison for HNF4 α and H3K4me3**. This figure shows how the peaks called by BayesPeak, ChipSeq Peak Finder, MACS and PeakSeq compare. For CSPF we used a maximum gap of 100 bp between reads to create clusters, which were then identified as peaks if they contained more than 9 reads, more than 5 of which overlapped at the same base, and if the total count was larger than the corresponding number in the Input sample by a factor of 5. To identify the enriched regions using the other methods, we chose p-values smaller than 0.05 for both MACS and PeakSeq. The number of peaks called by BayesPeak seems smaller in these diagrams than we report in the paper because in some cases more than one region is covered by a long peak identified by another method; in this case, they are merged into a single peak covering the entire region.

In our examples, we note that all but one of the peaks called by BayesPeak are identified by at least one other method, thus giving us confidence that BayesPeak is calling only true peaks. Peaks that other methods call, but that BayesPeak does not, may show discrepancy from the model that underpins BayesPeak, as will be discussed in the conclusions. MACS reports more peaks than any other method, suggesting that MACS may be returning peaks that represent false positives. Note that BayesPeak can return more peaks by accepting a lower posterior probability threshold, but these additional peaks are more likely to be background features.

Motif analysis

Transcription factors bind to a set of specific DNA sequences, many of which have been identified by previous studies. We used motif analysis to check whether the called peaks contain the known motif of the transcription factor, indicating that they represent proper transcription factor binding sites.

We searched for the HNF4

We can interpret these proportions as p-values, and the smaller they are, the greater the evidence that the peaks contain the transcription factor binding sequence. In Figure

Box-plots of motif enrichment p-values

**Box-plots of motif enrichment p-values**. We grouped all the sequences of the regions called by BayesPeak (including the ones called by MACS) and the ones detected only by MACS. Here we compare boxplots for the groups of called peaks, which summarize the p-values corresponding to the significance of the motif enrichment in each of the respective sequences. We observe that the regions identified by BayesPeak give lower p-values, on average, compared to the ones called by MACS only, strengthening the indication that enrichment is less likely to have occurred by chance and that binding does take place at those sites. A separate analysis with CLOVER used on the same groups, yielded p-values of less than 10^{-6 }for the regions called by BayesPeak and greater than 0.10 for those identified only by MACS.

In addition, we took another approach to check for over-representation of the motifs in the same two groups of enriched regions. For that purpose we used the program CLOVER ^{-6}, and no other of the available motifs was detected. In addition, the program did not report significant enrichment of any transcription factor motifs for the regions identified only by MACS, as the p-values for the corresponding tests were larger than 0.10. According to these findings, the regions identified by BayesPeak, most of which also called by MACS, have stronger evidence for binding than the ones that are uniquely identified by MACS.

Validation data sets

The objective testing of ChIP-seq peak calling methods is somewhat challenging, since spike-in data sets cannot yet be generated due to the difficulty of replicating the nature of the data, and simulated data sets do not give a close match to an experimental outcome. However there are two small data sets for which enriched and background regions have been validated, and we apply our methods to these.

The first such set that we consider was presented by Johnson et al.

We ran BayesPeak on their unamplified data and used the posterior probability threshold of 0.50 to call peaks. (The majority of identified peaks had probabilities greater than 0.98, as was also the case with the HNF4

In Figure

Algorithm comparison for NRSF data

**Algorithm comparison for NRSF data**. This figure shows how the known-positive and known-negative regions called by the same four methods compare. We merged those results for CSPF and PeakSeq that were identical.

As a final test, we analysed another transcription factor data set, namely Tal1 (also known as Scl), with a set of 24 experimentally-validated enriched regions

We ran BayesPeak using the Tal1 sample only and identified the enriched regions using the same posterior probability threshold of 0.50 as before. Our method did find all 24 enriched regions, 20 of them having posterior probabilities greater than 0.90. Comparing with the other algorithms as before, CSPF also gave the full list of known positives, whereas PeakSeq could not be used due to the absence of a control sample. MACS did not make any of the expected calls and reported a problem in building its model.

Discussion

A lot of binding peaks are obvious and will be identified by any sensible method. This, coupled with the small amount of available validated data, makes it difficult to show definitively that one method is better than another. We have however shown that our method performs well, and that it has several qualities that make it attractive for inclusion in an analysis suite.

The key advantages of our model, that other methods do not offer, are the Bayesian approach to parameter and state estimation, and the use of the negative binomial distribution. Within the Bayesian paradigm, we are returning posterior probabilities as our measure of certainty. Therefore we do not call regions as enriched simply because they do not look like background, but only if they look more like a peak than background. This offers the greatest scope for interpretation, as well as allowing for the use of probabilities as weights in subsequent analyses (

After submitting this manuscript, we became aware of another Bayesian implementation

The negative binomial distribution allows for overdispersion and provides a better fit to the data than the Poisson distribution that has been widely used by other methods. Not only did the Poisson distribution appear inferior in the comparison of discrepancy variables, but peak calling methods that incorporate it failed to identify some peaks in both the HNF4

Other features of our method that are desirable, but shared to some degree by some other methods, are the accounting for strandedness and orientation of the fragments and the ability to identify binding regions more precisely. Accounting for strandedness is essential for modelling the peaks, but our use of this information differs from that of previous methods. BayesPeak also tends to identify narrower regions than other methods, which is a particular advantage when searching for transcription factor binding sites.

Peaks identified by BayesPeak appear to be less likely to contain false positives. The high-confidence regions identified by our method tend also to be called by other methods. Additionally, the motif analysis we carried out showed high enrichment for the transcription factor's binding sequence. However, for the H3K4me3 data set, there were peaks that the other methods call, but that ours does not; we present examples of those in Figure

Some peaks called only by the other three methods and an example of a peak called by BayesPeak

**Some peaks called only by the other three methods and an example of a peak called by BayesPeak**. In this figure, there are three regions (labelled 1, 2, 3) that BayesPeak did not identify as peaks that the three other algorithms did. The plots show the first 32 bases of each fragment, the ones corresponding to the forward strand in red and the reverse strand in blue. In addition, we include an image of a peak that was called by our method to show how the shape compares with the others. The images were generated using the UCSC genome browser.

The other advantage of the modelling framework that we have used is the ability to adapt and extend the model as required. More states could be introduced to cope with a different physical model of fragment length to binding-site length, and simple modifications could allow for the use of paired end data. Finally, it is becoming more common for the locations of multiple transcription factor binding sites and locations of histone modifications to be investigated together for the same sample. The tendency is for each to be compared separately to a control sample, and due to financial and experimental pressures the same control sample tends to be used for each - a situation that is clearly not ideal. Our approach can be extended to model all samples simultaneously, sharing information about background levels and preventing the inappropriate over-influence of the control sample.

Conclusion

We have presented a flexible and adaptable method for the detection of enriched regions of the genome that offers advantages over methods currently in use and performs well for those few data sets with validated peaks.

Methods

Availability

The code is available from our website

Algorithm description

1. The genomic region is divided into windows and the data are converted into numbers of reads per window for each sample.

2. Starting values are assigned to the parameters of the model.

3 i. The likelihood is calculated recursively using forward and backward variables; the information is updated with every new state and the distribution of states is updated using the observed data and the likely transitions.

The forward variable can be defined as ** θ **= (

where _{ji }= _{t+1 }= _{t }=

Then, the likelihood can be represented by

and the probability that the state of window ** Y**, is

The probability that _{t }= 1, _{t }= 2) + _{t }= 3). To prevent underflow, we use the normalisation constant _{t}^{-1 }= Σ_{j}_{t}(_{t}(_{t}_{t}(_{t}(_{t}_{t}(

3 ii. The states are then simulated from the distribution _{(1, T)}|_{(1, T)}, ** θ**), which is the joint posterior mass function of all the states given

4. Given the complete data set (observed read counts and simulated states), each parameter is updated conditionally on the values of the remaining parameters using Gibbs updates. For most of them the form of the likelihood and the conjugate priors lead to closed-form posterior distributions such as Beta for _{0}, _{1}, _{0 }and _{1}. As this is not the case for _{0}, _{1 }and

5. Steps 3 and 4 are repeated a number of times and the updated values of the model parameters and state probabilities are recorded at each simulation. Averages of those probabilities give estimates of how likely the states are to take the values 0, 1, 2 or 3 and since any region _{t}, the significance score for each region is equal to the sum of those two probabilities.

State estimation and classification

Given the Gibbs draws of the parameters {^{(1)},..., ^{(m)}} and the corresponding hidden states {^{(1)},..., ^{(m)}}, where for the ^{th }draw ^{(j) }= {^{(j)}, ^{(j)}, _{0}^{(j)}, _{0}^{(j)}, _{0}^{(j)}, _{1}^{(j)}, _{1}^{(j)}, _{1}^{(j)}, ^{(j)}} and ^{(j) }= {_{1}^{(j)},..., _{T-1}^{(j)}}, our aim is to estimate the marginal posterior distributions of the states defined by _{t}'(_{t }= ** Y**) and decide on their classification. The obvious estimator would be

which can be improved by using expectations of the terms to become

where _{t}'(^{(j)}) is just the posterior probability calculated for the ^{th }simulation of the MCMC algorithm. This process introduces a layer of Monte Carlo variability, since it averages probabilities rather than events simulated with those probabilities.

Prior distributions

According to the transition probabilities of our model,

For the remaining parameters, we used more informative prior distributions, based on the nature of the enriched and background regions. We expected very few reads to map to regions where no binding is taking place, thus _{0 }should be close to zero. On the other hand, enriched areas have a wide range of fragment concentrations, reflecting the different binding affinities of the protein; therefore _{1 }should be allowed to vary considerably. We controlled for these features by tuning the scale and rate parameters of the Gamma priors of the hyperparameters _{0}, _{0}, _{1 }and _{1}.

More specifically, for both datasets we used the prior distributions

where a Beta(^{α-1}(1 - ^{β-1 }for

Implementation details

The algorithm was coded in C, and Perl was used to pre-process the data and do some of the motif analyses. The current implementation of the code analyses 5 Mb at a time. This allows for the parallelization of a genome-wide or chromosome-wide analysis and saves the need for either a) making genome-wide assumptions regarding the consistency of parameters or b) adding complexity to the model. The code can be altered to change the length of sequence being examined, however the current implementation is unlikely to scale to genome-wide within typical computational constraints. Other implementations of the model would be possible if this were desired.

We ran the MCMC algorithms for 100,000 iterations, thinning the chains and saving every 10th value to avoid correlation between consecutive draws and save computational space. To use the updated values as samples from the posterior distribution, we made sure that the values had reached stationarity and that the number of simulations was large enough to represent adequately the full range of the distribution. To do so, we ran three chains starting with different parameter values, discarded the first 2,000 values from the thinned sample and applied some formal diagnostic tools that check for convergence and good mixing between parallel chains. The Bayesian Output Analysis package (BOA)

Checking for convergence of the MCMC chains: Plots of the Brooks and Gelman shrink factor

**Checking for convergence of the MCMC chains: Plots of the Brooks and Gelman shrink factor**. These plots show how some of the parameters of the model from different runs of the algorithm (on HNF4

Authors' contributions

CS developed and implemented the model. RS compared the results between the different methods. AGL and ST supervised statistical aspects of the model. All authors drafted and approved the final manuscript.

Acknowledgements

The authors thank Duncan Odom, Michael Wilson and Dominic Schmidt for the HNF4