Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, UK

Department of Computer Science, Brigham Young University, Provo, Utah 84602, USA

Abstract

Background

Long branch attraction (LBA) is a problem that afflicts both the parsimony and maximum likelihood phylogenetic analysis techniques. Research has shown that parsimony is particularly vulnerable to inferring the wrong tree in Felsenstein topologies. The long branch extraction method is a procedure to detect a data set suffering from this problem so that Maximum Likelihood could be used instead of Maximum Parsimony.

Results

The long branch extraction method has been well cited and used by many authors in their analysis but no strong validation has been performed as to its accuracy. We performed such an analysis by an extensive search of the branch length search space under two topologies of six taxa, a Felsenstein-like topology and Farris-like topology. We also examine a long branch shortening method.

Conclusions

The long branch extraction method seems to mask the majority of the search space rendering it ineffective as a detection method of LBA. A proposed alternative, the long branch shortening method, is also ineffective in predicting long branch attraction for all tree topologies.

Background

Due to its speed and simplicity, one of the most common methods used in phylogenetics is Maximum Parsimony

LBA is the foundation for many of the arguments against the use of MP in phylogenetics. One foundational study showed that MP can be positively misleading when two non-sister taxa have long branches compared to the rest of the tree

LBA has been found in many real world examples, one review found 112 examples in a search on the Web of Science

1. If, after completing a full parsimony search you obtain a tree with a questionable grouping of a certain taxa that appears basal and makes the formal classification polyphyletic, suspect LBA.

2. Exclude the outgroup and re-run the analysis: does the questionable taxa form a monophyletic clade of the formal classification?

3. Return the outgroup and remove the questionable taxa and re-run the analysis: does this root the tree differently then in step 1 (later compare to step 4 and 5 as well)?

4. Return the questionable taxa and reanalyze the data set by separating the gene information from the morphological data: does the morphological data form a monophyletic group of the formal classification while the gene data place the questionable taxa basal in the tree?

5. Analyze the gene data using a method that takes into account branch lengths, (i.e. Bayes or Likelihood): does this method form a monophyletic group of the formal classification?

6. Using the same analysis of step 5: are the branch lengths of the questionable taxa and the outgroup some of the longest in the tree?

If you can answer yes to all the previous questions, LBA is the least refuted hypothesis. We have chosen to automate this technique with a few modifications and evaluate it on a series of synthetic data with six taxa under a variety of branch lengths with verified LBA.

The six taxa synthetic data sets were used for two main reasons. Six taxa data sets are small enough to be calculated in reasonable time but large enough for the LBE method to work. This gave us an

Methods

Synthetic data sets

To evaluate long branch extraction, datasets that allow for extraction of long branches and differentiation of topological location are necessary. We chose to perform the analysis using six taxa in a star shape for consistency and comparability to the more prevalent studies using four taxa cases of the Felsenstein and Farris (or reverse-Felsenstein) zone topologies

The Felsenstein-like and Farris-like topologies used to simulate the data.

**The Felsenstein-like and Farris-like topologies used to simulate the data.** The six taxa star shape Felsenstein-like and Farris-like topologies.

To produce these data sets we used the program Dawg

Dawg generated data sets for trees under both topologies where the

Evaluation of LBA area

Each data set was then analyzed by comparing the best parsimony tree from an exhaustive search. With six taxa this means scoring all 105 possible trees to find the best one. This best tree was then compared to the parsimony tree and the percentage of the trials out of 100 that the two matched was recorded. Then for each of the permutations of

Steps of LBE

To perform LBE, the target tree, in our case the resultant heuristically derived parsimony tree, and data set are given as parameters along with a list of outgroup taxa and questionable taxa to our Java version of LBE. Of the two

The first step of LBE is to remove the outgroup from the tree and the data set and rerun a parsimony search. The second step is to add the outgroup back and remove the questionable taxa. To increase the sensitivity, according to the recommendations of Bergsten (see “Concluding discussion: suggestions” from

Results and discussion

Areas under LBA

To detect the areas most effected by LBA, we ran an analysis of the six taxa data sets (see section ) over a range of branch lengths and with two scenarios for the position of the long branches (see Figure

Maximum Parsimony and the Felsenstein topology

**Maximum Parsimony and the Felsenstein topology** Percentage of runs that MP identifies the true tree under the Felsenstein-like topology. Trees found in the upper left corner (the dark black area) suffer from LBA. The dark outer edge is where the signal is lost from too long of branches.

Maximum Parsimony and the Farris topology

**Maximum Parsimony and the Farris topology** Percentage of runs that MP identifies the true tree under the Farris-like topology. Note the large area predicted correctly by MP; this is the area where the sister taxa have long branches and are correctly placed together based on MP’s bias.

One problem with most phylogenetic algorithms is the loss of detectable signal with extremely long trees. The length of the tree is the sum of all the branch lengths it has and those with an extreme length or long trees are difficult to decipher. This problem is clearly visible when examining the upper right of the figures under both topologies. We hypothesis that as the branch lengths get longer the percentage correct will converge to 0.95% as this is a random guess out of the 105 possible topologies.

This analysis served as a search space basis for where LBA should be detected. By comparing the differences between the Felsenstein-like and Farris-like topologies it is clearly visible which areas should be detected. When analyzed with ML these regions do not appear but the loss of signal is still present (see Figure

Maximum Likelihood and the Felsenstein topology

**Maximum Likelihood and the Felsenstein topology** Percentage of runs that ML identifies the true tree under the Felsenstein-like topology. Notice that the area of high accuracy is much larger and covers most of the LBA region. ML is not as susceptible to LBA.

LBE is not functioning as theory predicts

For a method to accurately detect LBA, it needs to discern between these two types of topologies and find the area of LBA. The region found by searching the branch length space should be the same predicted by LBE. Surprisingly this was not the case.

As is seen in Figures

Long Branch Attraction and the Felsenstein topology

**Long Branch Attraction and the Felsenstein topology** Felsenstein-like topology. The color gradient represents the percentage of trees that were not predicted to have LBA.

Long Branch Attraction and the Farris topology

**Long Branch Attraction and the Farris topology** Farris-like topology. There should be no detection of LBA in this scenario because the long-branches are sister taxa.

Further, LBE predicted LBA under the Farris-like topology, where we know

It is consistently classifying the wrong area of the Felsenstein-like topology as LBA and the same area of the Farris-like topology. In reality, this is an area suffering from loss of signal. But even in other areas of loss of signal, i.e. the lower right corner of Figure

What is more bothersome is that the LBE does not seem to consistently categorize based on specific examples of branch length. Under the full method of LBE with the branch length step included (see section ), the method only categorizes a maximum of 25% of any permutation of

Long Branch Extraction without branch length estimation

**Long Branch Extraction without branch length estimation** Felsenstein-like topology. This was a reanalysis without the third step, which looks at a branch length estimator ML. The detection is further biased and the area with less signal is confused with LBA.

Why it may not work

Siddall and Whiting make the claim that, “... if each of the two branches individually group in precisely the same place as the other when they are allowed to stand alone in an analysis, one can hardly argue that they are attracted to this placement by the absent branch.

We can thus split the branch length search space into three major areas: the area masked by the ML step (I), the area misled by the artificially long branch (II), and the area that is correct until it reaches a point of loss of signal (III), as seen in Figure

Hypothetical explanation of branch length search space

**Hypothetical explanation of branch length search space** Hypothetical explanation of branch length search space. I) This area is caused by the ML step predicting the

Area II is much more hypothetical but seems to fit the data reasonably well. When examining Figures

Finally, area III is where the LBE method is actually mostly correct or the area not suffering from some other artifact. Unfortunately, this area is not suffering from LBA but eventually it losses phylogenetic signal. It is the most clearly seen in Figure

Long branch shortening

Due to the problems associated with Long Branch Extraction, an alternate approach could be used. Rather than removing the suspected long branch that would cause changes in the overall phylogeny, a series of iterative steps are taken to shorten the branch to diminish the phylogenetic signal being sent from the questionable branch and then see if that changes the phylogeny. If the phylogeny changes, long branch attraction is suspected.

Assuming the questionable taxon (qtaxa) falls basal in the MP analysis and is suspect (this is similar to step 1 of LBE), LBS performs the following three step test:

1. Rather than sampling from all the other taxa, construct the ancestral sequence to all taxa excluding the outgroup and qtaxa. With this sequence, you have the combined signal of all the other taxa, or a summary of that clade.

2. Using the constructed sequence and the questionable taxa, hybridize the two in a random fashion. We are not implying crossing over, albeit that should be tested as well, but using a binomial distribution, characters are exchanged between the sampled ancestor and the suspected long branch ataxa. This causes the branch to be shortened by reducing the differences between the taxon and its hypothetical ancestor. However, since this ancestor is unknown, the characters for the questionable taxon are modified by sampling from the hypothetical ancestor. In Figures

Long Branch Shortening using 0% Sampling Frequency

**Long Branch Shortening using 0% Sampling Frequency** Long Branch Shortening using different sampling frequencies. Black indicates an incorrect diagnosis from LBS while gray indicates LBS successfully determined whether or not long branch attraction was present in the resulting phylogeny.

Long Branch Shortening using 30% Sampling Frequency

**Long Branch Shortening using 30% Sampling Frequency** Long Branch Shortening using different sampling frequencies. Black indicates an incorrect diagnosis from LBS while gray indicates LBS successfully determined whether or not long branch attraction was present in the resulting phylogeny.

Long Branch Shortening using 60% Sampling Frequency

**Long Branch Shortening using 60% Sampling Frequency** Long Branch Shortening using different sampling frequencies. Black indicates an incorrect diagnosis from LBS while gray indicates LBS successfully determined whether or not long branch attraction was present in the resulting phylogeny.

Long Branch Shortening using 90% Sampling Frequency

**Long Branch Shortening using 90% Sampling Frequency** Long Branch Shortening using different sampling frequencies. Black indicates an incorrect diagnosis from LBS while gray indicates LBS successfully determined whether or not long branch attraction was present in the resulting phylogeny.

3. Re-run the analysis with the hybridized sequence included in place of the qtaxa. If the taxa moves after reducing its own signal and adding some signal from the monophyletic clade you have some evidence of LBA The parameter or probability of switching in the binomial distribution is increased and steps 2 and 3 are repeated until either the probability reaches 1 or consistently (i.e. multiple runs) shows the hybridized qtaxa clading with the hypothetical clade.

One of the weaknesses of such an approach is the lack of an absolute answer. You don’t get a final answer of yes or no (as to whether LBA is occurring) but added evidence that there is a problem. This evidence comes in the form of a probability or percentage of the branch that needs to be shortened to form the monophyletic clade. If the probability comes out high, 0.9 to 1.0, you can be fairly sure that LBA is not occurring that that there is strong phylogenetic signal supporting the current position in the phylogeny. If it is very low, then long branch attraction has occurred and is causing an incorrect tree to be inferred. This evidence can help the researcher to understand if the questionable taxa (qtaxa) is sending a strong signal to be in the current location or a weak one. A weak signal implies that the location is inferred only because of analogous evolution and not homology. This implication can then be interpreted as the determination or detection of LBA.

In Figures

Maximum parsimony and maximum likelihood

Lastly, as this paper addresses algorithms to detect regions where Maximum Parsimony would report the incorrect tree, it is important to compare Maximum Parsimony and Maximum Likelihood in terms of the best scoring tree versus the true phylogeny. Figures

Parsimony vs. Likelihood using 4 taxa

**Parsimony vs. Likelihood using 4 taxa** Comparison of Parsimony and Maximum Likelihood using a 4 taxa Felsenstein Zone tree.

Parsimony vs. Likelihood using 6 taxa

**Parsimony vs. Likelihood using 6 taxa** Comparison of Parsimony and Maximum Likelihood using a 6 taxa Felsenstein-like Zone tree.

Conclusions

Long Branch Extraction(LBE) and Long Branch Shortening(LBS) are not reliable methods for detecting Long Branch Attraction(LBA) and should not be used in phylogenetic inquiries about LBA. Under a variety of branch lengths for six taxa synthetic data sets LBE incorrectly and inconsistently predicts LBA because of its inability to distinguish between artificially created long branches and the correct tree topology. The artificial long branch is created by the removal of the outgroup or questionable taxa branch creating a sister taxa that is artificially long, having removed the taxa that would break up its long branch. An additional problem is that the ML step masks a large area of the branch length space not giving the method the specificity that is needed to be effective. This was shown by an in depth search over two topologies, the Felsenstein-like topology that is easily susceptible to LBA and the Farris-like topology in which the long branches are correctly grouped together. The results support our conclusion that LBE is ineffective in detecting LBA.

LBS is not effective because it incorrectly estimates the sequence present at the ancestral node. Statistical sampling of the other sequences artificially causes the target taxon to appear like all the taxa rather than shortening its branch. This results in a loss of accuracy in the detection of LBA.

Both LBE and LBS suffer from a secondary effect. When a branch is extracted from the phylogeny or shortened, other branches are free to become the longest branch and will potentially draw other similarly long branches away from their correct locations. Both Maximum Likelihood and Maximum Parsimony are subject to LBA in Felsenstein topologies and Likelihood provides superior results in only a small part of the Felsenstein Zone.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors designed, analyzed, implemented and tested the proposed algorithm. Each author contributed equally in writing the paper. All authors read and approved the final manuscript.

Acknowledgements

This project was supported by the National Science Foundation under Grant No. 0120718 and by the Brigham Young University Office of Research and Creative Activity. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).

This article has been published as part of