Repository logo
 

Investigation of the Conservation of Biological Function in Transposable Elements using a Multiple Alignment Strategy


Loading...
Thumbnail Image

Type

Change log

Abstract

The non-coding genome has been relatively understudied since the dawn of modern genetics due to its sheer size and complexity. It was not until the last two decades that recent breakthrough in sequencing technology overcame those difficulties. Among many interesting findings, more than half of the non-coding genome was found to be occupied by transposable elements (TEs), DNA with transposition capability which can either benefit or disrupt gene functions depending on their destination. TEs are considered foreign and viral-like intruders which can affect genome integrity and therefore need to be silenced by several defense mechanisms such as KRAB-zinc finger protein (KZFPs).

Despite being repressed and eventually losing their ability to mobilise by neutral drift, many TEs can still persist in the human genome long after they lose their ability to transpose, suggesting the possibility of them being “domesticated”, which means that they can contribute to gene regulation via the co-option of transcription factor binding sites they contain. Coupled with many recent discoveries centered around TEs that found their involvement in gene regulatory pathways and cell development, it has become very interesting to explore the conservation landscape of TEs. Since coding and regulatory regions are more likely to be conserved throughout the evolution process for their functions, we hypothesize that functional regions of TEs are also more likely to be conserved than non-functional regions and leveraged multiple novel approaches and datasets to study this hypothesis.

However, when compared to coding regions, the conservation patterns of TEs are harder to detect as they are younger sequences found in fewer species. This makes exploration of conservation on single TE sequences difficult. In this project, we overcome the issue by utilizing the characteristic properties of TEs: a large number of highly similar sequences dispersed in the genome. As TEs in a group are closely related, it is possible to use a multiple sequence alignment approach to analyze genome-wide data of these TEs as a group. Here, I present a new Multiple Alignment Mapping framework to facilitate the process and help map genome-wide data onto TE alignment. I then use it in later chapters to test the hypothesis mentioned above, using inter-species conservation metrics (phyloP) and intra-species (gnomAD allele frequency) which we contrast using genome-wide functional datasets such as TF binding motif, KZFP binding activity or chromatin accessibility.

While doing this we found that multi-species alignment (Zoonomia) which is the source of phyloP dataset used in this project could be used for other purposes. From the observation of phyloP score patterns and Zoonomia alignment, we found that TEs whose positions are aligned by sequence from many species (old) generally have a wider range of phyloP scores when compared to sequences with fewer alignments (new). This led to the creation of the TEA-TIME framework, a novel method to derive TE age using the divergent timing of the oldest species in the alignment that significantly aligned according to the similarity scoring method inspired by NCBI BLAST, in order to separate TEs into discrete age subgroups to improve subsequent functional analysis.

I then combine these two frameworks together to test the hypothesis that some TE regions are conserved on selected TE families which have been described as potentially domesticated with known binding transcription factor motifs. I find that THE1C elements in the human genome with TF motifs are significantly more conserved in other mammalian species than the TE without motifs, especially at the motif positions. Within the same motifs, some positions in the consensus have stronger conservation signals than others even though all sequences evolved independently after their integration in the genome at different locations, suggesting that some of these regions might have biological importance. Interestingly the conservation patterns within transcription factor motifs on the consensus alignment also align with structural expectations. For example, BMAL1/CLOCK-bound motifs, which contain helix-loop-helix (bHLH) domain, show accelerated patterns at the central region, which is known to be structurally flexible, flanked by more conserved positions. This pattern supports the idea that certain positions within these motifs can tolerate variation, which is consistent with their biological role. Moreover, we also found potential co-evolution signals where positions outside known transcription factor motifs show significantly conserved signals suggesting the potential for unknown binding sites. When multiple motifs were used for grouping, we found that some motifs (NFkB or CLOCK) are co-conserved with other motifs (AP-1 or BMAL1, respectively) but not the other way around. This suggests the possible hierarchy of importance among motifs in the same pathway. These results are interesting in the context of other observations, for example that experimental data such as TEs with ATAC-seq enrichment in human datasets (chromatin accessibility) are more conserved in mammalian genomes compared to inaccessible TEs, indicating that there is likely an actual link between conservation patterns and biological functions of TEs.

All these results demonstrate the potential of the two frameworks as tools for TE analysis. Even though experimental confirmation and validation will need to be performed, the scalability and the high-throughput potential of the framework I built will pave the way for future large-scale analysis.

Description

Date

2024-10-31

Advisors

Imbeault, Michael

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International (CC BY 4.0)
Sponsorship
The Royal Thai Government Scholarship