Rapid speciation of cichlids fishes may be explained by evolutionary divergence of novel open reading frames

Cichlids fishes exhibit extensive phenotypic diversification and speciation. Encounters with new environments alone are not sufficient to explain this striking diversity of cichlid radiation because other taxa coexistent with the Cichlidae demonstrate lower species richness. Wagner et al analyzed cichlid diversification in 46 African lakes and reported that both extrinsic environmental factors and intrinsic lineage-specific traits related to sexual selection have strongly influenced the cichlid radiation ​, which indicates the existence of molecular mechanisms responsible for rapid phenotypic diversification and events leading to reproductive isolation. However discovery of the molecular mechanisms in terms of genetic or transcriptomic or proteomic diversity responsible for the rapid speciation in a short geological time has remained elusive. In this study we integrated transcriptomic and proteomic signatures from two cichlids species, identified novel open reading frames (nORFs) and performed evolutionary analysis on these nORF regions. Our results suggest that the time scale of speciation of the two species can be better explained by the evolutionary divergence of these nORF genomic regions. Therefore this study has revealed the potential functional and evolutionary role of nORFs, which has far reaching implications for such evolutionary and speciation studies that have traditionally been focussed on known protein coding regions.

the cichlid radiation 1 , which indicates the existence of molecular mechanisms responsible for rapid phenotypic diversification and events leading to reproductive isolation. However discovery of the molecular mechanisms in terms of genetic or transcriptomic or proteomic diversity responsible for the rapid speciation in a short geological time has remained elusive. In this study we integrated transcriptomic and proteomic signatures from two cichlids species, identified novel open reading frames (nORFs) and performed evolutionary analysis on these nORF regions. Our results suggest that the time scale of speciation of the two species can be better explained by the evolutionary divergence of these nORF genomic regions. Therefore this study has revealed the potential functional and evolutionary role of nORFs, which has far reaching implications for such evolutionary and speciation studies that have traditionally been focussed on known protein coding regions.

Main
A genome-wide study of 73 Malawi cichlid species reported a low (0.1-0.25%) average sequence divergence between species pairs, indicating highly similar genomes 2 . Further, a comparative genomic analysis of three morphologically and ecologically distinct cichlid species from Lake Victoria found highly similar degrees of genetic distance and polymorphism consistent with conservation of protein-coding regions 3 . A study between two East African cichlid species, Astatotilapia burtoni and Ophthalmotilapia ventralis, reported a genetic distance of 1.75% when annotated and unannotated transcripts were considered, but only 0.95% when including protein-coding sequences 4 . The genetic differences responsible for some specific cichlid traits are known; for example, bmp4 's influence on jaw morphology 5,6 , the expression patterns of egg-spots and blotches on fin tissue 7 , role of gonadotropin-releasing hormone (GnRH) 8 and multiple steroid receptors (estrogen, androgen and corticosteroid receptors) 9 ; in chemosensory and auditory plasticity respectively and the divergence in visual pigmentation 'opsin' genes affecting mate selection 10 11 . However the genetic, transcriptomic and proteomic diversity among the cichlids appears too low to account for the striking phenotypic diversity of the taxon.
Sequencing of the genomes of five cichlid species revealed accelerated protein-coding sequence evolution, divergence of regulatory elements, regulation by novel micro-RNAs, and divergence in gene expression associated with transposable element insertions 12 . These studies suggest that the observed genomic diversity between the cichlids could be largely the result of divergence in non-coding regions. Previous work from our lab has revealed functionally active regions in the non-coding genome of many species that have not yet been classified as genes 13 . We call these regions novel open reading frames (nORFs). Therefore we embarked on a proteogenomic analysis to identify nORFs in two tissues of two cichlid species Oreochromis niloticus ( Nile tilapia, ON ) and Pundamilia nyererei (Makobe Island, PN) ( Fig. 1a ) and to investigate whether these nORFs can help explain the speciation of these two species in a short geological time scale. The entire workflow is illustrated in Fig. 1b . These species are genetically similar but phenotypically divergent. PN is a rock-dwelling lacustrine fish 14 , whilst ON dwells in rivers 15 . They differ also in diet, ON is an omnivore with a primarily plant-based diet and PN is a carnivore 16 , 17 . ON has a more plain colouration whereas PN males have yellow flanks and red dorsal regions, a trait that is subject to sexual selection 14,18 . We compared the expression of transcripts between the species in two metabolically-active tissues, the testes and liver. We chose to study the latter under the rationale that the distinct diets of the species might be accompanied by divergent liver transcriptomes. Analysis of transcript expression in the testes allowed comparison of extent of divergence in sex and non-sex traits.

Selection of the reference genome and evaluation of read alignment and transcript assembly methods
The reference genomes of ON and PN feature gaps and mis-assemblies, and are not completely annotated. This made it necessary to first examine the extent to which the poorer quality of existing assemblies for these species might affect alignment and quantitation of RNA-seq reads. We aligned the reads to their respective genomes and to the better annotated genome of a closely related cichlid species, Metriaclima zebra , with fewer gaps and mis-assemblies. We then compared overall and concordant alignment rates. PN liver reads had 4.9% and 1.8% higher overall and concordant alignment rate, respectively, to the M. zebra genome than to its own genome. Whereas, ON liver reads had 30% and 40.5% lower overall and concordant alignment rates, respectively, to the M. zebra genome than to its own genome ( Fig. 1c) . As ON reads had a higher overall and concordant alignment rate on aligning to its own genome, we decided to align the reads to the species' respective genomes for transcriptome alignment and assembly. The PN derived reads had a higher alignment rate to M. zebra than itself, while it was lower in ON derived reads, may be because M. zebra is more closely related to PN than ON. The two commonly used RNA-seq read alignment methods: TopHat and HISAT were compared by aligning the liver tissue reads of both the species to their respective genomes. The overall alignment (Fig. 1d left) and concordant (Fig. 1d right) alignment rates for both methods were very similar, but HISAT2 took approximately half the computational time compared to TopHat. Hence, HISAT2 was chosen to align the reads for the rest of the analysis.
We then evaluated several assembly methods ( Fig. 1b ). As there is no consensus in the literature regarding the optimal method for transcriptome assembly 19 20 21 22 , the following assembly methods were evaluated: Trinity -a de novo method, and Stringtie and Cufflinks -two reference based methods. These two reference based methods were run in two modes: with and without providing the reference annotations (Stringtie/Cufflinks WR and NR respectively).
To compare the assembly between the methods three replicates of simulated reads was generated for both the species, using a built-in differential expression model of Polyester v1.14.1. Reads were simulated from ON and PN reference annotation transcripts to produce three replicates of approximately 25 million 75 bp paired-end simulated reads for each species, without incorporating sequencing errors and with uniform transcript expression levels. Simulated reads were aligned to their respective genomes using HISAT2 2.1.0 and then assembled using the five assembly methods. For both ON and PN, the de novo method, Trinity, had much lower precision and sensitivity in assembling transcripts than the reference based methods. The reference-annotation based methods (Stringtie WR and Cufflinks WR), which used the existing genome annotations in transcriptome assembly, showed the highest precision and sensitivity values for both ON and PN. For the PN reads there was no difference between these two methods. However, Stringtie NR had higher mean precision and sensitivity values than Cufflinks NR when assembling ON-derived reads (Fig. 1e) . On the basis of these results, three methods were chosen for assembling the RNA-seq reads: Trinity (TR), Stringtie WR (WR) and Stringtie NR (NR). Trinity was chosen despite its low sensitivity as it was the only method studied that is capable of assembling transcripts that are not present in the reference annotations.
The Stringtie assembled transcriptomes were quantified using Stringtie to generate transcript level abundances. Whereas, RSEM was used to quantify the transcripts assembled de novo by Trinity. The transcripts abundances for all the three methods were then analysed for differential expression using Ballgown and Ebseq. RSEM generated counts for Trinity assembled transcripts were converted to FPKMs required for downstream assembly of Ballgown, using 'ballgownrsem' function.

Identification of orthologous and uniquely expressed transcripts in the two fishes
After the assembly of aligned reads with the three assembly tools; the assembled transcriptomes were processed to remove the unexpressed, duplicate and highly similar transcripts (Supplementary Figure 1) . Post the filter, the total number of transcripts assembled by the three assembly methods, in each tissue of each fish, as depicted in (Fig. 2a) , ranged from 22,879 to 254,399. The assembled transcriptomes were compared between the two fishes for each method and tissue type, to identify the transcripts that were either conserved between the two fishes or were expressed only in one or the other fish.
We identified the 'orthologous' transcripts conserved between the two fishes using reciprocal best hits (RBH) method. The ON transcript sequences for each tissue type and method were mapped to their respective PN transcriptomes and vice versa using blastn v2.7.1+ 23 . Transcript pairs that were each other's highest scoring match were identified as orthologs. And the transcripts if did not have a match in the opposing species with at least 80% identity, were assumed to be expressed uniquely to the species; and were classes as 'species-specific' transcripts.
We identified, for the three assembly methods, around 13,618 -20,837 orthologous transcripts, in the testes transcriptomes of the two fishes. Similarly, around 9,802 -48,041 orthologous transcripts were identified in the liver transcriptomes of the two fishes (Fig. 2b) . In both tissues, the number of orthologous transcripts identified in the Trinity assembled transcriptomes were highest, while were lowest in the Stringtie NR assembled transcriptomes.
Additionally, Fig. 2c depicts the number of species-specific transcripts identified by three assembly methods, per tissue in each fish. The number of ON-specific transcripts, in the two tissues for the three methods, varied from 4,530 -54,669, whereas PN-specific transcripts varied from 2,801 -60,193. Except for PN testes, the number of species-specific transcripts identified in Trinity-assembled transcriptome was highest. No species-specific liver transcripts were commonly found by all three of Stringtie WR, Stringtie NR and Trinity. In contrast, 441 (0.6%) of the species specific ON testes transcripts and 93 (0.9%) of the species-specific PN testes transcripts were identified by all three methods (Supplementary Figure 2) .

Comparative transcriptomes between liver and testes of the two fishes
To determine whether the transcriptome level differences contribute to the diversity in the two fishes, we compared the expression levels of the orthologous transcripts of the equivalent tissues. Principal component analysis (PCA) on the normalised expression levels qualitatively separated the two fishes in both the liver and testes samples for all three transcriptome assembly methods (Supplementary Figure 3) . Next, we carried out differential expression analysis of the orthologous transcripts to identify transcripts whose expression varied between the two fishes. The tools, ballgown and EBseq, used for DE analysis were observed to show strikingly different results. Upon Ballgown's analysis, no transcripts were seen to be differentially expressed between the liver transcriptome of the two fishes; while 2,804 and 2,734 testes transcripts were observed to be differentially expressed, in the Stringtie WR and Stringtie NR assembled transcriptomes, respectively. Of these, 1,232 transcripts are commonly identified by the two stringtie assembly methods. Also, no transcript was DE in Trinity assembled testes transcriptome (Figure 3a-b) . But, when differential expression analysis was done using Ebseq, transcripts from both testes and liver were identified to be DE. 4,591-26,671 and 8,872-13,436 transcripts were identified to be respectively DE in liver and testes transcriptomes assembled by the three assembly pipelines. As large numbers of orthologous transcripts (~30-62%) were identified to be DE by EBseq, we did not further analyse these results.

Functional annotation of the DE and species-specific transcripts .
For annotation by Interproscan and blastp we used union of the transcripts identified by the three assembly methods, and not just the transcripts identified commonly by the three methods.
The majority of DE testes transcripts were annotated as related to cellular or metabolic processes, with a smaller number relating to localisation, biological regulation, regulation of biological processes and response to stimulus (Fig 3c) . Further the annotations of species-specific transcripts, for each tissue type and species showed broadly similar trends, mostly pertaining to cellular processes, metabolic processes, localisation and regulation of biological processes. Additionally, some annotations were also specific to a particular tissue type. Eight of the species-specific transcripts in the PN testes and fifteen of the species-specific transcripts in the ON testes were annotated with the GO term reproductive processes and reproduction, suggesting that the ON and PN reproductive systems have diverged ( Supplementary Figure 4 ).

Identification of the novel transcripts derived from the non-coding regions and their departure from the neutral substitution rate
Further analysing the species-specific transcripts, we observed that a subset of them were transcribed from previously annotated non-coding regions. We call these regions novel Open Reading Frames (nORFs). Of these subset, 100 nORF transcripts had evidence of translation identified using our mass spectrometry-based proteogenomic analysis. We observed that 8-24 and 5-25 number of nORF transcripts were transcribed and translated from intronic and intergenic regions respectively, found for each species and tissue type ( Table 1) . There was little overlap in the species-specific nORF translated products found by each method, with no overlap between Trinity and the Stringtie methods, two intronic and two intergenic species-specific liver ON translation products found by both Stringtie WR and Stringtie NR, six and two intergenic species-specific liver PN and testes ON translation products found using both Stringtie WR and Stringtie NR.
Further investigation of these nORF translated products by InterProScan revealed that one intergenic product from PN testes was annotated with immunity related GO terms. Similarly one intergenic translated product, each from ON testes and PN liver, had immunoglobulin like fold and domain.

Evolutionary analysis of nORF transcripts
In order to determine whether these 100 nORF transcripts, with experimental translational evidence, evolved in a non-neutral manner we next calculated their substitution rates by calculating the genome-wide, base-wise conservation-acceleration (CONACC) scores using phyloP 24 . To do this, existing multiple whole genome alignments of the five cichlids provided by Brawand et al 12 was used . Of the 100 nORF transcript regions; we were able to map the scores for only 41 regions because of the variability in the two different ON assemblies and due to insufficient aligned data. As the ON assembly, ASM185804v2, used in our analysis was different than the one used in the whole genome alignments -Orenil1.1, few of the nORF regions were unmapped during assembly conversion. The regions with the mapped scores were further reduced as no CONACC score is assigned to a site; if there is insufficient data per site or gaps in the alignement (Table 2) . CONACC scores were computed over all branches of the cichlid's phylogeny, and used to detect the departure from neutrality in novel regions and also in the other known annotated features of the genome like CDS, 5'UTR, 3'UTR, introns, intergenes and ancient repeats (AR).
The analysis of the cumulative distributions (Fig. 4a) of the phylop scores of ON's known annotated features showed that the CDS regions (red line) were most conserved while the AR's were least conserved. This is intuitive as the functional coding regions are expected to have more evolutionary constraints than the non-functional repeat regions. The distribution of CONACC scores of all the annotated features were significantly different than that of AR (Welch t-test, P-value < 0.05) (Fig. 4a) .
Conservation scores were also mapped to the 9 ON novel intergenic and 27 ON's novel intronic regions. As these novel regions are very few compared to the AR, we sampled 10,000 times, from all the AR regions, to randomly pick one length-matched AR per nORF transcript. The distribution of CONACC scores for these length-matched, equal sample-sized AR regions were significantly different (Welch t-test, p-value < 0.05) than the novel intergenic regions ( Fig. 4b ) for 7,519/10,000 times; and only 2,338/10,000 times for the novel intronic regions (Fig. 4c) .
Compared to AR, the 9 novel-intergenic regions in ON showed a shift towards more accelerated CONACC scores (gray line in the graph), whereas the 27 novel-intronic regions showed a non-neutral substitution rate with shift towards more conserved CONACC scores (blue line in the graph). This indicates that these regions which are varied in all the cichlids, might contribute to the phenotypic variation in ON.

Phylogenetic divergence time scale analysis of ON and PN
To check whether these accelerated nORF genomic regions can reveal the actual divergence time between ON and PN species and perhaps give us a clue to the speciation process we carried out Bayesian Evolutionary Analysis Sampling Trees model (BEAST) 25 , which was run on BEAST v1.10.4 26 . A strict molecular clock was set, to allow for the most reliable comparison between trees based of nORFs sequences . The molecular clock was time calibrated with a fossil time constraint. The constraint was set as a lognormal prior distribution with a mean in real space of 45.5 million years ago (MYA) and a standard deviation of 0.5 MYA. This time calibration was based on cichlid fossils estimated to be 45 million years old 27 . The substitution rate was fixed to allow better comparison between trees. 9 nORFs from the intergenic region of ON were selected as they deviated from the neutral model substitution rate predicted by AR.
Our analysis ( Table 3

Total RNA extraction from liver tissues and sequencing
Approximately 5-10mg of fresh liver tissue from tank-reared P. nyererei (generation 1; Lake Victoria) and O. niloticus (generation~93; Manzala, Egypt) specimens, snap-frozen upon dissection, was homogenised and used for RNA extraction. Total RNA was isolated using TRIzol (ThermoFisher) and then treated with DNase (TURBO DNase, ThermoFisher) to remove any DNA contamination. Quality and quantity of total RNA extracts were determined using NanoDrop spectrophotometer (ThermoFisher), Qubit (ThermoFisher) and BioAnalyser (Agilent). On average, 11.82±0.42Mio paired-end reads were generated for ON and PN liver samples.

Proteogenomic workflow to investigate evidence of translation
The 36 Thermo mass spectrometry raw files were submitted to be searched against the respective per-fish, tissue-assembled transcriptome databases (for example liver Stringtie WR-assembled, liver Stringtie NR-assembled, liver de novo Trinity-assembled) in six frames utilizing Proteome Discoverer v2.1 and Mascot 2.6. The spectra identification was performed with the following parameters: MS/MS mass tolerance was set to 0.8 Da, and the peptide mass tolerance set to 10ppm. The enzyme specificity was set to trypsin, and two missed cleavages were tolerated. Carbamidomethylation of cysteine was set as a fixed modification, whilst variable modifications consisted of: oxidation of methionine, phosphorylation of serine, threonine and tyrosine, and deamidation of asparagine and glutamine. High confidence peptide identifications were determined using Percolator node, where false discovery rate estimation (FDR) < 0.01 was used. A minimum of two high confidence peptides per protein was required for identification

RNA-Seq Simulation Experiment
An RNA-sequencing experiment was simulated to assess the precision and sensitivity of de novo and reference-based transcriptome assembly methods 29 and therefore to decide which methods to use for transcriptome assembly. Reads were simulated from the O. niloticus and P.
nyererei reference annotation transcripts using Polyester v1.14.1 to produce three replicates of approximately 25 million 75 bp paired-end simulated reads for each species. The reads were simulated without sequencing errors and with uniform transcript expression levels.

Comparison of Methods for RNA-seq Read Alignment
Total RNA-seq read sequences from O. niloticus and P. nyererei testes tissues were obtained from Brawand et al. 12 and total RNA-seq read sequences for liver tissues were generated for this study (see above). These reads were quality-checked using FastQC v0.11.5. It was thought that the RNA-seq reads might align better to the genome of M. zebra, a closely related species, than to the O. niloticus and P. nyererei genomes, as the M. zebra genome has few gaps and mis-assemblies 30

Comparison of methods for transcriptome assembly
Simulated reads were aligned to their respective genomes using HISAT2 2.1.0 32 and were assembled using four reference-based assembly methods and Trinity v2.0.6 33 35 . The Trinity-assembled transcriptomes were mapped to their respective genomes using GMAP version 2017-11-15 36 to provide genomic coordinates of the transcripts for comparison to the reference Annotations.
The precision and sensitivity of the simulated transcriptomes produced using different transcriptome assembly methods were assessed by comparison to the reference annotations. 10 x 10,000 transcripts were randomly sampled with replacement from each simulated transcriptome. These were mapped against the reference transcriptomes from which the reads were derived using GFFcompare v0.10.1 to obtain estimates for the precision and sensitivity of each assembly method at the transcript level. Raw sensitivity values were multiplied by transcriptome size / 1000 to account for the loss of sensitivity produced by using a subset of the data.

RNA-Seq Read Alignment and Assembly
Based on the results of the simulation study, the RNA-Seq reads were assembled using Stringtie WR, Stringtie NR and Trinity. The HISAT2-aligned reads were assembled using Stringtie WR and Stringtie NR . For the reference-based assembly methods, the transcriptomes assembled for each biological replicate were merged using the Stringtie merge utility to produce one transcriptome per method per tissue per species. The Trinity assembled transcriptomes were mapped to their respective genomes using GMAP version 2017-11-15 to provide the genomic coordinates of the transcripts. This was required in order to compare the transcripts found using different methods.

Transcriptome Processing and Database Production
The assembled transcriptomes were processed prior to data analysis and transcriptome database production. Unexpressed transcripts derived from the reference annotations were present in the Stringtie WR transcriptomes. The Stringtie-assembled transcriptomes were therefore filtered to remove unexpressed and duplicated transcripts. At some loci, Stringtie contigs. Two Transrate score thresholds were used to remove low-quality transcripts from the assemblies. A variable threshold was used to produce transcriptomes with optimal Transrate scores, referred to as strongly filtered transcriptomes. A lower threshold of 0.01 was also used to produce the weakly filtered transcriptomes. BUSCO v3 was used before and after filtering by Transrate score to assess the completeness of transcriptomes. This was done by testing for the presence of single copy orthologs that are universal within the metazoa.

Ortholog Identification
Pairs of orthologous transcripts between the two species for each method and tissue type were identified using the reciprocal best hits (RBH) method for use in PCA and differential expression analysis. The ON transcript sequences for each tissue type and method were mapped to their respective PN transcriptomes and vice versa using blastn v2.7.1+ 23 . Transcript pairs that were each other's' highest scoring match were identified as orthologs.

Species-Specific Transcript Identification
To identify transcripts that were only expressed in one species or the other, assembled transcripts from ON were compared to those from PN and vice versa using blastn v2.7.1+ Transcripts were classed as species-specific if they did not have a match in the opposing species with at least 80% identity.

Identification of Novel Species Specific Translation Products
Species-specific transcripts were compared to the reference annotations for their respective species using GFFcompare v0.10.1. to identify species-specific intergenic and intronic transcripts. If these transcripts had evidence of translation then the resulting translation products were classed as species-specific novel translation products.

Principal Component Analysis
Principal component analysis (PCA) was performed in R to separate samples based on the expression of orthologous transcripts. For Stringtie WR and NR expression values were in fragments per kilobase of transcript per million mapped reads (FPKM) and for Trinity expression values were count data for equivalent orthologous transcript sections (explained in more detail below). PCA was also used to separate samples based on the expression values of orthologous proteins.

Differential Transcript Expression Analysis
Differential expression analysis was carried out to compare the expression levels of orthologous transcripts between species and to ascertain whether expression levels had diverged more in the liver or in the testes.
Differential expression analysis for Stringtie-assembled transcriptomes was performed using a custom R script based on the Ballgown Bioconductor package. Sample FPKM values were log 2 transformed and normalised for library size using a 75th percentile normalisation. Linear models were constructed for each pair of orthologous transcripts to predict expression levels either including or excluding species as a predictor variable. The abilities of the two models to explain the normalised expression values were compared using F-tests, with Benjamini-Hochberg multiple testing correction. Expression levels were compared between species for both liver and testes.
For Trinity differential expression analysis, orthologous pairs of transcripts were truncated to remove non-corresponding transcript sections based on the blastn mapping of orthologous transcripts to each other. This was done to account for the large differences in length found between some orthologous transcript pairs. Counts for the truncated transcripts were estimated using RSEM v1.2.31 39 . Differential expression analysis was carried out on the truncated-transcript count data using generalised linear model quasi-likelihood F tests in EdgeR with Benjamini-Hochberg multiple testing correction.

Gene Ontology Annotation
GO annotation was used to assign putative biological functions to the DE transcripts and species-specific transcripts. Amino acid sequences for these proteins were predicted from the longest open reading frames of their transcripts using Virtual Ribosome v2.0 40 . The amino acid sequences were analysed using InterProScan 67.0 to identify families, domains and important sites and assign GO annotations 41 . GO annotations were visualised with Blast2GO v5.0 42 .

Comparison of Transcriptome Assembly Methods
For each stage of the data analysis the results found using each of Stringtie WR, Stringtie NR and Trinity were compared to find the overlap in the transcripts identified as differentially expressed or species specific. The Trinity assembled transcriptomes were mapped to their respective genomes using GMAP version 2017-11-15 to provide the genomic coordinates of the transcripts. The matching transcripts present in the transcriptomes assembled using each of the three methods were identified using GFFcompare v0.10.1.

Substitution rate calculations using phyloP
phyloP from Phylogenetic Analysis with Space/Time Models (PHAST) v1.5 package, was used to identify the genomic sequences that evolve with a rate different than that expected at neutral drift 43 24 . First, a neutral substitution model was constructed using phyloFit in PHAST by fitting a time reversible substitution 'REV' model on the phylogeny obtained from four-fold degenerate (4D) sites (Supplementary figure 6) . This phylogeny has topology and branch lengths similar to the subtree similarly constructed by Brawand et al using 4D sites from alignment of 9 teleost genomes ( which includes the 5 cichlids genomes that we have used) 12 48 and are as follows. Sequence evolution was taken to follow the HKY model 49 and the species-tree prior was set to the Yule speciation process 50 (Yule 1925               regions. The distribution of CONACC scores of the randomized AR subsets were significantly different from that of novel intergenic regions for 2338/10000 times.