Repository logo

D. simulans s49 and D. sechellia NF100 genomes

Change log


This dataset is associated with the following article: "Trans-Regulatory Changes Underpin the Evolution of the Drosophila Immune Response". This dataset includes the variant calling files and accordingly generated genomes of D. simulans Sz49 and D. sechellia DenisNF100. dsim49_24.vcf and dsecNF100_20.vcf are the corresponding variant calling files to Sz49 and DenisNF100. dsim49.fasta and dsecNF100.fasta are the corresponding genomes generated for Sz49 and DenisNF100 using the respective variant calling files. We used publicly available data to produce reference genomes for the Drosophila lines. Whole genome sequencing data for D. simulans Sz49 was obtained from NCBI Sequence Read Archive, under BioProject PRJNA318623 (BioSample: SAMN05157406). Whole genome sequencing data for D. sechellia DenisNF100 was obtained from NCBI Sequence Read Archive, under BioProject PRJNA395473 (BioSample: SAMN07407394). Genomic sequencing reads of Sz49 and DenisNF100 were mapped to D. simulans reference genome(FlyBase Dsim r2.01) and the D. sechellia reference genome (FlyBase Dsec r1.3) respectively using the default parameters of the BWA-MEM algorithm in the BWA package. Duplicated read pairs were removed using Markduplicates in the Picard toolkit ( Subsequently, variants were called against the reference genome using the GATK (v4.1.4.1) HaplotypeCaller toolkit. Single nucleotide polymorphisms (SNPs) and indels were separated for Base Quality Score Recalibrations (BQSR). Each round of BQSR (GATK toolkit) was performed using hard-filtered variants from the previous round as “true set”. For SNPs, the hard-filtering criteria were set with QualByDepth < 2.0, StrandOddRatio>3.0, FisherStrand > 60.0, RMSMappingQuality<40.0, MappingQualityRankSumTest < -12.5 and ReadPosRankSumTest < -8.0. For indels, the hard-filtering criteria were set with QualByDepth < 2.0, FisherStrand > 200.0, ReadPosRankSumTest < -20.0. BQSR were performed until the number of output variants plateaued or oscillated around a constant number. The finalised sets of variants after BQSR were then used to modify the reference genomes and generate Sz49 and DenisNF100 genomes with GATK FastaAlternateReferenceMaker toolkit.


Software / Usage instructions

VCFs and genomes were generated using BWA package and tools implemented by GATK.


Drosophila sechellia, Drosophila simulans, genome, vcf