A curated dataset of modern and ancient high-coverage shotgun human genomes

Over the last few years, genome-wide data for a large number of ancient human samples have been collected. Whilst datasets of capture SNPs have been collated, high coverage shotgun genomes (which are relatively few but allow certain type of analyses not possible with ascertained captured SNPs) have to be reprocessed by individual groups from raw reads. This task is computationally intensive. Here, we release a dataset including 34 whole-genome sequenced samples, previously published and distributed worldwide, together with the genetic pipeline used to process them. The dataset contains 73,435,604 sites called across 18 ancient and 16 modern individuals and includes sequence data from four previously published ancient samples which we sequenced to higher coverage (10-18x). Such a resource will allow researchers to analyse their new samples with the same genetic pipeline and directly compare them to the reference dataset without re-processing published samples. Moreover, this dataset can be easily expanded to increase the sample distribution both across time and space.


Background & Summary
The number of ancient humans with genome-wide data available has increased from less than five a decade ago to more than 3,000 thanks to advancements in extraction and sequencing methods for ancient DNA (aDNA) 1 . However, there are just a few high-quality (coverage > 10x) shotgun whole-genome sequenced ancient samples 2 . Moreover, the genetic pipelines used to process shotgun aDNA data are very diverse, making it hard to combine published samples from different studies and research groups. Therefore, researchers have to download raw reads of published samples and reprocess them to create a dataset to compare their new samples against to without pipeline-associated biases. This problem is less pronounced for modern DNA samples as the higher quality of DNA and sequencing coverage partially reduce the biases introduced by the usage of different bioinformatic tools.
Panels including shotgun data for modern samples distributed worldwide have been previously published, such as the Simon Genome Diversity Program 3 , 1000 Genome Project 4 and Human Genome Diversity Project (HGDP-CEPH panel) 5 .
However, the same concept has not yet been applied to ancient samples or a mix of modern and ancient samples. This study aims to start filling this gap by creating a dataset including both modern and ancient samples distributed across all continents. Therefore, we fully reprocessed 14 high-quality shotgun sequenced ancient samples downloaded from the literature, generated additional new data for previously published 4 ancient samples and merged them with 16 modern samples. The final dataset includes 34 individuals and researchers can use it to quickly compare their new samples against a set of individuals distributed across time and space ( Figure  1). Moreover, we hope that researchers will add additional data processed with the pipeline that we released to increase the sample resolution both in time and space.

Sample collection
Additional sequence data were generated for four ancient samples which were previously collected and described in the following original publications: ZVEJ25 and ZVEJ31 were published in Jones et al. (2017) 6 , KK1 in Jones et al. (2015) 7 and NE5 in Gamba et al. (2014) 8 . Furthermore, 14 additional ancient samples and modern samples have been downloaded from the literature (see Table 1 and 2). The final dataset includes 34 samples consisting of 18 ancient and 16 modern samples.

DNA extraction, Library preparation and next-generation sequencing
DNA was extracted and libraries were prepared for ZVEJ25, ZVEJ31, KK1 and NE5 (Table 3), following protocols described in the original publications, with the exception that DNA extracts were incubated with USER enzyme (5 µl enzyme: 16.50 µl of extract) for 3 hours at 37°C prior to library preparation in order to repair post-mortem molecular damage. The libraries were sequenced across 31 lanes of a HiSeq 2,500.

Ancient samples
The following approach was used for both the newly sequenced ancient samples and the downloaded raw fastq files from previously published ancient samples. Adapters were trimmed with Cutadapt v1.9.1 9 and then raw reads were aligned to human reference sequence hg19/hs37d5 with bwa aln v0.7.12 10 with seeding disabled (-l 1000), maximum edit distance set to -n 0.01 and maximum number of gap opens set to -o 2. Sai files were converted into sam files using bwa samse v0.7.12 and the read group line was also added. Bam files were generated using Samtools view v1.9 11 . Reads from multiple libraries belonging to the same sample were merged with the module MergeSamFiles within Picard v2.9.2 12 . Aligned reads were filtered for minimum mapping quality 20 with Samtools view v1.9. Indexing, sorting and duplicate removal (rmdup) were performed with Samtools v1.9. Indels were realigned using The Genome Analysis Toolkit v3.7 13 (module RealignerTargetCreator and IndelRealigner) and 2bp were softclipped from the start and ends of reads using a custom python script. Final bam files were split by chromosome using Samtools view v1.9 and variant calling was performed with UnifiedGenotyper from The Genome Analysis Toolkit v3.7. All calls were filtered for minimum base quality 20 (-mbq 20) and reference-bias free priors were used (-inputPrior 0.0010 -inputPrior 0.4995). The same priors have been used for modern samples in the Simon Genome Diversity Panel 3 . We focused on selecting a subset of the genome representing neutral genomic variation for demographic inferences 14,15 . Therefore, specific filters were applied to discard: recombination hotspots (filter_hotspot1000g), poor mapping quality regions (filter_Map20), recent duplication (recent duplications, RepeatMasker score < 20), recent segmental duplication (filter_segDups), simple repeats (filter_simpleRepeat), gene exons together with 1000bp flanking and conserved elements together 100bp flanking (filter_selection_10000_100) and positions with systematic sequencing errors (filter_SysErrHCB and filter_SysErr.starch). All CpG sites were removed as well as C and G sites with an adjacent missing genotype. Genotypes were filtered by minimum coverage 8x and maximum coverage defined as twice the average coverage. Vcf files per chromosome belonging to the same sample were concatenated using vcf-concat from vcftools v0.1.15 2 . 16

Modern samples
Bam files were downloaded from the Simon Genome Diversity Panel 3 and from McColl et al. 17 (Table 2). Bam files were split by chromosome and variant calling, filtering for GC sites and coverage were performed as described above for the ancient samples with the same options and thresholds.

Final dataset
Per sample vcf files were compressed with bgzip and indexed with tabix from htslib v1.6 11 . The final dataset was assembled by merging filtered compressed vcf files for all modern and ancient samples with bcftools merge v1.6 11 . Only sites with called genotypes for all samples were kept using vcftools v0.1.15 (--max-missing 1). Triallelic sites were also discarded using bcftools view v1.6 (-m1 -M2). Final vcf statistics were generated with bcftools stats v1.6. Downstream analysis and plotting were performed in R v3.6.3 18 .

Data Records
All newly generated sequencing raw reads have been deposited in the NCBI Sequence Read Archive XXX.

Technical Validation
Summary of newly generated data DNA was extracted for four previously published samples (ZVEJ25, ZVEJ31, KK1 and NE5) and sequence data were generated with an average coverage between 10x and 18x (Table 3). Endogenous DNA was estimated between 0.48 and 0.71 across all libraries (Table 4). Each library generated between 150 and 425 millions of reads corresponding to 15.2 and 42.9Gb respectively ( Table 4).

Summary of the whole dataset including ancient and modern samples
The final dataset includes 34 samples with 509,348,047 sites in neutral regions before filtering (see Methods section for a detailed description of which regions were considered for variant calling). Sites not called across all samples (0% missing data allowed) were then discarded and 73,439,415 were retained. Multiallelic sites (3811) were also removed bringing the final number of filtered sites to 73,435,604 (Table 5). Minimum and maximum coverage per sample within the final dataset is 11.3x and 55x respectively (within filtered intervals) with an average coverage across all samples of 30.1x (Table 5). We calculated the number of transitions (ts), transversions (tv) and the ts/tv ratio per sample ( Table 5). As expected, all eight ancient samples that were not subjected to UDG-treatment showed a higher ts/tv ratio than their UDG-treated counterparts (see Figure 2), consistent with higher levels of DNA damage in these samples. The Brazialian sample Sumidouro 5 shows the highest excess of transition, possibly due to poor DNA preservation caused by environmental conditions. All other samples (both modern and UDG-treated ancient) showed similar ts/tv ratio with an average of 1.73, maximum and minimum of 1.76 and 1.63 respectively (see Table 5, Figure 2).

Code Availability
The pipeline used to process the data with all scripts is available at XXX.