Source code of simulations for Hurles et al. Origins of chromosomal rearrangement hotspots in the human genome". ********************************************* ; GC_Diverge.pro is a program to investigate the impact of gene conversion between paralogs on sequence divergence ; Simulate evolution of two pairs of paralogous sequence using 0-3 to represent A,C,G,T, avoiding use of strings, which IDL is not good at handling ; Use same starting sequence for all four sequences and then at end score sequence similarity ; Includes some variants between the paralogs to start within, i.e. duplication preceeds speciation ; Each pair of paralogous sequence constrains gene conversion to a portion so as to allow ; normal divergence to occur in the same sequence, as a control ; Directionality parameter allows investigation of directionality changes ; Uses Jukes Cantor sequence model - equal probability of base changing to any other base ; The results (sequence similarities) are plotted and are also output to a text file for further analysis ;;;;;;;;MODIFICATION HISTORY;;;;;;;; ; created 9/3/03 with single GC model of geometric tract distribution ; modified 11/3/03 ;;;;;SET PARAMETERS;;;;;;;; length=10000; length of paralogous sequence, use multiples of windowlength GClength=5000; length of paralogous sequence undergoing gene conversion (GC), starting at 5' end, use multiples of windowlength windowlength=200.0; size of sliding window to examine divergence in resulting sequence comparisons GCrate=0.1; rate of GC per locus relative to per locus rate of base substitution variants=2; % variant between proximal and distal ancestral copies polarity=[0.5,0.5]; ratio of proximal to distal GC and distal to proximal GC. 0.5 equals non-directional ; The two values allow different polarities for the different daughter species Ncycles=100; number of cycles to continue simulation BSrate=0.8; probability of a base substitution occuring in each cycle ; if each cycle involves a per locus base substitution rate of BSrate, it represents simulation over Ncycles*BSrate/(mu*L) generations mu=4E-8; rate of base substitution per nucleotide per generation Gentime=20; generation time Nreps=500; number of replications; conversionmodel=1; set model for gene conversion, 1=geometric tract length meantract=352; mean GC tract length described by geometric distribution, mean =1/p geometric=(meantract-1.0)/meantract; geometric CDF is y=1-q^x, the equation underlying GC tract length, where q=1-p ; probability of getting a tract length n is p*q^(n-1), which is the same as p*(1-p)^(n-1) ; 0.99717 is from Hilliker et al 1994, and gives a mean of 353bp ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; start=systime(); to monitor time taken ;;;;;;PRINT INITIAL INFORMATION print, 'simulations proceeding over ', Ncycles*BSrate/(mu*length), ' generations' print, 'equivalent to ', Ncycles*BSrate*gentime/(mu*length), ' years' print, 'Gene conversion rate equivalent to ', Ncycles*BSrate*GCrate/(Ncycles*BSrate/(mu*length)), ' per locus per generation' ;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;;;;;input parameters of Geometric gene conversion model;;;;;;;;;;;;; ; Use Genrand procedure for generating random conversion tract lengths according to this distribution ; Genrand.pro and included procedures are from the SDSS library of IDL routines ; to use genrand need a set of matching X and f values covering the given distribution ; generate random numbers using: GENRAND, fvalues, xvalues, nrand, array ; where 'nrand' is the number of random numbers required and 'array' is the array in which they are stored ; the resulting distribution has been tested using the plothist procedure ;generate paired f and x values for geometric distribution to allow random numbers to be generated from it ;start at 0 and work out f at intervals of 10 of x until f = less than 1 in 100000 if (conversionmodel eq 1) then begin flimit=round(alog(0.00001)/alog(geometric));work out limit of f xvalues =findgen(round(flimit/10)) xvalues=10*xvalues fvalues=geometric^xvalues ;how many random numbers required? ; Mean = Nreps*cycles*BSrate*GCrate*2 ; therefore conservative mean is 2*Nreps*cycles*BSrate*GCrate*2 = 2*100*100*0.5*0.1 = 2000 ; what about variance around the mean? therefore generate twice as many number as conservative estimate needrand=0L needrand=2L*Nreps*Ncycles*BSrate*GCrate*2+1; conservative estimate of random numbers needed genrand, fvalues, xvalues, needrand, georand countgeorand=0L; counter to ensure each random number only used once print, 'tract lengths set' endif ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;; set arrays to contain data from replications Simrep=fltarr(2,Nreps) windows=length/windowlength; windows for calculatng sequence similarity plotsim=fltarr(windows, 4, Nreps); array to hold % similarity over 200bp windows ;;;;; ;;; Start replications For x=0, Nreps-1 do begin sequences=intarr(length,4); set two pairs of identical paralogs into array with order ; sequences(*,0)= proximal 1 ; sequences(*,1)= distal 1 ; sequences(*,2)= proximal 2 ; sequences(*,3)= distal 2 ; introduce some variants into the distal copies in both species if (variants gt 0) then begin Nvariants=fix(length/100.0*variants); calculate number of variant sites to add variantsites=fix(randomu(h,Nvariants)*10000.0); choose random sites for variants sequences(variantsites,1)=1; introduce into distal 1 sequences(variantsites,3)=1; introduce into distal 2 endif ; gene conversion may occur between proximal 1 and distal 1, and proximal 2 and distal 2 countconv=intarr(2,2); count conversion events in each pair of paralogs For y=0, Ncycles-1 do begin random=randomu(d,6); generate random numbers, four for base substitution and two for gene conversion For z=0,3 do begin; generate base substitution or not in all four sequences if (random(z) lt BSrate) then begin ; define base substitution event bs=randomu(q,2); first number defines position, second defines new base site=fix(bs(0)*length); index of site to be changed, between 0 and length-1 newbase=fix(bs(1)*3); index of new base, 0-2, three options; need to set depending on ancestral state case sequences(site,z) of 0: newbase=newbase+1; change base 0 to 1,2 or 3 1: case newbase of ; change base 1 to 0,2 or 3 0: newbase=0 1: newbase=2 2: newbase=3 endcase 2: case newbase of ; change base 2 to 0,1 or 3 0: newbase=0 1: newbase=1 2: newbase=3 endcase 3: newbase=newbase ; change base 3 to 0,1 or 2 endcase sequences(site,z)=newbase endif endfor ; done base substitution for all four sequences ; gene conversion or not for each pair of paralogs For c=0,1 do begin; gene conversion between each pair of paralogs in turn if (random(4+c) lt (GCrate*BSrate)) then begin ; define gene conversion event tract=round(georand(countgeorand)); get tract length from predefined array countgeorand=countgeorand+1L; ensure each tract length is only used once gc=randomu(r,2); first number defines polarity, second defines start site startsite=fix(gc(1)*(GClength-tract)) if (gc(0) lt polarity(c)) then begin ; proximal to distal conversion, distal= acceptor sequences(startsite:startsite+tract, (c*2+1))=sequences(startsite:startsite+tract, (c*2)) ;print, 'gene conversion proximal ', c+1, 'to distal', c+1 countconv(0,c)=countconv(0,c)+1 endif else begin ; distal to proximal conversion, proximal=acceptor sequences(startsite:startsite+tract, (c*2))=sequences(startsite:startsite+tract, (c*2+1)) countconv(1,c)=countconv(1,c)+1 endelse endif Endfor; done gene conversion for both pairs of paralogs Endfor ; end this cycle ;;;; Store data for this replication ;;;; measure sequence similarity over whole sequence similarity=intarr(4) similarity(0)=n_elements(where(sequences(0:(GClength-1),0) eq sequences(0:(GClength-1),2))) similarity(1)=n_elements(where(sequences(0:(GClength-1),1) eq sequences(0:(GClength-1),3))) similarity(2)=n_elements(where(sequences(GClength:(length-1),0) eq sequences(GClength:(length-1),2))) similarity(3)=n_elements(where(sequences(GClength:(length-1),1) eq sequences(GClength:(length-1),3))) GCsimilarity=total(similarity(0:1))/(2*GClength) nonGCsimilarity=total(similarity(2:3))/(2*(length-GClength)) ;store for each replication simrep(0,x)=GCsimilarity simrep(1,x)=nonGCsimilarity ;;;;;;; ;;;; measure sequence similarity over sliding windows for f=0,windows-1 do begin ;p1 v p2 orthologs plotsim(f,0,x)=n_elements(where(sequences(windowlength*f:(windowlength*f+(windowlength-1)),0) eq sequences(windowlength*f:(windowlength*f+(windowlength-1)),2))) ; d1 v d2 orthologs plotsim(f,1,x)=n_elements(where(sequences(windowlength*f:(windowlength*f+(windowlength-1)),1) eq sequences(windowlength*f:(windowlength*f+(windowlength-1)),3))) ; p1 v d1 paralogs plotsim(f,2,x)=n_elements(where(sequences(windowlength*f:(windowlength*f+(windowlength-1)),0) eq sequences(windowlength*f:(windowlength*f+(windowlength-1)),1))) ;p2 v d2 paralogs plotsim(f,3,x)=n_elements(where(sequences(windowlength*f:(windowlength*f+(windowlength-1)),2) eq sequences(windowlength*f:(windowlength*f+(windowlength-1)),3))) endfor ;;;;;;; Endfor; end replications print, mean(simrep(0,*)), ' % similarity in GC region' print, mean(simrep(1,*)), ' % similarity in non-GC region' print, 100.0*n_elements(where(simrep(0,*) lt simrep(1,*)))/nreps, '% of reps when GC more diverged than non-GC' ;;;; average sequence similarity over sliding windows over all replications avgsim=fltarr(windows,2); array to contain ortholog and paralog averages for each window pairwisesim=fltarr(windows,4); array to contain pairwise ortholog and paralog divergences for each window For g=0,1 do begin; for each average pairwise comparison For h=0,windows-1 do begin; for each sliding window avgsim(h,g)=(mean(plotsim(h,2*g,*))+mean(plotsim(h,(2*g+1),*)))/2/windowlength*100.0 endfor endfor For j=0,3 do begin; for each average pairwise comparison For h=0,windows-1 do begin; for each sliding window pairwisesim(h,j)=mean(plotsim(h,j,*))/windowlength*100.0 endfor endfor ;;;;;;; window, 0 plot, avgsim(*,0), yrange=[(min(avgsim(*,0))-2),100], title='Sequence similarity between orthologs (p1 and p2; d1 and d2)' window, 1 plot, avgsim(*,1), yrange=[(min(avgsim(*,1))-2),100], title= 'Sequence similarity between paralogs (p1 and d1; p2 and d2)' print, start, systime() ;;;;;;;;;;;;;;; CHOOSE OUTPUT FILE FOR WRITING RESULTS TO output=dialog_pickfile(/write) IF output EQ '' THEN print, 'no file selected' openw, lun, output, /get_lun ;;;;;;;;;;;;;;;; module for printing averaged paralog and ortholog divergence to file printf, lun, 'sequence similarity between orthologs' printf, lun, avgsim(*,0) printf, lun, 'sequence similarity between paralogs' printf, lun, avgsim(*,1) ;;;;;;;;;;;;;;;; module for printing pairwise paralog and ortholog divergences to file ;printf, lun, 'sequence similarity between orthologs p1 and p2' ;printf, lun, pairwisesim(*,0) ;printf, lun, 'sequence similarity between orthologs d1 and d2' ;printf, lun, pairwisesim(*,1) ;printf, lun, 'sequence similarity between paralogs p1 and d1' ;printf, lun, pairwisesim(*,2) ;printf, lun, 'sequence similarity between paralogs p2 and d2' ;printf, lun, pairwisesim(*,3) free_lun, lun beep END