Fusion genes
in breast cancer
Elizabeth M. Batty
Clare College, University of Cambridge
A dissertation submitted to the University of
Cambridge in candidature for the degree of
Doctor of Philosophy
November 2010
ii
Declaration
This dissertation contains the results of experimental work carried out between October 2006
and October 2010 in the Department of Pathology, University of Cambridge. This dissertation is
the result of my own work and includes nothing which is the outcome of work done in
collaboration except where specifically indicated in the text. It has not been submitted whole or
in part for any other qualification at any other University.
iii
Summary
Fusion genes in breast cancer
Elizabeth Batty
Fusion genes caused by chromosomal rearrangements are a common and important feature in
haematological malignancies, but have until recently been seen as unimportant in epithelial
cancers. The discovery of recurrent fusion genes in prostate and lung cancer suggests that
fusion genes may play an important role in epithelial carcinogenesis, and that they have been
previously under-reported due to the difficulties of cytogenetic analysis of solid tumours. In
particular, breast cancers often have complex, highly rearranged karyotypes which have proved
difficult to analyse using classical cytogenetic techniques.
The aim of this project was to search for fusion genes in breast cancer by using high-resolution
mapping of chromosome rearrangements in breast cancer cell lines. Mapping the chromosome
rearrangements was initially done using high-resolution DNA microarrays and fluorescence in-
situ hybridisation, but moved to high-throughput sequencing as it became available. Interesting
candidate genes identified from the mapped chromosome rearrangements were investigated
on a larger set of cell lines and primary tumours.
The complete karyotypes of two breast cancer cell lines were constructed using a combination
of microarrays, fluorescence microscopy, and high-throughput sequencing. A number of
potential fusion genes were identified in these two cell lines. Although no expressed fusion
genes were found, the complete karyotypes gave insight into the number and mechanisms of
chromosome rearrangement in breast cancer, and identified interesting candidate genes which
may be of importance in tumourigenesis. Two genes which were fused in other breast cancer
cell lines, BCAS3 and ODZ4, were disrupted by chromosome rearrangements and identified as
interesting candidate genes in tumorigenesis.
A bioinformatic pipeline to process high-throughput sequencing data was set up and validated,
and shown to more accurately predict fusion genes than other methods, and can be used to
investigate further cell lines and tumours for recurrent fusion genes. The pipeline was used to
analyse data from 3 other breast cancer cell lines and predict chromosomal rearrangements
and fusion genes, several of which were found to be expressed. Of the fusions predicted in the
cell line ZR-75-30, 7 expressed fusion genes were identified, and may have functional
significance in breast cancer.
iv
Acknowledgements
I would like to thank all my colleagues and collaborators, especially Karen and Susie for all their
support in the lab, and Carole for advising me. I acknowledge the contributions of Karen
Howarth, Suet-Feung Chin, Ina Schulte, Jess Pole, Susanne Flach, Scott Newman, and Kevin
Howe to the work in this thesis.
I am grateful to Clare College, the Medical Research Council, and especially Breast Cancer
Campaign for their financial support.
Thanks to my family and friends, especially my parents for their continued support, and my
housemates for their continual provision of coffee.
My final thanks go to my supervisor, Paul, for providing support, inspiration and enthusiasm for
my project for four years, and for giving me the chance to work in his lab.
v
Contents
Declaration ii
Summary iii
Acknowledgements iv
List of Figures ix
List of Tables xii
Abbreviations xiii
Chapter 1 – Introduction
1.1 Introduction 2
1.2 Genes and pathways altered in cancer 2
1.2.1 How many genetic changes are needed to progress to cancer? 4
1.2.2 The role of genomic instability 8
1.3 Breast cancer 9
1.3.1 Genes commonly altered in breast cancer 9
1.3.2 Breast cancer susceptibility genes 10
1.4 Classification of breast cancer 11
1.4.1 Breast cell lineages 12
1.5 Cytogenetics of breast cancer 13
1.5.1 Mechanisms of chromosome amplification 17
1.6 Chromosome translocations 20
1.7 Research techniques 25
1.7.1 Karyotyping 25
1.7.2 Fluorescence in-situ hybridisation 25
1.7.3 Chromosome painting and spectral karyotyping 26
1.7.4 Comparative genomic hybridisation 27
1.7.5 Array comparative genomic hybridisation 27
1.7.6 Sequencing 28
1.7.7 Cell lines 29
1.8 Hypothesis 30
Chapter 2 – Materials and methods
2.1 Cell culture 32
2.1.1 Sources 32
vi
2.1.2 Culturing cells 32
2.2 RNA extraction 33
2.3 Protein extraction 33
2.4 DNA extraction 34
2.5 cDNA synthesis 34
2.6 PCR 34
2.7 Real-time PCR 35
2.8 Sequencing 35
2.9 Flow sorting of chromosomes 35
2.9.1 Genomiphi amplification of sorted chromosomes 36
2.10 Metaphase preparations from cell lines 37
2.10.1 Metaphase spreads 37
2.11 BAC growth and DNA extraction 38
2.12 FISH probes and hybridisation 38
2.12.1 Nick translation 38
2.12.2 Hybridisation 39
2.12.3 Detection 39
2.13 Array painting 40
2.13.1 Labelling 40
2.13.2 Hybridisation 40
2.13.3 Washing 41
2.13.4 Scanning 41
2.14 Western blotting 41
2.14.1 Blotting 41
2.14.2 Detection 42
2.15 High-throughput sequencing 42
Chapter 3 – Rearrangements of BCAS3 in breast cancer
3.1 Introduction 44
3.2 Results 45
3.2.1 High-resolution array painting 45
3.2.2 FISH mapping of the derivative chromosome 51
3.2.3 Fine mapping and sequencing of the breakpoint 55
3.2.4 BCAS3 in other cell lines and tumours 58
3.3 Discussion 63
Chapter 4 – The complete karyotype of HCC1806
4.1 Introduction 66
4.1.1 Previous work 67
4.2 Results 69
4.2.1 High-resolution breakpoints from SNP6.0 arrays 69
4.2.2 Determining the breakpoints 70
vii
4.2.3 Comparison of array painting and the SNP6.0 arrays 71
4.2.4 Identification of previous undetectable copy number changes 75
4.2.5 Assembly of the complete karyotype 76
4.2.6 Discrepancies between array painting and whole genome 77
SNP6.0 array
4.2.7 Systematic search for fusion genes 84
4.2.8 New fusion genes identified by high-resolution arrays 87
4.2.9 Fusion genes caused by small deletions 89
4.2.10 Fusion genes caused by tandem duplications 93
4.3 Discussion 98
Chapter 5 – The complete karyotype of MDA-MB-134 obtained using array painting and
high-throughput sequencing
5.1 Introduction 101
5.1.1 Previous work 102
5.2 Results 103
5.2.1 Assembling a complete karyotype of MDA-MB-134 103
5.2.2 Array painting of rearranged chromosomes 110
5.2.3 High-throughput sequencing 122
5.2.4 Paired-end read high-throughput sequencing of MDA-MB-134 125
5.2.5 Detection of structural variants 125
5.2.6 Validation of structural variants 127
5.2.7 Sequencing of structural variant junctions 133
5.2.8 Potential fusion genes found by high-throughput sequencing 137
5.2.9 ODZ4 as a potential fusion gene 150
5.3 Discussion 154
Chapter 6 – Bioinformatics of high-throughput sequencing of breast cancer
6.1 Introduction 159
6.2 Alignment of high-throughput sequencing reads 162
6.2.1 Calling normal and aberrant reads 164
6.2.2 Processing mate-pair data 166
6.3 Clustering and structural variant calling 168
6.4 Fusion gene prediction 174
6.4.1 Validation of predicted fusion genes 178
6.5 Copy number variation 181
6.5.1 Correcting for mappability 181
6.5.2 Correcting for GC content 181
6.5.3 Segmentation and copy number analysis 186
6.6 Discussion 188
viii
Chapter 7 – Discussion
7.1 How prevalent are fusion genes in breast cancer? 191
7.2 How important are fusion genes in breast cancer? 192
7.3 How can fusion genes in breast cancer be found? 195
7.4 Mechanisms of chromosome rearrangement in breast cancer 197
7.5 Future directions 198
7.5.1 ODZ4 198
7.5.2 High-throughput sequencing and bioinformatic analysis 199
7.6 Conclusion 203
References 204
Appendix 1 - Primers used 223
Appendix 2 – BACs used 232
Appendix 3 – Manufacturers and suppliers 233
Appendix 4 – Bioinformatic pipeline scripts 235
Appendix 5 – Documentation for bioinformatic pipeline 254
ix
List of figures
Chapter 1 – Introduction
Figure 1.1 Breakage-fusion-bridge cycles 19
Chapter 3 – Rearrangements of BCAS3 in breast cancer
Figure 3.1 Array painting 47
Figure 3.2 Position of the chromosome 17 48
Figure 3.3 Breaks in BCAS3 in MCF7 and HCC1806 48
Figure 3.4 Position of the deletion on chromosome 7 50
Figure 3.5 Position of the breakpoint on chromosome 8 50
Figure 3.6 Possible orientations of chromosome fragments 51
Figure 3.7 FISH for orientation of chromosome 7 fragment 52
Figure 3.8 FISH for possible inversion on chromosome 7 54
Figure 3.9 Tiling path array of chromosome 7 56
Figure 3.10 Junction sequence for the 7;17 junction 58
Figure 3.11 Final orientation of chromosome fragments in the der(7)t(8;7;17) 58
Figure 3.12 FISH for BCAS3 on a SUM52 metaphase spread 59
Figure 3.13 RT-PCR on BCAS3 exons in cell liens 60
Figure 3.14 Location of BAC probes used for TMA FISH 61
Figure 3.15 Western blot of BCAS3 62
Chapter 4 – The complete karyotype of HCC1806
Figure 4.1 SKY karyotype of HCC1806 66
Figure 4.2 Flow karyotype of HCC1806 chromosomes 68
Figure 4.3 Segmentation of SNP6.0 arrays by circular binary segmentation 71
Figure 4.4 Breakpoint duplication from SNP6.0 arrays 73
Figure 4.5 Related derivative chromosome in HCC1806 75
Figure 4.6 A deletion identified from the SNP6.0 array 76
Figure 4.7 Array painting of chromosome 2 in fraction A compared
to SNP6.0 array 79
Figure 4.8 Array painting of chromosome 2 in fraction G compared
to SNP6.0 array 81
Figure 4.9 FISH to confirm chromosome 2 amplification in fraction G 82
Figure 4.10 FISH to confirm chromosome 2 breakpoint in fraction G 83
Figure 4.11 Fusion gene caused by translocation 85
Figure 4.12 Readthrough fusion gene caused by translocation 85
Figure 4.13 Primer design for amplification of fusion products 86
Figure 4.14 PCR for fusion genes caused by translocations in HCC1806 88
Figure 4.15 Fusion gene caused by intrachromosomal deletion 90
Figure 4.16 PCR for fusion genes caused by small deletions 91
Figure 4.17 Fusion genes caused by tandem duplications 1 94
Figure 4.18 Fusion genes caused by tandem duplications 2 95
x
Figure 4.19 Fusion genes caused by tandem duplications 3 96
Figure 4.20 PCR for fusion genes caused by tandem duplications 97
Chapter 5 – The complete karyotype of MDA-MB-134 obtained using array painting and
high-throughput sequencing
Figure 5.1 SKY karyotype of MDA-MB-134 101
Figure 5.2 Schematic of chromosome 8 and 11 amplification 103
Figure 5.3 Flow karyotype of MDA-MB-134 chromosomes 105
Figure 5.4 Reverse chromosome painting of chromosome fractions A and C 107
Figure 5.5 Reverse chromosome painting of chromosome fraction 14 109
Figure 5.6 Array painting of MDA-MB-134 fraction C 111
Figure 5.7 Array painting of MDA-MB-134 fraction 14 112
Figure 5.8 FISH to show orientation of der(15)t(15;17) 114
Figure 5.9 FISH to show orientation of der(16)t(16;18) 115
Figure 5.10 Array painting of MDA-MB-134 fraction A 117
Figure 5.11 Comparison of array painting and whole genome SNP6.0 data 119
Figure 5.12 Position of amplifications on chromosome 8 120
Figure 5.13 Position of deletions on chromosome 8 121
Figure 5.14 Position of amplifications on chromosome 11 122
Figure 5.15 Illumina high-throughput sequencing 124
Figure 5.16 FISH to confirm chromosome X translocation 132
Figure 5.17 Circos plot of structural variants 135
Figure 5.18 Circos plot of structural variants on chromosomes 8 and 11 136
Figure 5.19 Structure of the 8;11 amplicon 137
Figure 5.20 Fusion gene PCR strategy 141
Figure 5.21 PCR to detect fusion transcripts 1 142
Figure 5.22 PCR to detect fusion transcripts 2 144
Figure 5.23 PCR to detect readthrough fusion transcripts 145
Figure 5.24 The two predicted KLHL35 fusions 147
Figure 5.25 PCR to detect internal gene deletions 149
Figure 5.26 ODZ4 breakpoints in breast cancer cell lines 151
Figure 5.27 PCR for ODZ4 in cell lines 153
Figure 5.28 Real-time PCR for ODZ4 in cell lines 154
Figure 5.29 Comparison of segmented copy number 156
Chapter 6 – Bioinformatics of high-throughput sequencing of breast cancer
Figure 6.1 High-throughput sequencing pipeline steps 161
Figure 6.2 Small-insert sequencing libraries 165
Figure 6.3 Construction of mate-pair libraries 167
Figure 6.4 Clustering of reads to call structural variants 1 169
Figure 6.5 Clustering of reads to call structural variants 2 169
Figure 6.6 Structural variant calling from small-insert libraries 171
Figure 6.7 Structural variant calling from mate-pair library 172
Figure 6.8 Fragment size distribution from a mate-pair library 174
xi
Figure 6.9 Fusion gene prediction from small-insert library 176
Figure 6.10 Fusion gene prediction from mate-pair library 177
Figure 6.11. Reads in a window versus GC percentage of the window 183
Figure 6.12 GC bias in copy number plots 184
Figure 6.13 Correction of copy number data for GC bias 185
Figure 6.14 Segmented copy number plot showing small region of amplification 187
Chapter 7 – Discussion
Figure 7.1 Artefactual versus real structural variants 201
Figure 7.2 Improvements to structural variant calling 202
xii
List of tables
Chapter 2 – Materials and methods
Table 2.1 Origin of cell lines 32
Table 2.2 Cell line growth conditions 33
Table 2.3 Extended chromosome preparations 37
Chapter 4 – The complete karyotype of HCC1806
Table 4.1 Potential fusion genes caused by translocations and large deletions 87
Table 4.2 Potential fusion genes caused by small deletions 90
Table 4.3 Potential fusion genes cause by tandem duplications 93
Chapter 5 – The complete karyotype of MDA-MB-134 obtained using array painting and
high-throughput sequencing
Table 5.1 Categories of structural variants in MDA-MB-134 126
Table 5.2 Predicted structural variants in MDA-MB-134 128
Table 5.3 Validated structural variants in MDA-MB-134 131
Table 5.4 Exact breakpoints and homology of structural variants in MDA-MB-134 134
Table 5.5 Predicted gene fusions from high-throughput sequencing 139
Table 5.6 Predicted readthrough gene fusions from high-throughput sequencing 140
Table 5.7 Predicted internal gene deletions from high-throughtput sequencing 148
Chapter 6 – Bioinformatics of high-throughput sequencing of breast cancer
Table 6.1 Predicted fusion genes in the cell line ZR-75-30 179
Table 6.2 Predicted fusion genes in the paired cell lines VP229 and VP267 180
Chapter 7 – Discussion
Table 7.1 Fusion genes currently known in breast cancer 194
xiii
Abbreviations
API application programming interface
ATCC American Type Culture Collection
BAC bacterial artificial chromosome
BSA bovine serum albumin
CGH comparative genomic hybridisation
DAPI 4'6-diamidino-2-penylindole
DMEM Dulbecco's Modified Eagle medium
DMSO dimethyl sulphoxide
DOP-PCR degenerate oligonucleotide polymerase chain reaction
FBS foetal bovine serum
FISH fluorescence in-situ hybridisation
ITS insulin-transferrin-selenium supplement
LB Luria Bertani
MAQ Mapping and Assembly with Qualities
M-FISH multiplex fluorescence in-situ hybridisation
MMTV mouse mammary tumour virus
PBS phosphate buffered saline
PMT photomultiplier tube
RPMI Roswell Park Memorial Institute
SKY Spectral karyotyping
SSC sodium chloride sodium citrate
SST sodium chloride sodium citrate 0.05% Tween 20
SV structural variant
TE tris-EDTA
Chapter 1
Introduction
Chapter 1 Introduction
2
1.1 Introduction
Cancer is caused by the accumulation of genetic changes in genes which control cell
death and proliferation, but the number of changes which are necessary to progress to
malignancy, which genes or pathways they affect, and the different mechanisms of
genetic change are a subject of much debate.
1.2 Genes and pathways altered in cancer
Hanahan and Weinberg (2000) described six key processes which must be deregulated
in the cell for progression to malignancy, and suggest that the large numbers of genes
implicated in cancer represent different ways to evade the anti-cancer defences of the
cell. As I am primarily interested in breast cancer, I have looked at how breast tumours
may exhibit genetic changes which contribute to the deregulation of these six processes.
The first process which must be overcome is the dependence on external growth factors
to signal the cell to proliferate. This can be overcome by the cell generating its own
growth factors, or by altering the requirements of the growth factor receptors and their
downstream pathways. An example of this process in breast cancer is the amplification
and overexpression of the ERBB2 receptor in breast cancer (Slamon et al., 1987), which
may act by making the cell hypersensitive to small amounts of growth factors.
The cell must also ignore the antiproliferative signals which attempt to block cell
proliferation. At the G1 – S phase transistion of the cell cycle, the cell decides whether
to continue to proliferate, or whether to stop dividing and become quiescent or
differentiate. The RB tumour suppressor gene is important in the control of this
transition, and changes in the RB pathway in breast cancer may remove the block on cell
proliferation. RB expression is lost in 20-35% of breast tumours (Bosco and Knudsen,
2007). In tumours which retain RB expression, it may be inactivated by phosphorylation
by cyclin/CDK complexes, and the amplification of CCND1 in breast cancer may
Chapter 1 Introduction
3
contribute to aberrant phosphorylation. Estrogen also upregulates the promoter of
CCND1, and anti-estrogenic therapies may act by inihibiting cell cycle progression
(Foster et al., 2001).
Tumour cells must also evade apotosis, either by deregulating the machinery which
senses the signals which trigger apoptosis, or by turning off the pathways which respond
to these signals and cause the cell to die. TP53 is a sensor of DNA damage, and
upregulates other pro-apoptotic genes. In breast cancer, the TP53 gene is one of the
two most commonly mutated genes, and its downstream targets are also commonly
mutated (Pharoah et al., 1999).
The cell may also have to evade the signals which limit their multiplicative potential and
become immortal. As the telomeres of chromosomes become shorter with each cell
division, telomere length acts as a break on unlimited replication, as the telomeres
become lost and the ends of the chromosomes fuse, usually leading to cell death.
Tumours often evade this check by upregulating the expression of telomerase - in breast
cancer, telomerase activity was found in over 90% of breast tumours (Hiyama et al.,
1996).
For a tumour to progress and grow to greater size, it must recruit new blood vessels to
supply oxygen and nutrients to the tumour, and it must deregulate the mechanisms
which control angiogenesis and vasculogenesis in the cell, by upregulating the inducers
of angiogenesis or downregulating the suppressors. Whether this is achieved by
alteration of the genes involved in angiogenesis pathways, or whether angiogenesis is
upregulated in the tumour as part of the natural response to hypoxia is unclear, and
molecules which regulate angiogenesis are produced not only by the cancerous cell but
by the normal cells surrounding the tumour (Carmeliet and Jain, 2000). Regardless of
whether the mechanism is genetic or a part of normal homeostatic processes, pro-
angiogenic genes are a common drug target in cancer (Banerjee et al., 2007). In breast
cancer, the amplification and overexpression of ERBB2 can increase angiogenesis and
Chapter 1 Introduction
4
expression of VEGF (Kumar and Yarmand-Bagheri, 2001), and VEGF inhibitors are a
target of antiangiogenic agents in breast cancer (Banerjee et al., 2007).
The final barrier to tumour progression is to acquire the capability for invasion and
metastasis. The genetic changes which lead to invasion and metastasis are not well
known, and one explanation for the difficulty in understanding these changes is that
genetic changes which specifically lead to invasion and metastasis do not exist. Bernards
and Weinberg (2002) argue that a change which leads to metastasis, unlike the changes
which lead to tumour growth and immortality, does not confer an advantage to the
primary tumour and would remain rare. This implies that the changes which enable
metastasis are already present in the tumour, and confer some early selective
advantage as well as the ability to metastasize later in tumour progression. This is
supported by evidence from gene expression profiling, which shows that breast primary
tumours and their distant metastases show similar expression patterns, suggesting that
the dominant clone in the primary tumour may have already acquired the capability for
metastasis long before it occurs (Weigelt et al., 2003). A further argument suggests that
metastasis occurs by the chance event of a malignant cell escaping into the vasculature
and finding a site suitable for growth, without any additional genetic changes (Edwards,
2002). However, this argument remains controversial, and further research is needed to
give a definitive answer.
1.2.1 How many genetic changes are needed to progress to cancer?
It was noted as early as 1957 that tumours progressed through different stages to
metastasis in a stepwise manner (Foulds, 1957), and that this was consistent with a
tumour which emerged from a single cell and progressed by acquiring more genetic
changes. Although early modelling of the age distribution of cancer suggested that four
to twelve mutational events would be necessary to cause cancer (Armitage and Doll,
1954), this assumed that the mutational events were independent. Later work suggests
Chapter 1 Introduction
5
that a better model is a two-stage theory of carcinogenesis, where the first stage gives
the cell a selective advantage, and makes it more likely to accumulate the mutation or
mutations necessary to progress to a second stage (Armitage and Doll, 1957). A detailed
analysis of this two-stage model suggests that three rate-determining events were
needed for cancer to arise (Stein and Stein, 1990).
This model also suggests that while most cancers follow this two-stage pattern, some
cancers, such as retinoblastoma, have single-stage kinetics. This was explained by work
on the occurrence of retinoblastoma by Knudson, which suggested that a “two-hit”
model was in effect, and both copies of the RB gene must be lost for retinoblastoma to
occur. A predisposition to retinoblastoma was due to an inherited mutation which
knocked out one copy of the gene, requiring only a somatic mutation in the other copy
to cause cancer rather than two mutations in the same cell (Knudson, 1971). The
germline mutation of RB serves as the first event, and a further somatic mutation allows
the tumour to progress to the second stage.
Further evidence for a multi-step model of carcinogenesis comes from studies of
colorectal carcinoma by Fearon and Vogelstein (1990). They used the model system of
colorectal tumours, which were known to progress from benign adenomas to
carcinomas to metastatic disease, to suggest that tumours began with a mutation in a
single cell, which acquired more mutations as the disease progressed, and that while the
genetic changes often occurred in a similar order in different tumours it was the overall
number of mutations which was important for progression. Their early estimate, based
on evidence from colorectal cancer progression as well as mathematic models of cancer
progression, was that three to seven hits were necessary for malignancy. This figure was
based on a few known mutations such as KRAS and TP53, and low-resolution data which
could only identify loss of heterozygosity over whole chromosome arms as a putative
mechanism for deletion of a particular tumour suppressor gene (Vogelstein and Kinzler,
1993).
Chapter 1 Introduction
6
Finding the mutations important for tumour initiation and progression is more difficult
due to the presence of non-functional “passenger” mutations which occur by chance
and are carried through successive rounds of clonal expansion. This problem is
especially prevalent in genome-wide studies which do not look at specific candidate
genes found by functional studies or linkage analysis. Sjöblom et al. (2006) found that
the number of coding mutations in a series of breast tumours and cell lines is higher
than the background rate of mutation would suggest, and statistical methods are
needed to distinguish important “driver” mutations from passengers by finding genes
which are found mutated in a higher proportion of tumours than would be expected by
chance. Using this method, Sjöblom et al. predict that up to 20 of the somatic mutations
found in breast tumours may be driving mutations, with another 80 mutations which
are passengers, a much higher figure than previously reported, although these
calculations rely on an accurate assessment of the background rate of mutation (Forrest
and Cavet, 2007), and may suffer from a high false discovery rate (Getz et al., 2007). A
study of mutations in a greater number of tumour types but looking only at protein
kinase mutations showed a wide variation in the mutation rates between tumour types,
suggesting that a background rate of mutation would be difficult to estimate (Greenman
et al., 2007). Greenman et al. do not predict a number of driver mutations per tumour,
but estimate that 119 of their 518 sequenced genes contain a driving mutation, leading
them to a similar conclusion to Sjöblom et al. - the number of driver mutations is larger
than previously estimated. Mathematical modelling supports the experimental evidence
and suggests that this large number of driving mutations indicates that while there are
certain common pathways which are mutated and give a large selective advantage, such
as TP53 and APC, the majority of driving mutations will confer only a small selective
advantage, and that the stochastic nature of these mutation contributes to the
heterogeneity of cancer (Beerenwinkel et al., 2007).
Recent studies of complete cancer genomes are consistent with the idea that there are
many driving mutations. The earliest complete breast cancer sequence was of a
metastasis and reported 32 non-synonymous mutations in coding sequences, none of
Chapter 1 Introduction
7
which were in genes reported as candidate cancer genes by Sjöblom et al (Shah et al.,
2009). 11 of the 32 mutations could be found in the primary tumour, and 6 of these
mutations were present at low levels in the primary tumour, suggesting heterogeneity
of somatic mutations in the primary tumour. The only complete coding sequences of
both a breast primary tumour and metastasis which is so far complete reports 50
mutations in coding regions, with a ratio of synonymous to non-synonymous mutations
similar to that which would be expected by chance, suggesting that the majority of
coding mutations are not strongly selected for and are not driving mutations (Ding et al.,
2010). The complete sequence of a melanoma cell line (Pleasance et al., 2010a) found
292 coding mutations, with a similar lack of selection for non-synonymous mutations,
and over 33,000 mutations in non-coding regions of the genome, and similar figures
were obtained for a small-cell lung cancer with 134 coding mutations and over 22,000
non-coding mutations (Pleasance et al., 2010b).
Although studies have focused on the number of mutations needed to progress to
cancer, genes can be altered by other mechanisms such as copy-number alteration. A
study of copy-number changes focusing only on major copy-number changes (defined as
deletion of all copies, or amplification to >11 copies) showed on average 17 genes
altered by major copy number changes per tumour (Leary et al., 2008). Stephens et al.
(2009) generated the most comprehensive study of somatic rearrangements in 24
breast cancer samples and found large differences in the number and type of
rearrangements present, from a tumour with only a single rearrangement to tumours
with hundreds of tandem duplications present in the genome. The rearrangements
were enriched for those affecting genes, although it is not clear whether this is due to
the rearrangements being selected for, or due to the mechanism of genome
rearrangements favouring coding regions.
These studies are beginning to uncover the number and depth of the changes present in
cancer genomes, but the complete picture is still not clear. The most comprehensive
study of rearrangements in breast cancer yet published estimates they detected only
Chapter 1 Introduction
8
50% of the changes present in each sample (Stephens et al., 2009), and studies of
somatic mutation may miss small indels, which were difficult to detect with the
alignment methods used in these studies (Li and Durbin, 2009). To observe the overall
picture of the number of changes needed for progression will require an integrated
analysis combining mutation, copy-number alteration, identification of fusion genes and
epigenetic changes to identify key pathways altered in cancer (Teschendorff and Caldas,
2009).
1.2.2 The role of genomic instability
Cells must accumulate a number of genetic changes in order to progress to cancer; the
question of whether these changes can arise given the normal human mutation rate or
whether there must be an underlying genetic instability to explain the number of
mutations is still subject to debate. It has been suggested that genomic instability may
not be a requirement for tumour development, but a secondary effect of mutations
whose primary effect is to protect against apoptosis (Bodmer, 2008) or carcinogens
(Bardelli et al., 2001). Even if genomic instability is not a requirement, carcinogenesis
may proceed more quickly when the genome is unstable, especially if the number of
changes required for progression is large and the genomic instability arises early
(Beckman and Loeb, 2006).
Tumours demonstrate a number of mechanisms of genomic instability, which affect the
genome at different levels, ranging from single nucleotide changes to rearrangement of
chromosome and the gain and loss of chromosome arms and whole chromosomes.
Patterns of genomic instability were first seen in colorectal tumours, where a small
proportion of tumours with a near-normal karyotype display microsatellite instability,
due to defects in genes in the mismatch repair pathways. Other colorectal tumours have
an aneuploid karyotype and show loss and gain of whole chromosomes as well as loss of
heterozygosity (Lengauer et al., 1997).
Chapter 1 Introduction
9
Many breast tumours display genomic instability. 47% of breast tumours have aneuploid
karyotypes (Teixeira et al., 2002), while BRCA1 and BRCA2 can suppress genome
instability, and BRCA1 and BRCA2-deficient cells exhibit chromosomal instability
(Venkitaraman, 2002).
1.3 Breast cancer
Breast cancer exhibits considerable molecular heterogeneity, with many different genes
associated with the disease and few recurring mutations, and unlike other common
epithelial tumours, no single pathway has emerged as the dominant pathway in breast
cancer tumorigenesis.
1.3.1 Genes commonly altered in breast cancer
Studies of somatic mutations in breast cancer support the model that there are few
commonly mutated genes, and many genes which are mutated much less frequently.
Two genes stand out as often mutated in breast cancer across all subtypes: TP53 and
PIK3CA.
PIK3CA has been reported to be mutated in 8-40% of breast tumours, and may be a
relatively early event in tumorigenesis (Miron et al., 2010). The mutations cluster
around exon 9 and exon 20, and result in increased kinase activity (Samuels et al., 2005).
Wood et al. (2007), in a screen of coding mutations in 20,000 genes, found mutations in
a number of genes in pathways involved in PIK3CA signalling. These mutations are often
mutually exclusive, suggesting that only one mutation is needed to disrupt the pathway
sufficiently to drive tumourigenesis (Velculescu, 2008).
TP53 is a tumour suppressor gene mutated in 20-40% of breast tumours (Pharoah et al.,
1999). It plays an important role in the cellular response to stress, and acts by inducing
Chapter 1 Introduction
10
cell cycle arrest and apoptosis. Most cancers have lost TP53 activity by point mutation,
with few deletions or frameshifts (Vousden and Lu, 2002). The mutant forms of TP53 are
often more stable than the wild-type and found at high levels in the cell, and may act as
dominant-negative inhibitors when they form complexes with the wild-type protein.
Tumours with high expression of HER2 and accumulation of TP53 have considerably
decreased overall survival (Yamashita et al., 2004). Similarly to PIK3CA, even in TP53
wild-type tumours, regulators and targets of TP53 are often mutated – MDM2, which
stabilises TP53 and is downregulated in response to stress, is amplified in up to 6% of
breast tumours (Al-Kuraya et al., 2004). Tumours with mutations in the breast cancer
susceptibility genes BRCA1 and BRCA2 are more likely to have TP53 mutations
(Greenblatt et al., 2001), and show a different spectrum of mutations than in sporadic
cancers, suggesting that the inactivation of particular functions of TP53 may be
important in BRCA-deficient tumours (Venkitaraman, 2002).
1.3.2 Breast cancer susceptibility genes
Known breast cancer susceptibility alleles can be divided into three classes, based on
the penetrance of the alleles (Turnbull and Rahman, 2008). BRCA1 and BRCA2 are high-
penetrance genes, and carriers of a mutant allele have a greater than tenfold increased
risk of breast cancer. BRCA1 and BRCA2 are involved in the DNA damage response, and
many of the disease-associated mutations result in loss of function (Gudmundsdottir
and Ashworth, 2006). Between them these two genes represent 15-20% of the excess
familial risk. TP53 is also mutated in Li-Fraumeni syndrome, which gives a high risk of
developing several types of cancer, but the number of families with Li-Fraumeni
Syndrome is rare and account for only a small part of the increased familial breast
cancer risk.
Four alleles are known which give a 2-4X relative risk of breast cancer and are classed as
intermediate-penetrance alleles. All four genes (CHEK2 (Meijers-Heijboer et al., 2002),
Chapter 1 Introduction
11
BRIP1 (Seal et al., 2006), ATM (Renwick et al., 2006) and PALB2 (Rahman et al., 2007))
are involved in the DNA damage response, and have roles in the same pathways as
BRCA1 and BRCA2. Eight low-penetrance variants that give a relative risk of <1.5 are
currently known from genome-wide associate studies (Easton et al., 2007; Cox et al.,
2007; Stacey et al., 2007). Most of these variants do not lie within protein-coding genes,
and it is not known how they cause increased breast cancer risk.
1.4 Classification of breast cancer
Breast cancer appears to be a heterogenous disease which shows wide variation in gene
expression, point mutations and structural variation. Tumours can be classified based on
histopathological grade, immunohistochemical staining, and lately gene expression
profiling, which groups tumours into subtypes based on gene expression levels, and may
distinguish between histologically similar tumours which are molecularly different
(Rouzier et al., 2005). Gene expression profiling suggests that the different subtypes of
breast cancer vary widely, harbouring different gene alterations and responding
differently to therapy, and the different subtypes may even be distinct diseases
(Herschkowitz et al., 2007).
Sørlie et al. (2001) carried out gene expression profiling and used the results to cluster
tumours. This approach grouped tumours into two classes largely based on ER status,
and each class was divided into three subtypes. Of the ER negative tumours, the ERBB2+
subgroup shows high expression of ERBB2 and other genes present in the ERBB2
amplicon, while the basal subgroup expresses basal-type cytokeratins, laminins and
fatty acid binding proteins, and the normal-like subgroup shows high expression of basal
epithelial genes and low expression of luminal epithelial genes. The ER positive/luminal
tumours were split into at least two subgroups, with luminal A tumours showing the
highest expression of the ER-related genes, and the luminal B subtype could be further
separated into luminal B and luminal C by expression in the luminal C group of a set of
Chapter 1 Introduction
12
genes of unknown function but which are also highly expressed in the basal and ERBB2+
subtypes. The basal and ERBB2+ subtypes were also correlated with poor prognosis and
mutations in TP53.
Subsequent studies have replicated some of the initial classification, and further
molecular classifications have been suggested. The divide between ER positive and ER
negative tumours is consistent and the two groups have distinct gene expression
profiles (Gruvberger et al., 2001). The luminal A and luminal B subgroups have been
found to have differences in proliferation, histological grade and prognosis, with luminal
B having a poorer prognosis (Weigelt et al., 2010), but the initial separation of luminal B
into two further subgroups is not always repeated in subsequent studies. This suggests
that the distinction between the different luminal subgroups is less clear than the divide
between other subgroups, and the luminal group represents a continuum of gene
expression which can be arbitrarily divided into different subgroups (Wirapati et al.,
2008). In the ER negative category, the basal-like and ERBB2+ classes are highly
reproducible (Rouzier et al., 2005), but the normal-like category may be an artefact of
high normal tissue contamination of tumours (Parker et al., 2009). Other groupings of
ER negative tumours have been suggested, such as the ‘claudin-low’ subtype, which
shows low expression of the genes involved in cell-cell adhesion (Herschkowitz et al.,
2007), and a subtype showing low genomic instability, discovered using integrated gene
expression and copy number profiling (Chin et al., 2007).
1.4.1 Breast cell lineages
A question underlying the classification of breast cancers is whether the different
subtypes reflect a difference in the cell types which give rise to them, or whether the
subtypes are independent of the cell of origin.
The human mammary epithelium probably contains two general lineages, the luminal
cells and the myoepithelial cells (Stingl and Caldas, 2007). A population of
Chapter 1 Introduction
13
undifferentiated basally-positioned cells may represent the mammary gland stem cell,
and gives rise to progenitor cells which may be multilineage, or produce luminal or
myoepithelial cells only. These stem and progenitor cells are thought to be important as
the initial cells which give rise to tumours, as any mutation in the progenitor will be
passed to the daughter cells, which will acquire further mutations through subsequent
rounds of cell division (Cairns, 2002).
Some evidence for tumours arising from different progenitor cells has been found. Cell
lines with a luminal gene expression pattern show no cells with basal characteristics,
suggesting that a luminal progenitor cell gave rise to the tumour (Stingl and Caldas,
2007). Tumours induced using the same combination of oncogenes gave rise to
tumours with different phenotypes, suggesting the cell type of the precursor influences
the type of tumour produced (Ince et al., 2007).
Although the exact number of breast cancer subtypes varies between the approaches, it
is clear that a number of different subtypes exist, with different gene expression and
patterns of chromosomal rearrangement, and few genetic changes have been found
which are common to all subtypes. Whether this is due to a difference in the cell of
origin, or whether the tumours originate from the same cell type but follow a different
mutational path is not yet clear, but there is some evidence to suggest that breast
cancer is not one disease but a set of heterogenous diseases which arise from the same
tissue, and further research into the development of the mammary gland may help to
determine which of the two possibilities is correct.
1.5 Cytogenetics of breast cancer
Classical cytogenetic analysis of breast cancer is difficult due to the technical difficulties
of obtaining good karyotypes. The results are often dependent on the culture
techniques used (Teixeira et al., 2002), may be biased towards those malignant tumours
that divide better in culture. The karyotypes produced are often complex and difficult to
Chapter 1 Introduction
14
interpret. A further difficulty arises from the heterogeneity of individual tumours, as
analysis of a single sample may not give an accurate picture of the tumour karyotype.
The studies of the cytogenetics of breast tumours which have attempted to overcome
these technical difficulties show very heterogenous karyotypes within breast cancers,
ranging from near-diploid with few chromosome alterations, to tumours with complex
highly-rearranged karyotypes (Teixeira et al., 2002). Among the near-diploid tumours,
certain rearrangements were often present as the sole chromosome aberration, such as
deletion of 3p13-14, which may delete the candidate tumour suppressor gene FHIT, and
a der(1;16)(p10;q10) rearranged chromosome.
While the karyotype of an individual tumour provides only information on the state of
the karyotype at that time and not the evolutionary history of the tumour, studies of
large numbers of tumours and the different clones within a tumour provide an overall
picture of the karyotypic evolution of breast cancer. By looking at the number of
chromosomes and the number of rearrangements across a series of breast cancers,
Dutrillaux et al. (1991) suggest that chromosomes are lost early on, due to whole
chromosome loss and unbalanced rearrangements, followed by endoreduplication of
the whole genome and further chromosome loss and rearrangement. The presence of
hyperploid sidelines in near-diploid tumours supports this pathway, as does the trend
towards increasing number of rearrangements as chromosome number decreases,
confirmed by Texiera et al. (2002). Although many tumours follow this pathway, it is not
the only way for a breast cancer karyotype to evolve, as seen by the presence of near-
diploid tumours which have not follow the pathway of chromosome loss and
endoreduplication.
Higher-resolution array CGH studies have validated the results from earlier cytogenetic
studies. Regions of the genome which are commonly gained, lost, and amplified in
breast cancer can be defined at higher resolution than chromosome banding provides,
and a number of recurrent amplicons have been found, including regions on 8p12,
Chapter 1 Introduction
15
11p13, and 17q21, as well as regions of frequent low copy number gain or loss
(Fridlyand et al., 2006).
A set of copy number subtypes have been defined according to patterns and frequency
of copy number change, although different studies have produced slightly different
subtypes. One subtype includes those with the simplest karyotype of gain of 1q and loss
of 16q, which occur in ER positive tumours with low histological grade, and were seen
by Fridyland et al. (2006). A second subtype includes low-level gains and losses with
occasional peaks of amplification, and are seen by both Fridyland et al. (2006) and Hicks
et al. (2006), who found 60% of tumours fall into this “simplex” subtype. Tumours with
complex karyotypes with frequent gains and losses, in which few regions of the genome
are present at normal copy number, were found in both studies, and were termed
“sawtooth” tumours by Hicks et al. (2006). The tumours tend to be ER negative and
have a significantly worse outcome than tumours in the other subtypes (Fridlyand et al.,
2006). Hicks et al. also identify a fourth group of tumours displaying a “firestorm”
pattern of amplification, with clustered, narrow peaks of high-level amplification in a
relatively simple karyotype. 11q and 17q are among the regions most often found in
“firestorm” amplifications.
Much research has focused on finding the genes which drive amplifications in breast
cancer. A well-studied example is the amplification of 17q12, which leads to
overexpression of the ERBB2 gene, which is correlated with poor prognosis, and has
been successfully targeted for chemotherapy (Järvinen and Liu, 2003). The targets of
other amplicons are less clear.
Amplification of 17q23 is found in up to 20% of breast tumours (Bärlund et al., 2000)
and is associated with poor prognosis. While 17q23 is gained in a number of different
tumour types, high-level amplification is only seen in breast tumours (Andersen et al.,
2002). The consensus region of amplification covers over 5Mb and a number of genes
have been implicated as the drivers of amplification, including APPBP2, RAD51C,
THRAP1, and PPM1D (Bärlund et al., 2000; Monni et al., 2001; Lambros et al., 2010),
Chapter 1 Introduction
16
mainly by correlation of mRNA overexpression with copy number gain. It is possible that
no single gene is the driver of amplification, but rather that a number of genes at the
core of the amplicon are the target of amplification (Parssinen et al., 2007), and that
high-level amplification is required for significant overexpression. An alternate
hypothesis looks for the minimal region of amplification on high resolution arrays and
suggests a minimal region of 250Kb centred around the microRNA mir-21 (Haverty et al.,
2008).
8p11-12 is another region of common amplification in breast cancer, found in 10-25% of
tumours (Garcia et al., 2005). Early studies suggested FGFR1 as a candidate to be the
driving oncogene in this region (Ugolini et al., 1999), but while FGFR1 is overexpressed
at the mRNA level when amplified, additional FGFR1 protein is not seen in cell lines with
amplification, and inhibition of FGFR1 does not slow the growth of cell lines (Ray et al.,
2004). Refinement of the minimal region of amplification using higher resolution array
CGH suggests that FGFR1 is outside the minimal region, and suggests a 1.5Mb region of
minimal amplification with ZNF703, ERLIN2, BRF2 and RAB11FIP1 as candidate driving
oncogenes (Garcia et al., 2005). Other studies have suggested that rather than one
simple amplicon, the 8p11-12 region contains four separate regions of amplification,
one of which overlaps with the 1.5Mb region of Garcia et al. (Gelsi-Boyer et al., 2005),
which is contradicted by the findings of Haverty et al. (2008) who used high-resolution
Affymetrix arrays to narrow the region of minimal amplification to 400Kb containing,
among other genes, BRF2, RAB11FIP1, and ZNF703.
11q13 is found amplified in around 19% of breast cancers, but it is rarely found
amplified alone, and is often found co-amplified with other commonly amplified regions
such as 8p12 and 17p12 (Letessier et al., 2006). CCND1 is the best supported candidate
driving oncogene, as it is within the most frequently amplified region, and CCND1
overexpression promotes mammary tumours in mice (Wang et al., 1994).
Co-amplification of different regions in the same tumour suggests that genes in different
amplicons may collaborate to drive tumourigenesis. FGFR1/CCND1 co-amplification
Chapter 1 Introduction
17
results in poorer prognosis than when the genes are amplified separately, as does
ERBB2/MYC co-amplification (Cuny et al., 2000). The amplified regions are often
physically associated and arranged in complex structures (Paterson et al., 2007).
Although no correlation between genes amplified on 8p and 11q has been found,
expression of CCND1 on 11q13 may induce expression of ZNF703 on 8p12 (Kwek et al.,
2009). Another hypothesis is that a translocation between chromosomes 8 and 11 is an
early event which is then amplified, and that a fusion gene at the translocation junction
may be the driving event (Paterson et al., 2007), but as yet this hypothesized fusion
gene has not been found, and the amplicons are often physically separated.
1.5.1 Mechanisms of chromosome amplification
A number of mechanisms have been proposed to explain how chromosome
amplification occurs. If the amplification is at a distant site to the original gene, the
proposed mechanism involves duplication of the gene and excision of the duplicated
copy, which replicates extrachromosomally and reintegrates into the DNA at a different
site (Schwab, 1999). A common example of this method of replication is the MYCN locus
in neuroblastoma, where double minute chromosomes containing multiple copies of the
MYCN locus can be seen in tumours. The amplified copies of MYCN in cell lines are more
often found as homogenously staining regions, which are more common in cell lines
than tumours (Benner et al., 1991), but the site of integration of the MYCN amplification
is never at the locus on chromosome 2 where MYCN is normally found (Schwab et al.,
2003; Storlazzi et al., 2010).
For amplifications where the extra material resides where the single-copy gene would
normally be found, different mechanisms of amplification has been proposed, of which
the breakage-fusion-bridge cycle is the best known (Schwab, 1999). The initiating event
is a double chromatid break which is repaired to form a fusion (Figure 1.1). During
anaphase this forms a bridge between sister chromatids, which must be broken to allow
Chapter 1 Introduction
18
cell division to continue. If the break is not in the same place as the original break and
fusion occurred, the daughter products will have either a duplication or a deletion.
Further rounds of this breakage-fusion-bridge cycle will result in further amplification of
material. The signature of this mechanism is that the amplified genes are inverted
(Schwab, 1999).
Chapter 1 Introduction
19
Figure 1.1. Breakage-fusion-bridge cycles as a mechanism of oncogene amplification. A
break occurs in both chromatids, which fuse and resolve unequally to give two possible
daughter products, one with a deletion and one with a duplication. The red gene will be
duplicated and one of the copies will be inverted.
Chapter 1 Introduction
20
Another mechanism for chromosome rearrangement is the replication fork stalling and
template switching (FoSTeS) model proposed by Lee et al. (2007). This occurs when the
lagging single strand of DNA during replication forms a secondary structure, blocking the
progress of the replication fork, which then switches to another template with
microhomology. Depending on the position of the other replication fork, this can cause
a duplication, but does not explain high-level amplifications, unlike the breakage-fusion-
bridge cycle which will continue until the chromosome acquires a telomere (Hastings et
al., 2009).
1.6 Chromosome translocations
Chromosome aberrations have been seen in cancer for many years. Abnormal
segregation of chromosomes was seen by Boveri in the in the early 1900s and proposed
to be a cause of malignancy, and in the 1950s it was shown that nearly all tumour cell
lines had chromosome aberrations (Rowley, 2001) but these chromosome abnormalities
were assumed to be a result of chromosome instability as the events did not appear to
recur in different tumours from the same origin, and not an important event in their
own right.
The first recurrent chromosome abnormality in a human cancer was found in 1960, with
the discovery of the Philadelphia chromosome in chronic myeloid leukaemia. The nature
of the translocation was not discovered until chromosome banding techniques showed
it was a reciprocal translocation of chromosome 22 to chromosome 9 , which produced
a fusion of the BCR locus on chromosome 22 to the ABL tyrosine kinase on chromosome
22 (Shtivelman et al., 1985). This creates a fusion of the two genes which leads to mRNA
and protein containing domains from both BCR and ABL, under the control of the BCR
promoter. The fused protein product was shown to have tyrosine kinase activity and to
induce leukaemia when expressed in mouse bone-marrow cells (Daley et al., 1990).
Imatinib, a specific inhibitor of the BCR-ABL fusion, is used to treat leukaemia patients
Chapter 1 Introduction
21
(Deininger et al., 2005). The BCR-ABL fusion is also found in acute lymphoblastic
leukaemia, with a different breakpoint which includes less of the BCR protein in the
fused product (Hermans et al., 1987), with even greater tyrosine kinase activity than in
CML. The success of the BCR-ABL fusion as an effective therapeutic target linked to a
specific cancer led to a search for other chromosomal aberrations which could produce
similar specific therapeutic targets.
Although the BCR-ABL fusion was the first to be seen cytogenetically, the genes involved
in another recurrent translocation in cancer were discovered first. The karyotypes of
cells taken from Burkitt's lymphoma showed an extra band on chromsome 14, while one
was missing from the end of chromosome 8 (Zech et al., 1976). When the oncogene
MYC was located to chromosome 8, it was shown that the coding region of MYC was
juxtaposed with the promoter region of the immunoglobulin heavy chain (IGH) (Dalla-
Favera et al., 1982; Taub et al., 1982). This does not create a direct fusion of the two
genes as is the case for BCR-ABL, but changes the expression levels and pattern of MYC,
which leads to tumorigenesis (ar-Rushdi et al., 1983). Translocations causing fusions
between oncogenes and members of the immunoglobulin family are a hallmark of B-cell
lymphomas, and have been discovered in mantle cell lymphoma (CCND1-IGH) and
follicular lymphoma (BCL2-IGH) at high frequency, and at lower frequencies in other
lymphomas (Kuppers, 2005).
Subsequently, hundreds of other recurrent (present in 1% or more cases) gene fusions
of both types have been discovered in common haematological malignancies (Mitelman
et al., 2007), although fusions which produce a fusion protein are more common than
promoter insertions (Rowley, 2001).
Until recently, fusion genes caused by recurrent chromosome aberration were thought
to be a feature of haematological and soft tissue cancers, but not of solid tumours,
where other mechanisms such as deletion and point mutation were thought to be more
important, and recurrent chromosomal aberrations were rarely seen. This may be due
to tissue-specific mechanisms causing chromosome rearrangements, such as
Chapter 1 Introduction
22
recombination in haematopoetic progenitor cells which then give rise to leukaemias
(Albertson et al., 2003), but there is some evidence that this stems from the difficulties
of performing cytogenetic analysis on epithelial tumours rather than from a lack of
recurrent gene fusions. Additionally, the prevalence of recurrent rearrangements in
haematological malignancies may have been overestimated due to selection bias for
patients reported in the literature due to a cytogenetic abnormality (Mitelman et al.,
2005). The actual proportion of malignancies with recurrent rearrangements may be as
high in epithelial tumours as in haematological malignancies, but represent a large
number of rare rearrangements without the common rearrangements seen in
leukaemias (Mitelman et al., 2004).
There are several technical difficulties which make it more difficult to find fusion genes
in solid tumours. Karyotyping solid tumours is more difficult due to poor chromosome
morphology, and the karyotypes are often so complex they cannot be characterized
completely (Mitelman et al., 2004). Obtaining metaphases is difficult as the carcinoma
cells may not divide, and contaminating normal cells or minor clones may grow better
than the dominant clone (Persson et al., 1999). Further evidence that the lack of
reported fusion genes was a technical artefact and not a difference between the two
types of tumour is that the fusion genes which were reported in rare epithelial cancers
often included the same genes involved in haematological fusions. The ETV6-NTRK3
fusion found in secretory breast cancer (Tognon et al., 2002) is also seen in congenital
fibrosarcoma (Knezevich et al., 1998) and acute myeloid leukaemia (Eguchi et al., 1999).
Some fusion genes were known in rare epithelial cancers. Fusions of RET are found in
papillary thyroid cancer, with the most common fusion being caused by inversions of
chromosome 10 which fuse the 3’ end of RET to 5’ portion of H4 although there are a
number of other 5’ partners. Fusions of NTRK1 to a number of partners (Alberti et al.,
2003) and an AKAP9-BRAF fusion produced by an intrachromosomal inversion have also
been reported (Ciampi et al., 2005). The AKAP9-BRAF fusion was found more often in
radiation-induced cancers while sporadic thyroid carcinoma often carried a BRAF point
Chapter 1 Introduction
23
mutation, and RET fusions are also more common in radiation-induced cancers,
suggesting the mechanisms of gene activation are linked to environmental factors. A
fusion of BRAF and KIAA1549, which shows constitutive kinase activity, has also been
found in 66% of pilocytic astrocytomas, a common paediatric brain tumour (Jones et al.,
2008).
The first recurrent fusion to be discovered in a common epithelial cancer was the fusion
of TMPRSS2 to members of the ETS transcription factor family in prostate cancer
(Tomlins et al., 2005). Previously, fusion genes had been found using cytogenetics or by
transfection assays, but this fusion was discovered using a bioinformatic approach,
working on the assumption that a gene fusion should result in overexpression of an
oncogene in a subset of cases, and that the two genes involved in the fusion would both
be overexpressed. Applying this Cancer Outlier Profile Analysis to prostate cancer gave
two strong candidates, ERG and ETV1, and overexpression of these two genes was
mutually exclusive, suggesting they played a similar role in prostate cancer
development. 5' RACE on ERG and ETV1 transcripts showed fusions to the prostate-
specific androgen-sensitive gene TMPRSS2. TMPRSS2-ETV4 fusions have also been found
(Tomlins et al., 2006). Fusions of TMPRSS2 and members of the ETS family have been
found in up to 80% of prostate cancers (Tomlins et al., 2007). Although TMPRSS2 fusions
appeared to be driving the majority of ERG overexpression, many prostate cancers had
ETV1 overexpression without a fusion to TMPRSS2, and ETV1 was found fused to a
number of different 5’ partners, including the housekeeping gene HNRPA2B1, the
androgen-induced gene SLC45A3, and the androgen-repressed gene c15orf21 (Tomlins
et al., 2007). These 5’ partners do not contribute coding sequence to the fusion, but
place ETV1 under the control of promoter and enhancer elements of distant genes.
Prostate cancer fusions demonstrate another reason why fusion genes may have been
difficult to find in common epithelial cancers. The BCR-ABL fusion is atypical in being
present in the majority of cases of CML (Mitelman et al., 2005), and fusions in other
cancers are often found at lower frequency. Additionally, fusion genes which involve the
Chapter 1 Introduction
24
same gene fused to a range of partners are well known, such as the fusions of EWS to
multiple members of the ETS family in Ewing’s sarcoma (Arvand and Denny, 2001), but
the ETV1 fusions in prostate involve a number of 5’ partners from different gene
families, including both androgen-induced and androgen-repressed genes, prostate-
specific partners and ubiquitously expressed genes, and looking for fusions which
involve such a wide range of partners with no commonality could be more difficult than
looking for fusions which involve genes from the same family or pathway.
A fusion between EML4 and ALK was subsequently discovered in around 10% of non-
small cell lung cancer by searching a retroviral cDNA library for inserts which would
transform mouse fibroblast cells (Soda et al., 2007). ALK was already known to form
fusions with NPM in anaplastic lymphoma (Morris et al., 1994), and the kinase domain is
retained in both fusions. A KIF5B-ALK fusion has also been found in NSCLC and shows
transforming potential (Takeuchi et al., 2009). The EML4-ALK fusion was also found by a
study of phosphorylation of tyrosine kinases in NSCLC, which also found a fusion of
SLC34A2 to ROS (Rikova et al., 2007).
Gene fusions have been discovered in breast cancer cell lines and tumours but so far
none have been shown to be recurrent. The cell line MDA-MB-175 has a fusion of ODZ4
to NRG1 (Liu et al., 1999), and FHIT has a fusion to MACROD2 in BrCa-MZ-02 (Popovici et
al., 2002), although this is associated with a lack of FHIT protein rather than a fusion
product. Howarth et al. (2008) found two fusion genes in a study of three cell lines,
TAX1BP1-AHCY, and RIF1-PKD1L1. Recent high-throughput sequencing studies have
found a number of other fusions – Hampton et al. (2009) found four expressed fusion
genes in MCF7, including the previously discovered BCAS4-BCAS3 fusion (Bärlund et al.,
2002), and suggest that the fusions may be suppressing wild-type expression of the
genes by dominant-negative effects. Stephens et al. (2009) found 21 expressed fusion
genes in a study of 24 breast cancer cell lines and tumours, none of which were
recurrent.
Chapter 1 Introduction
25
Although the prevailing view has previously been that fusion genes in epithelial cancers
are rare and unimportant, this is being increasingly challenged by the number of fusion
genes which are being identified in common tumours. It is likely that the fusion genes in
epithelial cancers are not like the common, recurrent fusions found in leukaemia, but
will involve individually rarer fusions which act on genes in the same pathway, or fusions
where the fusion partners differ but all have the same effect on the important gene in
the fusion, and to find these rarer fusions will require large-scale studies of cancer
genetics which can only be achieved through high-resolution microarray and sequence
analysis.
1.7 Research techniques
1.7.1 Karyotyping
Early karyotype analysis was hampered by an inability to accurately distinguish different
chromosomes. The discovery that treatment with trypsin and staining with Giemsa
produced a banding pattern which allowed all human chromosomes to be identified
enabled the detection of structural aberrations such as translocations, deletions, and
inversions. However, even high-resolution G-banding could only provide a resolution of
~3Mb at best, with ~6Mb being more usual, and any aberrations which did not have a
clear banding pattern were impossible to identify (Smeets, 2004).
1.7.2 Flurorescence in-situ hybridisation
Fluorescence in-situ hybridisation (FISH) is a method of visualising the location of DNA
using a fluorescently-labelled probe which binds to the target DNA. The probe consists
of genomic DNA which hybridizes to the target region of the genome. The DNA is either
Chapter 1 Introduction
26
directly labelled with a fluorophore, or a reporter molecule such as biotin is
incorporated into the DNA and fluorescently-labelled antibodies are used to visualize it.
It was developed as an alternative to the visualization of nucleic acids by radiolabelled
probes, as fluorescent labelling offers better resolution and are safer to use, and
multiple fluorophores can be used to visualize more than one sequence at a time
(Levsky and Singer, 2003). FISH was first performed in 1982 (Van Prooijen-Knegt et al.,
1982), with a probe hybridized to metaphase chromosomes.
Metaphase FISH has a resolution of around 3Mb (Raap, 1998). An advantage of FISH
over traditional cytogenetics is that it can be performed on interphase nuclei, with a
resolution of up to 100kb, and fibre-FISH using DNA fibres attached to a slide can
resolve probes down to 1kb apart (Ersfeld, 2004). This allows FISH to be used to resolve
even small-scale genomic rearrangements.
1.7.3 Chromosome painting and spectral karyotyping
The development of chromosome flow sorting allowed whole human chromosomes to
be amplified, labelled and used as FISH probes. Spectral karyotyping (Schröck et al.,
1996) is a technique which uses combinations of different fluorophores to label all 24
human chromosomes. The combination of fluorophores present at each pixel of the
image is measured using an interferometer and used to classify each chromosome. SKY
and the similar M-FISH technique can easily identify the chromosomes involved in
translocations, including small pieces of chromosomes and homogenously staining
regions which cannot be identified by G-banding. However, the resolution of SKY is
around ~10Mb and translocations smaller than this cannot be identified, and small
chromosome pieces can be misidentified due to overlap of fluoroscence. As SKY is based
on chromosome painting, internal deletions, duplications, and inversions cannot be
detected.
Chapter 1 Introduction
27
1.7.4 Comparative genomic hybridisation
Comparative genomic hybridization (Kallioniemi et al., 1992) is another technique for
using fluorescently-labelled DNA to determine tumour karyotypes. Tumour and normal
reference DNA are labelled with two different fluorophores, and hybridized together to
a normal metaphase spread. The amount of labelled DNA which binds to each locus is
relative to the abundance of the locus in the two samples, and deletions and
amplifications change the ratio of the two signals at each locus. By analysing the ratio of
the two signals at each locus, the patterns of gain and loss along each chromosome can
be plotted.
1.7.5 Array comparative genomic hybridisation
Array comparative genomic hybridisation improves the resolution of CGH by hybridizing
the labelled reference and tumour DNA to a microarray of DNA probes and measuring
the ratio for each probe in one experiment. Initial experiments used BAC clones (Pinkel
et al., 1998), but later arrays have used smaller inserts from cosmids and fosmids, and
modern array CGH uses short oligonucleotides, with the limit of resolution being
determined by the spacing of the oligonucleotides on the array. Oligonucleotides can
also be designed to avoid repeats, reducing the noise caused by hybridisation to
repetitive regions (Beaudet and Belmont, 2008). The sensitivity of oligonucleotide
hybridisation also allows them to be used for large-scale SNP calling. Oligonucleotides
are designed specifically to hybridise to the different SNP alleles (Kennedy et al., 2003),
and can be used to find areas of uniparental disomy and loss of heterozygosity. Bignell
et al. (2004) demonstrated the use of arrays originally designed to detect SNPs to detect
genotype and copy number at once. High-density commercial arrays such as the
Affymetrix SNP6.0 array include up to 2 million probes for simultaneous genotyping and
detection of copy number aberrations at high resolution.
Chapter 1 Introduction
28
1.7.6 Sequencing
A limitation of array-based techniques for mapping chromosome rearrangements is that
the sequences which are juxtaposed at the breakpoints are not known. Sequencing-
based approaches overcome this limitation by using a paired-end approach to sequence
both sides of a breakpoint.
End-sequence profiling was used to map chromosome rearrangements by creating BAC
libraries from a genome and sequencing from the ends of the BACs to identify
rearrangements, as the end sequences of a BAC containing a chromosome
rearrangement will align to the genome in the wrong position or orientation (Volik et al.,
2003). Copy number can also be determined from the density of the end-sequences
across the genome, although the resolution is determined by the size of the BAC
fragments.
High-throughput paired-end sequencing uses the same principle as end-sequence
profiling but the sequenced DNA fragments are much smaller and give a
correspondingly higher resolution than end-sequence profiling (Campbell et al., 2008).
At high levels of coverage, sequencing can also be used to identify point mutations and
small insertions and deletions in the genome (Pleasance et al., 2010a); (Pleasance et al.,
2010b).
Transcriptome sequencing can be used to find the consequences of chromosome
rearrangement such as fusion genes or internal rearrangements by finding transcripts
which align to two different genes, and may be produced by a genomic rearrangement.
Transcriptome sequencing may find fusion genes which would not be detected by
genome sequencing, as they are produced by read-through transcripts produced from
neighbouring genes, such as the SLC45A3-ELK4 fusion found in prostate cancer which
has no detectable DNA rearrangement (Maher et al., 2009).
Chapter 1 Introduction
29
1.7.7 Cell lines
Cell lines derived from tumours are commonly used in the laboratory to overcome the
difficulties with the use of primary tumours. Cell lines offer unlimited material for study,
are free of stromal contamination, and can be replaced from fresh stocks if they become
contaminated (Burdall et al., 2003). Common concerns about the use of cell lines
include problems of genetic drift, the use of ‘false’ cell lines contaminated with other
cell lines (MacLeod et al., 1999), and cell lines which are not from the supposed tissue of
origin, such as MDA-MB-435, commonly thought to be a breast cancer cell line which is
in fact derived from the M14 melanoma cell line (Rae et al., 2006). Furthermore, as
breast cell lines are often derived from post-treatment metastases or pleural effusions,
not primary breast tumours, they may not be representative of the disease as a whole
but model primarily the later-stage aggressive disease.
A study of the HCC series of breast cancer cell lines showed excellent concordance
between primary tumours and the cell lines established from them, including cell
morphology, ploidy, expression of ER and PR, and loss of heterozygosity (Wistuba et al.,
1998). There was also no correlation between the length of time in culture of the cell
lines and the concordance with the primary tumour, and studies of colorectal and
ovarian cancer cell lines have found that a stable karyotype is maintained over many
generations (Roschke et al., 2002).
Comparisons between CGH on cancer cell lines and primary tumours showed that the
chromosome gains and losses found in the cell lines are a good model for those found in
real tumours (Greshock et al., 2007). Some specific rearrangements are more often
found in cell lines, such as loss of chromosome 18 (Neve et al., 2006) and amplification
of the MYC locus (Greshock et al., 2007), and the subset of breast cancer with simple
1q/16 rearrangements is under-represented in cell lines. In general, breast cancer cell
lines recapitulate events commonly found in primary tumours, such as patterns of high-
level amplification (Neve et al., 2006), and the pattern of chromosome loss followed by
Chapter 1 Introduction
30
endoreduplication known to occur in many breast tumours (Dutrillaux et al., 1991) is
recapitulated in breast cancer cell lines (Morris et al., 1997).
1.8 Hypothesis
The importance of fusion genes in leukaemias and lymphomas has been known for
many years, but the importance and the prevalence of fusion genes in solid tumours has
been underestimated due to the difficulty of finding them. Using breast cancer cell lines
as a model, the aim of this project was to investigate chromosomal rearrangements and
find any fusion genes which may occur, and to investigate the recurrence and
importance of any fusion genes in other cell lines and tumours. This involved mapping
all the chromosomal rearrangements in breast cancer cell lines using high-resolution
techniques, which can be validated against our existing knowledge of the
rearrangements, and use the resulting karyotypes to provide insight into the
cytogenetics of breast cancer.
Chapter 2
Materials and methods
Chapter 2 Materials and methods
32
2.1 Cell culture
2.1.1 Sources
The origin of the cell lines used is given in Table 2.1.
Cell line Supplier Reference
MDA-MB-134 O’Hare Cailleau et al., 1978
HCC1143 ATCC Gazdar et al., 1998
HCC1806 ATCC Gazdar et al., 1998
HCC2218 ATCC Gazdar et al., 1998
VP229/VP267 McCallum McCallum and Lowther, 1996
ZR-75-30 O’Hare Engel et al., 1978
HB4a O’Hare Stamps et al., 1994
Table 2.1. Origin of cell lines. O’Hare: cell lines were a kind gift from Professor MJ O’Hare,
(LICR/UCL Breast Cancer Laboratory, University College Medical School, London, UK).
HB4a is a cell line derived from normal human breast epithelium by immortalization with SV40
large T-antigen (Smeets et al., 1994) which has been shown by gene expression profiling to have
similar gene expression to normal human breast epithelium (Git et al,. 2008).
2.1.2 Culturing cells
Ampoules of cells frozen in liquid nitrogen were thawed at 37°C, centrifuged to remove residual
DMSO, and resuspended in warm culture medium in a 25cm2 flask. Once the adherent cells were
confluent they were washed with 2ml of Versene, then 2ml of Versene with trypsin (0.5mg/ml
except for HCC1806 which was 1mg/ml) was added and incubated at 37°C for 2 – 5 minutes until
cells detached. 2ml of media was added to the flask, and the cell suspension was centrifuged at
1600g for 3 minutes to pellet the cells. The cells were resuspended in the appropriate volume of
media for the new flask (6ml for a T75, 12ml for a T150).
All cells were grown with 100U/ml penicillin and 100μg/ml streptomycin and cultured at 37°C with
Chapter 2 Materials and methods
33
5% CO2, except MDA-MB-134 which was cultured in 7.5% CO2.
Cell line Growth type Medium Additives
MDA-MB-134 Adherent 50:50 DMEM-F12 15% FBS
HCC1143 Adherent RPMI 10% FBS
HCC1806 Adherent RPMI 10% FBS
HCC2218 Suspension RPMI 10% FBS
VP229/VP267 Adherent MCDB-201 2% FBS + 1% ITS
ZR-75-30 Adherent 50:50 DMEM-F12 10% FBS + 1% ITS
HB4a Adherent 50:50 DMEM-F12 10% FBS
Table 2.2. Cell line growth conditions.
To freeze cells, the cells were pelleted as above, and resuspended in 1.5ml of media with 10%
DMSO in 2ml cryotubes. Tubes were frozen slowly at -80°C and stored in liquid nitrogen.
2.2 RNA extraction
The media was changed 12 hours before harvesting cells at 70% confluence. For adherent cell
lines, 7.5ml (for a T75) or 15ml (for a T150) Trizol reagent (Invitrogen) was added and the cells left
at room temperature for at least 5 minutes before the cells were harvested using a cell scraper and
transferred to a Falcon tube. For suspension cells, the cells were centrifuged at 1600g for 3
minutes and the supernatant removed, and the pellet resuspended in the appropriate volume of
Trizol. 1.5ml of chloroform was added, and the cells were vortexed and centrifuged at 2000g at 4°C
for 15 minutes. The top layer was retained and mixed with 4ml of isopropanol, and centrifuged at
2000g at 4°C for 15 minutes. The pellet was washed twice in 70% ethanol, and either stored under
ethanol at -80°C, or resuspended in 200µl of RNase-free water for immediate use.
Chapter 2 Materials and methods
34
2.3 Protein extraction
Cells were trypsinised and the pellet washed with PBS, then lysed by adding 1ml of RIPA buffer
(50mM Tris HCl (pH 8), 150 mM NaCl, 1% NP-40 (v/v), 0.5% sodium deoxycholate (w/v), 0.1% SDS
(w/v), 0.5 mM EDTA, Complete Protease Inhibitor Cocktail (Roche, used according to the
manufacturer’s instructions)) and mixing well. Cells were placed on ice for 20 minutes, then
centrifuged at 16000g at 4°C for 10 minutes, and the supernatant retained.
2.4 DNA extraction
Cells were trypsinised and 1ml of DNAzol reagent (Invitrogen) added, and the cells were lysed with
a P1000 pipette. 0.5ml of 100% ethanol was added and mixed by inversion, and left at room
temperature for 3 minutes. The precipitated DNA was spooled around a pipette tip and transferred
to a clean tube, and washed twice with 1ml of 95% ethanol. The DNA was resuspended in 250μl of
water and quantified on the NanoDrop spectrophotometer.
2.5 cDNA synthesis
The DNA-free kit (Ambion) was used to remove DNA contamination. 10μg of total RNA extracted
as above was treated with 1μl of rDNase I and the rDNase was removed with the DNAse Removal
Reagent . First strand cDNA synthesis was performed using the SuperScript III First-Strand Synthesis
Kit (Invitrogen). 5μg of DNase-treated RNA was mixed with 50ng of random hexamer primers and
1μl of 10mM dNTPs and incubated at 65°C for 5 minutes, then cooled on ice. 2μl of 10X RT buffer,
4μl of 25mM MgCl2, 2μl of 0.1MDTT, 40U RNaseIN (Promega) and 200U SuperScript III were added,
incubated at 25°C for 10 minutes then 50°C for 50 minutes, and the reactions were stopped by
incubating at 85°C for 5 minutes. The cDNA was stored at -20°C until needed.
2.6 PCR
Primers to amplify specific regions of genomic DNA or cDNA were designed using Ensembl
(www.ensembl.org) and Primer3 (Rozen and Skaletsky, 2000) (http://frodo.wi.mit.edu/primer3/).
Chapter 2 Materials and methods
35
All standard PCR (for target regions under 2kb) was carried out using HotMaster Taq polymerase
(VWR) with the following reaction mix: 2.5μl of 10X HotMaster buffer, 1μl of 10mM dNTPs, 1μl
each of 100mM forward and reverse primer, 1U Taq, 1μl of 50ng/μl DNA in a 25μl volume. Cycling
conditions were 95°C for 5 minutes, then 35 cycles of 95°C for 30 seconds, 60°C for 30 seconds,
72°C for 1 minute per kb of target, and finally 72°C for 10 minutes. Long range PCR for targets up to
10kb was performed using Elongase polymerase mix (Invitrogen) and a reaction mix containing 1μl
10mM dNTPs, 1μl each of 10μM forward and reverse primer, 2μl of 50ng/μl DNA, 1U of Elongase
polymerase mix and 10μl of a mix of buffer A and B optimised for the best Mg2+ concentration ,
made up to 50μl. Cycling conditions were 94°C for 30 seconds, followed by 35 cycles of 94°C for 30
seconds, 60°C for 30 seconds, 68°C for 1 minute per kb of target, and finally 68°C for 10 minutes.
10μl of product was visualised on a 0.8-2% agarose gel with 0.05% ethidium bromide.
2.7 Real-time PCR
Gene-specific primers were designed as above. The reaction was carried out in a 10μl volume
containing 1X SybrGreen PCR Master Mix (Applied Biosystems), 1μl each of 2.5mM forward and
reverse primer, and 1μl of 50ng/μl cDNA. Cycling was carried out using an ABI Prism 7900HT RT-
PCR machine (Applied Biosystems) and the cycling conditions were 50°C for 2 minutes and 95°C for
ten minutes, then 40 cycles of 95°C for 15 seconds, 60°C for one minute, and a final dissociation
step of 95°C for 15 seconds and 60°C for 15 seconds. Primer pair efficiency was calculated using a
standard curve from cDNA dilutions, and primers with an amplification efficiency of 1.8 or higher
were used. In the experiments in Chapter 3, GAPDH was used as a control cDNA, and expression of
the cDNA of interest was normalised to the value of GAPDH in each cell line. In the experiments in
Chapter 5, 3 genes (GAPDH, UBC, and RPL13a) were used as control cDNAs, and expression of the
cDNA of interest was normalised to the mean of all three genes.
2.8 Sequencing
PCR products under 1kb were purified using the QIAquick PCR Purification Kit (Qiagen) according
to manufacturer’s instructions and capillary sequencing of the products was performed by the
DNA Sequencing Facility, Department of Biochemistry. Longer PCR products were first cloned in a
Chapter 2 Materials and methods
36
pCR-XL-TOPO vector using the TOPO XL PCR Cloning Kit (Invitrogen). The plasmid DNA containing
the insert was extracted using the HiSpeed Plasmid Midi-Prep Kit (Qiagen) and sequenced as
above.
2.9 Flow sorting of chromosomes
Flow sorting of chromosomes was performed according to standard methods (Ng and Carter,
2006). The cells were subcultured 1:2 the day before sorting to synchronise the cells. Colcemid was
added to a final concentration of 0.1µg/ml 6 hours before the cells were harvested. Adherent cells
were harvested by banging the flask, and the medium removed to a 50ml Falcon tube. The cells
were pelleted by centrifuging at 250g for 5 minutes, the pellet resuspended in 5ml of PBS, and
incubated at room temperature for 10 minutes.
Cell swelling was monitored by mixing 10μl of cell suspension with 10μl of Turk’s solution. The cell
suspension was spun down at 250g for 5 minutes and resuspended in 1-3ml of polyamine isolation
buffer. After incubation on ice for 10 minutes and vortexing for 20 seconds, a small sample of the
preparation was stained with propidium iodide (5mg/ml) and observed under a fluorescence
microscope to see whether the chromosomes had clumped together, and vortexed until the
chromosomes were free. The chromosome suspension was transferred to a 15ml tube and
centrifuged for 1 minute at 173g, and the supernatant transferred to a fresh 15ml tube. Hoechst
33258, MgSO4.7H20, and Chromomycin A3 were added to the suspension to a final concentration
of 1μg/ml, 10mM and 80μg/ml respectively, and incubated at 4°C overnight. The next day the
suspension was centrifuged for 2 minutes at 250g and the supernatant removed to a new tube.
Sodium sulphite to a final concentration of 250mM was added one hour before the chromosomes
were sorted. Aliquots of 500 or 2000 chromosomes were sorted on a MoFlo (Cytomation
Bioinstruments) and analysed using Summit software (Beckman Coulter) to count the number of
events in each chromosome fraction.
2.9.1 Genomiphi amplification of sorted chromosomes
Aliquots of sorted chromosomes (volume ~20μl) were precipitated by adding 0.5μl Pellet Paint co-
precipitant (Merck), 1.5μl of 2.5M sodium acetate pH 5.5, and 50μl ethanol, incubating at -20°C
Chapter 2 Materials and methods
37
overnight, and centrifuging at 16000g for 20 minutes at 4°C. The pellet was washed in 70% ethanol
and air-dried, then resuspended in 1μl TE. Chromosomes were amplified using the GenomiPhi DNA
Amplification Kit (GE Healthcare) according to the manufacturer’s protocol and purified using
MicroSpin G50 columns.
2.10 Metaphase preparation from cell lines
The cells were split the day before sorting to synchronise the cells. For standard metaphase
preparations, colcemid was added to a final concentration of 0.1µg/ml 20 hours after the cells
were split and incubated for 90 minutes. To produce extended chromosome preparations, cells
were treated with BrdU, EtBr and colcemid:
Cell line BrdU (40µg/ml) EtBr(5µg/ml) Colcemid(0.1µg/ml)
HCC1806 16.5 hours 1.5 hours None
MDA-MB-134 20 hours 1.5 hours 0.75 hours
After incubation, cells were trypsinised and centrifuged at 1600g for 3 minutes to pellet the cells.
The supernatant was removed, leaving ~500µl of medium on the pellet, and the cells were
resuspended using a P1000 pipette. 20ml of 0.075M KCl warmed to 37°C was added drop by drop
to the cells with agitation, and the cells incubated at 37°C for 15 minutes. 10-20 drops of ice cold
freshly prepared 3:1 fix (3 parts methanol to 1 part acetic acid) were added, and the cells
centrifuged at 1600g for 3 minutes. The supernatant was removed, again leaving ~500µl of
medium on the pellet, and the cells were resuspended using a P1000 pipette to ensure no cell
clumps were left. 20ml of 3:1 fix was added drop by drop with agitation, and the cells were
incubated on ice for 5 minutes, then centrifuged at 1600g for 3 minutes. The supernatant was
removed leaving ~500µl of fix on the pellet, and the cells were resuspended using a P1000 pipette.
The fixation step was repeated once more with 3:1 fix, and then with 3:2 fix (3 parts methanol to 2
parts acetic acid). The ~500µl of metaphase suspension was transferred to a 2ml Eppendorf tube
Chapter 2 Materials and methods
38
and made up to 2ml with 3:2 fix, and stored at -20°C for at least 24 hours before use.
2.10.1 Metaphase spreads
100μl of water was placed on a glass microscope slide, and 10-20μl of metaphase suspension was
dropped onto the slides from a height of ~40cm. The slides were checked under a light microscope
for the presence of metaphases and the area containing metaphases was marked with a diamond
pen. The slides were dehydrated by incubating for 3 minutes each in 70, 90 and 100% ethanol at
room temperature, and the slides allowed to air-dry before incubating at 37°C overnight. Slides
were stored at -20°C until needed.
2.11 BAC growth and DNA extraction
Bacterial artificial chromosomes were stored as glycerol stabs at -80°C. They were streaked onto LB
agar with 20μg/ml chloramphenicol or 25μg/ml kanamycin and grown overnight at 37°C. A single
colony was grown for 6-8 hours at 37°C with shaking in 5ml LB media with 20μg/ml
chloramphenicol or 25μg/ml kanamycin . 1ml of this colony was added to 100ml of
LB/chloramphenicol or LB/kanamycin media and grown overnight at 37°C with shaking. The 100ml
cultures were centrifuged at 3000g for 15 minutes and the BAC DNA was extracted using the
HiSpeed Plasmid Midi-Prep Kit (Qiagen) according to the manufacturer’s instructions. The DNA was
precipitated by adding 1/10 volume of 3M sodium acetate pH 5.2 and 2 volumes of 100% ethanol
before incubating overnight at 20°C. The DNA was centrifuged at 15000g for 30 minutes, and the
pellet was air-dried and resuspended in 50μl of TE buffer and incubated at 65°C for two hours. The
DNA concentration was determined on the NanoDrop spectrophotometer.
2.12 FISH probes and hybridisation
Chapter 2 Materials and methods
39
2.12.1 Nick translation
Input material was BAC DNA, Genomphi-amplified sorted chromosomes, or sorted chromosomes
amplified by 3 rounds of DOP-PCR. Sorted normal human chromosomes were provided by Patricia
O’Brien and Professor Malcolm Ferguson-Smith, Department of Veterinary Medicine, University of
Cambridge. 500ng – 1µg DNA was nick translated in a 25μl reaction volume containing 2.5μl of nick
translation buffer, 1.9μl of low-C dNTPs (0.1M dATP, dGTP, dTTP, 0.03M dCTP), 0.7μl of labelled
dUTPs , 0.7μl of DNAse I and 1μl of DNA polymerase I. dUTPs were labelled with Digoxigenin-11,
Spectrum Orange, or Biotin. The reaction was incubated at 14°C for 2 hours. 4μl of the reaction
was run on a 1.5% agarose gel to check the product, and the reaction was stopped with 2.5μl of
EDTA and incubated at 65°C for ten minutes.
2.12.2 Hybridisation
7μl of each probe or chromosome paint was precipated overnight at -20°C with 3μl of CoT-1 DNA,
1μl of glycogen and 300μl of ethanol. The probe mixture was centrifuged at 15000g for 30 minutes
and the pellet allowed to air dry before being resuspended in 20μl of hybridization buffer (50% DI
formamide, 10% dextran sulphate, 1X Denhardt’s solution (Sigma), 2XSSC, 23mM Na2HPO4, 17mM
NaH2PO4) and left at 37°C for 30 minutes. Finally, the mixture was incubated at 70°C for ten
minutes, cooled on ice for 2 minutes, and incubated at 37°C for one hour.
Prepared metaphase spreads were incubated overnight at 37°C. The slides were denatured for 1
minute in denaturation solution (70% deionised formamide, 2XSSC) heated to 70°C and placed in
ice-cold 70% ethanol for 5 minutes. The slides were washed in a series of 70, 90 and 100% ethanol
for 3 minutes each and allowed to air-dry before incubation at 37°C for ten minutes.
18μl of the probe mixture was pipetted onto a slide and covered with a clean coverslip, then
sealed with rubber cement. The slides were placed in a humid box and hybridized at 37°C
overnight.
2.12.3 Detection
The rubber cement was removed from the slides with tweezers and the coverslips removed by
Chapter 2 Materials and methods
40
soaking in 2xSSC. The slides were washed twice for 5 minutes each in a solution of 50% formamide
and 1xSSC at 42°C and twice for 5 minutes each in a solution of 1xSSC at 42°C, then once for 5
minutes in a solution of 4xSST. The slides were blocked with 100μl 3% BSA in 4xSST for 30 minutes
and washed briefly in 4xSST. Antibody layers were prepared by adding 1μl of antibody per slide to
200μl of 1% BSA in 4xSST, incubating in the dark for 10 minutes, and centrifuging for 10 minutes at
15000g. Digoxigenin labelled probes were detected with sheep FITC anti-digoxigenin. Diotin-
labelled probes were detected with a layer of Cy5-labelled streptavidin, a layer of biotinylated anti-
streptavidin (Vector Laboratories), and a final layer of Cy5-labelled streptavidin. Each antibody
layer was incubated for 30 minutes at 37°C and the slides were washed 3 times in 4xSST with 0.5%
BSA. After all antibody layers were complete, 20μl of Vectashield with DAPI (Vector Laboratories)
was placed on a clean coverslip, and the slide is inverted onto the coverslip and allowed to dry in
the dark before being sealed with nail varnish. The slides were analysed on a Nikon Eclipse E800
Fluorescence microscope using Cytovision software (Applied Imaging) and stored at 4°C while not
in use.
2.13 Array painting
1Mb genomic arrays produced by the Cancer Research UK DNA Microarray Facility and tiling path
arrays (a gift from Dr K. Ichimura, as described in Ichimura et al., 2006) were used for array
painting. Genomiphi-amplified chromosomes were labelled with Cy3 and reference DNA (a pool of
normal female DNA) was labelled with Cy5. Custom NimbleGen arrays used for high-resolution
array painting were designed by Dr Karen Howarth and hybridised by Roche-NimbleGen. The
whole-genome array CGH data was produced by Dr Graham Bignell and the Cancer Genome
Project, Wellcome Trust Sanger Institute, using the human SNP6.0 array (Affymetrix).
2.13.1 Labelling
Labelling was carried out using a BioPrime Labelling Kit (Invitrogen). 450ng of Genomphi-amplified
sorted chromosome DNA was mixed with 60μl 2.5X Random Primer Solution in a 150μl reaction.
The DNA was heated to 100°C for ten minutes and cooled on ice, and 15μl 1X dNTPs, 1.5μl Cy3 or
Cy5 labelled dCTP (Amersham) and 3μl exo-Klenow polymerase were added while on ice. The
Chapter 2 Materials and methods
41
reaction was incubated at 37°C overnight and stopped with 15μl of stop buffer. The
unincorporated nucleotides were removed with a Micro-Spin G50 cleanup column (Amersham)
and the Cy3/Cy5 incorporation was measured on the NanoDrop.
2.13.2 Hybridisation
The Cy3 and Cy5 labelled DNA was mixed with 80μl human CoT1 DNA, 44μl 3M sodium acetate (ph
5.2), 6μl yeast tRNA and 1ml 100% ethanol, mixed well, and precipitated at -20°C overnight. A
hybridisation chamber was prepared by adding a strip of Whatman paper soaked in 2xSSC/20%
formamide solution. The precipitated DNA was spun for 15 minutes at 15000g to pellet the DNA,
the pellet was washed with 500μl 80% ethanol, and the supernatant was removed. The dry pellet
was resuspended in 50μl array hybridisation buffer (50% formamide, 10% dextran sulphate, 0.1%
Tween 20, 2X SSC, 10mM Tris buffer pH 7.4) pre-heated to 70°C. The sample was denatured for 10
minutes at 70°C, then incubated for one hour at 37°C in the dark. The sample was pipetted onto
the array slide and covered with a coverslip, then placed in the prepared hybridiation chamber at
37°C for 24 hours.
2.13.3 Washing
The slide was washed in PBS with 0.05% Tween 20 to remove the coverslip, then washed in a fresh
solution of PBS/0.05% Tween 20 for 10 minutes at room temperature with shaking. The slide was
transferred to 1X SSC/50% formamide pre-heated to 42°C and incubated for 30 minutes at 42°C
with shaking, then washed in fresh PBS/0.05% Tween 20 for 10 minutes at room temperature with
shaking. The slide was dried by centrifugation at 750g for 2 minutes, and stored in a dark box until
ready to scan.
2.13.4 Scanning
The slides were scanned on an Axon 4000B scanner using GenePix Pro 6.0 software (Molecular
Devices), using constant PMT gain settings of 1000 for the Cy5 channel and 800 for the Cy3
channel. The median signal (minus background) for each channel for each probe was used, and any
Chapter 2 Materials and methods
42
probe where the signal in the Cy5 channel was not twice the signal from the Drosophila control
probes was rejected. The log2 ratio of the test (Cy3) to the reference (Cy5) signal was plotted and
used to call chromosome losses and gains.
2.14 Western blotting
2.14.1 Blotting
Total protein extracts prepared as above were mixed 1:1 with β-mercaptoethanol and denatured at
99°C for 2 minutes then cooled on ice. 10μl of each sample was loaded onto a 12% Tris-acetate gel
(Invitrogen) and run at 125V for 90 minutes. The gel was trimmed and soaked in transfer buffer,
and transferred onto a PVDF membrane for 2 hours at 250mA at 4°C. The membrane was blocked
overnight in blocking buffer (1X TBS with 0.1% Tween 20 and 5% milk powder) at 4°C.
2.14.2 Detection
The membrane was washed 3 times for 5 minutes each in 1X TBS/0.1% Tween 20. Primary
antibody (diluted 1:1000-1:10000 in blocking buffer) was added and incubated for at least 1 hour,
and washed 3 times for 5 minutes each in 1X TBS/0.1% Tween 20. Secondary antibody (diluted
1:10000 in blocking buffer) was added and incubated for at least 1 hour at room temperature.
Detection was performed using the ECL Plus Western Blotting Detection System (GE Healthcare)
according to manufacturer’s instructions.
2.15 High-throughput sequencing
DNA sequencing libraries were prepared by Drs Jessica Pole and Ina Schulte in the lab from
genomic DNA extracted as above or from Genomiphi-amplified DNA using Paired-End DNA Sample
Prep Kit or Mate-Pair Library Prep Kit (Illumina) according to the manufacturer’s instructions.
Sequencing was performed on an Illumina GAIIx sequencer at the Cancer Research UK Cambridge
Research Institute.
Chapter 3
Rearrangements of BCAS3 in
breast cancer
Chapter 3 Rearrangements of BCAS3 in breast cancer
44
3.1 Introduction
Fusion genes are well-known in haematological malignancies. Recurrent fusion genes was first
discovered in chronic myeloid leukaemia (Shtivelman et al., 1985) and Burkitt’s lymphoma )
(Dalla-Favera et al., 1982; Taub et al., 1982), and fusion genes which lead to fusion proteins
have subsequently been found in many other common haematological malignancies (Mitelman
et al., 2007). Some recurrent fusion genes have been identified in solid tumours, such as the
TMPRSS2-ERG fusion in prostate cancer (Tomlins et al., 2005), but at the start of my project,
only 3 fusion genes had been found in breast cancer – a fusion of ODZ4 to NRG1 in the cell line
MDA-MB-175 (Liu et al., 1999), a fusion of FHIT to a cDNA later identified as MACROD2 in BrCa-
MZ-02 (Popovici et al., 2002), and a fusion of BCAS4 to BCAS3 in MCF7 (Bärlund et al., 2002). All
of these fusions were found in cell lines, and none were known to be recurrent. However, the
recurrent fusions of TMPRSS2 to members of the ETS transcription factor family had recently
been shown in prostate cancer (Tomlins et al., 2005), suggesting that recurrent gene fusions are
present in epithelial cancers, and that there may be recurrent fusions in breast cancer which
had yet to be discovered.
Work by Dr Karen Howarth in the lab had mapped many of the chromosome rearrangements in
three breast cancer cell lines to look for fusion genes. Two fusion genes had been found in the
breast cancer cell line HCC1806, RIF1-PKD1L1 and TAX1BP1-AHCY (Howarth et al., 2008). The
SKY karyotype of this cell line includes a der(7)t(8;7;17) chromosome, and array painting of this
derivative chromosome to a 1Mb array showed a break in BCAS3 which was joined to part of
chromosome 7. A break in BCAS3 was interesting as it was already known to take part in a
fusion in breast cancer, and a recurrent fusion would be one of the very few recurrent fusions
in common epithelial cancers.
The BCAS4-BCAS3 fusion was discovered by investigating chromosome amplification in the cell
line MCF7. This line shows amplification of 17q23 and 20p13, which are two of the commonly
amplified regions in breast cancer, with amplification of 17q23 found in around 20% of breast
cancers, and 20p13 in between 12 and 39% of cancers (Bärlund et al., 2002). Expression analysis
Chapter 3 Rearrangements of BCAS3 in breast cancer
45
of the genes in the amplicon in MCF7 identified an expressed sequence tag as the most
overexpressed transcript in this region (Monni et al., 2001).
Further investigation of this novel EST was performed, and showed that the full-length cDNA
(now called BCAS3, for breast carcinoma amplified sequence 3), was not only overexpressed in
MCF7 but fused to another novel gene on 20p13, BCAS4 (Bärlund et al., 2002). The BCAS4-
BCAS3 fusion joins exon 1 of BCAS4 to exons 23 and 24 of BCAS3, and alters the open reading
frame of BCAS3, resulting in a truncated protein which ends after 21bp of BCAS3 exon 23.
Other studies of BCAS3 have suggested a possible functional role in breast carcinogenesis. High
expression of BCAS3 in breast cancer has been associated with tamoxifen resistance (Gururaj et
al., 2006), and its expression is induced by estrogen receptor alpha (Gururaj et al., 2007).
As it was part of one of the few known fusions in breast cancer, I decided that the break in
BCAS3 was worthy of further investigation. To determine whether BCAS3 was part of a fusion
gene in HCC1806, I first needed to determine the other breakpoints on the derivative
chromosome to high resolution to determine the potential fusion partner. I would then
investigate any possible BCAS3 fusion in HCC1806, and look for any signs of a recurrent break or
fusion in other breast cancer cell lines and primary tumours.
3.2 Results
3.2.1 High-resolution array painting
At the start of my project, most of the breakpoints in HCC1806 were known only to low
resolution based on a 1Mb BAC array, which had 3000 probes spaced at roughly 1Mb intervals.
For most breakpoints, this was not high enough resolution to determine the exact gene at the
breakpoint. To find the breakpoints at higher resolution and determine which genes were
broken, the der(7)t(8;7;17) was hybridized to a custom Nimblegen array (array designed by Dr
Karen Howarth). The Nimblegen array was designed around the breakpoints already known
from 1Mb and tiling path array painting, and covers small regions at high resolution. Figure 3.1
shows the 1Mb array painting for the three chromosomes, and the high-resolution Nimblegen
array results for the regions around the breakpoints. On chromosome 17, the breakpoint was
Chapter 3 Rearrangements of BCAS3 in breast cancer
46
between 56,164,500 and 56,172,000bp, which is within BCAS3 as expected, between exons 4
and 5 (Figure 3.2). This retains the 3’ end of the gene, which is the same end of the gene as is
retained in the known MCF7 fusion (Figure 3.3).
Chapter 3 Rearrangements of BCAS3 in breast cancer
47
Figure 3.1. 1Mb array painting and Nimblegen array painting for the der(7)t(8;7;17)
chromosome from HCC 1806. A – chromosome 7, showing a break at 27,670,000bp and a 400kb
deletion between 110,858,500 and 111,278,500bp. B – chromosome 8, showing a break at
116,586,500bp. C – chromosome 17, showing a break at 56,170,000bp.
Chapter 3 Rearrangements of BCAS3 in breast cancer
48
Chapter 3 Rearrangements of BCAS3 in breast cancer
49
On chromosome 7, there was a breakpoint between 27,669,500 and 27,671,500bp, which is
just outside the promoter of HIBADH and 50kb from the gene TAX1BP1, and a deletion between
110,858,500 and 111,278,500bp which deletes part of IMMP2L and DOCK4 (Figure 3.4). On
chromosome 8, the breakpoint was between 116,585,500 and 116,588,500bp, which is in the
gene TRPS1 (Figure 3.5). (See Chapter 4 for investigation of the TRPS1-TAX1BP1 fusion.)
Chapter 3 Rearrangements of BCAS3 in breast cancer
50
Chapter 3 Rearrangements of BCAS3 in breast cancer
51
3.2.2 FISH mapping of the derivative chromosome
Although the individual breakpoints were known to high resolution, the arrangement of the
chromosomes was not completely resolved, and the possible fusion partners of BCAS3 could
not be determined from the array data alone. Chromosome 7 was known to be joined to
chromosomes 8 and 17 from the SKY karyotype, but only one breakpoint on chromosome 7 was
seen from the array painting. One possibility was that chromosome 7 was fused to another
chromosome at or very close to the telomere, which would not be seen on the array painting as
there were few probes in regions near the telomere (Figure 3.6A). A second possibility was that
the small deletion was not an interstitial deletion but part of a more complex rearrangement
involving an inversion and a deletion, which would mean one of the breakpoints from the
deletion was joined to another chromosome (Figure 3.6B). The orientation of the chromosome
7 fragment with respect to the other 2 chromosomes was not clear, and from the array painting
it could not be determined whether the known breakpoint was joined to chromosome 17 or
chromosome 8.
Figure 3.6. Possible orientations of the chromosome fragments in the der(7)t(8;7;17)
chromosome based on 1Mb array painting. A – the unknown break on chromosome 7 could be
a near-telomeric fusion. B – the unknown break could be part of a more complicated
rearrangement involving inversion and a deletion, possibly the known deletion on chromosome
7.
Chapter 3 Rearrangements of BCAS3 in breast cancer
52
To determine the arrangement of the chromosome fragments a series of FISH experiments was
performed. A probe close to the known breakpoint on chromosome 7 was hybridised to a
HCC1806 metaphase with chromosome 17 paint (Figure 3.7). The probe did not co-localise with
chromosome 17, but was found at the other end of the derivative chromosome, showing that
the known chromosome 7 breakpoint was joined to chromosome 8.
Figure 3.7. FISH on HCC1806 metaphase to determine orientation of chromosome 7 fragment.
Chromosome 17 paint labeled with Spectrum Orange is shown in blue. RP4-781A18 on
chromosome 7 is shown in green. RP4-781A18 is near the known breakpoint on chromosome 7
at 27.7Mb, and is present on the opposite end of the derivative chromosome from the
chromosome 17 paint, showing that the known breakpoint is joined to chromosome 8 and not
chromosome 17. The red signal is a mismapped probe.
Chapter 3 Rearrangements of BCAS3 in breast cancer
53
To determine whether there had been a deletion and inversion, a FISH experiment was
designed with probes near the telomere and the deletion on chromosome 7 (Figure 3.8). The
probe near the telomere was not seen on the derivative chromosome, indicating that there had
been a deletion or breakpoint near the telomere. The probe next to the deletion was present
but was not close to the chromosome 17 fragment, indicating that there had not been the
inversion on chromosome 7 hypothesized in Figure 3.6B.
Chapter 3 Rearrangements of BCAS3 in breast cancer
54
Figure 3.8. FISH to investigate a possible inversion on chromosome 7 in the der(7)t(8;7;17). A –
the chromosome fragments known to be in the derivative chromosome in some orientation,
and the location of the FISH probes used on chromosome 7. RP11-518I12 is close to the
telomere of chromosome 7, and RP11-563O5 is next to the small deletion at 111Mb. B – FISH
on HCC1806 metaphase. Spectrum orange chromosome 17 paint is shown in blue. RP11-518I12
is shown in red, and RP11-563O5 is shown in green. The chromosome in the lower right shows
the normal arrangement of the green and red probes on the telomere of chromosome 7. On
the derivative chromosome the green telomeric probe is absent, indicating a deletion near the
telomere. The red probe at 111Mb is in the expected position and not juxtaposed with the
chromosome 17 paint, indicating that there has not been an inversion.
Chapter 3 Rearrangements of BCAS3 in breast cancer
55
3.2.3 Fine mapping and sequencing of the breakpoint
The results of the FISH experiments suggested that the chromosome 17 fragment was joined to
the chromosome 7 fragment near the telomere of chromosome 7, but that some material had
been lost from the telomeric region. The der(7)t(8;7;17) had previously been hybridised to a
chromosome 7 tiling path array by Dr Karen Howath, but the loss of material from the telomere
had not been detected. The analysis of the tiling path arrays was designed to reduce noise by
routinely removing all probes which did not meet a set threshold for signal relative to a set of
Drosophila control probes on the array, and re-analysis of the chromosome 7 tiling path
showed that a number of probes near the telomere of chromosome 7 had been removed
during this noise reduction process as the signal for the normal reference DNA fell below the
signal threshold. When there probes were included, there was a deletion at the telomere which
had been previously overlooked (Figure 3.9A) and that the breakpoint was between
155,560,009 and 155,669,524bp. There were no genes at this breakpoint, and the nearest gene
was SHH (chromosome 7, 155,288,319-155,297,728), which was over 300kb from the
breakpoint, suggesting that BCAS3 was not part of a gene fusion (Figure 3.9B).
Chapter 3 Rearrangements of BCAS3 in breast cancer
56
Figure 3.9. A - Tiling path array for chromosome 7 of the der(7)t(8;7;17). As well as the
previously detected breakpoint at 27Mb, a further breakpoint at 155.5Mb can be seen. B –
diagram of the deleted region on chromosome 7.
Chapter 3 Rearrangements of BCAS3 in breast cancer
57
To confirm the result of this mapping which BCAS3 was not fused to another gene, I cloned and
sequenced the breakpoint junction. Using the breakpoint positions from the tiling path array
painting, and from whole genome SNP6 arrays which had become available since the start of
the project (SNP6.0 data kindly provided by Dr Graham Bignell and colleagues at the Sanger
Institute Cancer Genome Project, later published as Bignell et al., 2010), the breakpoint on
chromosome 7 could be mapped to between 155,709,517 and 155,715,189bp. The breakpoint
on chromosome 17 was already known to be between 56,164,500 and 56,172,000bp. Primers
were designed at 1kb intervals in the breakpoint regions, and long range PCR using
combinations of these primers was carried out to amplify a junction product. A product of
around 2kb was obtained using primers from 155,713,659bp on chromosome 7 and
56,166,019bp on chromosome 17. The product was cloned and sequenced to show the exact
breakpoint was at 155,714,224bp on chromosome 7 and 56,165,019bp on chromosome 17,
with a 1bp overlap between the sequences (Figure 3.10). The final arrangement of the
der(7)t(8;7;17) chromosome is shown in Figure 3.11.
Chapter 3 Rearrangements of BCAS3 in breast cancer
58
AAGGAGATGAGACACATCTGGTGAACACAGGTGACAGACATGGAGAAGTGAAAATGCTGTACCAAAAT
ATATTCTTCAATATGGTATTAAATATGGTATTTTAACAAATTTCTACGTATTAAATATTAATAGCATGATTT
GAGATCAGGAGAGCTAAGGTACATTATGCTAAATAACATTAAGGTAGAATAGTGAGACCAATATAGACT
GAATCATTCATTCATCAATTTATTCATTCAACAAGCATCTCTTTGGATTAACTATCATTTATTGAGTGCCAA
TTATTATATACTATCAAATATACAATTATACATTTAACAAATATACAATTTATATATTGTTGCCTGGTTAGA
TAAATATGTTATTAACCTTATTTTAAAACGAAACTCAGATTTAGTAAATTTGTATAGCTAATAAGCATANT
CCATTTTCTTTTCTACTA
Figure 3.10. Junction sequence for 7;17 junction of the der(7)t(8;7;17) chromosome. The bases
highlighted in blue are from chromosome 7, 155,714,171 to 155,714,224bp on the positive
strand. The bases highlighted in red are from chromosome 17, 56,165,019 to 56,165,424bp on
the positive strand. The base highlighted in green is a 1bp overlap between the sequences. The
12 bases shown in black are a 12bp insertion into chromosome 17.
Figure 3.11. The correct arrangement of chromosome fragments in the der(7)t(8;7;17)
chromosome in HCC1806. The nearest genes to the breakpoints are shown. Although genes are
broken at the breakpoints on chromosome 8 and 17, they are joined to non-genic regions on
chromosome 7, and no fusions can be found.
3.2.4 BCAS3 in other cell lines and tumours
Microarray data from Chin et al. (2007) suggested that there was also a break in BCAS3 in the
breast cancer cell line SUM52. This was confirmed using FISH probes upstream and
downstream of BCAS3, and showed that there was one extra copy with just the 5’ end of the
gene retained, and 5 extra copies of the 3’ end of the gene (Figure 3.12).
Chapter 3 Rearrangements of BCAS3 in breast cancer
59
Figure 3.12. FISH with probes 5’ and 3’ to BCAS3 on a SUM52 metaphase. A – diagram showing
probe location. RP11-947H19 is located 100kb upstream of the start of BCAS3, and RP11-160D4
is located 30kb downstream of the end of BCAS3. B – results of hybridization to SUM52
metaphase. RP11-947H19 is shown in red and RP11-160D4 is shown in green.There are two
intact copies of BCAS3, with five copies where only the 3’ end is retained (individual green
signals), and one copy where only the 5’ end is retained (individual red signals).
To further investigate whether BCAS3 was fused in any of the cell lines, I performed real time
PCR using 3 sets of primers from the beginning, middle and end of the gene to look for any cell
lines which differentially expressed part of the gene. The results are shown in Figure 3.13. The
normal human breast cell line HB4a was used as a control. HMT3552 is another normal human
breast cell line.
Chapter 3 Rearrangements of BCAS3 in breast cancer
60
Figure 3.13. Real time PCR for 3 different exons in BCAS3 on a panel of cell lines. All the values
are normalized to HB4a, a normal breast epithelial cell line.
There are four lines which show twofold or higher overexpression of any part of BCAS3 relative
to HB4a: SUM52, BT549, SUM44, and SkBr3. SUM52 shows fifteen times higher expression of
the first exon of BCAS3 compared to HB4a, with lower expression of the two primer pairs in
exons 9 and 23. There are more copies of the 3’ end than the 5’ end of BCAS3 in SUM52, but
this result suggests that the extra copies of the 3’ end are not affecting expression of the gene.
Whole genome SNP6.0 data for BT549 does not show any chromosome rearrangements in
BCAS3. The highest resolution array data available for SUM44 and SkBr3 is a custom 30k Agilent
array (data kindly provided by Dr Suet-Feung Chin), which shows no unbalanced
rearrangements in BCAS3 in either cell line at the resolution of the array. HCC1806 does not
show overexpression of any exons of BCAS3. Notably, MCF7, which has the original BCAS4-
BCAS3 fusion gene, does not show overexpression of BCAS3. Overexpression of the primer pair
in exon 23 would be expected as it is found in the BCAS4-BCAS3 fusion transcript.
Chapter 3 Rearrangements of BCAS3 in breast cancer
61
As BCAS3 was broken in multiple cell lines, I decided to look for breaks in a set of tumours. 6
probes were chosen, 3 overlapping probes each upstream and downstream of BCAS3. Each set
of probes was pooled and hybridized to a normal metaphase to ensure they gave a single
strong signal, and then hybridized to a tissue microarray containing cores from 141 breast
tumours (tissue microarray kindly provided by Dr Suet-Feung Chin, who also performed the
hybridization) (Figure 3.14).
Figure 3.14. Locations of BAC probes used for TMA FISH. The upstream and downstream probes
are just outside the BCAS3 gene, and the three probes on each side overlap to give a single
signal on an interphase nuclei. The upstream probes are RP11-105G8, RP11-381A5, and RP11-
947H19. The downstream probes are RP11-160D4, RP11-466D9, and RP11-180G7.
Chapter 3 Rearrangements of BCAS3 in breast cancer
62
Each of the tumour cores was scored according to number of signals seen for each pool of
probes. Of the 141 tumours, 107 could be successfully scored. 97 of the cores showed a normal
result, with overlapping signals from the two sets of probes. A further 7 tumours showed
amplification of both sets of probes, indicating the whole of BCAS3 was amplified. Only 3
tumours showed split probes, indicating that there was a break in the gene, and all 3 tumours
had extra copies of the 5’ end of BCAS3 only. No tumours showed isolated signals from only the
3’ end of the gene.
A Western blot was performed to analyse the BCAS3 protein, initially in 3 cell lines, HB4a,
HCC1806, and MCF7 (Figure 3.15). Unfortunately the only available antibody to BCAS3 gave
multiple nonspecific bands and could not be used for any further analysis.
Figure 3.15. A Western blot of BCAS3 on 3 cell lines. HB4a is immortalized normal breast
epithelium, HCC1806 and MCF7 have known breaks in BCAS3. Multiple non-specific bands were
observed in all cell lines. The BCAS3 protein is 101kDa.
Chapter 3 Rearrangements of BCAS3 in breast cancer
63
3.3 Discussion
Although BCAS3 is broken in HCC1806, it does not form part of a gene fusion. The break in
BCAS3 on the der(7)t(8;7;17) removes the 5’ end of the gene and may inactivate it, but there
are three complete copies of BCAS3 present on other chromosomes in HCC1806, so the
inactivation of one copy is unlikely to have an effect, and the expression of BCAS3 in HCC1806 is
not decreased compared to the normal breast cell line. The nearest gene to the broken copy of
BCAS3 is SHH, but there are no known enhancer elements near the breakpoint which could
affect SHH expression.
In total, 8.2% of the tumours showed amplification of BCAS3, close to the figure of 9.4%
reported by Bärlund et al. (2002). This is consistent with the knowledge that around 20% of
breast tumours show amplification of 17q23, as BCAS3 is just outside the minimal region of
amplification as defined by Pärssinen et al. (2007) and would not be expected to be amplified in
all tumours showing 17q23 amplification. The 3 tumours showing amplification of the 5’ end of
BCAS3 also support this interpretation – as BCAS3 is a large gene just outside the minimally
amplified region, some breaks in this gene would be expected, and may represent cases where
the amplification ends inside BCAS3. As it was the 3’ end of BCAS3 which was fused to BCAS4 in
MCF7, if a similar fusion was present in any of the tumour on the tissue microarray, split signals
which retained the 3’ end of BCAS3 would be expected, and that was not seen in any of the
tumours.
Analysis of the tiling path array for chromosome 7 showed a deletion at the telomere which
had not previously been detected, as the standard filtering procedure had removed the deleted
probes. A section of the chromosome where a number of probes had been lost at any region
other than the telomeres would be noticed, but a small deletion near the telomeres was not
immediately seen. This suggests that the standard protocol developed for array quality control
may cause other real breakpoints to be missed, particularly telomeric breakpoints.
Chapter 3 Rearrangements of BCAS3 in breast cancer
64
Although BCAS3 appeared to be a good candidate for a recurrent fusion gene in breast cancer,
it does not appear that it is important as a fusion gene. Subsequent work by Dr Ina Shulte has
found a fusion of the 5’ end of BCAS3 to HOXB9 in the cell line ZR-75-30, but this proved to be
out of frame. While there does not appear to be an important recurrent fusion of BCAS3, it is
recurrently broken, both in cell lines and tumours which show amplification of 17q, and in cell
lines which do not show chromosome 17 amplification. This may be simply due to chance, as it
is a large gene which may be broken by chance, especially as it is near the edge of a common
amplicon, or it may be that overexpression of a truncated form of BCAS3 including the 5’ end
can affect BCAS3 activity.
Chapter 4
The complete karyotype of
HCC1806
Chapter 4 The complete karyotype of HCC1806
66
4.1.1 Introduction
The first cell line I completely analysed for chromosome rearrangements was HCC1806.
HCC1806 is described as a cell line derived from an acantholytic squamous carcinoma of
the breast (Gazdar et al., 1998). It has a heavily rearranged hyper-diploid karyotype
(Figure 4.1), with a median of 51 chromosomes and no normal copies of chromosomes
2, 3, 4, 6, 7, 10, 12, 14, and 21. It is ER, PR and HER2 negative, and has a deletion of the
potential tumour suppressor gene FHIT (Sevignani et al., 2003).
Figure 4.1. SKY karyotype of HCC1806 (Mira Grigorova, unpublished). A typical
metaphase is shown – the consensus karyotype of HCC1806 is 51(49-53), X, -X, 1x1,
der(1;5)(p10;q10), -2, der(2)t(2;5;2)dup(2), del(2)t(2;12), der(2?)t(2;14), der(3)del(3),
der(3)t(3;22)(p12;?), der(3)t(3;20)(p12;?), der(3)t(3;19), der(4)t(4;6)(p15;p12),
der(4)t(1;4)(q11;p15), 5x1, der(5;10)t(p10;p10), der(6)t(4;6)(p15;p12), der(6)t(1;6p),
der(6)del(6)(q10-qter), -7, der(7?)t(2;7), der(7)t(8;7;17), 8x1, der(8)del(8)(p12-pter), 9x1,
der(9)t(9;12)(p21; p12?), -10, der(10)t(6;10)(?;p11), i(10q), 11x1, der(11)t(3;11), -12,
der(12)t(12;13)(p12;?), der(12)t(12;22)(q13-14;q13), der(12)del(12)(q13-qter), 13x1,
der(13)t(13;2;7), der(13)t(13;11;13), -14, der(14)t(6;14)(?;p11.2), 15x1,
isodic(15)t(15;10), der(15)del(15), 16x1, der(16)t(16q11.1;3p11;11p11-pter), 17x1,
i(17q)t(3;17)/i((17q)t(3/;17;15), 18x1, 19x1, der(19)t(8;19), der(19)t(18;19),
der(19)t(22;19;22), der(19)t(7;19;10), der(19)del(19), 20x2, -21, der(21)t(3;21), 22x1,
der(22)t(21;22), der(22)t(12;22)(q13;q13)
Previous work analysed all the breakpoints at low resolution using 1Mb array painting,
with higher-resolution tiling path and custom oligonucelotide arrays used to analyse
Chapter 4 The complete karyotype of HCC1806
67
primarily the balanced breakpoints (Howarth et al., 2008). HCC1806 was chosen as a
good cell line for this analysis as it had a small chromosome number, making it easier to
flow sort each chromosome, and it had a large number of reciprocal balanced
translocations, which we were specifically interested in. Many of the unbalanced
breakpoints had not been analysed at higher resolution than the 1Mb array painting,
which gives breakpoints to a resolution of 3-4Mb, which is normally not enough to
identify the genes which are broken and any fusions or rearrangements which may
result. Using higher-resolution whole genome array CGH from the Affymetrix SNP 6.0
platform (Bignell et al., 2010), I aimed to complete the karyotype of HCC1806 to a high
resolution, including the deletions and amplifications too small to be seen on previous
arrays, and to investigate all the possible gene fusion events resulting from
rearrangements.
4.1.2 Previous work
Previous work on HCC1806 in the lab was carried out by Dr Karen Howarth. Flow
karyotypes of a chromosome preparation from HCC1806 cells were used to separate the
chromosomes and each aberrant chromosome was hybridized separately to a 1Mb
array. Flow sorting produced 51 separate chromosome fractions, labelled A to o (Figure
4.2) (Howarth et al., 2008).
Chapter 4 The complete karyotype of HCC1806
68
Figure 4.2. Flow karyotype of HCC1806 chromosomes. The chromosomes are sorted by
separating the different fractions based on the staining intensity of Hoechst 33258 and
Chromomycin A3. The 51 fractions are labelled A to o (Howarth et al., 2008).
A number of the fractions contained more than one derivative chromosome, as they co-
localize on the flow karyotype due to the two derivative chromosomes being
approximately the same size and with similar GC composition. There is one case in
which the two derivative chromosomes in the same fraction contained pieces of the
same chromosome, meaning that the mapped breaks could not be assigned to a single
chromosome, but in all other cases the two co-sorted chromosomes did not contain
pieces of the same chromosome. As the chromosome pieces involved in each derivative
Chapter 4 The complete karyotype of HCC1806
69
chromosome were known from the SKY karyotype, the breaks could be assigned to one
of the derivative chromosomes.
Each fraction was labelled and hybridized to a 1Mb array; as 3 consecutive probes were
considered necessary to call a change in copy number, the breakpoints could be
mapped to a resolution of around 1Mb but rearrangements smaller than 3Mb would not
affect 3 consecutive probes and were not identified. In total, 1Mb array painting
revealed 93 breakpoints in HCC1806, of which 21 rearrangements were balanced to
1Mb resolution. Tiling path arrays which had probes every ~100Kb were available for
chromosomes 6, 7 and 22, and 14 breakpoints on chromosomes 6, 7, and 22 were
mapped to higher resolution using these arrays. A further 22 breakpoints, including all
the balanced breakpoints, were mapped using custom Nimblegen oligonucleotide arrays
designed to give probes every 200bp in the breakpoint regions. It was not practical to
map all the breakpoints on the custom Nimblegen arrays, so the balanced breaks were
prioritized as they would not be detected using whole genome array CGH even at high
resolution.
Many of the breaks that were mapped to high-resolution with the Nimblegen arrays
were within genes. Two fusion products were identified, both at balanced
rearrangements: TAX1BP1-AHCY and RIF1-PKD1L1 (Howarth et al., 2008).
4.2 Results
4.2.1 High-resolution breakpoints from SNP6 arrays
A total of 71 unbalanced breakpoints were not mapped at a high enough resolution by
Howarth et al. (2008) to identify the genes broken. To map these breakpoints, I used
whole genome array CGH data from the Affymetrix SNP6.0 platform, kindly provided by
Dr Graham Bignell and colleagues from the Cancer Genome Project, Wellcome Trust
Sanger Institute (later published as (Bignell et al., 2010). This data was used to map at
high resolution all the unbalanced breakpoints in HCC1806 which were previously
Chapter 4 The complete karyotype of HCC1806
70
known only to 1Mb or tiling path resolution, and to confirm the mapping previously
performed using the Nimblegen arrays.
The Affymetrix SNP array 6.0 includes over 1,800,000 25bp probes. 900,000 probes
detect SNPs, 200,000 copy number probes detect regions of copy number variation,
while a further 700,000 probes are evenly spaced across the genome. This gives a
median probe separation of 700bp. In addition, the SNP probes provide information on
the genotype, allowing determination of regions of heterozygosity.
4.2.2 Determining the breakpoints
The provided segmentation of the whole genome SNP6.0 array using circular binary
segmentation (Venkatraman and Olshen, 2007) was unreliable and missed several
known breakpoints (Figure 4.3). Instead, the breakpoints were estimated by eye. 259
possible unique breakpoints were identified, without reference to the array painting in
order to prevent selection bias towards already-known breakpoints. The size of the
estimated interval containing the breakpoint varied according to the number of probes
in the region of interest and the noise around the breakpoint, but the median size was
20kb.
Chapter 4 The complete karyotype of HCC1806
71
Figure 4.3. Incorrect segmentation of SNP6.0 arrays by circular binary segmentation. A
plot of the SNP6.0 array for part of chromosome 1 with segmentation performed by
circular binary segmentation. The position of the break is known from array painting to
be at 15.5Mb, which is incorrectly assigned by the SNP6.0 segmentation.
4.2.3 Comparison of array painting and the SNP6.0 array
The whole genome SNP6.0 array was matched to our existing array painting data. 75 of
the 259 breakpoints corresponded to breaks already identified by array painting.
Comparison of the breakpoint regions called by eye on the SNP6 data with breakpoints
which were known to tiling path or oligonucleotide array resolution showed that the
SNP6 regions always agreed with the previous data, suggesting that the breakpoints
called by eye are reliable.
The majority of the balanced breakpoints could not be seen as a copy number change
on the whole genome SNP6.0 data, as they were balanced to the resolution of the array.
There were three exceptions where the breakpoints could be seen as they were not
perfectly balanced: the balanced breaks at 16p21.1 and 3p21.1 in the chromosome
fractions L and I, and the balanced break at 7p15 in the fractions L and M. These breaks
appeared perfectly balanced to 1Mb resolution, but with the higher resolution whole
Chapter 4 The complete karyotype of HCC1806
72
genome SNP6.0 data a small copy number gain of between 100-200kb could be seen at
the position of each breakpoint (Figure 4.4). This gain could be caused by a duplication
of material at the breakpoint, with the region of copy number gain being present on
both derivative chromosomes, or it could represent a small duplication on one of the
products, or an unrelated duplication on another copy of that chromosome. Subsequent
work by Dr Karen Howarth confirmed using FISH and PCR that the duplicated material is
present at the breakpoint on chromosomes from both fractions, and does not represent
a tandem duplication on one of the translocation products.
Chapter 4 The complete karyotype of HCC1806
73
Figure 4.4. Breakpoint duplications from SNP6.0 arrays. The plots show the SNP6.0 array
for HCC1806. A – part of chromosome 16, B – part of chromosome 3. The copy number
changes marked in red are at the location of a balanced translocation, and represent
duplicated sequences present in both products of the reciprocal translocation.
Chapter 4 The complete karyotype of HCC1806
74
There was one other balanced break which can be seen on the whole genome SNP 6.0
data, which is the balanced break at 6p22. This break was found in three chromosome
fractions (B, V and Z). The chromosome fragment from 6pter to 6p22 was found in two
fractions (V and Z) and the reciprocal fragment was only found in fraction B. This gave
an extra copy of one side of balanced break, which can be seen as a copy number step
on the whole genome SNP6.0 array.
The only unbalanced breaks seen in the array painting data which were not present on
the whole genome SNP6.0 array were the chromosome 15 and 17 breaks in
chromosome fraction Q. Chromosome fraction Q contains a der(17)t(3;17;15) which is
likely to be a further rearrangement of the der(17)t(3;17) chromosome in chromosome
fraction X, as the chromosome 3 and 17 breaks were in the same locations (Figure 4.5).
The absence of a copy number step corresponding to the der(17)t(3;17;15) in the SNP6
array data may represent a difference between the sample of HCC1806 used in our
experiments and that used for the whole genome SNP6.0 array, suggesting the
der(17)t(3;17;15) rearrangement may have occurred in culture, or it is possible that the
breaks are actually balanced and that the reciprocal fragments have been lost in our
sample and retained in the sample used for the SNP6.0 array. The predicted copy
number from the SNP6.0 array is consistent with two copies of the der(17)t(3;17) and no
copies of the der(17)t(3;17;15).
Chapter 4 The complete karyotype of HCC1806
75
Figure 4.5. Related derivative chromosomes in HCC1806. The der(17)t(3;15;17) in
chromosome fraction Q shares breaks on chr17 and chr3 with the der(17)t(3;17) in
chromosome fraction X, and is assumed to be a further rearrangement of the same
chromosome. The SNP6.0 copy number is consistent with their sample of HCC1806
having two copies of the der(17)t(3;17) and no copies of the der(17)t(3;17;15).
4.2.4 Identification of previously undetectable copy number changes
The breakpoints identified on the whole genome SNP6.0 array were used to find gains
and losses which were not identified on the 1Mb array painting. As three consecutive
clones at the same level were considered necessary to be sure of a break on the 1Mb
array painting, any copy number gains or losses which spanned three probes or fewer
would not have been called as a breakpoint from the 1Mb array. The exact resolution of
the array depends on the exact spacing of the probes in that region, but any gains or
losses under 5Mb are likely to have been overlooked on the 1Mb array painting. I
Chapter 4 The complete karyotype of HCC1806
76
identified previously-undescribed gains and losses by looking for any increase or
decrease in copy number where neither of the boundaries were a known translocation
and the region was under 5Mb. The SNP6 array showed 24 copy number gains, with a
size range from 91kb to 2.11Mb and a median size of 1.01Mb, and 23 copy number
losses ranging from 103kb to 4.5Mb with a median size of 577kb. An example of a loss
found on chromosome 9 which was not called by array painting can be seen in Figure
4.6.
Figure 4.6. Example of a deletion identified from the SNP6.0 array. Whole genome
SNP6.0 data for part of chromosome 9 is shown in gray, with the 1Mb array painting for
chromosome fraction R overlaid in red. A deletion can be seen between the arrows at
around 123 and 124Mb, which was not detected as 3 consecutive probes were not
called as deleted. The sequence of the BAC at the left hand edge of the deletion
probably overlaps the edge of the breakpoint.
4.2.5 Assembly of the complete karyotype
Using the SKY karyotype, array painting, and whole genome SNP6.0 array, a complete
picture of the derivative chromosomes was constructed. Several assumptions were
made in assembling the karyotype. First, I assumed that telomere fusions would be rarer
than non-telomere fusions, and so two broken chromosomes are likely to join at the
breakpoints rather than at the telomeres. A fusion at the telomeres would also leave
two broken ends without telomeres. I further assumed that the karyotype that involves
Chapter 4 The complete karyotype of HCC1806
77
the fewest chromosome rearrangements to produce a derivative chromosome is
correct, and a break that appears at the same location in two derivative chromosomes is
joined to the same other chromosome, as it is more likely that the break has arisen once
and undergone further rearrangement than for the same break to have arisen twice. For
example, chromosome fraction O contains a der(20)t(3;20) chromosome, and
chromosome fraction M contains a der(20)t(3;20;7) chromosome, and as the breaks on
chromosome 3 and chromosome 20 are in the same locations in both derivative
chromosomes I assumed they were joined to each other in both chromosomes and the
der(20)t(3;20;7) had undergone a further translocation with chromosome 7. As the
higher resolution CGH allowed the breakpoint intervals to be called to higher resolution
this assumption is likely to be correct.
4.2.6 Discrepancies between array painting and whole genome SNP6.0 array
After the whole genome SNP6.0 was matched to the array painting, while the overall
agreement was good there were still some breakpoints which could not be accounted
for as either known breakpoints from the array painting, or small gains and losses that
would be too small to see on the array painting due to the low resolution. I investigated
some of these discrepancies in order to determine if they represented a true
discrepancy between the two data sources.
One of the discrepancies I investigated was extra breaks on chromosome 2 (Figure 4.7).
HCC1806 does not have a normal copy of chromosome 2, but there are 6 derivative
chromosomes which contain pieces of chromosome 2, which were sorted into 5
different chromosome fractions. When the whole genome SNP6.0 data and the array
painting were compared for chromosome 2, there were several breaks which were not
part of small rearrangements and did not agree with any of the breaks previously called
from the 1Mb array painting.
Chapter 4 The complete karyotype of HCC1806
78
One such break was seen as a step down in copy number on the SNP6.0 data at 75Mb.
This region of chromosome 2 was present on only the der(2)t(5;2;5) chromosome found
in fraction A. A closer inspection of the 1Mb array painting data showed that the break
at 75Mb appeared to be present on the array but had been missed (Figure 4.7). This
may be due to the magnitude of the changes seen in array painting, as the change in
log2 ratio between the regions of chromosome 2 which are not present in the derivative
and those which are present at one copy is much greater than the shift between one
and two copies present.
Chapter 4 The complete karyotype of HCC1806
79
Figure 4.7. 1Mb array painting of chromosome 2 in chromosome fraction A (above)
compared to the whole genome SNP6.0 array for chromosome 2 (below). The green
lines show the breakpoints which were originally called from the 1Mb array painting and
the matching breakpoints in the SNP6.0 data. The red line at 75Mb marks an additional
breakpoint which was called from the SNP6.0 data and not previously known, and shows
that the breakpoint can be found in the 1Mb array painting and was overlooked due to
the smaller shift in hybridisation intensity from 1 to 2 copies than from 0 to 1 copy.
Chapter 4 The complete karyotype of HCC1806
80
Another discrepancy involved several breakpoints on the q arm of chromosome 2
(Figure 4.8). In addition to a breakpoint at 150Mb known from the array painting, there
was a break at 178Mb, a small amplification between 178 and 180Mb, and a break at
200Mb. The only chromosomes containing this region of chromosome 2 were the
der(2)t(5;2;5) found in chromosome fraction A, and a der(7)t(2;7) found in chromosome
fraction G.
On closer inspection of the 1Mb array painting, the extra breaks on the q arm of
chromosome 2 could be seen to be at least one extra copy of chromosome 2 in
chromosome fraction G from 178Mb to the qter (Figure 4.8). It was shown by FISH that
there is an extra copy of that region of chromosome 2, with an amplification of the 178-
180Mb region on one copy only (Figure 4.9). This amplification was seen on the array
painting, but as there were only two probes in this region, it was not called as a copy
number change. A further FISH experiment showed that although it is unclear from the
array painting whether there is a break at 200Mb, there is an extra copy of the region
between 180Mb and 200Mb on both chromosomes (Figure 4.10).
Chapter 4 The complete karyotype of HCC1806
81
Figure 4.8. 1Mb array painting of chromosome 2 in chromosome fraction G compared
to the whole genome SNP6.0 array for chromosome 2. The green line marks the
breakpoint which was called from the 1Mb array painting. This break is balanced so not
break is seen on the whole genome SNP6.0. The solid red lines show breaks which were
seen on the SNP6.0 array but not previously seen on the 1Mb array painting, but they
can been seen as a smaller shift on the array painting and may have been overlooked.
The dashed red line shows a breakpoint on the SNP6.0 array which does not seem to
have an associated shift on the 1Mb array painting.
Chapter 4 The complete karyotype of HCC1806
82
Figure 4.9. FISH to confirm breakpoint and amplification of extra chromosome 2
fragment in chromosome fraction G. A – whole genome SNP6.0 array for a portion of
chromosome 2, with the two BAC probes used for FISH marked in red and green. B -
FISH on interphase and metaphase nuclei shows chromosome 2 paint in blue, BAC RP11-
65L3 in green and BAC RP11-67G7 in red. The FISH shows that there are 2 chromosomes
with paired red and green signals, which are the known chromosomes from fractions A
and G, and a chromosome with a single red signal and multiple green signals, which is an
extra chromosome also present in fraction G and has an amplification of the region with
the green probe.
Chapter 4 The complete karyotype of HCC1806
83
Figure 4.10. FISH to investigate extra chromosome 2 fragment in chromosome fraction
G. A – whole genome SNP6.0 array for a portion of chromosome 2, with the two BAC
probes used for FISH marked in red and green. B - FISH on interphase and metaphase
nuclei shows chromosome 2 paint in blue, BAC RP11-15J24 in green and BAC RP11-
59L22 in red. The FISH shows that there are 2 chromosomes that show two red signals
and one green signal.
Chapter 4 The complete karyotype of HCC1806
84
4.2.7 Systematic search for fusion genes
By assembling the complete karyotype of HCC1806, all the breakpoints were known to
high resolution, and, in most cases, which breakpoints were joined together. It was
possible that there was extra complexity at the breakpoints, such as a balanced
inversion which would not be detected using either of the array platforms. With higher
resolution array CGH, the genes at many of the unbalanced breakpoints are now known
where previously there were several candidate genes, although there are still several
breakpoints where the breakpoint cannot be mapped to a single gene.
By identifying the genes which are broken at each breakpoint, possible gene fusions
could be predicted.
Fusion genes can be produced by chromosome translocations in two main ways. They
can produce fusion genes directly by breaking a gene on each chromosome, which form
a fusion gene on the translocated chromosome involving the 5’ end of one gene and the
3’ end of the second gene (figure 4.11). They can also cause a fusion product when only
one gene is broken by removing the transcription termination and poly(A) addition site
of the gene, which causes transcription to continue into an intact downstream gene and
produce a fused transcript (figure 4.12). I refer to these as “readthrough” fusions. The
TAX1BP1-AHCY fusion previously identified in HCC1806 (Howarth et al., 2008) is a
readthrough fusion.
Chapter 4 The complete karyotype of HCC1806
85
Figure 4.11. Translocation between two chromosomes directly forming a fusion gene.
This produces a fusion gene with the 5’ end of the green gene and the 3’ end of the blue
gene. If this is a balanced, reciprocal translocation, the reciprocal fusion gene may also
be present.
Figure 4.12. Translocation between two chromosomes where only one gene is broken to
form a readthrough fusion. With the transcription end site of the green gene removed,
this could produce a fusion containing the 5’ end of the green gene and the whole of the
blue gene (apart from the first exon, which would not normally have a splice acceptor
site). This could alter the regulation of the blue gene, as it is under the control of a
different promoter.
To investigate the fusions, PCR primers were designed according to the example in
Figure 4.13. A pair of primers was designed to each gene to test expression of the gene
in HCC1806 and in the normal breast cell line HB4a. By using a combination of the
forward and reverse primers from different primer pairs, any fusion transcript would
give a product only in HCC1806, and would not be present in the normal cell line.
Chapter 4 The complete karyotype of HCC1806
86
Figure 4.13. Primer design for amplification of fusion products on cDNA. A - The MGAM
gene is broken in HCC1806 between exon 29 and exon 38 (breakpoint region defined by
red dotted lines). PCR primers were designed between exon 28 and exon 29. B - The
DPP6 gene is broken between exons 1 and 2. PCR primers were designed between exon
2 and exon 3. C - The hypothetical MGAM-DPP6 fusion protein would include the 5’
exons from MGAM and the 3’ exons from DPP6. By using the forward primer from
MGAM and the reverse primer from DPP6, a product will only be produced if the fusion
transcript is present, and a normal cell line not containing the translocation can be used
as a control.
Chapter 4 The complete karyotype of HCC1806
87
4.2.8 New fusions identified by high-resolution arrays
7 new candidate fusion genes were identified from higher resolution mapping of
breakpoints (Table 4.1).
5’ gene Chromosome 3’ gene Chromosome Chromosome fraction
FOXP4 6 HSP90 4 B, E
MGAM 7 DPP6 7 G
TMTC4 13 SUGT1L1 13 J, P
LMO1 11 NAG 2 K
TRPS1 8 TAX1BP1 7 L
CST4 20 EPHA3 3 M, O
BC022036 9 STAB2 12 R
Table 4.1. Potential fusion genes caused by translocations and large deletions.
Chromosome fractions are defined in Howarth et al. (2008).
The results of the PCR for the fusion genes caused by translocations and large deletions
are shown in Figure 4.14. No fusion transcripts were amplified; the genes LMO1, HSP90
and CST4 showed expression in HB4a but not in HCC1806, indicating that expression has
been lost, either by disruption of the gene at a breakpoint, or by some other mechanism
such as promoter methylation. LMO1 is known to be recurrently translocated in T-cell
leukaemia in the common t(11;14)(p13;q11) translocation (Boehm et al., 1991), and is
thought to play a role in leukaemogenesis (Tremblay et al., 2010). Many of the genes
did not show expression in either HB4a or HCC1806, and the lack of a fusion product
may be due to lack of expression of the 5’ gene.
Chapter 4 The complete karyotype of HCC1806
88
Figure 4.14. PCR for fusion genes caused by translocations in HCC1806. All PCRs were
carried out using HCC1806 cDNA.
Row Column Primers Row Column Primers
1 1 Ladder 3 1 Ladder
1 2 FOXP4 3 2 BC022036
1 3 HSP90 3 3 STAB2
1 4 FOXP4/HSP90 3 4 BC022036/STAB2
1 5 MGAM 3 5 TRPS1
1 6 DPP6 3 6 TAX1BP1
1 7 MGAM/DPP6 3 7 TRPS1/TAX1BP1
2 1 Ladder 4 1 Ladder
2 2 LMO1 4 2 SUGT1L1
2 3 NAG 4 3 TMTC4
2 4 LMO1/NAG 4 4 SUGT1L1/TMTC4
2 5 CST4 4 5 Negative control
2 6 EPHA3
2 7 CST4/EPHA3
Chapter 4 A complete karyotype of HCC1806
89
4.2.9 Fusion genes caused by small deletions
Using the whole genome SNP6.0 array data, 23 small deletions which were not
previously picked up by the 1Mb array painting were identified. Some of these small
deletions were at regions known to have copy number variation in the normal
population (Redon et al., 2006), and were assumed to be found in the germline, but
many of the deletions remove parts of genes and could potentially produce fusion
products, as shown in Figure 4.15. The 12 possible fusion genes are shown in Table 4.2.
The results of the PCR for fusion genes caused by small deletions are shown in Figure
4.16. No fusion transcripts were amplified.
Chapter 4 A complete karyotype of HCC1806
90
Figure 4.15. An intrachromosomal deletion can cause a fusion gene between the 5’ end
of the green gene and the 3’ end of the blue gene. The other ends of the genes are lost.
Gene 1 Gene 2 Chromosome
DISC1 KIAA1383 1
HK2 REG3G 2
GPR39 MGAT5 2
NAP5 BC045801 2
MTDH VPS13B 8
CTNLN ADAMTS1L1 9
C5 TTLL1 9
HCCA2 OR52B2 11
SBF2 GALNTL4 11
USP31 ERN2 16
ATAD5 SUZ12 17
CHEK2 PITPNB 22
Table 4.2. Potential fusions from small deletions
Chapter 4 A complete karyotype of HCC1806
91
Figure 4.16. PCR for fusions caused by small deletions
Chapter 4 A complete karyotype of HCC1806
92
Key for Figure 4.16:
Row Column Primers Row Column Primers
1 1 Ladder 4 1 Ladder
1 2 DISC1 4 2 C5
1 3 KIAA1383 4 3 TTLL1
1 4 DISC1/KIAA1383 fusion 4 4 C5/TTLL1 fusion
1 5 HK2 4 5 HCCA2
1 6 REG3G 4 6 OR52B2
1 7 HK2/REG3G fusion 4 7 HCCA2/OR52B2 fusion
2 1 Ladder 5 1 Ladder
2 2 GPR39 5 2 SBF2
2 3 MGAT5 5 3 GALNTL4
2 4 GPR39/MGAT5 fusion 5 4 SBF2/GALNTL4 fusion
2 5 NAP5 5 5 UPS31
2 6 BC045801 5 6 ERN2
2 7 NAP5/BC045801 fusion 5 7 UPS31/ERN2 fusion
3 1 Ladder 6 1 Ladder
3 2 MTDH 6 2 ATAD5 exon 6
3 3 VPS13B 6 3 ATAD5 exon 14
3 4 MTDH/VPS13B fusion 6 4 SUZ12
3 5 CTNLN 6 5 ATAD5 exon 6/SUZ13 fusion
3 6 ADAMTS1L1 6 6 ATAD4 exon 14/SUZ12 fusion
3 7 CTNLN/ADAMTS1L1 fusion 7 1 Ladder
7 2 PITPNB
7 3 CHEK2 exon 2
7 4 CHEK2 exon 10
7 5 PITPNB/CHEK2 exon 2 fusion
7 6 PITPNB/CHEK2 exon 10 fusion
Chapter 4 A complete karyotype of HCC1806
93
4.2.10 Fusion genes caused by tandem duplication
Small duplications may also produce fusion genes (Jones et al., 2008). 23 small
duplications were seen in the SNP6 array CGH. These may be tandem duplications or
they may be an insertion of an extra copy elsewhere in the genome. As it could not be
determined from the array CGH which of these possibilities was correct, it was assumed
for the purposes of predicting fusion genes that all of these duplications were tandem
duplications. The potential fusion genes would then depend on the orientation of the
genes at the breakpoint and the location and orientation of the inserted fragment. The
possibilities are shown in Figures 4.17-4.19 and the predicted fusion genes are shown in
Table 4.3. Figure 4.20 shows the PCR results. In one case, there were 3 possible genes
for one end of a fusion due to a poorly-resolved breakpoint, and all 3 were tested. No
fusion transcripts were found.
Gene 1 Gene 2 Chromosome
EPHB2 MYOM3 1
EPHB2 FUSIP1 1
EPHB2 PNRC2 1
MYOM3 EPHB2 1
FUSIP1 EPHB2 1
PNRC2 EPHB2 1
AFF3 BC156887 2
BC156887 AFF3 2
c6orf105 PHACTR1 6
PHACTR1 c6orf105 6
LAMA2 ARHGAP18 6
ARHGAP18 LAMA2 6
CATSPERB TC2N 14
SMURF2 CCDC46 17
GPC3 HS6ST2 23
Table 4.3. Potential fusion genes resulting from small tandem duplications
Chapter 4 A complete karyotype of HCC1806
94
Figure 4.17. The possible fusion genes produced by a tandem duplication which breaks
two genes on the same strand.A - a head-to-tail duplication which produces a fusion of
the 5’ end of the blue gene to the 3’ end of the green gene. B - a head-to-head
duplication which produces no fusion products. C - the other possible head-to-head
duplication which also produces no fusion products.
Chapter 4 A complete karyotype of HCC1806
95
Figure 4.18. The possible fusion genes produced by a tandem duplication which breaks
two genes on opposite strands, duplicating the 3’ ends of both genes. A - a head-to-tail
duplication which produces no fusion product. B - a head-to-head duplication which
produces a fusion of the 5’ end of the green gene with the 3’ end of the blue gene. C -
the other possible head-to-head duplication which produces a fusion of the 5’ end of
the blue gene and the 3’ end of the green gene.
Chapter 4 A complete karyotype of HCC1806
96
Figure 4.19. The possible fusion genes produced by a tandem duplication which breaks
two genes on opposite strands, duplicating the 5’ ends of both genes. A - a head-to-tail
duplication which produces no fusion product. B - a head-to-head duplication which
produces a fusion of the 5’ end of the green gene with the 3’ end of the blue gene. C -
the other possible head-to-head duplication which produces a fusion of the 5’ end of
the blue gene and the 3’ end of the green gene.
Chapter 4 A complete karyotype of HCC1806
97
Figure 4.20. PCR for fusions produced by tandem duplications. All PCR was on HCC1806
cDNA.
Row Column Primers Row Column Primers Row Column Primers
1 1 Ladder 3 1 Ladder 5 1 Ladder
1 2 EPHB2 pair a 3 2 ARHGAP18 pair a 5 2 c6orf105/PHACTR1
1 3 EPHB2 pair b 3 3 ARHGAP18 pair b 5 3 PHACTR1/c6orf105
1 4 MYOM3 pair a 3 4 CATSPERB 5 4 LAMA2/ARHGAP18
1 5 MYOM3 pair b 3 5 TC2N 5 5 ARHGAP18/LAMA2
1 6 FUSIP1 3 6 SMURF2 5 6 CATSPERB/TC2N
1 7 PNRC2 3 7 CCDC46 5 7 SMURF2/CCDC46
1 8 AFF3 pair a 3 8 GPC3 5 8 GPC3/HS6ST2
1 9 AFF3 pair b 3 9 HS6ST2
2 1 Ladder 4 1 Ladder
2 2 BC156887 pair a 4 2 EPHB2/MYOM3
2 3 BC156887 pair b 4 3 EPHB2/FUSIP1
2 4 c6orf105 pair a 4 4 EPHB2/PNRC2
2 5 c6orf105 pair b 4 5 MYOM3/EPHB2
2 6 PHACTR1 pair a 4 6 FUSIP1/EPHB2
2 7 PHACTR1 pair b 4 7 PNRC2/EPHB2
2 8 LAMA2 pair a 4 8 AFF3/BC156887
2 9 LAMA2 pair b 4 9 BC156887/AFF3
Chapter 4 A complete karyotype of HCC1806
98
4.3 Discussion
HCC1806 is described as derived from an acantholytic squamous cell carcinoma of the
breast (Gazdar et al., 1998). Acantholytic squamous cell carcinoma is a rare form of
breast cancer, which accounts for only 0.05% of all breast neoplasms. Although
acantholytic squamous cell carincomas are rare, a small percentage of invasive ductal
carcinomas show regions of squamous cell metaplasia (Fisher et al., 1983), so HCC1806
may be a rare variant of a true ductal carcinoma. CGH studies which have been carried
out on small numbers of acantholytic squamous cell carcinomas suggest that they show
some chromosome rearrangements which are characteristic of both breast cancer, and
squamous cell tumours from other regions (Aulmann et al., 2005).
The combination of array painting and whole genome SNP6.0 array data made it
possible to identify potential fusion genes caused by translocations which could not
have been identified using one method alone. The high-resolution SNP6.0 data
identified the genes at breakpoints, but the array painting was needed to determine
which breaks are found together on a derivative chromosome and may be joined to
each other. Some of this information could be inferred from the SKY karyotype, but the
problems of resolution and overlap of chromosomes at breakpoints make it difficult to
determine the exact arrangement from the SKY data alone.
No further fusion transcripts could be detected in HCC1806. This could be a true
negative result and reflect that there are few fusion genes in this cell line and the two
known fusions are the only fusion genes. The number of fusion genes found in cell lines
and tumours in the Stephens et al. study (2009) ranged from zero to eleven, suggesting
that if HCC1806 really has only two fusion genes then this is within the range found in
breast cancers. However, this study did not look for readthrough fusion genes, and may
underestimate the number of fusions in each cell line (see chapter 6 for details of
fusions not found in the cell line HCC1187).
Chapter 4 A complete karyotype of HCC1806
99
Alternatively, there may be other fusion genes present in HCC1806 which were not
found using the methods I have employed. Although these methods have been
successfully used to find fusion genes before, including the two known fusions in
HCC1806, they may miss fusion genes which are expressed at very low levels which
cannot be detected using standard PCR. The assumption is that fusion genes which are
barely expressed are unimportant, but they may act as dominant negative inhibitors of
one gene in the fusion even at low levels. Fusion genes that have unusual splicing
patterns and do not include any of the exons tested by PCR would also be missed, as
would some fusion genes that included novel exons. Most of the genomic junctions at
breakpoints in HCC1806 have not been sequenced. It is possible that the breakpoints
are more complex than they appear and may contain ‘genomic shards’ (Bignell et al.,
2007; Campbell et al., 2008), small pieces of DNA which have been inserted at the
breakpoint. These pieces are often smaller than a kilobase and would not be seen on
the SNP6.0 array, but in rare cases could affect any fusion gene produced by a
chromosome rearrangement. Another possibility is that there is an inversion at the
breakpoint, like the known inversion at a breakpoint in the breast cancer cell line T47D
(Pole et al., 2006), which are copy number neutral and would not be identified using
microarrays.
Chapter 5
The complete karyotype of
MDA-MB-134 obtained
using array painting and
high-throughput sequencing
Chapter 5 The complete karyotype of MDA-MB-134
101
5.1 Introduction
MDA-MB-134 is a breast cancer cell line derived from a pleural effusion obtained from a patient
with metastatic breast cancer (Cailleau et al., 1974). It is a hypodiploid line with a median of 44
chromosomes. There is a subclone which has endoreduplicated and subsequently lost
chromosomes, with a median of 66 chromosomes. The main rearrangements are two copies of
a large marker chromosome with amplification of chromosome 8 and 11, and the
der(15)t(15:17) and der(18)t(16:18) translocations (Figure 5.1) (Davidson et al., 2000).
Figure 5.1. SKY karyotype of MDA-MB-134 (Davidson et al., 2000).
The aim of the work was to map all the chromosome rearrangements in MDA-MB-134 to high
resolution, and search for gene fusions or other rearrangements which affect gene expression
and function. Chromosome 8 and 11 amplifications are common in breast cancer (Lafage et al.,
1992; Lemieux et al., 1996; Bautista and Theillet, 1998; Paterson et al., 2007), and 8p12 and
11q13 are found co-amplified in 8.2% of cases (Letessier et al., 2006). MDA-MB-134 is a model
for chromosome 8 and 11 amplification and has a simple karyotype with few other
translocations, suggesting that the rearrangements in the amplicon are important driving
events in carcinogenesis, and they will be easier to analyse in a cell line with a low level of other
rearrangements.
Chapter 5 The complete karyotype of MDA-MB-134
102
5.1.1 Previous work
The large homogenously staining region of the marker chromosome was shown to contain
sequences from chromosomes 8p11-12 and 11q13 arranged in a complex structure (Lafage et
al., 1992). Microdissection of the hsr and hybridization to normal metaphases suggested that
sequences from 8p12 and 11q13 were co-amplified, with a block of amplified DNA from 8q24
between the co-amplified regions (Guan et al., 1994). Further FISH suggested a rearrangement
between the centromere of chromosome 11 and the juxtacentromeric region of chromosome 8
(Lemieux et al., 1996). A mostly complete copy of 8q forms the short arm of the marker
chromosome, with a deletion of MYC, and there is no amplification of the copy of MYC located
between the co-amplified regions.
Array CGH shows amplification of 8p12 and 11q13 with few other copy number changes. The
amplicon on 8p12 is large and spans 7Mb of chromosome 8 (from 34.7 to 41.5Mb) and includes
FGFR1 and ZNF703, while the chromosome 11 amplicon is smaller and contains two separate
regions of amplification, one covering CCND1 and the other containing EMSY and GARP
(Paterson et al., 2007). FISH shows that the amplicon has a complex interdigitated structure,
with two blocks of 8p12 and 11q13 amplification separated by a region from 8q, and the
chromosome with the amplification is present in two copies (Figure 5.2). Overlapping signals
from chromosome 8 and 11 are seen, but the complete arrangement of the amplicon could not
be derived using FISH.
Chapter 5 The complete karyotype of MDA-MB-134
103
Figure 5.2. Schematic of the chromosome 8 and 11 amplifications in MDA-MB-134. There is a
single copy of normal 8 and 11, and two copies of the der(11) marker chromosome with two
regions of intermingled 8p12/11q13 separated by a single copy of material from 8q24.
(Adapted from Paterson et al. 2007).
5.2 Results
5.2.1 Assembling a complete karyotype of MDA-MB-134
The aim was to use array painting to characterize the chromosomal rearrangements which
could be seen on the SKY karyotype. The SKY karyotype suggested there were three rearranged
chromosomes – the der(15)t(15:17), der(18)t(16:18), and two copies of the der(8)t(8;11),
which previous work suggested were identical to the resolution of FISH and SKY and would be
in the same position in the flow karyotype.
Comparison of the chromosome fractions of MDA-MB-134 to the chromosomes sorted from a
normal human cell line showed six fractions in an abnormal position on the flow sort (Figure
Chapter 5 The complete karyotype of MDA-MB-134
104
5.3). These fractions may contain the rearranged chromosomes, or they may be outlying
regions of the normal chromosome fractions, as the fractions are not as tightly sorted as the
normal comparison. Fractions A and B were both collected as it was not possible to determine
which was the der(8)t(8;11) and which was the normal chromosome 1 from their position on
the flow sort. Fraction F was collected as, while it is common for two different homologues of
chromosome 21 to form separate fractions, as can be seen in the normal flow sort (Figure 3A),
it could not be determined from the flow sort whether it was a different homologue of
chromosome 21 or one of the rearranged chromosomes. Fractions C, D and E were collected as
they were in an unexpected position, and may be either rearranged chromosomes or outliers
from the normal fractions.
Chapter 5 The complete karyotype of MDA-MB-134
105
Figure 5.3. Flow karyotyping of abnormal chromosomes in MDA-MB-134A: Flow karyotype of
chromosomes from the normal cell line GM11321B with chromosome fractions labelled
(normal karyotype courtesy of Bee Lin Ng, Wellcome Trust Sanger Institute). B: Flow karyotype
of chromosomes from MDA-MB-134. The six labelled fractions look to be in an abnormal
position on the graph compared to the normal chromosomes, and may be the fractions
containing chromosomes with translocations.
Chapter 5 The complete karyotype of MDA-MB-134
106
To determine which of the six candidate fractions contained the rearranged chromosomes, they
were reverse chromosome painted to normal metaphases. The sorted chromosomes were
amplified using the Genomphi DNA Amplification Kit, and hybridised to normal metaphase
spreads.
Four of the candidate chromosome fractions contained normal chromosomes. Fraction B was
chromosome 1, fraction D and fraction E were outlying regions of the fractions for
chromosomes 9-12 (which co-localise) and chromosome 7 respectively, and fraction F was an
extra fraction for 21 caused by the different homologues of the chromosome sorting into
separate fractions. Fractions A and C contained two of the expected derivative chromosomes –
fraction A was the large t(8;11) chromosome and fraction C was the der(15)t(15;17) (Figure
5.4).
Chapter 5 The complete karyotype of MDA-MB-134
107
Figure 5.4. Reverse chromosome painting of sorted chromosome fractions to normal (DRM)
metaphases. A – chromosome fraction A, showing signal on chromosomes 8 and 11. B –
chromosome fraction C showing signal on chromosomes 15 and 17.
Chapter 5 The complete karyotype of MDA-MB-134
108
The der(18)t(16;18) was not found in any of the candidate abnormal fractions. It was possible
that the size of the rearranged chromosome caused it to co-localise on the flow karyotype with
a normal chromosome. If this was the case, the count of the number of events in that
chromosome fraction would be higher than expected, as three rather than two chromosomes.
Figure 5.5A shows the count of events in each chromosome fraction. The trend towards more
events in the fractions for the shorter chromosomes is due to more of the shorter
chromosomes being retained during the preparation for flow sorting, but even accounting for
that trend the fraction for chromosome 14 showed an unusually high number of events for a
fraction which should contain 2 chromosomes – over 2200 events when 1400 would be
expected. This suggested that the der(18)t(16;18) chromosome may be contained in the
chromosome 14 fraction, and reverse chromosome painting showed that it hybridized to the
whole of chromosome 14, the p arm of chromosome 16, and the q arm of chromosome 18
(Figure 5.5B).
Figure 5.5. Locating the der(16)t(16;18) chromosome in MDA-MB-134. A - a graph showing the
counts of each chromosome fraction in MDA-MB-134 during chromosome sorting, showing an
more than the expected number of chromosomes in the chromosome 14 fraction . The
trendline shows that more of the longer chromosomes are lost during the preparation for
sorting. B - reverse painting of the chromosome 14 fraction to normal (DRM) metaphases,
showing signal on chromosomes 14, 16, and 18.
Chapter 5 The complete karyotype of MDA-MB-134
109
Chapter 5 The complete karyotype of MDA-MB-134
110
5.2.2 Array painting of rearranged chromosomes
The breakpoints on the three rearranged chromosomes were further mapped by array painting
of the sorted chromosome fractions. The arrays used were 1Mb BAC arrays with 3,439 probes
spread across the genome, giving a probe approximately every megabase, and 90 probes at
higher density between 30.9 and 41.4Mb on chromosome 8, giving a probe on average every
120Kb across this region. Amplified sorted chromosomes were labeled and hybridized to arrays
against labeled normal female DNA, and the ratio of the signals was used to find regions which
were present or absent in the sorted chromosome compared to normal. The aim of using array
painting instead of whole genome CGH was that the breakpoints could be unambiguously
assigned to the rearranged chromosome as only the chromosomes present in each
chromosome fraction are hybridized to the array.
As reverse chromosome painting showed that two of the sorted fractions contained only the
derivative chromosome, while the t(16;18) co-localised with chromosome 14, which is not
known to be involved in the rearrangement, the breakpoints could be unambiguously assigned
to a particular derivative chromosome.
Array painting of the fraction containing the der(15)t(15;17) showed the breakpoints on each
chromosome to be centromeric (Figure 5.6). The t(15;17) is formed of 15q joined to 17q. The
array has no probes on the p arm of chromosome 15, but the whole of the q arm was retained,
so the breakpoint is likely centromeric.
Chapter 5 The complete karyotype of MDA-MB-134
111
Figure 5.6. Array painting of MDA-MB-134 chromosome fraction C. Plots show the log2 ratio of
the hybridization of the test chromosome fraction against a normal reference genome versus
the distance along the genome or chromosome. From top to bottom: whole genome plot,
chromosome 15, chromosome 17. The array has no probes on the p arm on chromosome 15.
The array is known to have a number of misidentified BACs, which probably account for the
probes which do not show the expected signal. Each probe is duplicated on the array, and both
copies are plotted separately on these graphs.
Chapter 5 The complete karyotype of MDA-MB-134
112
Array painting of the fraction containing the der(18)t(16;18) (as well as chromosome 14)
showed that the breakpoint on chromosomes 16 and 18 was also centromeric (Figure 5.7).
Figure 5.7. Array painting of MDA-MB-134 fraction 14. Plots show the log2 ratio of the
hybridization of the test chromosome fraction against a normal reference genome versus the
distance along the genome or chromosome. From top to bottom: whole genome plot,
chromosome 16, chromosome 18. The whole genome plot shows signal from chromosome 14,
as it co-localises with the derivative chromosome during flow sorting and cannot be separated.
Chapter 5 The complete karyotype of MDA-MB-134
113
Array painting showed that the translocations involve whole chromosome arms but could not
determine whether the chromosome arms were joined at centromeres, telomeres, or had more
complicated rearrangements which did not affect the copy number. To determine the
orientation of the translocated fragments, FISH was performed using probes near the
centromeres of the chromosome fragments (Figures 5.8 and 5.9).The derivative chromosomes
were joined at the centromeres in both the 15;17 and 16;18 translocations, and did not show a
telomeric fusion.
Chapter 5 The complete karyotype of MDA-MB-134
114
Figure 5.8. FISH to show orientation of chromosome fragments. Chromosome 17 paint labelled
with Spectrum Orange (blue). Probe on 15q12 is RP11-570N16, labelled with Digoxygenin
(green). Probe on 17q11.2 is RP11-403E9, labelled with Biotin (red). A is a normal (DRM)
metaphase, B is an MDA-MB-134 extended chromosome preparation metaphase. Arrow shows
the der(15)t(15;17) chromosome is formed of 15q and 17q chromosome fragments joined at
the centromeres. There is no telomeric fusion.
Chapter 5 The complete karyotype of MDA-MB-134
115
Figure 5.9. FISH to show orientation of chromosome fragments. Chromosome 18 paint labelled
with Spectrum Orange (blue). Probe on 18q11.1 is RP11-280C8 labelled with Digoxygenin
(green). Probe on 16p11.2 is RP11-2C24 labelled with Biotin (red). A is a normal (DRM)
metaphase, B is an extended MDA-MB-134 metaphase. The green signal is on 18q near the
centromere, and the red signal is on 16q near the centromere.
Chapter 5 The complete karyotype of MDA-MB-134
116
Array painting of the t(8;11) derivative chromosome allowed the amplified regions and
breakpoints to be determined to higher resolution than through previous FISH and microarray
studies (Figure 5.10). Using the criterion that a minimum of two adjacent probes showing a
change in hybridisation ratio are needed to confirm a gain or loss, the 8p amplification
appeared to have two separate regions of gain, from 30,945,723-31,288,495Mb and from
34,631,328-40,796,500Mb. The positions were taken from the start and end points of the BAC
probes which are gained, but as the probes were spaced at megabase intervals the breakpoints
could only be determined approximately. There was an extra copy of the whole of 8q, with two
deletions between 109,137,354-122,721,235Mb and 134,255,043-139,307,409Mb.
Chapter 5 The complete karyotype of MDA-MB-134
117
Figure 5.10. Array painting of MDA-MB-134 fraction A. Plots show the log2 ratio of the
hybridization of the test chromosome fraction against a normal reference genome versus the
distance along the genome or chromosome. From top to bottom: whole genome plot,
chromosome 8, chromosome 11. The signal from chromosome 1 and chromosome 7 seen in
the whole genome plot is contamination of the flow-sorted chromosome fraction. The boxes
mark the amplicons on chromosomes 8 and 11.
Chapter 5 The complete karyotype of MDA-MB-134
118
Chromosome 11 showed a gain of most of 11p from the start to 42,018,028Mb, and the q arm
had gained a region between the centromere and 62,403,825Mb. The amplicon on 11q showed
a higher log2 ratio of signals, indicating more copies of this region had been gained, and the
amplicon could be divided into two regions, 68,278,585-70,789,798Mb, and 73,676,966-
78,791,818Mb. There may also be further rearrangements within the amplicon which could not
be confirmed at this resolution, as two adjacent probes could not be called as gained or lost.
The extent of the amplification was subsequently confirmed using data from Affymetrix SNP6.0
arrays (Bignell et al., 2007), which has probes on average every 700bp and allows breakpoints
to be called to much higher resolution. The low-resolution array painting agreed with the SNP6
data (Figure 5.11), except for the small amplicon around 31Mb on chromosome 8, which was
called from two BAC probes and may be due to mismapped BACs . However, further
rearrangements suggested by single probe changes on the 1Mb array could be confirmed on
the higher resolution array, including a further high-level amplification on chromosome 8
between 21,375,101 and 21,983,001Mb. This amplification includes FGF17, which is
overexpressed in prostate cancer and associated with poor prognosis (Heer et al., 2004). A
summary of the chromosome rearrangements found on chromosomes 8 and 11 is shown in
figures 5.12 to 5.14.
Chapter 5 The complete karyotype of MDA-MB-134
119
Figure 5.11. Comparison of 1Mb array painting data (in red) and whole-genome SNP6.0 data (in
black). Data has been scaled to allow comparison of the two data sets. A – MDA-MB-134
chromosome 8. The two arrays agree on the extent of the 8p amplification, and there is an
additional amplification at 21.3-21.9Mb (indicated by the arrow) which is represented on the
1Mb array by a single BAC. Each BAC is present in duplicate on the array, so the 2 points in the
amplification represent the same BAC. B – MDA-MB-134 chromosome 11, showing agreement
between the two arrays on the amplification.
Chapter 5 The complete karyotype of MDA-MB-134
120
Figure 5.12. Amplifications on chromosome 8 in MDA-MB-134 found using array painting and
confirmed by SNP6.0 arrays. A – Ideogram of chromosome 8. The red boxes mark the amplified
regions found in the t(8;11) chromosome in MDA-MB-134. B – the genes found in the smaller
amplified region 21.3-21.9Mb. C – the genes found in the larger amplified region 34.6-40.7Mb,
including ZNF703 and FGFR1.
Chapter 5 The complete karyotype of MDA-MB-134
121
Figure 5.13. Deletions on chromosome 8 in MDA-MB-134 found using array painting and
confirmed by SNP6.0 arrays. A – Ideogram of chromosome 8. The red boxes mark the deletions
found in the t(8;11) chromosome in MDA-MB-134. B – the genes found in the larger deleted
region 109.1-122.7Mb. C – the genes found in the smaller deleted region 134.2-139.3Mb.
Chapter 5 The complete karyotype of MDA-MB-134
122
Figure 5.14. Amplifications on chromosome 11 in MDA-MB-134 found using array painting and
confirmed by SNP6.0 arrays. A – Ideogram of chromosome 11. The red boxes mark the
amplifications found in the t(8;11) chromosome in MDA-MB-134. B – the genes found in the
larger amplified region, 68.3-70.8Mb, including CCND1. C – the genes found in the smaller
amplified region, 73.7-78.8Mb.
5.2.3 High-throughput sequencing
Array-based approaches to mapping the amplifications allowed me to resolve the positions of
the breakpoints to high resolution, but could not tell which breakpoints are joined together,
which is essential to find rearrangements which may cause fusion genes.
High-throughput paired-end sequencing overcomes many of the limitations of array-based
mapping of structural variation. It can be used to map many of the structural variants in a cell
line in a single experiment, depending on the sequence coverage obtained. High-throughput
Chapter 5 The complete karyotype of MDA-MB-134
123
sequencing gives millions of short sequence reads from across the genome in one experiment
by sequencing small fragments of the genome in a massively parallel process. Paired-end
sequencing gives reads from both ends of the fragment, which can be aligned to a human
reference genome. Any fragments which contain a genomic rearrangement can be easily
identified as they will cause a change in the size or orientation of the fragment relative to the
reference genome, and can be easily identified. As the fragment straddles the rearrangement,
both sides of the rearrangement can be identified, as well as the orientation of the DNA.
Identification of rearrangements is not dependent on copy number changes, allowing balanced
rearrangements to be identified.
For paired-end sequencing using the Illumina GAII platform, genomic DNA is fragmented and
size-selected for the desired fragment size, which can be up to 800bp (Figure 5.15). Adaptors
are ligated to each end of the fragments and amplified with 20 cycles of PCR. A sample from the
fragment library is placed onto the flow cell where the adaptors adhere to a ‘lawn’ of primers.
Each fragment is amplified on the flow cell surface to produce a cluster of identical single-
stranded DNA strands. For each cycle of sequencing, four fluorescently-labelled nucleotides and
DNA polymerase are added to the flow cell. Each nucleotide has a reversibly-blocked 3’-OH
group so that only one base is incorporated at each step. The flow cell is imaged, then the
nucleotides are unblocked and another round of sequencing can take place. Each base pair is
called from the images with an associated quality score. For paired-end sequencing, each
cluster is re-amplified and a second round of sequencing proceeds from the adaptor ligated to
the other end of the fragments. This gives paired reads where the first read is from one end of
the fragment, and the second read from the opposite end of the fragment (Mardis, 2008).
Chapter 5 The complete karyotype of MDA-MB-134
124
Figure 5.15. The Illumina high-throughput sequencing strategy. A – library preparation. The genomic
DNA is fragmented and size-selected to give a population of small fragments, which have specific
adaptors ligated to either end. B – cluster formation. Single-strand DNA fragments are bound to the flow
cell surface, and cycles of bridge amplification create clusters of up to a million fragments. C – DNA
polymerase and labelled nucleotides are added to the flow cell, and one fluorescently-labelled base is
incorporated. The remaining nucleotides and polymerase are washed off, and an image of the whole
flow cells is taken. The 3’-OH block and fluorescent label are removed from the incorporated nucleotide,
and further rounds of synthesis take place. D – The images produced from the flow cell are used to call
the base pairs incorporated into each fragment.
Chapter 5 The complete karyotype of MDA-MB-134
125
5.2.4 Paired-end read high-throughput sequencing of MDA-MB-134
A library of genomic fragments from MDA-MB-134 was prepared by Dr Jessica Pole and 18
million paired-end reads were sequenced using the Illumina GAII sequencer by the Genomics
Facility of the Cancer Research UK Cambridge Research Institute. 37bp was sequenced from
each end of the ~450bp genomic fragments and aligned to the human reference genome by the
Bioinformatics Facility. Pairs where either end did not map uniquely to the genome were
discarded, and are likely to be fragments produced from repeat regions where the sequences
match multiple regions of the genome. For a set of reads which were exact duplicates of each
other only one read was retained, as these were thought to be either PCR duplicates caused by
amplification of the same fragment during the PCR amplification step of the library preparation,
or optical duplicates caused by the same cluster being read as two clusters during the imaging
step of sequencing. (See Chapter 6 for more detail of the bioinformatics used to process the
sequencing data.)
I analysed the 12 million uniquely-mapping non-duplicated paired reads left after this filtering
process to find structural variants and copy number changes. Although each paired read is only
74bp of sequence, a rearrangement anywhere in the fragment between the reads will be
detected from the end sequences, giving around 1X diploid genome coverage. At this level of
coverage, around 25% of the single-copy rearrangements in the genome would be detected as
we require 2 independent reads to support any rearrangement. The amplified regions will
represent proportionally more of the fragment library, as they are at a higher copy number and
there are 2 copies of the der(8)t(8;11) chromosome, so the coverage will be higher in the
amplified regions and more of the rearrangements in these regions will be detected.
5.2.5 Detection of structural variants
The 12 million reads were analysed for paired reads which appeared to suggest a structural
variants in the genome of MDA-MB-134. 47,446 paired reads were called as possible reads
across structural variants as they mapped to an unexpected location or orientation and were
Chapter 5 The complete karyotype of MDA-MB-134
126
sequenced from fragments which contain a genomic rearrangement. To reduce the number of
false positive structural variants which were called due to biological or bioinformatic artefacts,
such as sequencing errors, chimeric fragments created during the library preparation process,
or misaligned reads, two reads were required to call a structural variant from the possible
reads. 679 structural variants supported by 2 or more independent paired reads were
predicted, using 2,234 of the possible reads. The remaining reads not used to support a
structural variant were presumed to be artefacts or reads where only a single read supporting a
structural variant was found.
The structural variants were divided into 5 categories based on the probable type of
rearrangement which could be inferred from the reads (Table 5.1). (See Chapter 6 for details of
how the types of structural variant were inferred.)
52 variants mapped to regions of known copy number variation found in the human
population, based on the data in Conrad et al. (2010). There is no matching normal cell line for
MDA-MB-134 which would confirm these are germline rearrangements, but it is likely that they
are not somatic rearrangements, and they were removed from further analysis.
Category of structural variant Number
Interchromosomal translocation 16
Deletions larger than 10kb 6
Deletions between 1kb and 10kb 73
Deletions smaller than 1kb 485
Insertion 15
Inversion 14
Inverted tandem repeat 3
Table 5.1. Structural variants in MDA-MB-134, after removal of structural variants in regions of
common copy number variation but before any further filtering
Of the 612 remaining structural variants, over 90% were intrachromosomal rearrangements
under 10Kb, and 80% were under 1Kb. It is likely that some of the small rearrangements were
false positives caused by the thresholds chosen to find structural variants. Any read pair with a
Chapter 5 The complete karyotype of MDA-MB-134
127
fragment size larger than 3 standard deviations from the median was called as a possible
abnormal read, which will incorrectly call some reads as aberrant when they are in the
expected 0.3% of reads which fall outside this size threshold, and any region with two reads
which fall into the tail of the distribution may be called as a small deletion. Raising the
threshold to call an abnormal read would reduce the number of false positives but remove
some real small deletions.
5.2.6 Validation of structural variants
As my interest was primarily in the structure of the amplicon, I decided to concentrate on
validating the rearrangements between chromosomes and the intrachromosomal
rearrangements larger than 10kb, which amounted to 42 structural variants. The reads
supporting the predicted structural variants were re-aligned to the reference genome using
BLAT (Kent, 2002), which is a more sensitive but slower alignment tool, and reports multiple
possible alignments while the faster Maq alignment reports only the best possible alignment
that it has found, which may not always be correct. 7 of the predicted structural variants were
removed as the reads were in repeat regions including centromeres, or had other good
alignments which suggested the pair were a normal read and not a structural variant.
The 35 selected structural variants (Table 5.2) were validated using PCR. As the median size of
the fragments in the library was 454bp, the position of the breakpoints was known to ~500bp
resolution. Primers were designed to amplify the breakpoints in genomic DNA from MDA-MB-
134. A pool of normal human female DNA was used as a control.
Type of
structural
variant
Supporting
reads
Chromosome Breakpoint
region start
Breakpoint
region end
First
read
strand
Chromosome Breakpoint
region start
Breakpoint
region end
Second read
strand
Deletion 2 8 32,799,124 32,799,329 + 8 32,810,825 32,811,014 -
Deletion 12 8 34,902,273 34,902,554 + 8 35,015,339 35,015,607 -
Deletion 3 8 39,350,844 39,350,989 + 8 39,506,397 39,506,530 -
Deletion 2 8 106,144,206 106,144,375 + 8 139,083,092 139,083,275 -
Deletion 3 11 63,284,923 63,285,214 + 11 79,554,718 79,554,997 -
Deletion 4 11 76,836,485 76,836,687 + 11 77,033,372 77,033,540 -
Deletion 2 X 52,908,480 52,908,487 + X 55,695,862 55,695,898 -
Inversion 4 7 70,058,671 70,058,799 + 7 70,076,473 70,076,605 +
Inversion 2 7 70,064,185 70,064,483 - 7 70,076,820 70,077,102 -
Inversion 7 8 36,017,686 36,017,889 + 8 36,548,255 36,548,465 +
Inversion 16 8 41,650,073 41,650,447 - 8 42,088,880 42,089,260 -
Inversion 2 8 41,773,851 41,774,164 + 8 133,032,657 133,032,959 +
Inversion 4 11 63,285,646 63,286,004 - 11 79,551,488 79,551,832 +
Inversion 2 11 63,289,773 63,289,783 + 11 78,702,800 78,702,823 +
Inversion 5 11 66,697,611 66,697,896 - 11 70,224,114 70,224,384 -
Inversion 2 11 70,656,229 70,656,305 - 11 76,410,016 76,410,063 +
Inversion 3 11 70,765,883 70,765,964 + 11 77,329,765 77,329,838 +
Inversion 5 11 73,044,828 73,045,170 - 11 77,572,905 77,573,291 +
Inversion 2 11 74,811,880 74,812,025 - 11 78,266,004 78,266,155 +
Inversion 5 11 74,812,741 74,812,993 - 11 76,838,302 76,838,571 -
Inversion 3 11 76,418,909 76,419,129 - 11 76,984,735 76,984,919 -
Inversion 2 16 21,501,819 21,501,842 + 16 22,617,924 22,617,940 +
Inversion 2 16 33,148,642 33,148,677 - 16 33,201,522 33,201,815 +
Translocation 2 2 41,905,754 41,905,954 + 4 66,096,540 66,096,761 -
Translocation 3 8 42,088,497 42,088,709 + 11 68,454,146 68,454,386 -
Translocation 2 8 124,072,313 124,072,368 - X 136,263,026 136,263,099 -
Translocation 11 11 69,633,071 69,633,448 + 8 38,665,641 38,665,965 +
Translocation 8 11 70,783,301 70,783,583 + 8 21,981,233 21,981,523 +
Translocation 9 11 74,001,145 74,001,470 + 8 36,546,068 36,546,354 -
Translocation 7 11 74,001,293 74,001,424 - 8 36,548,943 36,549,081 -
Table 5.2. Table 2. Predicted structural variants larger than 10Kb in MDA-MB-134. Structural variants have been filtered to remove
common copy number variants and variants which had a normal mapping suggested by BLAT alignment. The read strand refers to
the alignment of the reads in the read pairs to the genome – the expected strands for a normal read pair were the first read on the
positive strand and the second read on the negative strand. (See Chapter 6 for more details on the expected strands for paired-end
reads.)
Translocation 7 11 74,819,515 74,819,840 + 8 21,375,639 21,375,999 -
Translocation 25 11 75,535,691 75,536,039 - 8 34,784,458 34,784,809 -
Translocation 2 11 76,086,980 76,087,112 + 8 39,756,523 39,756,660 -
Translocation 3 12 106,727,070 106,727,245 + 7 110,840,421 110,840,624 -
Translocation 5 17 63,921,789 63,921,813 + 8 35,511,422 35,511,471 -
Chapter 5 The complete karyotype of MDA-MB-134
130
Of the 35 selected variants, 4 variants could not be validated by PCR as no bands were
produced using two different primer pairs designed to amplify these regions. Of the 31
remaining variants, 7 produced a PCR product using the control normal female DNA as well as
MDA-MB-134 DNA, and were presumed to be common polymorphisms.
The 24 validated variants not present in the pooled normal DNA are shown in Table 5.3. All but
one of the rearrangements involve chromosomes 8 and 11. This was the expected result based
on the low numbers of copy number changes seen elsewhere in the genome on the Affymetrix
SNP6.0 and array painting data, and the low sequence coverage of the genome. Using copy
number data from the high-throughput sequencing, a further 9 unbalanced rearrangements
were detected by segmentation which have no paired reads supporting them. 5 of these
changes were single-copy deletions, and the other 4 rearrangements were small gains. These
rearrangements may be missed due to low copy number, or because they fall in repeat regions
and the reads spanning the junction cannot be aligned (see Chapter 7 for further discussion of
the problems of finding junctions which fall in repeat regions).
The one validated structural variant which did not involve chromosomes 8 and 11 suggested an
8;X translocation. The break on chromosome X was around 136,263,000, and a break at this
location can be seen on a copy number plot generated from the normal paired-end reads
(figure 5.16A). Chromosome painting showed that that 20Mb of distal Xq has been translocated
onto one copy of the marker chromosome (figure 5.16B). This rearrangement was not seen in
the SKY karyotype, although it could have been missed as it is a small piece of chromosome X or
may have been thought to be caused by chromosome overlap. It was also not present in the
SNP 6.0 data, suggesting the rearrangement is present in our sample of MDA-MB-134 and may
have been a late event in the evolution of the cell line.
Type of structural
variant
Supporting
reads
Chromo
some
Breakpoint
region start
Breakpoint
region end
Strand Chromosome Breakpoint
region start
Breakpoint
region end
Strand
Deletion 12 8 34,902,273 34,902,554 + 8 35,015,339 35,015,607 -
Deletion 2 8 106,144,206 106,144,375 + 8 139,083,092 139,083,275 -
Deletion 3 11 63,284,923 63,285,214 + 11 79,554,718 79,554,997 -
Deletion 4 11 76,836,485 76,836,687 + 11 77,033,372 77,033,540 -
Inversion 7 8 36,017,686 36,017,889 + 8 36,548,255 36,548,465 +
Inversion 16 8 41,650,073 41,650,447 - 8 42,088,880 42,089,260 -
Inversion 2 8 41,773,851 41,774,164 + 8 133,032,657 133,032,959 +
Inversion 4 11 63,285,646 63,286,004 - 11 79,551,488 79,551,832 +
Inversion 5 11 66,697,611 66,697,896 - 11 70,224,114 70,224,384 -
Inversion 2 11 70,656,229 70,656,305 - 11 76,410,016 76,410,063 +
Inversion 3 11 70,765,883 70,765,964 + 11 77,329,765 77,329,838 +
Inversion 5 11 73,044,828 73,045,170 - 11 77,572,905 77,573,291 +
Inversion 2 11 74,811,880 74,812,025 - 11 78,266,004 78,266,155 +
Inversion 5 11 74,812,741 74,812,993 - 11 76,838,302 76,838,571 -
Inversion 3 11 76,418,909 76,419,129 - 11 76,984,735 76,984,919 -
Translocation 3 8 42,088,497 42,088,709 + 11 68,454,146 68,454,386 -
Translocation 2 8 124,072,313 124,072,368 - X 136,263,026 136,263,099 -
Translocation 11 11 69,633,071 69,633,448 + 8 38,665,641 38,665,965 +
Translocation 8 11 70,783,301 70,783,583 + 8 21,981,233 21,981,523 +
Translocation 9 11 74,001,145 74,001,470 + 8 36,546,068 36,546,354 -
Translocation 7 11 74,001,293 74,001,424 - 8 36,548,943 36,549,081 -
Translocation 7 11 74,819,515 74,819,840 + 8 21,375,639 21,375,999 -
Translocation 25 11 75,535,691 75,536,039 - 8 34,784,458 34,784,809 -
Translocation 2 11 76,086,980 76,087,112 + 8 39,756,523 39,756,660 -
Table 5.3. Validated structural variants larger than 10Kb in MDA-MB-134. All variants were validated by PCR and sequencing.The
read strand refers to the alignment of the reads in the read pairs to the genome – the expected strands for a normal read pair were
the first read on the positive strand and the second read on the negative strand.
Chapter 5 The complete karyotype of MDA-MB-134
132
Chapter 5 The complete karyotype of MDA-MB-134
133
Figure 5.16. FISH to investigate an unexpected structural variant between chromosome 8 and
chromosome X. A – copy number plot from whole-genome SNP6.0 array for chromosome X of
MDA-MB-134, showing no copy number step on the q arm. B – copy number plot from high-
throughput sequencing for chromosome X of MDA-MB-134, showing a copy number step
around 136Mb. C - FISH confirming 8;X translocation in MDA-MB-134. Chromosome 8 spectrum
orange-labelled paint is blue, chromosome X FITC-labelled paint is green. One normal 8 and two
normal X chromosomes can be seen, along with two der(11) marker chromosomes, one of
which shows a translocation with chromosome X.
5.2.7 Sequencing of structural variant junctions
All of the validated variants were Sanger sequenced to confirm their positions and to find the
exact sequence at the junctions. All the breakpoints were found in the expected positions, with
the exact breakpoints being within a library insert size of the position of the reads spanning the
breakpoint (Table 5.4).
Figures 5.17-5.19 show all the validated structural variants in the genome, plotted against the
copy number data obtained from sequencing. Many of the breakpoints match the copy number
steps, although there are several structural variants which do not appear to be associated with
a copy number change, such as the junction between 74,811,827 and 78,266,347Mb on
chromosome 11 which only shows a copy number step on one side of the junction, and these
may be balanced breakpoints where there is no copy number change to be detected.
There are also copy number steps which are not associated with a structural variant. This could
be due to lack of coverage, as few breaks at low copy number would be detected at the current
coverage levels in MDA-MB-134, or they may represent breaks in repeat regions, as if the
region is highly repetitive any sequence from that region will be rejected as they will be a
perfect match to more than one location in the genome.
Of the junction sequences of the 24 variants, 9 showed no homology at the breakpoint, 14
showed homology of between 1 and 4bp, 1 variant had 13bp of homology, and 1 showed an
insertion of 1bp at the breakpoint (Table 5.4).
Type of structural variant Chromosome Breakpoint Chromosome Breakpoint Overlap/insertion Length Sequence
Translocation 11 74,819,892 8 21,375,631 Overlap 1 GT
Translocation 11 69,633,489 8 38,666,031 Overlap 1 T
Translocation 8 42,088,756 11 68,454,008 None 0 None
Translocation 11 74,001,087 8 36,548,912 Overlap 4 AGGT
Inversion 11 76,418,750 11 76,984,724 Overlap 1 A
Translocation 8 124,072,213 X 136,262,830 Overlap 4 CCCT
Translocation 11 70,783,703 8 21,981,574 Overlap 2 GT
Translocation 11 74,001,597 8 36,546,065 Insertion 1 T
Translocation 11 75,535,675 8 34,784,454 Overlap 3 TTG
Translocation 11 76,087,332 8 39,756,483 None 0 None
Inversion 11 74,811,827 11 78,266,347 Overlap 1 C
Inversion 11 74,812,675 11 76,838,258 Overlap 1 T
Deletion 11 76,836,832 11 77,033,295 None 0 None
Deletion 8 34,902,627 8 35,015,270 None 0 None
Inversion 8 36,017,999 8 36,548,613 Overlap 1 T
Inversion 8 41,650,072 8 42,088,882 Overlap 2 GT
Inversion 8 41,774,211 8 133,033,042 None 0 None
Deletion 8 106,144,619 8 139,083,096 None 0 None
Deletion 11 63,285,330 11 79,554,711 None 0 None
Inversion 11 66,697,564 11 70,224,058 Overlap 1 T
Inversion 11 70,655,988 11 76,410,171 None 0 None
Inversion 11 70,766,233 11 77,329,957 Overlap 13 TTCTTTTTGGAGA
Inversion 11 73,044,824 11 77,573,265 None 0 None
Inversion 11 63,285,642 11 79,551,886 Overlap 3 AAA
Table 5.4. The exact breakpoints and junction homology of the 24 validated structural variants in MDA-MB-134.
Chapter 5 The complete karyotype of MDA-MB-134
135
Figure 5.17. Structural variants in MDA-MB-134 plotted on a circular genome.
Interchromosomal rearrangements are plotted in green, while intrachromosomal
rearrangements are shown in blue. The plot was generated using the Circos software
(Krzywinski et al., 2009).
Chapter 5 The complete karyotype of MDA-MB-134
136
Figure 5.18. Structural variants in MDA-MB-134 showing chromosomes 8 and 11 only. The blue
lines mark intrachromosomal rearrangements, and the green lines and interchromosomal
rearrangements. The histogram in red shows copy number segments predicted using the
DNACopy program. The figure was generated using Circos (Krzywinski et al., 2009).
Chapter 5 The complete karyotype of MDA-MB-134
137
Figure 5.19. Structure of the 8;11 amplicon in MDA-MB-134. The plots show loess-corrected
copy number from high-throughput sequencing, with the upper plot showing the amplified
regions of chromosome 8, and the lower plot showing the amplified regions of chromosome 11.
The dotted red lines mark the breakpoints of validated structural variants, with the blue and
green lines showing the interchromosomal and intrachromosomal rearrangements. (See
Chapter 6 for details of how the loess correction of copy number was performed.)
5.2.8 Potential fusion genes found by high-throughput sequencing
The potential fusion genes at each breakpoint could be predicted using high-throughput
sequencing, as the breakpoints could be determined to high enough resolution that the genes
at both sides of the breakpoint could be identified.
Chapter 5 The complete karyotype of MDA-MB-134
138
All of the structural variants were used to test for potential fusion genes, readthrough fusion
genes, and internal exon deletions in MDA-MB-134. This included the small rearrangements
which were not selected for PCR validation, as only a few of these rearrangements affected
exons. The majority of the structural variants did not affect genes, or were small
rearrangements which only rearranged introns.
4 fusion genes were predicted (Table 5.5), of which 3 were part of the 8;11 amplicon. Primer
pairs were designed which would amplify each gene separately, to see if it was expressed, and
in combination would amplify a fusion product (Figure 5.20). The results of the PCR to detect
fusion genes are shown in Figures 5.21 and 5.22. No fusion products were detected, but three
genes were expressed in MDA-MB-134 but not in the human immortalized breast epithelial cell
line HB4a. These three genes, ODZ4, SHANK2, and UNC5D, are all in the amplified regions on
chromosome 8 and 11.
8 readthrough fusions were predicted (Table 5.6), of which 6 were part of the 8;11 amplicon.
No readthrough fusions were found. The results of the PCR are shown in Figure 5.23. KLHL35,
which also forms a potential fusion gene with ODZ4, forms a potential readthrough fusion with
AQP11, but no expression of this readthrough was detected, and it appears to be a separate
event to the rearrangement which causes KLHL35 and ODZ4 to be fused (Figure 5.24).
14 structural variations were predicted to cause deletions of 1 or more exons from a gene
(Table 5.7). All these deletions are small deletions under 10Kb, and in contrast to the predicted
fusion genes and readthrough fusions, none of the rearrangements are part of the 8;11
amplicon. PCR primers were designed to either side of the deletion to test for shorter
transcripts in MDA-MB-134, which would indicate a possible deletion of exons (Figure 5.25). No
such transcripts were detected.
Type of
structural
variant
Read Chromosome Breakpoint
region start
Breakpoint
region end
Read strand Genes Predicted fusion
Insertion First read 11 74811880 74812069 - KLHL35 5' of KLHL35 into 3' of ODZ4
Second read 11 78266004 78266199 + ODZ4
Translocation First read 17 63921789 63921857 + ARSG 5' of ARSG into 3' of UNC5D
Second read 8 35511422 35511515 - UNC5D
Inversion First read 8 41773851 41774208 + ANK1 5' of EFR3A into 3' of ANK1
Second read 8 133032657 133033003 + EFR3A
Inversion First read 11 66697611 66697940 - FBXL11 5' of SHANK2 into 3' of FBXL11
Second read 11 70224114 70224428 - SHANK2
Table 5.5. Gene fusions in MDA-MB-134 predicted from structural variants called from paired-end sequencing. The read strand
refers to the alignment of the reads in the read pairs to the genome.
Type of
structural
variant
Read Chromosome Breakpoint
region start
Breakpoint
region end
Read
strand
Broken
gene
Readthrough
partner
Predicted fusion gene
Translocation First read 11 70783301 70783627 + EPB49 SHANK2 EPB49 is broken and may read
through into SHANK2
Second read 8 21981233 21981567 +
Translocation First read 11 76086980 76087156 + ADAM2 LRRC32 ADAM2 is broken and may read
through into LRRC32
Second read 8 39756523 39756704 -
Insertion First read 11 70656229 70656349 - ACER3 NADSYN1 ACER3 is broken and may read
through into NADSYN1
Second read 11 76410016 76410107 +
Translocation First read 12 106727070 106727289 + IMMP2L PRDM4 IMMP2L is broken and may read
through into PRDM4
Second read 7 110840421 110840668 -
Deletion First read 7 30501892 30501945 + GGCT NOD1 GGCT is broken and may read
through into NOD1
Second read 7 30502611 30502711 -
Inversion First read 11 74812741 74813037 - KLHL35 AQP11 KLHL35 is broken and may read
through into AQP11
Second read 11 76838302 76838615 -
Inversion First read 11 74812741 74813037 - PAK1 SERPINH1 PAK1 is broken and may read
through into SERPINH1
Second read 11 76838302 76838615 -
Inversion First read 8 41650073 41650491 - ANK1 AP3M2 ANK1 is broken and may read
through into AP3M2
Second read 8 42088880 42089304 -
Table 5.6. Readthrough gene fusions in MDA-MB-134 predicted from structural variants called from paired-end sequencing. The
read strand refers to the alignment of the reads in the read pairs to the genome.
Chapter 5 The complete karyotype of MDA-MB-134
141
Figure 5.20. Fusion gene PCR strategy. The green and blue arrows represent genes, while the
red jagged line represents a breakpoint in the gene. PCR primers were designed which would
amplify the normal cDNA from each gene, and in combination would amplify only a fusion
product.
Chapter 5 The complete karyotype of MDA-MB-134
142
Figure 5.21. PCR to detect fusion transcripts in MDA-MB-134. No fusion transcripts were
detected. See next page for key to PCR reactions.
Chapter 5 The complete karyotype of MDA-MB-134
143
Upper row Lower row
Well Primers cDNA Well Primers cDNA
1 GapDH None 1 ANK1 Reference
2 GapDH Reference 2 ANK1 HB4a
3 GapDH HB4a 3 ANK1 MDA-MB-134
4 GapDH MDA-MB-134 4 EFR3A/ANK1 HB4a
5 ARSG Reference 5 EFR3A/ANK1 MDA-MB-134
6 ARSG HB4a 6 SHANK2 Reference
7 ARSG MDA-MB-134 7 SHANK2 HB4a
8 UNC5D Reference 8 SHANK2 MDA-MB-134
9 UNC5D HB4a 9 FBXL11 Reference
10 UNC5D MDA-MB-134 10 FBXL11 HB4a
11 ARSG/UNC5D HB4a 11 FBXL11 MDA-MB-134
12 ARSG/UNC5D MDA-MB-134 12 SHANK2/FBXL11 HB4a
13 EFR3A Reference 13 SHANK2/FBXL11 MDA-MB-134
14 EFR3A HB4a
15 EFR3A MDA-MB-134
Chapter 5 The complete karyotype of MDA-MB-134
144
Figure 5.22. PCR to detect fusion transcripts in MDA-MB-134. No fusion transcripts were
detected.
Well Primers cDNA
1 GapDH None
2 GapDH Reference
3 Blank Blank
4 Blank Blank
5 KLHL35 Reference
6 KLHL35 HB4a
7 KLHL35 MDA-MB-134
8 ODZ4 Reference
9 ODZ4 HB4a
10 ODZ4 MDA-MB-134
11 KLHL35/ODZ4 HB4a
12 KLHL35/ODZ4 MDA-MB-134
Chapter 5 The complete karyotype of MDA-MB-134
145
Figure 5.23. PCR to test for predicted
readthrough fusions in MDA-MB-134. A key
to the wells can be found on the following
page. No fusion transcripts were detected.
Chapter 5 The complete karyotype of MDA-MB-134
146
Row Column Primers cDNA
Row Column Primers cDNA
1 1 EPB49 Reference
5 1 GGCT Reference
1 2 EPB49 MDA-MB-134
5 2 GGCT MDA-MB-134
1 3 SHANK2 Reference
5 3 NOD1 Reference
1 4 SHANK2 MDA-MB-134
5 4 NOD1 MDA-MB-134
1 5 EPB49/SHANK2 Reference
5 5 GGCT/NOD1 Reference
1 6 EPB49/SHANK2 MDA-MB-134
5 6 GGCT/NOD1 MDA-MB-134
2 1 ADAM2 Reference
6 1 KLHL35 Reference
2 2 ADAM2 MDA-MB-134
6 2 KLHL35 MDA-MB-134
2 3 LRRC32 Reference
6 3 AQP11 Reference
2 4 LRRC32 MDA-MB-134
6 4 AQP11 MDA-MB-134
2 5 ADAM2/LRRC32 Reference
6 5 KLHL35/AQP11 Reference
2 6 ADAM2/LRRC32 MDA-MB-134
6 6 KLHL35/AQP11 MDA-MB-134
3 1 ACER3 Reference
7 1 ANK1 Reference
3 2 ACER3 MDA-MB-134
7 2 ANK1 MDA-MB-134
3 3 NADSYN1 Reference
7 3 APM32M Reference
3 4 NADSYN1 MDA-MB-134
7 4 APM32M MDA-MB-134
3 5 ACER3/NADSYN1 Reference
7 5 ANK1/APM32M Reference
3 6 ACER3/NADSYN1 MDA-MB-134
7 6 ANK1/APM32M MDA-MB-134
4 1 IMMP2L Reference
8 1 PAK1 Reference
4 2 IMMP2L MDA-MB-134
8 2 PAK1 MDA-MB-134
4 3 PRDM4 Reference
8 3 SERPINH1 Reference
4 4 PRDM4 MDA-MB-134
8 4 SERPINH1 MDA-MB-134
4 5 IMMP2L/PRDM4 Reference
8 5 PAK1/SERPINH1 Reference
4 6 IMMP2L/PRDM4 MDA-MB-134
8 6 PAK1/SERPINH1 MDA-MB-134
9 1 GapDH Reference
9 2 GapDH MDA-MB-134
9 3 GapDH Reference
Chapter 5 The complete karyotype of MDA-MB-134
147
Figure 5.24. Two possible KLHL35 fusions predicted in MDA-MB-134. A – fusion predicted
between KLHL35 and ODZ4. B – readthrough fusion predicted between KLHL35 and AQP11.
Chapter 5 The complete karyotype of MDA-MB-134
148
Type of structural variant Chromosome Deletion start Deletion end Gene
Deletion 11 411136 411904 ANO9
Deletion 11 3081376 3082009 OSBPL5
Deletion 11 7673249 7674173 OVCH2
Deletion 11 92742021 92743103 CCDC67
Deletion 12 131707143 131707992 P2RX2
Deletion 15 87669090 87669745 POLG
Deletion 15 91316598 91317352 CHD2
Deletion 16 1387181 1387880 C16orf28
Deletion 17 37743398 37744175 STAT3
Deletion 19 55820546 55821085 SYT3
Deletion 2 85746607 85747410 SFTPB
Deletion 20 62066274 62068265 ZNF512B
Deletion 6 158468017 158469427 SERAC1
Deletion 7 157080659 157081216 PTPRN2
Table 5.7. Internal gene deletions in MDA-MB-134 predicted from deletions called from paired-
end sequencing.
Chapter 5 The complete karyotype of MDA-MB-134
149
Figure 5.25. PCR to look for internal deletions of genes in MDA-MB-134. Each PCR was
performed using genes spanning the deletion. The first PCR in each pair is on universal
reference cDNA, and the second is on MDA-MB-134 cDNA to look for a different size product,
which would indicate a possible transcript with exons deleted.
Chapter 5 The complete karyotype of MDA-MB-134
150
5.2.9 ODZ4 as a potential fusion gene
The candidate fusion gene KLHL35-ODZ4 was of particular interest. ODZ4 (also known as DOC4
or TEN4) was part of one of the few gene fusions known before the start of my project, as it is
fused to NRG1 in the breast cancer cell line MDA-MB-175 (Liu et al., 1999). ODZ4 is part of the
teneurin family of signaling molecules, which can function as transmembrane receptors and
also transcription factors (Tucker and Chiquet-Ehrismann, 2006). ODZ1 overexpression has been
implicated in mouse mammary tumorigenesis by MMTV-insertion experiments (Theodorou et
al., 2007). ODZ4 has not been proposed as one of the important genes in the 11q amplicon in
breast cancer as it is outside the minimal region of amplification. From the SNP6.0 data from
the Cancer Genome Project (Bignell et al., 2010), breaks in ODZ4 can be seen in the cell lines
HCC1599, MRK-nu-1, and MCF7. The genome of MCF7 has been mapped by paired-end
sequencing (Hampton et al., 2009), but no fusions of ODZ4 were found. A diagram of the known
breaks in ODZ4 is shown in Figure 5.26.
Chapter 5 The complete karyotype of MDA-MB-134
151
Chapter 5 The complete karyotype of MDA-MB-134
152
ODZ4 was seen to be expressed in MDA-MB-134 but not in the normal immortalized breast cell
line HB4a. To investigate whether ODZ4 was expressed in other breast cancer cell lines, a
primer pair designed to amplify exons 4 and 5 was tested on a panel of cell lines. In this semi-
quantitative assay, out of 28 breast cancer cell lines, 5 showed expression of ODZ4 (Figure
5.27). ODZ4 was also expressed in the non-cancer breast cell line HMT3552, but not detectably
in HB4a. Of the cell lines with known breaks, MDA-MB-134 and MDA-MB-175 showed
expression while HCC1599 and MCF7 did not. No data was available for MRK-nu-1. HCC1419,
HCC1500, and PMC42 also showed expression of ODZ4.
Quantitative real-time PCR was performed on ODZ4. Primers were designed to amplify exon 8,
which is part of the fusion gene in MDA-MB-175, and exon 21, which is not part of the fusion
gene, and tested on a panel of cell lines. The results are shown in figure 5.28, with expression
compared to the universal human reference cDNA, containing cDNA from a pool of different
cell lines.
HB4a and HMT3552 both express ODZ4 at low but detectable levels. The lack of expression
seen in HB4a using semi-quantitative methods was likely to be due to the expression levels
being at the limit of detection for standard PCR. The cell lines BT-474, HCC1500, PMC42 and
SUM52 showed a low level of upregulation of both exons of ODZ4 compared to the control cell
line HB4a. Semi-quantitative PCR suggested that HCC1500 would have a higher expression level
than was seen by quantitative PCR, which may be due to the standard PCR using different
samples of cDNA which had not been normalized to any housekeeping genes.
MDA-MB-134 and MDA-MB-175 showed higher overexpression of ODZ4 than any other cell
lines. MDA-MB-175 was the only cell line which showed higher expression of exon 8 than exon
21, and this may be due to the fusion of ODZ4, which includes exon 8 but not exon 21.
Chapter 5 The complete karyotype of MDA-MB-134
153
Figure 5.27. PCR using primers for ODZ4 exons 4-5 on a panel of breast cancer cell lines.
Row 1: HMT3552 (positive), HB4a, SUM44, SUM52, BT20, BT474, BT549, MCF7
Row 2: HCC38, HCC1143, HCC1419 (positive), HCC1500 (positive), HCC1569, HCC1599,
HCC1806, HCC1937.
Row 3: MDA-MB-134 (positive), MDA-MB-175 (positive), MDA-MB-231, MDA-MB-361, MDA-
MB-415, MDA-MB-468, ZR-75-1, ZR-75-30.
Row 4: VP229, VP267, PMC42 (positive), Patu-1, Suit-2, Mia-paca-2, DU4475, SKBr3
Row 5: T47D, universal reference cDNA (positive control), negative control.
Chapter 5 The complete karyotype of MDA-MB-134
154
Figure 5.28. Real-time PCR on two sets of primers in ODZ4. Expression for each cell line was
normalised to three housekeeping genes (GAPDH, UBC, and RPL13a), and the expression for
each cell line was plotted relative to the expression of the universal reference cDNA.
5.3 Discussion
The karyotype of MDA-MB-134 has now been investigated using low-resolution array painting,
high-resolution SNP6.0 arrays, and high-throughput sequencing. A comparison of the three
methods shows good concordance between the three methods for determining the structure of
amplicons.
The 1Mb array painting offers the lowest resolution data, but it is the only method which allows
the structure of the amplified chromosome to be determined directly from the chromosome
without the possibility of breaks on other copies of the chromosome being included as part of
the amplification. While it appears that the breaks on chromosomes 8 and 11 in MDA-MB-134
are confined to the two amplified chromosomes, in cell lines and tumours with a more
Chapter 5 The complete karyotype of MDA-MB-134
155
complicated pattern of amplification, spread across multiple different derivative chromosomes,
whole-genome approaches could lead to confusion when trying to assemble complicated
amplicons. The higher resolution data also suggests that requiring 3 consecutive clones gained
or lost to call a copy number change in the 1Mb array painting was conservative, as there were
regions with 1 or 2 clones showing a copy number change which appears to be real.
A limitation of microarray-based approaches is that it is difficult to quantify regions of high and
low copy number in one experiment without the probes at high copy number becoming
saturated (Williams and Thomson, 2010). High-throughput sequencing should be unaffected by
this problem as the copy number is obtained directly from the number of reads. A comparison
of the copy number data obtained from the SNP6.0 array and from high-throughput sequencing
suggests that the structure of the amplicon has more copy number changes than are called
from the SNP6.0 data as the array becomes saturated at high copy number (Figure 5.29).
Chapter 5 The complete karyotype of MDA-MB-134
156
Figure 5.29. Comparison of segmented copy number for MDA-MB-134 chromosome 8. A –
SNP6.0 array data, with segments predicted by the PICNIC algorithm (Greenman et al., 2010)
shown in red. B – Illumina sequencing copy number data for the same region, with segments
predicted by DNACopy shown in red. The vertical green lines mark known breakpoints
confirmed by PCR and capillary sequencing. In A, the amplification appears to be 7-fold, with a
shift from 2 copies to 14 copies (the SNP6.0 data is taken from the endoreduplicated sideclone
of MDA-MB-134, which is why distal 8p is present at 2 copies rather than 1). In B, the
amplification appears to be 16-fold.
Chapter 5 The complete karyotype of MDA-MB-134
157
The search for fusion genes in MDA-MB-134 was disappointing, as 12 fusion genes were
predicted and none were found to be expressed. This was particularly disappointing as two of
the predicted fusions involved genes known to be fused in epithelial cancers. An expressed in-
frame fusion of SHANK2 is known in a melanoma cell line (Berger et al., 2010). The potential
fusion partner in MDA-MB-134 is EPB49, also known as dematin, which encodes a cytoskeletal
protein which has been implicated in prostate tumorigenesis. Truncated forms of EPB49 have a
dominant negative effect and cause cytoskeletal abnormalities and altered cell shape
(Lutchman et al., 1999), raising the possibility that a fusion of EPB49 could also act as a
dominant negative inhibitor of the wild-type protein.
ODZ4 was another known fusion partner predicted to be fused in MDA-MB-134. Although no
fusion transcript was detected, ODZ4 was substantially upregulated in MDA-MB-134 compared
to the normal control cell lines. ODZ4 lies outside the highly amplified region, which suggests
another mechanism causes overexpression rather than an increase in expression due to
increased copy number. ODZ4 is also overexpressed in BT-474, for which high-resolution SNP6.0
data is also available (data provided by the Cancer Genome Project (Bignell et al., 2010)).
Although there are rearrangements on chromosome 11, there is no copy number change in
ODZ4. SNP6.0 data is not available for PMC42, HCC1500, or SUM52, the other cell lines which
overexpress ODZ4, but data from a custom oligonucleotide array containing 30,000 probes
suggests that SUM52 may have an amplification of ODZ4, while PMC42 and HCC1500 do not
show a copy number change. (Data kindly provided by Dr Suet-Feung Chin, Cancer Research UK
Cambridge Research Institute.) This suggests that multiple mechanisms may cause ODZ4
expression, including copy number gain and expression as part of a fusion gene. Additionally,
PMC42 has been suggested as a good model of normal breast epithelium despite originating
from a breast cancer, as it has similar mRNA and miRNA expression profiles to HB4a (Git et al.,
2008), but it has higher expression of ODZ4 than the cell lines derived from normal breast
epithelium.
Chapter 6
Bioinformatics of
high-throughput sequencing
of breast cancer
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
159
6.1 Introduction
The primary aim of the high-throughput sequencing I performed on cancer cell lines was
to identify structural variants in the cancer genome. I chose to perform paired-end
sequencing, which sequences a short read from either end of a fragment of a known
size, rather than single end sequencing, which produces only one sequence read from
each piece of fragmented DNA. Paired-end sequencing is a better method for detecting
structural variants than single-end sequencing, as sequencing from either end of a DNA
fragment will detect a rearrangement occurring anywhere in the fragment. This gives a
much higher level of coverage for rearrangements for the same amount of actual
sequence from a single-end read, where rearrangements will only be detected if there is
a read across the rearrangement.
Finding the structural variants in a tumour genome usually relies on comparing the
tumour genome to a reference genome by aligning all the sequences from the tumour
to the reference genome and detecting any variation. This is easier than assembling the
tumour genome from scratch as it is difficult to assemble a human genome from current
short sequence reads (Medvedev et al., 2009). Paired-end reads from a structural
variant in the cancer genome will produce an identifiable signature depending on the
type of rearrangement, and these signatures can be detected and used to find the
underlying structural rearrangement. The read coverage across the genome can also be
used to detect unbalanced structural variants, as a change in copy number will cause a
change in the number of reads across that region of the genome.
Although the idea of sequencing the end of genomic fragments to find structural
variants has been used in the past (Volik et al., 2003), this was limited by the capacity of
capillary sequencing and produced small numbers of sequences from large DNA
fragments. High-throughput sequencing produces orders of magnitude more sequence
reads than previous approaches, and while some bioinformatics software was available
to analyse high-throughput sequencing data, at the time this project was started there
was no software available which would use paired-end sequencing data to predict
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
160
structural variants and copy number changes, and any effect they had on genes. I
developed a bioinformatic analysis pipeline using a combination of available software
packages and some custom software which would process the raw sequence data and
generate copy number and structural variation information. The steps in the pipeline
are shown in Figure 6.1.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
161
Figure 6.1. Outline of the pipeline used to process high throughput sequencing data
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
162
6.2 Alignment of high-throughput sequencing reads
The process of DNA alignment for high-throughput sequencing involves comparing a
short sequence read to a reference genome and determining the most likely position
within the genome that produced the sequence. Sequence alignment algorithms
specifically designed for high-throughput sequencing perform differently to those
optimized for other applications, as they can make certain assumptions – for instance,
we can assumed that all matches will be perfect or near-perfect, which is not true of
alignment algorithms designed to find alignments between different species (Flicek and
Birney, 2009). Sequence alignment in general involves a trade-off between the
sensitivity of the alignment and the speed of the algorithm, as a faster algorithm may
not find a more distant alignment, or may misalign some sequences. Designing
algorithms specifically for high-throughput sequencing allows them to perform quickly
enough to cope with a high volume of data without too much of a trade-off in accuracy.
To get the raw sequences, the image analysis (FIRECREST) and base calling modules
(BUSTARD) of the standard Illumina pipeline were used. The sequence alignment
program MAQ (Li et al., 2008) was then used to align the paired-end reads. MAQ was
one of the first algorithms developed specifically to handle high volumes of short
sequence reads, and adapts the seed and extension method used by earlier algorithms.
BLAST is an early example of this approach – short exact hits or ‘seeds’ within the
sequence are found, and then each seed is extended into a longer alignment (Altschul et
al., 1990). This allows for fast searching by initially searching for only the short seed
sequences and then searching for longer alignments only where a short match has
already been found, narrowing the search space. PatternHunter (Ma et al., 2002)
improved this method through the use of non-contiguous seeds, which were shown to
increase the sensitivity of matches as they are less affected by a mutation at a single
base pair. MAQ uses the first 28bp of each to read to create six non-contiguous seeds,
which increases the match sensitivity, and speeds up the searching by creating six hash
tables, one for each seed. The hash table is created by taking the base pairs at each
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
163
position in the seed and using a hash function to generate an integer value based on
them. The reads can then be ordered according to this integer and grouped together in
memory. The same hash function is then applied to a 28bp subsequence of the
reference sequence, which also generates an integer which can be used to look up the
indexed reads which gave the same integer. If a hit is found then the match is extended
beyond 28bp to see if it matches the whole length of the sequence. This is repeated
over the six hash tables for all possible 28bp subsequences of the reference genome to
find the best match.
To find the best match for a sequence in the genome, MAQ searches for the ungapped
match with the lowest mismatch score, which is defined as the quality scores of the
bases that are mismatched. For greater speed, MAQ only considers hits with two or
fewer mismatches in the first 28 positions of the read. Each alignment is assigned a
quality score, which is a measure of the probability that the alignment reported by MAQ
is the correct alignment. For each quality score, Q:
A quality score of 30 indicates a 1 in 1000 chance that the read is incorrectly mapped.
If MAQ finds multiple alignments with equally good quality scores, it will return one
alignment at random and give a quality score of 0, allowing these alignments to be
easily identified and filtered out. I used a quality threshold of 35 to decide whether to
retain a read pair for further processing, which removes low quality alignments which
may be misaligned, and also removes any reads that do not have a single unique match
as non-uniquely mapping reads will give spurious results if used for structural variant
calling.
Multiple read pairs that map to identical positions in the genome are likely to be
duplicates of the same fragment created during the PCR amplification step. After the
reads had been aligned with MAQ, the first step was to identify reads that were likely to
be PCR duplicates and remove all but one of the identical read pairs.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
164
6.2.1 Calling normal and aberrant reads
A normal read was defined as a pair of reads which aligned to the genome with the
expected size and orientation (Figure 6.2). The expected size was determined from the
size distribution of the DNA fragments in the library. Anything within 3 standard
deviations of the median was considered inside the normal range, as the library size
distribution is approximately normally distributed. The expected orientation is with the
read aligned to the positive strand in a lower position on the chromosome, and the read
aligned to the negative strand in a higher position on the chromosome.
Any read pair that did not meet the criteria for a normal read was called as an
aberrantly mapping read. Aberrantly mapping reads are candidates to be reads from
DNA fragments that contain a structural difference from the reference human genome.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
165
Figure 6.2. A - a normal read pair aligned to the reference genome. The blue read is in
the lower position on the chromosome and aligns to the positive strand, while the red
read is in the higher position on the chromosome and aligns to the negative strand. B -
the library size distribution of a typical small insert library showing the range of
fragment sizes which are considered as normal fragments. (Example taken from ZR-75-
30 cell line library.)
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
166
6.2.2 Processing mate-pair data
There are two types of paired-end library. Standard small-insert libraries have a
fragment size of up to 800bp. Mate-pair libraries use longer fragments which are
circularized, ligated, and fragmented a second time to produce a library with a larger
insert size than can be achieved using standard small-insert libraries.
Mate-pair libraries need to be processed differently, as a normal read from a mate-pair
library will have a different size and orientation than a standard small-insert library
(Figure 6.3). In a mate-pair library, the paired reads are taken from the ends of a longer
DNA fragment. As long DNA fragments cannot be directly sequenced on the Illumina
platform, longer fragments are circularized and the junction labelled with biotin. A
smaller fragment containing the ligated junction is cut out and these junction fragments
are pulled out using avidin-labelled beads. The junction fragments are similar in size to
the fragments from a standard small-insert library and are sequenced in the same way
as a standard library. As the DNA has been circularized, when the sequence reads from
the ends of the small fragments are aligned to the genome, they will align in the
opposite orientation to the normal fragments of a small-insert library (Figure 6.3). The
size range for a normal fragment is also larger, as the size distribution of the fragments
in a mate-pair library is wider than for a small insert library.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
167
Figure 6.3. Construction of mate-pair libraries. After circularization, ligation and
fragmentation of the libraries, there are two populations of short fragments in the
library. The desired small fragments have reads shown as yellow arrows. A population of
unwanted fragments will also be present in the library due to imperfect selection for the
biotin-labelled fragments. The reads from these fragments are shown in green, and have
opposite orientations to the desired reads.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
168
6.3.1 Clustering and structural variant calling
To call a structural variant, two or more independent high-quality reads which
supported the structural variant were needed in order to minimize artefactual
predictions, on the basis that chimeras produced during the library preparation or
misaligned reads are much less likely to occur twice in the same location than two real
reads across a structural variant.
To find multiple read pairs which covered the same structural variant, the reads were
clustered (Figure 6.4). The read pairs were first sorted so that the first read in the pair
came first in the genome, and then the read pairs were sorted into order by
chromosome, position, and strand they aligned to. If there were multiple read pairs
where all the first reads in the pairs were on the same chromosome, aligned to the
same strand, and the distance between them was less than the upper limit of the library
size range, they were clustered together. The clustering process was repeated for the
second reads in the pairs. Only read pairs where both of the reads from all the pairs
were in the same regions are used to call the structural variants (Figure 6.5). These
reads are assumed to be spanning the same breakpoint, and the true position of the
breakpoint lies no further than the upper limit of the library insert size from the position
of the read in the cluster which was furthest from the breakpoints.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
169
Figure 6.4. Clustering of reads to call structural variants. Paired reads are clustered if
they map to the same strand and the difference in their positions is smaller than the
maximum library insert size. The position of the breakpoint must lie less than the
maximum library insert size from the end of the read furthest to the breakpoint.
Figure 6.5. Clustering of reads to call structural variants. Only the reads shown in green
support this structural variant, as both reads in the pair support the variant. The reads
shown in blue would not be used to support the structural variant as only one of the
reads in a pair supports the structural variant.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
170
The structural variants were classified based on read strands and positions (Figure 6.6),
and named based on the most likely chromosomal rearrangement which can be inferred
from the reads.
A DIF is called when the two reads in a pair map to different chromosomes. It is most
likely to be an interchromosomal translocation or insertion. A DEL is called when the
two reads in a pair map further apart than the maximum library size, and the read which
maps to the lower position is on the positive strand, and the read in the higher position
is on the negative strand. It is most likely to be a deletion, but it could also be an
insertion of material normally found at a higher position on the chromosome, with no
loss of DNA. An INS is called when the read which maps to the lower position is on the
negative strand, and the read which maps to the higher position is on the positive
strand. It is mostly likely to represent an insertion. Head-to-tail tandem duplications will
also be in this class, and as the paired reads cover only one side of the insertion it
cannot be distinguished from a larger insertion. An INV is called when both reads in the
pairs map to the same strand, indicating one side of the breakpoint has been inverted.
An ITR is called when both reads in the pair map to the same position and the same
strand, and it is a special variant of the INV class caused by a head-to-head tandem
duplication.
Just as the strands for a normal read pair are reversed for mate-pair libraries, the
expected strands when calling structural variants must also be reversed (Figure 6.7). A
probable deletion is called from a read pair where the read in the lower position is
aligned to the negative strand, and the read in the higher position is aligned to the
positive strand. A probable insertion is called from a read pair where the read in the
lower position is aligned to the positive strand, and the read in the higher position is
aligned to the negative strand. Probable inversions and inverted tandem repeats will
have both reads aligned to the same strand.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
171
Figure 6.6. Structural variant calling of small-insert library. For each type of structural
variant, the upper diagram shows the arrangement of the abnormal chromosome, and
the lower diagram shows how the read pair would align to the reference genome.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
172
Figure 6.7. Structural variant calling from mate-pair library. For each type of structural
variant, the upper diagram shows the arrangement of the abnormal chromosome and
the reads produced from a mate-pair library, and the lower diagram shows how the
read pair would align to the reference genome.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
173
Structural variant calling is more difficult from mate-pair libraries because the biotin
selection step of the library preparation is not perfect, and some of the unbiotinylated
fragments will also be retained. These small fragments, which do not cross the
circularization junction, will be retained after size selection (Figure 6.3), and they can be
seen as a second peak in a graph of the library size distribution (Figure 6.8). These
fragments are similar to the fragments produced from a small-insert library, and will
align to the same strands as for a small-insert library – the read in the lower position will
align to the positive strand, and the read in the higher position will align to the negative
strand. This gives the small fragments the same read orientation as insertions in the
mate-pair library, and they could cause a spurious insertion to be called. Requiring two
or more hits to call a structural variant reduces the likelihood of calling these insertions,
as the small fragments make up a smaller percentage of the library than the desired
junction fragments, and the probability of getting two reads across the same region is
lowered. The read pairs which appear to be from the small fragments can also be
filtered out – this will remove a small number of real insertions along with the false
positives. As much larger numbers of small insertions were called from unfiltered mate-
pair libraries than from small-insert libraries, it was likely that many of them were
spurious insertions, and the small fragments were filtered out before any further
analysis was done.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
174
Figure 6.8. Fragment size distribution from the mate-pair library of the cell line ZR-75-
30. The histogram shows the percentage of fragments of each size. The blue bars show
the size distribution for the fragments where the reads map in the expected orientation
for the biotinylated fragments from a mate-pair library. The red bars how the size
distribution of the smaller non-biotinylated fragments which were not removed during
library preparation.
6.4 Fusion gene prediction
Once a list of well-supported structural variants was produced from the short
sequences, the next step was to look for any potential fusion genes which may result
from these structural variants.
The list of structural variants was first checked against a list of known human copy
number variations (Conrad et al., 2010) and any which matched were considered to be
normal human variation and not acquired in the tumour.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
175
The set of structural variants with known common copy number variations removed was
used to predict potential fusion genes. The fusions were predicted computationally at
the DNA level by using the Ensembl Application Programming Interface to retrieve all
the genes which overlap the breakpoint region of a structural variant, and predicting
whether a fusion transcript could be formed based on the direction of the reads and the
strands of the genes (Figures 6.9 and 6.10). As well as rearrangements where two genes
are broken and fused, we considered rearrangements which break only one gene and
remove the 3' end including the transcription stop site, which could result in
transcription continuing into a downstream gene until it reaches the stop site of the
downstream gene. As a breakpoint could break two genes on different strands, it is
possible for more than one fusion gene to be predicted for each structural variant. As
the breakpoints are not resolved to base-pair level, if the gene is broken inside an exon
it was not possible to predict whether or not an in-frame transcript would be produced.
Although it is theoretically possible to predict whether an intronic break will produce an
in-frame fusion, alternate splicing and cryptic exons are found in fusion genes (Howarth
et al., 2008), which complicates the prediction of fusions.
Structural variants can also cause genes to be internally rearranged, deleting or
duplicating exons. As the majority of structural variants are small deletions entirely
within introns, which are expected to have no effect on the coding sequence of the
gene, the number of exons deleted is also returned by the script, so that the deletions
which may affect genes can be easily prioritised over those which do not delete any
coding regions.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
176
Figure 6.9. Fusion gene prediction from small-insert library. The strands of the read
alignments allow the orientation of the chromosomes and genes to be determined, and
predict whether a gene fusion will be formed. Depending on the orientation of the
genes, a fusion gene may be predicted, there may be no possible fusion, or there may
be a potential readthrough fusion. As a single breakpoint may break two genes on
different strands, more than one possible gene fusion or readthrough is possible at each
junction.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
177
Figure 6.10. Fusion gene prediction from mate-pair library. The gene prediction is similar
to for the small-insert library , but the orientations of the chromosome and genes are
reversed relative to the aligned strands of the reads.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
178
6.4.1 Validation of predicted fusion genes
To validate the computational prediction of fusion genes, I used independent structural
variation data sets from the cell lines MCF7 (Hampton et al., 2009) and HCC1187
(Stephens et al., 2009a). In their paper, Hampton et al. predicted that MCF7 had 10
possible in-frame fusions caused by breakpoints occurring in the introns of 2 genes, and
they found that 4 of these were present at the cDNA level. I took their published data
set, which contained all the structural variants they found in MCF7, and put the data
into my fusion gene prediction script. The script successfully found all 10 fusions that
Hampton et al. predicted, as well as 4 other potential fusions, 3 internal deletions and
41 potential readthrough fusions.
Stephens et al. (2009) found a number of fusions and internal rearrangements in the cell
line HCC1187. They found 6 expressed fusion genes, 2 of which were in-frame fusions
and 4 out-of-frame fusions. I put their data set of structural variants into my fusion gene
prediction script, and all 6 of the expressed fusions were found, along with 5 other
predicted fusions not mentioned in their paper. At least one of the fusions I predicted
(PUM1-TRERF1) is known to be expressed (Dr Karen Howarth, unpublished). They also
looked for internal gene rearrangements and exon deletions, and found 5 expressed in-
frame internal gene rearrangements and 7 internal rearrangements which were not
checked for expression. All of these internal rearrangements were predicted
computationally by my script, along with 8 other internal gene rearrangements. One of
the 8 rearrangements not found by Stephens et al. (2009) is a tandem duplication of
RAD51L1, which is known to be expressed (Susanne Flach, unpublished work in the lab).
This validation shows that my method of fusion prediction not only successfully finds
known fusion genes in other data sets, but predicts fusion genes which have been
missed by other methods of fusion gene prediction.
The fusion prediction script was then used to predict fusion genes using a data set of
paired end reads from cell lines and tumour samples. In the cell line ZR-75-30, 20
potential fusion genes were predicted. In collaboration with Dr Ina Schulte I looked for
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
179
expression of the fusion genes and found 7 expressed fusion genes, of which 3 produced
in-frame transcripts (Table 6.1) .
Type of structural variant 5' Gene 3' Gene Expressed? In-frame?
Insertion PCTK3 NFASC No No
Translocation GRIP2 BCL11A No No
Inversion PREX2 TSNARE1 No No
Translocation HYLS1 TIMM23 No No
Translocation STRN3 PLCE1 No No
Translocation FSIP1 BAZ2A No No
Translocation CBX3 c15orf57 No No
Translocation TRAPPC9 STARD3 No No
Translocation NDRG1 HOXB4 No No
Translocation TRAPPC9 SPAG5 No No
Translocation PPM1D TRAPPC9 No No
Inversion TAOK1 CA10 No No
Insertion SSH2 PLXDC1 No No
Inversion ZMYM4 OPRD1 Yes No
Translocation COL14A1 SKAP1 Yes Yes
Translocation APPBP2 PHF10L1 Yes Yes
Inversion TAOK1 PCGF2 Yes Yes
Deletion UPS32 CCDC49 Yes No
Inversion BCAS3 HOXB9 Yes No
Deletion TIAM1 NRIP1 Yes Yes
Table 6.1. Predicted fusion genes in the ZR-75-30 cell line.
In the paired cell lines VP229 and VP267, taken from a patient at different stages of
disease, there were 27 fusions which were predicted to be in both cell lines. 3 were
found to be expressed, of which 2 were out of frame, and 1 was in frame (Scott
Newman and Susanne Flach, unpublished) (Table 6.2).
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
180
Type of structural variant 5' Gene 3' Gene Expressed? In-frame?
Insertion NRG3 c10orf11 No No
Translocation GRIK1 CPXM2 No No
Deletion OR52N1 TRIM5 No No
Insertion NRG3 SAMD8 No No
Translocation PDLIM1 ZBBX Yes No
Inversion ACADSB ADAM12 No No
Translocation PDLIM1 TNIK No No
Inversion FAM125B SPTLC1 Yes No
Translocation NRG3 GRIP1 No No
Translocation DLG5 KCNMB2 No No
Translocation MYNN NRG3 No No
Inversion AL356155.1 SORCS1 No No
Inversion ROR2 NEK6 No No
Deletion ZFAND2a c7orf50 No No
Translocation CPLX1 DUSP14 No No
Inversion UBTD1 SLIT1 No No
Inversion PBX3 ROR2 No No
Translocation MDS1 KCNMA1 Yes Yes
Deletion FNBP1 FAM129B No No
Inversion FAM129B NEK6 No No
Insertion APOBEC3G APOBEC3D No No
Translocation DPY19L2 DPY19L2P2 No No
Inversion AATF AC113211.2 No No
Translocation c10orf11 c17orf63 No No
Translocation c17orf63 c10orf11 No No
Translocation ADK KCNMB2 No No
Translocation UBR4 ZFP37 No No
Table 6.2. Predicted fusion genes in both of the paired cell lines VP229 and VP267.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
181
6.5 Copy number variation
The read pairs which map normally were used to find copy-number alterations. The
number of sequence reads within a given interval which align to the genome should be
proportional to the copy number, and the resolution is limited only by the read
coverage across the genome. To get accurate copy-number from sequencing, the data
must be corrected for the repeat content and GC percentage across the genome.
6.5.1 Correcting for mappability
Repetitive regions of the genome will have fewer aligned reads than unique regions as
fragments produced from repeat regions during library preparation are likely to align
perfectly to multiple regions of the genome, and will be discarded during data
processing. To account for this ‘mappability’ of the genome, the start positions of all the
genomic locations where a 35bp read would give a single match were simulated. This list
of mappable starts was used to divide the genome up into windows which each contain
the same number of mappable starts, giving windows of variable size. Highly repetitive
regions of the genome give larger windows.
The sequence reads were binned into the windows across the genome, and the number
of reads in each window should be proportional to the copy number.
6.5.2 Correcting for GC content
The library preparation protocols for the Illumina Genome Analyzer are known to
introduce bias and produce greater numbers of reads from regions with high GC content
(Dohm et al., 2008). This bias was reduced by lowering the melting temperature during
the gel extraction of the library preparation protocol, which has been shown to
considerably reduce the GC bias (Quail et al., 2008). To examine the GC content bias in
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
182
our data, the GC percentage for each window was retrieved using the Ensembl
Application Programming Interface, and plotted against the number of reads in each
window (Figure 6.11). This showed a bias in our data at both ends of the scale, with
fewer reads in windows with either a high or a low GC content. Although a bias towards
fewer reads in regions of low GC content was previously seen (Dohm et al., 2008), the
corresponding drop-off in read number at high GC was not seen by previous studies.
This may be because the effect was masked by the larger GC bias produced during the
library preparation, as they had not followed the improved protocol described above, or
because their data was generated from bacterial genomes, and there were few regions
with high enough GC content for the effect to be seen. This bias is also seen as a ‘wave’
in the uncorrected copy number data when plotted alongside the GC percentage of the
genome (Figure 6.12).
The GC bias was noted to be similar to the GC ‘wave’ seen by Marioni et al. (2007) in
array CGH data, and a similar method of loess correction was used to normalise the
data. A loess curve was fitted to the a plot of GC content against the number of reads
per window, and the values predicted from the loess curve were used to normalize the
number of reads per window (Figure 6.13). This produced better copy number data as
judged by eye.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
183
Figure 6.11. The number of reads in a window plotted against GC percentage of the
window to show the bias against reads at both low and high GC percentages. Data
shown is from MDA-MB-134, chromosome 1. Most of chromosome 1 is present in two
copies, with a small region present at higher copy number.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
184
Figure 6.12. Copy number plots for MDA-MB-134 chromosome 1 showing GC content
bias. A - the number of reads in each window across MDA-MB-134 chromosome 1,
corrected for mappability but not GC percentage. B - the GC percentage of each
window. C - the GC percentage and reads per window plotted on the same graph.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
185
A
B
Figure 6.13. Correction of copy number data for GC content bias. A - Number of reads
per window across MDA-MB-134 chromosome 1, uncorrected for GC content. B -
Number of reads per window across MDA-MB-134 chromosome 1 after loess correction
for GC content bias.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
186
6.4.4. Segmentation and copy number analysis
Segmentation in the context of copy number data refers to the process of
computationally determining breakpoints and copy number alterations from a dataset
of copy number information. A number of segmentation methods have been previously
developed, primarily for analysing array CGH data. The SegSeq algorithm (Chiang et al.,
2009) is currently the only segmentation method designed specifically for high-
throughput sequencing, but it relies on the use of a matched normal sample to call
breakpoints, and was therefore unsuitable for my purposes. Instead I used DNAcopy
(Olshen et al., 2004; Venkatraman and Olshen, 2007), which uses circular binary
segmentation to call breakpoints. DNAcopy has been shown to perform better than
other segmentation methods on both simulated and real CGH data (Lai et al., 2005;
Willenbrock and Fridlyand, 2005).
DNAcopy allows the user to set different parameters which determine how the
segmentation is performed, and allow the algorithm to be “tuned” for better
performance on a particular dataset (for example, to reduce false positives from data
with a low signal-to-noise ratio) (Lai et al., 2005). There is little available information on
how to best choose parameters for segmentation, although a study on the GLAD
algorithm concluded that the choice of parameters had minimal effect on their data set
(Rigaill et al., 2008). DNAcopy had been previously used to segment breast cancer cell
line CGH data using the default parameters (Venkatraman and Olshen, 2007), and I
changed only one parameter from the default, which was to use an “undo” method to
remove breakpoints detected due to local trends in the data, as is recommended by the
authors of the method. The parameters were tested by comparing the segments
produced using different parameter sets for chromosomes from MDA-MB-134 and
comparing them to the segmentation produced by the PICNIC algorithm from Affymetrix
SNP6 data. The chosen parameters correctly detected known copy number changes
without adding extra segments which did not appear to be supported by the data. The
exception is at the centromeres where additional copy number segments may be called
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
187
due to incorrect mapping of reads to repeat regions. A known weakness of the circular
binary segmentation method is an inability to detect small regions of copy number
change in the middle of chromosomes (Olshen et al., 2004), but DNAcopy successfully
segments the small amplification on MDA-MB-134 chromosome 8 (Figure 6.14).
Figure 6.14. Segmented copy number plot for part of MDA-MB-134 chromosome 8. A
weakness of circular binary segmentation is the inability to detect small regions of copy
number change, but using DNACopy the small amplification at 21.8Mb is successfully
segmented.
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
188
6.6 Discussion
The bioinformatic pipeline was validated by using data from a number of cell lines to
assess how well it performed the analysis we required.
The structural variant calls from MDA-MB-134 have been the most thoroughly analysed
(see Chapter 5 for details). My analysis prioritised the validation of translocations and
large genome rearrangements, as they were more likely to affect genes. Of the 42 large
rearrangements predicted by the pipeline, 7 were filtered out by re-aligning the reads
using BLAT, 4 could not be validated by PCR, and 7 were also present in a pool of normal
female DNA, leaving 24 which could be validated and sequenced, or around 57% of the
predicted variants.
The 3 different ways that false positive variants were identified suggest there are
multiple ways that the analysis could be improved. Improving the alignment steps of the
pipeline is an easy way to remove the structural variants that are caused by
misalignment. The variants that could not be validated by PCR are more difficult to
address, as it is not possible to tell whether they are spurious structural variant calls, or
whether they are true variants which fall in a region that is difficult to PCR. The variants
that are also present in the normal female DNA do not represent a failure of the
bioinformatic pipeline, as they are real variants which are present in the sequencing, but
future sequencing projects will involve tumour genomes sequenced alongside the
matched normal genome. Bioinformatic methods can be used to filter out any variants
that appear in the normal genome. The smaller predicted structural variants have not
been validated, so the number of false positives is unknown.
The fusion gene predictions were tested in cell lines by looking for expression of a fusion
transcript. The number of predicted fusions which were expressed ranges from 0 out of
4 in MDA-MB-134, 7 out of 20 in ZR-75-30, and 3 out of 27 in VP229 and VP267. These
figures are for gene fusions only, not including readthrough fusions or internal
rearrangements which have not yet been completely tested in any cell line except MDA-
Chapter 6 Bioinformatics of high-throughput sequencing of breast cancer
189
MB-134. The Stephens et al. (2009b) study found a higher proportion of their predicted
fusion genes were expressed than in my study, but their predictions missed at least one
expressed fusion gene, suggesting that my fusion prediction pipeline may predict more
fusion genes which prove not to be expressed, but also find real fusion genes which
other studies would miss. They also found that fewer of their predicted fusion genes
were expressed in amplified regions, and all of these cell lines contain highly rearranged
amplicons which may contribute to the lower number of expressed fusions.
The copy number segments which were predicted for MDA-MB-134 agree with both our
previous knowledge of MDA-MB-134 from FISH and microarray studies (Paterson et al.,
2007), and the breakpoints which are known to base pair resolution fall within the
breakpoint regions predicted by DNACopy. Although the methods used to produce copy
number segments are accurate, a disadvantage of the methods is that the breakpoints
cannot be determined to a higher resolution than the size of the windows used to divide
up the reads. As sequence coverage increases, a smaller window size can be used to
refine the breakpoints, but a better method is to use a breakpoint calling technique
which is not reliant on fixed window size. The SegSeq algorithm uses hidden Markov
models to predict copy number change points, and even at relatively low coverage
(~15million 36bp reads) it can predict breakpoints to within 1kb (Chiang et al., 2009).
The rSW-seq algorithm, which uses a Smith-Waterman approach to map breakpoints, is
another copy number prediction algorithm developed to avoid the use of windows and
improve resolution. While both these algorithms identify breakpoints at higher
resolution than my window-based approach, they both require normal sequences to
compare against the tumour sequences, and they have so far been tested using cell line
data which will not have the problems of stromal contamination found in tumour
samples. Using a matched normal sample and calling the differences between the
tumour and the normal sample also removes the problem of GC content bias, as the
biases should be the same in both samples.
Chapter 7
Discussion
Chapter 7 Discussion
191
7.1 How prevalent are fusion genes in breast cancer?
Part of the central hypothesis of my thesis is that the prevalence of fusion genes in solid
tumours has been underestimated, and my study of breast cancer cell lines supports this idea.
At the start of my project, only 5 fusion genes had been found in breast cancer, all in cell lines –
a fusion of ODZ4 to NRG1 in the cell line MDA-MB-175 (Liu et al., 1999), a fusion of FHIT to a
cDNA later identified as MACROD2 in BrCa-MZ-02 (Popovici et al., 2002), a fusion of BCAS4 to
BCAS3 in MCF7 (Barlund et al., 2002), and the RIF1-PKD1L1 and TAX1BP1-AHCY fusions recently
found in HCC1806 (Howarth et al., 2008). In HCC1806, which has two known fusion genes, I
found no further fusion genes, and in MDA-MB-134 no fusion genes were found. However, the
predictions of fusion genes I made in the cell lines ZR-75-30 and the paired cell lines
VP267/VP229 using high-throughput sequencing data were validated and showed that ZR-75-30
has at least seven expressed fusion genes and VP267 and VP229 have three expressed fusions.
This agrees with the recent data from Stephens et al. (2009), which found between zero and
eleven expressed fusion genes per tumour or cell line. Stephens et al. estimate they would have
found 50% of the rearrangements present in their samples, which suggests that there are
further fusion genes they have missed due to low coverage of the genome, and this is likely to
be true of the cell lines I have investigated by sequencing – MCF7, which has been investigated
by both genomic sequencing (Hampton et al., 2009) and paired-end transcriptome analysis
(Ruan et al., 2007), has 13 known expressed transcripts (Table 7.1).
One possible explanation for why no further fusion genes were found in HCC1806 is that fusion
genes are more likely to be found at balanced rather than unbalanced breakpoints. HCC1806
was first selected for array painting partly because the SKY karyotype suggested that it had
balanced translocations, and the two known fusion genes in HCC1806, RIF1-PKD1L1 and
TAX1BP1-AHCY, both involve balanced chromosome rearrangements. No further fusion genes
were found when I investigated the unbalanced rearrangements in this cell line. It is probable
that balanced rearrangements, which do not cause a gain or loss of material, are selected
because they have an effect on genes in other ways, and balanced rearrangements may
produce fusion genes more often than unbalanced rearrangements.
Chapter 7 Discussion
192
Another question is whether the lack of fusion genes found in MDA-MB-134 was due to
technical difficulties of finding fusion genes, or whether it is a true result. The majority of the
rearrangements found in MDA-MB-134 were in the amplicon, which was as expected based on
the low-coverage sequencing, which was unlikely to find all the breakpoints in non-amplified
regions. All of the 24 validated structural variants found by paired-end sequencing in MDA-MB-
134 were part of the amplicon, and none of the 9 rearrangements outside the amplicon
suggested by copy number changes were detected by paired-end sequencing. This is a lower
number of rearrangements detects than might be expected, which may be due to the repetitive
nature of the breakpoint regions (see further discussion below). No structural variants were
found to suggest that MDA-MB-134 contains any balanced rearrangements, which may be
more likely to produce fusion genes. Additionally, the complex nature of the amplifications in
MDA-MB-134 may make it less likely that rearrangements in the amplicon will produce fusion
genes, as the two fusion partners may be further rearranged by a nearby breakpoint, or may be
missing the nearby DNA sequences needed to form a stable transcript (Stephens et al., 2009).
A hypothesis put forward in Paterson et al. (2007) was that a fusion gene formed from co-
amplification of chromosomes 8 and 11 might drive this amplification. No fusion gene has been
found in MDA-MB-134 to support this hypothesis. It is more likely that the driver of
amplification is co-expression of two genes in the amplicon, as suggested by Kwek et al. (2009).
7.2 How important are fusion genes in breast cancer?
In contrast to the 5 known fusion genes at the start of my project, there are now 64 known
fusion genes across a range of breast cancer cell lines and tumours (Table 7.1). Although
increasing numbers of fusion genes have been found, only one fusion is though to be recurrent
in breast cancer. The EML4-ALK translocation first reported in non-small cell lung cancer (Soda
et al., 2007; Rikova et al., 2007) has also been reported as present in 2.4% of breast tumours
(Lin et al., 2009), although this contradicts an earlier study which found that the transcript is
specific to NSCLC and not found in breast tumours (Fukuyoshi et al., 2008).
Chapter 7 Discussion
193
5' Gene 3' Gene In-frame? Cell line or tumour Source
ZMYM4 OPRD1 No ZR-75-30 This study
COL14A1 SKAP1 Yes ZR-75-30 This study
APPBP2 PHF10L1 Yes ZR-75-30 This study
TAOK1 PCGF2 Yes ZR-75-30 This study
UPS32 CCDC49 No ZR-75-30 This study
BCAS3 HOXB9 No ZR-75-30 This study
TIAM1 NRIP1 Yes ZR-75-30 This study
PDLIM1 ZBBX No VP267/VP229 This study
FAM125B SPTLC1 No VP267/VP229 This study
MDS1 KCNMA1 Yes VP267/VP229 This study
EML4 ALK Yes 5 breast tumours Lin et al., 2009
PLXND1 TMCC1 Yes HCC1187 Stephens et al., 2009
RGS22 SYCP1NM Yes HCC1187 Stephens et al., 2009
EFTUD2 KIF18B Yes HCC1395 Stephens et al., 2009
ERO1L FERMT2 Yes HCC1395 Stephens et al., 2009
PLA2R1 RBMS1 Yes HCC1395 Stephens et al., 2009
CYTH1 PRPSAP1 Yes HCC1599 Stephens et al., 2009
NFIA EHF Yes HCC1937 Stephens et al., 2009
STRADB noP58 Yes HCC1954 Stephens et al., 2009
INTS4 GAB2 Yes HCC2157 Stephens et al., 2009
RASA2 ACPL2 Yes HCC2157 Stephens et al., 2009
SMYD3 ZNF695 Yes HCC2157 Stephens et al., 2009
ACBD6 RRP15 Yes HCC38 Stephens et al., 2009
LDHC SERGEF Yes HCC38 Stephens et al., 2009
MBOAT2 PRKCE Yes HCC38 Stephens et al., 2009
SLC26A6 PRKAR2A Yes HCC38 Stephens et al., 2009
SMF PPARGC1B Yes HCC38 Stephens et al., 2009
RAF1 DAZL Yes PD3664a Stephens et al., 2009
AC141586.2 CCNF Yes PD3670a Stephens et al., 2009
SEPT8 AFF4 Yes PD3670a Stephens et al., 2009
ETV6 ITPR2 Yes PD3688a Stephens et al., 2009
KCNQ5 RIMS1 Yes HCC1395 Stephens et al., 2009
HN1 USH1G Yes PD3693a Stephens et al., 2009
AGPAT5 MCPH1 No HCC1187 Stephens et al., 2009
CTAGE5 SIP1 No HCC1187 Stephens et al., 2009
PLXND1 TMCC1 No HCC1187 Stephens et al., 2009
SUSD1 ROD1 No HCC1187 Stephens et al., 2009
EIF3K CYP39A1 No HCC1395 Stephens et al., 2009
IL6R ATP8B2 No HCC2157 Stephens et al., 2009
Chapter 7 Discussion
194
RBM14 PACS1 No HCC2157 Stephens et al., 2009
FBXL18 RNF216 No PD3670a Stephens et al., 2009
ITPR2 ETV6 No PD3688a Stephens et al., 2009
GRB7 PERLD1 No HCC2218 Stephens et al., 2009
HDAC11 FBLN2 No PD3670a Stephens et al., 2009
FGFR1 ZNF703 No PD3690a Stephens et al., 2009
SSH2 SUZ12 No PD3693a Stephens et al., 2009
RIF1 PKD1L1 Yes HCC1806 Stephens et al., 2009
TAX1BP1 AHCY Yes HCC1806 Stephens et al., 2009
FHIT MACROD2 Yes BrCa-MZ-02 Popovici et al., 2002
BCAS4 BCAS3 Yes MCF7 Barlund et al., 2002
ARGHEF2 SULF2 Yes MCF7 Hampton et al., 2009
DEPDC1B ELOVL7 Yes MCF7 Hampton et al., 2009
RAD51C ATXN7 Yes MCF7 Hampton et al., 2009
SULF2 PRICKLE2 Yes MCF7 Hampton et al., 2009
NPEPPS USP32 Yes MCF7 Hampton et al., 2009
ASTN2 PTPRG Yes MCF7 Hampton et al., 2009
BCAS3 RSBN1 Yes MCF7 Hampton et al., 2009
ASTN2 TBC1D16 Yes MCF7 Hampton et al., 2009
BCAS4 PRKCBP1 Yes MCF7 Hampton et al., 2009
cXorf15 SYAP1 Yes MCF7 Ruan et al., 2007
RPS6KB1 TMEM49 Yes MCF7 Ruan et al., 2007
BRCC3 FUNDC2 Yes MCF7 Ruan et al., 2007
NRG1 ODZ4 Yes MDA-MB-175 Liu et al., 1999
Table 7.1. Fusion genes currently known to be expressed in breast cancer cell lines and
tumours.
BCAS3 is the only other gene known to be recurrently fused in breast cancer, as although a
fusion could not be found in HCC1806 (Chapter 3), the sequencing of ZR-75-30 discovered a
fusion of BCAS3 to HOXB9. However, this did not produce an in-frame product, and involved
the 5’ end of the gene in the fusion as opposed to the 3’ end retained in the known fusion in
MCF7. Similarly, investigation of ODZ4 as a potential recurrent target of fusion did not find it
fused in any cell line other than the previously-known MDA-MB-175 fusion.
Fusion gene recurrence has been previously used to determine whether gene fusions were
important, but the approaches which were used in haematological malignancies may not be the
best way to find important events in solid tumours. Already there is evidence from prostate
Chapter 7 Discussion
195
cancer that while there are important recurrent fusions such as TMPRSS2-ERG, the same genes
are found fused to different fusion partners, and some of the variant fusions have so far been
found in only one case (Tomlins et al., 2007). There may be no recurrent fusion genes in breast
cancer, but as the number of known fusion genes grows, we may see fusions involving the same
genes fused to different fusion partners, and fusion genes involving different members of the
same pathway which have the same effect on the cell.
7.3 How can fusion genes in breast cancer be found?
Two different approaches have been taken in the search for fusion genes in solid tumours. One
approach looks for a fusion transcript and then locate the genomic rearrangement which
produces the fusion. This approach has successfully found fusion genes by several different
methods: using expression array data to look for genes which are overexpressed and potentially
fused (Tomlins et al., 2005; Lin et al., 2009); using retroviral expression libraries to find novel
transforming genes (Soda et al., 2007); using proteomic approaches to look at fusion proteins
directly (Rikova et al., 2007); and more recently by using RNA-seq to find fusion genes across
the whole transcriptome (Maher et al., 2009), a technique which will also find fusion genes
that are not the result of a genomic rearrangement, such as readthrough fusions between two
adjacent genes (Berger et al., 2010).
The second approach searches for genomic rearrangements, and looks for the fusion genes
which may result from the genomic rearrangement. This is the approach I have taken.. Over
the course of this study, the technology I used to map chromosome rearrangements in breast
cancer moved from low-resolution array painting, to high-resolution whole genome arrays, and
finally to high-throughput sequencing. All of these different approaches were used in turn to
map the rearrangements in MDA-MB-134.
Array painting is useful to determine which chromosome fragments are present in a derivative
chromosome, and hence discover which breakpoints are joined together. Prior to high-
throughput sequencing, this was the only way to determine which chromosome breakpoints
were joined together, and while paired-end sequencing can also determine the two sides of a
Chapter 7 Discussion
196
breakpoint, it cannot assemble the whole derivative chromosome. High-throughput sequencing
is also unable to detect rearrangements near the centromeres and telomeres by finding the
sequences crossing the breakpoint, as the sequences produced from these repetitive regions
cannot be unambiguously aligned to a reference genome. The der(15)t(15;17) and
der(16)t(16;18) chromosomes in MDA-MB-134 are an example of a rearrangement which
would not be detected using high-throughput sequencing. However, the loss of material caused
by the unbalanced translocation would still be seen from copy number plots taken from high-
throughput sequencing, even if the exact breakpoint is not known. It is likely that a breakpoint
in the repetitive regions near the centromeres and telomeres is not affecting genes directly, but
that the loss of material is the important event.
Although the array painting in this study was carried out on low-resolution 1Mb arrays,
individual breakpoints can be mapped using high-resolution custom Nimblegen arrays (Gribble
et al., 2007; Howarth et al., 2008), and higher-resolution arrays such as the SNP6.0 array could
be used for array painting as well as for whole-genome array CGH. As the SNP6.0 array has
probes to detect genotype as well as copy number, the genotypes of different derivative
chromosomes could be compared to give information about karyotype evolution, as
chromosomes which evolved from the same parental copy could be identified.
High-throughput sequencing has the potential to replace array-based methods of mapping
chromosome rearrangements. As well as paired-end sequencing to provide information on both
sides of a breakpoint, the sequences can be used to give high-resolution copy number
information. Array painting could also be replaced with paired-end high-throughput
sequencing, as it is possible to sequence sorted chromosomes individually (Chen et al., 2010).
However, as well as the biases caused by the GC and repeat content of the data, there may be
other, less obvious biases in the sequencing data which could affect copy number that have not
yet been discovered.
As noted above, a potential problem with the use of high-throughput sequencing is aligning
sequences to repetitive regions. Although many of the copy number steps seen in MDA-MB-134
Chapter 7 Discussion
197
had structural variants associated with them, there were copy number changes which did not
have any associated structural variants. One possible explanation for this is low coverage, but
breakpoints at comparable copy number to the missing breaks were found with multiple
supporting reads, and relaxing the criteria for calling a structural variant to require only one
read across the breakpoint did not find any potential structural variants associated with these
breakpoints. Another possibility is that the breakpoints are in a region which is highly repetitive
and although there are paired-end reads which span the breakpoint, none of these read pairs
had a unique mapping to the genome and were discarded.
7.4 Mechanisms of chromosome rearrangement in breast cancer
Analysis of the MDA-MB-134 amplicon supports the breakage-fusion-bridge cycle model of
amplicon formation. Many of the junctions involved in the amplicon are inversions, which are a
signature of breakage-fusion-bridge cycles of amplification. An alternative model is the
translocation-excision-amplification model (Van Roy et al., 2006), which proposes that the
amplified regions are first excised from chromosomes and amplified as double minute
chromosomes, which then re-integrate into the genome (Storlazzi et al., 2010). This model
requires a translocation between chromosomes 8 and 11 in MDA-MB-134, which does appear
to have occurred, but in the translocation-exicision-amplification model the sequences which
were excised from the translocated chromosome re-integrate at a different site than they were
excised from (Corvi et al., 1994), and the amplicon in MDA-MB-134 appears to have been
amplified in place.
The microhomology at the breakpoints in MDA-MB-134 suggests two different mechanisms are
involved. No homology or very short regions of microhomology, with small insertions at the
junction, suggest non-homologous end-joining as a mechanism for double-strand break repair
(Hastings et al., 2009), and this is seen in the majority of the junctions in MDA-MB-134. The
longer region of homology seen at one breakpoint suggests the alternative pathway of
microhomology-mediated end-joining is also operative in MDA-MB-134 (McVey and Lee, 2008).
The mechanisms that control which method of repair is used are not well known, but studies in
Chapter 7 Discussion
198
urothelial cancer suggest that cancer cells may preferentially use the more error-prone non-
homologous end-joining even when there is sufficient microhomology for microhomology-
mediated end-joining to be used (Windhofer et al., 2008). Alternatively, microhomology-
mediated end-joining may be a “back-up” mechanism used only when non-homologous end-
joining is unavailable (Lieber, 2010).
The karyotype of HCC1806 shows a number of tandem duplications. This could be an example
of the particular “mutator phenotype” suggested by Stephens et al. (2009), where an unknown
mechanism probably related to DNA damage repair produces large number of tandem
duplications. HCC1806 is consistent with the observation that these mutator phenotype
tumours are ER and PR negative, and do not have BRCA1 or BRCA2 mutations.
7.5 Future Directions
7.5.1 ODZ4
The high-throughput sequencing of MDA-MB-134 predicted that ODZ4, which is known to be
fused to NRG1 in the breast cancer cell line MDA-MB-175, may be fused to KLHL35 in MDA-MB-
134. Although there is no expression of the predicted fusion MDA-MB-134, ODZ4 is
overexpressed relative to the normal breast cell line, and it is more highly expressed than might
be expected based on dosage effects, as while it is part of the amplified region on chromosome
11 it is at the outer edge of the amplicon, and is not at high copy number. This suggests that
there is an alternate mechanism driving the overexpression of this gene other than extra
copies. Further studies could investigate the mechanism of ODZ4 overexpression, which may be
unrelated to its presence in an amplified region, but could potentially be due chromosome
rearrangements placing ODZ4 under the control of a different promoter, or near an amplified
enhancer. Mutations in ODZ4 have also been found in pancreatic cancer (Yachida et al., 2010),
suggesting that ODZ4 may contribute to tumorigenesis by other mechanisms which do not
cause overexpression.
Chapter 7 Discussion
199
ODZ4 has not previously been suggested as important in 11q13 amplification, but my studies
suggest that it may be upregulated by a combination of amplification and other mechanisms,
and a study which looks for candidate genes using the minimal region of amplification would
not look at ODZ4. There are a number of other criteria which have been suggested as important
when looking for the important genes in amplification, such as correlation with clinical outcome
and analysis of biological activity using siRNA knockdowns (Santarius et al., 2010), and further
studies of ODZ4 could investigate its importance by methods other than looking at gene
amplification and expression.
7.5.2 High-throughput sequencing and bioinformatic analysis
Methods for sequence alignment and analysis are constantly being improved, and since the
bioinformatic pipeline described in this study was developed, it has been updated with a newer
generation of alignment and analysis programs. Longer read lengths cannot be aligned by MAQ,
and the primary alignment is now performed using BWA (Li and Durbin, 2009) with a more
sensitive re-alignment step using Novoalign ( www.novocraft.com ), and using Picard for
duplicate calling and to assess the depth of the library ( picard.sourceforge.net ).
Wet-lab validation of the results of high-throughput sequencing is expensive and time-
consuming, and it is important to improve the bioinformatic analysis of the sequencing to
remove as many false positive and misleading results as possible before any downstream
analysis is done.
One area for improvement is in the initial sequence alignment. Sequence misalignment is one
possible cause of artefactual structural variants, and several of the structural variant calls in
MDA-MB-134 proved to be due to sequence misalignment. As sequence coverage increases,
the number of misaligned reads will increase, and depending on the probability of a sequence
being misaligned, the number of artefactual structural variants called may increase faster than
the number of real structural variants (Figure 7.1). A way to decrease the problems of
misalignment is to use a two-stage alignment process, as has been implemented in the latest
version of the bioinformatic pipeline described in Chapter 6. The initial alignment step uses a
Chapter 7 Discussion
200
fast but less sensitive alignment program to align the large numbers of reads produced, such as
BWA (Li and Durbin, 2009) or Bowtie (Langmead et al., 2009), which would miss the true
alignments of a small number of reads. The possible aberrant reads called from this first-pass
alignment would be realigned using a more sensitive algorithm, such as BFAST which is
specifically designed for sensitive alignment of short reads (Homer et al., 2009). As the number
of aberrant reads is a small percentage of the total, it is possible to use a much slower
algorithm to re-align the aberrant reads which would be impractical to use to align the total set
of reads.
Chapter 7 Discussion
201
Figure 7.1. Graph to show how artefactual structural variants could increase at a greater rate
than real structural variants.
There are also improvements that could be made to structural variant calling. The strategy I
have used looks at each structural variant individually, and does not link variants together, such
as two structural variants that are on either side of an insertion (Figure 7.2A). Linking structural
variants together can help to identify regions of insertion or inversion, rather than just one side
of an event, but if there is a further rearrangement which has not been detected then the
linkage will be wrong, and so this method is more useful for small rearrangements where it is
less likely that another variant between them has been missed (Medvedev et al., 2009).
Another improvement to structural variant calling is to use the sequences that span the
breakpoint junction to identify the breakpoint to base pair resolution without PCR validation
and sequencing (Figure 7.2B). A sequence that maps to two regions of the genome will not be
aligned using current alignment algorithms, which do not look for split mapping reads because
this would slow down the alignment. This strategy would not work with 37bp reads because
Chapter 7 Discussion
202
they would produce too many possible alignments, but as sequence reads become longer they
can be re-aligned to look for split mappings and find the breakpoint junctions. Split mapping of
reads becomes more important as longer sequence reads are produced from small insert
libraries, as more of each fragment is sequenced and there is more chance of a breakpoint
falling within a read rather than in between the paired-end reads.
Figure 7.2. Improvements to structural variant calling. A –how two pairs of reads spanning the
two sides of an insertion would align to the reference genome. B –how a read spanning a
breakpoint would align to the reference genome in two locations. (Diagram adapted from
Medvedev et al (2009)).
Chapter 7 Discussion
203
7.6 Conclusion
From the results in this study and others which have been published over the course of my
project, it is clear that there are fusion genes in breast cancer. High-throughput genomic and
transcriptome sequencing makes it easier to find fusion genes than by previous array and FISH
based methods, and as the number of sequenced breast cancer genomes increases, the
number of fusion genes found will increase as well. The importance of fusion genes in breast
cancer has yet to be demonstrated, as recurrent gene fusions have yet to be found, and
functional studies will be necessary to link fusion gene to carcinogenesis, but given the
importance of fusion genes in other cancers it seems likely that fusion genes will also be
important in breast cancer. As the number of known fusions increases, it will be easier to find
pathways which are recurrently involved in fusion genes, and a combined analysis of fusion
genes along with copy-number alterations, mutations and epigenetic modifications will provide
an overall picture of the genes which are involved in breast cancer.
References
Alberti, L., Carniti, C., Miranda, C., Roccato, E., and Pierotti, M. A. (2003). RET and NTRK1
proto‐oncogenes in human diseases. Journal of Cellular Physiology 195, 168‐186.
Albertson, D. G., Collins, C., McCormick, F., and Gray, J. W. (2003). Chromosome
aberrations in solid tumors. Nat. Genet. 34, 369‐376.
Al‐Kuraya, K., Schraml, P., Torhorst, J., Tapia, C., Zaharieva, B., Novotny, H., Spichtin, H.,
Maurer, R., Mirlacher, M., Köchli, O., et al. (2004). Prognostic relevance of gene
amplifications and coamplifications in breast cancer. Cancer Res. 64, 8534‐8540.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local
alignment search tool. J. Mol. Biol. 215, 403‐410.
Andersen, C. L., Monni, O., Wagner, U., Kononen, J., Barlund, M., Bucher, C., Haas, P.,
Nocito, A., Bissig, H., Sauter, G., et al. (2002). High‐throughput copy number analysis of
17q23 in 3520 tissue specimens by fluorescence in situ hybridization to tissue
microarrays. Am. J. Pathol. 161, 73‐79.
Armitage, P., and Doll, R. (1957). A two‐stage theory of carcinogenesis in relation to the
age distribution of human cancer. Br. J. Cancer 11, 161‐169.
Armitage, P., and Doll, R. (1954). The age distribution of cancer and a multi‐stage theory
of carcinogenesis. Br. J. Cancer 8, 1‐12.
ar‐Rushdi, A., Nishikura, K., Erikson, J., Watt, R., Rovera, G., and Croce, C. M. (1983).
Differential expression of the translocated and the untranslocated c‐myc oncogene in
Burkitt lymphoma. Science 222, 390‐393.
Arvand, A., and Denny, C. T. (2001). Biology of EWS/ETS fusions in Ewing's family
tumors. Oncogene 20, 5747‐5754.
Aulmann, S., Schnabel, P. A., Helmchen, B., Dienemann, H., Drings, P., Otto, H. F., and
Sinn, H. P. (2005). Immunohistochemical and cytogenetic characterization of
acantholytic squamous cell carcinoma of the breast. Virchows Archiv 446, 305‐309.
Banerjee, S., Dowsett, M., Ashworth, A., and Martin, L. (2007). Mechanisms of Disease:
angiogenesis and the management of breast cancer. Nat. Clin. Prac. Oncol. 4, 536‐550.
Bardelli, A., Cahill, D. P., Lederer, G., Speicher, M. R., Kinzler, K. W., Vogelstein, B., and
Lengauer, C. (2001). Carcinogen‐specific induction of genetic instability. Proc. Natl. Acad.
Sci. U.S.A 98, 5770 ‐5775.
204
References
Bärlund, M., Monni, O., Kononen, J., Cornelison, R., Torhorst, J., Sauter, G., Kallioniemi,
O. P., and Kallioniemi, A. (2000). Multiple genes at 17q23 undergo amplification and
overexpression in breast cancer. Cancer Res. 60, 5340‐5344.
Bärlund, M., Monni, O., Weaver, J. D., Kauraniemi, P., Sauter, G., Heiskanen, M.,
Kallioniemi, O. P., and Kallioniemi, A. (2002). Cloning of BCAS3 (17q23) and BCAS4
(20q13) genes that undergo amplification, overexpression, and fusion in breast cancer.
Genes Chromosomes Cancer 35, 311‐317.
Bautista, S., and Theillet, C. (1998). CCND1 and FGFR1 coamplification results in the
colocalization of 11q13 and 8p12 sequences in breast tumor nuclei. Genes
Chromosomes Cancer 22, 268‐277.
Beaudet, A. L., and Belmont, J. W. (2008). Array‐based DNA diagnostics: let the
revolution begin. Annu. Rev. Med. 59, 113‐129.
Beckman, R. A., and Loeb, L. A. (2006). Efficiency of carcinogenesis with and without a
mutator mutation. Proc. Natl. Acad. Sci. U.S.A 103, 14140 ‐14145.
Beerenwinkel, N., Antal, T., Dingli, D., Traulsen, A., Kinzler, K. W., Velculescu, V. E.,
Vogelstein, B., and Nowak, M. A. (2007). Genetic progression and the waiting time to
cancer. PLoS Comput. Biol. 3, e225.
Benner, S. E., Wahl, G. M., and Von Hoff, D. D. (1991). Double minute chromosomes and
homogeneously staining regions in tumors taken directly from patients versus in human
tumor cell lines. Anticancer Drugs 2, 11‐25.
Berger, M. F., Levin, J. Z., Vijayendran, K., Sivachenko, A., Adiconis, X., Maguire, J.,
Johnson, L. A., Robinson, J., Verhaak, R. G., Sougnez, C., et al. (2010). Integrative analysis
of the melanoma transcriptome. Genome Res. 20, 413‐427.
Bernards, R., and Weinberg, R. A. (2002). A progression puzzle. Nature 418, 823.
Bignell, G. R., Santarius, T., Pole, J. C., Butler, A. P., Perry, J., Pleasance, E., Greenman, C.,
Menzies, A., Taylor, S., Edkins, S., et al. (2007). Architectures of somatic genomic
rearrangement in human cancer amplicons at sequence‐level resolution. Genome Res.
17, 1296‐1303.
Bignell, G. R., Greenman, C. D., Davies, H., Butler, A. P., Edkins, S., Andrews, J. M., Buck,
G., Chen, L., Beare, D., Latimer, C., et al. (2010). Signatures of mutation and selection in
the cancer genome. Nature 463, 893‐898.
205
References
Bignell, G. R., Huang, J., Greshock, J., Watt, S., Butler, A., West, S., Grigorova, M., Jones,
K. W., Wei, W., Stratton, M. R., et al. (2004). High‐resolution analysis of DNA copy
number using oligonucleotide microarrays. Genome Res. 14, 287‐295.
Bodmer, W. (2008). Genetic instability is not a requirement for tumor development.
Cancer Res. 68, 3558‐3561.
Boehm, T., Foroni, L., Kaneko, Y., Perutz, M. F., and Rabbitts, T. H. (1991). The rhombotin
family of cysteine‐rich LIM‐domain oncogenes: distinct members are involved in T‐cell
translocations to human chromosomes 11p15 and 11p13. Proc. Natl. Acad. Sci. U. S. A.
88, 4367‐4371.
Bosco, E. E., and Knudsen, E. S. (2007). RB in breast cancer: at the crossroads of
tumorigenesis and treatment. Cell Cycle 6, 667‐671.
Burdall, S. E., Hanby, A. M., Lansdown, M. R., and Speirs, V. (2003). Breast cancer cell
lines: friend or foe? Breast Cancer Res. 5, 89‐95.
Cailleau, R., Olivé, M., and Cruciger, Q. V. (1978). Long‐term human breast carcinoma
cell lines of metastatic origin: preliminary characterization. In Vitro 14, 911‐915.
Cailleau, R., Young, R., Olivé, M., and Reeves, W. J. (1974). Breast tumor cell lines from
pleural effusions. J. Natl. Cancer Inst. 53, 661‐674.
Cairns, J. (2002). Somatic stem cells and the kinetics of mutagenesis and carcinogenesis.
Proc. Natl. Acad. Sci. U. S. A. 99, 10567‐10570.
Campbell, P. J., Stephens, P. J., Pleasance, E. D., O'Meara, S., Li, H., Santarius, T.,
Stebbings, L. A., Leroy, C., Edkins, S., Hardy, C., et al. (2008). Identification of somatically
acquired rearrangements in cancer using genome‐wide massively parallel paired‐end
sequencing. Nat. Genet. 40, 722‐729.
Carmeliet, P., and Jain, R. K. (2000). Angiogenesis in cancer and other diseases. Nature
407, 249‐257.
Chen, W., Ullmann, R., Langnick, C., Menzel, C., Wotschofsky, Z., Hu, H., Doring, A., Hu,
Y., Kang, H., Tzschach, A., et al. (2010). Breakpoint analysis of balanced chromosome
rearrangements by next‐generation paired‐end sequencing. Eur. J. Hum. Genet. 18, 539‐
543.
Chiang, D. Y., Getz, G., Jaffe, D. B., O'Kelly, M. J., Zhao, X., Carter, S. L., Russ, C.,
Nusbaum, C., Meyerson, M., and Lander, E. S. (2009). High‐resolution mapping of copy‐
number alterations with massively parallel sequencing. Nat. Methods 6, 99‐103.
206
References
Chin, S. F., Teschendorff, A. E., Marioni, J. C., Wang, Y., Barbosa‐Morais, N. L., Thorne, N.
P., Costa, J. L., Pinder, S. E., van de Wiel, M. A., Green, A. R., et al. (2007). High‐
resolution aCGH and expression profiling identifies a novel genomic subtype of ER
negative breast cancer. Genome Biol. 8, R215.
Ciampi, R., Knauf, J. A., Kerler, R., Gandhi, M., Zhu, Z., Nikiforova, M. N., Rabes, H. M.,
Fagin, J. A., and Nikiforov, Y. E. (2005). Oncogenic AKAP9‐BRAF fusion is a novel
mechanism of MAPK pathway activation in thyroid cancer. J. Clin. Invest. 115, 94‐101.
Conrad, D. F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts, J., Andrews,
T. D., Barnes, C., Campbell, P., et al. (2010a). Origins and functional impact of copy
number variation in the human genome. Nature 464, 704‐712.
Corvi, R., Amler, L. C., Savelyeva, L., Gehring, M., and Schwab, M. (1994). MYCN is
retained in single copy at chromosome 2 band p23‐24 during amplification in human
neuroblastoma cells. Proc. Natl. Acad. Sci. U.S.A. 91, 5523‐5527.
Cox, A., Dunning, A. M., Garcia‐Closas, M., Balasubramanian, S., Reed, M. W. R., Pooley,
K. A., Scollen, S., Baynes, C., Ponder, B. A. J., Chanock, S., et al. (2007). A common coding
variant in CASP8 is associated with breast cancer risk. Nat. Genet. 39, 352‐358.
Cuny, M., Kramar, A., Courjal, F., Johannsdottir, V., Iacopetta, B., Fontaine, H., Grenier,
J., Culine, S., and Theillet, C. (2000). Relating genotype and phenotype in breast cancer:
an analysis of the prognostic significance of amplification at eight different genes or loci
and of p53 mutations. Cancer Research 60, 1077‐1083.
Daley, G. Q., Van Etten, R. A., and Baltimore, D. (1990). Induction of chronic
myelogenous leukemia in mice by the P210bcr/abl gene of the Philadelphia
chromosome. Science 247, 824‐830.
Dalla‐Favera, R., Bregni, M., Erikson, J., Patterson, D., Gallo, R. C., and Croce, C. M.
(1982). Human c‐myc onc gene is located on the region of chromosome 8 that is
translocated in Burkitt lymphoma cells. Proc. Natl. Acad. Sci. U.S.A. 79, 7824‐7827.
Davidson, J. M., Gorringe, K. L., Chin, S., Orsetti, B., Besret, C., Courtay‐Cahen, C.,
Roberts, I., Theillet, C., Caldas, C., and Edwards, P. A. W. (2000). Molecular cytogenetic
analysis of breast cancer cell lines. Br. J. Cancer 83, 1309‐1317.
Deininger, M., Buchdunger, E., and Druker, B. J. (2005). The development of imatinib as
a therapeutic agent for chronic myeloid leukemia. Blood 105, 2640‐2653.
Ding, L., Ellis, M. J., Li, S., Larson, D. E., Chen, K., Wallis, J. W., Harris, C. C., McLellan, M.
D., Fulton, R. S., Fulton, L. L., et al. (2010). Genome remodelling in a basal‐like breast
cancer metastasis and xenograft. Nature 464, 999‐1005.
207
References
Dohm, J. C., Lottaz, C., Borodina, T., and Himmelbauer, H. (2008). Substantial biases in
ultra‐short read data sets from high‐throughput DNA sequencing. Nucleic Acids Res. 36,
e105.
Dutrillaux, B., Gerbault‐Seureau, M., Remvikos, Y., Zafrani, B., and Prieur, M. (1991).
Breast cancer genetic evolution: I. Data from cytogenetics and DNA content. Breast
Cancer Res. Treat. 19, 245‐255.
Easton, D. F., Pooley, K. A., Dunning, A. M., Pharoah, P. D. P., Thompson, D., Ballinger, D.
G., Struewing, J. P., Morrison, J., Field, H., Luben, R., et al. (2007). Genome‐wide
association study identifies novel breast cancer susceptibility loci. Nature 447, 1087‐
1093.
Edwards, P. A. W. (2002). Metastasis: the role of chance in malignancy. Nature 419, 559‐
560.
Eguchi, M., Eguchi‐Ishimae, M., Tojo, A., Morishita, K., Suzuki, K., Sato, Y., Kudoh, S.,
Tanaka, K., Setoyama, M., Nagamura, F., et al. (1999). Fusion of ETV6 to neurotrophin‐3
receptor TRKC in acute myeloid leukemia with t(12;15)(p13;q25). Blood 93, 1355‐1363.
Engel, L. W., Young, N. A., Tralka, T. S., Lippman, M. E., O'Brien, S. J., and Joyce, M. J.
(1978). Establishment and characterization of three new continuous cell lines derived
from human breast carcinomas. Cancer Res. 38, 3352‐3364.
Ersfeld, K. (2004). Fiber‐FISH: fluorescence in situ hybridization on stretched DNA.
Methods Mol. Biol 270, 395‐402.
Fearon, E. R., and Vogelstein, B. (1990). A genetic model for colorectal tumorigenesis.
Cell 61, 759‐767.
Fisher, E. R., Palekar, A. S., Gregorio, R. M., and Paulson, J. D. (1983). Mucoepidermoid
and squamous cell carcinomas of breast with reference to squamous metaplasia and
giant cell tumors. Am. J. Surg. Pathol. 7, 15‐27.
Flicek, P., and Birney, E. (2009). Sense from sequence reads: methods for alignment and
assembly. Nat. Methods 6, S6‐S12.
Forrest, W. F., and Cavet, G. (2007). Comment on "The Consensus Coding Sequences of
Human Breast and Colorectal Cancers". Science 317, 1500a.
Foster, J. S., Henley, D. C., Ahamed, S., and Wimalasena, J. (2001). Estrogens and cell‐
cycle regulation in breast cancer. Trends Endocrinol. Metab. 12, 320‐327.
Foulds, L. (1957). Tumor progression. Cancer Res. 17, 355‐356.
208
References
Fridlyand, J., Snijders, A. M., Ylstra, B., Li, H., Olshen, A., Segraves, R., Dairkee, S.,
Tokuyasu, T., Ljung, B. M., Jain, A. N., et al. (2006). Breast tumor copy number
aberration phenotypes and genomic instability. BMC Cancer 6, 96.
Fukuyoshi, Y., Inoue, H., Kita, Y., Utsunomiya, T., Ishida, T., and Mori, M. (2008). EML4‐
ALK fusion transcript is not found in gastrointestinal and breast cancers. Br. J. Cancer 98,
1536‐1539.
Garcia, M. J., Pole, J. C. M., Chin, S., Teschendorff, A., Naderi, A., Ozdag, H., Vias, M.,
Kranjac, T., Subkhankulova, T., Paish, C., et al. (2005). A 1Mb minimal amplicon at 8p11‐
12 in breast cancer identifies new candidate oncogenes. Oncogene 24, 5235‐5245.
Gazdar, A. F., Kurvari, V., Virmani, A., Gollahon, L., Sakaguchi, M., Westerfield, M.,
Kodagoda, D., Stasny, V., Cunningham, H. T., Wistuba, I. I., et al. (1998). Characterization
of paired tumor and non‐tumor cell lines established from patients with breast cancer.
Int. J. Cancer 78, 766‐774.
Gelsi‐Boyer, V., Orsetti, B., Cervera, N., Finetti, P., Sircoulomb, F., Rouge, C., Lasorsa, L.,
Letessier, A., Ginestier, C., Monville, F., et al. (2005). Comprehensive profiling of 8p11‐12
amplification in breast cancer. Mol. Cancer Res. 3, 655‐667.
Getz, G., Hofling, H., Mesirov, J. P., Golub, T. R., Meyerson, M., Tibshirani, R., and
Lander, E. S. (2007). Comment on "The Consensus Coding Sequences of Human Breast
and Colorectal Cancers". Science 317, 1500b.
Git, A., Spiteri, I., Blenkiron, C., Dunning, M., Pole, J., Chin, S., Wang, Y., Smith, J.,
Livesey, F., and Caldas, C. (2008). PMC42, a breast progenitor cancer cell line, has
normal‐like mRNA and microRNA transcriptomes. Breast Cancer Res. 10, R54.
Greenblatt, M. S., Chappuis, P. O., Bond, J. P., Hamel, N., and Foulkes, W. D. (2001).
TP53 Mutations in Breast Cancer Associated with BRCA1 or BRCA2 Germ‐line Mutations.
Cancer Research 61, 4092‐4097.
Greenman, C. D., Bignell, G., Butler, A., Edkins, S., Hinton, J., Beare, D., Swamy, S.,
Santarius, T., Chen, L., Widaa, S., et al. (2010). PICNIC: an algorithm to predict absolute
allelic copy number variation with microarray cancer data. Biostat. 11, 164‐175.
Greenman, C., Stephens, P., Smith, R., Dalgliesh, G. L., Hunter, C., Bignell, G., Davies, H.,
Teague, J., Butler, A., Stevens, C., et al. (2007). Patterns of somatic mutation in human
cancer genomes. Nature 446, 153‐158.
209
References
Greshock, J., Nathanson, K., Martin, A. M., Zhang, L., Coukos, G., Weber, B. L., and Zaks,
T. Z. (2007). Cancer cell lines as genetic models of their parent histology: analyses based
on array comparative genomic hybridization. Cancer Res. 67, 3594‐3600.
Gribble, S. M., Kalaitzopoulos, D., Burford, D. C., Prigmore, E., Selzer, R. R., Ng, B. L.,
Matthews, N. S., Porter, K. M., Curley, R., Lindsay, S. J., et al. (2007). Ultra‐high
resolution array painting facilitates breakpoint sequencing. J. Med. Genet. 44, 51‐58.
Gruvberger, S., Ringnér, M., Chen, Y., Panavally, S., Saal, L. H., Borg, Å., Fernö, M.,
Peterson, C., and Meltzer, P. S. (2001). Estrogen receptor status in breast cCancer Is
associated with remarkably distinct gene expression patterns. Cancer Res. 61, 5979‐
5984.
Guan, X. Y., Meltzer, P. S., Dalton, W. S., and Trent, J. M. (1994). Identification of cryptic
sites of DNA sequence amplification in human breast cancer by chromosome
microdissection. Nat. Genet. 8, 155‐161.
Gudmundsdottir, K., and Ashworth, A. (2006). The roles of BRCA1 and BRCA2 and
associated proteins in the maintenance of genomic stability. Oncogene 25, 5864‐5874.
Gururaj, A. E., Holm, C., Landberg, G., and Kumar, R. (2006). Breast cancer‐amplified
sequence 3, a target of metastasis‐associated protein 1, contributes to tamoxifen
resistance in premenopausal patients with breast cancer. Cell Cycle 5, 1407‐1410.
Gururaj, A. E., Peng, S., Vadlamudi, R. K., and Kumar, R. (2007). Estrogen Induces
Expression of BCAS3, a novel estrogen receptor‐alpha coactivator, through proline‐,
glutamic acid‐, and leucine‐rich protein‐1 (PELP1). Mol. Endocrinol. 21, 1847‐1860.
Hampton, O. A., Den Hollander, P., Miller, C. A., Delgado, D. A., Li, J., Coarfa, C., Harris, R.
A., Richards, S., Scherer, S. E., Muzny, D. M., et al. (2009). A sequence‐level map of
chromosomal breakpoints in the MCF‐7 breast cancer cell line yields insights into the
evolution of a cancer genome. Genome Res. 19, 167‐177.
Hanahan, D., and Weinberg, R. A. (2000). The hallmarks of cancer. Cell 100, 57‐70.
Hastings, P. J., Lupski, J. R., Rosenberg, S. M., and Ira, G. (2009). Mechanisms of change
in gene copy number. Nat. Rev. Genet. 10, 551‐564.
Haverty, P. M., Fridlyand, J., Li, L., Getz, G., Beroukhim, R., Lohr, S., Wu, T. D., Cavet, G.,
Zhang, Z., and Chant, J. (2008). High‐resolution genomic and expression analyses of copy
number alterations in breast tumors. Genes Chromosomes Cancer 47, 530‐542.
Heer, R., Douglas, D., Mathers, M. E., Robson, C. N., and Leung, H. Y. (2004). Fibroblast
growth factor 17 is over‐expressed in human prostate cancer. J. Pathol. 204, 578‐586.
210
References
Hermans, A., Heisterkamp, N., von Linden, M., van Baal, S., Meijer, D., van der Plas, D.,
Wiedemann, L. M., Groffen, J., Bootsma, D., and Grosveld, G. (1987). Unique fusion of
bcr and c‐abl genes in Philadelphia chromosome positive acute lymphoblastic leukemia.
Cell 51, 33‐40.
Herschkowitz, J. I., Simin, K., Weigman, V. J., Mikaelian, I., Usary, J., Hu, Z., Rasmussen,
K. E., Jones, L. P., Assefnia, S., Chandrasekharan, S., et al. (2007). Identification of
conserved gene expression features between murine mammary carcinoma models and
human breast tumors. Genome Biol. 8, R76.
Hicks, J., Krasnitz, A., Lakshmi, B., Navin, N. E., Riggs, M., Leibu, E., Esposito, D.,
Alexander, J., Troge, J., Grubor, V., et al. (2006). Novel patterns of genome
rearrangement and their association with survival in breast cancer. Genome Res. 16,
1465‐1479.
Hiyama, E., Gollahon, L., Kataoka, T., Kuroi, K., Yokoyama, T., Gazdar, A. F., Hiyama, K.,
Piatyszek, M. A., and Shay, J. W. (1996). Telomerase activity in human breast tumors. J.
Natl. Cancer Inst. 88, 116‐122.
Homer, N., Merriman, B., and Nelson, S. F. (2009). BFAST: An alignment tool for large
scale genome resequencing. PLoS ONE 4, e7767.
Howarth, K. D., Blood, K. A., Ng, B. L., Beavis, J. C., Chua, Y., Cooke, S. L., Raby, S.,
Ichimura, K., Collins, V. P., Carter, N. P., et al. (2008). Array painting reveals a high
frequency of balanced translocations in breast cancer cell lines that break in cancer‐
relevant genes. Oncogene 27, 3345‐3359.
Ichimura, K., Mungall, A. J., Fiegler, H., Pearson, D. M., Dunham, I., Carter, N. P., Collins,
V. P. (2006). Small regions of overlapping deletions on 6q26 in human astrocytic
tumours identified using chromosome 6 tile path array‐CGH. Oncogene 25, 1261‐1271.
Ince, T. A., Richardson, A. L., Bell, G. W., Saitoh, M., Godar, S., Karnoub, A. E., Iglehart, J.
D., and Weinberg, R. A. (2007). Transformation of different human breast epithelial cell
types leads to distinct tumor phenotypes. Cancer Cell 12, 160‐170.
Järvinen, T. A. H., and Liu, E. T. (2003). HER‐2/neu and topoisomerase II alpha in breast
cancer. Breast Cancer Res. Treat. 78, 299‐311.
Jones, D. T., Kocialkowski, S., Liu, L., Pearson, D. M., Bäcklund, L. M., Ichimura, K., and
Collins, V. P. (2008). Tandem duplication producing a novel oncogenic BRAF fusion gene
defines the majority of pilocytic astrocytomas. Cancer Res. 68, 8673‐8677.
211
References
Kallioniemi, A., Kallioniemi, O., Sudar, D., Rutovitz, D., Gray, J., Waldman, F., and Pinkel,
D. (1992). Comparative genomic hybridization for molecular cytogenetic analysis of solid
tumors. Science 258, 818‐821.
Kennedy, G. C., Matsuzaki, H., Dong, S., Liu, W., Huang, J., Liu, G., Su, X., Cao, M., Chen,
W., Zhang, J., et al. (2003). Large‐scale genotyping of complex DNA. Nat. Biotech. 21,
1233‐1237.
Kent, W. J. (2002). BLAT—The BLAST‐Like Alignment Tool. Genome Res. 12, 656‐664.
Knezevich, S. R., McFadden, D. E., Tao, W., Lim, J. F., and Sorensen, P. H. (1998). A novel
ETV6‐NTRK3 gene fusion in congenital fibrosarcoma. Nat. Genet. 18, 184‐187.
Knudson, A. G. (1971). Mutation and cancer: statistical study of retinoblastoma. Proc.
Natl. Acad. Sci. U. S. A. 68, 820‐823.
Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S. J.,
and Marra, M. A. (2009). Circos: An information aesthetic for comparative genomics.
Genome Res. 19, 1639‐1645.
Kumar, R., and Yarmand‐Bagheri, R. (2001). The role of HER2 in angiogenesis. Semin.
Oncol. 28, 27‐32.
Kuppers, R. (2005). Mechanisms of B‐cell lymphoma pathogenesis. Nat. Rev. Cancer 5,
251‐262.
Kwek, S. S., Roy, R., Zhou, H., Climent, J., Martinez‐Climent, J. A., Fridlyand, J., and
Albertson, D. G. (2009). Co‐amplified genes at 8p12 and 11q13 in breast tumors
cooperate with two major pathways in oncogenesis. Oncogene 28, 1892‐1903.
Lafage, M., Pedeutour, F., Marchetto, S., Simonetti, J., Prosperi, M. T., Gaudray, P., and
Birnbaum, D. (1992). Fusion and amplification of two originally non‐syntenic
chromosomal regions in a mammary carcinoma cell line. Genes Chromosomes Cancer 5,
40‐49.
Lai, W. R., Johnson, M. D., Kucherlapati, R., and Park, P. J. (2005). Comparative analysis
of algorithms for identifying amplifications and deletions in array CGH data.
Bioinformatics 21, 3763‐3770.
Lambros, M. B., Natrajan, R., Geyer, F. C., Lopez‐Garcia, M. A., Dedes, K. J., Savage, K.,
Lacroix‐Triki, M., Jones, R. L., Lord, C. J., Linardopoulos, S., et al. (2010). PPM1D gene
amplification and overexpression in breast cancer: a qRT‐PCR and chromogenic in situ
hybridization study. Mod. Pathol. 23, 1334‐1345.
212
References
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L. (2009). Ultrafast and memory‐
efficient alignment of short DNA sequences to the human genome. Genome Biol. 10,
R25.
Leary, R. J., Lin, J. C., Cummins, J., Boca, S., Wood, L. D., Parsons, D. W., Jones, S.,
Sjöblom, T., Park, B., Parsons, R., et al. (2008). Integrated analysis of homozygous
deletions, focal amplifications, and sequence alterations in breast and colorectal
cancers. Proc. Natl. Acad. Sci. U.S.A. 105, 16224 ‐16229.
Lee, J. A., Carvalho, C. M., and Lupski, J. R. (2007). A DNA replication mechanism for
generating nonrecurrent rearrangements associated with genomic disorders. Cell 131,
1235‐1247.
Lemieux, N., Apiou, F., Vogt, N., Malfoy, B., and Dutrillaux, B. (1996). Structural
heterogeneity of hsr(11) in the MDA‐MB‐134 mammary carcinoma cell line. Cancer
Genet. Cytogenet. 90, 75‐79.
Lengauer, C., Kinzler, K. W., and Vogelstein, B. (1997). Genetic instability in colorectal
cancers. Nature 386, 623‐627.
Letessier, A., Sircoulomb, F., Ginestier, C., Cervera, N., Monville, F., Gelsi‐Boyer, V.,
Esterni, B., Geneix, J., Finetti, P., Zemmour, C., et al. (2006). Frequency, prognostic
impact, and subtype association of 8p12, 8q24, 11q13, 12p13, 17q12, and 20q13
amplifications in breast cancers. BMC Cancer 6, 245.
Levsky, J. M., and Singer, R. H. (2003). Fluorescence in situ hybridization: past, present
and future. J. Cell. Sci. 116, 2833‐2838.
Li, H., Ruan, J., and Durbin, R. (2008). Mapping short DNA sequencing reads and calling
variants using mapping quality scores. Genome Res. 18, 1851‐1858.
Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows‐
Wheeler transform. Bioinformatics 25, 1754‐1760.
Lieber, M. R. (2010). NHEJ and its backup pathways in chromosomal translocations. Nat.
Struct. Mol. Biol. 17, 393‐395.
Lin, E., Li, L., Guan, Y., Soriano, R., Rivers, C. S., Mohan, S., Pandita, A., Tang, J., and
Modrusan, Z. (2009). Exon array profiling detects EML4‐ALK fusion in breast, colorectal,
and non‐small cell lung cancers. Mol. Cancer. Res. 7, 1466‐1476.
Liu, X., Baker, E., Eyre, H., Sutherland, G., and Zhou, M. (1999). γ ‐Heregulin: a fusion
gene of DOC‐4 and neuregulin‐1 derived from a chromosome translocation. Oncogene
18, 7110–7114.
213
References
Lutchman, M., Pack, S., Kim, A. C., Azim, A., Emmert‐Buck, M., van Huffel, C., Zhuang, Z.,
and Chishti, A. H. (1999). Loss of heterozygosity on 8p in prostate cancer implicates a
role for dematin in tumor progression. Cancer Genetics and Cytogenetics 115, 65‐69.
Ma, B., Tromp, J., and Li, M. (2002). PatternHunter: faster and more sensitive homology
search. Bioinformatics 18, 440‐445.
MacLeod, R. A., Dirks, W. G., Matsuo, Y., Kaufmann, M., Milch, H., and Drexler, H. G.
(1999). Widespread intraspecies cross‐contamination of human tumor cell lines arising
at source. Int. J. Cancer 83, 555‐563.
Maher, C. A., Kumar‐Sinha, C., Cao, X., Kalyana‐Sundaram, S., Han, B., Jing, X., Sam, L.,
Barrette, T., Palanisamy, N., and Chinnaiyan, A. M. (2009). Transcriptome sequencing to
detect gene fusions in cancer. Nature 458, 97‐101.
Mardis, E. R. (2008). Next‐generation DNA sequencing methods. Annu. Rev. Genomics
Hum. Genet. 9, 387‐402.
Marioni, J. C., Thorne, N. P., Valsesia, A., Fitzgerald, T., Redon, R., Fiegler, H., Andrews, T.
D., Stranger, B. E., Lynch, A. G., Dermitzakis, E. T., et al. (2007). Breaking the waves:
improved detection of copy number variation from microarray‐based comparative
genomic hybridization. Genome Biol. 8, R228.
McCallum, H. M., and Lowther, G. W. (1996). Long‐term culture of primary breast cancer
in defined medium. Breast Cancer Res. Treat 39, 247‐259.
McVey, M., and Lee, S. E. (2008). MMEJ repair of double‐strand breaks (director's cut):
deleted sequences and alternative endings. Trends Genet. 24, 529‐538.
Medvedev, P., Stanciu, M., and Brudno, M. (2009). Computational methods for
discovering structural variation with next‐generation sequencing. Nat. Meth. 6, S13‐S20.
Meijers‐Heijboer, H., van den Ouweland, A., Klijn, J., Wasielewski, M., de Snoo, A.,
Oldenburg, R., Hollestelle, A., Houben, M., Crepin, E., van Veghel‐Plandsoen, M., et al.
(2002). Low‐penetrance susceptibility to breast cancer due to CHEK2(*)1100delC in
noncarriers of BRCA1 or BRCA2 mutations. Nat. Genet. 31, 55‐59.
Miron, A., Varadi, M., Carrasco, D., Li, H., Luongo, L., Kim, H. J., Park, S. Y., Cho, E. Y.,
Lewis, G., Kehoe, S., et al. (2010). PIK3CA mutations in in situ and invasive breast
carcinomas. Cancer Res.70, 5674‐5678.
Mitelman, F., Johansson, B., and Mertens, F. (2004). Fusion genes and rearranged genes
as a linear function of chromosome aberrations in cancer. Nat. Genet. 36, 331‐334.
214
References
Mitelman, F., Johansson, B., and Mertens, F. (2007). The impact of translocations and
gene fusions on cancer causation. Nat. Rev. Cancer 7, 233‐245.
Mitelman, F., Mertens, F., and Johansson, B. (2005). Prevalence estimates of recurrent
balanced cytogenetic aberrations and gene fusions in unselected patients with
neoplastic disorders. Genes Chromosomes Cancer 43, 350‐66.
Monni, O., Bärlund, M., Mousses, S., Kononen, J., Sauter, G., Heiskanen, M., Paavola, P.,
Avela, K., Chen, Y., Bittner, M. L., et al. (2001). Comprehensive copy number and gene
expression profiling of the 17q23 amplicon in human breast cancer. Proc. Natl. Acad. Sci.
U. S. A. 98, 5711‐5716.
Morris, J. S., Carter, N. P., Ferguson‐Smith, M. A., and Edwards, P. A. (1997). Cytogenetic
analysis of three breast carcinoma cell lines using reverse chromosome painting. Genes
Chromosomes Cancer 20, 120‐139.
Morris, S., Kirstein, M., Valentine, M., Dittmer, K., Shapiro, D., Saltman, D., and Look, A.
(1994). Fusion of a kinase gene, ALK, to a nucleolar protein gene, NPM, in non‐Hodgkin's
lymphoma. Science 263, 1281‐1284.
Neve, R. M., Chin, K., Fridlyand, J., Yeh, J., Baehner, F. L., Fevr, T., Clark, L., Bayani, N.,
Coppe, J. P., Tong, F., et al. (2006). A collection of breast cancer cell lines for the study of
functionally distinct cancer subtypes. Cancer Cell 10, 515‐27.
Ng, B. L., and Carter, N. P. (2006). Factors affecting flow karyotype resolution. Cytometry
A 69, 1028‐1036.
Olshen, A. B., Venkatraman, E. S., Lucito, R., and Wigler, M. (2004). Circular binary
segmentation for the analysis of array‐based DNA copy number data. Biostatistics 5,
557‐572.
Parker, J. S., Mullins, M., Cheang, M. C., Leung, S., Voduc, D., Vickery, T., Davies, S.,
Fauron, C., He, X., Hu, Z., et al. (2009). Supervised risk predictor of breast cancer based
on intrinsic subtypes. J. Clin. Oncol. 27, 1160‐1167.
Parssinen, J., Kuukasjarvi, T., Karhu, R., and Kallioniemi, A. (2007). High‐level
amplification at 17q23 leads to coordinated overexpression of multiple adjacent genes
in breast cancer. Br. J. Cancer 96, 1258‐1264.
Paterson, A. L., Pole, J. C., Blood, K. A., Garcia, M. J., Cooke, S. L., Teschendorff, A. E.,
Wang, Y., Chin, S. F., Ylstra, B., Caldas, C., et al. (2007). Co‐amplification of 8p12 and
215
References
11q13 in breast cancers is not the result of a single genomic event. Genes Chromosomes
Cancer 46, 427‐439.
Persson, K., Pandis, N., Mertens, F., Borg, Å., Baldetorp, B., Killander, D., and Isola, J.
(1999). Chromosomal aberrations in breast cancer: A comparison between cytogenetics
and comparative genomic hybridization. Genes, Chromosomes and Cancer 25, 115‐122.
Pharoah, P. D., Day, N. E., and Caldas, C. (1999). Somatic mutations in the p53 gene and
prognosis in breast cancer: a meta‐analysis. Br. J. Cancer 80, 1968‐1973.
Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D., Collins, C., Kuo, W.,
Chen, C., Zhai, Y., et al. (1998). High resolution analysis of DNA copy number variation
using comparative genomic hybridization to microarrays. Nat. Genet. 20, 207‐211.
Pleasance, E. D., Cheetham, R. K., Stephens, P. J., McBride, D. J., Humphray, S. J.,
Greenman, C. D., Varela, I., Lin, M., Ordonez, G. R., Bignell, G. R., et al. (2010a). A
comprehensive catalogue of somatic mutations from a human cancer genome. Nature
463, 191‐196.
Pleasance, E. D., Stephens, P. J., O’Meara, S., McBride, D. J., Meynert, A., Jones, D., Lin,
M., Beare, D., Lau, K. W., Greenman, C., et al. (2010b). A small‐cell lung cancer genome
with complex signatures of tobacco exposure. Nature 463, 184‐190.
Pole, J. C. M., Courtay‐Cahen, C., Garcia, M. J., Blood, K. A., Cooke, S. L., Alsop, A. E., Tse,
D. M. L., Caldas, C., and Edwards, P. A. W. (2006). High‐resolution analysis of
chromosome rearrangements on 8p in breast, colon and pancreatic cancer reveals a
complex pattern of loss, gain and translocation. Oncogene 25, 5693‐5706.
Popovici, C., Basset, C., Bertucci, F., Orsetti, B., Adelaide, J., Mozziconacci, M. J., Conte,
N., Murati, A., Ginestier, C., Charafe‐Jauffret, E., et al. (2002). Reciprocal translocations
in breast tumor cell lines: cloning of a t(3;20) that targets the FHIT gene. Genes
Chromosomes Cancer 35, 204‐218.
Quail, M. A., Kozarewa, I., Smith, F., Scally, A., Stephens, P. J., Durbin, R., Swerdlow, H.,
and Turner, D. J. (2008). A large genome center's improvements to the Illumina
sequencing system. Nat. Methods 5, 1005‐10.
Raap, A. K. (1998). Advances in fluorescence in situ hybridization. Mutation Research
400, 287‐298.
Rae, J. M., Creighton, C. J., Meck, J. M., Haddad, B. R., and Johnson, M. D. (2006). MDA‐
MB‐435 cells are derived from M14 Melanoma cells––a loss for breast cancer, but a
boon for melanoma research. Breast Cancer Res. Treat. 104, 13‐19.
216
References
Rahman, N., Seal, S., Thompson, D., Kelly, P., Renwick, A., Elliott, A., Reid, S., Spanova,
K., Barfoot, R., Chagtai, T., et al. (2007). PALB2, which encodes a BRCA2‐interacting
protein, is a breast cancer susceptibility gene. Nat. Genet. 39, 165‐167.
Ray, M. E., Yang, Z. Q., Albertson, D., Kleer, C. G., Washburn, J. G., Macoska, J. A., and
Ethier, S. P. (2004). Genomic and expression analysis of the 8p11–12 amplicon in human
breast cancer cell lines. Cancer Res. 64, 40‐47.
Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., Fiegler, H.,
Shapero, M. H., Carson, A. R., Chen, W., et al. (2006). Global variation in copy number in
the human genome. Nature 444, 444‐454.
Renwick, A., Thompson, D., Seal, S., Kelly, P., Chagtai, T., Ahmed, M., North, B.,
Jayatilake, H., Barfoot, R., Spanova, K., et al. (2006). ATM mutations that cause ataxia‐
telangiectasia are breast cancer susceptibility alleles. Nat. Genet. 38, 873‐875.
Rigaill, G., Hupe, P., Almeida, A., La Rosa, P., Meyniel, J., Decraene, C., and Barillot, E.
(2008). ITALICS: an algorithm for normalization and DNA copy number calling for
Affymetrix SNP arrays. Bioinformatics 24, 768‐774.
Rikova, K., Guo, A., Zeng, Q., Possemato, A., Yu, J., Haack, H., Nardone, J., Lee, K.,
Reeves, C., Li, Y., et al. (2007). Global survey of phosphotyrosine signaling identifies
oncogenic kinases in lung cancer. Cell 131, 1190‐1203.
Roschke, A. V., Stover, K., Tonon, G., Schäffer, A. A., and Kirsch, I. R. (2002). Stable
karyotypes in epithelial cancer cell lines despite high rates of ongoing structural and
numerical chromosomal instability. Neoplasia 4, 19‐31.
Rouzier, R., Perou, C. M., Symmans, W. F., Ibrahim, N., Cristofanilli, M., Anderson, K.,
Hess, K. R., Stec, J., Ayers, M., Wagner, P., et al. (2005). Breast cancer molecular
subtypes respond differently to preoperative chemotherapy. Clinical Cancer Res. 11,
5678‐5685.
Rowley, J. D. (2001). Chromosome translocations: dangerous liaisons revisited. Nat. Rev.
Cancer 1, 245‐250.
Rozen, S., and Skaletsky, H. J. (2000). Primer3 on the WWW for general users and for
biologist programmers. In Bioinformatics Methods and Protocols: Methods in Molecular
Biology (Humana Press), pp. 365‐386.
Ruan, Y., Ooi, H. S., Choo, S. W., Chiu, K. P., Zhao, X. D., Srinivasan, K. G., Yao, F., Choo, C.
Y., Liu, J., Ariyaratne, P., et al. (2007). Fusion transcripts and transcribed retrotransposed
loci discovered through comprehensive transcriptome analysis using Paired‐End diTags
(PETs). Genome Res. 17, 828‐838.
217
References
Samuels, Y., Diaz Jr., L. A., Schmidt‐Kittler, O., Cummins, J. M., DeLong, L., Cheong, I.,
Rago, C., Huso, D. L., Lengauer, C., Kinzler, K. W., et al. (2005). Mutant PIK3CA promotes
cell growth and invasion of human cancer cells. Cancer Cell 7, 561‐573.
Santarius, T., Shipley, J., Brewer, D., Stratton, M. R., and Cooper, C. S. (2010). A census of
amplified and overexpressed human cancer genes. Nat. Rev. Cancer 10, 59‐64.
Schröck, E., Manoir, S. D., Veldman, T., Schoell, B., Wienberg, J., Ferguson‐Smith, M. A.,
Ning, Y., Ledbetter, D. H., Bar‐Am, I., Soenksen, D., et al. (1996). Multicolor Spectral
Karyotyping of Human Chromosomes. Science 273, 494‐497.
Schwab, M. (1999). Oncogene amplification in solid tumors. Semin. Cancer Biol. 9, 319‐
325.
Schwab, M., Westermann, F., Hero, B., and Berthold, F. (2003). Neuroblastoma: biology
and molecular and chromosomal pathology. Lancet Oncol. 4, 472‐480.
Seal, S., Thompson, D., Renwick, A., Elliott, A., Kelly, P., Barfoot, R., Chagtai, T.,
Jayatilake, H., Ahmed, M., Spanova, K., et al. (2006). Truncating mutations in the Fanconi
anemia J gene BRIP1 are low‐penetrance breast cancer susceptibility alleles. Nat. Genet.
38, 1239‐1241.
Sevignani, C., Calin, G. A., Cesari, R., Sarti, M., Ishii, H., Yendamuri, S., Vecchione, A.,
Trapasso, F., and Croce, C. M. (2003). Restoration of fragile histidine triad (FHIT)
expression induces apoptosis and suppresses tumorigenicity in breast cancer cell lines.
Cancer Research 63, 1183‐1187.
Shah, S. P., Morin, R. D., Khattra, J., Prentice, L., Pugh, T., Burleigh, A., Delaney, A.,
Gelmon, K., Guliany, R., Senz, J., et al. (2009). Mutational evolution in a lobular breast
tumour profiled at single nucleotide resolution. Nature 461, 809‐813.
Shtivelman, E., Lifshitz, B., Gale, R. P., and Canaani, E. (1985). Fused transcript of abl and
bcr genes in chronic myelogenous leukaemia. Nature 315, 550‐554.
Sjöblom, T., Jones, S., Wood, L. D., Parsons, D. W., Lin, J., Barber, T. D., Mandelker, D.,
Leary, R. J., Ptak, J., Silliman, N., et al. (2006). The consensus coding sequences of human
breast and colorectal cancers. Science 314, 268‐274.
Slamon, D. J., Clark, G. M., Wong, S. G., Levin, W. J., Ullrich, A., and McGuire, W. L.
(1987). Human breast cancer: correlation of relapse and survival with amplification of
the HER‐2/neu oncogene. Science 235, 177‐182.
218
References
Smeets, D. F. C. M. (2004). Historical prospective of human cytogenetics: from
microscope to microarray. Clin. Biochem. 37, 439‐446.
Soda, M., Choi, Y. L., Enomoto, M., Takada, S., Yamashita, Y., Ishikawa, S., Fujiwara, S.,
Watanabe, H., Kurashina, K., Hatanaka, H., et al. (2007). Identification of the
transforming EML4‐ALK fusion gene in non‐small‐cell lung cancer. Nature 448, 561‐566.
Sørlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M.
B., van de Rijn, M., Jeffrey, S. S., et al. (2001). Gene expression patterns of breast
carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci.
U. S. A. 98, 10869‐10874.
Stacey, S. N., Manolescu, A., Sulem, P., Rafnar, T., Gudmundsson, J., Gudjonsson, S. A.,
Masson, G., Jakobsdottir, M., Thorlacius, S., Helgason, A., et al. (2007). Common variants
on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor‐positive
breast cancer. Nat. Genet. 39, 865‐869.
Stein, W. D., and Stein, A. D. (1990). Testing and characterizing the two‐stage model of
carcinogenesis for a wide range of human cancers. J. Theor. Biol. 145, 95‐122.
Stephens, P. J., McBride, D. J., Lin, M., Varela, I., Pleasance, E. D., Simpson, J. T.,
Stebbings, L. A., Leroy, C., Edkins, S., Mudie, L. J., et al. (2009a). Complex landscapes of
somatic rearrangement in human breast cancer genomes. Nature 462, 1005‐1010.
Stingl, J., and Caldas, C. (2007). Molecular heterogeneity of breast carcinomas and the
cancer stem cell hypothesis. Nat. Rev. Cancer 7, 791‐799.
Storlazzi, C. T., Lonoce, A., Guastadisegni, M. C., Trombetta, D., D'Addabbo, P., Daniele,
G., L'Abbate, A., Macchia, G., Surace, C., Kok, K., et al. (2010). Gene amplification as
double minutes or homogeneously staining regions in solid tumors: origin and structure.
Genome Res 20, 1198‐1206.
Takeuchi, K., Choi, Y. L., Togashi, Y., Soda, M., Hatano, S., Inamura, K., Takada, S., Ueno,
T., Yamashita, Y., Satoh, Y., et al. (2009). KIF5B‐ALK, a novel fusion oncokinase identified
by an immunohistochemistry‐based diagnostic system for ALK‐positive lung cancer.
Clinical Cancer Research 15, 3143‐3149.
Taub, R., Kirsch, I., Morton, C., Lenoir, G., Swan, D., Tronick, S., Aaronson, S., and Leder,
P. (1982). Translocation of the c‐myc gene into the immunoglobulin heavy chain locus in
human Burkitt lymphoma and murine plasmacytoma cells. Proc. Natl. Acad. Sci. U. S. A.
79, 7837‐7841.
Teixeira, M. R., Pandis, N., and Heim, S. (2002). Cytogenetic clues to breast
carcinogenesis. Genes Chromosomes Cancer 33, 1‐16.
219
References
Teschendorff, A., and Caldas, C. (2009). The breast cancer somatic 'muta‐ome': tackling
the complexity. Breast Cancer Res. 11, 301.
Theodorou, V., Kimm, M. A., Boer, M., Wessels, L., Theelen, W., Jonkers, J., and Hilkens,
J. (2007). MMTV insertional mutagenesis identifies genes, gene families and pathways
involved in mammary cancer. Nat. Genet. 39, 759‐769.
Tognon, C., Knezevich, S. R., Huntsman, D., Roskelley, C. D., Melnyk, N., Mathers, J. A.,
Becker, L., Carneiro, F., MacPherson, N., Horsman, D., et al. (2002). Expression of the
ETV6‐NTRK3 gene fusion as a primary event in human secretory breast carcinoma.
Cancer Cell 2, 367‐376.
Tomlins, S. A., Laxman, B., Dhanasekaran, S. M., Helgeson, B. E., Cao, X., Morris, D. S.,
Menon, A., Jing, X., Cao, Q., Han, B., et al. (2007). Distinct classes of chromosomal
rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature 448, 595‐
599.
Tomlins, S. A., Rhodes, D. R., Perner, S., Dhanasekaran, S. M., Mehra, R., Sun, X. W.,
Varambally, S., Cao, X., Tchinda, J., Kuefer, R., et al. (2005). Recurrent fusion of TMPRSS2
and ETS transcription factor genes in prostate cancer. Science 310, 644‐648.
Tomlins, S. A., Mehra, R., Rhodes, D. R., Smith, L. R., Roulston, D., Helgeson, B. E., Cao,
X., Wei, J. T., Rubin, M. A., Shah, R. B., et al. (2006). TMPRSS2:ETV4 Gene Fusions Define
a Third Molecular Subtype of Prostate Cancer. Cancer Res. 66, 3396‐3400.
Tremblay, M., Tremblay, C. S., Herblot, S., Aplan, P. D., Hébert, J., Perreault, C., and
Hoang, T. (2010). Modeling T‐cell acute lymphoblastic leukemia induced by the SCL and
LMO1 oncogenes. Genes & Development 24, 1093‐1105.
Tucker, R. P., and Chiquet‐Ehrismann, R. (2006). Teneurins: a conserved family of
transmembrane proteins involved in intercellular signaling during development. Dev Biol
290, 237‐45.
Turnbull, C., and Rahman, N. (2008). Genetic predisposition to breast cancer: past,
present, and future. Annu. Rev. Genomics Hum. Genet. 9, 321‐345.
Ugolini, F., Adélaïde, J., Charafe‐Jauffret, E., Nguyen, C., Jacquemier, J., Jordan, B.,
Birnbaum, D., and Pébusque, M. J. (1999). Differential expression assay of chromosome
arm 8p genes identifies Frizzled‐related (FRP1/FRZB) and Fibroblast Growth Factor
Receptor 1 (FGFR1) as candidate breast cancer genes. Oncogene 18, 1903‐1910.
Van Prooijen‐Knegt, A. C., Van Hoek, J. F., Bauman, J. G., Van Duijn, P., Wool, I. G., and
Van der Ploeg, M. (1982). In situ hybridization of DNA sequences in human metaphase
220
References
chromosomes visualized by an indirect fluorescent immunocytochemical procedure.
Exp. Cell Res. 141, 397‐407.
Van Roy, N., Vandesompele, J., Menten, B., Nilsson, H., De Smet, E., Rocchi, M., De
Paepe, A., Pahlman, S., and Speleman, F. (2006). Translocation‐excision‐deletion‐
amplification mechanism leading to nonsyntenic coamplification of MYC and ATBF1.
Genes Chromosomes Cancer 45, 107‐117.
Velculescu, V. E. (2008). Defining the blueprint of the cancer genome. Carcinogenesis
29, 1087‐1091.
Venkatraman, E. S., and Olshen, A. B. (2007). A faster circular binary segmentation
algorithm for the analysis of array CGH data. Bioinformatics 23, 657‐663.
Venkitaraman, A. R. (2002). Cancer susceptibility and the functions of BRCA1 and
BRCA2. Cell 108, 171‐182.
Vogelstein, B., and Kinzler, K. W. (1993). The multistep nature of cancer. Trends Genet.
9, 138‐141.
Volik, S., Zhao, S., Chin, K., Brebner, J. H., Herndon, D. R., Tao, Q., Kowbel, D., Huang, G.,
Lapuk, A., Kuo, W. L., et al. (2003). End‐sequence profiling: sequence‐based analysis of
aberrant genomes. Proc. Natl. Acad. Sci. U. S. A. 100, 7696‐701.
Vousden, K. H., and Lu, X. (2002). Live or let die: the cell's response to p53. Nat. Rev.
Cancer 2, 594‐604.
Wang, T. C., Cardiff, R. D., Zukerberg, L., Lees, E., Arnold, A., and Schmidt, E. V. (1994).
Mammary hyperplasia and carcinoma in MMTV‐cyclin D1 transgenic
mice. Nature 369, 669‐671.
Weigelt, B., Baehner, F. L., and Reis‐Filho, J. S. (2010). The contribution of gene
expression profiling to breast cancer classification, prognostication and prediction: a
retrospective of the last decade. The Journal of Pathology 220, 263‐280.
Weigelt, B., Glas, A. M., Wessels, L. F. A., Witteveen, A. T., Peterse, J. L., and van't Veer,
L. J. (2003). Gene expression profiles of primary breast tumors maintained in distant
metastases. Proc. Natl. Acad. Sci. U. S. A. 100, 15901‐15905.
Willenbrock, H., and Fridlyand, J. (2005). A comparison study: applying segmentation to
array CGH data for downstream analyses. Bioinformatics 21, 4084‐4091.
Williams, A., and Thomson, E. (2010). Effects of scanning sensitivity and multiple scan
algorithms on microarray data quality. BMC Bioinformatics 11, 127.
221
References
222
Windhofer, F., Krause, S., Hader, C., Schulz, W. A., and Florl, A. R. (2008). Distinctive
differences in DNA double‐strand break repair between normal urothelial and urothelial
carcinoma cells. Mutat. Res. 638, 56‐65.
Wirapati, P., Sotiriou, C., Kunkel, S., Farmer, P., Pradervand, S., Haibe‐Kains, B.,
Desmedt, C., Ignatiadis, M., Sengstag, T., Schutz, F., et al. (2008). Meta‐analysis of gene
expression profiles in breast cancer: toward a unified understanding of breast cancer
subtyping and prognosis signatures. Breast Cancer Res. 10, R65.
Wistuba, I. I., Behrens, C., Milchgrub, S., Syed, S., Ahmadian, M., Virmani, A. K., Kurvari,
V., Cunningham, T. H., Ashfaq, R., Minna, J. D., et al. (1998). Comparison of features of
human breast cancer cell lines and their corresponding tumors. Clin. Cancer Res. 4,
2931‐2938.
Wood, L. D., Parsons, D. W., Jones, S., Lin, J., Sjoblom, T., Leary, R. J., Shen, D., Boca, S.
M., Barber, T., Ptak, J., et al. (2007). The genomic landscapes of human breast and
colorectal cancers. Science 318, 1108‐1113.
Yachida, S., Jones, S., Bozic, I., Antal, T., Leary, R., Fu, B., Kamiyama, M., Hruban, R. H.,
Eshleman, J. R., Nowak, M. A., et al. (2010). Distant metastasis occurs late during the
genetic evolution of pancreatic cancer. Nature 467, 1114‐1117.
Yamashita, H., Nishio, M., Toyama, T., Sugiura, H., Zhang, Z., Kobayashi, S., and Iwase, H.
(2004). Coexistence of HER2 over‐expression and p53 protein accumulation is a strong
prognostic molecular marker in breast cancer. Breast Cancer Res. 6, R24‐30.
Zech, L., Haglund, U., Nilsson, K., and Klein, G. (1976). Characteristic chromosomal
abnormalities in biopsies and lymphoid‐cell lines from patients with Burkitt and non‐
Burkitt lymphomas. Int. J. Cancer 17, 47‐56.
223
Appendix 1 –List of primers used for PCR and sequencing
Name Sequence Used for
BCAS3_56.164_fwd AGATCGCTGGTAGCAAGGAA BCAS3 junction fine mapping
BCAS3_56.164_rev ACCAAGGTGGTAAGCAGCAT BCAS3 junction fine mapping
BCAS3_56.165_fwd GAGGTGGGAGAAATGCTTGA BCAS3 junction fine mapping
BCAS3_56.165_rev AGCCGAGCCATAAACTGAGA BCAS3 junction fine mapping
BCAS3_56.166_fwd CCCAGTAATGCGATCCTAGC BCAS3 junction fine mapping
BCAS3_56.166_rev CTGCTTGAGGCCAAGAGTTC BCAS3 junction fine mapping
BCAS3_56.167_fwd GCTCACTACAACCCCTGCTT BCAS3 junction fine mapping
BCAS3_56.167_rev AGAGGTGAGTGGATCGCTTG BCAS3 junction fine mapping
BCAS3_56.168_fwd TGTCTTTCAAGGGGTTGGTC BCAS3 junction fine mapping
BCAS3_56.168_rev AGCCTCAGCAAAAGAGCAAG BCAS3 junction fine mapping
BCAS3_56.169_fwd CTACAACCCTTGCCTCTTCG BCAS3 junction fine mapping
BCAS3_56.169_rev GAGGCCAACAAGCAGATCAC BCAS3 junction fine mapping
chr7-155709k-fwd CCTCTGCTTGCTGGTGTGTA BCAS3 junction fine mapping
chr7-155709k-rev CAGGTTGAGCACCACTGTGT BCAS3 junction fine mapping
chr7-155710k-fwd ACCCAAGCTCCCTTCTCTTC BCAS3 junction fine mapping
chr7-155710k-rev CTTCATGAAAGCATGCTGGA BCAS3 junction fine mapping
chr7-155711k-fwd AAGAGCCTGACACAGCCATT BCAS3 junction fine mapping
chr7-155711k-rev TCTGTTGTTGGTCAGCCTTG BCAS3 junction fine mapping
chr7-155712k-fwd TCTCCTTGTGAAGCGTGATG BCAS3 junction fine mapping
chr7-155712k-rev TCTGCTGGCTCAGTAGAGCA BCAS3 junction fine mapping
chr7-155713k-fwd AAGCCCCAGGACAAGAAAAT BCAS3 junction fine mapping
chr7-155713k-rev GTCAAGTCCGGGGGTAGATT BCAS3 junction fine mapping
chr7-155714k-fwd GCACCTTCTATGTGCCATCA BCAS3 junction fine mapping
chr7-155714k-rev TTTGAATCCCCAGTGTGTTG BCAS3 junction fine mapping
chr7-155715k-fwd TGCATTACACTGGAACGTGAA BCAS3 junction fine mapping
chr7-155715k-rev CGATGCCAACCCACTTATTA BCAS3 junction fine mapping
chr7-cloning-fwd GCAGAACACAAAATCACCACA BCAS3 junction cloning and sequencing
chr7-cloning-rev TCTCCTCAGGGATGTTAATGTATG BCAS3 junction cloning and sequencing
chr17-cloning-fwd TTTGCATTGTTGTGATAGGACAT BCAS3 junction cloning and sequencing
chr17-cloning-rev CTCCAGTGCATTTTGCCTTT BCAS3 junction cloning and sequencing
BCAS3-ex23-fwd GGATCCGGAACAGAACTTCA BCAS3 real-time PCR
BCAS3-ex23-rev TTGCTGGTACCTACGGGAAG BCAS3 real-time PCR
BCAS3-ex1-fwd ATTCCCCAAGAAGACCCAGT BCAS3 real-time PCR
BCAS3-ex1-rev TCCTGCAGAAAAGTCGAGGAG BCAS3 real-time PCR
BCAS3-ex9-fwd GGAGCCTGTGGAGACAACAT BCAS3 real-time PCR
BCAS3-ex9-rev CTGTGGATGGCAACATCATC BCAS3 real-time PCR
USP31 rev CAGAGCTAAGGTGCGAGAGC HCC1806 small deletion fusions
ERN2 ex8 fwd GACACAGCCACCCTCTTCTC HCC1806 small deletion fusions
ERN2 ex8 rev CTCGCTCTCCTGAGACTTGG HCC1806 small deletion fusions
Appendix 1 Primers
224
ERN2 ex19 fwd CTTTATCGCCAGGCAAACAT HCC1806 small deletion fusions
ERN2 ex19 rev ACTCCTTCTCCAGCCAGTCA HCC1806 small deletion fusions
ATAD5-ex6-fwd AGCAGCTGATCCTGTCCCTA HCC1806 small deletion fusions
ATAD5-ex6-rev CAAATGCCACAAACAACACC HCC1806 small deletion fusions
SUZ12-fwd GAGGGGGTGGCAGTTACTC HCC1806 small deletion fusions
SUZ12-rev AGATCTGTGTTGGCTTCTCAAA HCC1806 small deletion fusions
ATAD5-ex14-fwd AGCACTTCCTCCCAAAACCT HCC1806 small deletion fusions
ATAD5-ex14-rev CAGCTCCAACGTCTTTGACA HCC1806 small deletion fusions
PITPNB fwd GGGCAGCTTTACTCTGTTGC HCC1806 small deletion fusions
PITPNB rev ATGCAGGCACTTTGCTCTTT HCC1806 small deletion fusions
CHEK2 ex2 fwd AACTCCAGCCAGTCCTCTCA HCC1806 small deletion fusions
CHEK2 ex2 rev TGTCCCTCCCAAACCAGTAG HCC1806 small deletion fusions
CHEK2 ex10 fwd CTGTTGGGACTGCTGGGTAT HCC1806 small deletion fusions
CHEK2 ex10 rev CGTAAAACGTGCCTTTGGAT HCC1806 small deletion fusions
AFF3-a-fwd AGAAGAGAGCTCCACGCTCA HCC1806 tandem duplication fusions
AFF3-a-rev GCTCCCGTTCCTTTTCTTTC HCC1806 tandem duplication fusions
AFF3-b-fwd TGAAGTCGTCTTCGGAAACC HCC1806 tandem duplication fusions
AFF3-b-rev ACTTTGCCAGGTGCTTGAAT HCC1806 tandem duplication fusions
BC156887-a-fwd GCAGAAGTGGGAGCCAAG HCC1806 tandem duplication fusions
BC156887-a-rev GTCCATGGTGGGAGGTGTC HCC1806 tandem duplication fusions
BC156887-b-fwd AGTTGGCAGCCATCAAAGTT HCC1806 tandem duplication fusions
BC156887-b-rev CAAGCCAGAGTTGGTCATCA HCC1806 tandem duplication fusions
CATSPERB-a-fwd CAGAGAAACGCTTTGCATGT HCC1806 tandem duplication fusions
CATSPERB-a-rev GGAGCAAGTCCTCCTGATGT HCC1806 tandem duplication fusions
TC2N-b-fwd CATTTGTGGTGCCCAAGTTT HCC1806 tandem duplication fusions
TC2N-b-rev AGCGTCGACTCAAATCAGGT HCC1806 tandem duplication fusions
EPHB2-a-fwd ATGCGGAAGAGGTGGATGTA HCC1806 tandem duplication fusions
EPHB2-a-rev TGTTGATGGGACAGTGGGTA HCC1806 tandem duplication fusions
EPHB2-b-fwd TGGTCTTCCTCATTGCTGTG HCC1806 tandem duplication fusions
EPHB2-b-rev CCGATCACCTGCTCAATTTT HCC1806 tandem duplication fusions
FUSIP1-fwd CACGTCTCTGTTCGTCAGGA HCC1806 tandem duplication fusions
FUSIP1-rev TCAGCATCACGAACATCCTC HCC1806 tandem duplication fusions
MYOM3-a-fwd CTGGGAGAGGACTGAGATCG HCC1806 tandem duplication fusions
MYOM3-a-rev AGCGAAGAGGGATCCAGAAC HCC1806 tandem duplication fusions
MYOM3-b-fwd GGACCCCAAAGACTCAGACA HCC1806 tandem duplication fusions
MYOM3-b-rev CCTGAGACGATGCAAGTCAA HCC1806 tandem duplication fusions
PNRC2-fwd ACTGGGTCCCTGTTTCCTTT HCC1806 tandem duplication fusions
PNRC2-rev CACAGTGCACACAACACGAG HCC1806 tandem duplication fusions
LAMA2-a-fwd CTCCTCCTTCTGCTGCTCTC HCC1806 tandem duplication fusions
LAMA2-a-rev TTGCTGATGTGCCTGTGACT HCC1806 tandem duplication fusions
LAMA2-b-fwd CCTGGAAACTGGATTTTGGA HCC1806 tandem duplication fusions
LAMA2-b-rev GCACTTGGTCTCCCATTGAT HCC1806 tandem duplication fusions
Appendix 1 Primers
225
ARHGAP18-a-fwd CTCTCCAGTTCCCAGGGAGT HCC1806 tandem duplication fusions
ARHGAP18-a-rev CCTTTGCATGGCTGTTCC HCC1806 tandem duplication fusions
ARHGAP18-b-fwd CTGGAGATCCACAGGAAAGC HCC1806 tandem duplication fusions
ARHGAP18-b-rev TGCCTCGTCATCTCTTCCTT HCC1806 tandem duplication fusions
HS6ST2-b-fwd CCCGGTACTTGAGTGAGTGG HCC1806 tandem duplication fusions
HS6ST2-b-rev GCGGTTGTTGGCTAGATTGT HCC1806 tandem duplication fusions
GPC3-a-fwd GATGCTGCTCAGCTTGGACT HCC1806 tandem duplication fusions
GPC3-a-rev TCCATGTTCAATCGTGCTGT HCC1806 tandem duplication fusions
SMURF2-a-fwd CGAGACGAGAGGAGGGAAA HCC1806 tandem duplication fusions
SMURF2-a-rev GGATGTGCCGAGAGTCG HCC1806 tandem duplication fusions
CCDC46-b-fwd GCAGCTGGTAGAGCTTGGTC HCC1806 tandem duplication fusions
CCDC46-b-rev TTGGCCTTTTCCAGTGTCAT HCC1806 tandem duplication fusions
c6orf105-a-fwd TTGCACACCATTTTCCAAGA HCC1806 tandem duplication fusions
c6orf105-a-rev GTTGTTTTTGGCATGTGCAG HCC1806 tandem duplication fusions
c6orf105-b-fwd TTTTGGCATTCTGGATCCTC HCC1806 tandem duplication fusions
c6orf105-b-rev TACACCCAGGTACCCGTCTC HCC1806 tandem duplication fusions
phactr1-a-fwd GGCCAGGATCTCCTTTAACC HCC1806 tandem duplication fusions
phactr1-a-rev TCGCTTTTCTTCTTCCTCCA HCC1806 tandem duplication fusions
phactr1-b-fwd GGTCACCAAAGCAGGACCTA HCC1806 tandem duplication fusions
phactr1-b-rev GCCAGGGAGCTGGTGTATAA HCC1806 tandem duplication fusions
IMMP2L-fwd GGTCACATCTGGGTTGAAGG HCC1806 large deletion fusion
IMMP2L-rev TAAGCGCTCTGGAGGAAGAA HCC1806 large deletion fusion
DOCK4-fwd GGTGCTGAAGGCACAAGAAT HCC1806 large deletion fusion
DOCK4-rev CCAGACCCTTTGCTCTCTTG HCC1806 large deletion fusion
c21.1 GGCAGCCGGTGAGGAGTTTGG MDA-MB-134 genomic junction PCR and
sequencing
c21.2 AGGCTGCCCCACAGAGACCC MDA-MB-134 genomic junction PCR and
sequencing
c26.1 GCTGGAGAGGCCGTTGTCTGG MDA-MB-134 genomic junction PCR and
sequencing
c26.2 CGACTGCAGTGAGGTCAGGCG MDA-MB-134 genomic junction PCR and
sequencing
c28.1 TGTGGCCATTCCCCTGCTGC MDA-MB-134 genomic junction PCR and
sequencing
c28.2 CCAAGGACAGCCCACGTGCC MDA-MB-134 genomic junction PCR and
sequencing
c35.1 GGCCCACTGTCCTTTAGCATGGC MDA-MB-134 genomic junction PCR and
sequencing
c35.2 CCTTTCTGTTCCCTTCCTCCTTCCC MDA-MB-134 genomic junction PCR and
sequencing
c8.1 AGAGAAGGGAAGTGGGTGTGGC MDA-MB-134 genomic junction PCR and
sequencing
c8.2 TTGCCTATGCTGTCTTTCTGTGACAA MDA-MB-134 genomic junction PCR and
sequencing
i228.1 AGGCCAGAGCCCAGGAGTGG MDA-MB-134 genomic junction PCR and
sequencing
Appendix 1 Primers
226
i228.2 ACTGAGAGGGGGTGAACTGGGC MDA-MB-134 genomic junction PCR and
sequencing
i242.1 GAGAGCAGCCCCAGGGAGGG MDA-MB-134 genomic junction PCR and
sequencing
i242.2 GGCTTTACCATGTTGGCGTTGAATTGG MDA-MB-134 genomic junction PCR and
sequencing
nc1.1 TTTAACGCCTTTTGGTGTCC MDA-MB-134 genomic junction PCR and
sequencing
nc1.2 TGCTCCAGAGGTGTGAACAG MDA-MB-134 genomic junction PCR and
sequencing
nc2.1 GTGCTGACCTTCTGGTCCAT MDA-MB-134 genomic junction PCR and
sequencing
nc2.2 AGTCAGTCCATCCGGTGTTC MDA-MB-134 genomic junction PCR and
sequencing
nc3.1 TACCCTCTCAGGTGCTGTCC MDA-MB-134 genomic junction PCR and
sequencing
nc3.2 CAGACTACAGGGGCTGCAAT MDA-MB-134 genomic junction PCR and
sequencing
nc4.1 CCAAGTGCTCCTGTCCTCTC MDA-MB-134 genomic junction PCR and
sequencing
nc4.2 AATGGTTGACCAGGTTCTGC MDA-MB-134 genomic junction PCR and
sequencing
nc5.1 AGCGCCTGGTACACAAGAAT MDA-MB-134 genomic junction PCR and
sequencing
nc5.2 CACTCTTTGAATTGGCGTGA MDA-MB-134 genomic junction PCR and
sequencing
nc6.1 ACTCCCTGTTGTGGGAACAC MDA-MB-134 genomic junction PCR and
sequencing
nc6.2 GAAACCATCTGGTCCAGGAA MDA-MB-134 genomic junction PCR and
sequencing
nc7.1 TTTTAAGCCTGTCGGAAAAG MDA-MB-134 genomic junction PCR and
sequencing
nc7.2 TGGCCCTGAATACTTTTTGG MDA-MB-134 genomic junction PCR and
sequencing
ni1.1 TGGCCCTGAATACTTTTTGG MDA-MB-134 genomic junction PCR and
sequencing
ni1.2 TTTCTTTTGCCCCACTGTTC MDA-MB-134 genomic junction PCR and
sequencing
ni10.1 CTGGAGGTCTCTGCCAGTTC MDA-MB-134 genomic junction PCR and
sequencing
ni10.2 ACTGCTCCCTTCTTCCTTCC MDA-MB-134 genomic junction PCR and
sequencing
ni11.1 CCAGAGGCAGAGGACAGAAC MDA-MB-134 genomic junction PCR and
sequencing
ni11.2 AATAGGGGAATTGGGGTGAG MDA-MB-134 genomic junction PCR and
sequencing
ni12.1 GCCCAGCCAAAATAGATTCA MDA-MB-134 genomic junction PCR and
sequencing
ni12.2 CAGCTTGGACTCCCTGTGAT MDA-MB-134 genomic junction PCR and
sequencing
ni13.1 TTTTGGACACAGAGGGAAGG MDA-MB-134 genomic junction PCR and
Appendix 1 Primers
227
sequencing
ni13.2 GAGTTTAGCGGCTCACACCT MDA-MB-134 genomic junction PCR and
sequencing
ni14.1 TTCAGCCATCTGGATTTTCC MDA-MB-134 genomic junction PCR and
sequencing
ni14.2 GGTTGCTTCCTGTGTTTGGT MDA-MB-134 genomic junction PCR and
sequencing
ni15.1 CACCACTGAGTCTGGAAGCA MDA-MB-134 genomic junction PCR and
sequencing
ni15.2 GTTTTGAAATGGGGGACCTC MDA-MB-134 genomic junction PCR and
sequencing
ni16.1 CACCTGTTCTCCCAAACGAT MDA-MB-134 genomic junction PCR and
sequencing
ni16.2 GGCAGAATGAAGTGGATTCAA MDA-MB-134 genomic junction PCR and
sequencing
ni17.1 GCCACACCAGAAGGTTGTTT MDA-MB-134 genomic junction PCR and
sequencing
ni17.2 CATCCACATCTGGAATGCTG MDA-MB-134 genomic junction PCR and
sequencing
ni18.1 TTCAGCGAGTAGGGCAGAGT MDA-MB-134 genomic junction PCR and
sequencing
ni18.1 TGTCTCCATCACCAGGAAAA MDA-MB-134 genomic junction PCR and
sequencing
ni19.1 CTTCTGCAGCTTTGGTCCAT MDA-MB-134 genomic junction PCR and
sequencing
ni19.2 GCTCCCTTCTCCATCCCTAC MDA-MB-134 genomic junction PCR and
sequencing
ni2.1 CATATTACTTTTGCTGAAGATTCTGA MDA-MB-134 genomic junction PCR and
sequencing
ni2.2 ACAACCACTGCAAACCATGA MDA-MB-134 genomic junction PCR and
sequencing
ni20.1 AGGGAGAGGAAAAGGGTCAG MDA-MB-134 genomic junction PCR and
sequencing
ni20.2 AACTCCCCACAAAGTTGCAC MDA-MB-134 genomic junction PCR and
sequencing
ni21.1 ACTTCAGCCCAGGAGTTCAA MDA-MB-134 genomic junction PCR and
sequencing
ni21.2 ACTCGCTTCCCGAAACACTA MDA-MB-134 genomic junction PCR and
sequencing
ni3.1 AGAGATGATCATGGGCAAGC MDA-MB-134 genomic junction PCR and
sequencing
ni3.2 TAGGCTGGCTTGGATTGC MDA-MB-134 genomic junction PCR and
sequencing
ni4.1 CTTCCTGTTTGGGAGTTGGA MDA-MB-134 genomic junction PCR and
sequencing
ni4.2 AGAGCCTGCATTTCTTGCAT MDA-MB-134 genomic junction PCR and
sequencing
ni5.1 TAAACAGACCCCACCCAGAG MDA-MB-134 genomic junction PCR and
sequencing
ni5.2 GCCATTTCCAGTTTCGATGT MDA-MB-134 genomic junction PCR and
sequencing
Appendix 1 Primers
228
ni6.1 TAAGTGCAGTGGCTCACACC MDA-MB-134 genomic junction PCR and
sequencing
ni6.2 AGGAGTGGCATTCAATGGAG MDA-MB-134 genomic junction PCR and
sequencing
ni7.1 TGTGGCGAAGCTTAGAGGAT MDA-MB-134 genomic junction PCR and
sequencing
ni7.2 CAGAGAGGTCATGGTTGTGC MDA-MB-134 genomic junction PCR and
sequencing
ni8.1 GATGAGCAGAGGGGGTATCA MDA-MB-134 genomic junction PCR and
sequencing
ni8.2 ACTCAGCATACTGCCCCACT MDA-MB-134 genomic junction PCR and
sequencing
ni9.1 CCAGGCAGAATGAAGAAAGC MDA-MB-134 genomic junction PCR and
sequencing
ni9.2 AAGTGATCTGCCCACCTCAG MDA-MB-134 genomic junction PCR and
sequencing
c21nested1a CAAACAGGGTAATCGGAGGA MDA-MB-134 genomic junction PCR and
sequencing
c21nested1b TTTTCAACAGCGGAGTAGGC MDA-MB-134 genomic junction PCR and
sequencing
c21nested2a CTTCCATCATGGTGATGTGC MDA-MB-134 genomic junction PCR and
sequencing
c21nested2b TTGGCTGCTGAGTTTCTCCT MDA-MB-134 genomic junction PCR and
sequencing
c35nested1a CCTTTAGCATGGCTTTCTGG MDA-MB-134 genomic junction PCR and
sequencing
c35nested1b TGTCTGCAATGGGGACATTA MDA-MB-134 genomic junction PCR and
sequencing
c35nested2a AATAATTGGCCATGCTCCTG MDA-MB-134 genomic junction PCR and
sequencing
c35nested2b TTCCTCCTTCCCTTTTGGTT MDA-MB-134 genomic junction PCR and
sequencing
new-nc7-1 CGAGCCAGGTAAGGGATGT MDA-MB-134 genomic junction PCR and
sequencing
new-nc7-2 ACAGGGCTTTCCTGATCAAA MDA-MB-134 genomic junction PCR and
sequencing
new-ni1-1 CTGTGGTTGCCTGTCACCTA MDA-MB-134 genomic junction PCR and
sequencing
new-ni1-2 ACTGGGCTTTCCATTCACTG MDA-MB-134 genomic junction PCR and
sequencing
new-ni13-1 CAGCCTGTGCAAAACGAATA MDA-MB-134 genomic junction PCR and
sequencing
new-ni13-2 TTTGAGGCTGCAGTGAGCTA MDA-MB-134 genomic junction PCR and
sequencing
new-ni3-1 ACCATTAGTGTGGGCGAAAG MDA-MB-134 genomic junction PCR and
sequencing
new-ni3-2 GATTTCTTGGCTGGCTTGA MDA-MB-134 genomic junction PCR and
sequencing
new-ni5-1 TCTCCCACAAGCCTCTCACT MDA-MB-134 genomic junction PCR and
sequencing
new-ni5-2 TCCAGCAGTGACAACAGAGG MDA-MB-134 genomic junction PCR and
Appendix 1 Primers
229
sequencing
UNC5D-fwd CGAGAGCTCAGGTTTGAAGG MDA-MB-134 fusions
UNC5D-rev CTTCCCTTCCTTGTGGGTCT MDA-MB-134 fusions
ARSG-fwd GTCACCAGCACTGCCTTGTA MDA-MB-134 fusions
ARSG-rev AGGCCTTGTAACGCTCCAG MDA-MB-134 fusions
EFR3A-fwd ATCCAAAAGATGGCCTTGTG MDA-MB-134 fusions
EFR3A-rev CAGTGCCTCCATAGCAATCA MDA-MB-134 fusions
ANK1-fwd TGCTGCTACCAGCTTTCTGA MDA-MB-134 fusions
ANK1-rev TGTTCCCCTTCTTGGTTGTC MDA-MB-134 fusions
SHANK2-fwd AGACCATTGGGAGCTACGTG MDA-MB-134 fusions
SHANK2-rev GTACTCGAAGGCCGAGAGTG MDA-MB-134 fusions
FBXL11-fwd CCGGATCCAGACTTCACTGT MDA-MB-134 fusions
FBXL11-rev GGCTAAACTCGAGGCTGATG MDA-MB-134 fusions
ANO1-fwd GGCTCTGGTGCACTATGTGA MDA-MB-134 internal rearrangements
ANO1-rev GTACTCGACGCAGTTGCTGA MDA-MB-134 internal rearrangements
OSBPL5-fwd TCGGAGAGAGAGAACCCTGA MDA-MB-134 internal rearrangements
OSBPL5-rev CAGCTTCATGCGGCTGTA MDA-MB-134 internal rearrangements
OVCH2-fwd GAAGCTGCACTTCCCAGAAA MDA-MB-134 internal rearrangements
OVCH2-rev CCTTGTCACTGTAGTTTTCAGGAT MDA-MB-134 internal rearrangements
CCDC67-fwd GCCTAAAGGCTCAATTTTCCA MDA-MB-134 internal rearrangements
CCDC67-rev CCATTTAGTTGAGTTTGGTAACTTTGT MDA-MB-134 internal rearrangements
P2RX2-fwd CCCAAATTCCACTTCTCCAA MDA-MB-134 internal rearrangements
P2RX2-rev GGTGGTGCCATTGATCTTGT MDA-MB-134 internal rearrangements
POLG-fwd CAGCACCTTCCTGGACACC MDA-MB-134 internal rearrangements
POLG-rev CTGTTCGAGACAGTGCTTCCT MDA-MB-134 internal rearrangements
CHD2-fwd GCTCTTGCCAAAGGAACAAG MDA-MB-134 internal rearrangements
CHD2-rev TTCGGATTTCTCCCTTGATG MDA-MB-134 internal rearrangements
c16orf28-fwd GGCCATGATTGAGAAGATCC MDA-MB-134 internal rearrangements
c16orf28-rev GCATGTCGTTGCATTTTGTA MDA-MB-134 internal rearrangements
STAT3-fwd TCAGGATGTCCGGAAGAGAG MDA-MB-134 internal rearrangements
STAT3-rev CGTACTCCATCGCTGACAAA MDA-MB-134 internal rearrangements
SYT3-fwd GGTGGTGCTGGACAACCT MDA-MB-134 internal rearrangements
SYT3-rev CCGATCACCTCGTTGTGC MDA-MB-134 internal rearrangements
SFTPB-fwd CTAGGGCATTGCCTACAGGA MDA-MB-134 internal rearrangements
SFTPB-rev GCATACAGATGCCGTTTGAG MDA-MB-134 internal rearrangements
ZNF512B-fwd GTGTCCAAACTCAGGGTGCT MDA-MB-134 internal rearrangements
ZNF512B-rev GTGTCTGCACTCAGCTGGAA MDA-MB-134 internal rearrangements
SERAC1-fwd TCCTTCAGCAACAGTGGAAA MDA-MB-134 internal rearrangements
SERAC1-rev CATATTTCCAATGACACGCATTA MDA-MB-134 internal rearrangements
PTPRN2-fwd ACATGGAGGACCACCTGAAG MDA-MB-134 internal rearrangements
PTPRN2-rev GTCAGCATGACGATCACCAC MDA-MB-134 internal rearrangements
EPB49-fwd CCTCCCGAGATTCCAGTGT MDA-MB-134 readthrough fusions
Appendix 1 Primers
230
EPB49-rev CGTGTTCCAGGAGGGAATAA MDA-MB-134 readthrough fusions
SHANK2-fwd TTGAGGAGAAGACGGTGGTC MDA-MB-134 readthrough fusions
SHANK2-rev GAAGTCCCCGGTCCTTAGTC MDA-MB-134 readthrough fusions
POLD3-fwd CAAATGGCTGAGCTATACACTAGG MDA-MB-134 readthrough fusions
POLD3-rev AACCTTGTGGCAGGAATGTC MDA-MB-134 readthrough fusions
KCNU1-fwd GCCATGTAAGAAGCCTCCAC MDA-MB-134 readthrough fusions
KCNU1-rev ACAGCTTCCAACAGGGTCAG MDA-MB-134 readthrough fusions
ADAM2-fwd CCAGTTGATTGGATTGACGA MDA-MB-134 readthrough fusions
ADAM2-rev CTTGAAAGGTTGCACCAACA MDA-MB-134 readthrough fusions
LRRC32-fwd GCTGCACAACACCAAGACAA MDA-MB-134 readthrough fusions
LRRC32-rev GCTGATCTCATTGGTGCTCA MDA-MB-134 readthrough fusions
ACER3-fwd TCAGCCAGCTCTGCTCTGAT MDA-MB-134 readthrough fusions
ACER3-rev GGCACAACCATGACCTCTCT MDA-MB-134 readthrough fusions
NADSYN1-fwd GCTACGGATGTTGGGATCAT MDA-MB-134 readthrough fusions
NADSYN1-rev GCAGGATCTTCCTGTTGAGG MDA-MB-134 readthrough fusions
IMMP2L-fwd AGTCACAAGGGTGGGTGAAA MDA-MB-134 readthrough fusions
IMMP2L-rev TGGTTCAAAAGCACCACATC MDA-MB-134 readthrough fusions
PRDM4-fwd TACCCTCACCTGGAGAGCAG MDA-MB-134 readthrough fusions
PRDM4-rev ATGAAGACTTTGGGCACCAT MDA-MB-134 readthrough fusions
GGCT-fwd GGAGGGATAGCCACCATTTT MDA-MB-134 readthrough fusions
GGCT-rev ATGGGGGAGCACTTTCGTA MDA-MB-134 readthrough fusions
NOD1-fwd GCGGCGATTACAGAAAACAT MDA-MB-134 readthrough fusions
NOD1-rev CTCTCAGCAGAAGGGCAATC MDA-MB-134 readthrough fusions
KLHL35-fwd TACGACCCCTTCTCCAACAC MDA-MB-134 readthrough fusions
KLHL35-rev GATGGTGTCCTCAAGGGAGA MDA-MB-134 readthrough fusions
AQP11-fwd AGCTTTGGCACTTTCGCTAC MDA-MB-134 readthrough fusions
AQP11-rev CCGGTGTTTTCCATATGAGG MDA-MB-134 readthrough fusions
PAK1-fwd AGTTACCACCTCCTGCCTCA MDA-MB-134 readthrough fusions
PAK1-rev CGAGCTACCGCTTCACTTTC MDA-MB-134 readthrough fusions
SERPINH1-fwd AGCAGCAAGCAGCACTACAA MDA-MB-134 readthrough fusions
SERPINH1-rev AGGACCGAGTCACCATGAAG MDA-MB-134 readthrough fusions
ANK1-fwd GAGCTGCAGTTCAGTGTGGA MDA-MB-134 readthrough fusions
ANK1-rev TCCAGCATGTTCACGATCTC MDA-MB-134 readthrough fusions
AP3M2-fwd AGAAAATGTGCCTCCGGTTA MDA-MB-134 readthrough fusions
AP3M2-rev TGGAAAACCATTGTCAAGCA MDA-MB-134 readthrough fusions
ODZ4-ex1-fwd CGCCGAGAGAGAGAGGAG ODZ4 exons, real time
ODZ4-ex1-rev ATCTCTCAGACACTTGGTCGG ODZ4 exons, real time
ODZ4-ex4-fwd GACTGAAATTACCATGTGTCCAAA ODZ4 exons, real time
ODZ4-ex4-rev AAGGAGAGGAAGCCTTACCG ODZ4 exons, real time
ODZ4-ex6-fwd CACCGAGCATGAAAACACTG ODZ4 exons, real time
ODZ4-ex6-rev TCCATTAACTCCCTGAACCG ODZ4 exons, real time
ODZ4-ex8-fwd GACATTGCAGGACAACCTCA ODZ4 exons, real time
Appendix 1 Primers
231
ODZ4-ex8-rev ATCACCAGGGTACCCACTGA ODZ4 exons, real time
ODZ4-ex11-fwd GCCTCCCTCCTTCACATACA ODZ4 exons, real time
ODZ4-ex11-rev TTTGGATTCAGGAATCTGGC ODZ4 exons, real time
ODZ4-ex18-fwd AGGAGCTGGCTGTGACACTT ODZ4 exons, real time
ODZ4-ex18-rev TGATGTCCAGAGGGTTAGGG ODZ4 exons, real time
ODZ4-ex26-fwd GTGGTGGTGAAGGACCTTGT ODZ4 exons, real time
ODZ4-ex26-rev GTGGACAAGTTTGGGCTGAT ODZ4 exons, real time
ODZ4-ex29-fwd GAGACCTCCAGCAAGGATGA ODZ4 exons, real time
ODZ4-ex29-rev AACAGCTACTACATCGGGGC ODZ4 exons, real time
ODZ4-ex21-fwd AGCCCCAGACCTGTCCTATT ODZ4 exons, real time
ODZ4-ex21-rev TTGGACGCGTCAATTTCATA ODZ4 exons, real time
GAPDH_RT_fwd GCAAATTCCATGGCACCGT GAPDH, real-time control
GAPDH_RT_rev TCGCCCCACTTGATTTTGG GAPDH, real-time control
232
Appendix 2 - BACs
BAC clones used for FISH. All positions are from the hg18/GrCH36 build of the human genome.
Name Chromosome Start position (bp) End position (bp)
RP11-65L3 2 178,966,383 179,139,203
RP11-67G7 2 182,686,311 182,851,083
RP11-59L22 2 192,914,919 193,076,744
RP11-15J24 2 205,174,891 205,347,264
RP4-781A18 7 27,976,454 28,166,805
RP11-563O5 7 114,005,319 114,175,715
RP11-518I12 7 157,549,662 157,756,716
RP11-381A5 17 55,638,614 55,853,525
RP11-947H19 17 55,827,724 56,009,281
RP11-105G8 17 55,904,414 56,060,400
RP11-160D4 17 56,857,614 57,015,721
RP11-466D9 17 56,954,223 57,129,790
RP11-180G7 17 57,115,715 57,267,805
233
Appendix 3 – Manufacturers and suppliers
Reagent Manufacturer/Supplier
anti-BCAS3 antibody Gift from Dr Jason Carroll, CRUK Cambridge Research Institute, Cambridge, UK
BACs Wellcome Trust Sanger Institute, UK/Invitrogen, Paisley, UK
BioPrime labelling kit Invitrogen, Paisley, UK
Biotin dUTP Roche Diagnostics, Basel, Switzerland
Biotinylated anti-streptavidin Vector Laboratories Inc., Burlingame, CA, USA
Chloramphenicol Sigma-Aldrich, Dorset, UK
Colcemid Sigma-Aldrich, Dorset, UK
Complete Protease Inhibitor Cocktail Roche Diagnostics, Basel, Switzerland
Cryotubes Fisher Scientific, Loughborough, UK
Cy3-labelled dCTP Amersham, Epsom, UK
Cy5-labelled dCTP Amersham, Epsom, UK
Cy5-labelled streptavidin Amersham, Epsom, UK
DAPI in Vectashield Vector Laboratories Inc., Burlingame, CA, USA
Denhardt's Solution Sigma-Aldrich, Dorset, UK
Dextran sulphate Sigma-Aldrich, Dorset, UK
Digoxygenin-11 dUTP Roche Diagnostics, Basel, Switzerland
DMEM-F12 GIBCO Technologies, Invitrogen, Paisley, UK
DMSO Invitrogen, Paisley, UK
DNA polymerase I Sigma-Aldrich, Dorset, UK
DNA-free Kit Ambion, Applied Biosystems, Foster City, USA
DNAse I Sigma-Aldrich, Dorset, UK
DNAzol reagent Invitrogen, Paisley, UK
dNTPs Invitrogen, Paisley, UK
ECL Plus Western Blotting Detection System GE Healthcare, Buckinghamshire, UK
Elongase polymerase mix Invitrogen, Paisley, UK
Eppendorf tubes Starlab, Milton Keynes, UK
Ethanol Sigma-Aldrich, Dorset, UK
Falcon tubes Bibby Sterilin, Stone, UK
FBS Sigma-Aldrich, Dorset, UK
FITC-labelled anti-digoxygenin Roche Diagnostics, Basel, Switzerland
Formamide VWR International, Lutterworth, UK
G50 MicroSpin columns GE Healthcare, Buckinghamshire, UK
GenomiPhi Kit GE Healthcare, Buckinghamshire, UK
HiSpeed Plasmid Midi-Prep Kit Qiagen UK, Crawley, UK
HotMaster Taq VWR International, Lutterworth, UK
Hyperladder I Bioline, London, UK
Isopropanol Invitrogen, Paisley, UK
Appendix 7 Manufacturers and Suppliers
234
ITS Sigma-Aldrich, Dorset, UK
Kanamycin Sigma-Aldrich, Dorset, UK
LB agar Hutchison/MRC Centre Media Unit
LB broth Hutchison/MRC Centre Media Unit
Mate-Pair Library Prep Kit Illumina, San Diego, CA, USA
MCBD-201 GIBCO Technologies, Invitrogen, Paisley, UK
NaH2PO4 VWR International, Lutterworth, UK
NaHPO4 VWR International, Lutterworth, UK
NanoDrop spectrophotometer Labtech International, Ringmer, UK
Paired-End DNA Sample Prep Kit Illumina, San Diego, CA, USA
PBS Hutchison/MRC Centre Media Unit
Pellet Paint Merck KGaA, Darmstadt, Germany
Penicillin/streptomycin GIBCO Technologies, Invitrogen, Paisley, UK
Pipette tips Starlab, Milton Keynes, UK
QIAquick PCR Purification Kit Qiagen UK, Crawley, UK
RIPA buffer Hutchison/MRC Centre Media Unit
RNAseIN Promega, Fitchburg, USA
RPMI-1640 GIBCO Technologies, Invitrogen, Paisley, UK
Rubber cement Heffers Art and Graphics Shop, Cambridge, UK
Sodium acetate Hutchison/MRC Centre Media Unit
Spectrum Orange dUTP Vysis UK Ltd/Abbott Laboratories, Downers Grove IL, USA
SSC Hutchison/MRC Centre Media Unit
SuperScript III First-Strand Synthesis Kit Invitrogen, Paisley, UK
SYBR Green PCR Master Mix Applied Biosystems, Foster City, USA
TE Hutchison/MRC Centre Media Unit
Tissue microarrays Dr Suet-Feung Chin, CRUK Cambridge Research Institute, Cambridge, UK
TOPO XL PCR Cloning Kit Invitrogen, Paisley, UK
Tris-acetate pre-cast gel Invitrogen, Paisley, UK
Trizol reagent Invitrogen, Paisley, UK
Trypsin GIBCO Technologies, Invitrogen, Paisley, UK
Tween 20 QbioGene, Livingston, Scotland
Versene Hutchison/MRC Centre Media Unit
Yeast tRNA Invitrogen, Paisley, UK
235
Appendix 4 – Bioinformatic pipeline scripts
A – Perl script to predict fusion genes
# Fusion Gene Prediction
# Takes the structural variant calls from sequencing and predicts possible
fusion genes, readthroughs, and internal gene rearrangements
# Uses the Ensembl API -
http://www.ensembl.org/info/docs/api/api_installation.html for installation
and necessary modules
# Liz Batty emb51@cam.ac.uk
# Last modified August 2010
use warnings;
use strict;
use Bio::EnsEMBL::Registry;
use Getopt::Long;
# set default values
my $matepair = 0;
my $strands = 0;
my $linecounter = 0;
my $columns = 0;
my $cnv = 0;
my $help = 0;
my $input = 'library.lanes.sv_calls.txt';
my $insertsize = 470;
#uncomment one of these to use either hg18 (may2009) or hg19 website in
hyperlinks
#my $ensembl_site = "may2009.archive.ensembl.org";
my $ensembl_site = "www.ensembl.org";
my( $type_of_sv,
$support_for_sv,
$node1_chr,
$node1_start,
$node1_end,
$node1_strand,
$extra_support_1,
$node2_chr,
$node2_start,
$node2_end,
$node2_strand,
$extra_support_2,
$node1_cnv,
$node2_cnv );
my $result = GetOptions ("matepair|m" => \$matepair,
"strands|s" => \$strands,
"insertsize|i=i" => \$insertsize,
"columns|c" => \$columns,
"cnv|n" => \$cnv,
"help|h" => \$help);
if ($help) {
Appendix 4 Bioinformatic pipeline scripts
236
print "Options are:\n--matepair\tUse for mate pair libraries where RF
is a normal read
--strands\tThe file uses -1 and 1 for strand directions
--insertsize\tUse to set the insert size of the library
--columns\tThe file has extra supporting read columns (see later version of
Kevin's script)
--cnv\tThe file has been checked for CNVs\n";
exit;
}
if (@ARGV) {
$input = $ARGV[0];
}
open (INPUT, $input) || die print "failed to open input file $!";
# print correct header line to the output file
{
if ($columns == 0 && $cnv == 0) {
print "Type of SV\tSV Support\tNode 1 chr\tNode 1 start\tNode 1
end\tNode 1 direction\tNode 2 chr\tNode 2 start\tNode 2 end\tNode 2
direction\tGene at node 1\tGene at node 2\tType of fusion\tDetails\n";
}
elsif ($columns == 1 && $cnv == 0) {
print "Type of SV\tSV Support\tNode 1 chr\tNode 1 start\tNode 1
end\tNode 1 direction\tExtra support\tNode 2 chr\tNode 2 start\tNode 2
end\tNode 2 direction\tExtra support\tGene at node 1\tGene at node 2\tType of
fusion\tDetails\n";
}
elsif ($columns == 0 && $cnv == 1) {
print "Type of SV\tSV Support\tNode 1 chr\tNode 1 start\tNode 1
end\tNode 1 direction\tNode 2 chr\tNode 2 start\tNode 2 end\tNode 2
direction\tNode 1 CNVs\tNode 2 CNVs\tGene at node 1\tGene at node 2\tType of
fusion\tDetails\n";
}
elsif ($columns == 1 && $cnv == 1) {
print "Type of SV\tSV Support\tNode 1 chr\tNode 1 start\tNode 1
end\tNode 1 direction\tExtra support\tNode 2 chr\tNode 2 start\tNode 2
end\tNode 2 direction\tExtra support\tNode 1 CNVs\tNode 2 CNVs\tGene at node
1\tGene at node 2\tType of fusion\tDetails\n";
}
}
#make a connection to the Ensembl database
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
# tells it we want to work with a slice of the human genome
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );
Appendix 4 Bioinformatic pipeline scripts
237
#read through the list of SVs
while()
{
#chomp the newline, replace the + and - with 1 and -1, and read
into variable
if ($strands == 0) {
$_=~s/\+/1/g;
$_=~s/\-/-1/g;
}
chomp $_;
my $structural_variant = $_;
my $has_sv_been_printed = 0;
#split up the columns in the SV file
if ($columns == 1 && $cnv == 0) {
( $type_of_sv,
$support_for_sv,
$node1_chr,
$node1_start,
$node1_end,
$node1_strand,
$extra_support_1,
$node2_chr,
$node2_start,
$node2_end,
$node2_strand,
$extra_support_2 ) = split('\t', $_);
}
elsif ($columns == 1 && $cnv == 1) {
( $type_of_sv,
$support_for_sv,
$node1_chr,
$node1_start,
$node1_end,
$node1_strand,
$extra_support_1,
$node2_chr,
$node2_start,
$node2_end,
$node2_strand,
$extra_support_2,
$node1_cnv,
$node2_cnv) = split('\t', $_);
}
elsif ($columns == 0 && $cnv == 1) {
( $type_of_sv,
$support_for_sv,
$node1_chr,
$node1_start,
$node1_end,
$node1_strand,
$node2_chr,
Appendix 4 Bioinformatic pipeline scripts
238
$node2_start,
$node2_end,
$node2_strand,
$node1_cnv,
$node2_cnv) = split('\t', $_);
}
else {
( $type_of_sv,
$support_for_sv,
$node1_chr,
$node1_start,
$node1_end,
$node1_strand,
$node2_chr,
$node2_start,
$node2_end,
$node2_strand, ) = split('\t', $_);
}
#skip header, if it is there
next if ($_=~/^Type/);
#skip LOPs
next if ($type_of_sv eq 'LOP');
#skip ITRs (newer equivalent of LOP)
next if ($type_of_sv eq 'ITR');
#skip over mitochondria and other haplotypes, etc
next if ($node1_chr eq 'M' || $node1_chr eq 'MT' ||
$node1_chr=~/"Un"/ || $node2_chr eq 'M' || $node2_chr eq 'MT' ||
$node2_chr=~/"Un"/ );
#mate pair reads have the opposite strand
if ($matepair == 1) {
if ($node1_strand == 1) {
$node1_strand=~s/1/-1/g;
}
else {
$node1_strand=~s/-1/1/g;
}
if ($node2_strand == 1) {
$node2_strand=~s/1/-1/g;
}
else {
$node2_strand=~s/-1/1/g;
}
}
#######################################
## FIND THE GENES AT THE BREAKPOINTS ##
#######################################
Appendix 4 Bioinformatic pipeline scripts
239
my $which_node = 1;
my ($node1_breaks_gene, $has_sv_been_printed_in_node1,
$node1_genearrayref) = get_broken_genes($has_sv_been_printed,
$structural_variant, $node1_chr, $node1_start, $node1_end, $which_node);
$has_sv_been_printed = $has_sv_been_printed_in_node1;
my @node1_genearray = @$node1_genearrayref;
$which_node = 2;
my ($node2_breaks_gene, $has_sv_been_printed_in_node2,
$node2_genearrayref) = get_broken_genes($has_sv_been_printed,
$structural_variant, $node2_chr, $node2_start, $node2_end, $which_node);
$has_sv_been_printed = $has_sv_been_printed_in_node2;
my @node2_genearray = @$node2_genearrayref;
##################################
## PREDICT ANY FUSIONS PRODUCED ##
##################################
# check if there are broken genes at both nodes, ie fusion is
possible
if ( $node1_breaks_gene == 1 && $node2_breaks_gene == 1) {
#use a loop to test all genes at node 1 against all genes
at node 2
foreach( @node1_genearray ) {
# split up the attributes of the gene from node one
my( $gene1_dbid,
$gene1_displayname,
$gene1_externalname,
$gene1_start,
$gene1_end,
$gene1_strand,
$gene1_stableid ) = split('\t', $_);
foreach( @node2_genearray ) {
# split up the attributes of the gene from node
two
my ($gene2_dbid,
$gene2_display,
$gene2_externalname,
$gene2_start,
$gene2_end,
$gene2_strand,
$gene2_stableid) = split('\t', $_);
# test if the genes are the same - ie it is an
internal deletion/duplication
if ( $gene1_stableid eq $gene2_stableid ) {
print
"=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapiens\/Gene\/Summary?g\=$gene1
_stableid\",
Appendix 4 Bioinformatic pipeline scripts
240
\"$gene1_externalname\"\)\t=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapien
s\/Gene\/Summary?g\=$gene2_stableid\", \"$gene2_externalname\"\)\tINTERNAL";
undef my @exon_already_seen;
if ($type_of_sv eq "DEL") { # only
check for exons which are deleted, not amplified/inverted
# set boundaries of deletion
- node 2 may be before node 1
my $deleted_slice;
if ($node1_start <
$node2_start) {
$deleted_slice =
$slice_adaptor->fetch_by_region('chromosome', $node1_chr, $node1_start,
$node2_end);
}
else {
$deleted_slice =
$slice_adaptor->fetch_by_region('chromosome', $node1_chr, $node2_start,
$node1_end);
}
my $deleted_exons =
$deleted_slice->get_all_Exons();
my $are_exons_deleted = 0;
my $exoncounter = 0;
my @exon_id_list;
while ( my $exon = shift
@{$deleted_exons} ) {
$are_exons_deleted = 1;
my $stable_id =
$exon->stable_id();
@exon_already_seen =
grep($stable_id, @exon_id_list);
if (@exon_already_seen)
{
$exoncounter++;
}
else {
push(@exon_id_list, $stable_id);
}
}
if ($are_exons_deleted == 0)
{
print "\tNO EXONS
DELETED";
}
else {
print "\t$exoncounter
EXONS DELETED";
}
}
}
Appendix 4 Bioinformatic pipeline scripts
241
#test for a head-on collision - ie two genes
which could produce readthoughs
elsif ( $gene1_strand == $node1_strand &&
$gene2_strand == $node2_strand ) {
get_run_through(\@node1_genearray,
$node1_strand, $node2_strand, $node2_start, $node2_end, $node2_chr);
get_run_through(\@node2_genearray,
$node2_strand, $node1_strand, $node1_start, $node1_end, $node1_chr);
}
#test for a 3'to 5' fusion
elsif ( $gene1_strand != $node1_strand &&
$gene2_strand == $node2_strand ) {
print
"=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapiens\/Gene\/Summary?g\=$gene1
_stableid\",
\"$gene1_externalname\"\)\t=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapien
s\/Gene\/Summary?g\=$gene2_stableid\", \"$gene2_externalname\"\)\tFUSION\t5'
of $gene2_externalname into 3' of $gene1_externalname";
}
#test for a 5' to 3' fusion
elsif ($gene1_strand == $node1_strand &&
$gene2_strand != $node2_strand) {
print
"=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapiens\/Gene\/Summary?g\=$gene1
_stableid\",
\"$gene1_externalname\"\)\t=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapien
s\/Gene\/Summary?g\=$gene2_stableid\", \"$gene2_externalname\"\)\tFUSION\t5'
of $gene1_externalname into 3' of $gene2_externalname";
}
else {
print
"=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapiens\/Gene\/Summary?g\=$gene1
_stableid\",
\"$gene1_externalname\"\)\t=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapien
s\/Gene\/Summary?g\=$gene2_stableid\", \"$gene2_externalname\"\)\tNO FUSION
PREDICTED";
}
}
}
if ( $has_sv_been_printed == 1 ) {
print "\n";
}
}
elsif( $node1_breaks_gene == 1 && $node2_breaks_gene == 0 ) {
#broken gene at node 1, want to find the gene it runs into
#node 2 is the test node
get_read_through(\@node1_genearray, $node1_strand,
$node2_strand, $node2_start, $node2_end, $node2_chr);
Appendix 4 Bioinformatic pipeline scripts
242
if ($has_sv_been_printed == 1) {print "\n";};
}
elsif( $node1_breaks_gene == 0 && $node2_breaks_gene == 1 ) {
#broken gene at node 2, want to find the gene it runs into
#node 1 is the test node
get_read_through(\@node2_genearray, $node2_strand,
$node1_strand, $node1_start, $node1_end, $node1_chr);
if ($has_sv_been_printed == 1) {print "\n";};
}
$linecounter++;
}
close INPUT;
print "Processing finished!\n";
sub get_broken_genes
{
my $has_sv_been_printed = shift;
my $structural_variant = shift;
my $node_chr = shift;
my $node_start = shift;
my $node_end = shift;
my $which_node = shift;
#create a chromosome slice for the possible breakpoint region
my $node_chromslice = $slice_adaptor->fetch_by_region('chromosome',
$node_chr, $node_end, $node_end+$insertsize);
my $node_slicestart = $node_chromslice->start();
my $node_sliceend = $node_chromslice->end();
my $node_genes = $node_chromslice->get_all_Genes();
undef my @node_genearray;
my $node_breaks_gene = 0;
while (my $gene = shift @{ $node_genes } ) {
#this loop tells it to output the break if it has not been done
before
if ($has_sv_been_printed != 1) {
print "$structural_variant\t";
$has_sv_been_printed = 1;
}
# call subroutine to get all the gene attributes and
returned them concatenated into the array
@node_genearray = get_gene_attributes($node_slicestart,
$node_sliceend, $which_node, $gene);
$node_breaks_gene = 1;
}
undef $node_genes;
return ($node_breaks_gene, $has_sv_been_printed, \@node_genearray);
Appendix 4 Bioinformatic pipeline scripts
243
}
sub get_read_through
{
my $genearray_ref = shift;
my $brokennode_strand = shift;
my $testnode_strand = shift;
my $testnode_start = shift;
my $testnode_end = shift;
my $testnode_chr = shift;
# go and find nearest unbroken gene in correct orientation
foreach(@{$genearray_ref})
{
# split up the attributes of the gene from the broken
node
my ($brokengene_dbID,
$brokengene_display,
$brokengene_externalname,
$brokengene_start,
$brokengene_end,
$brokengene_strand,
$brokengene_stableid) = split('\t', $_);
#print "broken gene is $brokengene_externalname\n";
if ($brokennode_strand == $brokengene_strand) {
#fetch 1000bp near test node
my ($testslice_start, $testslice_end);
if ($testnode_strand == 1)
{
$testslice_start = $testnode_start -
1000;
$testslice_end = $testnode_start;
}
else
{
$testslice_start = $testnode_end;
$testslice_end = $testnode_end + 1000;
}
my $iteration_counter = 1;
my $found_a_gene = 0;
#only iterate 1000 times max- ie, will find a
readthrough within 1Mb of the break
until ($found_a_gene == 1 || $iteration_counter
== 1000)
{
my @nearbygenes =
get_nearby_genes($testnode_chr, $testslice_end, $testslice_end+$insertsize);
foreach(@nearbygenes)
{
Appendix 4 Bioinformatic pipeline scripts
244
my ($nearbygene_dbID,
$nearbygene_display,
$nearbygene_externalname,
$nearbygene_start,
$nearbygene_end,
$nearbygene_strand,
$nearbygene_stableid) = split ("\t", $_);
if ($nearbygene_strand
!= $testnode_strand)
{
print
"=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapiens\/Gene\/Summary?g\=$broke
ngene_stableid\",
\"$brokengene_externalname\"\)\t=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_s
apiens\/Gene\/Summary?g\=$nearbygene_stableid\",
\"$nearbygene_externalname\"\)\tREADTHROUGH\t$brokengene_externalname is
broken and may read through into $nearbygene_externalname
",$iteration_counter,"kb away";
$found_a_gene =
1;
}
}
if ($testnode_strand == 1)
{
$testslice_end =
$testslice_start;
$testslice_start -= 1000;
}
else
{
$testslice_start =
$testslice_end;
$testslice_end += 1000;
}
undef @nearbygenes;
$iteration_counter++;
}
}
else {
print
"=HYPERLINK\(\"http\:\/\/$ensembl_site\/Homo_sapiens\/Gene\/Summary?g\=$broke
Appendix 4 Bioinformatic pipeline scripts
245
ngene_stableid\", \"$brokengene_externalname\"\)\t\tNO READTHROUGH WITHIN
1Mb";
}
}
}
sub get_nearby_genes
{
undef my @genearray;
my $chr = shift;
my $start = shift;
my $end = shift;
my $chromslice = $slice_adaptor->fetch_by_region('chromosome', $chr,
$start, $end);
my $genes = $chromslice->get_all_Genes();
while ( my $gene = shift @{$genes} )
{
my $dbID = $gene->dbID();
my $display = $gene->display_id();
my $externalname= $gene->external_name();
my $genestart = $gene->start(); #positions are relative
to the slice, not the absolute chromosomal location
my $geneend = $gene->end();
my $genestrand = $gene->strand();
my $stable_id = $gene->stable_id();
my $geneconcat = join("\t", $dbID, $display, $externalname,
$genestart, $geneend, $genestrand, $stable_id);
push (@genearray, $geneconcat);
}
return @genearray;
}
sub get_gene_attributes
{
my $start = shift;
my $end = shift;
my $node = shift;
my $gene = shift;
my $dbID = $gene->dbID();
my $display = $gene->display_id();
my $externalname= $gene->external_name();
my $genestart = $gene->start(); #positions are relative to
the slice, not the absolute chromosomal location
my $geneend = $gene->end();
my $genestrand = $gene->strand();
my $stable_id = $gene->stable_id();
my $geneconcat = join("\t", $dbID, $display, $externalname,
$genestart, $geneend, $genestrand, $stable_id);
push (my @genearray, $geneconcat);
Appendix 4 Bioinformatic pipeline scripts
246
return @genearray;
}
Appendix 4 Bioinformatic pipeline scripts
247
B – R script to window sequencing reads
# Makes copy number bins for Illumina copy number analysis
# Input is the for_cnv.txt file divided into chromosomes, and requires the
appropriate mappable_starts directory
# Liz Batty, last modified July 2010
#how many reads to put in a bin - more reads, larger bins
readsperbinlist <- c(100, 250, 500)
#include lanes in sample name
samplelist <- c("81777.a","83493.a","82674.a","938.a","946.a","81828.a")
for (sample in samplelist) {
for(i in c(1:22,"X", "Y")) {
print(paste("processing chromosome ",i," from ",sample,sep=""))
#read in mappable starts for the chr
chrstarts <-
read.table(file=paste("mappable_starts/chr",i,".mappable.starts", sep=""),
header=FALSE)
#read the file of good reads for CNV analysis (per chromosome)
cnvreads <-
read.table(file=paste(sample,"/",sample,".forcnv.chr",i,".txt", sep=""),
header=FALSE)
for (readsperbin in readsperbinlist) {
#subset to find the bin edges, determined by readsperbin
#ie, if this were idealised genome, this gives bin sizes
which would give readsperbin reads in each
chrstarts.short <- chrstarts[seq(1, nrow(chrstarts),
readsperbin),2]
#adds a final bin to catch all the end of chr reads
bin.edges <-
c(0,chrstarts.short,chrstarts.short[length(chrstarts.short)]+1000000)
#cuts the file up according to the bin edges
res <- cut(
as.integer(cnvreads[cnvreads[,1]==i,2]),
breaks=bin.edges-0.1,
labels=round(bin.edges[1:(length(bin.edges)-
1)]+(bin.edges[2:length(bin.edges)] - bin.edges[1:(length(bin.edges)-1)])/2)
)
#write the results to output as a table
result <- table(res)
write.table(
result,
sep="\t",
file=paste(sample,"/",sample,".chr",i,".raw_bincounts.",readsperbin,"re
adsperbin",sep=""),
Appendix 4 Bioinformatic pipeline scripts
248
row.names=FALSE,
col.names=FALSE,
quote=FALSE
)
}
}
}
Appendix 4 Bioinformatic pipeline scripts
249
C – Perl script to retrieve GC percentages
# GC percentage fetcher
# Takes the raw_bincounts from the copy number part of the pipeline, and gets
the GC percentage over those bins
# Liz Barrt last modified July 2010
use warnings;
use strict;
use Bio::EnsEMBL::Registry;
# set default values
my $linecounter = 0;
my $input = 0;
#make a connection to the Ensembl database
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
#input is copy number bins from copy_number_bins.R
my @files = <*raw_bincounts*>;
foreach(@files)
{
$input = $_;
my ($sample, $lanes, $chr, $bincount, $reads) = split(/\./,$_);
my $output
="$chr.$reads.gcbins.txt";
open (INPUT, $input) || die print "failed to open input
file $!";
open (OUTPUT, ">$output") || die print "failed to
open output file $!";
print "Input file is $input\n";
print "Output file will be saved as $output\n";
# tells it we want to work with a slice of the human genome
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core',
'Slice' );
my $bin_start = 1;
#read through the list of SVs
while()
{
print "Processing line $linecounter of $chr\r";
#chomp the newline, replace the + and - with 1 and -
1, and read into variable
chomp $_;
Appendix 4 Bioinformatic pipeline scripts
250
#split up the columns in the SV file
my( $bin_end,
$reads ) = split('\s', $_);
my $short_chr = substr $chr, 3;
#print "bine end is $bin_end, reads is $reads, short
chr is $short_chr\n";
my $chromslice = $slice_adaptor-
>fetch_by_region('chromosome', $short_chr, $bin_start, $bin_end);
#print Dumper ($chromslice);
my $gc_count = $chromslice->get_base_count->{'%gc'};
print OUTPUT "$bin_start\t$bin_end\t$gc_count\n";
$bin_start = $bin_end;
$linecounter++;
}
close INPUT;
close OUTPUT;
}
print "Processing finished!\n";
Appendix 4 Bioinformatic pipeline scripts
251
D – R script to perform GC correction and segmentation of sequencing reads, and produce
graphs and tables
# GC correction, segmentation and output of graphs from Solexa binned reads
# Liz Batty, last modified July 2010
# requires DNAcopy library
library(DNAcopy)
mb <- seq(0,2.5e8,5e6)
shortmb <- seq(0,250,10)
samplelist <- c("83493.a", "82674.a","938.a","81828.a","946.a")
readsperbinlist <- c(100,250,500)
for (sample in samplelist) {
for(i in c(1:22,"X")) {
print(paste("processing chromosome ",i," from ",sample,sep=""))
for (readsperbin in readsperbinlist) {
# read in binned reads and reformat for DNAcopy
chrbincount <-
read.table(file=paste(sample,"/",sample,".chr",i,".raw_bincounts.",readsperbi
n,"readsperbin",sep=""), header=FALSE)
chrbincount[,3] <-
array(i,dim=c(length(chrbincount[,1]),1))
chrCNA <- CNA(chrbincount[,2], chrbincount[,3],
chrbincount[,1], data.type="logratio", sampleid=sample)
#smooth (remove outliers) and segment the data
smooth.chrCNA <- smooth.CNA(chrCNA)
segment.chrCNA <- segment(smooth.chrCNA, verbose=1,
undo.splits="sdundo", undo.SD=2)
#read in GC percentages (retrieved from Ensembl - see
gcpercent.pl) and add to smoothed data
chrgc <-
read.table(file=paste("gcpercent/chr",i,".",readsperbin,"readsperbin.gcbins.t
xt",sep=""), header=FALSE)
smooth.chrCNA[,4] <- chrgc[,3]
colnames(smooth.chrCNA) <- c("chr", "position", "reads",
"gc")
smooth.chrCNA[,3][smooth.chrCNA[,3]==0] = NA
print("Performing loess correction")
# perform loess over chromosome data and predict correction
factor, then correct data and plot
smooth.chrloess <- loess(smooth.chrCNA$reads ~
smooth.chrCNA$gc, span=0.3, loess.control(iterations=3))
smooth.chrCNA$loesspred <- predict(smooth.chrloess,
smooth.chrCNA$gc)
smooth.chrCNA$dist_from_median <-
(smooth.chrCNA$loesspred/median(smooth.chrCNA$reads, na.rm=TRUE))
smooth.chrCNA$reads[smooth.chrCNA$reads==NA] = 0
Appendix 4 Bioinformatic pipeline scripts
252
smooth.chrCNA$corrected <-
(smooth.chrCNA$reads*(1/smooth.chrCNA$dist_from_median))
smooth.chrCNA$mbposition <- smooth.chrCNA$position/1000000
png(file=paste(sample,"/",sample,".chr",i,".",readsperbin,"readsperbin.
loesscorrected.png",sep=""), width=1000, height=500)
plot(smooth.chrCNA$mbposition, smooth.chrCNA$corrected,
pch=20,
xlab="Position (mb)",
ylab="Normalized reads",
sub=paste("Corrected copy number plot for chr",i," with
",readsperbin," reads per bin",sep=""),
xaxt="n",
#ylim=c(0,500)
)
axis(1,shortmb)
dev.off()
# redo the DNAcopy segmentation with corrected data
correctedCNA <- CNA(smooth.chrCNA$corrected,
smooth.chrCNA$chr, smooth.chrCNA$position, data.type="logratio",
sampleid=sample)
segment.correctedCNA <- segment(correctedCNA, verbose=1,
undo.splits="sdundo", undo.SD=1, alpha=0.005)
#print the segments to file
segmentmatrix <- as.matrix(segment.correctedCNA)
write.table(as.matrix(segment.correctedCNA$output),
sep="\t",
file=paste(sample,"/",sample,".chr",i,".",readsperbin,"readsperbin.correcteds
egments.seg",sep=""), row.names=FALSE, col.names=TRUE, quote=FALSE)
segmentmatrix <- as.matrix(segment.correctedCNA)
newsegments <- segmentmatrix[2,1]
#print segments formatted for circos histogram
hs <-
paste("hs",array(i,dim=c(length(segment.correctedCNA$output$loc.start))),
sep="")
circos <-
cbind(hs,segment.correctedCNA$output$loc.start,segment.correctedCNA$output$lo
c.end, segment.correctedCNA$output$seg.mean)
write.table(
circos,
file=paste(sample,"/",sample,".chr",i,".",readsperbin,"readsperbin.segm
ents.circos",sep=""),
quote=FALSE,
sep="\t",
na="0",
row.names=FALSE,
col.names=FALSE
)
Appendix 4 Bioinformatic pipeline scripts
253
#plot the segments produced from the corrected copy number
manually - easier to change plot than for standard DNA copy plots
png(file=paste(sample,"/",sample,".chr",i,".",readsperbin,"readsperbin.
correctedsegments.png",sep=""), width=1000, height=500)
plot(smooth.chrCNA$mbposition, smooth.chrCNA$corrected,
pch=20,
xlab="Position (mb)",
ylab="Normalized reads",
main=paste("Corrected segmentation plot for chr",i,"
with ",readsperbin," reads per bin",sep=""),
xaxt="n",
#ylim=c(0,500)
)
segment.correctedCNA$output$loc.start.mb <-
segment.correctedCNA$output$loc.start/1000000
segment.correctedCNA$output$loc.end.mb <-
segment.correctedCNA$output$loc.end/1000000
segments(segment.correctedCNA$output$loc.start.mb,
segment.correctedCNA$output$seg.mean, segment.correctedCNA$output$loc.end.mb,
segment.correctedCNA$output$seg.mean, lwd=2, col="red")
axis(1,shortmb)
dev.off()
#print the corrected bin values to a GFF file
gff <-
paste("chr",array(i,dim=c(length(smooth.chrCNA[,1]),1)),sep="")
solexa <- array("solexa",dim=c(length(gff),1))
samplelist <- array(sample,dim=c(length(gff),1))
dots <- array(".",dim=c(length(gff),1))
colorcol <- array(";color 000000",dim=c(length(gff),1))
endpos <- smooth.chrCNA$position
startpos <- endpos[-(length(endpos))]
startpos <- append(startpos,1,0)
gffwhole <- cbind(gff, solexa, samplelist, startpos,
endpos, smooth.chrCNA$corrected, dots, dots, colorcol)
write.table(
gffwhole,
file=paste(sample,"/",sample,".chr",i,".",readsperbin,"readsperbin.corr
ected.gff",sep=""),
quote=FALSE,
sep="\t",
na="0",
row.names=FALSE,
col.names=FALSE
)
}
}
}
254
Appendix 5 – Pipeline documentation
Documentation produced for the bioinformatic pipeline described in Chapter 6.
Processing solexa sequencing data
General useful unix information
cd will change directory, like under Windows. To move up a directory, use cd ..
ls lists all the files in a directory.
grep searches for a particular string in a file. This is useful for finding the
original reads from a large file. grep -A 5 will pull out a line and the 5 lines
following it.
To read the help pages for a command, use man command.
If two commands are separated by | , the output of the first command is 'piped' into the second
command.
If a command is followed by > filename , the output will end up in that file. This is used to string
commands together, especially when the intermediate files would be very large.
gunzip unzips compressed files (extension .gz). By default this removes the compressed file and
replaces it with the uncompressed one; to keep it, use gunzip -c input.txt.gz >
output.txt
cat concatenates files. If it is used with only one file, it will just output that file, so it is used as a quick
way to pipe a file into some other command.
Many of the bioinformatics programs are only available under Unix. To access them, you can run a
virtual machine inside OSX, which is slower than running them outside the virtual machine but does
work, although the screen can be slow to respond. To use the virtual machine, run the program
VirtualBox and start up the ubuntu install - the username is liz and the password is Liz. You can also
install programs on OSX by compiling them yourself using the GCC compiler found in the Apple XCode
developer’s tools, or getting Darwin/FinkCommander to install them from a Debian/Ubuntu package if
one exists.
Appendix 5 Pipeline documentation
255
File formats
.sh files are bash files - essentially a list of unix commands which will be run in order. Variables which are
passed to the .sh script are stored as $1, $2, $3, etc. The line datadir=$1 reads the first variable into
datadir, which will be used in the script whenever ${datadir} is used.
.pl files are Perl scripts. If there is no input file specified, they use the file given as a command line
argument.
awk/gawk is a language used to quickly manipulate text files.
.R is an R script.
.sam files are Sequence Alignment/Map files. This is the new standard format for aligned sequences, and
is describe in detail here:
http://samtools.sourceforge.net/SAM1.pdf
.bam files (and associated .bai files) are compressed .sam files. They are not human-readable but they
are much smaller than .sam files. To convert SAM to BAM (and vice-versa) requires the SAMtools
utilities http://samtools.sourceforge.net/
To reach GroupDocs from a Mac, use cd /Volumes/Edwards/GroupDocs. To reach GroupDocs under
Linux, it needs to be mounted with the command :
sudo mount -t cifs //datacentre/Edwards -o username=emb51,domain=h-
mrc,password=password Documents
This will put GroupDocs in the folder Documents.
Alignment
This is the process of finding the best match in the reference genome for each read. This is usually done
for us by the CRI.
The raw sequences come as FASTQ format files, with extension fq. Usually there are two files per lane,
one for each read in the pair, and straight off the machine they are named for lane and read in the pair -
s1_1.fq, s1_2.fq, s2_1.fq, etc
The fq format looks like
@HWUSI-EAS100R:6:73:941:1973#0/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
The first line is the unique id - machine name and run, lane on the flow cell, tile, xy coordinates on the
tile, and /1 or /2 to indicate read in a pair.
The second line is the sequence.
Appendix 5 Pipeline documentation
256
The third line is just a spacer.
The last line is the quality score for each base in the sequence. The quality scores are in Sanger format.
Each character is the quality score + 33 encoded as the ASCII character for that number.
See http://en.wikipedia.org/wiki/FASTQ_format for more details.
Human reference genomes for build 18 and 19 are in the human reference genome directories. They are
formatted as FASTA files, one entry per chromosome.
Alignment using MAQ
All the cell line data on GroupDocs was aligned using MAQ. MAQ is slow, does not do proper gapped
alignment, and only supports reads up to 63bp, but it may be the only way to align mate pair reads. (It
may be possible to align it with BWA/Bowtie if you tell it to expect the opposite orientation I haven’t
tested this, and it may not work for both populations in the mate pair library.) It aligns 2 million 37bp
reads to the reference genome in about 2 hours.
See MAQ manual for details of alignment ( http://maq.sourceforge.net/maq-manpage.shtml ) but the
basic alignment process is as follows:
1. Convert FASTQ files to MAQ's binary FASTQ format (.bfa)
2. Map reads in .bfa format to the reference sequence. This produces a .map file, which is not human-
readable.
3. Output aligned reads as a mapview file.
4. This outputs the reads as a mapviewpair file, which is very similar to a mapview file but puts the two
reads in a pair on one line, and adds an extra column for later de-duplication. This column is just the
chromosome and position for the two reads separated by colons (eg, 1:1256980:1:1345980), and is used
to remove PCR duplicates. NOTE: Some of the very early data does not have the final column on it. If
the file does not have this column, use the add28.pl script.
Alignment using BWA
The current CRI pipeline aligns using BWA. This is much faster, does 108bp reads, and produces gapped
alignment. It can align millions of reads in a few minutes. This is the alignment method used for the new
tumour data.
BWA can be found at http://bio-bwa.sourceforge.net/bwa.shtml . General alignment procedure:
1. Index the reference genome: bwa index -p prefix -a ls
2. Align with command bwa aln . Each end is aligned separately. See manual for options - I have not
played around with anything but the defaults.
3. Generate SAM format output file bwa sampe .
NOTE: BWA is quite fiddly - in particular it dislikes some reference genomes and will give an error shortly
after starting the alignment. Bowtie is a similar alignment program which has a much more user-friendly
manual and may be more useful.
Alignment using novoalign
Appendix 5 Pipeline documentation
257
http://www.novocraft.com/main/index.php
Novoalign is much slower than either MAQ or BWA (a million reads takes ~16 hours) but is more
accurate. In the current CRI pipeline it is used to realign reads called as aberrant in case they are
misalignments from BWA.
Calling structural variants
This can be broken down into the different steps:
1. Generate stats on the file.
2. Remove PCR duplicates.
3. Call the abnormal reads.
4. (optional) Re-align the abnormal reads with a slower/better alignment program.
5. (optional) Remove the bad regions.
6. Cluster the reads and call structural variants.
Calling structural variants under the old pipeline
This pipeline was used for all of the cell line data. All the commands are listed in the file
maq_paired_end_postprocessing.sh
1. Stats are generated using the get_stats_from_mapviewpair.pl script, which takes the
.mapviewpair file as an input. The stats are used to find the upper limit of the library size.
2. The last column of the mapviewpair file is used to remove duplicates as it contains the chromosome
and start position for the two reads in a pair, which should be unique for all pairs. The unix command
uniq is used for this: uniq -f 28 returns all the lines in a file which have a unique column 28. This will
only remove duplicates which are adjacent to each other (so the file needs to be sorted first) and it will
always return the first of the duplicates.
3. The abnormal reads are called using awk. The intention is to find all the pairs where the two reads are
farther apart than the upper limit, or have the wrong strands for the reads (ie, FR is normal, FF, RR, and
RF are abnormal).
The gawk command is:
gawk -v UPPER=${upper} '(and($6, 2) == 0 || $5 < -UPPER || $5 > UPPER)
&& $9 > 0'
This uses a bitwise and to query the mapviewpair flag – see below for an explanation of the bitwise flag.
4. In the old pipeline the reads are not re-aligned.
5. The bad regions are removed using the script filter_mapview_by_region.pl . This needs a
file of bad regions in the form . The existing file is based on bad regions
found by Susie and Jess. The current file is called regions_to_mask.curated.txt, and there is
a version updated for the new genome build called regions_to_mask.grch37.txt.
Appendix 5 Pipeline documentation
258
6. The structural variants are called using sv_from_mapview.pl. The input should be the mapviewpair file
with the abnormal read pairs in.
The script needs two options, -insertmax which should be the upper library insert size,
and –mappingqualmin which is the minimum quality of read to use to call structural
variants. -bed is optional and specifies a bed file for output as well as a plain text file.
Calling structural variants using the new pipeline
This is more complicated because Kevin’s updated pipeline is more closely tied into the CRI cluster, but it
is likely that the CRI will be able to produce most of the necessary files for us.The new tumour data has
all been run through part of this pipeline, and we get the data already processed through step 4 . We
have Kevin’s handover documentation which explains a lot of this pipeline – secondary_pipeline.txt and
structural_variation_pipeline.txt.
1. The stats are generated from the .bam file by a CRI script. The script is alignment_stats.pl, but it won’t
run without the cluster.
2. Duplicates are computed using samtools or Picard. Samtools is mentioned above. Picard is a similar
sort of thing but uses a java command-line interface to manipulate .sam/.bam files. You can find it here:
http://picard.sourceforge.net/
This produces a file called .dupreport.txt which has statistics on the duplication rate
in the library. The different statistics are explained here:
http://picard.sourceforge.net/picard-metric-definitions.shtml#DuplicationMetrics
Of particular interest is the Estimated Library Size, which estimates the number of unique molecules in
the library and can give an idea of the depth of the library.
3. I do not have the exact command which produces aberrant reads under the new pipeline. However, it
should be fairly simple to do this with gawk in the same way as the old pipeline. The two requirements
are to pull out pairs where the reads are in the wrong orientation, and pairs where the two reads map
further apart than the maximum library size.
4. The aberrant reads are re-aligned using Novoalign, which is more sensitive. We have been re-aligning
against the whole genome, but also against the different haplotypes, which helps remove false positives.
The file of interest is
${Library}.bwa_aberrant_pairs.novoalign.processed.aberrants.namesorted
.sam .
5. The bad regions are not removed in the new pipeline at the CRI. I have been masking the list of bad
regions, the centromeres, and 100kb from the telomeres, which removes a lot of SVs in those regions.
Use the command filter_sv_by _region.pl –regions
regions_to_mask.grch37.txt library.aberrantpairs.sam >
library.aberrantpairs.filtered.sam
Appendix 5 Pipeline documentation
259
6. The structural variants are called using sv_from_sam.pl . This can find the structural variants in
multiple libraries in one run. To tell the script what samples to expect, we construct a samplesheet,
containing the library prefix, maximum insert size, and library name.
Eg, for the new tumour libraries:
CRIRUN_369:4 504 81823
CRIRUN_306:1 621 83539
The library prefix is the run number and lane, which can be found in the .sam files or the stats file, and
tells the sv caller what sequences belong to which library.
A simple command to call SVs using sv_from_sam.pl:
cat
81823.bwa_aberrant_pairs.novoalign.processed.aberrants.namesorted.sam
83539.bwa_aberrant_pairs.novoalign.processed.aberrants.namesorted.sam
| sv_from_sam.pl -samplesheet samplesheet.txt > 81823-
83539.sv_calls.txt
This will concatenate the aberrant pairs from 81823 (a tumour) and 83539 (the matched normal)
libraries. This file is then sent to the sv caller, which knows which sequences belong to which sample
using the sample sheet. The output will have a final column showing how many reads support the SV in
each library. Note that if you want 2 reads supporting a variant, the script does not care which library
they come from, so it could be 1 from the tumour and 1 from the normal.
Further options:
-mappingqualmin is the minimum map quality to use, default is 35.
-edgepairs is how many reads needed to call an SV, default is 2.
-clip will clip all alignments to the specified length, to deal with multiple read lengths in
the same file.
-outputreads will output all the reads which contribute to the SVs to a .sam file.
To filter the SVs, use the script sv_filter.pl . This returns only those SVs with more than N hits in the
tumour and no hits in the normal.
perl sv_filter.pl -tumour -normal -hits
tumour-normal.svcalls.txt > filtered.svcalls.txt
Calling fusion genes from structural variants
To call fusion genes, a script retrieves the genes at each of the two nodes of a structural variant and
predicts whether any of them are in the right orientation to form a fusion gene.
This relies on the Ensembl API to retrieve the genes. Information about the API can be found here:
http://www.ensembl.org/info/data/api.html The script will only run if the Ensembl Perl modules are
installed, as well as DBD::MySQL, Getopt::Long and BioPerl.
Appendix 5 Pipeline documentation
260
Different versions of the Ensembl Perl modules use different builds of the genome. To check which
genome build you are using, use the ensembldbcheck.pl script, which will output the current versions,
and also the coordinates for BCAS3 to check if it is Hg18 or Hg19/GrCH37.
The file has different options to cope with different formats of input SV file, as the columns have
changed over time.
--columns: use if the file has extra support columns, as most of the later files do
--cnv: use if the file has been checked against a list of CNVs, see below
--insertsize : max insert size of the library
--strands: use if the file has -1 and 1 as strands instead of + and -.
--matepair: use if the input is from a mate pair library – all it does is flip the strands of the reads
A typical command:
perl fusion_gene_prediction.pl –columns –insertsize 450
81823.abcd.sv_calls.txt > 81823.breaks.txt
The output is a text file, but has automatic hyperlinks for Excel.
There is a script which runs a simple check to see if the nodes of the SV overlap with a list of known CNV
regions (taken from Conrad et al.) and adds an extra column. This is cnvcheck.pl and takes the SV file as
input, it also requires the conradcnv.txt file with the CNVs in it. Use the –columns option if the SV file
has the extra support columns.
Copy number pipeline
The input for the copy number calling pipeline is the library.for_cnv.txt file produced by the CRI
alignment pipeline. This contains all the start positions of the reads. Currently there is no script to run
this code all in one go, as it is difficult to run R from the command line in Windows, so I do it in stages
and cut and paste the code into R.
1. Chop up the for_cnv file into the individual chromosomes. This is done using the cnvchopper.pl script
2. Run the code in copy_number_bins.R . This needs the directory mappable_starts, which was
generated by Kevin and lists all the potential start positions of a read on the chromosome. This is used
to divide up the genome into bins where we would expect the same number of reads – this is not as
simple as dividing up the genome into equal size pieces, as fewer reads will map to repetitive regions.
Then the actual reads are placed into the bins calculated from the genome. The output files are all
named for the library, chromosome, and number of reads per bin used to calculate the genome bins.
To run this code, put the window sizes you want in the readsperbin list:
readsperbinlist <- c(100, 250, 500)
and the names of the libraries in the samplelist:
samplelist <- c("81777.a","83493.a","82674.a","938.a”)
Appendix 5 Pipeline documentation
261
250 is a good number for the bin size.
3. Retrieve the GC percentages for each bin across the genome. This only needs doing once for each bin
size. The gcpercent directory contains the files for bin sizes 100 and 250.
Any further GC percent data can be retrieved using the gcpercent.pl script. This will retrieve the GC
percent data for any files with raw_bincounts in the name in the directory it is run from.
4. The code in binsize_graphs.R performs a loess correction on the data using the GC percentage,
segments it with DNAcopy, and outputs some plots, the copy number segments as .seg files and also
formatted for Circos plots, and a .gff file of the corrected data. It needs the DNACopy R library. Again,
the readsperbinlist and the samplelist need to include the bin sizes used and the libraries to process.
SAM flags and bitwise and functions
Both the mapview and SAM formats use a bitwise flag. This is a way of encoding multiple pieces of
information in a single number, by looking at the individual bits of the number as a binary number.
For instance, in the bitwise flag in a SAM file, the first three bits represent whether the read is pair,
whether the read is mapped in a proper pair, or whether the read is unmapped.
Bit 1: read is in a pair
Bit 2: read is in a proper pair
Bit 3: read is unmapped
If all these things are true for a read, all three flags would be set. This could be represented in binary as
111, converting this to decimal gives us 1+2+4 = 7.
If the read is in a pair, but the pair is not a proper pair and the read is unmapped, the binary
representation is 100, and in decimal this is 1.
(For another explanation, see here: http://seqanswers.com/forums/showthread.php?t=2301 )
SAM encodes eleven different bits of information in a single field. These are described in the SAM
specification, and there is a tool to decode them here: http://picard.sourceforge.net/explain-flags.html
The bitwise and function queries the individual bits. For instance, the code and(, 4)
would compares the two things in the brackets , in this case the bitwise flag, and the 4 bit. If both these
were set to 1 (or true), it would return 1. If the flag is false at the 4 bit, it will return 0. In this way we can
select only the reads where the 4 bit of the flag is set to 1, ie all the mapped reads, and reject all the
unmapped reads.