Exploiting network-based 
approaches for understanding gene 
regulation and function 
 
 
 
 
 
Sarath Chandra Janga 
 
 
 
 
A dissertation submitted to the University of Cambridge in  
candidature for the degree of Doctorate of Philosophy 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
April 2010 
 
Darwin College, University of Cambridge 
MRC Laboratory of Molecular Biology 
Cambridge, United Kingdom 
 
  
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Previous page: A portrait of the transcriptional regulatory network of the budding yeast, Saccharomyces 
Cerevisiae. Each circle represents the network of transcriptional interconnections between all other 
chromosomes to one of the chromosomes. Evidently all chromosomes are transcriptionally controlled by 
factors encoded on many of the 16 chromosomes in this organism marked by the letters ‘a’ through ‘p’. 
 iii 
Declaration of originality 
This dissertation describes work I carried out at the Medical Research Council 
Laboratory of Molecular Biology in Cambridge between January 2008 and April 2010. 
The contents are my original work, although much has been influenced by the 
collaborations in which I took part. I have not submitted the work in this dissertation 
for any other degree or qualification at any other university. 
 
 
 
 
 
Sarath Chandra Janga 
April, 2010 
Cambridge, United Kingdom 
 iv 
Acknowledgements 
First of all I would like to express my gratitude to Dr. Madan Babu with out whose 
continuous support all along my doctoral work, it would have just remained a dream 
for me to carry out my thesis work at MRC Laboratory of Molecular Biology. Madan 
has not only been an excellent supervisor but a good friend who was always 
supportive of my research interests, by allowing me to work independently on a wide 
range of problems during my stay here. He has been a source of great inspiration on 
various occasions and a great scientific colleague to work with. In short, I probably 
could not have had a more understanding and motivating supervisor. 
 
I am also very grateful to Dr. Sarah Teichmann whose equivalently supporting words 
from time to time have been a motivation to finish my doctoral work in a short time. I 
have learnt from her the art of adventuring into unchartered territories of molecular 
biology with out fear. 
 
I am also thankful for the kind support and warm welcome that I received from Dr. 
Cyrus Chothia from the first day that I came to LMB. 
 
I consider myself very fortunate to be in a wonderful lab with a lot of energetic and 
highly motivating people working on fundamental problems of molecular biology. 
Indeed, I must admit that I have learnt at least as much from my colleagues and 
seminars at LMB, as I have learnt from reading books and papers, not to mention the 
fun that I had during numerous lunch and dinner breaks with various members of the 
lab and TCB group in particular. I especially would like to thank A Wuster, B Lang, AJ 
Venkatakrishnan, D Hebenstreit, D Wilson, E Levy, G Chalancon, J Su, N Mittal, P 
Kota, R Janky, S De, T Perica, V Charoensawan and J Gsponer for making my stay 
at LMB a memorable experience. 
 
I am also greatly indebted to all my scientific friends, collaborators and mentors, both 
in the past and during my PhD, for having helped me learn and adventure diverse 
areas of molecular biology. In no defined order, I would like to sincerely thank 
Agustino Martinez-Antonio (Irapuato, Mexico) for his confidence in my abilities, 
Ernesto Perez-Rueda (Cuernavaca, Mexico) for his kind hospitality during my visits 
to mexico, Gabriel Moreno-Hagelsieb (Waterloo, Canada) for being a great mentor 
and an excellent scientific friend, Heladia Salgado (Cuernavaca, Mexico) for her 
energy and patience to my requests to data, Andrew Emili (Toronto, Canada) for 
giving me the opportunity to work on an unsolved mystery, Denis Thieffry (Marseille, 
France) for making me learn to focus on important ideas and many other colleagues 
for scientific discussions over the years which made me a mature and independent 
scientist. I would also like to take this opportunity to offer my gratitude to all 
colleagues, administrative staff and heads of division, Venki Ramakrishnan and 
Kiyoshi Nagai at LMB whose continuous support have made it possible for me to 
develop a career in science.  
 
I am also grateful to the financial support that I received from Cambridge 
Commonwealth Trust (CCT) and the Medical Research Council during my PhD. 
 
Last, but not the least, I am most indebted to my family (my parents and sister) as 
well as near and dear who have been continuously supportive of my adventures in 
science and for understanding my reasons to be in silence for months. My very 
presence on this planet would not have been possible if not for my mother who 
expired long before I knew what maths and science is all about. I dedicate this thesis 
on her name. 
 v 
 
Abbreviations 
3C  Chromosome Confirmation Capture 
ArcA  Aerobic respiration control protein A 
BDBH  Bi-Directional Best Hits 
BLAST Basic Local Alignment Search Tool 
cAMP  cyclic Adenosine MonoPhosphate 
ChIP  Chromatin immunoprecipitation 
CLIP  Cross Linking and Immuno-Precipitation 
COGs  Clusters of Orthologous Groups 
CRP  cAMP Receptor Protein 
CT  Chromosomal Territory 
DBTBS DataBase of Transcriptional regulation in Bacillus Subtilis  
DNA  DeoxyriboNucleic Acid 
EC  Enzyme Commission 
FDR  False Discovery Rate 
FIS  Factor for Inversion Stimulation 
FISH  Fluorescent In Situ Hybridization 
FFL  Feed Forward Loop 
FNR  regulator of Fumarate and Nitrate Reduction 
GBA  Guilt By Association 
GC  Genomic Context 
GO  Gene Ontology 
GR  Global Regulator 
GRN  Gene Regulatory Network 
HMM  Hidden Markov model 
hnRNP heterogeneous nuclear RiboNucleoProtein 
HNS  Histone-like Nucleoid Structuring protein 
HU  Heat Unstable protein 
IHF  Integration Host Factor 
LAD  Lamina Associated Domain 
LCMS  Liquid Chromatography-Mass Spectrometry 
LCR  Locus Control Region 
MALDI  Matrix-Assisted Laser Desorption/Ionization 
MCL  Markov CLuster algorithm 
mRNA  Messenger RNA 
NAP  Nucleoid Associated Protein 
PAB  PolyAdenylate-Binding protein 
PI/PPI  Protein Interactions 
PTM  Post-Translational Modification 
PTN  Post-Transcriptional Network 
PTS  PhosphoTransferase System 
RBD  RNA Binding Domain 
RBP  RNA Binding Protein 
RIP  RNP ImmunoPrecipitation 
RNA  RiboNucleic Acid 
RNP  RiboNucleo Protein complex 
RRM  RNA Recognition Motif 
TAP  Tandem Affinity Purification 
TF  Transcription Factor 
TG  Target Gene 
TPI  Target Proximity Index 
TRN  Transcriptional Regulatory Network 
 vi 
Summary 
It is increasingly becoming clear in the post-genomic era that proteins in a cell do not 
work in isolation but rather work in the context of other proteins and cellular entities 
during their life time. This has lead to the notion that cellular components can be 
visualized as wiring diagrams composed of different molecules like proteins, DNA, 
RNA and metabolites. These systems-approaches for quantitatively and qualitatively 
studying the dynamic biological systems have provided us unprecedented insights at 
varying levels of detail into the cellular organization and the interplay between 
different processes. The work in this thesis attempts to use these systems or 
network-based approaches to understand the design principles governing different 
cellular processes and to elucidate the functional and evolutionary consequences of 
the observed principles. 
 
Chapter 1 is an introduction to the concepts of networks and graph theory 
summarizing the various properties which are frequently studied in biological 
networks along with an overview of different kinds of cellular networks that are 
amenable for graph-theoretical analysis, emphasizing in particular on transcriptional, 
post-transcriptional and functional networks. 
 
In Chapter 2, I address the questions, how and why are genes organized on a 
particular fashion on bacterial genomes and what are the constraints bacterial 
transcriptional regulatory networks impose on their genomic organization. I then 
extend this one step further to unravel the constraints imposed on the network of TF-
TF interactions and relate it to the numerous phenotypes they can impart to growing 
bacterial populations.  
 
Chapter 3 presents an overview of our current understanding of eukaryotic gene 
regulation at different levels and then shows evidence for the existence of a higher-
order organization of genes across and within chromosomes that is constrained by 
transcriptional regulation. The results emphasize that specific organization of genes 
across and within chromosomes that allowed for efficient control of transcription 
within the nuclear space has been selected during evolution. 
 
Chapter 4 first summarizes different computational approaches for inferring the 
function of uncharacterized genes and then discusses network-based approaches 
currently employed for predicting function. I then present an overview of a recent 
high-throughput study performed to provide a ‘systems-wide’ functional blueprint of 
the bacterial model, Escherichia coli K-12, with insights into the biological and 
evolutionary significance of previously uncharacterized proteins. 
 
In Chapter 5, I focus on post-transcriptional regulatory networks formed by RBPs. I 
discuss the sequence attributes and functional processes associated with RBPs, 
methods used for the construction of the networks formed by them and finally 
examine the structure and dynamics of these networks based on recent publicly 
available data. The results obtained here show that RBPs exhibit distinct gene 
expression dynamics compared to other class of proteins in a eukaryotic cell. 
 
Chapter 6 provides a summary of the important aspects of the findings presented in 
this thesis and their practical implications. 
 
Overall, this dissertation presents a framework which can be exploited for the 
investigation of interactions between different cellular entities to understand biological 
processes at different levels of resolution. 
 vii 
TABLE OF CONTENTS 
 
Chapter 1: Introduction 
 
PREAMBLE .................................................................................................................. 1-3 
OUTLINE OF THE INTRODUCTION ........................................................................ 1-4 
1.1 BASICS OF GRAPH THEORY AND NETWORKS ............................................. 1-5 
1.1.1 Local level........................................................................................................ 1-6 
1.1.2 Modular level ................................................................................................... 1-9 
1.1.3 Global level .................................................................................................... 1-12 
1.2 NETWORKS IN MOLECULAR BIOLOGY ........................................................ 1-14 
1.2.1 Methods to construct transcriptional regulatory networks .............................. 1-14 
1.2.2 Methods to construct functional linkage networks ......................................... 1-17 
1.2.3 Methods to construct post-transcriptional regulatory networks ...................... 1-19 
1.2.4 Methods to construct other classes of cellular and biological networks ......... 1-20 
1.3 OUTLINE OF THE THESIS .............................................................................. 1-23 
REFERENCES ........................................................................................................... 1-24 
 
Chapter 2: Functional, structural and dynamic constraints on 
bacterial regulatory networks 
 
OUTLINE ..................................................................................................................... 2-3 
CONTRIBUTION TO THE WORK IN THIS CHAPTER............................................ 2-4 
2.1 INTRODUCTION ................................................................................................. 2-5 
2.2 RESULTS ............................................................................................................. 2-9 
2.2.1 Constraints imposed on the network of transcription factors in bacteria .......... 2-9 
2.2.1.1 Topology of Escherichia coli cross-regulatory transcriptional network .... ….2-11 
2.2.1.2 Multiple parallel feed-forward loops regulate the use of different 
carbon sources ....................................................................................................... 2-13 
2.2.1.3 Long hierarchical cascades regulate developmental processes ................. 2-14 
2.2.2 Constraints imposed on bacterial genome organization by 
transcriptional network ............................................................................................ 2-15 
 ............................................................................................................................... 2-17 
2.2.2.1 Genomic co-localization of TFs and target genes is observed in 
small regulons ........................................................................................................ 2-18 
2.2.2.2 Transcriptional regulatory flow in the network of TFs.................................. 2-19 
2.2.2.3 Absolute and average mRNA abundance of TFs suggests 
correlation with regulon size and network hierarchy in E. coli................................. 2-20 
2.2.2.4 A conceptual model for the structuring of regulatory networks in 
bacteria ................................................................................................................... 2-23 
2.3 DISCUSSION & CONCLUSION ...................................................................... 2-25 
2.4 METHODS ......................................................................................................... 2-27 
2.4.1 Identification of regulon groups...................................................................... 2-27 
2.4.2 Estimating the statistical significance of the regulon groups .......................... 2-28 
REFERENCES ........................................................................................................... 2-28 
 viii 
Chapter 3: Transcriptional regulation constrains the 
organization of genes on eukaryotic chromosomes 
 
OUTLINE ..................................................................................................................... 3-3 
CONTRIBUTION TO THE WORK IN THIS CHAPTER............................................ 3-4 
3.1 INTRODUCTION ................................................................................................. 3-5 
3.2 RESULTS ............................................................................................................. 3-8 
3.2.1 Eukaryotic genome organization and transcriptional regulation....................... 3-8 
3.2.1.1 Long-range interactions involving distal regulatory elements ..................... 3-12 
3.2.1.2 Inter-chromosomal interactions .................................................................. 3-13 
3.2.1.3 Chromosomal territories, movement and nuclear organization................... 3-14 
3.2.1.4 Association of the genomic loci with the nuclear periphery......................... 3-16 
3.2.2 Transcriptional regulation constrains genome organization ........................... 3-17 
3.2.2.1 The majority of TFs show a strong preference to regulate genes on 
specific chromosomes ..................................................................................... ……3-18 
3.2.2.2 A significant fraction of the TFs tend to have targets on specific 
regions of the chromosomal arm ............................................................................ 3-23 
3.2.2.3 Most TFs show a strong preference to positionally cluster their 
targets within a chromosome .................................................................................. 3-26 
3.3 DISCUSSION & CONCLUSION ...................................................................... 3-28 
3.4 MATERIALS AND METHODS .......................................................................... 3-29 
3.4.1 Dataset of Transcription factors in S. cerevisiae and their regulatory 
interactions ............................................................................................................. 3-29 
3.4.2 Estimation of statistical significance .............................................................. 3-30 
3.4.3 Calculation of chromosomal preference ........................................................ 3-30 
3.4.4 Calculation of regional preference ................................................................. 3-31 
3.4.5 Calculation of target proximity ....................................................................... 3-31 
REFERENCES ........................................................................................................... 3-32 
 
Chapter 4: Uncovering the functional architecture of 
uncharacterized proteins in Escherichia coli 
 
OUTLINE ..................................................................................................................... 4-3 
CONTRIBUTION TO THE WORK IN THIS CHAPTER............................................ 4-4 
4.1 INTRODUCTION ................................................................................................. 4-5 
4.2 RESULTS ............................................................................................................. 4-6 
4.2.1 Overview of network-based function prediction ............................................... 4-6 
4.2.1.1 Methods and databases for constructing functional association 
networks .................................................................................................................. .4-9 
4.2.1.2 Computational methods for predicting function from network context ........ 4-12 
4.2.2 Uncovering the cellular roles of functional orphans in E. coli ......................... 4-14 
4.2.2.1 The extent of existing functional annotation for E. coli proteins .................. 4-16 
4.2.2.2 Properties of the functional orphans of E. coli ............................................ 4-17 
4.2.2.3 A systematic approach to elucidate biological function ............................... 4-18 
4.2.2.4 Experimental definition of the physical interaction network of the 
soluble proteome .................................................................................................... 4-19 
4.2.2.5 Orphan membership within multiple protein complexes.............................. 4-21 
 ix 
4.2.2.6 Functional interactions predicted by genomic-context methods ................. 4-24 
4.2.2.7 Defining the participation of orphans as the components of 
functional modules.................................................................................................. 4-27 
4.2.2.8 Improved functional inference within an integrated network 
framework ........................................................................................................... ….4-28 
4.2.2.9 Functional neighborhoods .......................................................................... 4-30 
4.3 DISCUSSION & CONCLUSION ...................................................................... 4-32 
4.4 MATERIALS AND METHODS .......................................................................... 4-35 
4.4.1 PI network generation .................................................................................... 4-35 
4.4.2 GC network generation .................................................................................. 4-36 
4.4.3 Clustering ...................................................................................................... 4-37 
4.4.4 Network-based function prediction and benchmarking .................................. 4-37 
REFERENCES ........................................................................................................... 4-37 
 
Chapter 5: Structure and dynamics of post-transcriptional 
regulatory networks directed by RNA-binding proteins 
 
OUTLINE ..................................................................................................................... 5-3 
CONTRIBUTION TO THE WORK IN THIS CHAPTER............................................ 5-3 
5.1 INTRODUCTION ................................................................................................. 5-4 
5.2 RESULTS ............................................................................................................. 5-7 
5.2.1 RNA binding proteins and post-transcriptional regulation ................................ 5-7 
5.2.2 Methods to Identify RBPs and their targets ..................................................... 5-9 
5.2.3 RBPs and post-transcriptional operons ......................................................... 5-12 
5.2.4 Post-transcriptional network formed by RBPs ............................................... 5-12 
5.2.5 Expression dynamics of RBPs in post-transcriptional networks .................... 5-15 
5.2.5.1 RBPs show high abundance and tight regulation at the protein level ......... 5-15 
5.2.5.2 The number of distinct targets bound by a RBP is correlated with its 
cellular abundance… .............................................................................................. 5-19 
5.2.5.3 RBPs bound to many RNA targets are less frequently degraded and 
tightly controlled at protein level ............................................................................. 5-21 
5.3 DISCUSSION & CONCLUSION ...................................................................... 5-23 
5.4 MATERIALS AND METHODS .......................................................................... 5-24 
5.4.1 Data on RNA-binding proteins in S. cerevisiae and their interactions ............ 5-24 
5.4.2 Analysis of the structure and properties of post-transcriptional 
regulatory network .................................................................................................. 5-25 
5.4.3 Data for comparative analysis of expression dynamics ................................. 5-25 
5.4.4 Comparison of the regulatory properties of RBPs with other protein 
coding genes .......................................................................................................... 5-26 
5.4.5 Analysis of the relationship between the number of targets of a RBP 
and its dynamic properties ...................................................................................... 5-27 
REFERENCES ........................................................................................................... 5-27 
 x 
 
 
Chapter 6: Conclusions and Perspectives 
 
6.1 Outline ................................................................................................................. 6-3 
6.2 Major Findings ................................................................................................... 6-5 
6.2.1 Constraints imposed by transcriptional regulation on genome 
organization and regulatory network......................................................................... 6-5 
6.2.2 Uncovering the functional landscape of a bacterial genome............................ 6-6 
6.2.3 Structure and dynamics of post-transcriptional networks controlled by 
RNA binding proteins ................................................................................................ 6-9 
Implications and Future Directions ..................................................................... 6-11 
REFERENCES ........................................................................................................... 6-14 
 
Appendix 
 
A.1 LIST OF PUBLICATIONS ................................................................................... A-3 
Publications during PhD (January 2008- April 2010) ................................................ A-3 
Publications under review, revision and in preparation............................................. A-5 
Publications prior to starting PhD ............................................................................. A-6 
A.2 REPRINTS ........................................................................................................... A-7 
 
 
 
 
Introduction                                                                                                                                                               1-1 
 
 
 
 
 
 
 
1                     Introduction 
 
 
Introduction                                                                                                                                                               1-2 
 
 
 
CONTENTS OF CHAPTER 1 
 
PREAMBLE ..................................................................................................................................... 1-3 
OUTLINE OF THE INTRODUCTION ..................................................................................... 1-4 
1.1 BASICS OF GRAPH THEORY AND NETWORKS ..................................................... 1-5 
1.1.1 LOCAL LEVEL ....................................................................................................................... 1-6 
1.1.2 MODULAR LEVEL.................................................................................................................. 1-9 
1.1.3 GLOBAL LEVEL ................................................................................................................... 1-12 
1.2 NETWORKS IN MOLECULAR BIOLOGY................................................................... 1-14 
1.2.1 METHODS TO CONSTRUCT TRANSCRIPTIONAL REGULATORY NETWORKS ........................ 1-14 
1.2.2 METHODS TO CONSTRUCT FUNCTIONAL LINKAGE NETWORKS ......................................... 1-17 
1.2.3 METHODS TO CONSTRUCT POST-TRANSCRIPTIONAL REGULATORY NETWORKS ............... 1-19 
1.2.4 METHODS TO CONSTRUCT OTHER CLASSES OF CELLULAR AND BIOLOGICAL NETWORKS 1-20 
1.3 OUTLINE OF THE THESIS ............................................................................................ 1-23 
REFERENCES .............................................................................................................................. 1-24 
 
 
Introduction                                                                                                                                                               1-3 
 
 
PREAMBLE 
Reductionism, which has been the paradigm in biological research for more than a 
century, has provided us with a wealth of knowledge about the individual cellular components, 
their functions and mechanisms. Despite its huge success in the last century, post-genomic 
biology has increasingly made it clear that discrete biological function can only rarely be 
attributed to an individual molecule. Instead, most biological outcomes in a cell arise from a 
complex interplay between different cellular entities such as proteins, DNA, RNA and 
metabolites. Therefore, a key challenge for biology in the twenty-first century is to understand 
the structure and dynamics of the complex web of interactions in a cell that contribute to its 
proper functioning. Although, we can not answer this question in full, the analyses, concepts 
and frameworks outlined in this thesis, will help the scientific community to interpret and better 
understanding the logic behind the several layers of complex web of interactions happening in 
the cell. 
In the last few years there has been a rapid development in various high-throughput 
technologies which has lead to the accumulation of a large amount of data from different areas 
of molecular and cellular biology. These developments together with increasing interest in the 
community for gaining a systems-wide understanding of the cellular machinery have provided 
us unprecedented insights into the structure, organization and dynamics of various major 
cellular processes such as transcription, translation, degradation etc. Likewise, efforts to 
understand the interaction of the cell with external environment have generated global 
phenotypic maps such as those due to small-molecule perturbations. Despite the growing 
amount of data representing each of these processes it should be admitted that none of these 
cellular processes work in isolation but rather form an integrated network of different wiring 
diagrams which is responsible for the observed behavior of the cell. In this thesis, I provide 
evidence that each of these networks of associations associated with a particular cellular 
process can be studied in detail to provide meaningful insights into how they contribute to the 
functioning of the cell, factors that constrain their structure and how they influence the genomes 
on which they are encoded. Nevertheless, an open challenge of the contemporary biology is to 
integrate these diverse cellular programs to first understand and model in quantitative terms the 
topological and dynamic properties of such a unified cellular network and then to exploit it for 
the therapeutic benefit of mankind. 
 
Introduction                                                                                                                                                               1-4 
 
 
OUTLINE OF THE INTRODUCTION 
An emerging notion in post-genomic biology is that cellular components can be visualized as a 
network of associations between different molecules like proteins, DNA, RNA and metabolites. 
This has led to the application of network theory and network-based approaches to a wide 
range of biological problems from understanding regulation of gene expression to prediction of 
gene’s function and phenotype to drug discovery settings. In this chapter, I first introduce the 
notion of networks and the basic principles of network biology together with an overview of 
different kinds of networks that are being widely studied in biological sciences at the systems 
level. In particular, I introduce the transcriptional and post-transcriptional networks in which 
trans-acting elements like TFs, RBPs and sigma factors form one set of nodes and their target 
genes or RNAs, of which they control the activity, form the other set of nodes. The links 
between them which have directionality from the trans-acting elements to their target genes, 
controlled by their cis-regulatory elements, form a complex and directional network of 
interactions. In contrast, functional linkage networks constructed in function prediction pipelines 
typically comprise of undirected networks where all the nodes are treated essentially the same 
and there is no directionality between nodes. These networks aim to uncover the broad 
functional role of the uncharacterized genes using the annotations of already characterized 
members to which they are connected to. I then give a brief overview of other classes of 
networks such as small-molecule protein interaction networks which are also referred to as the 
drug-target networks, to extend the generality and applicability of the network-guided 
approaches in understanding biological systems. 
Introduction                                                                                                                                                               1-5 
 
 
1.1 BASICS OF GRAPH THEORY AND NETWORKS 
Complex networks describe a wide range of dynamical systems in nature and society. In 
simplistic terms, a network comprises of a set of nodes with connections between them called 
edges. Most real world systems can be visualized in the form of networks also called graphs in 
mathematical literature. Examples include that of internet, World Wide Web (WWW), social 
networks of acquaintances between individuals, food webs, metabolic networks, transcriptional 
networks, signaling networks, neural networks and many others. Although the study of 
networks, in the form of mathematical graph theory, is one of the fundamental areas of discrete 
mathematics, much of our understanding about their underlying organizational principles has 
come to light only recently. While traditionally most complex networks have been modeled as 
random graphs, it is increasingly recognized that the topology and evolution of real networks are 
governed by robust design principles.  
      A number of biological systems ranging from metabolic to neuronal and food webs to 
ecosystems can be usefully represented as networks. More generally, the behavior of most 
complex systems emerges from the orchestrated activity of a many components that interact 
with each other through pairwise interactions. As such at a highly abstract level, the 
components can be reduced to a series of nodes that are connected to each other by edges, 
with each edge representing the interactions between two components. The nodes and links 
together form a network, or in more formal mathematical language, a graph and these 
definitions can be extended to any sub-system of a complex system under study. Since 
understanding the network of cellular interactions as a whole is impractical at the moment for at 
least two major reasons, namely incompleteness of the data representing the wide variety of 
interactions that are possible in a cell and variations in the mode as well as type of interactions. 
Theoreticians have been studying networks by dissecting the biological processes into different 
levels with the most commonly studied being the physical interactions between molecules, such 
as protein-protein, protein-nucleic acids and protein-metabolite, all of which can be 
conceptualized using the node-link nomenclature. Nevertheless, more complex functional 
interactions can also be considered within this representation. A classic example of such a 
representation is the network of metabolic pathways, where in metabolic substrates and 
products are connected with directed edges joining them if a known metabolic reaction exists 
that acts on a given substrate and produces a given product.    
 Depending on the nature of the interactions, networks can be directed or undirected. In 
directed networks, the interaction between any two nodes has a well-defined direction, which 
Introduction                                                                                                                                                               1-6 
 
 
represents, for example, the direction of material flow from a substrate to a product in a 
metabolic reaction or the direction of information flow from a transcription factor to the gene that 
it regulates. In undirected networks, the links do not have an assigned direction. For example, in 
protein interaction networks a link represents a mutual binding relationship and hence do not 
have a directionality in their association. 
Another important class of biological networks is the genetic regulatory network. The 
expression of a gene, i.e., the production by transcription and translation of the protein for which 
the gene encodes for, can be controlled by the presence of other proteins called transcription 
factors (TFs) which can control the expression of the gene both positively or negatively. In the 
former case, TFs are considered to act as activators and in the later as repressors. It is due to 
the regulatory network the genome can co-ordinate its response to both external and internal 
stimuli by controlling the expression of thousands of genes in appropriate amounts under 
appropriate conditions and time. Genetic regulatory networks were in fact one of the first 
networked dynamical systems for which large-scale modeling attempts were made. The early 
work on random Boolean nets by Kauffman (Kauffman, 1969; Kauffman, 1971; Kauffman, 1993) 
is a classic in this field before substantial advance has come more recently. The structure of 
transcriptional regulatory networks has been the focus of several recent studies (Babu et al., 
2004; Farkas, 2003; Guelzim et al., 2002; Janga and Collado-Vides, 2007; Thieffry et al., 1998). 
1.1.1 Local level 
A number of properties can be defined for a network representation and these properties can be 
grouped into three major classes namely local, module and global levels. In the following 
sections, I will summarize the major quantitative properties which can be used to define the 
structure of complex networks at each of these three levels. The first of them is at the local level 
and as the name suggests refers to the local properties of a node. For instance, as discussed 
above, networks can be directed or undirected depending on the nature of the interactions and 
as such directed networks comprise of both an out-going degree as well as in-coming degree 
while undirected networks only comprise of one degree associated with their nodes (see Table 
1-1 for a list of local properties of networks). Degree or connectivity of a node in a network 
corresponds to the total number of connections it has with other nodes in the network. As is 
evident, in directed networks degree or connectivity of a node is the sum of in-coming and out-
going degrees. Highly connected nodes i.e, nodes with high degree in biological networks are 
often referred to as hubs in the network. Degree distribution, P(k), is another property derived 
from degree of nodes in a network, which gives the probability that a selected node has exactly  
Introduction                                                                                                                                                               1-7 
 
 
 
Table 1-1. Different local properties which can be defined for a node in complex networks. 
 
Property Definition 
Indegree or incoming 
degree 
In directed networks where directionality of an interaction is taken into 
account, indegree refers to the number of incoming connections to a node 
of interest. In other words, indegree is the number of arrows that flow into 
the node under investigation.  
Outdegree or outgoing 
degree 
Out degree refers to the number of edges which start from a node of 
interest and point to other nodes in the network and is valid for directed 
networks where there is direction associated with each edge represented.  
Degree or Connectivity 
Degree or connectivity of a node refers to the total number of interactions 
it  has in a network – the higher the connectivity (i.e., hub nodes) the more 
the number of targets it interacts with. In directed networks degree simply 
corresponds to the sum of in and out degrees of a node. 
Clustering coefficient  
Clustering coefficient of a node reflects the extent to which the neighbors 
of a given node are interconnected among themselves to what is expected 
theoretically and indicates the cohesiveness or local modularity of the 
network. An extension of this metric to the complete network defined as 
the average clustering coefficient of all nodes, tells whether the network is 
modular or is sparsely connected. 
Betweenness  
Betweenness centrality of a node measures the number of shortest paths 
between all pairs of nodes in the network that pass through a node of 
interest – the higher the number of paths that pass through a node, the 
more important it is. 
Average path length Average length of the shortest paths between all pairs of nodes in the network. 
Closeness 
Closeness centrality is defined as the inverse of the average length of all 
the shortest paths from a node of interest to all other nodes in the network 
- note that closeness centrality defined this way implies that higher the 
closeness value, the higher the importance (centrality) of a node. 
Diameter 
The diameter of a network is the length of the longest path among all the 
shortest paths defined between two nodes. It gives an estimation of the 
distance between the farthest nodes in the network. 
Graph density The density of a network is the ratio of the number of edges to the number of total possible edges. 
Power law fit (exponent-
alpha) 
Fitting a power-law distribution function to the degree distribution of the 
network to study whether the network is likely to exhibit a scale-free 
network structure.  
 
 
k links. P(k) is obtained by counting the number of nodes N(k) with k=1,2.. links and dividing by 
the total number of nodes N. The degree distribution allows us to distinguish between different 
classes of networks. For example, a poissonian degree distribution is seen when P(k) is plotted 
against k for random networks indicating that most nodes have roughly equal number of links 
with little deviation from the average degree of a node in the network. By contrast, a power-law 
degree distribution indicates that a few nodes interact with numerous other nodes while most 
interact with rather few nodes (see Global Level). 
Introduction                                                                                                                                                               1-8 
 
 
Another important property at the local level is the clustering coefficient of a node which 
tells how interconnected are the neighbors of a given node to what is expected if all the 
neighbors are full connected. Mathematically, it is defined as the ratio of the number of 
observed links between the neighbors of a node of interest to the total number of feasible links 
between all the immediate neighbors. Average clustering coefficient of a network calculated as 
the mean of the clustering coefficients of all the nodes in the network gives a measure of 
cohesiveness in the network which is also commonly referred to as the extent of modularity. The 
higher the clustering coefficient greater is the modular nature of the network. To compare the 
extent of cohesiveness in a network often clustering coefficients of the real networks are 
compared with random networks with similar size and degree distribution.    
So far all the properties which are discussed concern the nodes in the network, however 
a number of properties have also been defined for edges in a network. Most important of these 
which needs mention is the path length between two nodes, which refers to the number of 
edges that one needs to traverse between two nodes of interest. Since there can be many 
alternative paths between two nodes, the shortest path i.e, the path with the smallest number of 
links between the selected nodes is often referred to as the path length. In directed networks, 
the path length between two nodes A and B may not be the same as that between nodes B and 
A reflecting the directionality in the network. Another important global property which stems from 
path length is the average or mean path length of a network and refers to the average of all the 
shortest paths between all pairs of nodes and offers a measure of a network’s overall reach. 
In addition to the degree of a node which tells how central or important a node in a 
network is, a number of other centrality measures have also been defined in the literature. 
These include betweenness and closeness centrality among other less popular definitions 
(Junker et al., 2006). Betweenness centrality, which is the number of shortest paths going 
through a node is typically calculated using the brandes algorithm (Brandes, 2001). Closeness, 
is measured as the inverse of the average length of the shortest paths from a node of interest to 
all other nodes in the network. Since the centrality measures, betweenness and closeness use 
the shortest path lengths between all pairs of nodes in a graph, for cases where no path exists 
between a particular pair of nodes, shortest path length is usually taken as one less than the 
maximum number of nodes in the graph. 
 While a number of these properties have been studied in diverse kinds of cellular 
networks and these will be discussed in the respective chapters or as appropriate, I summarize 
below some of the observations to give a flavor of their importance in understanding complex 
networks. Studies on the statistical properties of metabolic networks revealed that the 
Introduction                                                                                                                                                               1-9 
 
 
distributions of the outgoing and incoming degrees have been found to follow power law (Jeong 
et al., 2000). It was also shown using undirected versions of these metabolic graphs that they 
have short average path length and a large clustering coefficient (Fell and Wagner, 2000). In 
protein-protein interaction networks it was shown that the degree distribution follows a power 
law and that highly connected proteins are more likely to be lethal than lowly connected ones 
(Jeong et al., 2001) and that links between highly connected proteins tend to be suppressed 
while those between highly connected and low-connected proteins are abundant, which was 
proposed as an attribute of cellular networks to attain robustness and decrease cross talk 
between different functional modules (Maslov and Sneppen, 2002). This property of highly 
connected proteins avoiding interactions with other highly connected proteins in a network has 
been referred to as dissociative property. On the other hand, the observation that most real 
world networks have extremely small average path lengths is referred to as the small world 
effect (Watts and Strogatz, 1998). 
1.1.2 Modular level 
Another important level at which network organization is often studied is that of modules. 
Modules are seen in all kinds of complex systems from groups of friends in social networks, 
websites that are dedicated to similar topics in the internet, to groups of organisms which 
survive in a similar niche in an ecological food web. Modules are also evident in several 
engineered systems, from a simple computer chip to a more sophisticated super computer, 
where in they are employed to create an order and to organize the tasks dedicated to each of 
these fundamental units. Likewise, cellular processes have been proposed to be carried out in a 
highly modular manner (Hartwell et al., 1999). More generally, modules in biological networks 
refer to a group of genes/proteins or other cellular entities that work together to achieve a 
common task for the proper functioning of the cell (Alon, 2003; Hartwell et al., 1999; Ravasz and 
Barabasi, 2003; Ravasz et al., 2002). In fact, there are numerous examples of modules in a 
cellular context such as protein-protein and protein-RNA complexes which form physical 
modules or co-expressed gene clusters which work together in a given biochemical process or 
signaling modules which gather extracellular cues to prepare an organism for variations in the 
environment. Evidence for the existence of modularity in cellular networks has mostly come 
from the calculation of average clustering coefficient (see Table 1-1) of a wide variety of 
networks, which indicates the occurrence of a high number of interconnections between the 
neighbors of a node of interest. Average clustering coefficient which is the mean of the 
clustering coefficients of all the nodes is considered a proxy for modularity in networks. In the 
Introduction                                                                                                                                                               1-10 
 
 
absence of modularity, the clustering coefficient of the real and the randomized network are 
comparable. The average clustering coefficient of most real networks is significantly larger than 
that of a random network of equivalent size and degree distribution. For instance, existence of 
modularity defined in this fashion has been convincingly shown for a number of biological 
networks including metabolic, protein-protein and transcriptional (Guelzim et al., 2002; Ravasz 
et al., 2002; Wagner, 2001; Wuchty, 2001). Although there is no definitive agreement on how 
modules in cellular networks can be best identified and what set of genes would constitute a 
module (Wolf and Arkin, 2003), it is now a common knowledge that most biological systems can 
be divided into groups of genes which form discrete biological functions. Part of the problem in 
our ability to precisely determine the components of a module in cellular networks is that 
biological networks are hierarchical and scale-free structures (Ravasz and Barabasi, 2003; 
Ravasz et al., 2002) (see below) and therefore modularity in these settings indicates that the 
network can be split into either many modules each of which containing only few genes or a set 
of few modules where in each module can harbor many genes. It is therefore intuitive that the 
hierarchical modular nature of cellular systems naturally permits the definition of a module to be 
plastic depending the choice of the granularity one wishes to dissect a system into.  
 The high clustering in the cellular networks indicates that they are generally locally 
grouped with various subgraphs of highly interconnected groups of nodes forming the core – 
evidence supporting the occurrence of isolated functional modules. Subgraphs capture specific 
patterns of interconnections that characterize a given network at the modular level. However, 
not all subgraphs are equally significant in real networks, as indicated by a series of recent 
observations (Milo et al., 2002; Shen-Orr et al., 2002).  Some subgraphs or patterns of 
interconnections between nodes in a network appear more often than expected by chance in 
random networks with the same topology and these are often referred to as network motifs. 
Motifs in networks are analogous to sequence motifs in a set of homologous sequences which 
are defined as the patterns of amino acids or DNA stretches which occur more conserved than 
expected by chance. Different networks have been shown to be abundant for various motifs 
(Milo et al., 2002). For instance, transcriptional networks have been shown to harbor the Feed-
Forward Loops (FFLs) as the most abundant motif while protein interaction networks have been 
shown to comprise of fully connected cliques i.e, subgraphs in which all the nodes are 
connected to each other (Shen-Orr et al., 2002; Wuchty et al., 2003). The identification of motifs 
not only provides information about the type of local interconnections in the network but also 
allows one to understand their interplay with the rest of the network.  Several evidences support 
the biological relevance for the occurrence of motifs in networks. For example, the high degree 
Introduction                                                                                                                                                               1-11 
 
 
of evolutionary conservation of motif constituents within the yeast protein interaction network 
and the convergent evolution of motifs observed in the transcription regulatory network of 
diverse species all support their biological relevance (Conant and Wagner, 2003; Madan Babu 
et al., 2006; Wuchty et al., 2003).  
 In case of a transcriptional regulatory network, a module is typically defined as a set of 
genes that are regulated by a common set of Transcription Factors (TFs). Under this definition, 
it is intuitive to expect that various cellular processes can be conveniently regulated by discrete 
and separable modules which can coordinate the activities of many genes and carry out 
complex functions. Therefore, identifying transcriptional modules is useful for understanding 
cellular responses to internal and external signals under different cellular conditions. Datasets of 
genome-wide gene expression and location analysis (ChIP-chip) are frequently used to identify 
transcriptional modules controlling a variety of cellular processes (Bar-Joseph et al., 2003; 
Ihmels et al., 2002; Segal et al., 2003; Stuart et al., 2003; Wu et al., 2006). Several of these 
studies have focused on yeast and other model organisms due to the availability of extensive 
datasets on gene expression and transcriptional regulatory interactions together with their 
binding site information. From a computational perspective, typical approaches for module 
discovery involved the use of clustering and motif-discovery algorithms to gene expression data 
to find sets of co-regulated genes with variations in methods to include previously known 
information of cellular functions or promoter sequences. Some studies also used model based 
approaches such as Bayesian networks to infer modules and understand regulatory network 
architectures (Segal et al., 2003). Despite several methods which have been developed to 
identify regulatory modules from expression data, most frequently used implementations take 
into account that genes co-expressed in similar conditions are likely to belong to the same set of 
regulatory modules (Ihmels et al., 2004; Ihmels et al., 2002; Segal et al., 2003) while more 
sophisticated approaches integrate additional data sources like TF binding data, motif 
information or functional annotation (Bar-Joseph et al., 2003; Ihmels et al., 2002; Pilpel et al., 
2001). 
Although there have been several different approaches to identifying modules and have 
provided distinct outcomes in terms of the number and size of the resulting modules, the 
general consensus has been that regulatory networks are highly interconnected and very few 
modules are entirely separable from the rest of the network. Therefore, the major conclusion 
has been that modules are frequently nested within each other in a hierarchical fashion at 
different levels. In fact, an analysis of the distribution of the commonly seen motifs across the 
identified modules in transcriptional networks, suggests that network motifs themselves do not 
Introduction                                                                                                                                                               1-12 
 
 
exist in isolation but rather integrate to form part of the modules by sharing some of their edges 
(Dobrin et al., 2004; Resendis-Antonio et al., 2005). Thus, many small, highly connected motifs 
group into a few larger modules, which in turn integrate into even larger ones.  These nested 
modules are interconnected through local regulatory hubs. Such an organization not only 
explains the hierarchical organization, which is seen in other cellular networks (Ravasz and 
Barabasi, 2003) but also intuitively suggests the capacity for rapid regulatory changes through 
regulatory hubs, with integration and fine tuning of the regulatory processes by downstream 
TFs, thereby linking several modules in a hierarchical manner.  
As the components of a specific motif often interact with nodes that are outside the motif, 
it is important to understand how different motifs interact with each other and with the rest of the 
network for different kinds of networks.  While recent work shows that different motifs aggregate 
to form large motif clusters in transcriptional networks, the generality of these findings is still 
under debate. However, since motifs are present in all kinds of biological networks that have 
been examined till date (Milo et al., 2002), it is likely that the aggregation of motifs into motif 
clusters and modules is a generic property of most biological and real world networks. 
1.1.3 Global level 
One of the most important developments in our understanding of complex systems is the 
observation that despite the remarkable diversity in the variety of complex networks in nature, 
their architecture was found to be governed by a few simple principles. For example, most 
complex networks have been long believed to follow the degree distributions like that proposed 
by the Erdos-Renyi model, according to which a plot of the degree distribution, P(k), against the 
degree k of a complex network should follow a poisson distribution. However, it is now clear that 
most real world complex systems including biological networks follow a scale-free topology with 
a power-law degree distribution where in degree distribution, P(k), against the degree k on a 
log-log plot shows a straight line with a negative slope γ which varies between 2 and 3. It has 
also been shown that in both Erdos-Renyi model as well as scale-free model proposed by 
Barabasi and Albert (Barabasi and Albert, 1999), distribution of clustering coefficient was found 
to be independent of the degree (Barabasi and Oltvai, 2004). Nevertheless, a major difference 
between the two network models is that in the former most nodes have approximately equal 
number of links with all of them being close to the average degree in the network - indicative of 
a gaussian/poissonian degree distribution while the later is determined by the presence of a 
large number of nodes which are poorly connected and a relatively small number of nodes 
which are highly connected (also referred to as hubs). Due the scaling nature in the degree 
Introduction                                                                                                                                                               1-13 
 
 
distribution Barabasi-Albert or scale-free model exhibits a straight line on a log-log plot between 
the degree distribution and the degree of a node.  
 Yet another class of networks which have been proposed in the literature are the 
hierarchical scale-free networks which comprise of all the properties of scale-free networks and 
in addition also exhibit a slope of -1 when the distribution of clustering coefficient is plotted 
against the degree of a node on a log-log scale, indicating an organization where in sparsely 
connected nodes are part of highly clustered areas, with communication between the different 
highly clustered neighborhoods being maintained by a few hubs. It is increasingly believed that 
most real world complex networks obey this hierarchical scale-free modular structure (Ravasz 
and Barabasi, 2003; Ravasz et al., 2002; Yu and Gerstein, 2006).   
Although the hierarchical nature of networks has not been extensively explored for all 
the cellular networks, there is extensive evidence that most of them including protein-protein, 
transcriptional regulatory to metabolic linkages at least exhibit a scale-free topology (Giot et al., 
2003; Guelzim et al., 2002; Jeong et al., 2001; Wagner, 2001). In such networks, most proteins 
or cellular entities participate in only a few interactions while a few participate in 
disproportionately large number of interactions – a signature of scale-free networks with 
inherent power-law degree distribution. Although a large number of cellular networks have been 
shown to observe the scale-free topology in the recent years, not all of them are scale-free 
graphs. For instance, in the case of transcriptional regulatory networks the incoming 
connectivity which is defined as the number of transcription factors regulating a target gene, 
which quantifies the combinatorial effect of gene regulation, was observed to follow an 
exponential distribution in both Escherichia coli and Saccharomyces cerevisiae (Guelzim et al., 
2002; Thieffry et al., 1998). The exponential behaviour indicates that most target genes are 
regulated by similar number of factors and could reflect the limits on the number of transcription 
factors that can affect a target gene due to the constraints on the intergenic spacing available 
and the number of proteins that can simultaneously effect a promoter region. On the other hand, 
the outgoing connectivity, which is the number of target genes regulated by each transcription 
factor, was found to be distributed according to a power law, contrary to the incoming 
connectivity parameter. This is indicative of a hub-containing network structure, in which a 
select set of transcription factors participate in the regulation of a disproportionately large 
number of target genes. These hubs can be viewed as ‘global regulators’, as opposed to the 
remaining transcription factors that can be considered as ‘fine tuners’.  
In case of transcriptional regulatory networks it has been shown, by both a top-down and 
bottom-up approaches for determining hierarchy, that they possess a multi-layer hierarchical 
Introduction                                                                                                                                                               1-14 
 
 
modular structure (Ma et al., 2004; Yu and Gerstein, 2006). Interestingly, transcription networks 
do not seem to possess feedback regulation at the level of transcription meaning transcriptional 
regulation of TFs at the top by TFs at the bottom of this hierarchial structure is not frequent, 
indicating the prevalence for alternative forms of feedback control of transcription. Typically 
such a feedback occurs through the usage of protein-protein interactions at post-translational 
level or due to a complex interplay of cellular entities which control the activity of TFs by 
changing their conformation depending on the continuously varying intra- and extra-cellular 
conditions (Martinez-Antonio et al., 2006; Yu and Gerstein, 2006). It has also been observed 
that the TFs in the middle of this hierarchy (often from the levels 2 and 3 measured from the 
bottom) regulate more direct targets than those at the top suggesting that these middle level 
TFs act as managers and are indeed control-bottlenecks for cellular transcriptional response 
(Yu and Gerstein, 2006). 
While a number of other properties such as diameter, graph density etc of a network 
have also been defined in network biology (see Table 1-1) they would not be of immediate 
relevance to the work discussed in this thesis and hence have not been discussed in detail. 
1.2 NETWORKS IN MOLECULAR BIOLOGY 
1.2.1 Methods to construct transcriptional regulatory networks 
At an abstract level regulatory interactions linking TFs to their transcriptionally controlled target 
genes (TGs) in an organism can be viewed as a directed graph, in which the TFs and TGs 
represent the nodes while the regulatory interactions that connect them as the edges. Typically 
the resulting network is a complex, hierarchical, multilayered graph that can be studied at 
several levels of detail. However at a more fundamental level the organization of transcriptional 
regulatory machinery and the principles involved are considerably different in the two major 
kingdoms of life, bacteria and eukarya. In bacteria, transcription and translation happen in the 
same compartment i.e cytoplasm and transcriptional control can be considered to be mostly at 
the DNA sequence level through the use of cis-regulatory elements and organization of 
contiguous genes on the same strand of DNA into operons. However in eukaryotic genomes, 
the process of transcriptional regulation is highly complex and is co-ordinated at three major 
hierarchical levels. The first is at the DNA sequence level, i.e. the linear organization of 
transcription units and regulatory sequences.  Co-regulated genes organized into clusters in the 
genome constitute part of these individual functional units.  The second is at the chromatin level, 
which allows switching between different functional states, i.e between a state that suppresses 
Introduction                                                                                                                                                               1-15 
 
 
transcription and one that is permissive for gene activity. This level involves the changes in the 
chromatin structure that are controlled by the interplay between histone modification, DNA 
methylation, and a variety of repressive and activating mechanisms. This regulatory level is 
linked with the control mechanisms from level one that switch individual genes in the cluster to 
on and off, depending on the properties of the promoter. The third level is the nuclear level, 
which includes the dynamic 3D spatial organization of the genome inside the cell nucleus. The 
nucleus is structurally and functionally compartmentalized and epigenetic regulation of gene 
expression may involve repositioning of loci in the nucleus through changes in large-scale 
chromatin structure. All these differences add a layer of complexity and sophisticated control to 
the inherent structure, functionality and dynamics of transcriptional networks in eukarya in 
comparison to their bacterial counterparts. Despite these fundamental differences several basic 
principles in their organization and structure from a network perspective have been shown to be 
similar in both the kingdoms (Guelzim et al., 2002; Lee et al., 2002; Milo et al., 2002; Shen-Orr 
et al., 2002; Thieffry et al., 1998; Yu and Gerstein, 2006). 
Despite enormous interest in understanding transcriptional networks across organisms 
our knowledge on transcriptional interaction graphs for a genome has been very limited and is 
mostly restricted to model organisms like Escherichia coli and Saccharomyces cerevisiae for 
which extensive information is available (Gama-Castro et al., 2008; Lee et al., 2002). 
Transcriptional interactions in an organism have been traditionally identified from small scale 
assays which are documented in regulatory network databases through extensive manual 
curation efforts (Baumbach et al., 2007; Gama-Castro et al., 2008; Makita et al., 2004; Matys et 
al., 2006) or are obtained from high-throughput screens like ChIP-chip or ChIP-seq which allow 
the identification of regulatory interactions for a vast set of TFs in an organism (Grainger et al., 
2005; Lee et al., 2002). Yet another lower resolution high-throughput approach to screen in the 
whole genome, targets for a TF, is through the knock-out of TF genes and performing a whole 
genome microarray expression analysis (Devaux et al., 2001). Table 1-2 summarizes a list of 
these frequently employed low and high-throughput experimental techniques for the 
identification of regulatory interactions in an organism in an unambiguous manner. 
 
 
 
(Space left for an enhanced layout of the table) 
 
Introduction                                                                                                                                                               1-16 
 
 
Table 1-2. Different low and high-throughput strategies for studying and probing protein-DNA interactions. 
High-throughput technologies such as ChIP-chip, ChIP-seq and PBMs are frequently employed for the 
elucidating of regulatory networks on a genome-wide scale. 
 
Method Description 
Band shift 
Since DNA molecules are more flexible than proteins, they tend to exhibit much 
higher mobility in a polyacrylamide gel. Thus, under favourable conditions, free DNA 
can be distinguished from DNA bound to proteins due to the difference in molecular 
weight (Garner and Revzin, 1981). 
DNA 
footprinting 
In DNA footprinting, a 5’ end labeled double stranded DNA is partially degraded by 
DNAase both in the presence and absence of the TF. Degraded fragments are then 
loaded on to a gel to visualize by autoradiography. Since the region where the protein 
has bound the DNA will be protected from DNAase, no fragments are seen in those 
regions. Therefore, by comparing lanes, one can identify the binding site (Galas and 
Schmitz, 1978). 
FRET based 
binding site 
identification 
In this method a library of double stranded DNA with one of the two fluorophores 
attached to its end is used. Protein binding to two pieces of DNA , one from each 
library where each comprises half of the binding site’s sequence, induces FRET 
signal which can then be used to find protein bound to DNA (Heyduk and Heyduk, 
2002). 
Binding site 
detection 
using 
unnatural base 
analog 
In this approach a library of DNA sequences with an unnatural base analog (one for 
each base) is used. Following selection for protein-bound DNA molecules, the DNA is 
cleaved specifically at the modified base. The site of incorporation can be identified by 
gel electrophoresis by running fragments generated from unbound sample next to the 
fragments generated from the bound sample. Since the presence of an analog in the 
binding site impedes protein binding, this results in a depletion of the protein-bound 
pool (Storek et al., 2002). 
(ChIP-chip) 
and (ChIp-
seq) 
techniques 
The DNA binding protein is tagged with an epitope and is expressed in a cell. The 
bound protein is covalently linked to DNA by using an in vivo cross-linking agent such 
as formaldehyde. After cross-linking, DNA is sheared and the protein–DNA complex 
is pulled down using an antibody for the tag. Reversal of the cross-link releases the 
bound DNA, allowing the sequence of the fragments to be determined by 
hybridization to a microarray (ChIP-chip) or by sequencing (ChIP-seq). In ChIP-chip 
experiments, intergenic regions are spotted on to a microarray chip. Following a 
chromatin immunoprecipitation step, the bound fragments are reverse cross-linked 
and hybridized onto the microarray chip (Lee et al., 2006). In ChIP-seq experiments, 
the bound fragments are directly sequenced using 454/Solexa/Illumina sequencing 
technology. The sequences are then computationally mapped back to the genome 
sequence (Johnson et al., 2007). 
DNA adenine 
methyl 
transferase 
Identification 
(DamID) 
In DamID technique, protein of interest is fused to an E. coli protein, DNA adenine 
methyl transferase (Dam). Dam methylates the N6 position of the adenine in the 
sequence GATC, which occurs at reasonably high frequency in any genome (1 site in 
256 bases). Upon binding DNA, the Dam protein preferentially methylates adenine in 
the vicinity of binding. Subsequently, the genomic DNA is digested by the DpnI and 
DpnII restriction enzymes that cleave within the non-methylated GATC sequence, and 
remove fragments that are not methylated. The remaining methylated fragments are 
amplified by selective PCR and quantified using a microarray (Greil et al., 2006). 
Protein 
binding 
universal DNA 
microarrays 
(PBMs) 
This is an invitro method to probe protein–DNA interactions. A DNA binding protein of 
interest is epitope tagged, purified and bound directly to a double-stranded DNA 
microarray spotted with a large number of potential binding sites. Labeling with 
fluorophore conjugated antibody for the tag allows detection of binding sites from the 
significantly bound spots (Bulyk et al., 2004). 
 
Introduction                                                                                                                                                               1-17 
 
 
1.2.2 Methods to construct functional linkage networks 
Traditionally function of a protein was defined using a number of low-throughput approaches like 
mutagenesis of residues or whole proteins which allowed the identification of the phenotypes for 
follow up analysis. However, it is increasingly becoming clear that this rational is limited in its 
ability to infer the function of proteins; failing for those which exhibit mild phenotype or those 
which are not expressed under standard experimental conditions. In addition, since most 
proteins associate dynamically with a number of other cellular entities during their life time, the 
traditional notion of identifying function of a protein by isolating it from the rest of the cellular 
machinery can be misleading for a majority. This notion followed by the availability of 
experimentally determined protein-protein interaction maps for diverse model organisms have 
given rise to the use of these datasets for delineating the biological processes, pathways and 
complexes that proteins take part in (Aranda et al., ; Bader et al., 2003; Breitkreutz et al., 2008). 
Indeed, there is now observable overlap and informative variation between different types of 
low- and high-throughput experiments (Shoemaker and Panchenko, 2007) which provides a 
convincing reason for exploiting them as complementary approaches in unraveling the functions 
of proteins. Indeed, recent years have seen an explosion in the number of methods and 
databases which provide functional associations (both direct physical and indirect contextual 
interactions) between proteins using both experimental and computational means. I present an 
extensive list of these resources in Table 4-2 of Chapter 4, where in I also provide a more in 
depth discussion of network-based approaches for function prediction. 
Briefly, experimental approaches employed for constructing functional association 
networks mostly comprise of data from protein-protein interaction screens followed by co-
expression networks comprising of gene pairs showing significant correlation in their expression 
profiles across conditions, derived from microarray datasets (Luo et al., 2007; Ruan et al., ; 
Wang et al., 2009).  More recently, genetic interactions- measuring the fitness defects of the 
double mutants compared to that of the individual mutants, are also being employed for 
constructing these functional linkage networks (Butland et al., 2008; Costanzo et al.). These 
high-throughput experimental approaches not only increase the confidence of an association 
but also give cellular context of the protein providing complementary view to the traditional 
functional prediction paradigm. 
In addition to the experimental methods, several computational methods have been 
proposed for constructing protein-protein associations from sequence data alone. These include 
the genome context methods namely gene fusion, gene cluster or gene order conservation, 
Introduction                                                                                                                                                               1-18 
 
 
operon arrangements and protein phylogenetic profiles. The gene fusion approach tries to 
detect the fusion of two genes into a single protein coding gene in one of the sequenced 
genomes and thereby links them as a strong functional association (Enright et al., 1999; 
Marcotte et al., 1999a). The method of gene order conservation aims to identify pairs of genes 
which consistently show a tendency to cluster in immediate vicinity in a number of genomes- 
suggesting a strong functional link in prokaryotic genomes which are abundant in operons 
(Dandekar et al., 1998; Overbeek et al., 1999). The method of operon rearrangement tries to 
identify a link between any pair of genes on a genome as long as their orthologs are predicted 
to be organized in an operon with a high confidence in at least one sequenced genome (Janga 
et al., 2005; Rogozin et al., 2002; Snel et al., 2002). The power of this approach depends on the 
predictive quality of operon prediction methods which have been shown to reach ~90% 
accuracy in most sequenced genomes (Brouwer et al., 2008; Moreno-Hagelsieb and Collado-
Vides, 2002). Yet another approach not based on genomic proximity is phylogenetic profiles. In 
this method a vector of presence/absence profile of a gene across all the analyzed genomes is 
constructed and compared to identify genes which show the most correlated profiles, as a 
measure of functional link. The rational here is that two proteins showing similar profiles i.e, 
coordinated in their evolutionary gain and loss, are expected to be functionally related 
(Gaasterland and Ragan, 1998; Pellegrini et al., 1999). Modified versions of this approach take 
into account the phyogenetic signal of the genomes employed and/or the redundancy in the 
genome sequence information (Barker and Pagel, 2005; Date and Marcotte, 2003; Moreno-
Hagelsieb and Janga, 2008). 
Recently, the integration of different types of interaction data into genome-wide 
functional linkage maps has gained much popularity for functional inference as these integrated 
maps not only boost coverage but also confidence of an association when assessing protein 
function. One of the first studies which demonstrated the power of integrating different types of 
interaction data was by Marcotte and colleagues where they have put together diverse kinds of 
computational genome context inferences (Marcotte et al., 1999b). This was followed by a 
number of other methods such as those implemented in the STRING and PROLINKS 
databases, among other focused studies (Bowers et al., 2004; Hu et al., 2009; Jensen et al., 
2009; Massjouni et al., 2006). Typically, in these networks edge weights correspond to the 
integrated interaction probability values obtained by first scoring each of the methods 
independently against a set of gold standard interactions, which are then used in a bayesian 
fashion assuming the scores obtained in each method are independent of each other. More 
complex methods take into account the dependence and correlation between methods to 
Introduction                                                                                                                                                               1-19 
 
 
develop a regression model for scoring the integrated interactome (Linghu et al., 2008; Zhao et 
al., 2008). Nevertheless, all of them boil down to constructing a network with either weighted or 
unweighted edges which are then used for propagating annotations to uncharacterized 
members using network-based approaches discussed in Chapter 4. 
1.2.3 Methods to construct post-transcriptional regulatory networks 
Gene expression is a highly controlled process which is known to occur at several levels in 
eukaryotic organisms. Although traditionally messenger RNAs have been viewed as passive 
molecules in the pathway from transcription to translation there is increasing evidence that their 
metabolism is controlled by a class of proteins called RNA-binding proteins (RBPs) (Glisovic et 
al., 2008; Keene, 2007; Mata et al., 2005). In eukaryotes, since transcription and translation 
occur in different compartments, it allows for a plethora of options to control RNA at the post-
transcriptional level, including their splicing, polyadenylation, transport, mRNA stability, 
localization and translational control (Glisovic et al., 2008; Keene, 2007). Although some early 
studies revealed the involvement RBPs in the transport of mRNA from nucleus to the site of 
their translation, increasing evidence now suggests that RBPs regulate almost all of the post-
transcriptional steps.  
Development of several high throughput approaches has increased the amount of data 
for targets of RBPs in diverse organisms (See Table 5-3 in Chapter 5 for a detailed overview of 
these methods and techniques). These techniques have not been discussed here to avoid 
redundancy. This data of RBPs and their targets could be utilized to construct RBP-RNA 
interaction network which is also typically referred to as post-transcriptional regulatory network. 
This post-transcriptional network is represented in the form of a directional network with each 
edge corresponding to a regulatory link between the nodes (RBP and the target RNA) similar to 
directed networks discussed above for transcriptional regulatory networks. In this directed 
network, one set of nodes are RBPs forming the regulatory proteins while the other set of  
nodes are RNAs encoded by either protein-coding or non-protein coding genes referred to as 
the target nodes. These two nodes (regulator node and target node) are joined by an arrow 
starting from regulator node and directing towards target node. The target RNA may belong to 
diverse functional proteins including other RBPs. This network can also contain loops as a link 
starting from RBP and targeting itself, typically referred to as autoregulation of an RBP. This 
loop structure suggests that RBP can bind to its own RNA and control its metabolism at 
transcript level. There are several examples suggesting the auto-regulation of RBPs at post-
transcriptional level. For instance, in humans, RBPs such as AUF1, HuR, KSRP, NF90, TIA-1 
Introduction                                                                                                                                                               1-20 
 
 
and TIAR were reported to associate with their own mRNA and other RBPs (Pullmann et al., 
2007).  
Due to the availability of the network of post-transcriptional interactions for a 
considerable fraction of RBPs in model systems such as S. cerevisiae (Hogan et al., 2008), it 
has become possible to address several questions concerning the structure and organization of 
post-transcriptional networks directed by RBPs. Chapter 5 focuses on studying these properties 
by directly analyzing the currently available post-transcriptional regulatory network in the 
budding yeast.  
1.2.4 Methods to construct other classes of cellular and biological 
networks 
Development of several high throughput approaches in the last decade have not only increased 
the amount of information that we could gather to reveal important insights on the 
transcriptional, post-transcriptional or functional organization of an organism but they have also 
enabled us to start our journey to uncover the principles which hold them together. This is 
mainly because of the extent of information that has been possible to be collected by 
interrogating the cell’s environment at different levels of detail. For instance, availability of 
modern techniques now enable us to identify the set of protein-protein interactions, genetic 
interactions, metabolic maps and small molecule interactions at a whole-organism level. While a 
complete discussion of all the methods and techniques used to identify their respective 
interactomes is beyond the scope of this thesis. I outline below some of the commonly 
employed approaches for identifying the interaction graphs for each of these types of 
interactions occurring in the cell.   
Perhaps the most common form of interaction graphs which have been studied since the 
early days of genome sequencing are protein interactions. A number of approaches for studying 
them have been reported in the literature and these include the yeast two hybrid (Y2H) (Fields 
and Song, 1989), protein fragment complementation assay (PCA) (Pelletier et al., 1998), affinity 
purification coupled with mass spectrometry (AP-MS) (Babu et al., 2009a; Babu et al., 2009b; 
Gavin et al., 2002), protein chips (Fasolo and Snyder, 2009; Kung and Snyder, 2006), phage 
display (McCafferty et al., 1990), fluorescence energy transfer (FRET) (Jares-Erijman and Jovin, 
2003) and surface plasmon resonance (SPR) (Slavik and Homola, 2006). For a more extensive 
discussion on the protocols and methods for identifying protein interactions as well as for new 
developments in this area the reader is referred to recent reviews (Levy and Pereira-Leal, 2008; 
Shoemaker and Panchenko, 2007). 
Introduction                                                                                                                                                               1-21 
 
 
Another class of networks which are commonly studied is that of metabolic networks. 
They comprise of representing the metabolites and enzymes involved in catalyzing metabolic 
reactions as the nodes and edges in a directed network. Most of the work on understanding 
metabolic networks relies on either manually curated or semi-automated metabolic databases 
such as the kyoto encyclopedia of genes and genomes (KEGG) and Metacyc which are 
available for a wide range of model organisms (Caspi et al., 2008; Grossetete et al., ; Kanehisa 
et al., 2008). In addition to the metabolic maps available for diverse organisms, several groups 
also study and compile the metabolic reactions for a model organism of interest which are then 
used for follow up analysis of the metabolic circuitry (Duarte et al., 2007; Durot et al., 2009; Ma 
et al., 2007).  
 Organisms respond to continuous variations in internal and external cellular conditions 
by orchestrating their responses depending on the environmental challenges they are faced 
with. This involves the usage of a complex network of interactions among different proteins, 
RNA, metabolites and several other cellular entities, which undergo rewiring when perturbed by 
small molecules such as chemicals or drugs. The interaction between different chemicals and 
cellular entities can be represented in the form of a network- so called Drug-Target network. 
Recent years have seen the development of a number of approaches both computational and 
experimental for the identification and elucidation of the molecular targets of a drug on a 
genomic scale (Apsel et al., 2008; Brewerton, 2008; Fabian et al., 2005; Hillenmeyer et al., 
2008; Ho et al., 2009; Jacob and Vert, 2008; Kuhn et al., 2008; Paolini et al., 2006; Whitehurst 
et al., 2007; Yamanishi et al., 2008). This cellular target space which contains the targets of 
drugs, can be considered to predominantly comprise of three components namely protein-
protein, metabolic and transcriptional interaction networks. While the vast majority of the drugs 
target the protein-protein and metabolic components, limited number of targets have been 
identified till date for the transcriptional pool (Brennan et al., 2008; Goh et al., 2007; Lage et al., 
2007; Lee et al., 2008; Yildirim et al., 2007). Indeed, most common therapeutic targets for 
established drugs belong to either protein kinase or receptor families with enzymes and ion 
channels forming the second most predominant class of targets (Wishart et al., 2008). This 
explains the reasons for the increased attention towards understanding the biophysics of 
protein-protein contacts in the context of drug targets as these protein classes form major 
players in protein-protein interactions (Archakov et al., 2003). Table 1-3 shows different 
methods which are used for the construction of Drug-Target networks and can be broadly 
classified into genetics-based, proteomics-based and knowledge-driven approaches. 
 
Introduction                                                                                                                                                               1-22 
 
 
Table 1-3. Different methods available for identifying drug-targets on a genomic scale. Methods can be 
broadly classified into Proteomics-based, Genetics-based and Knowledge-driven. Although most methods 
traditionally are based on experimental screening, there is an increase in the number of computational 
techniques available for small molecule target discovery (grouped as knowledge-driven approaches). 
 
 
Proteomics-based 
Methods 
Description 
Activity based protein 
profiling (ABPP) 
(Speers and Cravatt, 
2004) 
This is a functional proteomic technology that uses chemical probes that react 
with mechanistically related classes of enzymes. The basic unit of ABPP is a 
probe that typically consists of a reactive group (electrophile or a photoreactive 
group) that covalently binds to the active site of an enzyme (nucleophilic 
residue) and a tag. The tag can either be a reporter (i.e. fluorophore, 
radioactive group) or a handle (i.e. affinity tags such as biotin). A tag-free 
strategy for activity-based protein profiling has also been introduced that 
utilizes the copper(I)-catalyzed azide-alkyne cycloaddition reaction (click 
chemistry) and gives the advantage of not interfering with biological activity or 
binding affinities of the probes. The activity-based protein profiling and 
multidimensional protein identification technologies (ABPP-MudPIT) can 
provide profiling of inhibitor selectivity, as the potency of an inhibitor can be 
tested against hundreds of targets simultaneously. (Jessani et al., 2005) 
Affinity 
chromatography 
(Katayama and Oda, 
2007) 
This is a protein separation method based on the interaction between target 
proteins and specific immobilized ligands. Traditionally, the ligand is tethered 
on a solid support via a spacer arm followed by the addition of a cellular lysate 
or tissue extract. Only target proteins binding tightly to the ligand are 
selectively purified, eluted off (denaturation or competition with free ligand) and 
subsequently identified by mass spectroscopy. To minimize the identification of 
nonspecifically bound proteins, the protein profile that is obtained with an 
inactive ligand analogue is also determined and compared with the relevant 
profile, determined with the desired analogue. More recently, an improved 
method for the identification of proteins that can bind to small-molecules and 
drugs has been established which uses quantitative mass spectrometry (MS)-
based proteomics (utilizing stable isotope labeling with amino acids in cell 
culture (SILAC)) and affinity chromatography. (Ong et al., 2009) 
Microarrays 
(Kingsmore, 2006; 
Ma and Horiuchi, 
2006; Salcius et al., 
2007; Wingren and 
Borrebaeck, 2006) 
Microarrays in drug target discovery provide miniaturized high-throughput tools 
to study binding of specific molecules to immobilized proteins or small 
molecules. In protein microarrays, different recombinant proteins or antibodies 
that are immobilized on a solid substrate are exposed to a drug solution to 
identify the target protein(s) which can bind to the small molecule. In chemical 
microarrays, immobilized drug compounds can be screened for candidate 
drug-target interactions with purified proteins (Ma and Horiuchi, 2006). When 
the target protein is known, small molecule arrays can be also used to identify 
off-target interactions that could have implications for side-effects. 
Genetics-based 
Methods 
Description 
Synthetic lethality/ 
Gene knock-out 
(Hillenmeyer et al., 
2008; Ho et al., 2009) 
Single gene knock-out strains on a genomic scale or for a selected set are 
exposed to small molecules at different concentrations to evaluate the fitness 
defects and fitness levels are compared to wild-type populations exposed to 
the same conditions. This provides an easy means to identify targets on a 
large scale. (Hillenmeyer et al., 2008; Ho et al., 2009) 
RNAi 
RNA interference pathways in mammalian systems are used for silencing 
genes and similar approaches as above are employed to study the fitness 
defects of cell lines to identify potential drug targets in higher eukaryotes 
(Turner et al., 2008; Whitehurst et al., 2007) 
Introduction                                                                                                                                                               1-23 
 
 
Knowledge-driven 
approaches 
Description 
Literature derived 
interactions. (Chen et 
al., 2008; Frijters et 
al., 2007; Tsui et al., 
2007; Yildirim et al., 
2007) 
In these approaches, manually curated set of interactions are obtained from 
the literature to generate high confidence set of drug-target relationships to 
either study their overall structure (Yildirim et al., 2007) or focus on specific 
disease of interest.  (Chen et al., 2008; Frijters et al., 2007; Tsui et al., 2007; 
Yildirim et al., 2007) 
Network-based 
approaches. (Apsel 
et al., 2008; Hopkins, 
2008) 
In these approaches, literature derived interactions are exploited to predict 
new interactions based on the principles governing the structure of the 
networks so that new disease targets are identified using comparative 
genomics or other informatics-based methods followed possibly by 
experiments to improve the chemicals. (Apsel et al., 2008; Hopkins, 2008) 
in silico 
chemogenomics 
(Rognan, 2007) 
In predictive chemogenomics one predicts relationships between 
genes/proteins and compounds. In silico approaches that are used can be 
classified into ligand-based approaches (ligand comparison for target 
prediction), target-based approaches (target comparison for ligand prediction) 
or ligand-target based approaches (Rognan, 2007) 
 
1.3 OUTLINE OF THE THESIS 
Now that I have summarized the different tools which can quantify the networks at varying levels 
and the numerous kinds of interaction graphs operating in the cell there are enormous 
possibilities to understand a cell’s internal organization and dynamics. In the following chapters, 
I will attempt to address some of these open questions.   
In particular in Chapter 2, I address the questions, how and why are genes organized on 
a particular fashion on bacterial genomes and what are the constraints bacterial transcriptional 
regulatory networks impose on their genomic organization. I then extend this one step further to 
unravel the constraints imposed on the network of TF-TF interactions and relate it to the 
numerous phenotypes they can impart to growing bacterial populations.  
In contrast to prokaryotes, regulation of gene expression in eukaryotes is much more 
complex and is known to occur at many different levels even at the stage of transcription. In 
Chapter 3, I first present an overview of our current understanding of eukaryotic gene regulation 
at different levels and then present evidence for the existence of a higher-order organization of 
genes across and within chromosomes that is constrained by transcriptional regulation. These 
results demonstrate that specific organization of genes across and within chromosomes that 
allowed for efficient control of transcription within the nuclear space has been selected during 
evolution. 
Determining the functions of proteins encoded by genome sequences represents a 
major challenge in contemporary biology. With traditional methods for annotation of a genome 
reaching their saturation there is an increasing need to develop alternate and complementary 
Introduction                                                                                                                                                               1-24 
 
 
approaches for solving the genomic function prediction challenge. As a result, alternate 
computational methods for inferring the protein function such as those which exploit the context 
of a protein in protein association networks have come to be sought after. These network-based 
approaches aim to integrate diverse kinds of functional interactions as a means of boosting 
coverage as well as confidence level of an association. In Chapter 4, I first present an overview 
of different computational approaches for inferring the function of uncharacterized genes and 
discuss network-based approaches currently employed for predicting function. I then summarize 
a recent high-throughput study performed to provide a ‘systems-wide’ functional blueprint of the 
bacterial model, Escherichia coli K-12, with insights into the biological and evolutionary 
significance of previously uncharacterized proteins. Given the volume of high-throughput data 
that is being reported for understanding diverse model systems, the network-based approaches 
presented here would become be a useful addition to unravel the functions of an increasing 
number of uncharacterized proteins accumulating in the genomic databases. 
While control of gene expression in eukaryotes first occurs at the level of transcription, 
there is accumulating evidence that RNA-binding proteins play major roles in controlling the 
expression of a protein by regulating expression at post-transcriptional level. In Chapter 5, I 
attempt to provide a comprehensive overview and preliminary insights on this rapidly developing 
area of post-transcriptional regulatory networks formed by RBPs. I discuss the sequence 
attributes and functional processes associated with RBPs, methods used for the construction of 
the networks formed by them and finally discuss the structure and dynamics of these post-
transcriptional networks based on recent publicly available data. The results obtained from this 
study show that RBPs exhibit distinct gene expression dynamics compared to other class of 
proteins in a eukaryotic cell and that these properties are also reflected from an analysis of the 
post-transcriptional networks formed by them. 
In Chapter 6, I first summarize the key findings of all the above chapters and then 
discuss their broader implications in light of recent findings.  
 
REFERENCES 
Alon, U. (2003). Biological networks: the tinkerer as an engineer. Science 301, 1866-7. 
Apsel, B., Blair, J. A., Gonzalez, B., Nazif, T. M., Feldman, M. E., Aizenstein, B., Hoffman, 
R., Williams, R. L., Shokat, K. M. and Knight, Z. A. (2008). Targeted polypharmacology: 
discovery of dual inhibitors of tyrosine and phosphoinositide kinases. Nature Chemical Biology 
4, 691-699. 
Introduction                                                                                                                                                               1-25 
 
 
Aranda, B., Achuthan, P., Alam-Faruque, Y., Armean, I., Bridge, A., Derow, C., Feuermann, 
M., Ghanbarian, A. T., Kerrien, S., Khadake, J. et al. The IntAct molecular interaction 
database in 2010. Nucleic Acids Res 38, D525-31. 
Archakov, A. I., Govorun, V. M., Dubanov, A. V., Ivanov, Y. D., Veselovsky, A. V., Lewi, P. 
and Janssen, P. (2003). Protein-protein interactions as a target for drugs in proteomics. 
Proteomics 3, 380-91. 
Babu, M., Butland, G., Pogoutse, O., Li, J., Greenblatt, J. F. and Emili, A. (2009a). 
Sequential peptide affinity purification system for the systematic isolation and identification of 
protein complexes from Escherichia coli. Methods Mol Biol 564, 373-400. 
Babu, M., Krogan, N. J., Awrey, D. E., Emili, A. and Greenblatt, J. F. (2009b). Systematic 
characterization of the protein interaction network and protein complexes in Saccharomyces 
cerevisiae using tandem affinity purification and mass spectrometry. Methods Mol Biol 548, 187-
207. 
Babu, M. M., Luscombe, N. M., Aravind, L., Gerstein, M. and Teichmann, S. A. (2004). 
Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 14, 283-91. 
Bader, G. D., Betel, D. and Hogue, C. W. (2003). BIND: the Biomolecular Interaction Network 
Database. Nucleic Acids Res 31, 248-50. 
Bar-Joseph, Z., Gerber, G. K., Lee, T. I., Rinaldi, N. J., Yoo, J. Y., Robert, F., Gordon, D. B., 
Fraenkel, E., Jaakkola, T. S., Young, R. A. et al. (2003). Computational discovery of gene 
modules and regulatory networks. Nat Biotechnol 21, 1337-42. 
Barabasi, A. L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286, 
509-12. 
Barabasi, A. L. and Oltvai, Z. N. (2004). Network biology: understanding the cell's functional 
organization. Nat Rev Genet 5, 101-13. 
Barker, D. and Pagel, M. (2005). Predicting functional gene links from phylogenetic-statistical 
analyses of whole genomes. PLoS Comput Biol 1, e3. 
Baumbach, J., Wittkop, T., Rademacher, K., Rahmann, S., Brinkrolf, K. and Tauch, A. 
(2007). CoryneRegNet 3.0--an interactive systems biology platform for the analysis of gene 
regulatory networks in corynebacteria and Escherichia coli. J Biotechnol 129, 279-89. 
Bowers, P. M., Pellegrini, M., Thompson, M. J., Fierro, J., Yeates, T. O. and Eisenberg, D. 
(2004). Prolinks: a database of protein functional linkages derived from coevolution. Genome 
Biol 5, R35. 
Brandes, U. (2001). A Faster Algorithm for Betweenness Centrality. Journal of Mathematical 
Sociology 25, 163-177. 
Breitkreutz, B. J., Stark, C., Reguly, T., Boucher, L., Breitkreutz, A., Livstone, M., 
Oughtred, R., Lackner, D. H., Bahler, J., Wood, V. et al. (2008). The BioGRID Interaction 
Database: 2008 update. Nucleic Acids Res 36, D637-40. 
Introduction                                                                                                                                                               1-26 
 
 
Brennan, P., Donev, R. and Hewamana, S. (2008). Targeting transcription factors for 
therapeutic benefit. Mol Biosyst 4, 909-19. 
Brewerton, S. C. (2008). The use of protein-ligand interaction fingerprints in docking. Curr Opin 
Drug Discov Devel 11, 356-64. 
Brouwer, R. W., Kuipers, O. P. and van Hijum, S. A. (2008). The relative value of operon 
predictions. Brief Bioinform 9, 367-75. 
Bulyk, M. L., McGuire, A. M., Masuda, N. and Church, G. M. (2004). A motif co-occurrence 
approach for genome-wide prediction of transcription-factor-binding sites in Escherichia coli. 
Genome Res 14, 201-8. 
Butland, G., Babu, M., Diaz-Mejia, J. J., Bohdana, F., Phanse, S., Gold, B., Yang, W., Li, J., 
Gagarinova, A. G., Pogoutse, O. et al. (2008). eSGA: E. coli synthetic genetic array analysis. 
Nat Methods. 5, 789-95. 
Caspi, R., Foerster, H., Fulcher, C. A., Kaipa, P., Krummenacker, M., Latendresse, M., 
Paley, S., Rhee, S. Y., Shearer, A. G., Tissier, C. et al. (2008). The MetaCyc Database of 
metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. 
Nucleic Acids Res 36, D623-31. 
Chen, E. S., Hripcsak, G., Xu, H., Markatou, M. and Friedman, C. (2008). Automated 
acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. 
J Am Med Inform Assoc 15, 87-98. 
Conant, G. C. and Wagner, A. (2003). Convergent evolution of gene circuits. Nat Genet 34, 
264-6. 
Costanzo, M., Baryshnikova, A., Bellay, J., Kim, Y., Spear, E. D., Sevier, C. S., Ding, H., 
Koh, J. L., Toufighi, K., Mostafavi, S. et al. The genetic landscape of a cell. Science 327, 425-
31. 
Dandekar, T., Snel, B., Huynen, M. and Bork, P. (1998). Conservation of gene order: a 
fingerprint of proteins that physically interact. Trends Biochem Sci 23, 324-8. 
Date, S. V. and Marcotte, E. M. (2003). Discovery of uncharacterized cellular systems by 
genome-wide analysis of functional linkages. Nat Biotechnol 21, 1055-62. 
Devaux, F., Marc, P., Bouchoux, C., Delaveau, T., Hikkel, I., Potier, M. C. and Jacq, C. 
(2001). An artificial transcription activator mimics the genome-wide properties of the yeast Pdr1 
transcription factor. EMBO Rep 2, 493-8. 
Dobrin, R., Beg, Q. K., Barabasi, A. L. and Oltvai, Z. N. (2004). Aggregation of topological 
motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinformatics 5, 10. 
Duarte, N. C., Becker, S. A., Jamshidi, N., Thiele, I., Mo, M. L., Vo, T. D., Srivas, R. and 
Palsson, B. O. (2007). Global reconstruction of the human metabolic network based on 
genomic and bibliomic data. Proc Natl Acad Sci U S A 104, 1777-82. 
Introduction                                                                                                                                                               1-27 
 
 
Durot, M., Bourguignon, P. Y. and Schachter, V. (2009). Genome-scale models of bacterial 
metabolism: reconstruction and applications. FEMS Microbiol Rev 33, 164-90. 
Enright, A. J., Iliopoulos, I., Kyrpides, N. C. and Ouzounis, C. A. (1999). Protein interaction 
maps for complete genomes based on gene fusion events. Nature 402, 86-90. 
Fabian, M. A., Biggs, W. H., 3rd, Treiber, D. K., Atteridge, C. E., Azimioara, M. D., 
Benedetti, M. G., Carter, T. A., Ciceri, P., Edeen, P. T., Floyd, M. et al. (2005). A small 
molecule-kinase interaction map for clinical kinase inhibitors. Nat Biotechnol 23, 329-36. 
Farkas, I. J., Jeong, H., Vicsek, T., Barabasi, A.-L., and Oltvai, Z.N.,. (2003). The topology of 
the transcription regulatory network in the yeast, Saccharomyces cerevisiae. Physica A 381, 
601-612. 
Fasolo, J. and Snyder, M. (2009). Protein microarrays. Methods Mol Biol 548, 209-22. 
Fell, D. A. and Wagner, A. (2000). The small world of metabolism. Nat Biotechnol 18, 1121-2. 
Fields, S. and Song, O. (1989). A novel genetic system to detect protein-protein interactions. 
Nature 340, 245-6. 
Frijters, R., Verhoeven, S., Alkema, W., van Schaik, R. and Polman, J. (2007). Literature-
based compound profiling: application to toxicogenomics. Pharmacogenomics 8, 1521-34. 
Gaasterland, T. and Ragan, M. A. (1998). Microbial genescapes: phyletic and functional 
patterns of ORF distribution among prokaryotes. Microb Comp Genomics 3, 199-217. 
Galas, D. J. and Schmitz, A. (1978). DNAse footprinting: a simple method for the detection of 
protein-DNA binding specificity. Nucleic Acids Res 5, 3157-70. 
Gama-Castro, S., Jimenez-Jacinto, V., Peralta-Gil, M., Santos-Zavaleta, A., Penaloza-
Spinola, M. I., Contreras-Moreira, B., Segura-Salazar, J., Muniz-Rascado, L., Martinez-
Flores, I., Salgado, H. et al. (2008). RegulonDB (version 6.0): gene regulation model of 
Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and 
Textpresso navigation. Nucleic Acids Res 36, D120-4. 
Garner, M. M. and Revzin, A. (1981). A gel electrophoresis method for quantifying the binding 
of proteins to specific DNA regions: application to components of the Escherichia coli lactose 
operon regulatory system. Nucleic Acids Res 9, 3047-60. 
Gavin, A. C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, 
J. M., Michon, A. M., Cruciat, C. M. et al. (2002). Functional organization of the yeast 
proteome by systematic analysis of protein complexes. Nature 415, 141-7. 
Giot, L., Bader, J. S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y. L., Ooi, C. E., 
Godwin, B., Vitols, E. et al. (2003). A protein interaction map of Drosophila melanogaster. 
Science 302, 1727-36. 
Glisovic, T., Bachorik, J. L., Yong, J. and Dreyfuss, G. (2008). RNA-binding proteins and 
post-transcriptional gene regulation. FEBS Lett 582, 1977-86. 
Introduction                                                                                                                                                               1-28 
 
 
Goh, K. I., Cusick, M. E., Valle, D., Childs, B., Vidal, M. and Barabasi, A. L. (2007). The 
human disease network. Proc Natl Acad Sci U S A 104, 8685-90. 
Grainger, D. C., Hurd, D., Harrison, M., Holdstock, J. and Busby, S. J. (2005). Studies of the 
distribution of Escherichia coli cAMP-receptor protein and RNA polymerase along the E. coli 
chromosome. Proc Natl Acad Sci U S A 102, 17693-8. 
Greil, F., Moorman, C. and van Steensel, B. (2006). DamID: mapping of in vivo protein-
genome interactions using tethered DNA adenine methyltransferase. Methods Enzymol 410, 
342-59. 
Grossetete, S., Labedan, B. and Lespinet, O. FUNGIpath: a tool to assess fungal metabolic 
pathways predicted by orthology. BMC Genomics 11, 81. 
Guelzim, N., Bottani, S., Bourgine, P. and Kepes, F. (2002). Topological and causal structure 
of the yeast transcriptional regulatory network. Nat Genet 31, 60-3. 
Hartwell, L. H., Hopfield, J. J., Leibler, S. and Murray, A. W. (1999). From molecular to 
modular cell biology. Nature 402, C47-52. 
Heyduk, T. and Heyduk, E. (2002). Molecular beacons for detecting DNA binding proteins. Nat 
Biotechnol 20, 171-6. 
Hillenmeyer, M. E., Fung, E., Wildenhain, J., Pierce, S. E., Hoon, S., Lee, W., Proctor, M., 
St Onge, R. P., Tyers, M., Koller, D. et al. (2008). The chemical genomic portrait of yeast: 
uncovering a phenotype for all genes. Science 320, 362-5. 
Ho, C. H., Magtanong, L., Barker, S. L., Gresham, D., Nishimura, S., Natarajan, P., Koh, J. 
L., Porter, J., Gray, C. A., Andersen, R. J. et al. (2009). A molecular barcoded yeast ORF 
library enables mode-of-action analysis of bioactive compounds. Nat Biotechnol 27, 369-77. 
Hogan, D. J., Riordan, D. P., Gerber, A. P., Herschlag, D. and Brown, P. O. (2008). Diverse 
RNA-binding proteins interact with functionally related sets of RNAs, suggesting an extensive 
regulatory system. PLoS Biol 6, e255. 
Hopkins, A. L. (2008). Network pharmacology: the next paradigm in drug discovery. Nature 
Chemical Biology 4, 682-690. 
Hu, P., Janga, S. C., Babu, M., Diaz-Mejia, J. J., Butland, G., Yang, W., Pogoutse, O., Guo, 
X., Phanse, S., Wong, P. et al. (2009). Global functional atlas of Escherichia coli 
encompassing previously uncharacterized proteins. PLoS Biol 7, e96. 
Ihmels, J., Bergmann, S. and Barkai, N. (2004). Defining transcription modules using large-
scale gene expression data. Bioinformatics 20, 1993-2003. 
Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y. and Barkai, N. (2002). Revealing 
modular organization in the yeast transcriptional network. Nat Genet 31, 370-7. 
Jacob, L. and Vert, J. P. (2008). Protein-ligand interaction prediction: an improved 
chemogenomics approach. Bioinformatics 24, 2149-56. 
Introduction                                                                                                                                                               1-29 
 
 
Janga, S. C. and Collado-Vides, J. (2007). Structure and evolution of gene regulatory 
networks in microbial genomes. Res Microbiol 158, 787-94. 
Janga, S. C., Collado-Vides, J. and Moreno-Hagelsieb, G. (2005). Nebulon: a system for the 
inference of functional relationships of gene products from the rearrangement of predicted 
operons. Nucleic Acids Res 33, 2521-30. 
Jares-Erijman, E. A. and Jovin, T. M. (2003). FRET imaging. Nat Biotechnol 21, 1387-95. 
Jensen, L. J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, 
P., Roth, A., Simonovic, M. et al. (2009). STRING 8--a global view on proteins and their 
functional interactions in 630 organisms. Nucleic Acids Res 37, D412-6. 
Jeong, H., Mason, S. P., Barabasi, A. L. and Oltvai, Z. N. (2001). Lethality and centrality in 
protein networks. Nature 411, 41-2. 
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. and Barabasi, A. L. (2000). The large-scale 
organization of metabolic networks. Nature 407, 651-4. 
Jessani, N., Niessen, S., Wei, B. Q., Nicolau, M., Humphrey, M., Ji, Y., Han, W., Noh, D. Y., 
Yates, J. R., 3rd, Jeffrey, S. S. et al. (2005). A streamlined platform for high-content functional 
proteomics of primary human specimens. Nat Methods 2, 691-7. 
Johnson, D. S., Mortazavi, A., Myers, R. M. and Wold, B. (2007). Genome-wide mapping of 
in vivo protein-DNA interactions. Science 316, 1497-502. 
Junker, B. H., Koschutzki, D. and Schreiber, F. (2006). Exploration of biological network 
centralities with CentiBiN. BMC Bioinformatics 7, 219. 
Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., 
Kawashima, S., Okuda, S., Tokimatsu, T. et al. (2008). KEGG for linking genomes to life and 
the environment. Nucleic Acids Res 36, D480-4. 
Katayama, H. and Oda, Y. (2007). Chemical proteomics for drug discovery based on 
compound-immobilized affinity chromatography. J Chromatogr B Analyt Technol Biomed Life 
Sci 855, 21-7. 
Kauffman, S. A. (1969). Metabolic stability and epigenesis in randomly constructed genetic 
nets. J Theor Biol 22, 437-67. 
Kauffman, S. A. (1971). Gene regulation networks: A theory for their structure and global 
behaviour, in A. Moscana and A. Monroy (eds.), Current Topics in Developmental Biology 6, 
pp.145-182. New York: Academic Press. 
Kauffman, S. A. (1993). The Origins of Order. Oxford: Oxford University Press. 
Keene, J. D. (2007). RNA regulons: coordination of post-transcriptional events. Nat Rev Genet 
8, 533-43. 
Kingsmore, S. F. (2006). Multiplexed protein measurement: technologies and applications of 
protein and antibody arrays. Nat Rev Drug Discov 5, 310-20. 
Introduction                                                                                                                                                               1-30 
 
 
Kuhn, M., Campillos, M., Gonzalez, P., Jensen, L. J. and Bork, P. (2008). Large-scale 
prediction of drug-target relationships. FEBS Lett 582, 1283-90. 
Kung, L. A. and Snyder, M. (2006). Proteome chips for whole-organism assays. Nat Rev Mol 
Cell Biol 7, 617-22. 
Lage, K., Karlberg, E. O., Storling, Z. M., Olason, P. I., Pedersen, A. G., Rigina, O., Hinsby, 
A. M., Tumer, Z., Pociot, F., Tommerup, N. et al. (2007). A human phenome-interactome 
network of protein complexes implicated in genetic disorders. Nat Biotechnol 25, 309-16. 
Lee, D. S., Park, J., Kay, K. A., Christakis, N. A., Oltvai, Z. N. and Barabasi, A. L. (2008). 
The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad 
Sci U S A 105, 9880-5. 
Lee, T. I., Johnstone, S. E. and Young, R. A. (2006). Chromatin immunoprecipitation and 
microarray-based analysis of protein location. Nat Protoc 1, 729-48. 
Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. 
M., Harbison, C. T., Thompson, C. M., Simon, I. et al. (2002). Transcriptional regulatory 
networks in Saccharomyces cerevisiae. Science 298, 799-804. 
Levy, E. D. and Pereira-Leal, J. B. (2008). Evolution and dynamics of protein interactions and 
networks. Curr Opin Struct Biol 18, 349-57. 
Linghu, B., Snitkin, E. S., Holloway, D. T., Gustafson, A. M., Xia, Y. and DeLisi, C. (2008). 
High-precision high-coverage functional inference from integrated data sources. BMC 
Bioinformatics 9, 119. 
Luo, F., Yang, Y., Zhong, J., Gao, H., Khan, L., Thompson, D. K. and Zhou, J. (2007). 
Constructing gene co-expression networks and predicting functions of unknown genes by 
random matrix theory. BMC Bioinformatics 8, 299. 
Ma, H. and Horiuchi, K. Y. (2006). Chemical microarray: a new tool for drug screening and 
discovery. Drug Discov Today 11, 661-8. 
Ma, H., Sorokin, A., Mazein, A., Selkov, A., Selkov, E., Demin, O. and Goryanin, I. (2007). 
The Edinburgh human metabolic network reconstruction and its functional analysis. Mol Syst 
Biol 3, 135. 
Ma, H. W., Buer, J. and Zeng, A. P. (2004). Hierarchical structure and modules in the 
Escherichia coli transcriptional regulatory network revealed by a new top-down approach. BMC 
Bioinformatics 5, 199. 
Madan Babu, M., Teichmann, S. A. and Aravind, L. (2006). Evolutionary dynamics of 
prokaryotic transcriptional regulatory networks. J Mol Biol 358, 614-33. 
Makita, Y., Nakao, M., Ogasawara, N. and Nakai, K. (2004). DBTBS: database of 
transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. 
Nucleic Acids Res 32, D75-7. 
Introduction                                                                                                                                                               1-31 
 
 
Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O. and Eisenberg, D. 
(1999a). Detecting protein function and protein-protein interactions from genome sequences. 
Science 285, 751-3. 
Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O. and Eisenberg, D. (1999b). A 
combined algorithm for genome-wide prediction of protein function. Nature 402, 83-6. 
Martinez-Antonio, A., Janga, S. C., Salgado, H. and Collado-Vides, J. (2006). Internal-
sensing machinery directs the activity of the regulatory network in Escherichia coli. Trends 
Microbiol 14, 22-7. 
Maslov, S. and Sneppen, K. (2002). Specificity and stability in topology of protein networks. 
Science 296, 910-3. 
Massjouni, N., Rivera, C. G. and Murali, T. M. (2006). VIRGO: computational prediction of 
gene functions. Nucleic Acids Res 34, W340-4. 
Mata, J., Marguerat, S. and Bahler, J. (2005). Post-transcriptional control of gene expression: 
a genome-wide perspective. Trends Biochem Sci 30, 506-14. 
Matys, V., Kel-Margoulis, O. V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., 
Chekmenev, D., Krull, M., Hornischer, K. et al. (2006). TRANSFAC and its module 
TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34, D108-10. 
McCafferty, J., Griffiths, A. D., Winter, G. and Chiswell, D. J. (1990). Phage antibodies: 
filamentous phage displaying antibody variable domains. Nature 348, 552-4. 
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002). 
Network motifs: simple building blocks of complex networks. Science 298, 824-7. 
Moreno-Hagelsieb, G. and Collado-Vides, J. (2002). A powerful non-homology method for the 
prediction of operons in prokaryotes. Bioinformatics 18 Suppl 1, S329-36. 
Moreno-Hagelsieb, G. and Janga, S. C. (2008). Operons and the effect of genome 
redundancy in deciphering functional relationships using phylogenetic profiles. Proteins 70, 344-
52. 
Ong, S. E., Schenone, M., Margolin, A. A., Li, X., Do, K., Doud, M. K., Mani, D. R., Kuai, L., 
Wang, X., Wood, J. L. et al. (2009). Identifying the proteins to which small-molecule probes 
and drugs bind in cells. Proc Natl Acad Sci U S A 106, 4617-22. 
Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D. and Maltsev, N. (1999). The use of 
gene clusters to infer functional coupling. Proc Natl Acad Sci U S A 96, 2896-901. 
Paolini, G. V., Shapland, R. H., van Hoorn, W. P., Mason, J. S. and Hopkins, A. L. (2006). 
Global mapping of pharmacological space. Nat Biotechnol 24, 805-15. 
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. and Yeates, T. O. (1999). 
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc 
Natl Acad Sci U S A 96, 4285-8. 
Introduction                                                                                                                                                               1-32 
 
 
Pelletier, J. N., Campbell-Valois, F. X. and Michnick, S. W. (1998). Oligomerization domain-
directed reassembly of active dihydrofolate reductase from rationally designed fragments. Proc 
Natl Acad Sci U S A 95, 12141-6. 
Pilpel, Y., Sudarsanam, P. and Church, G. M. (2001). Identifying regulatory networks by 
combinatorial analysis of promoter elements. Nat Genet 29, 153-9. 
Pullmann, R., Jr., Kim, H. H., Abdelmohsen, K., Lal, A., Martindale, J. L., Yang, X. and 
Gorospe, M. (2007). Analysis of turnover and translation regulatory RNA-binding protein 
expression through binding to cognate mRNAs. Mol Cell Biol 27, 6265-78. 
Ravasz, E. and Barabasi, A. L. (2003). Hierarchical organization in complex networks. Phys 
Rev E Stat Nonlin Soft Matter Phys 67, 026112. 
Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. and Barabasi, A. L. (2002). 
Hierarchical organization of modularity in metabolic networks. Science 297, 1551-5. 
Resendis-Antonio, O., Freyre-Gonzalez, J. A., Menchaca-Mendez, R., Gutierrez-Rios, R. 
M., Martinez-Antonio, A., Avila-Sanchez, C. and Collado-Vides, J. (2005). Modular analysis 
of the transcriptional regulatory network of E. coli. Trends Genet 21, 16-20. 
Rognan, D. (2007). Chemogenomic approaches to rational drug design. Br J Pharmacol 152, 
38-52. 
Rogozin, I. B., Makarova, K. S., Murvai, J., Czabarka, E., Wolf, Y. I., Tatusov, R. L., 
Szekely, L. A. and Koonin, E. V. (2002). Connected gene neighborhoods in prokaryotic 
genomes. Nucleic Acids Res 30, 2212-23. 
Ruan, J., Dean, A. K. and Zhang, W. A general co-expression network-based approach to 
gene expression analysis: comparison and applications. BMC Syst Biol 4, 8. 
Salcius, M., Michaud, G. A., Schweitzer, B. and Predki, P. F. (2007). Identification of small 
molecule targets on functional protein microarrays. Methods Mol Biol 382, 239-48. 
Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D. and Friedman, N. 
(2003). Module networks: identifying regulatory modules and their condition-specific regulators 
from gene expression data. Nat Genet 34, 166-76. 
Shen-Orr, S. S., Milo, R., Mangan, S. and Alon, U. (2002). Network motifs in the 
transcriptional regulation network of Escherichia coli. Nat Genet 31, 64-8. 
Shoemaker, B. A. and Panchenko, A. R. (2007). Deciphering protein-protein interactions. Part 
I. Experimental techniques and databases. PLoS Comput Biol 3, e42. 
Slavik, R. and Homola, J. (2006). Optical multilayers for LED-based surface plasmon 
resonance sensors. Appl Opt 45, 3752-9. 
Snel, B., Bork, P. and Huynen, M. A. (2002). The identification of functional modules from the 
genomic association of genes. Proc Natl Acad Sci U S A 99, 5890-5. 
Introduction                                                                                                                                                               1-33 
 
 
Speers, A. E. and Cravatt, B. F. (2004). Chemical strategies for activity-based proteomics. 
Chembiochem 5, 41-7. 
Storek, M. J., Ernst, A. and Verdine, G. L. (2002). High-resolution footprinting of sequence-
specific protein-DNA contacts. Nat Biotechnol 20, 183-6. 
Stuart, J. M., Segal, E., Koller, D. and Kim, S. K. (2003). A gene-coexpression network for 
global discovery of conserved genetic modules. Science 302, 249-55. 
Thieffry, D., Huerta, A. M., Perez-Rueda, E. and Collado-Vides, J. (1998). From specific 
gene regulation to genomic networks: a global analysis of transcriptional regulation in 
Escherichia coli. Bioessays 20, 433-40. 
Tsui, I. F., Chari, R., Buys, T. P. and Lam, W. L. (2007). Public Databases and Software for 
the Pathway Analysis of Cancer Genomes. Cancer Inform 3, 389-407. 
Turner, N. C., Lord, C. J., Iorns, E., Brough, R., Swift, S., Elliott, R., Rayter, S., Tutt, A. N. 
and Ashworth, A. (2008). A synthetic lethal siRNA screen identifying genes mediating 
sensitivity to a PARP inhibitor. EMBO J 27, 1368-77. 
Wagner, A. (2001). The yeast protein interaction network evolves rapidly and contains few 
redundant duplicate genes. Mol Biol Evol 18, 1283-92. 
Wang, K., Narayanan, M., Zhong, H., Tompa, M., Schadt, E. E. and Zhu, J. (2009). Meta-
analysis of inter-species liver co-expression networks elucidates traits associated with common 
human diseases. PLoS Comput Biol 5, e1000616. 
Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics of 'small-world' networks. Nature 
393, 440-2. 
Whitehurst, A. W., Bodemann, B. O., Cardenas, J., Ferguson, D., Girard, L., Peyton, M., 
Minna, J. D., Michnoff, C., Hao, W., Roth, M. G. et al. (2007). Synthetic lethal screen 
identification of chemosensitizer loci in cancer cells. Nature 446, 815-9. 
Wingren, C. and Borrebaeck, C. A. (2006). Antibody microarrays: current status and key 
technological advances. OMICS 10, 411-27. 
Wishart, D. S., Knox, C., Guo, A. C., Cheng, D., Shrivastava, S., Tzur, D., Gautam, B. and 
Hassanali, M. (2008). DrugBank: a knowledgebase for drugs, drug actions and drug targets. 
Nucleic Acids Res 36, D901-6. 
Wolf, D. M. and Arkin, A. P. (2003). Motifs, modules and games in bacteria. Curr Opin 
Microbiol 6, 125-34. 
Wu, W. S., Li, W. H. and Chen, B. S. (2006). Computational reconstruction of transcriptional 
regulatory modules of the yeast cell cycle. BMC Bioinformatics 7, 421. 
Wuchty, S. (2001). Scale-free behavior in protein domain networks. Mol Biol Evol 18, 1694-702. 
Wuchty, S., Oltvai, Z. N. and Barabasi, A. L. (2003). Evolutionary conservation of motif 
constituents in the yeast protein interaction network. Nat Genet 35, 176-9. 
Introduction                                                                                                                                                               1-34 
 
 
Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W. and Kanehisa, M. (2008). Prediction of 
drug-target interaction networks from the integration of chemical and genomic spaces. 
Bioinformatics 24, i232-40. 
Yildirim, M. A., Goh, K. I., Cusick, M. E., Barabasi, A. L. and Vidal, M. (2007). Drug-target 
network. Nat Biotechnol 25, 1119-26. 
Yu, H. and Gerstein, M. (2006). Genomic analysis of the hierarchical structure of regulatory 
networks. Proc Natl Acad Sci U S A 103, 14724-31. 
Zhao, X. M., Chen, L. and Aihara, K. (2008). Protein function prediction with the shortest path 
in functional linkage graph and boosting. Int J Bioinform Res Appl 4, 375-84. 
 
  
 
Constraints imposed on bacterial transcriptional networks                                                                                     2-1 
 
 
 
 
 
 
 
2 Functional, structural and dynamic constraints on 
bacterial regulatory networks 
 
 
Constraints imposed on bacterial transcriptional networks                                                                                     2-2 
 
 
 
CONTENTS OF CHAPTER 2 
 
OUTLINE ......................................................................................................................................... 2-3 
CONTRIBUTION TO THE WORK IN THIS CHAPTER.................................................... 2-4 
2.1 INTRODUCTION .................................................................................................................. 2-5 
2.2 RESULTS ................................................................................................................................ 2-9 
2.2.1 CONSTRAINTS IMPOSED ON THE NETWORK OF TRANSCRIPTION FACTORS IN BACTERIA .... 2-9 
2.2.1.1 TOPOLOGY OF ESCHERICHIA COLI CROSS-REGULATORY TRANSCRIPTIONAL NETWORK
 ................................................................................................................................................ ….2-11 
2.2.1.2 MULTIPLE PARALLEL FEED-FORWARD LOOPS REGULATE THE USE OF DIFFERENT CARBON 
SOURCES....................................................................................................................................... 2-13 
2.2.1.3 LONG HIERARCHICAL CASCADES REGULATE DEVELOPMENTAL PROCESSES ................. 2-14 
2.2.2 CONSTRAINTS IMPOSED ON BACTERIAL GENOME ORGANIZATION BY TRANSCRIPTIONAL 
NETWORK ..................................................................................................................................... 2-15 
 ..................................................................................................................................................... 2-17 
2.2.2.1 GENOMIC CO-LOCALIZATION OF TFS AND TARGET GENES IS OBSERVED IN SMALL 
REGULONS .................................................................................................................................... 2-18 
2.2.2.2 TRANSCRIPTIONAL REGULATORY FLOW IN THE NETWORK OF TFS ............................... 2-19 
2.2.2.3 ABSOLUTE AND AVERAGE MRNA ABUNDANCE OF TFS SUGGESTS CORRELATION WITH 
REGULON SIZE AND NETWORK HIERARCHY IN E. COLI ............................................................... 2-20 
2.2.2.4 A CONCEPTUAL MODEL FOR THE STRUCTURING OF REGULATORY NETWORKS IN BACTERIA
 ..................................................................................................................................................... 2-23 
2.3 DISCUSSION & CONCLUSION ................................................................................... 2-25 
2.4 METHODS ............................................................................................................................ 2-27 
2.4.1 IDENTIFICATION OF REGULON GROUPS .............................................................................. 2-27 
2.4.2 ESTIMATING THE STATISTICAL SIGNIFICANCE OF THE REGULON GROUPS ........................ 2-28 
REFERENCES .............................................................................................................................. 2-28 
 
 
Constraints imposed on bacterial transcriptional networks                                                                                     2-3 
 
 
OUTLINE 
One of the most important developments in our understanding of biological systems in the past 
decade is the application of network theory to biological problems. This is particularly true for 
the case of regulation of gene expression. Taking advantage of the currently available 
transcriptional regulatory networks of the model bacteria, Escherichia coli and Bacillus subtilis, a 
comprehensive genomic and structural analysis was performed. It was found that while the 
mode of regulatory interaction between transcription factors (TFs) is predominantly positive, TFs 
are frequently negatively auto-regulated. Furthermore, feedback loops, regulatory motifs and 
regulatory pathways are unevenly distributed in this network. Short pathways, multiple feed-
forward loops and negative auto-regulatory interactions are particularly predominant in the sub-
network controlling metabolic functions such as the use of alternative carbon sources. In 
contrast, long hierarchical cascades and positive auto-regulatory loops are over-represented in 
the sub-networks controlling developmental processes for biofilm and chemotaxis. We propose 
that these long transcriptional cascades coupled with regulatory switches (positive loops) for 
external sensing enable the coexistence of multiple bacterial phenotypes. We also provide a link 
between the transcriptional hierarchy of regulons (TFs) and their genome organization. We 
show that, to drive the kinetics and concentration gradients, TFs belonging to big and small 
regulons, depending on the number of genes they regulate, organize themselves differently on 
the genome with respect to their targets. We then propose a conceptual model that can explain 
how the hierarchical structure of TRNs might be ultimately governed by the dynamic biophysical 
requirements for targeting DNA-binding sites by transcription factors. Our results suggest that 
the main parameters defining the position of a TF in the network hierarchy are the number and 
chromosomal distances of the genes they regulate and their protein concentration gradients. 
These observations give insights into how the hierarchical structure of transcriptional networks 
can be encoded on the chromosome to drive the kinetics and concentration gradients of TFs 
depending on the number of genes they regulate and could be a common theme valid for other 
prokaryotes, proposing the role of transcriptional regulation in shaping the organization of genes 
on a chromosome. 
 
Constraints imposed on bacterial transcriptional networks                                                                                     2-4 
 
 
CONTRIBUTION TO THE WORK IN THIS CHAPTER 
Please note that the work presented in this chapter is the result of the following three 
publications during my doctoral period and my contribution to the work excludes the 
construction and layout of the cross-regulatory network of E. coli discussed in section 2.2.1 and 
defining the conceptual model discussed in section 2.2.2. I am grateful for the input and 
collaboration of Dr. Agustino Martinez-Antonio at CINVESTAV, IPN, Mexico, Dr. Dennis Thieffry 
at University of Marseille, France and Heladia Salgado at CCG, UNAM, Mexico for this analysis. 
 
1) Transcriptional regulatory networks 
Sarath Chandra Janga and M. Madan Babu 
Book chapter for Cambridge University Press for an edited book on “Networks in Cell Biology” 
2) Functional organization of Escherichia coli transcriptional regulatory network  
Agustino Martinez-Antonio, Sarath Chandra Janga and  Dennis Thieffry 
Journal of Molecular Biology, 2008, Vol. 381(1):238-247 
3) Transcriptional regulation shapes the organization of genes on bacterial chromosomes 
Sarath Chandra Janga, Heladia Salgado and Agustino Martinez-Antonio 
Nucleic Acids Research, 2009, Vol.37, No. 11, 3680-3688 
Constraints imposed on bacterial transcriptional networks                                                                                     2-5 
 
 
2.1 INTRODUCTION 
One of the most important developments in our understanding of biological systems in the past 
decade is the application of network theory to biological problems. This is particularly true for 
the case of regulation of gene expression. The accumulation of data on many factors that 
control the expression of genes or groups of genes, together with the increased use of high-
throughput techniques, such as DNA arrays and proteomics, has generated an overwhelming 
amount of data that has to be understood to infer relationships between genes, and between 
genes and signals. The reductionist approaches of molecular biology have made it impractical 
to deal with large amounts of information giving rise to the increasing use of the notion of 
networks in biology. Typically in network approaches to understand a biological system, 
elements are represented as nodes in the graph, which are connected by edges that represent 
biological interactions. This approach allows ill-defined descriptions of complexity to be replaced 
by objectively quantifiable, numerical parameters, such as connectivity or strengths of 
interactions (Jeong et al., 2000; Ronen et al., 2002). 
Most network analysis of transcriptional regulatory events in an organism involves 
representing genes and the proteins they encode as nodes. However it should be noted that in 
contrast to protein-protein networks, the links in transcription networks have directionality, 
meaning that connections have a starting node and a target node. Normally, an edge in such a 
network goes from a transcription factor to the genes it regulates. In most cases such regulation 
occurs through a direct effect i.e, the regulator binds the promoter regions upstream of the 
protein coding genes they control the activity off.  More complex representations include the 
incorporation of other entities like small molecules, RNA encoding genes, signal-transduction 
pathways or interacting proteins. However most of our understanding of transcriptional networks 
to date has been limited to the holistic view of Transcription Factors (TFs) and Target Genes 
(TGs) as nodes and the regulatory interactions between them as edges. It is this graph in an 
organism that is usually referred to as the Gene Regulatory Network (GRN). This network is 
also referred to as “Transcriptional Regulatory Network (TRN)” or simply “transcriptional 
network”. 
One of the most important and obvious pieces of information that can be obtained is the 
distribution of connectivity, i.e how many connections a node has and how many nodes have a 
particular number of connections. In the case of transcriptional networks these parameters 
actually have two sides, as incoming and outgoing connections must be considered separately. 
The incoming connectivity is the number of transcription factors regulating a target gene, which 
Constraints imposed on bacterial transcriptional networks                                                                                     2-6 
 
 
gives a sense of the combinatorial effect of gene regulation. The fraction of target genes with a 
given incoming connectivity was observed to follow an exponential distribution in both 
Escherichia coli and Saccharomyces cerevisiae (Guelzim et al., 2002; Thieffry et al., 1998). The 
exponential behaviour indicates that most target genes are regulated by a similar number of 
factors and apparently reflects the limits on the size of the multiprotein complexes that can be 
bound near the promoter as well as by the amount of DNA sequence in upstream regions of 
genes. On the other hand, the outgoing connectivity, which is the number of target genes 
regulated by each transcription factor, was found to be distributed according to a power law, 
contrary to the incoming connectivity distribution. This is indicative of a hub-containing network 
structure, in which a select set of transcription factors participate in the regulation of a 
disproportionately large number of target genes.  
  At a local level, in transcriptional networks certain sub-networks appear more often than 
expected by chance and have been referred to as motifs, analogous to sequence motifs which 
occur repeatedly in sequences. Motifs were originally described in E. coli transcriptional 
regulatory network but were subsequently found in yeast and other organisms (Alon, 2007b; 
Shen-Orr et al., 2002). Three network motifs were found to be predominantly occurring in most 
transcriptional networks: 1) Feed-Forward Loop (FFL), in which a transcription factor regulates 
the expression of another transcription factor which, in turn, regulates a gene that is also 
regulated by the first transcription factor; 2) Single-Input Module (SIM), in which a single 
transcription factor regulates several genes, which is usually also called a simple regulon 
(Gutierrez-Rios et al., 2003); 3) Dense Overlapping Regulons (DORs) in which several TFs 
regulate overlapping sets of genes and these groups are also called a complex regulon. FFL 
appears to be the most abundant motif among the best studied transcriptional networks. FFLs 
have been further classified into eight motif sub-types and two of them namely coherent type-1 
and incoherent type-1 FFL appear to be much more predominant than others (Alon, 2007b; 
Mangan and Alon, 2003). The former was shown to act as a sign-sensitive delay element and a 
persistence detector while the later was demonstrated to function as a pulse generator and 
response accelerator (Mangan et al., 2006; Mangan et al., 2003). Although motifs form over-
represented sub-graphs in the entire network of transcriptional regulation, they do not appear 
independently but rather integrate to form super-structures or modules that carry out a common 
biological function by sharing some of their edges (Dobrin et al., 2004a; Resendis-Antonio et al., 
2005).  
At a global level, transcriptional regulatory networks have been shown to possess a 
scalefree multi-layer hierarchical modular structure using both, a top-down and bottom-up, 
Constraints imposed on bacterial transcriptional networks                                                                                     2-7 
 
 
approaches for determining hierarchy (Ma et al., 2004a; Yu and Gerstein, 2006). Interestingly, 
transcription networks do not seem to possess feedback regulation at the level of transcription 
meaning transcriptional regulation of TFs at the top by TFs at the bottom of this hierarchial 
structure is not frequent, indicating the prevalence for alternative forms of feedback control of 
transcription. Typically such a feedback occurs through the usage of protein-protein interactions 
at post-translational level or due to a complex interplay of cellular entities which control the 
activity of TFs by changing their conformation depending on the continuously varying intra- and 
extra-cellular conditions (Martinez-Antonio et al., 2006a; Yu and Gerstein, 2006). The pyramid 
shaped multi-layer hierarchical transcriptional networks builds the basis for modules and motifs 
which have been determined using a number of approaches for decomposing the regulatory 
networks (Bar-Joseph et al., 2003; Dobrin et al., 2004a; Ihmels et al., 2004; Ihmels et al., 2002; 
Ma et al., 2004a; Milo et al., 2002; Resendis-Antonio et al., 2005; Segal et al., 2003; Shen-Orr 
et al., 2002; Wu et al., 2006) (see Figure 2-1 below for a summary of these properties). 
 
Basic unit
(Components)
transcriptional
interaction
Motifs
(Local level)
patterns of
Interconnections
Transcription
factor
Target gene
Scale free hierarchical network
(Global level)
all transcriptional
interactions in a cell
Tar et e e
Modules
(Intermediate level)
Interconnections of 
motifs
A B C D
 
Figure 2-1: Structure of the transcriptional regulatory network. Nodes represent transcription factors (red 
nodes) or target genes (black nodes) and directed edge indicates a regulatory interaction between the 
two. (A) Components of a regulatory interaction (B) Local structure of the network consists of patterns of 
inter-connections called network motifs. The three frequently occurring motifs are the feed-forward motif 
(top), single input motif (middle) and multiple input motif (bottom) (C) Motifs are interconnected to form 
groups of highly connected genes, referred to as regulatory modules (dashed circles) (D) the set of all 
regulatory interactions in a cell is referred to as the transcriptional network.  
 
Chromosomal proximity of functionally related genes has been observed as early as 
1960 when Jacob and Monod first reported that genes in bacteria are organized into 
polycistronic messenger RNAs (Jacob et al., 1960) suggesting the importance of chromosomal 
organization. It is now known that genes in bacterial genomes are largely organized into 
operons in order to enable co-regulation (Price et al., 2005) and it is due to this reason that the 
conservation of gene order in prokaryotic genomes is highly non-random (Ermolaeva et al., 
Constraints imposed on bacterial transcriptional networks                                                                                     2-8 
 
 
2001; Korbel et al., 2004). However our progress in understanding, how and what governs the 
organization of these functional modules on bacterial chromosomes has been very limited, 
despite the enormous number of bacterial genomes that have been sequenced in recent years. 
In particular, these observations lead to several unanswered basic questions like, is 
transcriptional regulation constraining genome structure beyond local operon structure? Are 
their constraints on the organization and evolution of transcriptional networks? 
Products of genes have different functional roles and hence not all genes are used at the 
same time and for the same purpose. This explains why groups of genes are differentially 
expressed. For instance, genes encoding for enzymes in krebs’s cycle are constitutively 
expressed in response to most growing conditions while genes responsible for using alternative 
carbon sources are sporadically required. The decision about which genes should be turned on 
or off is executed by transcription factors (TFs) that use metabolites/signals as input information 
from the environmental state and give a transcriptional response as output (Jacob, 1970; 
Martinez-Antonio et al., 2006b; Ptashne and Gann, 1997). As a result, the notion that TFs are 
expressed in varying concentrations came into existence. For instance, LacI, a repressor of the 
operon for lactose consumption, is expressed in the order of tens’ of molecules per cell, while 
global regulators such as CRP (cAMP receptor protein) or IHF (integration host factor) occur in 
the order of thousands of molecules in the course of the cell cycle (Elf et al., 2007; Luijsterburg 
et al., 2006). In bacterial cells, where transcription and translation are coupled to happen in the 
same compartment these considerations become especially important for regulating gene 
expression. During transcription, regulatory proteins (TFs) should find and bind to specific DNA 
sequences on the operator region of their target genes to repress or induce their transcription 
(Browning and Busby, 2004). The protein-DNA interaction is a critical step in gene regulation as 
TFs find their DNA-binding sites as result of a passive process. Furthermore, TFs do not use 
energy (e.g. ATP hydrolysis) to get DNA-sequence information (Hu et al., 2008), which forces 
these systems to use additional strategies for the optimal performance of different TFs. In the 
early era of molecular biology, brownian diffusion was thought to be the determining step in 
DNA-binding site recognition by TFs. However, this assumption was challenged when it was 
reported that the LacI repressor finds its DNA-targets 90-100 times faster than that predicted by 
a mere diffusive mechanism (Riggs et al., 1970; Wang et al., 2006). This observation led to the 
suggestion of ‘facilitated diffusion’ mechanism. In such a mechanism, TFs alternate between a 
three dimensional (3-D) diffusion in the cell jumping between DNA-strands and one-dimensional 
(1-D) sliding along the DNA to rapidly locate their binding sites (Berg et al., 1981; Richter and 
Eigen, 1974; Winter et al., 1981). This hypothesis was corroborated by several works mostly 
Constraints imposed on bacterial transcriptional networks                                                                                     2-9 
 
 
with single molecule studies in which the authors visualized individual TFs interacting with the 
DNA (Elf et al., 2007; Shimamoto, 1999; Wang et al., 2006; Xie et al., 2008). Several groups 
have also mathematically modeled the sliding process along the DNA and shown it to be a 
plausible way of making the search significantly faster than 3-D diffusion alone, in particular for 
TFs in low cellular concentrations (Cherstvy et al., 2008; Gowers et al., 2005; Hu et al., 2008; 
Murugan, 2007). However, it is unclear what factors govern a TF to adopt one or the other 
strategy discussed above and if there is an interplay between nucleoid structure, genome 
organization and the biophysical aspects of transcriptional regulation in bacterial systems. In 
what follows, I will present a summary of the work that was performed to understand the 
functional and structural constraints that are imposed on bacterial transcriptional networks. 
2.2 RESULTS 
2.2.1 Constraints imposed on the network of transcription factors in 
bacteria 
In bacteria, coupling of gene expression with external conditions is achieved through two 
molecular functions; (i) association of transcription factors (TFs) at specific sites in the genome 
and (ii) recognition of a relevant effector signal or metabolite (Jacob, 1970; Martinez-Antonio et 
al., 2006b). Typically these functions are performed by different domains of a single polypeptide, 
but there are also cases where two interacting proteins are responsible for these functions, as in 
two-component systems (Ulrich et al., 2005).  
At the phenotypic level, there are evidences for the coexistence of multiple phenotypes 
in bacterial cultures, e.g., of cells with different morphological and physiological abilities like 
motility, biofilm formation, drug-resistance etc (Balaban et al., 2004; Ehrlich et al., 2005). In 
particular, biofilm formation and chemotaxis are considered as multi-stage developmental 
processes and, in mature biofilms, a mixture of bacterial population from different 
developmental-stages coexist (O'Toole et al., 2000; Stoodley et al., 2002). In an attempt to 
understand whether these distinct phenotypes in growing populations of bacteria can be linked 
and explained in the context of transcriptional regulation, wealth of experimental data on 
transcriptional regulation for the best-characterized bacterium, Escherichia coli, was analyzed 
(Gama-Castro et al., 2008; Martinez-Antonio et al., 2008). In particular, a detailed analysis of 
the structure of the transcriptional regulatory network of transcription factors allowed us to 
unravel several constraints imposed on this network. In the following sections, I summarize the 
results of this published study where we performed a comprehensive analysis of regulatory 
Constraints imposed on bacterial transcriptional networks                                                                                     2-10 
 
 
interactions between all the experimentally characterized TFs (defined as cross-regulatory 
network) in E. coli (Martinez-Antonio et al., 2008). Our structural analysis of the transcriptional 
cross-regulatory network in Escherichia coli suggests that, regulatory interactions between TFs 
are predominantly positive while auto-regulatory interactions are mostly negative. However, this 
general trend seems to be reversed in the case of most downstream TFs involved in the 
regulation of biofilm/chemotaxis modules.  
Martinez-Antonio A, Janga SC, Thieffry D, 2008, JMB
Regulation of the 
metabolism of alternative 
carbon sources
Master regulator  for 
motility and chemotaxis
Master switch   
between biofilm and 
chemotaxis
Master regulator for 
biofilm formation
Master regulators for 
respiration modes
 
Figure 2-2: Core transcriptional regulatory network of E. coli. Light blue and pink nodes represent genes 
encoding for transcription factors and sigma factors, respectively; edges represent regulatory interactions 
among TFs and sigma factors (green for activation, red for repression, blue for dual interactions and 
yellow for sigma transcription), whereas loops represent transcriptional auto-regulation. Specific sub-
networks responsible for the regulation different processes are delimitated with a dashed line including 
the section showing the regulation of carbon sources.  
 
We also note that, there are striking topological differences between the sub-networks 
controlling carbon metabolism and developmental processes; the former compose of many 
parallel short transcriptional cascades, encompassing multiple feed-forward loops, each 
enabling the use of one alternative carbon source, while the later involve long and intertwined 
regulatory cascades (see Figure 2-2). These long transcriptional cascades typically include 
multiple auto-activated intermediate TFs, as well as regulatory circuits between TFs and sigma 
factors. We further observe that transcription factors acting at the end of these regulatory 
cascades often belong to two-component systems. This topology suggests that on one hand, 
cell homeostasy is maintained through multiple regulatory cascades with commonly auto-
Constraints imposed on bacterial transcriptional networks                                                                                     2-11 
 
 
repressed TFs, while the regulatory memory is preserved by the sequential activation of TFs at 
the core of the network. On the other hand, downstream of the hierarchical network, two-
component systems can memorise transient external signals through auto-activation loops, thus 
acting as molecular switches enabling the coexistence of alternative phenotypes. 
2.2.1.1 Topology of Escherichia coli cross-regulatory transcriptional 
network 
Available experimental data point to more than 3000 regulatory interactions between TFs and 
their regulated genes in E. coli. This information is integrated and documented in a specialised 
database called RegulonDB (Salgado et al., 2006). Global analyses of this huge network have 
already been published, emphasizing a hierarchical organisation and statistically over-
represented regulatory motifs (Dobrin et al., 2004b; Ma et al., 2004b; Shen-Orr et al., 2002). 
However, our aim here is to analyse the flow of regulatory information within the network of 
transcriptional interactions among TFs and sigmas (defined here as the E. coli transcriptional 
cross-regulatory network). This network encompasses 115 TFs and 7 sigma factors, i.e., around 
one third of the total predicted TF proteins in this bacterium (Figure 2-2) (Madan Babu and 
Teichmann, 2003; Perez-Rueda and Collado-Vides, 2000). On average, every TF is connected 
to two other TFs (i.e., more technically, the mean degree of the regulatory graph is 2.74). 
However, the connectivity distribution of TFs is not uniform, with a small fraction of global TFs 
with high out-degrees dominating the network (Martinez-Antonio and Collado-Vides, 2003).  
In order to better visualise the informational flow through the network, the following 
graphical conventions have been used in this figure (see also legend): (i) the size of the nodes 
representing TFs is proportional to the number of genes they regulate (e.g., CRP regulates 413 
genes and is represented by the second biggest node, after the housekeeping sigma factor 
rpoD ); (ii) arrows and colours refer to the direction and sign of the regulatory interaction; (iii) 
arrow thickness is proportional to the impact of the interaction, computed as the number of 
genes thereby (in)directly regulated. 
Majority of the TFs in this network are auto-regulated (~70%), of which about two-third 
account for negative loops (Table 2-1). This finding is consistent with the results of an analysis 
performed with a much smaller number of TFs more than ten years ago (Thieffry et al., 1998). 
This predominance of negative auto-regulatory loops contrasts with the predominance of 
positive arcs between different TFs (about 54%, see Table 2-1). The dominance of positive 
regulatory interactions in the regulatory network of E. coli is not limited to those among TFs, as 
a comparison of the regulation of all the target genes (3017 arcs/edges) shows that about 54% 
Constraints imposed on bacterial transcriptional networks                                                                                     2-12 
 
 
(1630) are positively regulated, 40% (1206) of them are repressed while about 6% (171) are 
dual regulated. This is especially interesting because majority of the TFs in bacteria have been 
reported to act as repressors (Moreno-Campuzano et al., 2006; Perez-Rueda and Collado-
Vides, 2000; Struhl, 1999). The conventions used in Figure 2-2 clearly display the hierarchical 
organisation of the network, with master regulators such as CRP, FNR or IHF each (in)directly 
regulating a large number of other transcription factors. Furthermore, the layout emphasises 
important variations regarding the length of the transcriptional cascades. 
 
Table 2-1. Distributions of positive, negative and dual (auto-) regulatory interactions and mean path 
length computed for the E. coli transcriptional cross-regulatory network displayed in Figure 2-2 and for the 
sub-networks controlling the regulation of alternative carbon sources as well as biofilm and chemotaxis 
development processes. Note that arcs are synonymous to edges in the network. 
 
Network 
section N
o.
 o
f  
TF
s 
A
ut
o-
re
gu
la
to
ry
 
in
te
ra
ct
io
ns
 
P
os
iti
ve
 a
ut
o-
re
gu
la
to
ry
 
in
te
ra
ct
io
ns
 
N
eg
at
iv
e 
au
to
-
re
gu
la
to
ry
 
in
te
ra
ct
io
ns
 
D
ua
l a
ut
o-
re
gu
la
to
ry
 
in
te
ra
ct
io
ns
 
R
eg
ul
at
or
y 
ar
cs
1  
P
os
iti
ve
  
ar
cs
 
N
eg
at
iv
e 
ar
cs
 
D
ua
l  
ar
cs
 
A
ve
ra
ge
  
pa
th
 le
ng
ht
h2
  
All 
115 
 
80  
(70%) 
24  
(30%) 
48  
(60%) 
8  
(10%) 
166 
 
90  
(54%) 
67  
(40%) 
9  
(6%) 
2.74 
 
Carbon 
sources 
24 
 
19  
(79%) 
5  
(26%) 
9  
(48%) 
5  
(26%) 
27 
 
20  
(74%) 
6  
(22%) 
1 
(4%) 
1.53 
 
Biofilm & 
motility3  
 
32 
18  
(56%) 
9  
(50%) 
9 
(50%) 
0 
  
52 
 
22  
(42%) 
27  
(52%) 
3  
(6%) 
3.12 
 
 
 
1Regulatory interactions from TFs to others TFs or towards sigma factors 
2Average path lengths in the (sub-)network(s) were calculated with the ViSANT program (Hu et al., 2007)  
3Only the TFs forming cascades ending on biofilm and chemotaxis modules were computed, the 
autoregulation of CRP was included in the carbon sources module. 
 
Although functional annotations on transcription factors are still limited, it is possible to 
classify the cross-regulating TFs into broad categories according to the physiological functions 
of the target structural genes: carbohydrate initial catabolism, respiration, biofilm formation and 
chemotaxis, etc,. As shown in this figure, these broad classes correspond to different local 
network topologies. Due to their contrasting topologies, in what follows, we will focus our 
discussion on short regulatory cascades observed in the case of carbohydrate catabolism as 
opposed to long regulatory cascades seen in the case of biofilm and chemotaxis pathways 
(marked in the figure and shown in table). CRP resides at the top of both sub-networks. CRP is 
the only global TF acting hierarchically over local TFs for the usage of carbohydrates, whereas 
CRP’s activity is comparable to the activity of other global regulators in the rest of the network. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-13 
 
 
Note that the concentration of its effector metabolite, cAMP (cyclic adenosine monophosphate), 
is at par with that of ATP, (adenosine triphosphate), which acts as the energetic currency of the 
cell (Bettenbrock et al., 2007). This suggests that CRP not only regulates the use of these 
substrates for producing ATP, but also senses the energetic status of the cell to decide the 
execution of other cellular programs.  
While it is evident from this figure and the table that there are differences in the topologies of the 
subnetworks controlling metabolism versus motility and chemotaxis, which will be the focus of 
the following sections, there are other subnetworks which are also worth mentioning. In 
particular, all 9 TFs controlling the expression of genes for amino acid biosynthesis seem to be 
expressed constitutively by sigma 70. Each TF regulates the transcription of the required genes 
for producing different amino acids. At high concentrations of the amino acids, allosteric 
modifications of TFs follow binding to their respective amino acids, resulting in TF auto-
repression, as well as to the repression of the corresponding biosynthetic genes. Interestingly, 
the logic behind negative autoregulation in this case is different to that of the catabolism of 
carbohydrates. While in the latter case TFs are autorepressed until the substrate is available, in 
the case of amino acids, TFs are autorepressed only in the presence of an excess of the 
synthesized final product. Another interesting subnetwork is that for alleviating the stresses by 
drugs, solvents and weak organic acids. The regulatory logic in this complex subnetwork is 
peculiar as their components form multi-element circuits and their inputs are directed by Rob 
and SoxR, two small proteins constitutively expressed but with very short half lives (1-2 min). 
Their stability/degradation depends on the presence/absence of their effector signals (Griffith 
and Wolf, 2004; Martin et al., 2000; Shah and Wolf, 2004). 
2.2.1.2 Multiple parallel feed-forward loops regulate the use of 
different carbon sources 
Cellular feeding, which includes the uptake of carbon and energy sources and their metabolism, 
can be considered as one of the main physiological processes in bacterial systems. The 
regulation of these processes directly affects cellular fitness. The selection of carbon sources is 
regulated by CRP and about 20 more specific TFs (Figure 2-2). The hierarchical organisation of 
the corresponding sub-network is characterized by a short average path length (Table 2-1). 
Regulatory interactions between CRP and the specific TFs result in the occurrence of multiple 
Feed-Forward Loops (FFL) for the use of alternative sugar sources. FFL is a network motif 
recurrently found in transcriptional networks and is defined as a three-gene pattern composed of 
two input transcription factors, one of which regulates the other, both jointly regulating a target 
Constraints imposed on bacterial transcriptional networks                                                                                     2-14 
 
 
gene (Mangan and Alon, 2003; Shen-Orr et al., 2002). Based on the mode of regulation of each 
TF, this motif is sub-divided into 8 different sub-types (Mangan and Alon, 2003). Coherent FFL 
type 1 corresponds to all the regulatory interactions in the motif being positive while in 
incoherent type 1 FFL the first TF regulates positively both the targets although second TF 
represses the expression of the target gene, thereby reversing the final effect. Majority of the 
feed-forward loops present in the subnetwork for carbon catabolism belong to coherent and 
incoherent type 1 groups (Mangan and Alon, 2003), with both TFs working cooperatively, as a 
result of a persistent signal (in this case, cyclic adenosine monophosphate) affecting the global 
TF and the presence of a signal affecting a TF corresponding to a sugar alternative to glucose 
(Alon, 2007b; Janga et al., 2007b; Mangan and Alon, 2003). This motif structure enables the 
filtering of short pulses of the signal affecting the global TF (cAMP) in case of transient glucose 
deprivation. Consequently, the target structural genes are activated only in the persistent 
absence of glucose and in the presence of an alternative carbon source. 
The phosphotransferase system (PTS) system typically transports and phosphorylates 
certain sugars, including glucose, a preferred carbon source for E. coli, and this condition 
ultimately results in low levels of cAMP. Consequently, CRP does not activate the transcription 
of the genes responsible for the degradation of alternative sugars. Note that most structural 
genes involved in the transport and initial catabolism of alternative carbon sources are encoded 
in operons, each specifically repressed in the absence of the inducing sugar. However, when 
glucose is lacking, cAMP level increases and CRP can activate the transcription of genes 
responsible for degrading alternative carbon sources (Deutscher et al., 2006). Simultaneously, 
sugars (or processed variant thereof) present in the cell bind their specific TF; allosteric 
interactions then result in TF unbinding from DNA, alleviating the repression and permitting the 
transcription of the corresponding target genes. This organisation involving multiple parallel 
feed-forward loops coupled to PTS activity appears optimal to enable rapid transcriptional 
responses to sudden lack of glucose in the presence of alternative carbon sources in the milieu 
(Dekel et al., 2005; Mangan et al., 2005). 
2.2.1.3 Long hierarchical cascades regulate developmental processes 
Biofilm formation and bacterial mobility can be seen as the outcome of specialised cell 
differentiation pathways. Biofilm formation involves subsequent cellular changes at the 
morphological and physiological levels resulting in bacterial populations with multiple 
phenotypes (Ehrlich et al., 2005; Hooshangi et al., 2005; Shapiro, 1998). Furthermore, bacteria 
living in biofilm communities are present in different developmental stages (at least four defined 
Constraints imposed on bacterial transcriptional networks                                                                                     2-15 
 
 
stages) as has been observed in Cryptoccocus, Pseudomonas, Staphylococcus, Xanthomonas, 
etc. (Aldridge and Hughes, 2002; Chantratita et al., 2007; Guerrero et al., 2006; Handke et al., 
2004; Kamoun and Kado, 1990; Massey et al., 2001). The part of E. coli transcriptional cross-
regulatory network involved in the control of biofilm formation and motility exhibits a relatively 
complex topology with several long cascades from CRP, IHF and FNR to downstream 
specialised TFs (Table 2-1 and Figure 2-2). Several of these cascades converge on the master 
regulators for motility and biofilm formation in the downstream (FhlCD and CsgD, respectively). 
For instance, CsgD, the master regulator for biofilm formation, is directly involved in several long 
circuits, suggesting a particular tight coupling of CsgD activity with the intracellular status. In 
contrast, FlhCD, the master compound regulator for motility and chemotaxis, is known to be 
regulated by nine other TFs but has not yet been reported to regulate any other TF. In fact, a 
relatively high proportion of the downstream specialised transcription factors also auto-activate 
themselves, a feature which is rare at the level of the whole transcriptional network.     
Note that the motility module has its own sigma factor, FliA, regulated by FlhCD. FliA is 
required for the transcription of the genes required for the last part of flagella development and 
for chemotaxis machinery (Kalir and Alon, 2004; Kalir et al., 2001). In contrast, the genes for 
biofilm development are transcribed by the housekeeping sigma 70 and RpoS, the sigma factor 
expressed in response to general stress (Hengge-Aronis, 2002).   
The execution of such long regulatory cascades requires time. Indeed, complete flagella 
assembly may take a generation time or longer (Aizawa and Kubori, 1998; Macnab, 2003; Pruss 
and Matsumura, 1997). The occurrence of positively auto-regulated TFs at several intermediate 
steps enables informed decisions about the cellular/environmental condition. In some 
conditions, cellular duplication might be faster than the conclusion of a long regulatory cascade. 
This implies that bacterial populations likely consist of mixture of bacteria with transcriptional 
programs at different levels in long regulatory cascades.  
2.2.2 Constraints imposed on bacterial genome organization by 
transcriptional network 
Simple regulons comprise of transcription factors (TFs) and the set of genes they regulate and 
were defined as early as 1964 (Maas et al., 1964). The functional properties of these set of 
genes can be diverse, vary in number and be encoded dispersedly on the chromosome. 
However, it is unclear if there is any relationship between regulon size and the chromosomal 
positioning of their genetic components, nor is it known how TFs, constituting large and small 
regulons, coordinate their activity in the context of regulatory networks.  
Constraints imposed on bacterial transcriptional networks                                                                                     2-16 
 
 
It is now well accepted that most biological networks are scale-free in their structure 
(Barabasi and Albert, 1999; Cases and de Lorenzo, 2005) and modular in their function 
(Hartwell et al., 1999; Kashtan and Alon, 2005; Resendis-Antonio et al., 2005; Slonim et al., 
2006; Snel and Huynen, 2004) but our understanding on how this scale-free structure is 
reflected on the chromosomal organization is very limited. Thus, addressing the design behind 
these architectures in the context of genome organization can provide important insights to a 
better understanding of genome structure and function. In bacterial genomes where 
transcription and translation occur in the same compartment these questions become especially 
important ,as the positioning of TFs on the chromosome could depend on concentration (Cai et 
al., 2006; Golding et al., 2005; Yu et al., 2006). Recent works have suggested the importance of 
chromosomal distance in bacterial genomes from diverse perspectives; from transcription units 
and operon organization to divergent and convergent transcriptional control (Janga et al., 
2007a; Korbel et al., 2004; Menchaca-Mendez et al., 2005; Warren and ten Wolde, 2004), 
however no analysis has focused on a link between regulon sizes, their genome organization 
and how this relates to the hierarchical transcriptional network structure (Lagomarsino et al., 
2007; Ma et al., 2004a; Yu and Gerstein, 2006). In a attempt to address the question of how 
these factors interplay and relate in the larger context of transcriptional networks of bacteria, 
currently available TRNs of best characterised gram-negative bacterial model, Escherichia coli 
(Salgado et al., 2006) and the not as well-characterized gram-positive representative, Bacillus 
subtilis (Makita et al., 2004) were used.    
In short, we studied the dependency between regulon sizes and their chromosomal 
positioning and show that regulons can be classified into 3 distinct groups based on average 
chromosomal distance between TFs and their respective target genes (Janga et al., 2007a) into 
big, intermediate and small regulons (Figure 2-3). We note that regulatory flux is generally 
driven from big to small regulons in both E. coli and B. subtilis. Finally, using data from two 
independently reported studies we show that the higher a TF is in the transcriptional hierarchy, 
more are its detected number of mRNA molecules per cell reflecting their need to be expressed 
in higher concentrations to regulate targets located distantly on the chromosome. In contrast to 
big regulons, local or dedicated TFs (lower in the network hierarchy) were found to be 
expressed in much lower concentrations explaining the reasons for their proximity on the 
chromosome to their target genes. These observations show how scale-free structure of 
transcriptional networks can be encoded on the chromosome to drive the kinetics and 
concentration gradients of TFs depending on the number of genes they regulate and could 
facilitate the horizontal transfer of local environment-specific transcriptional modules.  
Constraints imposed on bacterial transcriptional networks                                                                                     2-17 
 
 
      
A) B)
C)
D)
0.1
1
10
100
0.01 0.1 1 10 100
0.1
1
10
100
0.01 0.1 1 10 100
TF-TG average genome distance in regulons
R
eg
ul
on
 s
iz
e 
(g
en
e 
nu
m
be
rs
) Escherichia coli K-12 Bacillus subtilis A
B
C
A
B
C
 
Figure 2-3: Relationship between size (defined as the number of target genes) and average chromosomal 
distance for all known regulons in (A) E. coli  and (B) B. subtilis. Regulon size is plotted on Y-axis and is 
normalized with respect to size of the biggest regulon in each genome for the sake of comparison across 
genomes and the average chromosomal distance between the TF encoding gene and their respective 
target genes is shown on X-axis. Chromosomal distances were calculated as defined earlier (Janga et al., 
2007a) with the maximum distance being half the number of protein coding genes on a circular 
chromosome. Note that both regulon sizes and average chromosomal distances are normalized with 
respect to the maximum and both the axes are shown on a logarithmic scale. Flow of regulatory 
interactions between the TFs heading the regulons, grouped according to their size and chromosomal 
distance in (C)  E. coli and (D) B. subtilis; it can be noted that the regulatory flux among TFs typically 
follows the order, big to intermediate to small regulons, coloured respectively in red (big), green 
(intermediate) and blue (small). 
Constraints imposed on bacterial transcriptional networks                                                                                     2-18 
 
 
2.2.2.1 Genomic co-localization of TFs and target genes is observed in 
small regulons 
In a previous study we reported a distinct organization of genes coding for transcription factors 
(TFs) and their effector genes (whose products control TFs), depending on whether the effector 
proteins sense signals from endogenous or exogenous origin in Escherichia coli (Janga et al., 
2007a). In this study, we analyze if this observed distance, when extended to all members of a 
regulon, shows any trends depending on the size of the regulon. It should be noted that there is 
a clear distinction between TF-effector gene pairs and TF-target gene pairs. While the product 
of the former controls the activity of the TFs the later correspond to the set of genes 
transcriptionally regulated by the TF (forming part of a regulon). In this work our interest is to 
understand how the chromosomal distances (measured as number of intervening protein coding 
genes on a circular genome) between TF and its target genes in different regulons can explain 
or reflect the network structure. To address these questions, we obtained all regulons wherein 
transcription factors regulate at least two genes (excluding auto-regulation) in E. coli and in B. 
subtilis, taken from regulonDB (Salgado et al., 2006) and DBTBS (Ishii et al., 2001), 
respectively. We included heterodimeric TFs and excluded auto regulatory interactions. In E. 
coli K12, our final dataset contained 141 regulons comprising of 1597 regulatory interactions 
between TFs and their regulated genes; in B. subtilis the dataset contained 54 regulons 
comprising of 499 genes. First we asked if there is any link between regulon size (number of 
regulated genes by each TF) and the average chromosomal distance (calculated as the number 
of intervening protein coding genes on the circular chromosome as described earlier (Janga et 
al., 2007a)) between the TF and its target genes in each case. As a result of clustering (see 
Methods), regulons in both organisms can be grouped into three main categories (see Figure 2-
3 panel A, B and Table 2-2): (A) a few big regulons (10 in E. coli and 7 in B. subtilis) regulating 
more that 50% of the genes in their transcriptional networks (group A in the Figure). (B) An 
intermediate and heterogeneous group of regulons consisting of varying regulon sizes and 
chromosomal distances (group B); and (C) a group of small regulons having short chromosomal 
distances (group C). Notably, small regulons (group C) are smaller than the biggest operons of 
E. coli (15 genes) and B. subtilus (22 genes), possibly suggesting limitations on their sizes to 
act as functional modules either in the context of co-expression or for horizontal transfer (Korbel 
et al., 2004; Pal et al., 2005). The group of 10 TFs in E. coli having the most number of 
regulated genes, all are classified as global regulators according to one or more previous 
studies (Martinez-Antonio and Collado-Vides, 2003) while most of the TFs constituting small 
Constraints imposed on bacterial transcriptional networks                                                                                     2-19 
 
 
regulons were found to sense external fluctuant signals resembling local genetic modules 
(Martinez-Antonio et al., 2006b). In particular, we found that highly connected TFs were either 
Nucleoid Associated Proteins (NAPs) like IHF, FIS, HNS or growth condition specific regulators 
like Crp (aerobic), Fnr and NarL (anaerobic), central intermediary regulators like Lrp, ferric 
uptake regulator (Fur) or developmental pathway associated factors like FlhDC responsible for 
biofilm formation, suggesting that these regulators indeed have key functional roles in 
controlling the transcriptional responses of the cell depending on the condition of growth. It is 
interesting to note that several NAPs which are known to act as bacterial analogs of chromatin 
remodeling factors are enriched in this class (see below). Similarly, a functional analysis of the 
TFs from group B suggested that several of them are involved in basic cellular activities like 
regulation of the biosynthesis of amino acids, regulation of cell division and repair, regulation of 
the uptake of elements, cellular stress and response to antibiotics, indicating a limited functional 
role of these TFs compared to those from group A. Finally, an analysis of TFs from group C 
suggested that they are involved in the uptake of carbon sources, degradation of small 
molecules and are abundant in two component response regulators. To estimate if the average 
chromosomal distance seen in each group is significant, we compared this distance against 
those seen in randomly generated sets as described in Methods. We found that the observed 
distances for each of the three groups are significantly smaller than expected by chance, with 
regulons from group C being the closest (Table 2-2). 
 
Table 2-2. Properties of the main groups of regulons identified in the regulatory network of E. coli, based 
on average chromosomal distance between TF and its target genes. 
 
Regulon 
group 
Number of 
regulons  
(% of total 
regulons) 
Regulon size 
(average no. of 
genes/regulon) 
Total number of 
regulated genes  
(% of total regulated 
genes) 
Average distance 
(in gene numbers) 
between the TF 
and the target 
genes 
P-value 
significance  
(Z-score) 
A 10 (7%) 76-399 (159) 1595 (99) 1059.45 P< 0.001  
(-9.98) 
B 73 (52%) 2-48 (13) 953 (59) 889.69 P< 0.001  
(-10.74) 
C 58 (41%) <12 (4.8) 281 (17.5) 2.42 P< 0.001  
(-20.44) 
 
2.2.2.2 Transcriptional regulatory flow in the network of TFs 
To find out if there is any coordination between the TFs heading the different groups of regulons 
identified above, we analyzed the regulatory flow among the TFs constituting the regulatory 
Constraints imposed on bacterial transcriptional networks                                                                                     2-20 
 
 
network (Dobrin et al., 2004b; Ma et al., 2004a; Yu and Gerstein, 2006). Figures 2-3C and 2-3D 
show the regulatory interactions present between at least two TFs in E. coli and B. subtilis. Note 
that, all the TFs of group A are at the top of the network hierarchy initiating the regulatory 
interactions in the network of TFs. The regulatory flow follows an order, from TF members of 
group A to B to C, and there are no regulatory interactions from members of group C directed to 
B or A, indicating no feedback at the level of transcriptional regulation from the bottom to the 
top. However, there are some regulatory interactions between members of the same group and 
from members of group B towards members of group A. Other approaches for constructing 
hierarchical networks, such as the bottom-up strategy (Yu and Gerstein, 2006), using TF-TF 
network did not change our observations that group A shows a preference to occur at the top of 
the hierarchy while group C appears at the bottom of the hierarchical network. The partitioning 
of transcriptional network into big, intermediate and small regulons illustrates how the network 
components could be structured on a chromosome in a scale-free distribution, observed in 
various biological networks (Barabasi and Albert, 1999; Hartwell et al., 1999).  It is possible to 
generalize from our observations, that the TFs at the bottom of this hierarchy often correspond 
to very specific functional roles like those sensing specific environmental conditions 
(Lagomarsino et al., 2007). 
2.2.2.3 Absolute and average mRNA abundance of TFs suggests 
correlation with regulon size and network hierarchy in E. coli 
It is believed that global regulators should be present in higher concentrations in the cell 
compared to local or dedicated TFs (Elf et al., 2007). In fact, it is known to be valid for nucleoid-
associated proteins and other global regulators like CRP, Lrp and Fur in E. coli, whose protein 
concentrations reach more than 1000 units per cell (Chen et al., 2001; Luijsterburg et al., 2006). 
On the other hand, the number of TF proteins  of LacI, a dedicated TF for lactose utilization, 
rises from around 5 to a maximum of 20 upon induction of lactose (Droge and Muller-Hill, 2001). 
Indeed, early genomic approaches to study gene expression patterns on a genomic scale which 
exploited the codon frequency bias of highly expressed cellular machinery like ribosomal, 
transcription and cheparone associated classes, have shown that sequence specific TFs are 
generally poorly expressed (Karlin and Mrazek, 2000). However, so far, no global analysis has 
been performed to compare TF protein concentration with their connectivity and network 
hierarchy. Therefore to address this, we used mRNA profile data from two experiments 
performed in the M9+glucose medium, in which the absolute number of mRNA molecules were 
quantified (Covert et al., 2004; Liu et al., 2005; Lu et al., 2007). We obtained the number of  
Constraints imposed on bacterial transcriptional networks                                                                                     2-21 
 
 
                      
A)
B)
y = 8.4476x0.6439
R2 = 0.2739
P < 10-4
0.1
1
10
100
1000
0.1 1 10 100
mRNA molecules per cell
O
ut
-d
eg
re
e 
 o
f  
TF
s
y = 0.2073e0.4521x
R2 = 0.2614
P < 10-4
0.1
1
10
100
1000
4 5 6 7 8 9 10 11 12 13
Average mRNA expression of TFs
O
ut
-d
eg
re
e 
 o
f  
TF
s
 
Figure 2-4:  A) Relationship between mRNA abundance and out-degree of a TF in the regulatory network 
of E. coli. TFs are colored as per their grouping in Figure 2-3 with big regulons in red, intermediate ones 
in green and small regulons in blue. Bigger the regulon, stronger is its tendency to be expressed in higher 
concentrations. B) Relationship between out-degree of a TF and its average mRNA level, calculated after 
processing and normalizing the expression data according to RMA normalization, as reported by the 
authors (Faith et al., 2007). 
 
mRNA molecules (per cell) of genes encoding for TFs from this dataset, to see if it correlates 
with their connectivity and grouping as identified in Figure 2-3 (see Figure 2-4). We found that 
TFs higher in the network hierarchy had greater number of mRNA molecules per cell associated 
with them, suggesting that more protein molecules are produced (Figure 2-4A). To investigate 
further, the relationship between concentration of a TF and its network hierarchy, we compared 
TF’s outdegree against its average gene expression using a large compendium of E. coli 
microarrays reported recently (Faith et al., 2007). We found that TF’s outdegree and its average 
mRNA level across experiments follows the hierarchy described above (Figure 2-4B). Our 
Constraints imposed on bacterial transcriptional networks                                                                                     2-22 
 
 
results suggest that several regulators from group C identified in Figure 2-3A are poorly 
expressed, consistent with previous observations that two-component systems which are 
enriched in group C and are proximal on the chromosome show poor predicted expression 
values using codon usage measures (Janga et al., 2007a; Karlin and Mrazek, 2000).  If we 
assume that mRNA formation is a determining step in protein synthesis, these data might 
correspond to the absolute protein concentrations of the respective TFs per bacterial cell 
implying a correlation between a TF’s out-degree and its concentration, extending upon 
previous studies (Lozada-Chavez et al., 2008; Martinez-Antonio et al., 2008; Seshasayee et al., 
2009).These observations clearly indicate that the concentration of a TF is related to the way it 
is encoded on the chromosome with respect to its target genes, with local TFs regulating few 
genes present in physical proximity to their target genes and global TFs facilitating the 
regulation of many genes by increasing their cellular concentration. Indeed it has been 
postulated using simulations that low copy number TFs need to colocalize with their targets to 
enable a rapid and reliable gene regulation, confirming the need to place low copy local TFs in 
physical proximity to their targets in the genome (Kolesov et al., 2007). Proteome profiles for 
TFs were limited to a countable number until recently when two massive proteomic experiments 
were reported for E. coli (Ishihama et al., 2008; Lu et al., 2007). Excluding the nucleoid-
associated proteins which are discussed below, we could obtain protein concentrations for 25 
TFs belonging to different levels of E. coli network from these experiments (Figure 2-5). 
Consistent with our observations at mRNA level, TFs with high intracellular levels corresponded 
with high out-degree when their protein concentration is plotted as a function of the number of 
target genes. With respect to NAPs, these high-throughput experiments confirm their high 
abundance reported almost ten years ago using quantitative western blot analyses (Ali Azam et 
al., 1999). Indeed, a closer look at the peak expression by the same authors suggested that the 
production of these NAPs is distributed along the bacterial growth-phases (see Figure 2-6). The 
high cellular levels of these proteins  with concentrations varying from 20000 and 50000 units 
made it possible to estimate that on an average each monomer may bind every 500 bp along 
the genomic DNA (Ali Azam et al., 1999). In summary, in agreement with the data for mRNAs, 
we observe that protein abundance correlates with the out-degree of a TF in the network, with 
NAPs being particularly abundant and expressed in a growth-phase dependent manner, 
possibly to re-structure the nucleoid, facilitating the running of particular transcriptional 
programs depending on growth phase status (see below), (Ali Azam et al., 1999; Luijsterburg et 
al., 2006; Marr et al., 2008). Therefore, one can hypothesize from these results that the 
scalefree structure of bacterial transcriptional regulatory networks is encoded in the 
Constraints imposed on bacterial transcriptional networks                                                                                     2-23 
 
 
chromosome itself and that genome organization of bacterial chromosomes might indeed be 
influenced by their TRNs. 
 
Figure 2-5: Number of proteins/cell 
for TFs, as a function of the number 
of genes transcriptionally regulated 
by it (excluding NAPs as their protein 
levels are shown in Figure 2-6D and 
ArcA and NarL which are known to 
be poorly expressed in aerobic 
condition where the experiment is 
performed). Note that the proteomic 
data is available for only 25 TFs.   
 
 
2.2.2.4 A conceptual model for the structuring of regulatory networks 
in bacteria 
In the integrated model we propose here (Figure 2-6), the biophysical aspects of TFs for 
reaching their DNA-binding sites might be the main driving force for structuring the regulatory 
networks in bacteria as we know presently. This conceptual model is supported by the following 
observations and evidences:  
1) TFs governing small regulons are located close to their regulated genes on the chromosome 
and this spatial arrangement together with the fact that transcription and translational 
mechanisms occur simultaneously, should favor that the newly synthesized protein can contact 
quickly its target DNA through the sliding and hopping mechanism as was shown in the case of 
LacI (Elf et al., 2007; Wang et al., 2006) (Figure 2-6C). These local regulators are normally 
expressed in lower cellular concentrations as they would be required sporadically to regulate 
few operons whose products have dedicated functions. For instance, regulation of alternative 
carbon sources in E. coli is mainly governed by the global regulator CRP and a group of local 
TFs controlling small regulons which are located proximally on the chromosome with respect to 
their target genes (Figure 2-6B). The role of the products encoded in these small regulons is to 
transport and carry out the first catabolic steps of alternative sugars until their catabolism 
converges in the glycolysis pathway. Additionally, note that most of these TFs in bacteria are 
autoregulated (Martinez-Antonio et al., 2008). Thus, this sliding mechanism could be a 
generalized strategy for a quicker and tighter control of TFs over their own expression (Alon, 
2007a).  
Constraints imposed on bacterial transcriptional networks                                                                                     2-24 
 
 
 
C)
>>  <<
mRNA
3´5´
RNAP
RBCEffector 
metabolite
A) B)
>>
Sliding 
(local TFs)
Diffusion 
and jump
(global TFs)
0
10000
20000
30000
40000
50000
60000
FIS HU H-NS IHF
N
A
P
s 
m
ol
ec
ul
es
/c
el
l
Bacterial 
growth
D)
Analogous regulation
 
Figure 2-6: Integrated model of transcriptional regulatory network in bacteria (A) combined model 
representing various factors involved (B) activity and mechanistic basis for the functioning of local TFs (C) 
an example of global and local TFs co-regulating genes involved in the uptake of carbon sources in E. 
coli (D) protein abundance of different nucleoid-associated proteins along the growth-phases, acting as 
analog regulators. 
 
2) In contrast, global regulators which are distantly located with respect to the large number of 
genes they regulate employ a different strategy. Targeting DNA seems to be accurately 
managed by raising the concentration of the respective TFs and the actual mechanism used for 
binding DNA would be 3-D diffusion and jumping between the DNA strands (Figure 2-6A and 
CRP path in Figure 2-6B). The large cellular concentrations of these proteins might be 
maintained, in part, given that most global regulators are autoregulated in both positive and 
negative manner (Martinez-Antonio et al., 2008). Such a mechanism would also make sure that 
the concentrations of these proteins are maintained at high intracellular levels. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-25 
 
 
3) A third major player for gene regulation in bacteria is the way the DNA molecule is packed 
into nucleoids (Ali Azam et al., 1999; Luijsterburg et al., 2006; Zimmerman, 2006). Recent 
studies provide evidence that the DNA molecule is organized into loops of different lengths (10-
100 kbp) which make it possible for some DNA regions to be spatially proximal which would 
otherwise be distant on a linear molecule of DNA (Kepes, 2004; Marenduzzo et al., 2007; 
Postow et al., 2004; Riva et al., 2008). Although the exact co-ordinates of these DNA-loops is 
yet to be unveiled even in well-studied systems like E. coli, it is known that nucleoid associated 
proteins (NAPs) are specifically engaged in structuring DNA depending on the growth condition. 
These proteins bridge or bend the DNA molecule facilitating DNA loops and nucleoid´s 
structuring (Luijsterburg et al., 2006; Zimmerman, 2006). In particular, NAPs are shown to 
express in growth-phase dependent manner with FIS at the beginning of stationary phase, HNS 
in the mid-exponential and IHF in the arrested phase (see Figure 2-6D) (Ali Azam et al., 1999). 
These observations suggest that NAPs might structure the DNA molecule in a different way 
depending on the growth phase and this action should facilitate or predispose off only a section 
of the DNA-template for the activity of global and local regulators and the running of specific 
transcriptional programs. Accordingly, it has been suggested that NAPs act as analog regulators 
whereas the rest of the TFs responding to specific conditions (e.g. by binding signal effectors) 
act as digital regulators (Marr et al., 2008; Travers and Muskhelishvili, 2007) (Figure 2-6C).  
2.3 DISCUSSION & CONCLUSION 
Our structural analysis of the transcriptional cross-regulatory network in E. coli suggests that 
regulatory interactions between TFs are predominantly positive, while autoregulatory 
interactions are mostly negative. We also note that there are striking topological differences 
between the subnetworks controlling metabolic activities, such as carbon metabolism, and that 
controlling developmental processes; the former encompasses many parallel short 
transcriptional cascades and multiple FFLs, each enabling the use of one alternative carbon 
source, while the latter involves long and intertwined regulatory cascades. These long 
transcriptional cascades typically include multiple autoactivated intermediate TFs, as well as 
regulatory circuits between TFs and sigma factors in the case of biofilm formation. 
We further observe that TFs acting at the end of these regulatory cascades often belong 
to two-component systems. This topology suggests that cell homeostasy is maintained through 
multiple regulatory cascades with commonly autorepressed TFs, while the regulatory memory 
within the network is preserved by the sequential activation of TFs and by multi-element circuits 
at the core of the network. Downstream of the hierarchical network, two-component systems 
Constraints imposed on bacterial transcriptional networks                                                                                     2-26 
 
 
can memorise transient external signals through autoactivation loops, thus acting as molecular 
switches enabling the coexistence of alternative phenotypes. 
As shown in a recent study, the E. coli cross-regulatory network appears to be robust to 
tolerate the rewiring between members high and low in the network hierarchy (Isalan et al., 
2008). This study also indicated that the allosteric signals are the mandatory input elements for 
network function. Thus, TFs present in a condition different from the natural one(s) would have 
limited activity due to the absence of their effector signals. In this respect, a proper global 
understanding of the organisation of the E. coli transcriptional network (combining sigma and 
TFs) could contribute to the interpretation of network-rewiring experiments as well as foster 
more efficient design of synthetic regulatory circuits. 
It is important to note that the generality of the observed organization of the E. coli 
transcriptional cross-regulatory network remains to be assessed. Nevertheless, a more 
comprehensive picture of the network organisation in bacteria will progressively be drawn as 
additional regulatory elements such as small RNAs, anti-sigma factors and riboswitches are 
integrated (Gama-Castro et al., 2008). In addition, the combination of transcriptional and 
metabolic networks should provide important insights by linking effector metabolites and 
regulatory elements. Clearly, variations in regulatory network topology might be expected in the 
case of bacteria with asymmetric cell division (mostly alpha-proteobacteria), where the offspring 
asymmetric cells cause a transient genetic asymmetry that triggers different developmental 
processes, such as the formation of stalked and swarmer cells in Caulabacter or vegetative and 
spore-forming cells in Bacillus (Ausmees and Jacobs-Wagner, 2003; Dworkin, 2003; Dworkin 
and Losick, 2001; Hilbert and Piggot, 2004; Yudkin and Clarkson, 2005). Future comparisons 
between network topologies for different model systems should further enhance our 
understanding of regulatory network organization and its conservation or variations among 
different bacterial phyla. 
Our analysis linking the transcriptional hierarchy, genome organization and expression 
dynamics of TFs suggest that TFs high up in the hierarchy are detected in higher mRNA and 
protein molecules per cell, reflecting their need to be expressed in higher concentrations to 
regulate target genes located dispersedly on the chromosome. In contrast to big regulons, local 
or dedicated TFs (lower in the network hierarchy) were found to be expressed in much lower 
concentrations explaining the reasons for their proximity on the chromosome to their target 
genes. These observations give insights into how the scale-free structure of transcriptional 
networks can be encoded on the chromosome to drive the kinetics and concentration gradients 
of TFs, depending on the number of genes they regulate and could facilitate the horizontal 
Constraints imposed on bacterial transcriptional networks                                                                                     2-27 
 
 
transfer of local environment-specific transcriptional modules. Although our distance 
calculations do not take into account the three dimensional topology of the chromosome under a 
given cellular condition, it is easy to note that the chromosomal proximity of TFs to their targets 
in the case of small regulons can not be explained due to chance alone. While in the case of 
global TFs one can argue that as they regulate several genes, their average linear 
chromosomal distance could be an over-estimation of intracellular proximity considering the 
dynamic nature of the nucleoid. However, global TFs with their fuzzy binding sites in contrast to 
local TFs could complement their affinity to targets by increasing their concentrations to a 
sufficient degree when needed (Kolesov et al., 2007; Lozada-Chavez et al., 2008).  Thus, our 
results suggest that transcriptional regulatory networks play an important role in genome 
organization by shaping the organization of genes in genomes. These observations illustrate 
how bacteria as simple biological systems fit predicted theoretical principles in order to optimize 
their cellular performance in a compacted genome.  
2.4 METHODS 
2.4.1 Identification of regulon groups 
To identify the different regulon groups based on normalized regulon size and normalized 
average chromosomal distance between TF and its TGs in a regulon, we used K-means 
clustering implemented in cluster (de Hoon et al., 2004). To find the number of distinct clusters 
present in the data we first varied the number of clusters (parameter - number of clusters in K-
means clustering) to identify how many times the optimal solution has been found in 1000 runs 
using euclidean distance as the similarity metric. We found that when the number of clusters 
was set to 3 the optimal solution was found in 350 times out of 1000 runs while when the 
numbers of clusters was set to 2,4,5 the optimal solution was reached in 120, 167 and 84 times 
respectively, suggesting that the number of clusters in the set is indeed 3. Similar approaches 
have been used by others in calculating the significance of clusters with other clustering 
approaches, as principled clustering frequently results in suboptimal solutions in a single run 
(Slonim et al., 2006). 
To determine the composition of the clusters, we ran the K-means clustering algorithm 
using 3 as the number of clusters and 1000 as the number of runs. However, since different 
runs of the k-means clustering algorithm may not give the same final clustering solution, we 
repeated this experiment 10 times and finally took a consensus of the groupings identified in 
these runs. We repeated the whole approach to identify the distinct clusters in B. subtilis. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-28 
 
 
2.4.2 Estimating the statistical significance of the regulon groups 
To calculate the probability of expecting the chromosomal distances seen in each regulon group 
by chance, we compared the average chromosomal distance observed in each regulon group 
against the average chromosomal distances seen in 1000 randomly generated regulon groups 
obtained by preserving the number of regulatory interactions for each TF in a regulon group. 
Such a randomization preserves the number of TFs and the interactions in a regulon group but 
still associates to randomly selected genes in the complete genome thus preserving the 
topology of the regulon group while shuffling the genomic organization of the targets with 
respect to their regulating TF. 
Statistical significance was assessed based on (i) Z-score, calculated as the number of 
standard deviations the observed value is away from the randomly expected mean. This is 
obtained as the ratio between the differences of the observed, x,  and random expected, μ, 
values to the standard deviation, σ i.e., Z = (x– μ)/ σ) and (ii) p-values, defined as the fraction of 
the 1000 random trails which showed a value ≥ what was observed in the real dataset.  
REFERENCES 
Aizawa, S. I. and Kubori, T. (1998). Bacterial flagellation and cell division. Genes Cells 3, 625-
34. 
Aldridge, P. and Hughes, K. T. (2002). Regulation of flagellar assembly. Curr Opin Microbiol 5, 
160-5. 
Ali Azam, T., Iwata, A., Nishimura, A., Ueda, S. and Ishihama, A. (1999). Growth phase-
dependent variation in protein composition of the Escherichia coli nucleoid. J Bacteriol 181, 
6361-70. 
Alon, U. (2007a). An Introduction to Systems Biology: Design Principles of Biological Circuits. 
London. UK.: Chapman & Hall/CRC. 
Alon, U. (2007b). Network motifs: theory and experimental approaches. Nat Rev Genet 8, 450-
61. 
Ausmees, N. and Jacobs-Wagner, C. (2003). Spatial and temporal control of differentiation 
and cell cycle progression in Caulobacter crescentus. Annu Rev Microbiol 57, 225-47. 
Balaban, N. Q., Merrin, J., Chait, R., Kowalik, L. and Leibler, S. (2004). Bacterial persistence 
as a phenotypic switch. Science 305, 1622-5. 
Bar-Joseph, Z., Gerber, G. K., Lee, T. I., Rinaldi, N. J., Yoo, J. Y., Robert, F., Gordon, D. B., 
Fraenkel, E., Jaakkola, T. S., Young, R. A. et al. (2003). Computational discovery of gene 
modules and regulatory networks. Nat Biotechnol 21, 1337-42. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-29 
 
 
Barabasi, A. L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286, 
509-12. 
Berg, O. G., Winter, R. B. and von Hippel, P. H. (1981). Diffusion-driven mechanisms of 
protein translocation on nucleic acids. 1. Models and theory. Biochemistry 20, 6929-48. 
Bettenbrock, K., Sauter, T., Jahreis, K., Kremling, A., Lengeler, J. W. and Gilles, E. D. 
(2007). Correlation between growth rates, EIIACrr phosphorylation, and intracellular cyclic AMP 
levels in Escherichia coli K-12. J Bacteriol 189, 6891-900. 
Browning, D. F. and Busby, S. J. (2004). The regulation of bacterial transcription initiation. Nat 
Rev Microbiol 2, 57-65. 
Cai, L., Friedman, N. and Xie, X. S. (2006). Stochastic protein expression in individual cells at 
the single molecule level. Nature 440, 358-62. 
Cases, I. and de Lorenzo, V. (2005). Promoters in the environment: transcriptional regulation 
in its natural context. Nat Rev Microbiol 3, 105-18. 
Chantratita, N., Wuthiekanun, V., Boonbumrung, K., Tiyawisutsri, R., Vesaratchavest, M., 
Limmathurotsakul, D., Chierakul, W., Wongratanacheewin, S., Pukritiyakamee, S., White, 
N. J. et al. (2007). Biological relevance of colony morphology and phenotypic switching by 
Burkholderia pseudomallei. J Bacteriol 189, 807-17. 
Chen, S., Hao, Z., Bieniek, E. and Calvo, J. M. (2001). Modulation of Lrp action in Escherichia 
coli by leucine: effects on non-specific binding of Lrp to DNA. J Mol Biol 314, 1067-75. 
Cherstvy, A. G., Kolomeisky, A. B. and Kornyshev, A. A. (2008). Protein--DNA interactions: 
reaching and recognizing the targets. J Phys Chem B 112, 4741-50. 
Covert, M. W., Knight, E. M., Reed, J. L., Herrgard, M. J. and Palsson, B. O. (2004). 
Integrating high-throughput and computational data elucidates bacterial networks. Nature 429, 
92-6. 
de Hoon, M. J., Imoto, S., Nolan, J. and Miyano, S. (2004). Open source clustering software. 
Bioinformatics 20, 1453-4. 
Dekel, E., Mangan, S. and Alon, U. (2005). Environmental selection of the feed-forward loop 
circuit in gene-regulation networks. Phys Biol 2, 81-8. 
Deutscher, J., Francke, C. and Postma, P. W. (2006). How phosphotransferase system-
related protein phosphorylation regulates carbohydrate metabolism in bacteria. Microbiol Mol 
Biol Rev 70, 939-1031. 
Dobrin, R., Beg, Q. K., Barabasi, A. L. and Oltvai, Z. N. (2004a). Aggregation of topological 
motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinformatics 5, 10. 
Dobrin, R., Beg, Q. K., Barabasi, A. L. and Oltvai, Z. N. (2004b). Aggregation of topological 
motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinformatics 5, 10. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-30 
 
 
Droge, P. and Muller-Hill, B. (2001). High local protein concentrations at promoters: strategies 
in prokaryotic and eukaryotic cells. Bioessays 23, 179-83. 
Dworkin, J. (2003). Transient genetic asymmetry and cell fate in a bacterium. Trends Genet 19, 
107-12. 
Dworkin, J. and Losick, R. (2001). Differential gene expression governed by chromosomal 
spatial asymmetry. Cell 107, 339-46. 
Ehrlich, G. D., Hu, F. Z., Shen, K., Stoodley, P. and Post, J. C. (2005). Bacterial plurality as a 
general mechanism driving persistence in chronic infections. Clin Orthop Relat Res, 20-4. 
Elf, J., Li, G. W. and Xie, X. S. (2007). Probing transcription factor dynamics at the single-
molecule level in a living cell. Science 316, 1191-4. 
Ermolaeva, M. D., White, O. and Salzberg, S. L. (2001). Prediction of operons in microbial 
genomes. Nucleic Acids Res 29, 1216-21. 
Faith, J. J., Hayete, B., Thaden, J. T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S., 
Collins, J. J. and Gardner, T. S. (2007). Large-scale mapping and validation of Escherichia 
coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 5, e8. 
Gama-Castro, S., Jimenez-Jacinto, V., Peralta-Gil, M., Santos-Zavaleta, A., Penaloza-
Spinola, M. I., Contreras-Moreira, B., Segura-Salazar, J., Muniz-Rascado, L., Martinez-
Flores, I., Salgado, H. et al. (2008). RegulonDB (version 6.0): gene regulation model of 
Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and 
Textpresso navigation. Nucleic Acids Res 36, D120-4. 
Golding, I., Paulsson, J., Zawilski, S. M. and Cox, E. C. (2005). Real-time kinetics of gene 
activity in individual bacteria. Cell 123, 1025-36. 
Gowers, D. M., Wilson, G. G. and Halford, S. E. (2005). Measurement of the contributions of 
1D and 3D pathways to the translocation of a protein along DNA. Proc Natl Acad Sci U S A 102, 
15883-8. 
Griffith, K. L. and Wolf, R. E., Jr. (2004). Genetic evidence for pre-recruitment as the 
mechanism of transcription activation by SoxS of Escherichia coli: the dominance of DNA 
binding mutations of SoxS. J Mol Biol 344, 1-10. 
Guelzim, N., Bottani, S., Bourgine, P. and Kepes, F. (2002). Topological and causal structure 
of the yeast transcriptional regulatory network. Nat Genet 31, 60-3. 
Guerrero, A., Jain, N., Goldman, D. L. and Fries, B. C. (2006). Phenotypic switching in 
Cryptococcus neoformans. Microbiology 152, 3-9. 
Gutierrez-Rios, R. M., Rosenblueth, D. A., Loza, J. A., Huerta, A. M., Glasner, J. D., 
Blattner, F. R. and Collado-Vides, J. (2003). Regulatory network of Escherichia coli: 
consistency between literature knowledge and microarray profiles. Genome Res 13, 2435-43. 
Handke, L. D., Conlon, K. M., Slater, S. R., Elbaruni, S., Fitzpatrick, F., Humphreys, H., 
Giles, W. P., Rupp, M. E., Fey, P. D. and O'Gara, J. P. (2004). Genetic and phenotypic 
Constraints imposed on bacterial transcriptional networks                                                                                     2-31 
 
 
analysis of biofilm phenotypic variation in multiple Staphylococcus epidermidis isolates. J Med 
Microbiol 53, 367-74. 
Hartwell, L. H., Hopfield, J. J., Leibler, S. and Murray, A. W. (1999). From molecular to 
modular cell biology. Nature 402, C47-52. 
Hengge-Aronis, R. (2002). Signal transduction and regulatory mechanisms involved in control 
of the sigma(S) (RpoS) subunit of RNA polymerase. Microbiol Mol Biol Rev 66, 373-95, table of 
contents. 
Hilbert, D. W. and Piggot, P. J. (2004). Compartmentalization of gene expression during 
Bacillus subtilis spore formation. Microbiol Mol Biol Rev 68, 234-62. 
Hooshangi, S., Thiberge, S. and Weiss, R. (2005). Ultrasensitivity and noise propagation in a 
synthetic transcriptional cascade. Proc Natl Acad Sci U S A 102, 3581-6. 
Hu, L., Grosberg, A. Y. and Bruinsma, R. (2008). Are DNA Transcription Factor Proteins 
Maxwellian Demons? Biophys J. 
Hu, Z., Ng, D. M., Yamada, T., Chen, C., Kawashima, S., Mellor, J., Linghu, B., Kanehisa, 
M., Stuart, J. M. and DeLisi, C. (2007). VisANT 3.0: new modules for pathway visualization, 
editing, prediction and construction. Nucleic Acids Res 35, W625-32. 
Ihmels, J., Bergmann, S. and Barkai, N. (2004). Defining transcription modules using large-
scale gene expression data. Bioinformatics 20, 1993-2003. 
Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y. and Barkai, N. (2002). Revealing 
modular organization in the yeast transcriptional network. Nat Genet 31, 370-7. 
Isalan, M., Lemerle, C., Michalodimitrakis, K., Horn, C., Beltrao, P., Raineri, E., Garriga-
Canut, M. and Serrano, L. (2008). Evolvability and hierarchy in rewired bacterial gene 
networks. Nature 452, 840-5. 
Ishihama, Y., Schmidt, T., Rappsilber, J., Mann, M., Hartl, F. U., Kerner, M. J. and 
Frishman, D. (2008). Protein abundance profiling of the Escherichia coli cytosol. BMC 
Genomics 9, 102. 
Ishii, T., Yoshida, K., Terai, G., Fujita, Y. and Nakai, K. (2001). DBTBS: a database of 
Bacillus subtilis promoters and transcription factors. Nucleic Acids Res 29, 278-80. 
Jacob, F. (1970). La Logique du Vivant, Une Histoire de L'Hérédité. Paris: Gallimard. 
Jacob, F., Perrin, D., Sanchez, C. and Monod, J. (1960). [Operon: a group of genes with the 
expression coordinated by an operator.]. C R Hebd Seances Acad Sci 250, 1727-9. 
Janga, S. C., Salgado, H., Collado-Vides, J. and Martinez-Antonio, A. (2007a). Internal 
versus external effector and transcription factor gene pairs differ in their relative chromosomal 
position in Escherichia coli. J Mol Biol 368, 263-72. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-32 
 
 
Janga, S. C., Salgado, H., Martinez-Antonio, A. and Collado-Vides, J. (2007b). Coordination 
logic of the sensing machinery in the transcriptional regulatory network of Escherichia coli. 
Nucleic Acids Res 35, 6963-6972. 
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. and Barabasi, A. L. (2000). The large-scale 
organization of metabolic networks. Nature 407, 651-4. 
Kalir, S. and Alon, U. (2004). Using a quantitative blueprint to reprogram the dynamics of the 
flagella gene network. Cell 117, 713-20. 
Kalir, S., McClure, J., Pabbaraju, K., Southward, C., Ronen, M., Leibler, S., Surette, M. G. 
and Alon, U. (2001). Ordering genes in a flagella pathway by analysis of expression kinetics 
from living bacteria. Science 292, 2080-3. 
Kamoun, S. and Kado, C. I. (1990). Phenotypic Switching Affecting Chemotaxis, Xanthan 
Production, and Virulence in Xanthomonas campestris. Appl Environ Microbiol 56, 3855-3860. 
Karlin, S. and Mrazek, J. (2000). Predicted highly expressed genes of diverse prokaryotic 
genomes. J Bacteriol 182, 5238-50. 
Kashtan, N. and Alon, U. (2005). Spontaneous evolution of modularity and network motifs. 
Proc Natl Acad Sci U S A 102, 13773-8. 
Kepes, F. (2004). Periodic transcriptional organization of the E.coli genome. J Mol Biol 340, 
957-64. 
Kolesov, G., Wunderlich, Z., Laikova, O. N., Gelfand, M. S. and Mirny, L. A. (2007). How 
gene order is influenced by the biophysics of transcription regulation. Proc Natl Acad Sci U S A 
104, 13948-53. 
Korbel, J. O., Jensen, L. J., von Mering, C. and Bork, P. (2004). Analysis of genomic context: 
prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat 
Biotechnol 22, 911-7. 
Lagomarsino, M. C., Jona, P., Bassetti, B. and Isambert, H. (2007). Hierarchy and feedback 
in the evolution of the Escherichia coli transcription network. Proc Natl Acad Sci U S A 104, 
5516-20. 
Liu, M., Durfee, T., Cabrera, J. E., Zhao, K., Jin, D. J. and Blattner, F. R. (2005). Global 
transcriptional programs reveal a carbon source foraging strategy by Escherichia coli. J Biol 
Chem 280, 15921-7. 
Lozada-Chavez, I., Angarica, V. E., Collado-Vides, J. and Contreras-Moreira, B. (2008). 
The role of DNA-binding specificity in the evolution of bacterial regulatory networks. J Mol Biol 
379, 627-43. 
Lu, P., Vogel, C., Wang, R., Yao, X. and Marcotte, E. M. (2007). Absolute protein expression 
profiling estimates the relative contributions of transcriptional and translational regulation. Nat 
Biotechnol 25, 117-24. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-33 
 
 
Luijsterburg, M. S., Noom, M. C., Wuite, G. J. and Dame, R. T. (2006). The architectural role 
of nucleoid-associated proteins in the organization of bacterial chromatin: a molecular 
perspective. J Struct Biol 156, 262-72. 
Ma, H. W., Buer, J. and Zeng, A. P. (2004a). Hierarchical structure and modules in the 
Escherichia coli transcriptional regulatory network revealed by a new top-down approach. BMC 
Bioinformatics 5, 199. 
Ma, H. W., Kumar, B., Ditges, U., Gunzer, F., Buer, J. and Zeng, A. P. (2004b). An extended 
transcriptional regulatory network of Escherichia coli and analysis of its hierarchical structure 
and network motifs. Nucleic Acids Res 32, 6643-9. 
Maas, W. K., Maas, R., Wiame, J. M. and Glansdorff, N. (1964). Studies On The Mechanism 
Of Repression Of Arginine Biosynthesis In Escherichia Coli. I. Dominance Of Repressibility In 
Zygotes. J Mol Biol 78, 359-64. 
Macnab, R. M. (2003). How bacteria assemble flagella. Annu Rev Microbiol 57, 77-100. 
Madan Babu, M. and Teichmann, S. A. (2003). Evolution of transcription factors and the gene 
regulatory network in Escherichia coli. Nucleic Acids Res 31, 1234-44. 
Makita, Y., Nakao, M., Ogasawara, N. and Nakai, K. (2004). DBTBS: database of 
transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. 
Nucleic Acids Res 32, D75-7. 
Mangan, S. and Alon, U. (2003). Structure and function of the feed-forward loop network motif. 
Proc Natl Acad Sci U S A 100, 11980-5. 
Mangan, S., Itzkovitz, S., Zaslaver, A. and Alon, U. (2005). The Incoherent Feed-forward 
Loop Accelerates the Response-time of the gal System of Escherichia coli. J Mol Biol. 
Mangan, S., Itzkovitz, S., Zaslaver, A. and Alon, U. (2006). The incoherent feed-forward loop 
accelerates the response-time of the gal system of Escherichia coli. J Mol Biol 356, 1073-81. 
Mangan, S., Zaslaver, A. and Alon, U. (2003). The coherent feedforward loop serves as a 
sign-sensitive delay element in transcription networks. J Mol Biol 334, 197-204. 
Marenduzzo, D., Faro-Trindade, I. and Cook, P. R. (2007). What are the molecular ties that 
maintain genomic loops? Trends Genet 23, 126-33. 
Marr, C., Geertz, M., Hutt, M. T. and Muskhelishvili, G. (2008). Dissecting the logical types of 
network control in gene expression profiles. BMC Syst Biol 2, 18. 
Martin, R. G., Gillette, W. K. and Rosner, J. L. (2000). Promoter discrimination by the related 
transcriptional activators MarA and SoxS: differential regulation by differential binding. Mol 
Microbiol 35, 623-34. 
Martinez-Antonio, A. and Collado-Vides, J. (2003). Identifying global regulators in 
transcriptional regulatory networks in bacteria. Curr Opin Microbiol 6, 482-9. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-34 
 
 
Martinez-Antonio, A., Janga, S. C., Salgado, H. and Collado-Vides, J. (2006a). Internal-
sensing machinery directs the activity of the regulatory network in Escherichia coli. Trends 
Microbiol 14, 22-7. 
Martinez-Antonio, A., Janga, S. C., Salgado, H. and Collado-Vides, J. (2006b). Internal-
sensing machinery directs the activity of the regulatory network in Escherichia coli. Trends 
Microbiol 14, 22-27. 
Martinez-Antonio, A., Janga, S. C. and Thieffry, D. (2008). Functional organisation of 
Escherichia coli transcriptional regulatory network. J Mol Biol 381, 238-47. 
Massey, R. C., Buckling, A. and Peacock, S. J. (2001). Phenotypic switching of antibiotic 
resistance circumvents permanent costs in Staphylococcus aureus. Curr Biol 11, 1810-4. 
Menchaca-Mendez, R., Janga, S. C. and Collado-Vides, J. (2005). The network of 
transcriptional interactions imposes linear constrains in the genome. Omics 9, 139-45. 
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002). 
Network motifs: simple building blocks of complex networks. Science 298, 824-7. 
Moreno-Campuzano, S., Janga, S. C. and Perez-Rueda, E. (2006). Identification and analysis 
of DNA-binding transcription factors in Bacillus subtilis and other Firmicutes--a genomic 
approach. BMC Genomics 7, 147. 
Murugan, R. (2007). Generalized theory of site-specific DNA-protein interactions. Phys Rev E 
Stat Nonlin Soft Matter Phys 76, 011901. 
O'Toole, G., Kaplan, H. B. and Kolter, R. (2000). Biofilm formation as microbial development. 
Annu Rev Microbiol 54, 49-79. 
Pal, C., Papp, B. and Lercher, M. J. (2005). Adaptive evolution of bacterial metabolic networks 
by horizontal gene transfer. Nat Genet 37, 1372-5. 
Perez-Rueda, E. and Collado-Vides, J. (2000). The repertoire of DNA-binding transcriptional 
regulators in Escherichia coli K-12. Nucleic Acids Res 28, 1838-47. 
Postow, L., Hardy, C. D., Arsuaga, J. and Cozzarelli, N. R. (2004). Topological domain 
structure of the Escherichia coli chromosome. Genes Dev 18, 1766-79. 
Price, M. N., Huang, K. H., Arkin, A. P. and Alm, E. J. (2005). Operon formation is driven by 
co-regulation and not by horizontal gene transfer. Genome Res 15, 809-19. 
Pruss, B. M. and Matsumura, P. (1997). Cell cycle regulation of flagellar genes. J Bacteriol 
179, 5602-4. 
Ptashne, M. and Gann, A. (1997). Transcriptional activation by recruitment. Nature 386, 569-
77. 
Resendis-Antonio, O., Freyre-Gonzalez, J. A., Menchaca-Mendez, R., Gutierrez-Rios, R. 
M., Martinez-Antonio, A., Avila-Sanchez, C. and Collado-Vides, J. (2005). Modular analysis 
of the transcriptional regulatory network of E. coli. Trends Genet 21, 16-20. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-35 
 
 
Richter, P. H. and Eigen, M. (1974). Diffusion controlled reaction rates in spheroidal geometry. 
Application to repressor--operator association and membrane bound enzymes. Biophys Chem 
2, 255-63. 
Riggs, A. D., Bourgeois, S. and Cohn, M. (1970). The lac repressor-operator interaction. 3. 
Kinetic studies. J Mol Biol 53, 401-17. 
Riva, A., Carpentier, A. S., Barloy-Hubler, F., Cheron, A. and Henaut, A. (2008). Analyzing 
stochastic transcription to elucidate the nucleoid's organization. BMC Genomics 9, 125. 
Ronen, M., Rosenberg, R., Shraiman, B. I. and Alon, U. (2002). Assigning numbers to the 
arrows: parameterizing a gene regulation network by using accurate expression kinetics. Proc 
Natl Acad Sci U S A 99, 10555-60. 
Salgado, H., Gama-Castro, S., Peralta-Gil, M., Diaz-Peredo, E., Sanchez-Solano, F., 
Santos-Zavaleta, A., Martinez-Flores, I., Jimenez-Jacinto, V., Bonavides-Martinez, C., 
Segura-Salazar, J. et al. (2006). RegulonDB (version 5.0): Escherichia coli K-12 transcriptional 
regulatory network, operon organization, and growth conditions. Nucleic Acids Res 34, D394-7. 
Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D. and Friedman, N. 
(2003). Module networks: identifying regulatory modules and their condition-specific regulators 
from gene expression data. Nat Genet 34, 166-76. 
Seshasayee, A. S., Fraser, G. M., Babu, M. M. and Luscombe, N. M. (2009). Principles of 
transcriptional regulation and evolution of the metabolic system in E. coli. Genome Res 19, 79-
91. 
Shah, I. M. and Wolf, R. E., Jr. (2004). Novel protein--protein interaction between Escherichia 
coli SoxS and the DNA binding determinant of the RNA polymerase alpha subunit: SoxS 
functions as a co-sigma factor and redeploys RNA polymerase from UP-element-containing 
promoters to SoxS-dependent promoters during oxidative stress. J Mol Biol 343, 513-32. 
Shapiro, J. A. (1998). Thinking about bacterial populations as multicellular organisms. Annu 
Rev Microbiol 52, 81-104. 
Shen-Orr, S. S., Milo, R., Mangan, S. and Alon, U. (2002). Network motifs in the 
transcriptional regulation network of Escherichia coli. Nat Genet 31, 64-8. 
Shimamoto, N. (1999). One-dimensional diffusion of proteins along DNA. Its biological and 
chemical significance revealed by single-molecule measurements. J Biol Chem 274, 15293-6. 
Slonim, N., Elemento, O. and Tavazoie, S. (2006). Ab initio genotype-phenotype association 
reveals intrinsic modularity in genetic networks. Mol Syst Biol 2, 2006 0005. 
Snel, B. and Huynen, M. A. (2004). Quantifying modularity in the evolution of biomolecular 
systems. Genome Res 14, 391-7. 
Stoodley, P., Sauer, K., Davies, D. G. and Costerton, J. W. (2002). Biofilms as complex 
differentiated communities. Annu Rev Microbiol 56, 187-209. 
Constraints imposed on bacterial transcriptional networks                                                                                     2-36 
 
 
Struhl, K. (1999). Fundamentally different logic of gene regulation in eukaryotes and 
prokaryotes. Cell 98, 1-4. 
Thieffry, D., Huerta, A. M., Perez-Rueda, E. and Collado-Vides, J. (1998). From specific 
gene regulation to genomic networks: a global analysis of transcriptional regulation in 
Escherichia coli. Bioessays 20, 433-40. 
Travers, A. and Muskhelishvili, G. (2007). A common topology for bacterial and eukaryotic 
transcription initiation? EMBO Rep 8, 147-51. 
Ulrich, L. E., Koonin, E. V. and Zhulin, I. B. (2005). One-component systems dominate signal 
transduction in prokaryotes. Trends Microbiol 13, 52-6. 
Wang, Y. M., Austin, R. H. and Cox, E. C. (2006). Single molecule measurements of repressor 
protein 1D diffusion on DNA. Phys Rev Lett 97, 048302. 
Warren, P. B. and ten Wolde, P. R. (2004). Statistical analysis of the spatial distribution of 
operons in the transcriptional regulation network of Escherichia coli. J Mol Biol 342, 1379-90. 
Winter, R. B., Berg, O. G. and von Hippel, P. H. (1981). Diffusion-driven mechanisms of 
protein translocation on nucleic acids. 3. The Escherichia coli lac repressor--operator 
interaction: kinetic measurements and conclusions. Biochemistry 20, 6961-77. 
Wu, W. S., Li, W. H. and Chen, B. S. (2006). Computational reconstruction of transcriptional 
regulatory modules of the yeast cell cycle. BMC Bioinformatics 7, 421. 
Xie, X. S., Choi, P. J., Li, G. W., Lee, N. K. and Lia, G. (2008). Single-molecule approach to 
molecular biology in living bacterial cells. Annu Rev Biophys 37, 417-44. 
Yu, H. and Gerstein, M. (2006). Genomic analysis of the hierarchical structure of regulatory 
networks. Proc Natl Acad Sci U S A 103, 14724-31. 
Yu, J., Xiao, J., Ren, X., Lao, K. and Xie, X. S. (2006). Probing gene expression in live cells, 
one protein molecule at a time. Science 311, 1600-3. 
Yudkin, M. D. and Clarkson, J. (2005). Differential gene expression in genetically identical 
sister cells: the initiation of sporulation in Bacillus subtilis. Mol Microbiol 56, 578-89. 
Zimmerman, S. B. (2006). Shape and compaction of Escherichia coli nucleoids. J Struct Biol 
156, 255-61. 
 
 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-1 
 
 
 
 
 
 
 
3 Transcriptional regulation constrains the 
organization of genes on eukaryotic chromosomes 
 
 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-2 
 
 
 
CONTENTS OF CHAPTER 3 
 
OUTLINE.................................................................................................................................... 3-3 
CONTRIBUTION TO THE WORK IN THIS CHAPTER.................................................. 3-4 
3.1 INTRODUCTION ............................................................................................................. 3-5 
3.2 RESULTS ........................................................................................................................... 3-8 
3.2.1 EUKARYOTIC GENOME ORGANIZATION AND TRANSCRIPTIONAL REGULATION.................. 3-8 
3.2.1.1 LONG-RANGE INTERACTIONS INVOLVING DISTAL REGULATORY ELEMENTS ................ 3-12 
3.2.1.2 INTER-CHROMOSOMAL INTERACTIONS ........................................................................ 3-13 
3.2.1.3 CHROMOSOMAL TERRITORIES, MOVEMENT AND NUCLEAR ORGANIZATION................. 3-14 
3.2.1.4 ASSOCIATION OF THE GENOMIC LOCI WITH THE NUCLEAR PERIPHERY......................... 3-16 
3.2.2 TRANSCRIPTIONAL REGULATION CONSTRAINS GENOME ORGANIZATION........................ 3-17 
3.2.2.1 THE MAJORITY OF TFS SHOW A STRONG PREFERENCE TO REGULATE GENES ON SPECIFIC 
CHROMOSOMES ...............................................................................................................……3-18 
3.2.2.2 A SIGNIFICANT FRACTION OF THE TFS TEND TO HAVE TARGETS ON SPECIFIC REGIONS OF 
THE CHROMOSOMAL ARM........................................................................................................ 3-23 
3.2.2.3 MOST TFS SHOW A STRONG PREFERENCE TO POSITIONALLY CLUSTER THEIR TARGETS 
WITHIN A CHROMOSOME ......................................................................................................... 3-26 
3.3 DISCUSSION & CONCLUSION ................................................................................ 3-28 
3.4 MATERIALS AND METHODS .................................................................................... 3-29 
3.4.1 DATASET OF TRANSCRIPTION FACTORS IN S. CEREVISIAE AND THEIR REGULATORY 
INTERACTIONS ........................................................................................................................ 3-29 
3.4.2 ESTIMATION OF STATISTICAL SIGNIFICANCE .................................................................. 3-30 
3.4.3 CALCULATION OF CHROMOSOMAL PREFERENCE ............................................................ 3-30 
3.4.4 CALCULATION OF REGIONAL PREFERENCE ..................................................................... 3-31 
3.4.5 CALCULATION OF TARGET PROXIMITY ........................................................................... 3-31 
REFERENCES ......................................................................................................................... 3-32 
 
 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-3 
 
 
OUTLINE 
Recent advances in molecular techniques and high-resolution imaging are beginning to provide 
exciting insights into the higher order chromatin organization within the cell nucleus and its 
influence on eukaryotic gene regulation. This improved understanding of gene regulation also 
raises fundamental questions about how spatial features might have constrained the 
organization of genes on eukaryotic chromosomes and how re-arrangements that affect these 
processes might contribute to disease conditions. In this chapter, I discuss recent studies that 
highlight the role of spatial components in gene regulation and their impact on genome 
evolution. I then present evidence for the existence of a higher-order organization of genes 
across and within chromosomes that is constrained by transcriptional regulation. In particular, I 
show that the target genes of transcription factors for the yeast, Saccharomyces cerevisiae, are 
encoded in a highly ordered manner both across and within the sixteen chromosomes by 
demonstrating that the target genes of a (i) majority of the TFs are not randomly distributed 
across chromosomes but show a strong preference to be encoded on specific chromosomes, 
(ii) significant fraction of the TFs are not randomly distributed within a chromosome, but display 
a strong preference (or avoidance) to be encoded in regions containing particular chromosomal 
landmarks such as telomeres and centromeres (iii) majority of the TFs are not randomly 
scattered but are positionally clustered within a chromosome. These results demonstrate that 
specific organization of genes that allowed for efficient control of transcription within the nuclear 
space has been selected during evolution. The framework developed here can be exploited to 
uncover such higher-order organizational principles in other eukaryotes to provide insights into 
chromosomal territories, their role in cellular differentiation and transformation, and will have 
implications for understanding disease conditions that involve chromosomal aberrations. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-4 
 
 
CONTRIBUTION TO THE WORK IN THIS CHAPTER 
Please note that the work presented in this chapter is the result of the following two publications 
and my contribution to the work excludes the collaboration with Dr. Ana Pombo and Ines De 
Santiago at MRC Clinical Sciences Centre, London towards the review on eukaryotic genome 
organization and transcriptional regulation. I performed all other analyses. I would also like to 
thank Dr. Julio Collado-Vides at UNAM, Mexico for helpful discussions in developing this work.  
 
1) Eukaryotic gene regulation in three dimensions and its impact on genome evolution 
M. Madan Babu, Sarath Chandra Janga, Ines de Santiago and Ana Pombo 
Curr. Opin. Genet. Dev., 2008, Vol. 18(6):571-582 
2) Transcriptional regulation constrains the organization of genes on eukaryotic chromosomes 
Sarath Chandra Janga, Julio Collado-Vides and M. Madan Babu 
Proc. Natl. Acad. Sci. U S A. 105(41): 15761-6, 2008 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-5 
 
 
3.1 INTRODUCTION 
Since the discovery of chromatin in 1974 (Kornberg, 1974; Olins and Olins, 1974), it is now well 
known that eukaryotic genomes are compactly packed into chromatin, the fundamental unit of 
which is the nucleosome. Such an organization appears to serve two important purposes: (i) 
they allow for compaction to fit the DNA in the nucleus and (ii) they avoid unnecessary 
transcription of genes by preventing the RNA polymerase from accessing the promoter regions 
of genes. Apart from these general functions, chromatin structure is also known to play an 
important role in DNA replication and repair (Loizou et al., 2006). Nucleosomes consists of ~146 
bp of DNA wrapped twice around the core histone octamer (Luger et al., 1997) whose 
components and additional chromatin proteins can interact to form higher order chromosomal 
structures. Apart from providing a structural basis, components of the histone octamer could 
themselves be post-translationally modified by several different proteins. For instance, a class of 
proteins called the nucleosome remodeling enzymes, either remove the histone octamer from 
the nucleosome by chemically modifying them or by physically changing the position of the 
nucleosome to provide access. Importantly, several studies (both recently and in the past) have 
shown that the individual subunits of the histone octamer in a nucleosome could be chemically 
modified by an acetyl group, methyl (mono, di, or tri) group, phosphorylation, ADP ribosylation, 
ubiquitinylation, and sumoylation (Allfrey et al., 1964; Millar and Grunstein, 2006; Nightingale et 
al., 2006). Hence, such an organization of DNA into nucleosomes and the plethora of 
combinatorial possibilities of the modified state of the nucleosome is believed to provide an 
opportunity to regulate expression of relevant genes in a more sophisticated way, resulting in 
discrete biological outcomes. This combination of modification states that results in distinct 
effects in a cell has been conventionally referred to as the histone code (Turner, 1993; Turner, 
2007). Thus, nucleosomes are critical to the organization and maintenance of genetic material 
and their position and modification state can profoundly influence genetic activities such as 
regulation of gene expression (Kouzarides, 2002; Narlikar et al., 2002). 
More generally, the eukaryotic genome compared to its bacterial counterpart is a highly 
complex system, which is regulated at three major hierarchical levels (Lee and Young, 2000; 
van Driel et al., 2003). The first is at the DNA sequence level, i.e. the linear organization of 
transcription units and regulatory sequences. Co-regulated genes organized into clusters in the 
genome constitute part of these individual functional units.  The second is at the chromatin level, 
which allows switching between different functional states. This level involves the changes in 
the chromatin structure that are controlled by the interplay of histones and remodeling factors 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-6 
 
 
along with a variety of repressive and activating mechanisms. This regulatory level is linked with 
the control mechanisms from level one that switch individual genes in the cluster to on and off, 
depending on the properties of the promoter. The third level is the nuclear level, which includes 
the dynamic 3D spatial organization of the genome inside the cell nucleus. The nucleus is 
structurally and functionally compartmentalized and epigenetic regulation of gene expression 
may involve repositioning of loci in the nucleus through changes in large-scale chromatin 
structure. There is increasing evidence that such a higher order organization of chromatin 
arrangement contributes essentially to the regulation of gene expression and other nuclear 
functions (see (Cremer and Cremer, 2001; Lanctot et al., 2007; van Driel et al., 2003; Zinner et 
al., 2006)). The territorial organization of chromosomes was known from very early experiments, 
in which damaged regions of micro-irradiated cell nuclei, visualized in the subsequently 
prepared metaphase chromosomes, were found to be locally clustered (Zorn et al., 1979). The 
chromosome territories were later visualized directly by means of in situ hybridization in 
interspecies somatic hybrid cells (Manuelidis, 1985).  There is now convincing evidence that 
chromosomes in most eukaryotic nuclei occupy distinct volumes in the nuclear space called 
chromosomal territories separated by intra-chromosomal regions providing evidence for the 
dynamic nature of the positions occupied by the chromosomes (Cremer et al., 2000; Gasser, 
2002; Heun et al., 2001; Kurz et al., 1996; Taddei et al., 2004). Hence, the cross-talk between 
different chromosomes and genes located within them in the context of metabolic, 
transcriptional and signaling mechanisms could provide additional layer of complexity to our 
understanding of the proper functioning of the cell.  
Apart from the three dimensional architecture of cell nucleus discussed above (and 
shown in Figure 3-1) a number of regulatory mechanisms control the movement, organization 
and regulation of different loci with in the nucleus.  Functions and our current understanding of 
some of these regulatory elements responsible for the regulation at different levels have been 
discussed in detail in the first half of this chapter. All these observations raise some fundamental 
questions: has the requirement for transcriptional regulation and their spatial considerations 
constrained the way in which genes are organized on chromosomes? If yes, in what ways does 
it affect genome evolution? In the second half of this chapter I discuss the investigations 
involving the understanding of constraints placed by transcriptional regulation on the 
organization of genes on the chromosomes in eukaryotic organisms. 
 
(Space left for an enhanced layout of the figure) 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-7 
 
 
           
B)
Insulator 
element
Heterochromatin Locus
control region
Enhancer
TF binding
site
Gene
Transcription factory
Genomic 
loop
C)
Chromosomal 
territory
Inter-chromosomal
regulatory interaction
D)
E)
F)
Chromosome B
Chromosome A Nuclear 
periphery
Nuclear
pore
complex
Level 1 Level 2 Level 3
Regulatory elements 
and DNA 
modification
Nucleosome modification
and remodelling
Chromatin structure
Nuclear 
architecture
Chromosomal organization
A)
 
Figure 3-1: (A) Hierarchical organization of eukaryotic genetic material. DNA is wrapped into 
nucleosomes, which form the chromatin and is ultimately packaged into a chromosome that resides within 
the nucleus. The first level of regulation includes regulatory elements (e.g., enhancers and insulators), 
DNA methylation and DNA structure. The second level includes post-translational modification of 
nucleosomes and remodelling of nucleosomes. The third level of regulation includes chromosomal 
organization and the nuclear architecture. Features of genome architecture in 3D showing (B) DNA is 
shown as a black line, a gene is represented as an arrow and the different classes of regulatory elements 
are shown in various shapes and colors. Insulator elements (blue rectangles) block spread of 
heterochromatin (red circles) and prevent inappropriate interaction between enhancers (green oval) and 
unrelated genes. Enhancers can facilitate regulation of nearby genes that may still be a few kilobases 
away. Locus control region (gray oval) can bring genomic loci that are several kilobases away close to 
each other to co-ordinate gene expression. The bottommost panel shows various aspects of the spatial 
component in eukaryotic gene regulation. The nucleus is shown in the center. (C) Different active regions 
of the same or different chromosomes can associate with the same transcription factory. (D) Enhancers 
from one chromosome may regulate the expression of genes present on another chromosome via inter-
chromosomal interactions. (E) Chromosomes occupy defined volume within the nucleus, called as 
chromosome territories which are depicted in different colours with significant intermingling mostly at the 
edges. (F) Genetic material residing near the nuclear periphery has been correlated with gene silencing. 
One theme which stands out is that regions of the chromosome that interact with lamin and the nuclear 
inner membrane are largely inactive, in both mammals and yeast, whereas loci that interact with the 
components of the nuclear pore appear to be transcriptionally active, mostly observed in Drosophila and 
yeast. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-8 
 
 
3.2 RESULTS 
3.2.1 Eukaryotic genome organization and transcriptional regulation 
Though we observe an amazing diversity in the number of chromosomes that eukaryotic 
organisms encode (e.g., 32 chromosomes in yeast, over 200 in butterflies and 46 in humans), 
they are packaged in similar ways: each DNA molecule is wrapped around histone proteins to 
form nucleosomes, which are then condensed in a complex hierarchical manner to make up an 
entire chromosome (Figure 3-1A). Such an intricate organization of genetic material within the 
eukaryotic nucleus provides ample opportunities to regulate expression of the encoded genes at 
many different hierarchical levels. For instance, eukaryotic transcription is dynamically regulated 
at least at three major levels as shown in Figure 3-1A. The first is at the level of DNA sequence 
where DNA binding proteins (e.g., transcription factors; TFs) associate with cis-regulatory 
elements (e.g., TF binding sites) to regulate transcription. The second is at the level of 
chromatin, which allows segments within a chromosomal arm to switch between different 
transcriptional states, i.e., those that suppress transcription (heterochromatin) and those that 
allow for gene activation (euchromatin). This involves changes in chromatin structure and 
nucleosome occupancy, both of which are controlled by the interplay between several factors 
such as nucleosome remodeling complexes, histone modifications, and a variety of repressive 
and activating mechanisms (Millar and Grunstein, 2006; Razin et al., 2007). The third is at the 
level of the entire chromosome (Figure 3-1A) and includes positioning of chromosomes within 
the nuclear space (e.g., closer to the nuclear periphery or next to internal nuclear 
compartments) and spatial organization of specific chromosomal loci within the nucleus, both of 
which are known to influence gene expression (de Laat and Grosveld, 2007; Fraser and 
Bickmore, 2007; Misteli, 2007; Pombo and Branco, 2007; Schneider and Grosschedl, 2007).  
Several studies have investigated these mechanisms in detail and have revealed that 
such processes involve extensive physical and spatial association between distantly located 
genomic elements and widespread crosstalk between the different levels. Advancements in 
molecular techniques and high-resolution imaging (see Table 3-1) have facilitated investigation 
of the role of spatial component in gene regulation and have provided valuable insights into its 
importance in gene regulation (de Laat and Grosveld, 2007; Fraser and Bickmore, 2007; Misteli, 
2007; Pombo and Branco, 2007; Schneider and Grosschedl, 2007). In this chapter I first discuss 
recent studies that highlight the importance of spatial component in gene regulation and then 
present a detailed analysis that addresses how the requirements for gene regulation could have 
constrained genome organization. Finally, I discuss implications and outline open questions.  
Constraints imposed by eukaryotic transcriptional control                                                                                      3-9 
 
 
Table 3-1: Experimental and computational approaches to study eukaryotic transcriptional regulation. 
Experimental 
approaches 
Description 
ChIP-chip 
Chromatin-bound proteins are covalently linked to DNA by using an in vivo 
crosslinking agent such as formaldehyde (histones can be detected in unfixed 
chromatin preparations in native ChIP). Chromatin is then sheared and 
immunoprecipitated (ChIP) using an antibody for a native protein, a tagged 
version, or a specific post-translational modification. Reversal of the crosslink 
releases the bound DNA, allowing the enrichment of specific DNA fragments, 
whose identity is determined by hybridization to a microarray (chip).  
ChIP-seq 
In ChIP-seq experiments, the immunoprecipitated DNA is directly sequenced 
using high-throughput sequencing technologies (e.g., Solexa or 454). The 
sequences are then computationally mapped back to the reference genome. 
Fragments that were bound by the protein will be more abundant and sequenced 
several times, providing a direct measure of enrichment. 
DamID 
The DNA binding protein of interest is fused to an E. coli protein, Dam. Dam 
methylates the N6 position of the adenine in the sequence GATC, which is 
expected to occur once in every ~256 bases. Upon binding DNA, the Dam 
protein preferentially methylates adenine in the vicinity of binding. The DNA is 
digested by DpnI and DpnII restriction enzymes, which cleave within the non-
methylated GATC sequence, and remove fragments that are not methylated. 
The remaining methylated fragments are amplified by selective PCR and 
quantified using a microarray.  
RNA-TRAP 
Newly-made transcripts are detected in crosslinked cells by RNA-FISH using 
biotinylated probes and probe-RNA-chromatin complexes are amplified with 
tyramide or directly immunoprecipitated, before PCR analyses.  
Chromosome 
Confirmation 
Capture (3C) 
3C is used to determine which DNA sequences lie close together in 3D space in 
fixed cells. This typically involves fixation to crosslink DNA sequences that lie 
next to each other (usually through DNA–protein–DNA links), before cutting with 
a restriction enzyme, dilution and ligation at low concentration. This favours the 
ligation of pairs of DNA sequences that are crosslinked after which the reversing 
of crosslinks allows the ligated DNA to be detected by PCR.  
4C 
4C technology [chromosome conformation capture on chip (3C-on-chip) or 
circular chromosome conformation capture (circular-3C)] allows for an unbiased 
genome-wide search for DNA loci that contact a given locus. 
5C 
Chromosome Conformation Capture Carbon Copy (5C) is a massively parallel 
technique, which involves mapping physical interactions between genomic 
elements and sequencing or microarray analysis of the ligated end products of 
the 3C technique. 3C typically converts physical chromatin interactions into 
specific ligation products, which are quantified using high-throughput microarrays 
or quantitative DNA sequencing using 454-technology as detection methods. 
6C 
Combines ChIP for a specific chromatin bound protein with 3C-based methods 
to correlate specific long-range chromatin interactions with the presence of a 
specific bound protein. 
FISH 
Fluorescent in situ hybridization (FISH) detects specific DNA sequences and 
localizes them on cytogenetic preparations of chromosomes or interphase cell 
nuclei. Cells are hypotonically swollen and dropped on glass slides before 
hybridization, such that fine structural details might be lost. It uses tagged probes 
amplified from specific DNA fragments up to single chromosomes, to detect the 
target sequences. The genomic regions bound by the probe are visualized by 
fluorescence microscopy.  
3D-FISH 
A modified FISH procedure that improves the preservation of 3D nuclear 
structure (3D-FISH), important for spatial mapping of the position of specific 
genomic sequences within the interphase nuclei. This technique can be slow as 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-10 
 
 
it requires imaging of multiple image stacks on a small number of nuclei and 3D 
reconstruction. It can also be combined with protein and RNA localization. 
Cryo-FISH 
A modified FISH procedure that uses ultra-thin cryosections from sucrose-
embedded fixed samples. Sections are 100-200 nm thick. Preservation of 
ultrastructure is optimized, signal-to-noise ratios are improved and imaging 
artifacts are minimized. It is ideal for imaging short-range interactions between 
specific loci or their associations with specific landmarks with higher resolution 
and faster data collection. Specific cells in their tissue context can be easily 
investigated. It is not suitable to measure general 3D genomic positioning over 
large distances. 
Single Molecule 
Imaging 
(Fluorescence 
Microscopy) 
In Single Molecule Imaging (SMI) of live cells, the molecules of interest are 
conjugated with fluorophores and introduced into cells. The behavior of multiple 
fluorescent molecules in cells is then visualized using high-sensitivity video 
microscopy. The observables in SMI are the position or movement of the 
fluorescent spots, the fluorescence intensity of individual spots, the fluorescence 
spectrum or color of individual spots, and the number and distribution of the 
spots.  
Lac-binding-site 
array 
In this approach, in vivo visualization of chromatin dynamics is based on lac 
repressor recognition of direct repeats of the lac operator. The method allows 
tagging of specific chromosomal sites and thus in situ localization in vivo. 
Detection by light microscopy, using GFP-lac-repressor fusion proteins or 
immunofluorescence, can be complemented by higher-resolution electron 
microscopy using immunogold staining. This method facilitates the investigation 
of interphase chromosome dynamics, as well as chromosome segregation 
during cell division in organisms that lack cytologically condensed chromosomes.
Computational 
approaches Description 
Boolean modelling 
In qualitative modelling, kinetic processes are simulated by tracking over discrete 
time, the state of the system, defined in terms of a coarse range for each 
variable. The weak specification of such models conserves computer resources 
needed to explore the space of possible behaviours. Moreover, it provides high-
level predictions applying to a whole family of systems. Although simulation of 
qualitative models can be fast, even a rough exploration of parameter space can 
become intractable as the size of the system increases, highlighting the need for 
increasing computer resources and methods to accelerate the parameters’ 
search space. For genes that are naturally found in only two states (e.g., on or 
off), the trade-off in accuracy may not be high. On the other hand, simple models 
can, in some cases, predict behaviours that are far from reality. 
Deterministic 
modelling 
Deterministic modeling falls into the class of quantitative models. The most 
popular formalism is the deterministic ordinary differential equations (ODEs) 
which, when extended to model space, is referred to as partial differential 
equations (PDEs). Each equation in a set typically represents the rate of change 
of a species' continuous concentration as a sum or product of, more or less, 
empirical terms. This accounts for the effect of biological events on the 
concentration of the species. By definition, the initial state of the system in a 
deterministic model uniquely sets all future states. As analytical solutions seldom 
exist, numerical solutions need to be computed (once for each set of parameter 
values and initial conditions explored). 
Stochastic 
modelling 
Molecular interactions involving a small number of objects in a large volume are 
intrinsically random and cellular behaviour itself sometimes seems to reflect this 
randomness. Indeed, occurrences of “noise” have been found to be exploited by 
cells—for instance, to survive a variety of environmental changes or to increase 
sensitivity in signal transduction processes. To model such stochastic systems, 
two main methods are used. The first comprises using stochastic differential 
equations (SDEs; derived from ODEs by adding noise terms to the equations), 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-11 
 
 
the solutions for which can be numerically obtained either by computing many 
trajectories (Monte Carlo methods) or approximating their probability distribution 
and then calculating statistical measures (such as mean and variance). The 
second is an exact method which can cope with different reaction time-scales or 
spaces. Within this approach, molecules are modelled individually and reaction 
events are calculated by their probability.  
Monte Carlo 
simulation 
Monte Carlo methods are a class of computational algorithms that rely on 
repeated random sampling to compute their results. Monte Carlo methods are 
often used when simulating physical and mathematical systems. Because of 
their reliance on repeated computation and random or pseudo-random numbers, 
Monte Carlo methods are most suited for computer simulations and tend to be 
used when it is infeasible or impossible to compute an exact result with a 
deterministic algorithm. There is no single Monte Carlo method; instead, the term 
describes a large and widely-used class of approaches. 
Multi-scale 
modelling 
Multi-scale modeling refers to the modeling of a system at several levels of detail 
to increase the accuracy and representation of the system as close to reality. For 
instance, modeling of a chromatin unit, a nucleosome, using a simplified model 
for rapid discrete molecular dynamics simulations and an all-atom model for 
detailed structural investigation, would correspond to this class of modelling.  
Statistical 
correlations 
Statistical models search for patterns in experimental data. Correlation, 
regression and cluster analysis are all powerful statistical tools that can identify 
relationships among measured variables that probably are not attributable to 
chance. Statistics is also a powerful tool for uncovering the prevalence of a 
phenomena and evidence for potentially new mechanisms.  
Spatial modelling 
Spatial modelling takes into account that biological processes take place in 
heterogeneous and highly structured environments regulating cellular processes 
in both space and time. While recent technological advances are addressing the 
dearth of spatial data, theoretical advances are improving computational 
methods, making it now possible to simulate spatio-temporal models of biological 
processes in coarse-grained or realistic geometries.  
Kinetic modelling 
Kinetic modelling supports quantitative hypothesis testing by first translating a 
diagram into a mechanistic kinetic model. Diagrams typically consist of 
molecules, complexes, cellular locations and processes. As molecules and 
complexes can exist in several locations, it is often necessary to define several 
states for a single molecule — each state is a set of chemical species in a 
physical place.  
Comparative 
genomics 
Comparative genomics permits addressing questions at the sequence level both 
within and across organisms and their variations across diverse phylogenetic 
groups. Evolutionary aspects of several cellular elements from genes, regulatory 
elements, organellar macromolecular complexes to their chromosomal 
organization can be addressed using computational genome-scale approaches. 
Network based 
approaches 
In a network approach, objects are represented as nodes and interactions 
between objects are represented as links. This permits representation of 
genome-scale information in a convenient way to identify interesting topological 
features. One of the features is the presence of hub nodes which are objects 
connected to an extremely high number of other objects in a system. With the 
amount of data from several high-throughput technologies, representing 
interactions between biological molecules as networks has provided us with a 
general framework to address fundamental biological questions at a systems 
level. Examples of molecular interactions represented as networks include 
protein-DNA (transcription network), protein-RNA (post-transcriptional network), 
protein-protein (post-translational network, signaling and protein complexes) and 
protein-metabolite (metabolic network) interactions. 
 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-12 
 
 
and discuss how computational approaches can be helpful in investigating the prevalence of 
spatial regulatory mechanisms and in understanding their impact on genome evolution. It is 
important to note that the studies discussed have been carried out in different model systems 
and that further research is necessary to assess whether particular spatial mechanisms are 
universal or specific to each system. 
3.2.1.1 Long-range interactions involving distal regulatory elements 
Regulatory elements in eukaryotes can be spread over several kilobases away from the 
associated gene. These include binding sites for specific TFs, enhancer elements, locus control 
regions (LCRs) and insulator elements (Figure 3-1B). TF binding sites are generally close to 
promoter regions, but enhancer elements, LCRs and insulator elements can be present far 
away on the chromosome and may influence the expression of more than one gene 
simultaneously. Enhancers affect expression of nearby genes, whereas LCRs can affect several 
genes that are distantly located within a genomic locus spanning several kilobases (Dean, 
2006). Insulator elements can block promiscuous enhancer-promoter interaction or act as a 
barrier against the spreading of heterochromatin. The former class of insulators function by 
forming genomic loops via long-range interactions and the latter class prevents inappropriate 
gene expression by recruiting nucleosome modifying enzymes (Dorman et al., 2007).  
The formation of loops mediated by proteins bound to specific elements along a 
chromosome appears to have a central role in several processes as it can affect the expression 
of several genes in a neighborhood (O'Sullivan et al., 2004). Although only a few loops have 
been analyzed in detail and the nature of the molecular forces that maintain them remain 
unclear, recent evidence suggests that they are found in several eukaryotes (Dean, 2006) and 
that the transcriptional machinery itself could be a molecular tie (Grimaud et al., 2006; 
Marenduzzo et al., 2007; Osborne et al., 2004) (Figure 3-1C). Several studies that have used 
3D-FISH, chromosome conformation capture (3C) (Dekker et al., 2002) and its variants 4C, 5C 
and 6C (Simonis et al., 2007) and live-cell imaging (Muller et al., 2007) support the idea that 
active transcription units are in close contact within the nuclear space (Osborne et al., 2004; 
Pombo et al., 1999; Pombo et al., 2000) (see Table 3-1). The results are consistent with a 
model for genome organization in which active polymerases cluster into transcription ‘factories’ 
bringing together distal genes (Figure 3-1C) and where active genes are dynamically organized 
into shared nuclear subcompartments (Muller et al., 2007; Osborne et al., 2004; Pombo et al., 
1999; Pombo et al., 2000). They are also consistent with these cis-regulatory elements 
functioning as insulators, enhancers or LCRs, depending on their positions relative to other 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-13 
 
 
genes. Interestingly, in a recent study, it has been shown that specific ‘factories’ produce only a 
particular kind of transcript depending on the promoter type and whether or not the gene 
contains an intron, supporting the presence of ‘specialized’ transcription factories (Pombo et al., 
1999; Pombo et al., 2000; Xu and Cook, 2008). 
The genomic loops involving regulatory elements are dynamic, depend on the 
transcriptional status of a gene, vary between cell-types in the same organism and may involve 
several proteins. The beta-globin locus in mouse is the most studied and involves the Hbb-b1 
gene (which encodes beta-globin), its LCR and the Eraf gene (encoding an alpha-globin-
stabilizing protein) on the same chromosome. This LCR is thought to nucleate a chromatin hub 
which correlates with expression of globin-related genes. It has been confirmed, by 3C, 4C and 
RNA-TRAP that the contacts between Hbb-b1, the LCR and Eraf are seen only in erythroid 
nuclei (in which all three are transcribed) but not in brain cell nuclei (in which Hbb-b1 is inactive) 
(Carter et al., 2002; Osborne et al., 2004; Simonis et al., 2006). Moreover, the contacts that 
Hbb-b1 makes with other genomic regions depend on its transcriptional activity; in erythroid 
nuclei, 80% of contacts are with other active genes, but, in brain cells, this falls to only 13% 
(Simonis et al., 2006). Interestingly, the LCR region itself is also transcribed and this might even 
be required for its function (Ho et al., 2006). A range of TFs have been implicated in mediating 
genomic loops in the case of the Hbb-b1 locus and these include Eklf (erythroid Kruppel-like 
factor; aka Klf1), Gata1 (GATA-binding protein 1) and the zinc-finger protein Fog1 (aka Zfpm1) 
(Drissen et al., 2004; Vakoc et al., 2005). 
3.2.1.2 Inter-chromosomal interactions 
While long-range regulatory interactions involving loci from the same chromosome have been 
known for some time, it is only recently that inter-chromosomal regulatory interactions (trans-
interactions) were discovered (Figure 3-1D). Inter-chromosomal interactions may involve 
enhancer elements and genes from different chromosomes and can be cell-type specific. The 
first example of a trans-interaction between chromosomes in mammals (identified by 3C and 
FISH) is the association between the T helper 2 cytokine locus on mouse chromosome 11 and 
the promoter of IFN-gamma gene on chromosome 10 in the nuclei of naïve CD4+ T-cells  
(Spilianakis et al., 2005). This interaction is thought to hold the two loci in a poised state and 
might facilitate a quicker response upon T-cell activation to differentiate into Th1 and Th2 
lineages by expression of either gene-locus. Another interesting example is the regulation of 
olfactory receptor genes. Dual RNA and DNA FISH revealed that the expression of a specific 
olfactory gene is accompanied by inter- or intra-chromosomal interactions between the active 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-14 
 
 
gene and a genomic region on chromosome 14 containing an enhancer sequence, referred to 
as the H element (Lomvardas et al., 2006).  
  Recently, two different studies showed that inter-chromosomal interactions may involve 
several factors and can be induced upon exposure to specific stimuli or upon viral infection. The 
dynamics of gene association with transcription factories was investigated during immediate 
early gene induction in mouse B lymphocytes and was shown to result in a rapid relocation of 
the Myc proto-oncogene on chromosome 15 to the same factory that transcribed the Igh gene 
located on chromosome 12 (Osborne et al., 2007). The study on the investigation of the 
Interferon (IFN-beta) gene locus upon viral infection reported that the stochastic and monoallelic 
expression of the IFN-beta gene depends on inter-chromosomal associations with distinct 
genetic loci that could mediate binding of the transcription factor NF-kappaB to the IFN-beta 
enhancer, thus triggering transcription from this allele (Apostolou and Thanos, 2008). 
Another prominent example comes from the mammalian X-chromosome inactivation 
process, which involves a specific trans-association of the X-inactivation center (Xic) between 
the X chromosomes during early development. Female cells carry two X chromosomes, one of 
which is mostly silenced so that expression levels of X-linked genes are comparable to those in 
male cells. Recent studies detected a transient co-localization of the X inactivation centers of 
the homologous chromosomes that precedes the initiation of inactivation of one of the two 
chromosomes (Augui et al., 2007) and that the chromatin insulator protein (CTCF) is involved in 
mediating this interaction (Xu et al., 2007). CTCF also colocalizes with cohesin at specific sites 
in human and mouse chromosomes (Parelho et al., 2008) and raises the possibility that protein-
chromatin interaction involving genomic loci from different chromosomes could possibly 
stabilize, at least transiently, a network of inter-chromosomal interactions within the cell nucleus. 
3.2.1.3 Chromosomal territories, movement and nuclear organization 
Despite the growing evidence on inter-chromosomal interactions, it is known that chromosomes 
occupy territories with preferred and non-random positions in the nucleus of mammalian cells, 
so-called chromosome territories (CTs) (Cremer et al., 2006; Meaburn and Misteli, 2007) 
(Figure 3-1E). FISH experiments have revealed the relocation of chromatin domains containing 
activated genes to substantial distances outside their chromosome territory, suggesting that 
positional organization of chromatin domains within the nucleus could impinge on the regulation 
of gene expression. This finding, together with the observation of extensive intermingling of 
DNA from different chromosomes (Branco and Pombo, 2006) raises the issues of how and why 
genes move relative to their chromosome territories and whether the looping out regulates, or is 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-15 
 
 
regulated by transcriptional activity. Evidences in favour of a role for looping out in the regulation 
of gene expression has come from studies that show the colocalization of genes in the nucleus 
for co-expression or co-regulation (reviewed in (Fraser and Bickmore, 2007)). Active genes on 
decondensed chromatin loops extend outside chromosome territories and can colocalize both in 
cis and in trans at sites within the nucleus to share the same transcription factories (Osborne et 
al., 2004; Osborne et al., 2007) or to sites adjacent to splicing-factor enriched speckles (Chuang 
et al., 2006) or Cajal bodies (Dundr et al., 2007). Extensive relocalization of large genomic 
regions in response to gene activation can depend on actin (Chuang et al., 2006; Dundr et al., 
2007) and myosin (Chuang et al., 2006), suggesting that intranuclear movements of genomic 
regions are, at least in some cases, more directed than previously thought (Kumaran et al., 
2008). 
The spatial organization of chromosome territories in mammalian cells can be described 
by their radial positioning relative to the center of the nucleus, as was recently done for the 
three-dimensional (3D) map of all chromosomes in human male fibroblast nuclei (Bolzer et al., 
2005). The radial positioning of chromosomes correlates with their gene density in spherical 
nuclei such as lymphocytes (Kupper et al., 2007) and is evolutionarily conserved (Mora et al., 
2006; Neusser et al., 2007) but nevertheless tissue-specific to a certain degree (Parada et al., 
2004). Gene-rich regions tend to occupy more interior positions, while gene-poor and late-
replication regions tend to be associated with the nuclear periphery (Kupper et al., 2007; 
Neusser et al., 2007). In addition, similar non-random chromatin arrangements with respect to 
the local gene density or GC content have been observed for different cell types (e.g., 
fibroblasts, bone-marrow cells and cell lines) from several eukaryotic lineages such as 
amphibians, reptiles, birds and mammals (Federico et al., 2006; Neusser et al., 2007). 
Surprisingly, a detailed analysis of the position of chromosomes in mouse lymphocytes has 
shown that chromosomes are more likely to form ‘heterologous’ neighbourhoods, where 
homologous chromosomes are preferentially separated from each other, which might facilitate 
more extensive trans-interactions between heterologous chromosomes (Khalil et al., 2007). 
Though the general patterns appear to be evolutionarily conserved, they are 
nevertheless dynamic and are altered during cellular differentiation. Changes in the 
transcriptional program of a cell correlate with specific changes in the organization of individual 
CTs, at the level of intermingling, CT volume and radial position during lymphocyte activation 
(Branco et al., 2008), possibly reflecting an adaptation to the new transcriptional program. 
Similarly, other recent studies have shown that the architecture of chromosome territories 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-16 
 
 
changes during differentiation (e.g., human adipocyte differentiation (Kuroda et al., 2004) and 
mouse T-cell differentiation (Kim et al., 2004)). 
3.2.1.4 Association of the genomic loci with the nuclear periphery 
In metazoan nuclei, the nuclear envelope is underlaid by a continuous meshwork of lamins and 
lamin-associated proteins, which preferentially associate with inactive chromatin regions and 
facilitate chromatin organization (Akhtar and Gasser, 2007). Pickersgill et al (Pickersgill et al., 
2006) characterised the regions that interacted with the nuclear lamina in Drosophila 
melanogaster and showed an enrichment for gene-poor regions and repressed genes. More 
recently, Guelen et al (Guelen et al., 2008a) mapped the interaction sites of the entire genome 
with the nuclear lamina components in human fibroblasts and described over 1,300 lamina-
associated-domains (LAD) which were again enriched for genes with low expression levels. 
Though an association of silenced genes with the nuclear periphery is demonstrated, what was 
unclear from these studies was whether the requirement for gene repression causes association 
or if repression is an effect of association with the nuclear lamina. Experimentally induced 
repositioning of human chromosomal regions to the nuclear periphery in Finlan et al (Finlan et 
al., 2008) suggests a causative role of the nuclear periphery in suppressing the expression of 
some (but not all) genes as repositioning to the periphery is still compatible with active 
transcription. Another study investigating the consequences of repositioning the immunoglobulin 
loci in mouse fibroblasts to the nuclear periphery supports the notion that such molecular 
interactions may be a mechanism to limit the accessibility to proteins that facilitate 
recombination or transcription (Reddy et al., 2008). 
While the nuclear periphery has been generally associated with repressed genes, 
several studies have shown a correlation with active genes being associated with components 
of the nuclear pore complexes (NPCs), which serve as gates for the transport of molecules 
between the nucleus and cytoplasm. ChIP experiments in yeast for NPC components revealed 
an enrichment for active genes (Casolari et al., 2004).  Several inducible genes such as INO1, 
HXK1, GAL1, GAL2, and HSP104 become stably positioned at the nuclear periphery when 
activated and remain there after transcription is shut off (Brickner et al., 2007; Cabal et al., 
2006; Casolari et al., 2004; Taddei et al., 2006). In the case of Gal1 and Ino1, the relocalization 
to pores was found to be dependent upon the SAGA acetyl-transferase complex (Brickner and 
Walter, 2004; Cabal et al., 2006). In humans and Drosophila, the MSL complex can recruit 
transcriptionally active loci to the nuclear pore (Mendjan et al., 2006), although another study 
revealed that the association of silent genes is just as likely as for active genes (Brown et al., 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-17 
 
 
2008). Most of these results have to be reconciled with the observation that many (if not most) 
transcribed genes, in both yeast (Gartenberg et al., 2004) and mammals (Janicki et al., 2004), 
do not associate stably with pores. Despite this, one common theme that is emerging is a 
tendency for the inner nuclear membrane to be associated with less active genes, whereas the 
NPCs tend to associate with transcriptionally active loci, at least in yeast, possibly in order to 
facilitate efficient transport of mRNAs (Figure 3-1F). 
3.2.2 Transcriptional regulation constrains genome organization 
Unlike in prokaryotes, where the genetic material is primarily packaged in a single circular 
chromosome, the genome of eukaryotes is contained in the nucleus, condensed in a complex, 
hierarchical manner and is encoded in several different linear chromosomes. These distinctions, 
together with the fact that the transcriptional apparatus are largely different, enforce very 
different ways by which genes are transcribed from the chromosomes. In the case of most 
prokaryotes, the absence of a nucleus and the organization of functionally related genes into 
operons facilitate coupled transcription and translation of polycistronic transcripts. In contrast, 
the presence of a nucleus in eukaryotes imposes the constraint that the transcribed 
monocistronic mRNA needs to be transported to the cytoplasm before translation can occur. 
Although transcription in both prokaryotes and eukaryotes involves the evolutionarily 
conserved core RNA polymerase subunit, the whole process of transcriptional regulation is 
fundamentally different. In contrast to prokaryotes where transcription primarily relies on the cis-
regulatory DNA sequences alone (Browning and Busby, 2004), eukaryotic transcription is 
regulated at many levels (Lee and Young, 2000; van Driel et al., 2003). Therefore unlike in 
prokaryotes, transcription in eukaryotes is an energy-intensive, multi-step process, involving a 
large number of molecular events to be coordinated both in space and time. Given the intricacy 
involved in a single transcriptional regulatory interaction, one can ask whether or not the 
complexity of the whole network of transcriptional interactions has imposed a significant 
constraint on the organization of genes across the different eukaryotic chromosomes. This 
becomes particularly interesting in the light of a recent work, which demonstrated that tuning the 
expression level of a single gene can provide an enormous fitness advantage to an individual in 
a population of cells (Dekel and Alon, 2005). Thus one could extrapolate that optimization of 
transcriptional regulation on a global scale, such as the efficient expression of relevant genes 
under specific conditions, would have significant advantage on the fitness of an individual in a 
genetically heterogeneous population. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-18 
 
 
Though several studies have described that genes with similar expression pattern 
cluster on the genome and that gene order is conserved, no study has investigated how genes 
are organized across and within the chromosomes: given that eukaryotes contain several 
chromosomes, are the set of genes regulated by a given TF (i) randomly distributed across 
different chromosomes or encoded on specific chromosomes? (ii) distributed in an unbiased 
manner within a chromosomal arm or display preference to be encoded in regions containing 
particular chromosomal landmarks? (iii) positionally clustered within a chromosome or not? 
Here, we investigate these questions by using the recently available genome-scale data on 
13,853 high-confidence regulatory interactions (see Materials and Methods). This data covers 
156 TFs and 4495 target genes for the model eukaryote Saccharomyces cerevisiae, whose 
genetic material is organized into 16 linear chromosomes. 
3.2.2.1 The majority of TFs show a strong preference to regulate 
genes on specific chromosomes 
Several elegant studies have elucidated that the organization of chromosomes within the 
eukaryotic cell nucleus is non-random and that they occupy distinct volumes called 
chromosomal territories (Cremer and Cremer, 2001; Gasser, 2002). In yeast, in addition to the 
ordered movements during cell division, it has been demonstrated that interphase 
chromosomes undergo large rapid movements (over 0.5 μm in a 10 seconds interval; nuclear 
diameter of ~2μm) and that such movements could reflect the metabolic state of the cell (refs 
(Akhtar and Gasser, 2007; Gasser, 2002) and references therein). These observations have 
suggested that the non-random organization of the chromosomes could (i) allow functional 
compartmentalization of the nuclear space, thus potentially enhancing or repressing expression 
of specific genes and (ii) bring co-regulated genes into physical proximity in order to co-ordinate 
gene expression. The above-mentioned observations (also see previous sections) on the non-
random nuclear architecture and chromosomal dynamics together with the fact that 
transcriptional regulation in eukaryotes is an energy-intensive, highly coordinated and time-
intensive process motivated us to ask if such considerations have constrained the positioning of 
genes in specific chromosomes during the course of evolution.  
Given that eukaryotes encode several linear chromosomes, we first investigated if the 
targets of TFs tend to be preferentially encoded on specific chromosomes, or randomly 
distributed on different chromosomes. We therefore analyzed the chromosomal location of the 
targets for each TF in the currently available map of protein-DNA interactions for yeast (see 
Figure 3-2A and methods). We first created a ‘chromosome preference profile’ for every TF, 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-19 
 
 
which is a vector that contains the number of target genes on each of the 16 chromosomes. By 
comparing this vector to what is expected by chance (see Methods), we identified the TFs which 
displayed a significant preference to have their targets on specific chromosomes more often 
than what is expected by chance. 
 
Step 2
Chromosomal
preference
Regional 
preference
Clustering of 
targets
Do transcription factors 
preferentially regulate 
genes on specific 
chromosomes?
Do transcription factors 
preferentially bind or avoid 
specific regions on the 
chromosomes? (i.e. the 
centromeric, middle or 
telomeric region)
Do transcription factors 
preferentially regulate 
genes that are proximal 
on a chromosome?
Step 1 Step 2 Step 3
Obtain the number of binding events 
for a TF on the different 
chromosomes, i.e. occurrence profile
. . . . .
Chr1 Chr2 Chr16
Transcription factor
(11 targets across 3 chromosomes) …Chr1 Chr2 Chr16Chr3-1+5 +7+1Z-scoreprofile:
Obtain the expected occurrence 
profile by analyzing 1000 
randomly rewired networks and 
calculate Z-scores
Obtain the Z-score profile and 
identify TFs that show chromosomal 
affinity (i.e. Z-score ≥ 3)
Z-score = σ
x - μ The TF preferentially binds 
to chromosomes 1 and 16
Obtain the number of binding events 
on the different regions of the 
chromosome 
(centromeric, middle, telomeric) and 
get occurrence profile
Step 1
. . . . .
Chr1 Chr2
TF (11 targets in 2 
chromosomes)
…Observed:
Expected:
Chr1 Chr16
4 2 5
1 1 1 1
0
Chr3
…
Obtain the expected occurrence 
profile by analyzing 1000 
randomly rewired networks and 
calculate Z-scores
Z-score = σ
x - μ
Obtain the Z-score profile and 
identify TFs that show regional 
preference (i.e. Z-score ≥ 3)
C M T
Observed:
Expected:
4 0
1 2
0
Step 2
The TF preferentially binds to 
regions close to the centromere
C M T
+7 -2
Z-score
profile: -1
Step 3
Step 1
Obtain the number (N) and position  
of targets for a TF on the different 
chromosomes
Step 3
Obtain average TPI for all TFs over 
1000 randomly rewired networks 
and identify TFs with Z-score ≥ 3
The TF has targets that are 
preferentially clustered on 
the chromosomeCreate a network where nodes are 
targets and link genes if they are 
separated by ≤ 10 genes. Identify 
distinct no. genes that are clustered 
(in yellow) and calculate TPI.
C=8
N=11
. . . . .
Chr1 Chr2 Chr16
Transcription factor
(11 targets across 3 chromosomes)
TPI =
C
N .72
.12
Observed:
Expected:
TPI
+8Z-score
A
B
C
T-region
M-region
C-region
T-region
M-region
C-region
Chr2
1
 
Figure 3-2: Schematic showing the methods employed to estimate the significance for (A) chromosomal 
preference (B) regional preference and (C) clustering of target genes. Please see methods for details. x: 
observed value; μ mean: σ standard deviation. 
 
Since the null model is critical to obtain statistical significance, we ensured that the 
random networks are as close as possible to the real network in terms of the topology and the 
gene distribution on the chromosomes. The random networks were therefore obtained by 
employing a re-wiring procedure, preserving the connectivity distribution and the inherent 
chromosomal distribution of the genes. In other words, the number of targets for each TF and 
the number of TFs regulating a given target gene (TG) in the random networks will be the same 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-20 
 
 
as what is seen in the real network but the interactions between them are randomly re-wired. As 
this procedure does not randomize the chromosomal position of a gene, any inherent, non-
random clustering of genes on the genome is explicitly maintained. Furthermore, this procedure 
treats every chromosome independently by maintaining the same gene density and the same 
number of genes as seen in the real yeast chromosomes. This therefore allows us to assess 
any preference for binding by the TFs. For all observations reported here, statistical significance 
was assessed based on p-value and Z-score. Only TFs with p ≤ 10-3 and |Z| ≥ 3 were 
considered to show a significant difference in comparison to the null model. To correct for 
multiple testing, we calculated q-values as a measure of significance using the q-value package 
in R. We estimate a false discovery rate (FDR) of 0.3% when calling all p ≤ 10-3 as significant. 
Through this analysis, we found that a majority of the TFs (84 TFs, p < 10-3 and Z ≥ 3) 
showed a striking preference to encode a significant fraction of target genes on at least one 
particular chromosome. Of these, 78% (66 TFs) showed preference to only one chromosome, 
18% (15 TFs) showed preference to two chromosomes and a smaller fraction (4%) of the TFs 
showed preference to three or more chromosomes. Figure 3-3A shows all the 16 chromosomes 
of S. cerevesiae along with the TFs which have been identified to preferentially bind to the 
target loci on them. Our investigation identified several TFs to have a strong preference to 
regulate genes on specific chromosomes. Some of these include (i) the global regulatory hub 
Sok2p, showing a significant preference for binding to chromosome 15 (observed, x: 67, 
expected, μ: 32, Z: 6.7, p < 10-3), regulating genes important for pseudohyphal differentiation 
and vesicle trafficking, (ii) Phd1p, showing a preference for binding to chromosome 5 (x: 52, μ: 
23, Z: 7.0, p < 10-3) and chromosome 9 (x: 32, μ: 14, Z: 5.1, p < 10-3), controlling expression of 
genes required for differentiation and (iii) Msn4p, showing preference for chromosome 13 (x: 32, 
μ: 13, Z: 5.4, p < 10-3), regulating expression of genes involved in stress response. While it is 
interesting to note that all of the 16 chromosomes have a preferred set of TFs binding them 
(Figure 3-3B), the number of TFs showing preference to a particular chromosome does not 
correlate with the physical size of the chromosome (in bp), gene content or the gene density. 
Taken together, these observations indicate that the targets of most TFs are not randomly 
distributed across the different chromosomes. Instead, they are highly ordered and show a 
preference to be encoded on specific chromosomes, independent of the size and the gene 
density of the chromosome. 
Our finding that such a pattern of organization exists for the distribution of targets of TFs 
motivated us to methodically analyze (i) if the TFs themselves show a preference to be encoded 
on specific chromosomes, and in particular, if global regulatory proteins show any such 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-21 
 
 
preference and (ii) if there are any patterns of higher-order organization of regulatory 
interactions between chromosomes. Our investigation on the first question unambiguously 
revealed that TFs and particularly the global regulatory hubs do not show any preference to be 
encoded on specific chromosomes. Instead the distribution was similar to what is expected by 
chance. However, we identified the existence of a higher-order organization of regulatory 
interactions wherein several TFs which are encoded on specific chromosomes tend to 
preferentially regulate or avoid regulating genes on distinct chromosomes. Figure 3-3C shows 
the links between chromosomes which display statistically significant tendency to either interact 
(red line; p <10-3; Z ≥ 3) or avoid interaction (blue line; p < 10-3; Z ≤ -3) in the context of 
transcriptional regulation. These observations suggest that TFs encoded in specific 
chromosomes can show distinct preferences to regulate targets encoded on particular 
chromosomes and might reflect a coordinated and possibly a combinatorial, effect between TFs 
that are encoded in the same chromosome. 
 
 
 
 
 
 
 
 
 
 
(Space left for an enhanced layout of the figure) 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-22 
 
 
A B
C 16 chromosomes (encoding the transcription factors)
16 chromosomes (encoding the target genes)
Z-score scale
TF
No. targets1 162 3 4 5 6 7 8 9 10 11 12 13 14 15
16 Chromosomes
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
Chr-1 Chr-2 Chr-3 Chr-4
Chr-5 Chr-6 Chr-7 Chr-8
Chr-9 Chr-10 Chr-11 Chr-12
Chr-13 Chr-14 Chr-15 Chr-16
 
Figure 3-3: Chromosomal preference for binding by TFs. (A) Each column in the matrix represents one of 
the 16 chromosomes. Each row represents the Z-score significance profile of a particular TF to have its 
targets on the different chromosomes. The top 75 TFs (selected by p-value and higher Z-scores) are 
ordered after hierarchically clustering their Z-score profiles. The number of target genes is shown next to 
the gene name (B) TFs with target preference for each of the 16 chromosomes. Only those TFs which 
show significant preference and regulate more than 16 genes are shown. Each chromosome has a set of 
TFs that tend to preferentially bind them. The thickness of the red line is proportional to the absolute 
number of target genes for that TF on the chromosome. (C) Higher order organization of regulatory 
interactions. The top and bottom columns denote the chromosomes where the TFs and TGs. Red and 
blue lines connecting the two chromosomes mean that TFs originating from a specific chromosome tend 
to preferentially encode or avoid targets on a particular chromosome, respectively. The thickness is 
proportional to the Z-score.  
Constraints imposed by eukaryotic transcriptional control                                                                                      3-23 
 
 
3.2.2.2 A significant fraction of the TFs tend to have targets on specific 
regions of the chromosomal arm 
Apart from the fact that the nucleus is organized into sub-compartments, creating 
microenvironments that facilitate distinct nuclear functions, several studies that visualized 
precise chromosomal loci have revealed that specific regions of the chromosomes display 
restricted displacement to varying degrees (Akhtar and Gasser, 2007; Gasser, 2002). For 
instance, in yeast, chromosomal ‘landmarks’ such as the telomeres and centromeres show 
marked constraints in their movements within the nuclear space when compared to other 
chromosomal loci. In addition, live microscopy studies have revealed that centromeres tend to 
cluster near the spindle pole body (SPB) whereas the telomeres tend to be tethered to the 
nuclear envelope (Akhtar and Gasser, 2007; Gasser, 2002). Moreover it has been shown that 
yeast chromosomes form chromosomal loops, where the telomeric ends come closer to each 
other than to the centromeres. Such anchoring of chromosomal regions is thought to be 
reversible and is known to involve microtubules that associate with the SPB (for centromeres) 
and the yKu heterodimeric protein, Esc1p and Sir4p (for telomeres)(Akhtar and Gasser, 2007; 
Gasser, 2002). This phenomenon of periodic attachment of distinct regions of the chromosomal 
arms to the nuclear periphery appears to be a conserved mechanism and is believed to regulate 
patterned gene expression, possibly by separating transcriptionally active and inactive 
chromosomal domains (Finlan et al., 2008; Guelen et al., 2008b). These observations motivated 
us to assess if such phenomena, during the course of evolution, could have constrained the 
target genes of TFs to be encoded within distinct regions of the chromosomal arm. 
In particular, we asked if TFs tend to preferentially bind or avoid specific regions on the 
linear chromosomes, such as regions closer to the centromere, the telomere, or the regions in-
between. To investigate this question, we first divided each chromosomal arm into three equal 
regions (in bp): C, containing the centromere, M, the middle region and T, containing the 
telomere. For each TF, we then created a ‘regional preference profile’, which contains the 
number of targets in each of the three regions. Comparing these results with random 
expectation by performing the same calculations on 1000 random networks allowed us to 
assess the statistical significance (see Figure 3-2B and Methods). This enabled the discovery of 
TFs which display a significant bias to bind to particular regions of the chromosomal arm 
independent of the specific chromosome. We found that 29 TFs (Figure 3-4A) showed a 
statistically significant preference (p < 10-3; Z ≥ 3 at a FDR of 0.5%) to bind to a particular region 
over others, thus providing the first evidence for the prevalence for such an effect. We show that 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-24 
 
 
several TFs display a strong preference to bind specific regions on chromosomal arms. For 
instance, Hsf1p, the trimeric heat shock regulatory protein and Msn2p, the multicopy suppressor 
of SNF1 mutation protein tend to preferentially regulate genes that are encoded in regions 
closer to the centromere, whereas the bZIP domain containing TFs Yap5p and Yap6p which are 
required under stress conditions tend to bind to regions closer to the telomere. Additional 
evidence which reinforced our observations that certain TFs do show preference to bind to 
specific regions on the chromosome came from our inspection of the TFs which avoided binding 
to a particular region (Figure 3-4B). We found that certain TFs like the osmosis dependent 
regulator Skn7p and Msn2p clearly avoided binding to the T-region (containing the telomere) 
while the pleiotropic drug regulator Pdr1p and Smp1p avoided regulating genes in the C-region 
(containing the centromere). Interestingly, the suppressor of kinase Sok2p, which regulates 
genes involved in cellular differentiation, avoids binding to both the C and M regions of the 
chromosomes, displaying a clear preference to bind to the region containing the telomere. 
Taken together, these observations suggest that events which allowed clustering of certain 
functionally related genes, based on their usage, accessibility and transcriptional activity, have 
been selected during evolution. Consistent with this proposal, it is interesting to note that 
regions that cluster at the nuclear periphery such as the telomeres, as well as the mating-type 
loci are generally transcriptionally silent, whereas internally located regions encoding metabolic 
enzymes on the chromosomal arm get recruited to nuclear pores upon transcriptional activation 
(Cabal et al., 2006; Casolari et al., 2004; Ishii et al., 2002; Taddei et al., 2006).  
We then investigated if (i) the loci encoding TFs, and in particular global regulatory 
proteins, show any regional preference and (ii) there are patterns of higher-order organization of 
regulatory interactions involving specific chromosomal regions, i.e., if TFs encoded in specific 
regions tend to preferentially regulate genes on other chromosomal regions. Though our 
investigation along these lines revealed the absence of any such preferential organizational 
pattern for the loci encoding TFs, we discovered that genes encoding global regulatory hubs 
tend to strongly avoid being encoded in regions closer to the telomere (p = 0.004). 
Investigations to uncover the presence of higher-order interactions between specific 
chromosomal regions revealed that TFs encoded elsewhere in the genome regulate genes 
within the T-region whereas TFs within the T-region appear to preferentially avoid regulating 
genes in the same region (p = 0.007; Figure 3-4C). These observations are consistent with the 
fact that genes on telomeric and sub-telomeric regions are largely repressed. Given the 
dynamic nature of the different chromosomal regions and the differential transcriptional activity 
associated with specific regions, such organization of loci encoding TFs within specific regions 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-25 
 
 
            
A B
C
C M T
C M T
Regions encoding the TFs
Regions encoding the targets
Z-score scale Z-score scale
 
Figure 3-4: TFs showing significant regional preference or avoidance for binding on the chromosomes 
(see Figure 3-2B). (A) TFs which show a strong tendency to have their targets on the C-region 
(containing the centromere), M-region (containing the middle region) or T-region (containing the telomere) 
on the chromosome. (B) TFs which show a strong avoidance to have their targets on the three regions. 
Green boxes highlight the group of TFs which show significant regional avoidance for one of the three 
regions. In the cartoon next to the matrices, thick black lines indicate preference and broken black lines 
indicate avoidance. Only TFs with p < 10-3 and |Z | ≥ 3 are shown in both cases. (C) Higher order 
organization of regulatory interactions. The top column denotes regions on the chromosomal arm where 
the TFs are encoded and the bottom column denotes the regions where the targets are encoded. Lines 
connecting the two regions mean that TFs originating from a specific region tend to preferentially have 
(red lines) or avoid (blue lines) targets on a particular region of the chromosome. The thickness is 
proportional to the Z-score. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-26 
 
 
of the chromosomes, and patterns of higher order regulatory interactions may have been 
selected during evolution. Taken together, the findings reported here strongly suggest that such 
regional preferences are not only seen for the targets of specific TFs, but also for global 
regulatory hubs and the regulatory interactions affecting expression of genes in specific 
chromosomal regions. 
3.2.2.3 Most TFs show a strong preference to positionally cluster their 
targets within a chromosome 
Though we report the prevalence of chromosomal preference and regional bias in the 
distribution of the targets of a large fraction of the TFs, it does not answer if the regulated genes 
are proximal to each other on the chromosome or if they are relatively far apart within the same 
region. While several studies have revealed that genes with similar expression profiles (co-
expressed genes) cluster on the chromosome (Cohen et al., 2000; Hurst et al., 2004; Spellman 
and Rubin, 2002), no study has addressed if the targets of the same TF, cluster on the 
chromosome on a genomic scale. Although previous studies have unambiguously revealed the 
existence of chromosomal domains that contain genes with similar expression pattern (co-
expressed genes), it should be kept in mind that clustering of co-expressed genes need not 
always imply regulation by the same TF because co-expressed genes maybe clustered due to 
several reasons such as mechanisms involving chromatin remodeling, transcriptional read-
through, regulation of genes by the same TF or regulation by different TFs in the same 
transcriptionally active euchromatinic domain (Batada et al., 2007). Therefore, we initiated a 
systematic investigation and analyzed if the targets of most TFs display positional clustering on 
a given chromosome or not. 
 We first defined and calculated the Target Proximity Index (TPI) for each TF (see 
methods and Figure 3-2C). In short, the TPI for a TF represents the fraction of all the regulated 
genes that show proximal clustering on the chromosome. In our study we defined proximity, D, 
as the number of genes that separate two targets of a TF. We then compared the TPI values for 
the observed and the random networks to obtain the statistical significance. From our analysis, 
we found that most TFs (>75%) showed high TPI values (TPI > 0.6, p < 10-3; at a FDR of 0.1% 
for D ≤ 20), suggesting a strong preference for target genes to be clustered within a distance 
range of ~20 genes. On the contrary, TPI values in random networks for the same distance 
threshold were found to be significantly lower than 0.2. To ensure that the observations are (i) 
not biased by tandem gene duplications controlled by the same TF or (ii) not biased by 
divergent, bi-directional genes which could artificially increase the TPI score, relevant control 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-27 
 
 
     
0
5
10
15
20
25
30
35
40
45
50
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Target Proximity Index
%
 o
f T
ra
ns
cr
ip
tio
n 
Fa
ct
or
s D<=1(real) D<=2(real) D<=3(real) D<=4(real) D<=5(real)
D<=1(random) D<=2(random) D<=3(random) D<=4(random) D<=5(random)
0
5
10
15
20
25
30
35
40
45
50
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Target Proximity Index
%
 o
f T
ra
ns
cr
ip
tio
n 
Fa
ct
or
s
D<=5(real) D<=10(real) D<=15(real) D<=20(real) D<=30(real)
D<=5(random) D<=10(random) D<=15(random) D<=20(random) D<=30(random)
0
5
10
15
20
25
30
35
40
45
50
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Target Proximity Index
%
 o
f T
ra
ns
cr
ip
tio
n 
Fa
ct
or
s D<=40(real) D<=60(real) D<=80(real) D<=100(real) D<=200(real)
D<=40(random) D<=60(random) D<=80(random) D<=100(random) D<=200(random)
N=11
C=2 C=10
N=11
Target Proximity Index (TPI)
TPI =
2
11
TPI =
10
11
A
B
C
1 ≤ D ≤ 5
5 ≤ D ≤ 30
40 ≤ D ≤ 200
 
Figure 3-5: Frequency distribution of TPI values. Distribution of Target Proximity Index (TPI) for all TFs in 
the real and randomly constructed networks at different proximity values i.e., D values (see Methods) are 
shown in (A) D ≤ 1 to D ≤ 5; (B) D ≤ 5 to D ≤ 30 and (C) D ≤ 40 to D ≤ 200. Note that in the real network, 
the maximum proportion of TFs have TPI values which are much higher than what is seen for the random 
networks (at around 0.8 for real network and 0.2 for random networks at D ≤ 20), demonstrating that most 
TFs show clustering of their targets in this distance range. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-28 
 
 
calculations were performed. In the filtered network, we removed (i) all tandem duplicates from 
our dataset and (ii) randomly chose a target gene from a divergent, bi-directional gene pair and 
calculated the TPI score. Our results did not change after controlling for tandem duplicates and 
bi-directionally transcribed genes, suggesting that what we observe are truly attributable to 
positional clustering of targets on a chromosome. An investigation of how many genes are 
positionally clustered within the window of 20 genes revealed that on an average, such a 
window only contains 2.6 genes regulated by the same TF. This is striking and suggests that all 
three mechanisms, i.e., (a) chromatin remodeling, (b) regulation by different TFs in the same 
euchromatinic domain and (c) regulation by the same TF within a euchromatinic domain, may 
contribute to the previously observed domains of co-expressed genes. 
In order to validate the robustness of our definition of proximity on the TPI values, we 
systematically varied this parameter (D) from 1 to 200 and compared them against what was 
obtained in random networks (Figure 3-5). We found that significant separation between real 
data and random networks occurred for the definition of proximity (D) as being less than 20 
genes, suggesting that this could reflect the average size of a possible open euchromatinic 
domain that is available for transcription in yeast. Our results therefore suggest that evolution 
might have favored certain recombination events which allowed genes that need to be regulated 
by the same TF to be encoded close to each other. Another distinct possibility given that 
transcriptional regulatory networks are likely to be plastic (Borneman et al., 2007) would be that 
selection could have first driven clustering of genes that need to be co-regulated and then new 
transcriptional regulatory interactions could have evolved afterwards. Regardless of the driving 
force, the evolutionary advantages are clear: such a clustering of targets would not demand 
high concentrations of TFs in the nucleus which are generally expressed in low quantities and 
prevent inappropriate regulation of unrelated target genes. Such an organization has the added 
advantage of minimizing noise in expression levels, which has been recently proposed to be an 
additional driving force for gene order conservation (Batada and Hurst, 2007). 
3.3 DISCUSSION & CONCLUSION 
In conclusion, our study demonstrates that the complexity of transcriptional regulation 
constrains genome organization at several levels. Our findings beyond those discussed in detail 
here, such as TFs encoded in specific chromosomes and within distinct regions show a strong 
preference to regulate genes on distinct chromosomes and regions open up several questions 
and expand our need to understand eukaryotic gene regulation at a higher level. The findings 
reported here are consistent with several molecular mechanisms, such as the genome-wide 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-29 
 
 
loop model of chromosomes (Francastel et al., 2000), the presence of expression hubs (Kosak 
and Groudine, 2004) and transcription factories (Cook, 1999; Osborne et al., 2004) and the 
nuclear gating hypothesis (Blobel, 1985). 
With the development of experimental methods such as 3D chromosome capture, 4C 
and 5C and the availability of genome-scale data on protein-DNA interactions from high-
throughput experiments in other eukaryotes (shown in Table 3-1), our work provides a 
fundamental framework by which such questions can be systematically studied for higher 
eukaryotes. In fact, a preliminary analysis in mammalian systems using stem cell differentiation 
factors Sox2, Oct4 and Nanog have indeed revealed a striking preference for these TFs to 
encode their targets on specific chromosomes (SCJ, MMB, Unpublished). We therefore believe 
that our work, which demonstrates that gene organization is constrained by the process of 
transcriptional regulation in yeast, is likely to be a paradigm that is also applicable to other 
eukaryotes.  
The findings reported here has several direct applications. For instance, the map that we 
describe for yeast in this study can serve as a guide and be exploited in genetic engineering 
experiments for identifying the most appropriate region (on the 16 chromosomes) to incorporate 
a gene of interest – particularly if it has to be regulated under the control of a specific TF. 
Describing such maps for higher eukaryotes will have implications in gene therapy and in 
rationally identifying suitable sites to incorporate reporter genes while producing transgenic 
organisms. We anticipate that revealing the presence of such patterns of organization of genes 
within the linear chromosomes of eukaryotes, such as humans, would have significant 
implications in our understanding of transcriptional regulation, chromosomal territories, their role 
in cellular differentiation and of specific chromosomal disorders, such as recombination events 
and copy number variations that are prevalent in diverse diseases such as cancer. 
3.4 MATERIALS AND METHODS 
3.4.1 Dataset of Transcription factors in S. cerevisiae and their 
regulatory interactions 
The transcriptional regulatory network for S. cerevisiae was assembled from the results of 
literature curation, and ChIP-chip experiments (Harbison et al., 2004; Horak et al., 2002; Lee et 
al., 2002; Svetlov and Cooper, 1995). This network consists of 4527 genes, which include 156 
DNA-binding TFs, 4495 target genes and 13,853 regulatory interactions. 31 TFs qualified as 
hubs, which were defined as the top 20% of the TFs with high out-going connectivity. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-30 
 
 
Chromosomal positions of all the protein coding genes on the yeast genome were obtained from 
http://www.yeastgenome.org. Tandem duplicates and bi-directionally transcribed genes were 
identified by employing pair-wise blast using an e-value cut-off of 10-2 and using chromosomal 
position of the genes in the network. 
3.4.2 Estimation of statistical significance 
To estimate statistical significance of the properties described here, the reported values for the 
real network of protein-DNA interactions were compared against 1000 randomly generated 
networks obtained by employing the re-wiring procedure. The re-wiring procedure randomly 
reconnects TFs with target genes but ensures that any inherent gene distribution on the 
chromosome and the overall connectivity distribution of the network is maintained. As this 
procedure does not randomize the chromosomal position of a gene, any inherent, non-random 
clustering of genes on the genome is explicitly maintained. Furthermore, it is important to note 
that this procedure maintains the same gene density and the same number of genes on a 
chromosome as what is seen in the real yeast chromosomes. This therefore allows us to assess 
any preference for binding by the TFs reported in our study. To assess if TFs and hubs were 
preferentially encoded in different chromosomes, we carried out 1000 trials, where we randomly 
picked the same number of genes as the number of TFs and hubs seen in the real network and 
analyzed the chromosomal distribution of them. For all observations reported in our study, 
statistical significance was assessed based on (i) p-value, defined as the fraction of the 1000 
random networks which showed a value ≥ what was observed in the real network and (ii) Z-
score, calculated as the number of standard deviations the observed value is away from the 
mean of the 1000 random networks. This is obtained as the ratio of the difference between the 
observed, x, and random expected, σ, values to the standard deviation, σ i.e., Z = (x–σ)/σ. TFs 
with p ≤ 10-3 and |Z-scores| ≥ 3 (unless stated otherwise) were considered to show a significant 
difference in comparison to the null model described above. All significance values were 
corrected for multiple testing using the q-value package in R (Arava et al., 2003). In particular, 
the Benjamini & Hochberg step-wise p-value method implemented in the package was used. 
The same package was used to assess the False Discovery Rate (FDR) at a p-value threshold 
of 10-3 and to estimate the corresponding q-values.  
3.4.3 Calculation of chromosomal preference 
To test whether a TF has a preference to bind a specific chromosome more often than expected 
by chance, we first constructed a ‘chromosomal binding profile’. This is a 16 dimensional (one 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-31 
 
 
for each chromosome) vector describing the number of binding events in each chromosome. 
We then obtained an expected ‘chromosomal binding profile’ by using 1000 randomly re-wired 
networks, and taking it through a similar procedure. The preference for a TF to bind a particular 
chromosome was measured using p-value and Z-score profiles. The p-value for each TF for 
each chromosome was estimated as the fraction of the 1000 random networks that showed an 
equal or higher number of binding events than in the real network. The p-value profile was 
obtained in a similar manner across the different chromosomes for all TFs. The Z-score profile 
was calculated based on average binding frequency and standard deviation from the 1000 
random networks (see Figure 3-2A). Only those TFs which showed a preference to bind to at 
least one chromosome with p ≤ 10-3 and Z ≥ 3 were considered significant. A p-value cut-off of 
10-3 results in an estimated FDR of 0.3%. 
3.4.4 Calculation of regional preference 
To assess if TFs preferentially bind to specific regions of the chromosomes more often than 
expected by chance, we first obtained a ‘regional binding profile’. Every chromosomal arm was 
divided into three regions of equal size (in bp, see Figure 3-2B) to obtain the C-region 
(containing the centromere), M-region (in the middle) and the T-region (containing the 
telomere). Thus the ‘regional binding profile’ is a 3 dimensional (one for each region) vector that 
captures the number of binding events of a TF on all the chromosomes. We then obtained the 
expected ‘regional binding profile’ by using the 1000 randomly re-wired networks and taking it 
through the same set of calculations. The p-value and Z-score profiles were obtained as 
described above. Only those TFs with p ≤ 10-3 were considered to show regional preference (or 
avoidance) for binding. We estimate a FDR of 0.5% at a p-value threshold of 10-3 for TFs 
showing regional preference.  
3.4.5 Calculation of target proximity 
To assess the positional clustering of targets of a given TF across chromosomes, we calculated 
the Target Proximity Index (TPI) for each TF. This is defined as the ratio of the number of the 
targets that are within a particular distance (proximity is measured as D, the number of genes 
that physically separate two genes regulated by the same TF on the chromosome) to the total 
number of targets regulate by that TF (see Figure 3-2C). The TPI values lie between 0 and 1 
where TFs with high TPI values would indicate high clustering of their targets. In order to test 
the significance of clustering of targets for each TF, we obtained the expected TPI values by 
computing the same for 1000 randomly re-wired networks. P-values and Z-score were 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-32 
 
 
computed as described above. TFs were considered to display a preference to cluster their 
binding sites if p ≤ 10-3 and Z ≥ 3. At these thresholds we estimated a FDR of 0.1%. Since most 
TFs were found to show significant clustering of targets, the TPI score distribution of all the TFs 
was used to demonstrate the differences between the observed and expected behavior.  
REFERENCES 
Akhtar, A. and Gasser, S. M. (2007). The nuclear envelope and transcriptional control. Nat 
Rev Genet 8, 507-17. 
Allfrey, V. G., Faulkner, R. and Mirsky, A. E. (1964). Acetylation and Methylation of Histones 
and Their Possible Role in the Regulation of Rna Synthesis. Proc Natl Acad Sci U S A 51, 786-
94. 
Apostolou, E. and Thanos, D. (2008). Virus Infection Induces NF-kappaB-dependent 
interchromosomal associations mediating monoallelic IFN-beta gene expression. Cell 134, 85-
96. 
Arava, Y., Wang, Y., Storey, J. D., Liu, C. L., Brown, P. O. and Herschlag, D. (2003). 
Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae. Proc Natl 
Acad Sci U S A 100, 3889-94. 
Augui, S., Filion, G. J., Huart, S., Nora, E., Guggiari, M., Maresca, M., Stewart, A. F. and 
Heard, E. (2007). Sensing X chromosome pairs before X inactivation via a novel X-pairing 
region of the Xic. Science 318, 1632-6. 
Batada, N. N. and Hurst, L. D. (2007). Evolution of chromosome organization driven by 
selection for reduced gene expression noise. Nat Genet 39, 945-9. 
Batada, N. N., Urrutia, A. O. and Hurst, L. D. (2007). Chromatin remodelling is a major source 
of coexpression of linked genes in yeast. Trends Genet 23, 480-4. 
Blobel, G. (1985). Gene gating: a hypothesis. Proc Natl Acad Sci U S A 82, 8527-9. 
Bolzer, A., Kreth, G., Solovei, I., Koehler, D., Saracoglu, K., Fauth, C., Muller, S., Eils, R., 
Cremer, C., Speicher, M. R. et al. (2005). Three-dimensional maps of all chromosomes in 
human male fibroblast nuclei and prometaphase rosettes. PLoS Biol 3, e157. 
Borneman, A. R., Gianoulis, T. A., Zhang, Z. D., Yu, H., Rozowsky, J., Seringhaus, M. R., 
Wang, L. Y., Gerstein, M. and Snyder, M. (2007). Divergence of transcription factor binding 
sites across related yeast species. Science 317, 815-9. 
Branco, M. R., Branco, T., Ramirez, F. and Pombo, A. (2008). Changes in chromosome 
organization during PHA-activation of resting human lymphocytes measured by cryo-FISH. 
Chromosome Res 16, 413-26. 
Branco, M. R. and Pombo, A. (2006). Intermingling of chromosome territories in interphase 
suggests role in translocations and transcription-dependent associations. PLoS Biol 4, e138. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-33 
 
 
Brickner, D. G., Cajigas, I., Fondufe-Mittendorf, Y., Ahmed, S., Lee, P. C., Widom, J. and 
Brickner, J. H. (2007). H2A.Z-mediated localization of genes at the nuclear periphery confers 
epigenetic memory of previous transcriptional state. PLoS Biol 5, e81. 
Brickner, J. H. and Walter, P. (2004). Gene recruitment of the activated INO1 locus to the 
nuclear membrane. PLoS Biol 2, e342. 
Brown, C. R., Kennedy, C. J., Delmar, V. A., Forbes, D. J. and Silver, P. A. (2008). Global 
histone acetylation induces functional genomic reorganization at mammalian nuclear pore 
complexes. Genes Dev 22, 627-39. 
Browning, D. F. and Busby, S. J. (2004). The regulation of bacterial transcription initiation. Nat 
Rev Microbiol 2, 57-65. 
Cabal, G. G., Genovesio, A., Rodriguez-Navarro, S., Zimmer, C., Gadal, O., Lesne, A., Buc, 
H., Feuerbach-Fournier, F., Olivo-Marin, J. C., Hurt, E. C. et al. (2006). SAGA interacting 
factors confine sub-diffusion of transcribed genes to the nuclear envelope. Nature 441, 770-3. 
Carter, D., Chakalova, L., Osborne, C. S., Dai, Y. F. and Fraser, P. (2002). Long-range 
chromatin regulatory interactions in vivo. Nat Genet 32, 623-6. 
Casolari, J. M., Brown, C. R., Komili, S., West, J., Hieronymus, H. and Silver, P. A. (2004). 
Genome-wide localization of the nuclear transport machinery couples transcriptional status and 
nuclear organization. Cell 117, 427-39. 
Chuang, C. H., Carpenter, A. E., Fuchsova, B., Johnson, T., de Lanerolle, P. and Belmont, 
A. S. (2006). Long-range directional movement of an interphase chromosome site. Curr Biol 16, 
825-31. 
Cohen, B. A., Mitra, R. D., Hughes, J. D. and Church, G. M. (2000). A computational analysis 
of whole-genome expression data reveals chromosomal domains of gene expression. Nat 
Genet 26, 183-6. 
Cook, P. R. (1999). The organization of replication and transcription. Science 284, 1790-5. 
Cremer, T. and Cremer, C. (2001). Chromosome territories, nuclear architecture and gene 
regulation in mammalian cells. Nat Rev Genet 2, 292-301. 
Cremer, T., Cremer, M., Dietzel, S., Muller, S., Solovei, I. and Fakan, S. (2006). 
Chromosome territories--a functional nuclear landscape. Curr Opin Cell Biol 18, 307-16. 
Cremer, T., Kreth, G., Koester, H., Fink, R. H., Heintzmann, R., Cremer, M., Solovei, I., 
Zink, D. and Cremer, C. (2000). Chromosome territories, interchromatin domain compartment, 
and nuclear matrix: an integrated view of the functional nuclear architecture. Crit Rev Eukaryot 
Gene Expr 10, 179-212. 
de Laat, W. and Grosveld, F. (2007). Inter-chromosomal gene regulation in the mammalian 
cell nucleus. Curr Opin Genet Dev 17, 456-64. 
Dean, A. (2006). On a chromosome far, far away: LCRs and gene expression. Trends Genet 
22, 38-45. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-34 
 
 
Dekel, E. and Alon, U. (2005). Optimality and evolutionary tuning of the expression level of a 
protein. Nature 436, 588-92. 
Dekker, J., Rippe, K., Dekker, M. and Kleckner, N. (2002). Capturing chromosome 
conformation. Science 295, 1306-11. 
Dorman, E. R., Bushey, A. M. and Corces, V. G. (2007). The role of insulator elements in 
large-scale chromatin structure in interphase. Semin Cell Dev Biol 18, 682-90. 
Drissen, R., Palstra, R. J., Gillemans, N., Splinter, E., Grosveld, F., Philipsen, S. and de 
Laat, W. (2004). The active spatial organization of the beta-globin locus requires the 
transcription factor EKLF. Genes Dev 18, 2485-90. 
Dundr, M., Ospina, J. K., Sung, M. H., John, S., Upender, M., Ried, T., Hager, G. L. and 
Matera, A. G. (2007). Actin-dependent intranuclear repositioning of an active gene locus in vivo. 
J Cell Biol 179, 1095-103. 
Federico, C., Scavo, C., Cantarella, C. D., Motta, S., Saccone, S. and Bernardi, G. (2006). 
Gene-rich and gene-poor chromosomal regions have different locations in the interphase nuclei 
of cold-blooded vertebrates. Chromosoma 115, 123-8. 
Finlan, L. E., Sproul, D., Thomson, I., Boyle, S., Kerr, E., Perry, P., Ylstra, B., Chubb, J. R. 
and Bickmore, W. A. (2008). Recruitment to the nuclear periphery can alter expression of 
genes in human cells. PLoS Genet 4, e1000039. 
Francastel, C., Schubeler, D., Martin, D. I. and Groudine, M. (2000). Nuclear 
compartmentalization and gene activity. Nat Rev Mol Cell Biol 1, 137-43. 
Fraser, P. and Bickmore, W. (2007). Nuclear organization of the genome and the potential for 
gene regulation. Nature 447, 413-7. 
Gartenberg, M. R., Neumann, F. R., Laroche, T., Blaszczyk, M. and Gasser, S. M. (2004). 
Sir-mediated repression can occur independently of chromosomal and subnuclear contexts. Cell 
119, 955-67. 
Gasser, S. M. (2002). Visualizing chromatin dynamics in interphase nuclei. Science 296, 1412-
6. 
Grimaud, C., Bantignies, F., Pal-Bhadra, M., Ghana, P., Bhadra, U. and Cavalli, G. (2006). 
RNAi components are required for nuclear clustering of Polycomb group response elements. 
Cell 124, 957-71. 
Guelen, L., Pagie, L., Brasset, E., Meuleman, W., Faza, M. B., Talhout, W., Eussen, B. H., 
de Klein, A., Wessels, L., de Laat, W. et al. (2008a). Domain organization of human 
chromosomes revealed by mapping of nuclear lamina interactions. Nature 453, 948-51. 
Guelen, L., Pagie, L., Brasset, E., Meuleman, W., Faza, M. B., Talhout, W., Eussen, B. H., 
de Klein, A., Wessels, L., de Laat, W. et al. (2008b). Domain organization of human 
chromosomes revealed by mapping of nuclear lamina interactions. Nature. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-35 
 
 
Harbison, C. T., Gordon, D. B., Lee, T. I., Rinaldi, N. J., Macisaac, K. D., Danford, T. W., 
Hannett, N. M., Tagne, J. B., Reynolds, D. B., Yoo, J. et al. (2004). Transcriptional regulatory 
code of a eukaryotic genome. Nature 431, 99-104. 
Heun, P., Laroche, T., Shimada, K., Furrer, P. and Gasser, S. M. (2001). Chromosome 
dynamics in the yeast interphase nucleus. Science 294, 2181-6. 
Ho, Y., Elefant, F., Liebhaber, S. A. and Cooke, N. E. (2006). Locus control region 
transcription plays an active role in long-range gene activation. Mol Cell 23, 365-75. 
Horak, C. E., Luscombe, N. M., Qian, J., Bertone, P., Piccirrillo, S., Gerstein, M. and 
Snyder, M. (2002). Complex transcriptional circuitry at the G1/S transition in Saccharomyces 
cerevisiae. Genes Dev 16, 3017-33. 
Hurst, L. D., Pal, C. and Lercher, M. J. (2004). The evolutionary dynamics of eukaryotic gene 
order. Nat Rev Genet 5, 299-310. 
Ishii, K., Arib, G., Lin, C., Van Houwe, G. and Laemmli, U. K. (2002). Chromatin boundaries 
in budding yeast: the nuclear pore connection. Cell 109, 551-62. 
Janicki, S. M., Tsukamoto, T., Salghetti, S. E., Tansey, W. P., Sachidanandam, R., 
Prasanth, K. V., Ried, T., Shav-Tal, Y., Bertrand, E., Singer, R. H. et al. (2004). From 
silencing to gene expression: real-time analysis in single cells. Cell 116, 683-98. 
Khalil, A., Grant, J. L., Caddle, L. B., Atzema, E., Mills, K. D. and Arneodo, A. (2007). 
Chromosome territories have a highly nonspherical morphology and nonrandom positioning. 
Chromosome Res 15, 899-916. 
Kim, S. H., McQueen, P. G., Lichtman, M. K., Shevach, E. M., Parada, L. A. and Misteli, T. 
(2004). Spatial genome organization during T-cell differentiation. Cytogenet Genome Res 105, 
292-301. 
Kornberg, R. D. (1974). Chromatin structure: a repeating unit of histones and DNA. Science 
184, 868-71. 
Kosak, S. T. and Groudine, M. (2004). Gene order and dynamic domains. Science 306, 644-7. 
Kouzarides, T. (2002). Histone methylation in transcriptional control. Curr Opin Genet Dev 12, 
198-209. 
Kumaran, R. I., Thakar, R. and Spector, D. L. (2008). Chromatin dynamics and gene 
positioning. Cell 132, 929-34. 
Kupper, K., Kolbl, A., Biener, D., Dittrich, S., von Hase, J., Thormeyer, T., Fiegler, H., 
Carter, N. P., Speicher, M. R., Cremer, T. et al. (2007). Radial chromatin positioning is shaped 
by local gene density, not by gene expression. Chromosoma 116, 285-306. 
Kuroda, M., Tanabe, H., Yoshida, K., Oikawa, K., Saito, A., Kiyuna, T., Mizusawa, H. and 
Mukai, K. (2004). Alteration of chromosome positioning during adipocyte differentiation. J Cell 
Sci 117, 5897-903. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-36 
 
 
Kurz, A., Lampel, S., Nickolenko, J. E., Bradl, J., Benner, A., Zirbel, R. M., Cremer, T. and 
Lichter, P. (1996). Active and inactive genes localize preferentially in the periphery of 
chromosome territories. J Cell Biol 135, 1195-205. 
Lanctot, C., Cheutin, T., Cremer, M., Cavalli, G. and Cremer, T. (2007). Dynamic genome 
architecture in the nuclear space: regulation of gene expression in three dimensions. Nat Rev 
Genet 8, 104-15. 
Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. 
M., Harbison, C. T., Thompson, C. M., Simon, I. et al. (2002). Transcriptional regulatory 
networks in Saccharomyces cerevisiae. Science 298, 799-804. 
Lee, T. I. and Young, R. A. (2000). Transcription of eukaryotic protein-coding genes. Annu Rev 
Genet 34, 77-137. 
Loizou, J. I., Murr, R., Finkbeiner, M. G., Sawan, C., Wang, Z. Q. and Herceg, Z. (2006). 
Epigenetic information in chromatin: the code of entry for DNA repair. Cell Cycle 5, 696-701. 
Lomvardas, S., Barnea, G., Pisapia, D. J., Mendelsohn, M., Kirkland, J. and Axel, R. 
(2006). Interchromosomal interactions and olfactory receptor choice. Cell 126, 403-13. 
Luger, K., Mader, A. W., Richmond, R. K., Sargent, D. F. and Richmond, T. J. (1997). 
Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature 389, 251-60. 
Manuelidis, L. (1985). Individual interphase chromosome domains revealed by in situ 
hybridization. Hum Genet 71, 288-93. 
Marenduzzo, D., Faro-Trindade, I. and Cook, P. R. (2007). What are the molecular ties that 
maintain genomic loops? Trends Genet 23, 126-33. 
Meaburn, K. J. and Misteli, T. (2007). Cell biology: chromosome territories. Nature 445, 379-
781. 
Mendjan, S., Taipale, M., Kind, J., Holz, H., Gebhardt, P., Schelder, M., Vermeulen, M., 
Buscaino, A., Duncan, K., Mueller, J. et al. (2006). Nuclear pore components are involved in 
the transcriptional regulation of dosage compensation in Drosophila. Mol Cell 21, 811-23. 
Millar, C. B. and Grunstein, M. (2006). Genome-wide patterns of histone modifications in 
yeast. Nat Rev Mol Cell Biol 7, 657-66. 
Misteli, T. (2007). Beyond the sequence: cellular organization of genome function. Cell 128, 
787-800. 
Mora, L., Sanchez, I., Garcia, M. and Ponsa, M. (2006). Chromosome territory positioning of 
conserved homologous chromosomes in different primate species. Chromosoma 115, 367-75. 
Muller, W. G., Rieder, D., Karpova, T. S., John, S., Trajanoski, Z. and McNally, J. G. (2007). 
Organization of chromatin and histone modifications at a transcription site. J Cell Biol 177, 957-
67. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-37 
 
 
Narlikar, G. J., Fan, H. Y. and Kingston, R. E. (2002). Cooperation between complexes that 
regulate chromatin structure and transcription. Cell 108, 475-87. 
Neusser, M., Schubel, V., Koch, A., Cremer, T. and Muller, S. (2007). Evolutionarily 
conserved, cell type and species-specific higher order chromatin arrangements in interphase 
nuclei of primates. Chromosoma 116, 307-20. 
Nightingale, K. P., O'Neill, L. P. and Turner, B. M. (2006). Histone modifications: signalling 
receptors and potential elements of a heritable epigenetic code. Curr Opin Genet Dev 16, 125-
36. 
O'Sullivan, J. M., Tan-Wong, S. M., Morillon, A., Lee, B., Coles, J., Mellor, J. and 
Proudfoot, N. J. (2004). Gene loops juxtapose promoters and terminators in yeast. Nat Genet 
36, 1014-8. 
Olins, A. L. and Olins, D. E. (1974). Spheroid chromatin units (v bodies). Science 183, 330-2. 
Osborne, C. S., Chakalova, L., Brown, K. E., Carter, D., Horton, A., Debrand, E., 
Goyenechea, B., Mitchell, J. A., Lopes, S., Reik, W. et al. (2004). Active genes dynamically 
colocalize to shared sites of ongoing transcription. Nat Genet 36, 1065-71. 
Osborne, C. S., Chakalova, L., Mitchell, J. A., Horton, A., Wood, A. L., Bolland, D. J., 
Corcoran, A. E. and Fraser, P. (2007). Myc dynamically and preferentially relocates to a 
transcription factory occupied by Igh. PLoS Biol 5, e192. 
Parada, L. A., McQueen, P. G. and Misteli, T. (2004). Tissue-specific spatial organization of 
genomes. Genome Biol 5, R44. 
Parelho, V., Hadjur, S., Spivakov, M., Leleu, M., Sauer, S., Gregson, H. C., Jarmuz, A., 
Canzonetta, C., Webster, Z., Nesterova, T. et al. (2008). Cohesins functionally associate with 
CTCF on mammalian chromosome arms. Cell 132, 422-33. 
Pickersgill, H., Kalverda, B., de Wit, E., Talhout, W., Fornerod, M. and van Steensel, B. 
(2006). Characterization of the Drosophila melanogaster genome at the nuclear lamina. Nat 
Genet 38, 1005-14. 
Pombo, A. and Branco, M. R. (2007). Functional organisation of the genome during 
interphase. Curr Opin Genet Dev 17, 451-5. 
Pombo, A., Jackson, D. A., Hollinshead, M., Wang, Z., Roeder, R. G. and Cook, P. R. 
(1999). Regional specialization in human nuclei: visualization of discrete sites of transcription by 
RNA polymerase III. Embo J 18, 2241-53. 
Pombo, A., Jones, E., Iborra, F. J., Kimura, H., Sugaya, K., Cook, P. R. and Jackson, D. A. 
(2000). Specialized transcription factories within mammalian nuclei. Crit Rev Eukaryot Gene 
Expr 10, 21-9. 
Razin, S. V., Iarovaia, O. V., Sjakste, N., Sjakste, T., Bagdoniene, L., Rynditch, A. V., 
Eivazova, E. R., Lipinski, M. and Vassetzky, Y. S. (2007). Chromatin domains and regulation 
of transcription. J Mol Biol 369, 597-607. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-38 
 
 
Reddy, K. L., Zullo, J. M., Bertolino, E. and Singh, H. (2008). Transcriptional repression 
mediated by repositioning of genes to the nuclear lamina. Nature 452, 243-7. 
Schneider, R. and Grosschedl, R. (2007). Dynamics and interplay of nuclear architecture, 
genome organization, and gene expression. Genes Dev 21, 3027-43. 
Simonis, M., Klous, P., Splinter, E., Moshkin, Y., Willemsen, R., de Wit, E., van Steensel, 
B. and de Laat, W. (2006). Nuclear organization of active and inactive chromatin domains 
uncovered by chromosome conformation capture-on-chip (4C). Nat Genet 38, 1348-54. 
Simonis, M., Kooren, J. and de Laat, W. (2007). An evaluation of 3C-based methods to 
capture DNA interactions. Nat Methods 4, 895-901. 
Spellman, P. T. and Rubin, G. M. (2002). Evidence for large domains of similarly expressed 
genes in the Drosophila genome. J Biol 1, 5. 
Spilianakis, C. G., Lalioti, M. D., Town, T., Lee, G. R. and Flavell, R. A. (2005). 
Interchromosomal associations between alternatively expressed loci. Nature 435, 637-45. 
Svetlov, V. V. and Cooper, T. G. (1995). Review: compilation and characteristics of dedicated 
transcription factors in Saccharomyces cerevisiae. Yeast 11, 1439-84. 
Taddei, A., Hediger, F., Neumann, F. R. and Gasser, S. M. (2004). The function of nuclear 
architecture: a genetic approach. Annu Rev Genet 38, 305-45. 
Taddei, A., Van Houwe, G., Hediger, F., Kalck, V., Cubizolles, F., Schober, H. and Gasser, 
S. M. (2006). Nuclear pore association confers optimal expression levels for an inducible yeast 
gene. Nature 441, 774-8. 
Turner, B. M. (1993). Decoding the nucleosome. Cell 75, 5-8. 
Turner, B. M. (2007). Defining an epigenetic code. Nat Cell Biol 9, 2-6. 
Vakoc, C. R., Letting, D. L., Gheldof, N., Sawado, T., Bender, M. A., Groudine, M., Weiss, 
M. J., Dekker, J. and Blobel, G. A. (2005). Proximity among distant regulatory elements at the 
beta-globin locus requires GATA-1 and FOG-1. Mol Cell 17, 453-62. 
van Driel, R., Fransz, P. F. and Verschure, P. J. (2003). The eukaryotic genome: a system 
regulated at different hierarchical levels. J Cell Sci 116, 4067-75. 
Xu, M. and Cook, P. R. (2008). Similar active genes cluster in specialized transcription 
factories. J Cell Biol 181, 615-23. 
Xu, N., Donohoe, M. E., Silva, S. S. and Lee, J. T. (2007). Evidence that homologous X-
chromosome pairing requires transcription and Ctcf protein. Nat Genet 39, 1390-6. 
Zinner, R., Albiez, H., Walter, J., Peters, A. H., Cremer, T. and Cremer, M. (2006). Histone 
lysine methylation patterns in human cell types are arranged in distinct three-dimensional 
nuclear zones. Histochem Cell Biol 125, 3-19. 
Constraints imposed by eukaryotic transcriptional control                                                                                      3-39 
 
 
Zorn, C., Cremer, C., Cremer, T. and Zimmer, J. (1979). Unscheduled DNA synthesis after 
partial UV irradiation of the cell nucleus. Distribution in interphase and metaphase. Exp Cell Res 
124, 111-9. 
 
 
Functional landscape of E. coli proteins                                                                                                                   4-1 
 
 
 
 
 
 
 
4 Uncovering the functional architecture of 
uncharacterized proteins in E. coli 
 
 
Functional landscape of E. coli proteins                                                                                                                   4-2 
 
 
 
CONTENTS OF CHAPTER 4 
 
OUTLINE ......................................................................................................................................... 4-3 
CONTRIBUTION TO THE WORK IN THIS CHAPTER.................................................... 4-4 
4.1 INTRODUCTION .................................................................................................................. 4-5 
4.2 RESULTS ................................................................................................................................ 4-6 
4.2.1 OVERVIEW OF NETWORK-BASED FUNCTION PREDICTION.................................................... 4-6 
4.2.1.1 METHODS AND DATABASES FOR CONSTRUCTING FUNCTIONAL ASSOCIATION NETWORKS
 ...................................................................................................................................................... .4-9 
4.2.1.2 COMPUTATIONAL METHODS FOR PREDICTING FUNCTION FROM NETWORK CONTEXT .. 4-12 
4.2.2 UNCOVERING THE CELLULAR ROLES OF FUNCTIONAL ORPHANS IN E. COLI ..................... 4-14 
4.2.2.1 THE EXTENT OF EXISTING FUNCTIONAL ANNOTATION FOR E. COLI PROTEINS .............. 4-16 
4.2.2.2 PROPERTIES OF THE FUNCTIONAL ORPHANS OF E. COLI ................................................. 4-17 
4.2.2.3 A SYSTEMATIC APPROACH TO ELUCIDATE BIOLOGICAL FUNCTION ............................... 4-18 
4.2.2.4 EXPERIMENTAL DEFINITION OF THE PHYSICAL INTERACTION NETWORK OF THE SOLUBLE 
PROTEOME.................................................................................................................................... 4-19 
4.2.2.5 ORPHAN MEMBERSHIP WITHIN MULTIPLE PROTEIN COMPLEXES ................................... 4-21 
4.2.2.6 FUNCTIONAL INTERACTIONS PREDICTED BY GENOMIC-CONTEXT METHODS................. 4-24 
4.2.2.7 DEFINING THE PARTICIPATION OF ORPHANS AS THE COMPONENTS OF FUNCTIONAL 
MODULES ..................................................................................................................................... 4-27 
4.2.2.8 IMPROVED FUNCTIONAL INFERENCE WITHIN AN INTEGRATED NETWORK FRAMEWORK
 ................................................................................................................................................ ….4-28 
4.2.2.9 FUNCTIONAL NEIGHBORHOODS ...................................................................................... 4-30 
4.3 DISCUSSION & CONCLUSION ................................................................................... 4-32 
4.4 MATERIALS AND METHODS ........................................................................................ 4-35 
4.4.1 PI NETWORK GENERATION ................................................................................................. 4-35 
4.4.2 GC NETWORK GENERATION ............................................................................................... 4-36 
4.4.3 CLUSTERING ....................................................................................................................... 4-37 
4.4.4 NETWORK-BASED FUNCTION PREDICTION AND BENCHMARKING ..................................... 4-37 
REFERENCES .............................................................................................................................. 4-37 
 
 
Functional landscape of E. coli proteins                                                                                                                   4-3 
 
 
OUTLINE 
Determining the functions of proteins encoded by genome sequences represents a major 
challenge in modern biology. Whole-genome sequencing projects are a major source of proteins 
of unknown function. Annotation of a genome involves assignment of functions to gene 
products, in most cases on the basis of amino-acid sequence alone. Structure-based 
identification of homologues often succeed where sequence-alone-based methods fail, due to 
the conservation of folding patterns long after sequence similarity becomes undetectable. 
Nevertheless, prediction of protein function from sequence and structure is still a difficult 
problem, because homologous proteins often have different functions and these traditional 
approaches have already started to reach an optimum. As a result, alternative computational 
methods for inferring the protein function such as those which exploit the context of a protein in 
protein association networks have come to be sought after. These methods, often referred to as 
network-based functional inference techniques, provide a first hand guess of the functional role 
and provide complementary insights to traditional methods in understanding the function of 
uncharacterized proteins. Most recent network-based approaches aim to integrate diverse kinds 
of functional interactions as it not only boosts coverage but also confidence level of an 
association, thereby improving the assessment of protein function. In a recent study we 
attempted to characterize one-third of the 4,225 protein-coding genes of Escherichia coli K-12 
which remain functionally unannotated (functional orphans). In particular, to elucidate their 
biological roles, we performed an extensive proteomic survey using affinity-tagged E. coli strains 
and generated comprehensive genomic context inferences to derive a high-confidence 
compendium for virtually the entire proteome consisting of 5,993 putative physical interactions 
and 74,776 putative functional associations, most of which are novel. Clustering of the 
respective probabilistic networks revealed putative orphan membership in discrete multiprotein 
complexes and functional modules, while a machine-learning strategy based on network 
integration implicated the orphans in specific biological processes. In the second half of this 
chapter, I highlight this resource which provides a ‘systems-wide’ functional blueprint of a model 
microbe, with insights into the biological and evolutionary significance of previously 
uncharacterized proteins. Given the volume of high-throughput data that is being reported for 
understanding diverse model systems the time is ripe to employ these network-based 
approaches which can be used on a whole-organism level to unravel the functions of an 
increasing number of proteins accumulating in the genomic databases. 
Functional landscape of E. coli proteins                                                                                                                   4-4 
 
 
CONTRIBUTION TO THE WORK IN THIS CHAPTER 
Please note that the work presented in this chapter is the result of the following two publications. 
Uncovering the functional roles of previously uncharacterized E. coli proteins is an ongoing 
collaborative project with the groups of Dr. Andrew Emili at University of Toronto and Dr. Gabriel 
Moreno-Hagelsieb at Wilfred Laurier University, Canada. My contribution to this high-throughput 
study included, but was not limited to, developing novel computational frameworks for 
understanding genome-context functional associations, analyzing raw protein-protein interaction 
data generated by Dr. Emili’s group, integrating data using computational approaches and 
inferring function from such networks. Please note that some of the work on functional 
associations have resulted in other publications in the past, these studies are cited in the 
appendix but not discussed here. 
 
1) Network-based function prediction in post-genomic era : Metabolic enzymes as a case study  
Sarath Chandra Janga and Gabriel Moreno-Hagelsieb 
Metabolic Engineering (Submitted) 
2) Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins 
Pingzhao Hu†, Sarath Chandra Janga†, Mohan Babu†,  J. Javier Díaz-Mejía†, Gareth Butland†, 
Yang W, Pogoutse O, Guo X, Phanse S, Wong P, Chandran S, Christopoulos C, Nazarians-
Armavil A, Nasseri NK, Musso G, Ali M, Nazemof N, Eroukova V, Golshani A, Paccanaro A, 
Greenblatt JF, Moreno-Hagelsieb G, Emili A 
PLoS Biology 2009, 7(4): e96 
Functional landscape of E. coli proteins                                                                                                                   4-5 
 
 
4.1 INTRODUCTION 
Determining the functions of proteins encoded by genome sequences represents a major 
challenge in modern biology. As of March 23, 2010, the TrEMBL database contained 
10,618,387 sequences (http://www.ebi.ac.uk/uniprot/TrEMBLstats/). The GOLD database 
(http://www.genomesonline.org) reports more than 1000 published genomes with over 3700 
genome projects underway; the database also reports more than 100 metagenome projects  
with the venter’s marine microbial communities project alone contributing more than 6,000,000 
proteins to the already accumulating list of protein repertoire. Although the pace at which 
sequencing technologies are able to generate the genome sequence data is increasing, our 
ability to unravel the functional roles of the encoded proteins in these genomes has been rather 
limited.  
Historically proteins identified from genome sequencing projects were annotated mostly 
using the aid of BLAST (Altschul et al., 1997) or other sequence comparison tools followed by 
manual intervention (Gotoh, 1999; Pearson, 1995; Procter et al.). A principal reason behind 
researchers BLASTing protein sequences against databases is to learn about some aspect of 
their function. The researcher aims to answer this question by finding a significant sequence 
similarity to another protein that is already in the database and whose function was 
experimentally characterized. This is essentially the most widely used form of computational 
function prediction and is commonly referred to as annotation transfer by sequence similarity or 
simply homology-based transfer. The rationale behind homology-based annotation transfer is 
that, if two sequences have a high degree of similarity, then they have evolved from a common 
ancestor and they have similar, if not identical functions. This might appear an obvious 
statement however with increasing number of sequences as well as duplications observed in 
different lineages, the power of homology-based annotation transfer is being challenged. Adding 
to this is the problem of errors in annotation even in human curated databases, which spread 
mis-annotations when homology-based approaches are used. All these factors have made it 
evident that the traditional approaches for annotating genes with their functional descriptions is 
nearly impossible with the exponential increase in the number of proteins. In addition, most of 
the newly identified proteins do not show a high sequence similarity with an already 
characterized protein leading to the failure or rather saturation of the homology-based 
approaches and making it impossible to keep up with the influx of data for manually curated 
annotation. All of these factors have been responsible for an increase in a varied number of 
automated function inference approaches in the recent years (see Table 4-1) (Godzik et al., 
Functional landscape of E. coli proteins                                                                                                                   4-6 
 
 
2007; Han et al., 2006; Rentzsch and Orengo, 2009; Zhao et al., 2008a). These automated 
function inference methods are based on a number of features, starting from nucleotide or 
amino acid sequence, sequence patterns/profiles and protein structure patterns to chromosomal 
location, phylogenetic information, expression profiles, molecular interaction data, functional 
associations and gene co-evolution and are summarized in Table 4-1.   
4.2 RESULTS 
4.2.1 Overview of network-based function prediction 
The very definition of biological function is ambiguous with its exact meaning depending on the 
context in which it is used and the classification it is based on (Rison et al., 2000; Whisstock and 
Lesk, 2003). It is obvious in the post-genomic era that biological function has many aspects 
associated with it. For instance, a protein kinase; in the biochemical context can simply be 
defined as an enzyme or more precisely a kinase’s function would be the phosphorylation of the 
hydroxyl group of a specific substrate. While the former gives a very coarse annotation of the 
protein under study the later gives finer details about its function. A totally different way to 
understand the role of a protein with in the cell is to ask where exactly it occurs in the cell. This 
aspect is equally important information especially for entities which occur with in a cell as they 
can potentially occur in a number of sub-cellular localizations. In this particular case, kinases 
can be identified either in the cytoplasm or nucleus and this information is crucial in gathering its 
role and interactions with other proteins with in the cellular environment. Likewise, a mutation in 
the kinase can be associated with a disease phenotype. Therefore, it is increasingly becoming 
clear that when speaking of a protein’s function, we must always specify the aspect or aspects 
of the functional description. In particular, when setting out to develop a function prediction tool 
we must keep in mind which functional aspect or aspects we are trying to predict and use the 
appropriate vocabulary. 
Once functional aspects of a protein are defined, the question is how function can be 
interpreted in computational terms. For instance, protein sequences for a long time have been 
represented as character strings that enable their use for many computational tasks including 
pairwise comparisons and multiple sequence alignments, motif searching, database searching 
and several other tasks aimed at extracting biological information from the sequence. In fact, 
our ability to express protein sequence information as a character string amenable for 
computational processing followed by the availability of algorithms which can exploit this 
information for meaningful interpretation has changed our view and understanding of cellular  
Functional landscape of E. coli proteins                                                                                                                   4-7 
 
 
Table 4-1. Resources currently available for protein function prediction grouped according to the 
predominant method or approach implemented in them. Note that the list may be incomplete as some 
resources which are not directly relevant to the methods discussed here might have escaped their 
mention in this table. 
 
Approach Resource Webpage 
Sequence similarity 
based 
GOtcha http://www.compbio.dundee.ac.uk/gotcha/gotcha.php 
PFP http://dragon.bio.purdue.edu/pfp/ 
GOsling https://www.sapac.edu.au/gosling/ 
OntoBlast http://functionalgenomics.de/ontogate/ 
GOblet http://goblet.molgen.mpg.de 
Blast2GO http://www.blast2go.de 
Phylogenomics 
based 
SIFTER http://sifter.berkeley.edu 
AFAWE http://bioinfo.mpiz-koeln.mpg.de/afawe/ 
RIO http://www.rio.wustl.edu/ 
OrthoStrapper http://www.cgb.ki.se/OrthoGUI 
Domain/pattern/prof
ile based 
InterProScan http://www.ebi.ac.uk/tools/interproscan/ 
Pfam http://pfam.sanger.ac.uk 
SUPERFAMILY http://supfam.cs.bris.ac.uk/superfamily/ 
PROSITE http://www.expasy.ch/prosite/ 
PRINTS http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/ 
SMART http://smart.embl-heidelberg.de/ 
Gene3D http://gene3d.biochem.ucl.ac.uk/gene3d/ 
PANTHER http://www.pantherdb.org/ 
TIGRFAMs http://www.tigr.org/TIGRFAMs/ 
SCOP http://scop.mrc-lmb.cam.ac.uk/scop/ 
CATH http://www.cathdb.info/ 
CatFam http://www.bhsai.org/downloads/catfam.tar.gz 
Sequence 
clustering based  
ProtoNet http://www.protonet.cs.huji.ac.il/ 
CluSTr http://www.ebi.ac.uk/clustr/ 
eggNOG http://eggnog.embl.de 
COGs http://www.ncbi.nlm.nih.gov/COG/ 
InParanoid http://inparanoid.sbc.su.se/cgi-bin/index.cgi 
MultiParanoid http://multiparanoid.sbe.su.se/index.html 
OrthoMCL http://www.orthomcl.org/cgi-bin/OrthoMclWeb.cgi 
Machine Learning 
based 
ProtoFun http://www.cbs.dtu.dk/services/ProtFun/ 
GOPET http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar 
SVM-Prot http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi 
ffPred http://bioinf.cs.ucl.ac.uk/ffpred/ 
EzyPred http://www.csbio.sjtu.edu.cn/bioinf/EzyPred/ 
Network based 
MCODE http://baderlab.org/Software/MCODE 
MCL http://www.micans.org/mcl/ 
SAMBA http://acgt.cs.tau.ac.il/samba/ 
PRODISTIN http://crfb.univ-mrs.fr/webdistin/ 
Cytoscape http://www.cytoscape.org/ 
STRING http://string.embl.de/ 
VisANT http://visant.bu.edu/ 
VIRGO http://whipple.cs.vt.edu/virgo/welcome.cgi 
 
Functional landscape of E. coli proteins                                                                                                                   4-8 
 
 
entities. However, in contrast to sequence information, the annotation of a protein until recently 
has been written in human language, conveying the complex descriptions and intricacies of its 
function as well as experimental evidence in support of it, in terms of custom non-standard 
format varying between different groups. As a result, vocabulary went on to be invented and re-
invented, with many terms being synonymous. This synonymy not only raises confusion among 
human curators (re)annotating the annotations but also increases the chances of additional 
errors due to a non-standard format for annotating the function. Therefore, over the years a 
need to convey this information in a more controlled and well-defined fashion has emerged 
especially due to the requirements to make the annotations processed automatically. One of the 
first group of people who appreciated this were the biochemists to come up with the Enzyme 
Commission (EC) classification (Tipton, 1994). EC classifies metabolic reactions in a four-level 
hierarchy which are noted by a four-position identifier, going from the most general in the first 
position to the most specific function of the enzyme in the last position. This classification not 
only addresses the need for a controlled vocabulary but also a well-defined relationship 
between terms thereby allowing the comparison between annotations. While enzymes form one 
of the most commonly occurring protein classes in the cell, they are definitely not the only kind, 
so these definitions are not sufficient for annotating functions of all the proteins in a cell. 
Therefore, following this classification, Monica riley and colleagues in 1993 came up with the 
Riley or Multifun classification system for E. coli (Riley, 1993; Serres and Riley, 2000). Other 
annotation systems came into existence following this which include Clusters of Orthologous 
Groups (COG) (Tatusov et al., 1997) – based on manual annotation of a group of orthologous 
proteins by hierarchically organizing the functional descriptions, swissprot annotations based on 
human curation efforts on well-annotated proteins (Apweiler, 2001; Kretschmann et al., 2001), 
and more recently Gene Ontology (GO) (Ashburner et al., 2000). The common theme among 
these schemes is the establishment of a controlled vocabulary and in many cases a 
categorization that proceeds from the general to the specific. The Gene Ontology (GO) currently 
serves as the dominant cross-specie approach for machine-legible functional annotation and 
covers three major aspects of gene products’ function, namely molecular function, biological 
process and cellular component. Each ontology is implemented as a directed acyclic graph 
(DAG) where terms are represented as nodes in the graph and are arranged from the general to 
the specific. The DAG arrangement means that each node may have more than a single parent 
which enables the description of functions that are associated with more than one biological 
activity or process. By standardizing an annotation and defining the relationships between terms 
using a graph, annotations can be computationally processed. For instance, given a GO-
Functional landscape of E. coli proteins                                                                                                                   4-9 
 
 
annotated genome a researcher can computationally identify the set of all genes with a given 
annotation and likewise one can predict functional labels of proteins using such a controlled 
vocabulary. Naturally, such standardized annotations also limit the flexibility in the amount of 
detail an annotation can be made. 
Having defined function and the means of describing function, one can start discussing 
function prediction. In particular, function prediction using network-based approaches which is 
the topic of this chapter essentially requires two seed components: a) a network of functional 
associations which are amenable for graph theory analysis b) a network-based function 
prediction algorithm for predicting functional labels for uncharacterized genes in the graph. In 
what follows, I will first discuss different approaches for constructing and integrating functional 
association networks and then outline currently available computational methods for inferring 
function based on them.      
4.2.1.1 Methods and databases for constructing functional association 
networks 
Traditionally function of a protein was defined using a number of low-throughput approaches like 
mutagenesis of residues or whole proteins which allowed the identification of the phenotypes for 
follow up analysis. However, it is increasingly becoming clear that this rational is limited in its 
ability to infer the function of proteins; failing for those which exhibit mild phenotype or those 
which are not expressed under standard experimental conditions. In addition, since most 
proteins associate dynamically with a number of other cellular entities during their life time, the 
traditional notion of identifying function of a protein by isolating it from the rest of the cellular 
machinery can be misleading for a majority. This notion followed by the availability of 
experimentally determined protein-protein interaction maps for diverse model organisms have 
given rise to the use of these datasets for delineating the biological processes, pathways and 
complexes that proteins take part in (Aranda et al., ; Bader et al., 2003; Breitkreutz et al., 2008). 
Indeed, there is now observable overlap and informative variation between different types of 
low- and high-throughput experiments (Shoemaker and Panchenko, 2007a) which provides a 
convincing reason for exploiting them as complementary approaches in unraveling the functions 
of proteins. Indeed, recent years have seen an explosion in the number of methods and 
databases which provide functional associations (both direct physical and indirect contextual 
interactions) between proteins using both experimental and computational means (Table 4-2). 
 
(Space left for an enhanced layout of the table) 
Functional landscape of E. coli proteins                                                                                                                   4-10 
 
 
Table 4-2. Different approaches for generating functional linkage maps or networks. Typically, these 
networks either independently or integrated versions of them form the input for network-based functional 
inference algorithms.  
 
Approach Description Data sources 
Protein-protein 
interactions 
Physical interactions between proteins 
identified either by mass spectrometry or 
one of the hybrid approaches are used to 
generate protein interaction maps on a 
large-scale which are used as input for 
function prediction 
algorithms.(Shoemaker and Panchenko, 
2007a) 
HPRD (http://www.hprd.org) 
IntAct 
(http://www.ebi.ac.uk/intact/site/index.jsf) 
MINT 
(http://cbm.bio.uniroma2.it/mint/index.html) 
BioGRID (http://www.thebiogrid.org) 
DIP 
(http://dip.doe-mbi.ucla.edu/dip/Main.cgi) 
MPPI (http:// mips.gsf.de/proj/ppi) 
Co-expression 
networks 
In these approaches gene co-expression 
above a significant correlation threshold 
is considered as a presence of a 
functional linkage between genes. 
Genome-wide inspection of these gene 
co-expression networks provides an 
intuitive way to represent complex co-
expression patterns between many 
genes providing functional insights into 
uncharacterized processes. (Aoki et al., 
2007; Huber et al., 2007) 
GEO (http://www.ncbi.nlm.nih.gov/geo) 
SMD (http://genome-www5.stanford.edu) 
ArrayExpress 
(http://www.ebi.ac.uk/arrayexpress) 
caArray 
(http://caarraydb.nci.nih.gov/caarray) 
M3D (http://m3d.bu.edu/) 
Genetic 
interaction 
networks 
(Lasko, 2000) In these approaches 
interactions between genes are 
constructed by linking gene pairs which 
show significantly reduced fitness when 
both the genes are knocked out 
compared to when each gene is knocked 
out independently. These lethality 
assays are carried out on a high-
throughput scale to construct genome-
scale interactions. (Butland et al., 2008; 
Costanzo et al.) 
BioGRID (http://www.thebiogrid.org) 
DRYGIN (http://drygin.ccbr.utoronto.ca) 
IM Browser 
(http://proteome.wayne.edu/PIMdb.html) 
Genome 
context 
networks 
These approaches include the gene 
fusion, gene cluster or gene order 
conservation, phylogenetic profile and 
operon rearrangement methods 
(Dandekar et al., 1998; Enright et al., 
1999; Janga et al., 2005; Pellegrini et al., 
1999). See text for further discussion. 
STRING (http://string.embl.de) 
ProLinks (http://prolinks.mbi.ucla.edu/) 
VisANT (http://visant.bu.edu) 
Integration of 
data sources 
These approaches integrate different 
kinds of functional association data using 
machine learning techniques and then 
construct high-confidence functional 
linkage networks which are then used for 
function prediction (Hu et al., 2009; 
Linghu et al., 2008; Marcotte et al., 
1999b; Zhao et al., 2008b). 
STRING (http://string.embl.de) 
ProLinks (http://prolinks.mbi.ucla.edu/) 
VisANT (http://visant.bu.edu) 
Virgo (http://whipple.cs.vt.edu:8080/virgo) 
Functional landscape of E. coli proteins                                                                                                                   4-11 
 
 
To summarize, experimental approaches employed for constructing functional 
association networks mostly comprise of data from protein-protein interaction screens followed 
by co-expression networks comprising of gene pairs showing significant correlation in their 
expression profiles across conditions, derived from microarray datasets (Luo et al., 2007; Ruan 
et al., ; Wang et al., 2009).  More recently, genetic interactions- measuring the fitness defects of 
the double mutants compared to that of the individual mutants, are also being employed for 
constructing these functional linkage networks (Butland et al., 2008; Costanzo et al.). These 
high-throughput experimental approaches not only increase the confidence of an association 
but also give cellular context of the protein providing complementary view to the traditional 
functional prediction paradigm. 
In addition to the experimental methods, several computational methods have been 
proposed for constructing protein-protein associations from sequence data alone. These include 
the so-called genome context methods namely gene fusion, gene cluster or gene order 
conservation, operon arrangements and protein phylogenetic profiles. The gene fusion 
approach tries to detect the fusion of two genes into a single protein coding gene in one of the 
sequenced genomes and thereby links them as a strong functional association (Enright et al., 
1999; Marcotte et al., 1999a). The method of gene order conservation aims to identify pairs of 
genes which consistently show a tendency to cluster in immediate vicinity in a number of 
genomes- suggesting a strong functional link in prokaryotic genomes which are abundant in 
operons (Dandekar et al., 1998; Overbeek et al., 1999). The method of operon rearrangement 
tries to identify a link between any pair of genes on a genome as long as their orthologs are 
predicted to be organized in an operon with a high confidence in at least one sequenced 
genome (Janga et al., 2005; Rogozin et al., 2002; Snel et al., 2002). The power of this approach 
depends on the predictive quality of operon prediction methods which have been shown to 
reach ~90% accuracy in most sequenced genomes (Brouwer et al., 2008; Moreno-Hagelsieb 
and Collado-Vides, 2002). Yet another approach not based on genomic proximity is 
phylogenetic profiles. In this method a vector of presence/absence profile of a gene across all 
the analyzed genomes is constructed and compared to identify genes which show the most 
correlated profiles, as a measure of functional link. The rational here is that two proteins 
showing similar profiles i.e, coordinated in their evolutionary gain and loss, are expected to be 
functionally related (Gaasterland and Ragan, 1998; Pellegrini et al., 1999). Modified versions of 
this approach take into account the phyogenetic signal of the genomes employed and/or the 
redundancy in the genome sequence information (Barker and Pagel, 2005; Date and Marcotte, 
2003; Moreno-Hagelsieb and Janga, 2008). 
Functional landscape of E. coli proteins                                                                                                                   4-12 
 
 
Recently, the integration of different types of interaction data into genome-wide 
functional linkage maps has gained much popularity for functional inference as these integrated 
maps not only boost coverage but also confidence of an association when assessing protein 
function. One of the first studies which demonstrated the power of integrating different types of 
interaction data was by Marcotte and colleagues where they have put together diverse kinds of 
computational genome context inferences (Marcotte et al., 1999b). This was followed by a 
number of other methods such as those implemented in the STRING and PROLINKS 
databases, among other focused studies (Bowers et al., 2004; Hu et al., 2009; Jensen et al., 
2009; Massjouni et al., 2006). Typically, in these networks edge weights correspond to the 
integrated interaction probability values obtained by first scoring each of the methods 
independently against a set of gold standard interactions, which are then used in a bayesian 
fashion assuming the scores obtained in each method are independent of each other. More 
complex methods take into account the dependence and correlation between methods to 
develop a regression model for scoring the integrated interactome (Linghu et al., 2008; Zhao et 
al., 2008b). Nevertheless, all of them boil down to constructing a network with either weighted or 
unweighted edges which are then used for propagating annotations to uncharacterized 
members using approaches discussed in the section below.     
4.2.1.2 Computational methods for predicting function from network 
context 
Any set of functional associations, whether experimentally derived or predicted by the above 
methods can be depicted as a network of nodes connected by edges, with nodes representing 
proteins and edges denoting the interactions between these nodes. As such most network-
based functional inference algorithms work under the premise that the closer the two nodes are 
in the network higher is the functional similarity between them (Sharan et al., 2007). Indeed, 
most computational approaches for predicting function from network simply exploit the context 
of a protein with in the local or global network-neighborhood analogous to traditional sequence 
or genomic context methods. These approaches also generally tend to infer the broader 
function such as biological process a protein is in involved in, as opposed to the molecular 
function which is typically inferred by homology-based approaches – making network-based 
approaches complementary methods for annotating genomes. These methods can be grouped 
into two major classes namely those which use direct network-context and those which are 
assisted by module prediction. The former infer the function of a protein based on its 
connections (direct or indirect) in the network while the later first identify the modules of related 
Functional landscape of E. coli proteins                                                                                                                   4-13 
 
 
proteins and then annotate each protein in the module based on the known functions of its 
members using one of the direct methods (see Table 4-3 for a summary of the methods 
belonging to either class). 
 
Table 4-3. Different methods currently available for network-based function prediction.  
 
Method Description References 
Direct 
In simpler versions of these methods function of 
a protein is assigned based on the number of 
annotated protein neighbors in the immediate 
network neighborhood which are associated with 
a particular function. Advanced approaches take 
into account overall network topology and are 
able to give confidence scores for predictions. 
Techniques such as flow simulation and graph 
theoretic based have shown to yield high 
accuracies on some model systems.  Other 
methods in this category involve the use of 
probabilistic markov random models.  
(Chua et al., 2006; Deng et al., 
2003; Hishigaki et al., 2001; 
Karaoz et al., 2004; Letovsky and 
Kasif, 2003; Nabieva et al., 2005; 
Schwikowski et al., 2000; 
Vazquez et al., 2003) 
Module based 
In these approaches, two major steps are 
involved: 1) Identification of modules which are 
functionally coherent using any clustering 
technique 2) predicting function of 
uncharacterized members in a cluster using any 
of the direct methods or by computing 
enrichment for characterized functions in a given 
module and then transferring the annotations to 
other members. The first step follows the notion 
that genes which work in the same biological 
process should be homogenous in their 
functional roles and hence plays a crucial role in 
these methods. So majority of the methods in 
this category differ in the approach taken to 
identify modules. 
(Altaf-Ul-Amin et al., 2006; Bader 
and Hogue, 2003; Brun et al., 
2003; King et al., 2004; Pereira-
Leal et al., 2004; Rives and 
Galitski, 2003; Samanta and 
Liang, 2003; Spirin and Mirny, 
2003) 
 
 Among the direct methods, the simplest and perhaps the most intuitive method for 
function prediction determines the function of a protein based on the known function of proteins 
lying in the immediate neighborhood and is commonly referred to as the majority consensus or 
Guilt-By-Association (GBA) method (Schwikowski et al., 2000). Although simple and can be 
effective for dense networks, the method does not take into account the complete topology of 
the network and neither does provide a score for predicted functional label. Therefore, over the 
years more sophisticated methods like those developed by Hishigaki et. al, (Hishigaki et al., 
2001) and Chua et. al, (Chua et al., 2006) tried to address these limitations. Other direct 
Functional landscape of E. coli proteins                                                                                                                   4-14 
 
 
methods involve the use of graph theoretical principles such as cuts and flow-simulation in the 
networks in order to take advantage of the global and/or local topology of the network under 
consideration (Karaoz et al., 2004; Nabieva et al., 2005; Vazquez et al., 2003). In doing so, 
these methods also aim at maximizing the number of edges (for a protein of interest) which 
connect to other proteins assigned with the same function. Some authors also employed 
probabilistic approaches to address the caveats of the original methods and follow the premise 
that the function of a protein is independent of all other proteins given the functions of its 
immediate neighbors- thereby leading to the use of markov random field models for solving the 
problem of function prediction (Deng et al., 2003; Letovsky and Kasif, 2003)(also see (Sharan et 
al., 2007) ). 
 Biological systems are inherently modular in their functions with groups of genes being 
associated with a particular biological process/pathway (Hartwell et al., 1999). This has resulted 
in the development of module-based functional inference approaches. In these approaches, first 
coherent groups’ of genes which are predicted to work together to achieve a common biological 
task are identified by clustering methods and then the functions of genes with in the group are 
assigned. Once modules are identified, simple methods like GBA or hypergeometric enrichment 
computed for every function associated with the module are used for transferring the 
annotations to the uncharacterized members. Therefore, in these approaches the initial 
clustering method employed is crucial in determining the quality of the functional predictions. As 
a result, different module-assisted techniques differ in the module detection technique 
employed. Module finding algorithms typically depend on the network topology information 
which is used as a distance metric, resulting in the use of clustering techniques for identifying 
either a defined number of clusters, as in k-means clustering or some times hierarchical 
clustering of the data. Some of the module detection techniques also have the ability to detect 
overlapping clusters as a means of revealing the inherent plasticity in biological systems. Table 
4-3 summarizes some of the module-assisted techniques employed for functional inference. 
4.2.2 Uncovering the cellular roles of functional orphans in E. coli 
Because of its central position in the microbial research community, the Gram-negative 
bacterium Escherichia coli plays a leading role in investigations of the fundamental molecular 
biology of bacteria (Arifuzzaman et al., 2006; Baba et al., 2006; Barrett et al., 2005; Butland et 
al., 2005; Faith et al., 2007; Feist et al., 2007; Joyce et al., 2006; Riley et al., 2006). This 
experimentally-tractable microbe is a workhorse in basic and applied research aimed at 
elucidating the mechanistic basis of prokaryotic processes and traits, including those of 
Functional landscape of E. coli proteins                                                                                                                   4-15 
 
 
pathogens. The ever-expanding availability of genomic resources makes E. coli particularly well-
suited to systematic investigations of microbial protein components and functional relationships 
on a global scale. These include a genome-wide collection of single gene deletion strains (Baba 
et al., 2006) along with extensive knowledge of regulatory circuits (Barrett et al., 2005; Faith et 
al., 2007; Gama-Castro et al., 2008; Joyce et al., 2006) and metabolic pathways (Feist et al., 
2007; Kanehisa and Goto, 2000; Keseler et al., 2005).  
 Yet despite being the most highly studied model bacterium, a recent comprehensive 
community annotation effort for the fully sequenced reference K-12 laboratory strains (Riley et 
al., 2006) indicated that only half (~54%) of the protein-coding gene products of E. coli currently 
have experimental evidence indicative of a biological role. The remaining genes have either only 
generic, homology-derived functional attributes (e.g. ‘predicted DNA-binding’) or no discernable 
physiological significance. Some of these functional ‘orphans’ (not to be confused with 
‘ORFans’, which are genes present within only single or closely-related species) may have 
eluded characterization in part because they exhibit mild mutant phenotypes, are expressed at 
low or undetectable levels, or have limited homology to annotated genes. 
A key feature of the molecular organization of all organisms, including bacteria, is the 
tendency of gene products to associate into macromolecular complexes, biochemical pathways 
and functional modules that in turn mediate all the major cellular processes. Elaboration of 
these interaction networks via proteomic, genomic and bioinformatic approaches can reveal 
previously overlooked components and unanticipated functional associations (Hawkins and 
Kihara, 2007). For example, a recent integrative analysis of phenotypic, phylogenetic and 
physical interaction data led to the discovery of an evolutionarily conserved set of novel 
bacterial motility-related proteins (Rajagopala et al., 2007). However, while systematic 
integration of diverse high-throughput interaction datasets is routinely performed to reveal new 
functional relationships in model eukaryotes such as yeast, worm and fly (Bandyopadhyay et al., 
2008; Gunsalus et al., 2005; Lee et al., 2008; Myers et al., 2005; Reguly et al., 2006; Sharan 
and Ideker, 2006), few analogous studies of the global functional architecture of E. coli, and any 
prokaryote for that matter, have been reported to date (Campillos et al., 2006; Slonim et al., 
2006; Yellaboina et al., 2007).  
 To this end, we have combined complementary, highly-sensitive computational and 
experimental procedures to derive extensive high-quality maps of the functional interactions 
inferred by genomic context (GC) methods and physical interactions (PI) deduced by 
proteomics of E. coli. Our results indicate that many previously unannotated bacterial proteins 
are components of functionally cohesive modules and multiprotein complexes linked to well 
Functional landscape of E. coli proteins                                                                                                                   4-16 
 
 
known biological processes. A substantive fraction of these associations could be verified by 
independent experimentation and were found to be broadly conserved across prokaryotic phyla, 
indicating homologous systems in other microbes, while others are seemingly restricted to the 
E. coli lineage. However, in what follows I present a summary of this large-scale study where in 
we characterize the broad biological processes of these functional orphans using an integration 
of computational and experimental means. The entire data collection is publicly accessible via a 
searchable web-browser interface (http://ecoli.med.utoronto.ca/) to stimulate exploration of both 
conserved and specialized bacterial proteins within the context of biological processes of 
particular interest. 
4.2.2.1 The extent of existing functional annotation for E. coli proteins 
Since the functional characterization of E. coli, and bacteria in general, has largely been guided 
historically by scientific interests and technical considerations, some bias is expected in terms of 
the coverage and depth of existing biological knowledge as reflected in current gene 
annotations. This biased coverage is likely due to multiple reasons, ranging from the low 
expression of certain proteins to the lack of homologs in other organisms including humans. To 
evaluate the degree to which the physiological functions of the 4,225 putative protein-coding 
sequences of E. coli K-12 are characterized presently, we examined the scope of literature 
reference records curated in the UniProt annotation system (Apweiler et al., 2004). After 
excluding PubMed references corresponding to genomic mapping studies, the average total 
number of papers associated with each of the proteins of E. coli K-12 is surprisingly limited 
(Figure 4-1A), with many proteins apparently still uncited.  
 We next examined recent E. coli K-12 (sub-strains W3110 and MG1655) gene 
annotations in the public databases RefSeq (Pruitt et al., 2005), MultiFun (Serres et al., 2004), 
and EcoCyc (Keseler et al., 2005). Since W3110 is commonly used for high-throughput studies, 
we devoted the bulk of our subsequent analysis to this sub-strain. In total, we found that 2,794 
(66%) of E. coli’s proteins had either proper mnemonic names (Rudd, 1998), experimentally-
derived annotations in the MultiFun multifunction schema, or literature documentation to a well-
defined pathway or multiprotein complex in EcoCyc (Figure 4-1B). This left 1,431 proteins (34%) 
as currently functionally uncharacterized (which constitute our ‘orphans’ set). Of these, 446 
(31%) have at least one putative molecular function defined on the basis of sequence (such as 
the presence of a predicted DNA-binding domain or an enzymatic motif) in the Clusters of 
Orthologous Groups of proteins (COGs) catalog (Tatusov et al., 1997). 
Functional landscape of E. coli proteins                                                                                                                   4-17 
 
 
4.2.2.2 Properties of the functional orphans of E. coli 
        
Figure 4-1. Annotated and functional orphan genes of the E. coli K-12 reference strain 
(A) Frequency distribution of supporting publications per E. coli protein-coding gene. (B) Summary of 
existing annotations for E. coli, showing proteins of unknown function (orphans) lacking proper names or 
functional annotations in MultiFun or EcoCyc. (C) Although the functional orphans are encoded by 
transcripts with half-lives comparable to those of annotated genes, they tend to be expressed at lower 
levels based on (D) microarray analysis of mRNA and (E) Codon Adaptation Index scores, and (F) have 
lower molecular weights on average. Orthologs of orphans are also less prevalent in sequenced genomes 
than those of annotated genes (G). However, examination of environmental metagenomic libraries (H) 
indicates that the orphans are not necessarily exclusive to the Escherichia lineage. AMO: methane 
oxidizing Archaea; Anammox: anaerobic ammonium oxidation bacteria. t, T-test; p, P-value; NS, not 
statistically significant. 
 
The genes lacking annotation appear to be translated into bona fide proteins as their 
corresponding transcripts (Selinger et al., 2003) were not significantly (p = 0.36) less stable than 
the products of annotated genes (Figure 4-1C). However, some differences were evident in 
terms of their biophysical attributes and evolutionary scope relative to annotated genes. Most 
notably, only 21 orphans (1.5%) are required for viability under standard laboratory conditions 
(Baba et al., 2006) in contrast with the 280 annotated genes (10%) previously deemed 
essential. The orphans were also significantly (p < 1e-10) less abundant at both the transcript 
[Figure 4-1D; avg. normalized mRNA expression over 400 microarray experiments (Faith et al., 
2007): 8.0 (orphans) vs. 8.9 (annotated)] and protein levels (Figure 4-1E; avg. codon adaptation 
index: 0.41 vs. 0.47). Furthermore, they tend to encode somewhat smaller proteins (Figure 4-
Functional landscape of E. coli proteins                                                                                                                   4-18 
 
 
1F; avg. MW: 29.4 vs. 38.2 kDa; p < 1e-10) with fewer domain assignments (44%) than for 
annotated proteins (74%) according to the SUPERFAMILY database (Madera et al., 2004).  
Orphans also generally find fewer orthologs in a non-redundant genome dataset, defined 
by filtering at 90% similarity based on the frequency of shared orthologs among genomes 
(Figure 4-1G), with an average of 0.22 as compared with 0.48 for annotated genes (p < 1e-10) 
using a maximum-score E-value cutoff of 1x10-6 for BLAST bi-directional best hits (BDBHs). 
Nevertheless, broader sequence comparisons against currently available metagenomes (Figure 
4-1H) indicated that orphan homologs (one way BLAST hits) are often widely distributed in 
diverse environments (See online protocols accompanying this published study for more 
detailed description of the Materials and Methods); for example, a high proportion (0.80) of 
orphans have homologs present in marine metagenomes, anaerobic bacterial populations (farm 
silage, 0.51; whalefall, 0.50; sludges, 0.49), and even in the residents of the mammalian gut 
(union of human and mouse, 0.35), implying participation in core bacterial processes. 
Furthermore, the same high proportion (~99%) of orphan and annotated genes have orthologs 
in the other sequenced E. coli isolates, including pathogenic variants and closely-related 
Shigella strains. Taken together, this argues that the functional significance of the orphans is 
more pervasive than the current annotations suggest. 
4.2.2.3 A systematic approach to elucidate biological function 
The scarce existing knowledge regarding the biological roles of the orphans is likely due to 
multiple reasons, ranging from the lower expression, non-essentiality, or smaller sizes of certain 
orphan proteins to their lack of obvious homologs in other organisms including humans. 
Accordingly, integration of multiple data sources is warranted to decipher the specific biological 
roles of this uncharacterized repertory. Since the elucidation of physical and functional 
interaction networks can provide insights into bacterial protein function based on the concept of 
guilt-by-association (Yao and Ruzzo, 2006), we took a multi-pronged approach. We performed 
large-scale proteomic analysis to determine orphan participation as components of stable 
multimeric protein complexes, and inferred functional relationships based on genomic context 
inference, which exploits the patterns of gene conservation across bacterial genomes 
(Shoemaker and Panchenko, 2007b; von Mering et al., 2005). We then predicted the functions 
of the orphans using an integrative machine-learning procedure. Finally, independent low-
throughput experiments were also performed to validate a subset of high confidence predictions 
related to core biological processes which will not be discussed in here. Key steps (mostly 
computational) in this pipeline are outlined schematically in Figure 4-2.  
Functional landscape of E. coli proteins                                                                                                                   4-19 
 
 
             
Figure 4-2. Generation, integration of different networks and orphan function prediction   
 
(A) Construction of a PI network based on protein co-purification and detection by mass spectrometry. 
For the confidence scoring by logistic regression, datasets consisting of PI from low-throughput studies 
curated in DIP, BIND and IntAct (gold positives) and proteins in different subcellular localizations (gold 
negatives) were used for benchmarking. The resulting PI network, with edge weights corresponding to 
likelihood ratios, was clustered using MCL to delimit ‘multiprotein complexes’. (B) Integration of four GC 
methods into a single functional interaction network using a probabilistic model (von Mering et al., 2005), 
whose resulting scores (edge weights) were inputted to MCL to delimit ‘functional modules’. (C) Orphan 
function prediction was conducted using a ‘guilt-by-association’ procedure. After integration of PI and GC 
interactions into a single probabilistic network, a machine learning algorithm (StepPLR) newly developed 
for this study was used to assign functions based on the binary associations of orphans with annotated 
proteins, the respective interaction edge weights and the overall network topology. Correlations between 
vectors of these function predictions (orphans) and the annotations were then used as input to delimit 
‘functional neighborhoods’ by clustering using MCL. 
 
4.2.2.4 Experimental definition of the physical interaction network of 
the soluble proteome 
We performed systematic large-scale tandem-affinity purifications of all endogenous soluble 
orphan and annotated proteins detectably expressed in E. coli W3110 under standard culture 
conditions [see Materials and Methods below and protocols linked with the manuscript for 
further details]. We used an optimized Sequential Peptide Affinity (SPA)-tagging system to 
Functional landscape of E. coli proteins                                                                                                                   4-20 
 
 
isolate multiprotein complexes (Zeghouf et al., 2004). This procedure is based on the integration 
of marker cassette bearing a dual-affinity tag, consisting of three FLAG sequences and a 
calmodulin binding peptide separated by a protease cleavage site, fused to the C-termini of 
targeted open reading frames in E. coli DY330 (W3110 background) via  λ-phage “Red” 
mediated homologous recombination. This system enables recovery of native bacterial protein 
complexes at near-endogenous levels (Butland et al., 2005), minimizing spurious non-specific 
protein associations. Stably interacting polypeptides were subsequently detected using a highly-
sensitive combination of tandem mass spectrometry (LCMS) and peptide mass fingerprinting 
procedures (MALDI) to increase detection coverage and accuracy, just as had been previously 
done in a focused investigation for highly-conserved essential E. coli proteins by my 
collaborators (Butland et al., 2005). We successfully chromosomally-tagged 1,241 new baits, 
aiming to verify putative interactions by reciprocal tagging where possible, for a total of 1,476 
large-scale protein purifications (after including the 235 reported previously), of which 552 
represented orphans. 
 Since proteomic datasets typically contain noise in the form of non-specific associations, 
we performed a careful statistical analysis and quality filtering to determine biologically 
meaningful physical interactions. We considered that the specificity and affinity between any two 
putatively interacting proteins should be correlated with the consistency of co-purification over 
all the experiments in which the proteins were identified (i.e. co-complexed). We therefore used 
an established co-purification metric (Zhang et al., 2008) to assess interaction specificity based 
on the similarity of the protein co-purification patterns. We then generated a single consolidated 
confidence score for each putative pair-wise physical interaction based on the co-purification 
metric together with the primary interaction evidence to penalize inconsistent or promiscuous 
binders (i.e. possible false-positives) using alternatively a logistic regression model and 
bayesian inference (Suthram et al., 2006). 
 The logistic regression model was trained using a reference set of curated gold-standard 
Protein Interactions (PIs), which represents the union of experimentally-verified physical 
interactions derived from low-throughput experiments extracted from the Database of Interacting 
Proteins (DIP) (Xenarios et al., 2000), the Biomolecular Interaction Network Database (BIND) 
(Bader et al., 2003) and the IntAct database (Kerrien et al., 2007). For the negative gold 
standards, we compiled pairs of proteins annotated with different subcellular localizations (i.e. 
one cytoplasmic, the other periplasmic or outer membrane-bound (Diaz-Mejia et al., 2009). 
 Despite its relative simplicity, the logistic regression model offered better performance 
than the Bayesian method (see Figure 4-3A). We therefore applied the former to our global PI 
Functional landscape of E. coli proteins                                                                                                                   4-21 
 
 
network, assigning a probabilistic confidence score for each pair of putatively interacting 
proteins. To minimize false positives without incurring excessive false negatives, we further 
filtered our network using a stringent minimum confidence cutoff of ≥ 0.75 as a high proportion 
(71%) of PI verified by reciprocal purification had likelihood scores at or above this threshold. 
Finally, we removed from consideration the ten most-highly connected ‘hub’ proteins which were 
deemed particularly abundant non-specific contaminants.  
The resulting final network consisted of 5,993 high-confidence, non-redundant pair-wise 
interactions among 1,757 distinct E. coli proteins, including 451 orphans, or roughly two thirds of 
the predicted soluble cytoplasmic proteome. As summarized in Figure 4-3B, most (3,193, or 
53%) of these physical interactions are novel, while only 47% were already reported in either 
the DIP, BIND or IntAct interaction databases, or previous large-scale proteomic studies 
(Arifuzzaman et al., 2006; Butland et al., 2005). Importantly, our filtered dataset had a 
comparable level of accuracy as for the much smaller set of 716 ‘validated’ PI reported 
previously (Butland et al., 2005) and a genome-scale dataset of 7,123 PI (median confidence of 
0.69) generated using an analogous affinity purification schema in yeast (Krogan et al., 2006). 
The reliability of this dataset was also evident by two additional independent criteria. 
First, the mRNA expression patterns of the putatively interacting proteins were nearly as highly 
correlated as those of the presumably more abundant curated protein pairs determined by low-
throughput experiments (Figure 4-3C). Second, despite the more limited evolutionary 
distribution of the orphans, the putatively interacting proteins exhibited an elevated degree of 
co-occurrence of the respective orthologs across other bacterial species, evident from the high 
mutual information of the corresponding phylogenetic profiles (see Methods), again comparable 
to that of interacting pairs derived from low-throughput experiments (Figure 4-3D). Collectively, 
these results indicate that our physical interaction network is very likely to be informative about 
orphan protein function. 
4.2.2.5 Orphan membership within multiple protein complexes 
Since macromolecular assemblies mediate biological function in cells, we partitioned our high 
confidence physical interaction network using the Markov clustering algorithm (MCL; see 
Materials and Methods) to define orphan membership as subunits of discrete multiprotein 
complexes. MCL simulates random walks (i.e. flux) to delimit highly connected sub-networks 
based on both the connectivity and the weight of the graph edges (Enright et al., 2002). In this 
case, the weights reflect the interaction likelihood ratios obtained by logistic regression (Figure 
4-2A). The higher the flux within in a region, the more likely MCL will delimit the region as a  
Functional landscape of E. coli proteins                                                                                                                   4-22 
 
 
 
Figure 4-3. High-confidence physical interactions and putative multiprotein complexes  
 
(A) Benchmarking of the experimentally-derived PI network in E. coli against positive and negative gold 
standards by ROC-curve analysis. (B) Overlap of PI identified in this study with previous proteomic 
reports (Arifuzzaman et al., 2006; Butland et al., 2005) and low-throughput PI obtained from DIP, BIND 
and IntAct. (C) Putatively interacting proteins have highly-correlated gene expression patterns and (D) 
similar phylogenetic profiles based on mutual information as for low-throughput curated PI and in contrast 
to control protein pairs derived from different sub-cellular compartments.. (E) Graphical schematic of 
putative stable, soluble multiprotein complexes using the GenePRO Cytoscape plugin (Vlasblom et al., 
2006). Each node represents a complex, whose size reflects the number of contained proteins; edge 
widths reflect the number of interactions between subunits of different complexes. (F) Multiprotein 
complexes implicated in the bacterial translation apparatus; orphans mentioned in the main text are 
highlighted in bold. (G) Reduced rate of total protein synthesis in a strain lacking ybcJ relative to wild-type 
cells (WT). (H) Perturbed ribosome profiles in an yfgB deletion strain. (I) Elevated rates of frame-shifting 
and stop-codon readthrough in yfgB and ybcJ deletion strains relative to wild-type (WT). β-gal activity is 
only produced after the corresponding translational defect has occurred; error bars indicate standard 
deviation. 
 
Functional landscape of E. coli proteins                                                                                                                   4-23 
 
 
cluster (in this case, a putative multimeric protein complex). A recent comparative study (Brohee 
and van Helden, 2006) found that MCL is often superior to other clustering algorithms in 
identifying functionally-related groupings in probabilistic molecular interaction graphs and is 
remarkably resilient to spurious graph perturbations (e.g. missing edges). 
We optimized the MCL parameters (see Materials and Methods) to partition the 5,993 PI 
network, generating a set of 443 putative multiprotein complexes (Figure 4-3E), most of which 
consist of 2-4 polypeptides. In agreement with previous reports (Brohee and van Helden, 2006), 
alternative clustering algorithms comparable to MCL in terms of accuracy, like Restricted 
Neighborhood Search Cluster algorithm (King et al., 2004), produced similar groupings (data not 
shown). Moreover, as was found in a proteomic survey of yeast multiprotein complexes (Krogan 
et al., 2006), both the subunit number and degree connectivity of the MCL clusters followed a 
power-law distribution. In particular, two hundred and forty four (55%) of these E. coli 
multiprotein complexes contained at least one orphan as a putative subunit, with mechanistically 
suggestive linkages suggestive of a concerted biological function (Figure 4-3E). The complexes 
also showed a significant (p < 0.001) enrichment in terms of functional homogeneity implying 
that both the annotated components and the associated orphans tend to participate in the same 
biological processes.  
For example, 25 orphans were detected as part of a large sub-network of putative 
complexes involved in protein synthesis (Figure 4-3F). These include the orphans YbcJ and 
YncE, which physically interacted with the pseudouridylate synthase RluB, the RNA helicases 
SrmB and DeaD, the exoribonucleases E (Rne) and R (Rnr), and other components of the 
ribonucleolytic ‘degradosome’ responsible for mRNA degradation, suggesting a probable role in 
RNA processing and/or turnover. Likewise, YfgB co-purified with three translation-related 
complexes, including the ribosome. Consistent with these observations, the expression of YncE, 
which has similarity to the non-ribosomal peptide synthase AfuA of Aspergillus fumigatus, is 
reduced >9-fold upon exposure of E. coli to the translational inhibitor puromycin (Sabina et al., 
2003). We also determined that deletion of ybcJ results in a significant reduction in the 
incorporation of 35S- labeled methionine in vivo relative to wild-type (Figure 4-3G), indicating a 
decrease in the global rate of protein synthesis. Similarly, ribosome profile analysis (Figure 4-
3H) showed that inactivation of yfgB decreased the level of mature polysomes actively engaged 
in mRNA translation and altered the cellular ratios of 30S and 50S ribosomal subunits relative to 
70S monosomes. Moreover, both the ybcJ and yfgB mutants exhibited reduced translation 
fidelity (Figure 4-3I) as assayed by four reporter plasmids that measure the frequency of 
frameshifts and stop codon readthrough.  
Functional landscape of E. coli proteins                                                                                                                   4-24 
 
 
 Other orphans is this translation sub-network include YibL, which co-purified both with 
YfgB and YbcJ, and with RNA processing factors involved in ribosome biogenesis, such as the 
RNA pseudouridine synthetases RluB/RluC and the RNA helicase DeaD, and with RppH 
(formerly NudH), which was recently identified as a regulator of 5'-end-dependent mRNA 
degradation (Barkan et al., 2007; Deana et al., 2008; Jiang et al., 2006). Similarly, the orphan 
YdhQ co-purified with translation elongation factor Tu, while YagJ interacted with lysine tRNA 
synthetase (LysU), and YjcF, which has similarity to phenylalanyl-tRNA synthetase PheT of 
Bacteroides vulgatus, bound ribosomal release factor 2 and another orphan, YbeB, which in 
turn was found to associate with the 50S ribosome subunit, as recently reported (Jiang et al., 
2007). These results confirm that our high-confidence physical interaction network is informative 
about the function of at least certain orphans. 
4.2.2.6 Functional interactions predicted by genomic-context methods 
Although we attempted to tag and purify the entire soluble E. coli interactome, we failed to 
detect 469 orphan proteins by MALDI or LCMS, presumably because they are both membrane-
associated (~35%) and hence not soluble, or are of particularly low in abundance (~40). To 
bypass this limitation, we applied computational methods to discern a network of high-
confidence pair-wise functional interactions for all E. coli proteins, including those not detectable 
by proteomic methods, by examining the natural chromosomal clustering of bacterial genes. As 
illustrated in Figure 4-2B, we used four different genomic context (GC) methods, namely: (i) 
Gene Fusions (Enright et al., 1999; Marcotte et al., 1999a); (ii) similarity between Phylogenetic 
Profiles (Gaasterland and Ragan, 1998; Pellegrini et al., 1999; Tatusov et al., 1997); (iii) 
evolutionary conservation of Gene Order (Dandekar et al., 1998; Janga and Moreno-Hagelsieb, 
2004; Overbeek et al., 1999); and (iv) Intergenic Distances (Janga et al., 2005; Rogozin et al., 
2002; Snel et al., 2002) (see Materials and Methods for details). The latter two methods are 
independent approaches to detect operons and their subsequent rearrangements across 
prokaryotic genomes. In particular, the Intergenic Distances method, leads to considerably more 
high-quality predicted functional associations compared with the first three classic GC methods 
(Janga et al., 2005), and does not depend critically on the detection of orthologs in evolutionarily 
distant genomes, making it potentially better suited for detecting functional interactions involving 
orphans. 
 The pair-wise interactions generated by each of these prediction methods were 
independently evaluated by benchmarking using gold standards. Positive gold standards were 
defined as pairs of E. coli genes belonging to the same biological pathway as defined in 
Functional landscape of E. coli proteins                                                                                                                   4-25 
 
 
EcoCyc, while the negative gold standards represented pairs of annotated E. coli genes whose 
products participate in different pathways. The results of each GC method were subsequently 
combined to create a single unified functional association score (Figure 4-2B). Although 
different data integration algorithms have been developed (Chua et al., 2006; Lee et al., 2004; 
Nabieva et al., 2005; von Mering et al., 2005), most of these have a similar probabilistic basis 
and assumptions. For this study, we opted for the integration procedure used by Bork and 
colleagues (von Mering et al., 2005) to construct the Search Tool for the Retrieval of Interacting 
Genes/Proteins (STRING) database. This approach treated the reliability of the associations 
generated by each GC method as independent probabilities, such that the likelihood of an 
interaction is proportional to the number of times it was observed and the degree to which each 
GC method contributed to the overall network reliability. Finally, we applied a stringent filter to 
the unified functional network to obtain a set of 74,776 high-confidence (probabilities ≥  0.80) 
non-redundant interactions (Figure 4-4A).  
Despite the tendency of the orphans to exhibit more limited conservation notwithstanding 
the dependency of GC methods on homologs in multiple species (except for operon predictions 
based on intergenic distances (Janga et al., 2005)), our combined GC network implicated 
virtually all (1,367, or 96%) of the orphans in 23,365 pair-wise functional interactions. Moreover, 
relatively few (<18%) of our predicted interactions appear to have been reported previously 
(Figure 4-4B). While we could not meaningfully compare our results to an alternate set of 
putative functional links generated recently (Yellaboina et al., 2007) because of a lack of publicly 
accessible dataset scores, we found that less than 5% (3,368) of our predicted interactions are 
listed in the PROLINKS comparative genomics databank (Bowers et al., 2004) while only ~16% 
(11,842, of which only 2,613 involve an orphan) were present in STRING (v. 7.1) at a more 
liberal 0.7 confidence threshold. More critically, greater than 85% of our predicted orphan 
interactions involve a functionally-annotated E. coli protein, indicating a good potential to make 
functional inferences. The fact that PROLINKS has 1,657 predictions not attained by our 
integrative approach may reflect our use of a higher confidence threshold as well as differences 
in implementation of the GC measures and the identification of putative orthologs. For instance, 
whereas we used BLAST-BDBHs as criteria to detect orthologs between pairs of genomes, 
STRING uses COG-based definitions of orthology, while PROLINKS uses one-way BLAST hits 
(not necessarily orthologs). Conversely, most of the 16,585 predictions exclusive to the STRING 
database were compiled using text mining or alternate experimental criteria such as protein-
protein interactions, whereas the highest numbers of predictions exclusive to our GC datasets 
come from operon rearrangements (Janga et al., 2005). 
Functional landscape of E. coli proteins                                                                                                                   4-26 
 
 
          
Figure 4-4. High-confidence genomic context associations and putative functional modules 
(A) Benchmarking of unified GC interactions in E. coli against positive and negative gold standards by 
receiver operating characteristic (ROC)-curve analysis; cumulative area-under-the-curve (AUC) is shown 
as an overall performance measure. (B) Overlap of high-confidence functional interactions predicted in 
this study with two other public GC databases. (C) Even after eliminating adjacent gene pairs to control 
for known and predicted E. coli operons, functionally-linked genes have highly-correlated patterns of 
mRNA expression comparable to components of the same curated EcoCyc pathways rather than different 
pathways. (D) Functionally-linked genes are enriched for annotations to the same COG functional 
categories. (E) Graphical representation of putative E. coli functional modules; node size and colors are 
proportional to the number and fraction of orphan and annotated subunits, respectively, while lines 
represent interactions connecting modules. (F) Putative fimbriae-related module. (G) Defective motility of 
mutant strains deleted for orphans linked to fimbriae (as in panel F); single dashes indicate moderately 
impaired motility, while double dashes represent strong repression. Other mutants displaying a normal 
phenotype comparable to the wild-type strain BW25113 (WT) are not shown. (H) Defective biofilm 
formation by mutants deleted for fimbriae-related orphans (as in panel F); significant differences (T-test) 
in cell adhesion (absorbance) between mutant and WT strains are denoted by asterisks (single, p < 0.01;  
double , p < 0.0001); LB, Luria Bertani medium; CFA, colonization factor antigen medium. (I) Metabolic 
modules mentioned in main text. (J) Mutants auxotrophic for shikimic-acid and aromatic amino acids; 
growth on minimal 'drop-out' media is indicated. 
Functional landscape of E. coli proteins                                                                                                                   4-27 
 
 
The reliability of our unified functional association network was independently 
corroborated based on the high correlations of expression among putatively interacting gene 
pairs (Figure 4-4C), which was comparable to that observed for components of the same 
curated EcoCyc pathway even after eliminating all pairs of genes belonging to an 
experimentally-characterized operon or those which form contiguous gene pairs in E. coli 
(Figure 4-4C). We also observed a marked enrichment for interactions among proteins 
annotated to the same curated Clusters of Orthologous Group (COG) functional categories 
(Figure 4-4D), implicating by extension any associated orphans in these same processes. 
4.2.2.7 Defining the participation of orphans as the components of 
functional modules 
Groups of functionally interacting genes form functional modules centered on a common 
process or biochemical pathway(s). To define orphan participation as components of such 
modules, we partitioned the high-confidence GC network using MCL, generating a total of 507 
putative functional modules consisting of two or more components (Figure 4-4E). Examination 
of the functional homogeneity of these predicted modules (see Materials and Methods) 
indicated, as for our putative multiprotein complexes, that they were highly-enriched (p <0.0001 
compared with null random models) for concerted annotated biological processes, again 
implicating the associated orphans in these same roles. Module membership followed a 
characteristic power law distribution with most modules having between 2 and 10 components. 
Two hundred and eighty nine (57%) of the modules had at least one of a total of 1,189 
different orphans. One notable example is shown in Figure 4-4F. Diverse lines of experimental 
and bioinformatic evidence support the involvement of this putative module in the biogenesis 
and/or activity of fimbriae, appendages or pili that are shorter than the characteristic flagellum of 
gram-negative bacteria, which mediate cell adhesion, biofilm formation, motility and host 
invasion (Fronzes et al., 2008; Hahn et al., 2002). For instance, 12 of the 13 orphan 
components possess sequence characteristics of bacterial adhesins and chaperone/Usher pili 
protein families (Madera et al., 2004; Nuccio and Baumler, 2007). Gene expression profiling 
studies (Domka et al., 2007; Domka et al., 2006) have previously established that most of these 
orphans are also coordinately induced during biofilm formation. Perhaps most compellingly, we 
found that single gene E. coli knock-out mutants of 6 of the 13 orphans display markedly 
reduced swarming capabilities in semi solid agar (Figure 4-4G), while 11 out of 13 mutants were 
significantly impaired for biofilm formation in vitro as compared with a wild type control (Figure 
Functional landscape of E. coli proteins                                                                                                                   4-28 
 
 
4-4H). Taken together, these observations strongly implicate this set of orphans in the formation 
and/or proper function of fimbriae. 
 Several other prominent modules are shown in Figure 4-4I. These comprise the orphans 
YdiN, YdiL and YdiM predicted (based on operon rearrangements) to functionally interact with 
several members of the Aro- operon known to participate in the metabolism of shikimate, a 
precursor of aromatic amino acids. Consistent with this, ydiN, aroD and ydiB are reportedly 
over-expressed when E. coli is grown in media containing shikimate as the sole carbon source 
(Johansson and Liden, 2006). Moreover, we found that deletion of either ydiN or ydiB resulted in 
phenotypic auxotrophy for shikimatic- and aromatic amino acids, comparable to that observed 
after loss of known aromatic amino biosynthetic genes (e.g., aroA and aroD). 
 Other functional modules include frlA / frlB, part of the Frl operon of E. coli responsible 
for the import and metabolism of the alternative carbon source fructoselysine, together with the 
orphan YifK, which has sequence characteristics of a transporter (Diaz-Mejia et al., 2009), 
implicating it in electrochemical potential-driven uptake of this sugar. Conversely, two orphans, 
YecC and YecS, had functional associations consistent with linkages to amino acid biosynthesis 
and nucleotide metabolism, four (YagU, YqeG, YhaO and YhaM) were linked to a putative 
module involved in transport and metabolism of threonine and serine, while three others (YjjI, 
YeiM, and YjjJ) were found in a module enriched for factors involved in nucleotide transport and 
degradation of deoxyribonucleosides. Taken as a whole, these results suggest discrete 
functional relationships for many previously unannotated proteins, even implicating certain 
orphans within specific pathways. 
4.2.2.8 Improved functional inference within an integrated network 
framework 
Examination of the extent of overlap between our physical and functional networks, both in 
terms of common binary interactions and shared components among the derived complexes 
(from PI) and modules (from GC), indicated that they are largely complementary. Since a 
similar trend was also evident comparing other existing curated E. coli physical interaction 
datasets (derived from either low- throughput or other high-throughput studies) with 
independent functional predictions (e.g. GC inferences from STRING;), this presumably stems 
in part from the incomplete coverage obtained by these different approaches. Regardless, 
these observations imply that the union of PI and GC networks is necessary to capture the 
widest spectrum of biologically-relevant interactions. Indeed, it has been shown previously that 
combination of physical interactions with functional genomic inferences, each statistically-
Functional landscape of E. coli proteins                                                                                                                   4-29 
 
 
weighted according to dataset quality, can markedly improve both functional coverage and 
accuracy (Beyer et al., 2007; Ideker and Sharan, 2008; Lee et al., 2004; Myers and 
Troyanskaya, 2007; von Mering et al., 2005). We therefore merged our experimental and 
predicted associations with the same method used to generate the unified GC network (Figure 
4-2C; see Materials and Methods).  
 The resulting combined probabilistic network consisted of 80,370 high-confidence 
(probability ≥ 75%) putative pair -wise interactions encompassing virtually the entire proteome 
of E. coli, including 2,769 (99%) annotated proteins and 1,375 (96%) functional orphans. 
Graph analysis of this final integrated network indicated that the orphans tended to have a 
lower overall connectivity and betweenness centrality, measured as the number of shortest 
paths going through a given node, relative to annotated components, suggesting more 
peripheral positions in the integrated networks. However, the orphans also exhibited lower 
average closeness, defined as the average length of shortest paths between any two nodes, 
and had similar overall clustering coefficients, indicating that in general the orphans are 
functionally connected rather than isolated from the annotated gene products. These 
observations implied that consideration of both the individual associations and overall 
placement of the orphans within the integrated interaction network would facilitate functional 
deduction.  
 We therefore devised a new network-based function prediction method (termed 
StepPLR; see Figure 2C and Materials and Methods) to exploit the global topological similarity 
among all the protein pairs and their corresponding functional annotations in the integrated 
network. Our method assigns functions to unannotated orphans based on the functional 
information from their first-order (direct) and second-order (indirect) annotated neighbors in 
the integrated functional association network using penalized logistic regression models and a 
stepwise variable selection procedure to deduce optimal functional profiles (see 
supplementary methods accompanying the manuscript for a detailed protocol). We based our 
classifications on the discrete COG functional categories and on the hierarchical, 
multifunctional terms of the Gene Ontology (GO)(Ashburner et al., 2000; Camon et al., 2004) 
and MultiFun classification schemas (Serres et al., 2004). To avoid potential sources of false 
predictions, we removed any proteins labeled with the evidence codes IPI (for ‘inferred from 
protein interaction’) and IGC (for ‘inferred from genomic context method’) when generating the 
GO reference set, as well as proteins in poorly characterized categories in COGs and 
MultiFun. 
Functional landscape of E. coli proteins                                                                                                                   4-30 
 
 
 We found that StepPLR had better precision and recall compared to several other widely 
used guilt-by-association procedures tested, such as majority-counting and chi-squared-based 
methods. Although the performance achieved for the different functional categories varied, our 
approach generated AUC values of 0.8 or higher for most of the COG (83%), GO (67%) and 
MultiFun (53%) categories and was relatively insensitive to the number of annotated proteins 
per function. Moreover, since our method exploited the correlation among the different 
categories, most orphans had multiple biologically-consistent predicted functions. 
4.2.2.9 Functional neighborhoods 
As displayed graphically in Figure 4-5A, our prediction procedure ultimately linked many of the 
orphans to specific, functionally-related protein ‘neighborhoods’. We again made use of the 
MCL algorithm to objectively delimit functionally highly homogeneous (p < 0.0001) protein 
groupings based on the profile similarity of annotations and predictions shown in this figure. One 
notable example is the protein translation machinery (Figure 4-5B), which has 23 associated 
orphans. To independently verify the functional relevance of these assignments, we examined 
the effects of deleting the corresponding genes in terms of conferring sensitivity to drugs that 
inhibit protein synthesis. Consistent with expectation of a direct role in protein synthesis, and 
similar to loss of bona fide annotated translation factors and tRNA synthetases, the mutant 
strains exhibited statistically significant (p < 0.05) differential sensitivity as compared to wild type 
and unrelated gene mutants to a variety of antibiotics that selectively block protein translation 
(Figure 4-5C).  
We also examined an alternate group of orphans (YafP, YiaD and YbcM) associated 
with the flagellar biogenesis and motility apparatus (Figure 4-5D). Single-gene knockout 
mutants annotated components in this neighborhood exhibit decreased motility in semi-solid 
agar as compared to wild-type E. coli strains (Rajagopala et al., 2007). Consistent with our 
functional predictions, we likewise found that deletion of yafP ablated cell motility in vitro (Figure 
4-5E), similar to mutants lacking core flagellum motor proteins (e.g. FliH, FliM), while loss of 
yiaD and ybcM reduced swarming (i.e. decreased halo formation) to an extent comparable to 
perturbation of other established flagellar components (e.g. flgJ and fliR). A previous study 
(Bresolin et al., 2006) using phenotypic complementation analysis had suggested that a ybcM-
ortholog in Yersinia enterocolitica is likely an AraC-type regulatory protein involved in controlling 
bacterial motility. These results suggest that, akin to several other recently discovered novel 
motility components (Girgis et al., 2007; Rajagopala et al., 2007), these orphans are required for 
the proper assembly and/or subsequent locomotion of the E. coli flagella. 
Functional landscape of E. coli proteins                                                                                                                   4-31 
 
 
  
Figure 4-5. The functional neighborhoods of E. coli 
 
(A) A ‘clustergram’ displaying existing annotations (orange) and the predicted functions (this study; blue) 
for all the protein-coding genes of E. coli (x-axis) and their associated biological processes (y-
axis)(descriptions from different functional schemas i.e, COG, Multifun and GO not shown due to lack of 
space). Proteins were clustered using MCL based on the paired similarity of the functional annotations 
and predictions in this matrix to delimit ‘functional neighborhoods’. (B) Putative functional neighborhood 
showing high-confidence integrated (combined PI and GC networks) interactions of select orphans with 
the protein synthesis machinery. For clarity, individual names of ribosomal proteins and tRNA synthetases 
are not shown. (C) Heatmap showing the differential sensitivity of orphan deletion strains to antibiotics 
targeting protein synthesis relative to the colony size in the absence of drug. Mutants deleted for 
annotated proteins from this neighborhood are shown as positive controls , while deletion mutants lacking 
genes not contained within this neighborhood are shown as negative controls; . (D) Neighborhood with 
three orphans putatively involved in flagellum assembly and motility. (E) Deletions of the corresponding 
components reduce swarming capability; single dash, moderately impaired motility; double dash, strong 
repression. (F) Sub-network of orphans associated with DNA enzymes. (G) Deletion of the orphan yhcG 
results in synthetic lethality when combined with hypomorphic alleles (*) of three essential DNA replication 
factors (parE, dnaN, dnaB). 
 
Functional landscape of E. coli proteins                                                                                                                   4-32 
 
 
Many other orphans were predicted to have roles in other conserved biological systems, 
such as DNA replication. For example, as shown in Figure 4-5F, we identified the orphan YhcG 
in association with DNA processing enzymes, including the restriction complexes HsdMRS and 
McrABC, the integrases IntF and IntS, and the recombinase PinE. YhcG has sequence 
characteristics of the PD-(D/E)XK superfamily of nucleases involved in DNA recombination and 
repair (Kosinski et al., 2005). Consistent with these observations, we found that deletion of yhcG 
results in a synthetic-lethal phenotype (Figure 4-5G) when combined with hypomorphic alleles 
of the replicative primosome (dnaB), DNA polymerase III (dnaN), and DNA topoisomerase IV 
(parE), consistent with a direct role in DNA replication or the resolution of critical intermediates. 
4.3 DISCUSSION & CONCLUSION 
Defining the precise biological roles and relationships of bacterial gene products in an often 
dynamically changing physiological context is a challenging proposition. Historically, systematic 
assessments of protein function in bacteria have tended to rely on molecular inferences based 
on sequence alignments and domain architectures, while experimental characterization has 
traditionally been driven by specific scientific interests rather than with the aim of providing the 
broader community with unbiased collections of functionally-related proteins and phenotypes. 
Since the biological role of a protein is not necessarily reflected in its primary sequence, the 
elucidation of molecular interaction networks can provide an alternate perspective even in the 
absence of detailed phenotypic data (Ideker and Sharan, 2008; Lee et al., 2008). Here, we have 
opted to view a model microbial cell mechanistically as a series of modular molecular interaction 
networks that underlie the major biochemical processes that mediate cell homeostasis and 
proliferation, wherein the functional attributes of particular gene products are reflected in their 
overall patterns of associations.  
 To this end, we have generated an extensive compendium of physical and functional 
linkages covering almost the entire protein-coding complement of E. coli. This led to the 
elucidation of hundreds of putative soluble multiprotein complexes and functional modules 
encompassing virtually all the many gene products currently lacking public annotations. While 
existing integrative probabilistic interaction databases like STRING (von Mering et al., 2005) and 
EcID (Andres-Leon et al., 2009) provide valuable additional binary interactions that are 
potentially useful for protein function prediction or as complementary evidence to those reported 
in this study, our machine learning strategy goes beyond describing binary interactions by 
explicitly describing the most probable biological functions of the orphans. Of particular 
noteworthiness, our functional predictions and phylogenetic projections associate a sizeable 
Functional landscape of E. coli proteins                                                                                                                   4-33 
 
 
fraction of the functional orphans with core bacterial processes, suggesting they may have 
previously eluded detection in part due to prior analytical biases.  
 Since the various methods used in this study discover different types of molecular 
relationships and each has its own intrinsic bias, complementary information was obtained 
through data integration. The limited overlap between the high-confidence physical and 
functional interaction networks presumably stems in part from to the incomplete coverage 
typically achieved by high-throughput experiments and their methodological differences 
(Rajagopala et al., 2007; Yu et al., 2008). For example, certain orphans were difficult to evaluate 
by GC methods due to a lack of apparent orthologs at medium-to-high evolutionary distances, 
which hinders comparative genomic inferences. Likewise, although we performed large-scale 
tandem affinity tagging and purification under near-native physiological conditions to generate 
highly purified preparations of stable, endogenous multiprotein complexes, we did not achieve 
complete coverage of the proteome. We did not attempt to purify a large number of membrane-
associated proteins, which require specialized solubilization procedures, while the soluble 
proteins that we failed to tag or detect by mass spectrometry were presumably either of very low 
abundance or not expressed in our growth conditions.   
Comparison of our physical interaction network with analogous public datasets produced 
for other model species, such as worm, fly, yeast and even the bacterium H. pylori, revealed 
very limited (<1%) overlap. These observations are congruent with recent findings by Uetz and 
colleagues (Rajagopala et al., 2007) showing that only a third (49) of the 173 experimentally-
derived PI in the cell motility network of the spirochete Treponema pallidum are predicted to 
occur in the ε-proteobacteria Campylobacter jejuni on the basis of orthology could subsequently 
be confirmed by targeted two-hybrid testing. The limited overlap between proteomic datasets 
presumably reflects a combination of incomplete coverage by various experimental assays, 
methodological differences and imperfect conservation. 
The observation that the intersection of functional genomics inferences with low-
throughput curated physical interaction data is somewhat higher might be explained by two non-
mutually exclusive ways: first, protein-protein interactions reported in the literature based on 
traditional biochemical methods might be biased towards the most evolutionarily conserved 
multiprotein complexes, which tend to be enriched for essential components with broadly 
distributed phylogenetic profiles that are more easily and accurately predicted by GC methods; 
second, the relatively high sensitivity of the two complementary forms of protein mass 
spectrometry used in this study may have resulted in the detection of lower abundance orphan 
proteins that have previously not been studied in depth.  
Functional landscape of E. coli proteins                                                                                                                   4-34 
 
 
The last point is consistent with the notion that different proteomic methods capture 
different physical interaction types (Yu et al., 2008). Hence, alternate proteomic methods, such 
as two-hybrid screens (Parrish et al., 2007; Rain et al., 2001; Rajagopala et al., 2007; Titz et al., 
2008) or in vivo protein-fragment complementation assays (Tarassov et al., 2008), may be 
better suited for detecting certain physical interactions currently underrepresented in our 
dataset. In a similar vein, additional functional relationships will undoubtedly be uncovered by 
different experimental and computational procedures, such as high-throughput comparative 
analysis of mutant cellular phenotypes (Baba et al., 2006), genome-wide genetic interaction 
screens (Butland et al., 2008; Typas et al., 2008), and automated text mining (Hoffmann and 
Valencia, 2004; Rzhetsky et al., 2008).  
The topological properties inherent to biological networks (e.g. their hierarchical 
organization and degree distributions) combined with incomplete interactome coverage make 
establishing definitive functional groupings difficult (Sharan et al., 2007). Our approach was to 
take into account both the correlations among functional categories and the overall topological 
structure of the integrated network to generate a more balanced probabilistic model. While 
alternate methods may provide enhanced interpretations of the organizational properties of the 
PI and GC networks, the functional enrichment and experimental validations established here 
suggest that our network-based computational inferences provide a reasonable perspective for 
exploring bacterial protein function.  Similar strategies have resulted in powerful predictors of 
protein function in Eukaryotes (Marcotte et al., 1999a; McDermott et al., 2005; Murali et al., 
2006; Myers and Troyanskaya, 2007; Schwikowski et al., 2000). The potential trade-off is that 
additional error or uncertainty may have occasionally been introduced by assuming functional 
similarity among more loosely connected proteins. Moreover, the probabilities associated with 
particular functional terms may not be directly comparable. Functional orphans associated with 
very well-characterized biological processes are more likely to be correctly assigned by 
computational methods (Myers and Troyanskaya, 2007) while those associated with relatively 
poorly studied proteins will tend to remain obscure. Nonetheless, they can be grouped together 
on the basis of specific PI, GC or even other functional associations and hence serve as 
functional groupings rather than isolated entities. 
 In general, the high confidence functional relationships we inferred for E. coli could be 
validated by independent experimental tests, and can be extrapolated to other bacterial species, 
including pathogens. Over 35% of the orphans find orthologs as far away as Archaea, and 
hence are likely associated with the same basic housekeeping processes we predict for E. coli, 
such as formation of the cell wall and protein synthesis. Conversely, our systematic 
Functional landscape of E. coli proteins                                                                                                                   4-35 
 
 
comparisons also revealed some unique aspects of the orphans in the evolutionary history of E. 
coli, such as the potential fimbriael factors that appear to be restricted to Enterobacteriaceae. 
One interpretation is that orphans with limited phylogenetic distributions contribute to fine tuning 
of adaptive physiological responses upon changing environmental conditions, as previously 
suggested for peripheral metabolic genes acquired by horizontal transfer (Pal et al., 2005). 
Alternatively, some orphans might belong to the well conserved biological systems which still 
need to be characterized for their functional role. 
4.4 MATERIALS AND METHODS 
4.4.1 PI network generation 
Large-scale SPA tagging and purifications were performed essentially as previously described 
(Butland et al., 2005; Zeghouf et al., 2004). Briefly, a DNA cassette encoding the SPA-tag and a 
selectable marker flanked by gene-specific targeting sequences was amplified by PCR using 
primers with homology to a selected locus. The cassette was then transformed and integrated 
using homologous recombination in the lysogenic E. coli strain DY330 (W3110 background), 
which harbors the highly efficient λ-phage-encoded homologous recombination enzymes exo, 
bet, and gam under the control of the temperature-sensitive CI857 repressor (the “Red” 
system)(2), to create a C-terminal fusion with the protein of interest. Strains in which the PCR 
product has integrated were subjected to antibiotic selection, and tagged protein expression 
was confirmed by Western blotting. 
Two complementary mass spectrometry techniques (gel-based MALDI peptide mass 
fingerprinting and gel-free LCMS shotgun sequencing) were used to detect physically interacting 
proteins. Details about the large-scale strain culture, protein extraction and purification, and 
protein identification procedures are provided as supplementary protocols accompanying the 
published manuscript. Scoring of tentative PI from the LCMS and MALDI assays was conducted 
using a logistic regression model using reference PI obtained by low-throughput experiments 
curated in the DIP, BIND and IntAct databases (Bader et al., 2003; Kerrien et al., 2007; 
Xenarios et al., 2000) as a positive training set. Our negative training set consisted of pairs of 
proteins in which one component was experimentally determined or predicted with high 
confidence to be cytoplasmic and the other residing in the outer membrane or the periplasm 
(Diaz-Mejia et al., 2009); inner membrane proteins were discarded from this negative dataset 
since they are in physical proximity (and hence could potentially physically interact) to 
cytoplasmic and periplasmic proteins. Our logistic regression procedure also took into account 
Functional landscape of E. coli proteins                                                                                                                   4-36 
 
 
the degree of consistency of co-purifying protein pairs, balancing the tradeoff between “spoke” 
and “matrix” representation models of interactions within co-purified groups of proteins to 
decrease the false discovery rate. We then combined the scores derived from LCMS and 
MALDI into a a single PI network using a previously established procedure for integrating 
probabilistic networks (von Mering et al., 2005), which assumes the reliabilities of associations 
generated by these methods are independent. To facilitate independent critical evaluation, all 
our processed interaction data is available through the website in HUPO-PSI molecular 
interaction reporting format (standard level 2.5). 
4.4.2 GC network generation 
The four GC methods used to predict functional interactions among E. coli proteins were based 
on: (i) functional linkages among genes which fuse to form a single open reading frame in at 
least one other genome i.e. Gene Fusion (Enright et al., 1999); (ii) the mutual information of the 
coordinated presence or absence of pairs of genes across a set of 440 non-redundant genomes 
i.e. Phylogenetic Profiles (Moreno-Hagelsieb and Janga, 2008; Pellegrini et al., 1999); and (iii) 
the natural chromosomal association of bacterial genes in operons as detected by two 
alternative methods, namely (a) the tendency of genes forming operons to show small 
Intergenic Distances (Moreno-Hagelsieb and Collado-Vides, 2002; Salgado et al., 2000), and 
(b) the conservation of Gene Order, in which a confidence value for each adjacent pair of genes 
present in the same strand was used as indicator that those genes likely form an operon as 
compared with the conservation of adjacent genes found in opposite strands (Janga and 
Moreno-Hagelsieb, 2004). For the last two methods, subsequent Operon Rearrangements were 
also detected by genomic mapping of orthologs across 440 non-redundant bacterial genomes 
(Janga et al., 2005).  
 For all four GC methods, we used the BLAST BDBHs as an operational definition of 
orthology. To avoid circularity, the prediction scores of the four GC methods were benchmarked 
separately using as positive reference set proteins belonging to the same metabolic pathway 
according to EcoCyc (Keseler et al., 2005), and as negatives proteins in different pathways. A 
single, unified high-confidence functional association network was then constructed by 
integrating the interaction predictions generated by the four genomic context methods using a 
the same scoring model (von Mering et al., 2005) used to integrate the MALDI and LCMS data.  
 
(Space left for an enhanced layout of the text) 
 
Functional landscape of E. coli proteins                                                                                                                   4-37 
 
 
4.4.3 Clustering 
Protein clusters were generated from three different networks using MCL (Enright et al., 2002): 
(i) the PI network (generating protein complexes), (ii) the unified GC network (generating 
functional modules); and (iii) the function prediction/annotation profiles derived from the 
integration of PI and GC networks (generating functional neighborhoods). The core idea of MCL 
is to simulate random walks (i.e. flux) among the proteins (nodes) within each network to delimit 
regions with high flux, taking into account the connectivity and weight of interaction edges. In 
this work, edge weights correspond to the likelihood of pairwise protein interactions in each 
network. In each case, the global MCL inflation parameter, which tunes the granularity of the 
delimited clusters, was optimized by balancing the mass fraction of clusters and efficiency of 
partitions. The resulting clusters were individually assessed for functional homogeneity in terms 
of COG annotations as described previously (Loganantharaj et al., 2006). 
4.4.4 Network-based function prediction and benchmarking 
Our algorithm (StepPLR) for assigning biological functions is essentially a network topology-
based method in which the functions of the orphans are predicted based on the functions of 
their associated annotated proteins in the immediate (direct) and adjacent (indirect) network 
vicinity. Briefly, a single network integrating the high-confidence PI and GC probabilistic 
networks was first created using the same scoring model (von Mering et al., 2005) used to 
integrate the PI data and the four GC networks. Then the weighted topological overlap (Zhang 
and Horvath, 2005) between each pair of protein nodes in the integrated network was calculated 
to determine the correlated functional profiles based on a penalized logistic regression model. 
Finally, a stepwise variable selection procedure to optimize function profiles in the final logistic 
regression was used. Only functional categories with at least 15 annotated E. coli proteins were 
used in our integrated functional association network: 18 COG classes, corresponding to 
widespread bacterial protein functions; 19 biological classes from MultiFun, in which the 
proteins can have multiple annotations based on different classification criteria; and 51 
biological process classes in GO. Other guilt-by-association representative methods (e.g. 
majority-counting and chi-squared-based) were also evaluated (results not shown). 
REFERENCES 
Altaf-Ul-Amin, M., Shinbo, Y., Mihara, K., Kurokawa, K. and Kanaya, S. (2006). 
Development and implementation of an algorithm for detection of protein complexes in large 
interaction networks. BMC Bioinformatics 7, 207. 
Functional landscape of E. coli proteins                                                                                                                   4-38 
 
 
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, 
D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search 
programs. Nucleic Acids Res 25, 3389-402. 
Andres-Leon, E. A., Ezkurdia, I., Garcia, B., Valencia, A. and Juan, D. (2009). EcID. A 
database for the inference of functional interactions in E. coli. Nucleic Acids Res 37, D629-
D635. 
Aoki, K., Ogata, Y. and Shibata, D. (2007). Approaches for extracting practical information 
from gene co-expression networks in plant biology. Plant Cell Physiol 48, 381-90. 
Apweiler, R. (2001). Functional information in SWISS-PROT: the basis for large-scale 
characterisation of protein sequences. Brief Bioinform 2, 9-18. 
Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, 
E., Huang, H., Lopez, R., Magrane, M. et al. (2004). UniProt: the Universal Protein 
knowledgebase. Nucleic Acids Res 32, D115-9. 
Aranda, B., Achuthan, P., Alam-Faruque, Y., Armean, I., Bridge, A., Derow, C., Feuermann, 
M., Ghanbarian, A. T., Kerrien, S., Khadake, J. et al. The IntAct molecular interaction 
database in 2010. Nucleic Acids Res 38, D525-31. 
Arifuzzaman, M., Maeda, M., Itoh, A., Nishikata, K., Takita, C., Saito, R., Ara, T., 
Nakahigashi, K., Huang, H. C., Hirai, A. et al. (2006). Large-scale identification of protein-
protein interaction of Escherichia coli K-12. Genome Res 16, 686-91. 
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., 
Dolinski, K., Dwight, S. S., Eppig, J. T. et al. (2000). Gene ontology: tool for the unification of 
biology. The Gene Ontology Consortium. Nat Genet 25, 25-9. 
Baba, T., Ara, T., Hasegawa, M., Takai, Y., Okumura, Y., Baba, M., Datsenko, K. A., Tomita, 
M., Wanner, B. L. and Mori, H. (2006). Construction of Escherichia coli K-12 in-frame, single-
gene knockout mutants: the Keio collection. Mol Syst Biol 2, 2006 0008. 
Bader, G. D., Betel, D. and Hogue, C. W. (2003). BIND: the Biomolecular Interaction Network 
Database. Nucleic Acids Res 31, 248-50. 
Bader, G. D. and Hogue, C. W. (2003). An automated method for finding molecular complexes 
in large protein interaction networks. BMC Bioinformatics 4, 2. 
Bandyopadhyay, S., Kelley, R., Krogan, N. J. and Ideker, T. (2008). Functional maps of 
protein complexes from quantitative genetic interaction data. PLoS Comput Biol 4, e1000065. 
Barkan, A., Klipcan, L., Ostersetzer, O., Kawamura, T., Asakura, Y. and Watkins, K. P. 
(2007). The CRM domain: an RNA binding module derived from an ancient ribosome-
associated protein. Rna 13, 55-64. 
Barker, D. and Pagel, M. (2005). Predicting functional gene links from phylogenetic-statistical 
analyses of whole genomes. PLoS Comput Biol 1, e3. 
Functional landscape of E. coli proteins                                                                                                                   4-39 
 
 
Barrett, C. L., Herring, C. D., Reed, J. L. and Palsson, B. O. (2005). The global transcriptional 
regulatory network for metabolism in Escherichia coli exhibits few dominant functional states. 
Proc Natl Acad Sci U S A 102, 19103-8. 
Beyer, A., Bandyopadhyay, S. and Ideker, T. (2007). Integrating physical and genetic maps: 
from genomes to interaction networks. Nat Rev Genet 8, 699-710. 
Bowers, P. M., Pellegrini, M., Thompson, M. J., Fierro, J., Yeates, T. O. and Eisenberg, D. 
(2004). Prolinks: a database of protein functional linkages derived from coevolution. Genome 
Biol 5, R35. 
Breitkreutz, B. J., Stark, C., Reguly, T., Boucher, L., Breitkreutz, A., Livstone, M., 
Oughtred, R., Lackner, D. H., Bahler, J., Wood, V. et al. (2008). The BioGRID Interaction 
Database: 2008 update. Nucleic Acids Res 36, D637-40. 
Bresolin, G., Neuhaus, K., Scherer, S. and Fuchs, T. M. (2006). Transcriptional analysis of 
long-term adaptation of Yersinia enterocolitica to low-temperature growth. J Bacteriol 188, 2945-
58. 
Brohee, S. and van Helden, J. (2006). Evaluation of clustering algorithms for protein-protein 
interaction networks. BMC Bioinformatics 7, 488. 
Brouwer, R. W., Kuipers, O. P. and van Hijum, S. A. (2008). The relative value of operon 
predictions. Brief Bioinform 9, 367-75. 
Brun, C., Chevenet, F., Martin, D., Wojcik, J., Guenoche, A. and Jacq, B. (2003). Functional 
classification of proteins for the prediction of cellular function from a protein-protein interaction 
network. Genome Biol 5, R6. 
Butland, G., Babu, M., Diaz-Mejia, J. J., Bohdana, F., Phanse, S., Gold, B., Yang, W., Li, J., 
Gagarinova, A. G., Pogoutse, O. et al. (2008). eSGA: E. coli synthetic genetic array analysis. 
Nat Methods. 5, 789-95. 
Butland, G., Peregrin-Alvarez, J. M., Li, J., Yang, W., Yang, X., Canadien, V., Starostine, 
A., Richards, D., Beattie, B., Krogan, N. et al. (2005). Interaction network containing 
conserved and essential protein complexes in Escherichia coli. Nature 433, 531-7. 
Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., 
Lopez, R. and Apweiler, R. (2004). The Gene Ontology Annotation (GOA) Database: sharing 
knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 32, D262-6. 
Campillos, M., von Mering, C., Jensen, L. J. and Bork, P. (2006). Identification and analysis 
of evolutionarily cohesive functional modules in protein networks. Genome Res 16, 374-82. 
Chua, H. N., Sung, W. K. and Wong, L. (2006). Exploiting indirect neighbours and topological 
weight to predict protein function from protein-protein interactions. Bioinformatics 22, 1623-30. 
Costanzo, M., Baryshnikova, A., Bellay, J., Kim, Y., Spear, E. D., Sevier, C. S., Ding, H., 
Koh, J. L., Toufighi, K., Mostafavi, S. et al. The genetic landscape of a cell. Science 327, 425-
31. 
Functional landscape of E. coli proteins                                                                                                                   4-40 
 
 
Dandekar, T., Snel, B., Huynen, M. and Bork, P. (1998). Conservation of gene order: a 
fingerprint of proteins that physically interact. Trends Biochem Sci 23, 324-8. 
Date, S. V. and Marcotte, E. M. (2003). Discovery of uncharacterized cellular systems by 
genome-wide analysis of functional linkages. Nat Biotechnol 21, 1055-62. 
Deana, A., Celesnik, H. and Belasco, J. G. (2008). The bacterial enzyme RppH triggers 
messenger RNA degradation by 5' pyrophosphate removal. Nature 451, 355-8. 
Deng, M., Zhang, K., Mehta, S., Chen, T. and Sun, F. (2003). Prediction of protein function 
using protein-protein interaction data. J Comput Biol 10, 947-60. 
Diaz-Mejia, J. J., Babu, M. and Emili, A. (2009). Computational and experimental approaches 
to chart the Escherichia coli cell-envelope-associated proteome and interactome. FEMS 
Microbiol Rev 33, 66-97. 
Domka, J., Lee, J., Bansal, T. and Wood, T. K. (2007). Temporal gene-expression in 
Escherichia coli K-12 biofilms. Environ Microbiol 9, 332-46. 
Domka, J., Lee, J. and Wood, T. K. (2006). YliH (BssR) and YceP (BssS) regulate Escherichia 
coli K-12 biofilm formation by influencing cell signaling. Appl Environ Microbiol 72, 2449-59. 
Enright, A. J., Iliopoulos, I., Kyrpides, N. C. and Ouzounis, C. A. (1999). Protein interaction 
maps for complete genomes based on gene fusion events. Nature 402, 86-90. 
Enright, A. J., Van Dongen, S. and Ouzounis, C. A. (2002). An efficient algorithm for large-
scale detection of protein families. Nucleic Acids Res 30, 1575-84. 
Faith, J. J., Hayete, B., Thaden, J. T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S., 
Collins, J. J. and Gardner, T. S. (2007). Large-scale mapping and validation of Escherichia 
coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 5, e8. 
Feist, A. M., Henry, C. S., Reed, J. L., Krummenacker, M., Joyce, A. R., Karp, P. D., 
Broadbelt, L. J., Hatzimanikatis, V. and Palsson, B. O. (2007). A genome-scale metabolic 
reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and 
thermodynamic information. Mol Syst Biol 3, 121. 
Fronzes, R., Remaut, H. and Waksman, G. (2008). Architectures and biogenesis of non-
flagellar protein appendages in Gram-negative bacteria. Embo J. 
Gaasterland, T. and Ragan, M. A. (1998). Microbial genescapes: phyletic and functional 
patterns of ORF distribution among prokaryotes. Microb Comp Genomics 3, 199-217. 
Gama-Castro, S., Jimenez-Jacinto, V., Peralta-Gil, M., Santos-Zavaleta, A., Penaloza-
Spinola, M. I., Contreras-Moreira, B., Segura-Salazar, J., Muniz-Rascado, L., Martinez-
Flores, I., Salgado, H. et al. (2008). RegulonDB (version 6.0): gene regulation model of 
Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and 
Textpresso navigation. Nucleic Acids Res 36, D120-4. 
Girgis, H. S., Liu, Y., Ryu, W. S. and Tavazoie, S. (2007). A comprehensive genetic 
characterization of bacterial motility. PLoS Genet 3, 1644-60. 
Functional landscape of E. coli proteins                                                                                                                   4-41 
 
 
Godzik, A., Jambon, M. and Friedberg, I. (2007). Computational protein function prediction: 
are we making progress? Cell Mol Life Sci 64, 2505-11. 
Gotoh, O. (1999). Multiple sequence alignment: algorithms and applications. Adv Biophys 36, 
159-206. 
Gunsalus, K. C., Ge, H., Schetter, A. J., Goldberg, D. S., Han, J. D., Hao, T., Berriz, G. F., 
Bertin, N., Huang, J., Chuang, L. S. et al. (2005). Predictive models of molecular machines 
involved in Caenorhabditis elegans early embryogenesis. Nature 436, 861-5. 
Hahn, E., Wild, P., Hermanns, U., Sebbel, P., Glockshuber, R., Haner, M., Taschner, N., 
Burkhard, P., Aebi, U. and Muller, S. A. (2002). Exploring the 3D molecular architecture of 
Escherichia coli type 1 pili. J Mol Biol 323, 845-57. 
Han, L., Cui, J., Lin, H., Ji, Z., Cao, Z., Li, Y. and Chen, Y. (2006). Recent progresses in the 
application of machine learning approach for predicting protein functional class independent of 
sequence similarity. Proteomics 6, 4023-37. 
Hartwell, L. H., Hopfield, J. J., Leibler, S. and Murray, A. W. (1999). From molecular to 
modular cell biology. Nature 402, C47-52. 
Hawkins, T. and Kihara, D. (2007). Function prediction of uncharacterized proteins. J 
Bioinform Comput Biol 5, 1-30. 
Hishigaki, H., Nakai, K., Ono, T., Tanigami, A. and Takagi, T. (2001). Assessment of 
prediction accuracy of protein function from protein--protein interaction data. Yeast 18, 523-31. 
Hoffmann, R. and Valencia, A. (2004). A gene network for navigating the literature. Nat Genet 
36, 664. 
Hu, P., Janga, S. C., Babu, M., Diaz-Mejia, J. J., Butland, G., Yang, W., Pogoutse, O., Guo, 
X., Phanse, S., Wong, P. et al. (2009). Global functional atlas of Escherichia coli 
encompassing previously uncharacterized proteins. PLoS Biol 7, e96. 
Huber, W., Carey, V. J., Long, L., Falcon, S. and Gentleman, R. (2007). Graphs in molecular 
biology. BMC Bioinformatics 8 Suppl 6, S8. 
Ideker, T. and Sharan, R. (2008). Protein networks in disease. Genome Res 18, 644-52. 
Janga, S. C., Collado-Vides, J. and Moreno-Hagelsieb, G. (2005). Nebulon: a system for the 
inference of functional relationships of gene products from the rearrangement of predicted 
operons. Nucleic Acids Res 33, 2521-30. 
Janga, S. C. and Moreno-Hagelsieb, G. (2004). Conservation of adjacency as evidence of 
paralogous operons. Nucleic Acids Res 32, 5392-7. 
Jensen, L. J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, 
P., Roth, A., Simonovic, M. et al. (2009). STRING 8--a global view on proteins and their 
functional interactions in 630 organisms. Nucleic Acids Res 37, D412-6. 
Functional landscape of E. coli proteins                                                                                                                   4-42 
 
 
Jiang, M., Datta, K., Walker, A., Strahler, J., Bagamasbad, P., Andrews, P. C. and 
Maddock, J. R. (2006). The Escherichia coli GTPase CgtAE is involved in late steps of large 
ribosome assembly. J Bacteriol 188, 6757-70. 
Jiang, M., Sullivan, S. M., Walker, A. K., Strahler, J. R., Andrews, P. C. and Maddock, J. R. 
(2007). Identification of novel Escherichia coli ribosome-associated proteins using isobaric tags 
and multidimensional protein identification techniques. J Bacteriol 189, 3434-44. 
Johansson, L. and Liden, G. (2006). Transcriptome analysis of a shikimic acid producing 
strain of Escherichia coli W3110 grown under carbon- and phosphate-limited conditions. J 
Biotechnol 126, 528-45. 
Joyce, A. R., Reed, J. L., White, A., Edwards, R., Osterman, A., Baba, T., Mori, H., Lesely, 
S. A., Palsson, B. O. and Agarwalla, S. (2006). Experimental and computational assessment 
of conditionally essential genes in Escherichia coli. J Bacteriol 188, 8259-71. 
Kanehisa, M. and Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic 
Acids Res 28, 27-30. 
Karaoz, U., Murali, T. M., Letovsky, S., Zheng, Y., Ding, C., Cantor, C. R. and Kasif, S. 
(2004). Whole-genome annotation by using evidence integration in functional-linkage networks. 
Proc Natl Acad Sci U S A 101, 2888-93. 
Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I., Bridge, A., Derow, C., Dimmer, E., 
Feuermann, M., Friedrichsen, A., Huntley, R. et al. (2007). IntAct--open source resource for 
molecular interaction data. Nucleic Acids Res 35, D561-5. 
Keseler, I. M., Collado-Vides, J., Gama-Castro, S., Ingraham, J., Paley, S., Paulsen, I. T., 
Peralta-Gil, M. and Karp, P. D. (2005). EcoCyc: a comprehensive database resource for 
Escherichia coli. Nucleic Acids Res 33, D334-7. 
King, A. D., Przulj, N. and Jurisica, I. (2004). Protein complex prediction via cost-based 
clustering. Bioinformatics 20, 3013-20. 
Kosinski, J., Feder, M. and Bujnicki, J. M. (2005). The PD-(D/E)XK superfamily revisited: 
identification of new members among proteins involved in DNA metabolism and functional 
predictions for domains of (hitherto) unknown function. BMC Bioinformatics 6, 172. 
Kretschmann, E., Fleischmann, W. and Apweiler, R. (2001). Automatic rule generation for 
protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 
17, 920-6. 
Krogan, N. J., Cagney, G., Yu, H., Zhong, G., Guo, X., Ignatchenko, A., Li, J., Pu, S., Datta, 
N., Tikuisis, A. P. et al. (2006). Global landscape of protein complexes in the yeast 
Saccharomyces cerevisiae. Nature 440, 637-43. 
Lasko, P. (2000). The drosophila melanogaster genome: translation factors and RNA binding 
proteins. J Cell Biol 150, F51-6. 
Lee, I., Date, S. V., Adai, A. T. and Marcotte, E. M. (2004). A probabilistic functional network of 
yeast genes. Science 306, 1555-8. 
Functional landscape of E. coli proteins                                                                                                                   4-43 
 
 
Lee, I., Lehner, B., Crombie, C., Wong, W., Fraser, A. G. and Marcotte, E. M. (2008). A 
single gene network accurately predicts phenotypic effects of gene perturbation in 
Caenorhabditis elegans. Nat Genet 40, 181-8. 
Letovsky, S. and Kasif, S. (2003). Predicting protein function from protein/protein interaction 
data: a probabilistic approach. Bioinformatics 19 Suppl 1, i197-204. 
Linghu, B., Snitkin, E. S., Holloway, D. T., Gustafson, A. M., Xia, Y. and DeLisi, C. (2008). 
High-precision high-coverage functional inference from integrated data sources. BMC 
Bioinformatics 9, 119. 
Loganantharaj, R., Cheepala, S. and Clifford, J. (2006). Metric for Measuring the 
Effectiveness of Clustering of DNA Microarray Expression. BMC Bioinformatics 7 Suppl 2, S5. 
Luo, F., Yang, Y., Zhong, J., Gao, H., Khan, L., Thompson, D. K. and Zhou, J. (2007). 
Constructing gene co-expression networks and predicting functions of unknown genes by 
random matrix theory. BMC Bioinformatics 8, 299. 
Madera, M., Vogel, C., Kummerfeld, S. K., Chothia, C. and Gough, J. (2004). The 
SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res 32, D235-9. 
Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O. and Eisenberg, D. 
(1999a). Detecting protein function and protein-protein interactions from genome sequences. 
Science 285, 751-3. 
Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O. and Eisenberg, D. (1999b). A 
combined algorithm for genome-wide prediction of protein function. Nature 402, 83-6. 
Massjouni, N., Rivera, C. G. and Murali, T. M. (2006). VIRGO: computational prediction of 
gene functions. Nucleic Acids Res 34, W340-4. 
McDermott, J., Bumgarner, R. and Samudrala, R. (2005). Functional annotation from 
predicted protein interaction networks. Bioinformatics 21, 3217-26. 
Moreno-Hagelsieb, G. and Collado-Vides, J. (2002). A powerful non-homology method for the 
prediction of operons in prokaryotes. Bioinformatics 18 Suppl 1, S329-36. 
Moreno-Hagelsieb, G. and Janga, S. C. (2008). Operons and the effect of genome 
redundancy in deciphering functional relationships using phylogenetic profiles. Proteins 70, 344-
52. 
Murali, T. M., Wu, C. J. and Kasif, S. (2006). The art of gene function prediction. Nat 
Biotechnol 24, 1474-5; author reply 1475-6. 
Myers, C. L., Robson, D., Wible, A., Hibbs, M. A., Chiriac, C., Theesfeld, C. L., Dolinski, K. 
and Troyanskaya, O. G. (2005). Discovery of biological networks from diverse functional 
genomic data. Genome Biol 6, R114. 
Myers, C. L. and Troyanskaya, O. G. (2007). Context-sensitive data integration and prediction 
of biological networks. Bioinformatics 23, 2322-30. 
Functional landscape of E. coli proteins                                                                                                                   4-44 
 
 
Nabieva, E., Jim, K., Agarwal, A., Chazelle, B. and Singh, M. (2005). Whole-proteome 
prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21 
Suppl 1, i302-10. 
Nuccio, S. P. and Baumler, A. J. (2007). Evolution of the chaperone/usher assembly pathway: 
fimbrial classification goes Greek. Microbiol Mol Biol Rev 71, 551-75. 
Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D. and Maltsev, N. (1999). The use of 
gene clusters to infer functional coupling. Proc Natl Acad Sci U S A 96, 2896-901. 
Pal, C., Papp, B. and Lercher, M. J. (2005). Adaptive evolution of bacterial metabolic networks 
by horizontal gene transfer. Nat Genet 37, 1372-5. 
Parrish, J. R., Yu, J., Liu, G., Hines, J. A., Chan, J. E., Mangiola, B. A., Zhang, H., Pacifico, 
S., Fotouhi, F., DiRita, V. J. et al. (2007). A proteome-wide protein interaction map for 
Campylobacter jejuni. Genome Biol 8, R130. 
Pearson, W. R. (1995). Comparison of methods for searching protein sequence databases. 
Protein Sci 4, 1145-60. 
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. and Yeates, T. O. (1999). 
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc 
Natl Acad Sci U S A 96, 4285-8. 
Pereira-Leal, J. B., Enright, A. J. and Ouzounis, C. A. (2004). Detection of functional modules 
from protein interaction networks. Proteins 54, 49-57. 
Procter, J. B., Thompson, J., Letunic, I., Creevey, C., Jossinet, F. and Barton, G. J. 
Visualization of multiple alignments, phylogenies and gene family evolution. Nat Methods 7, 
S16-25. 
Pruitt, K. D., Tatusova, T. and Maglott, D. R. (2005). NCBI Reference Sequence (RefSeq): a 
curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids 
Res 33 Database Issue, D501-4. 
Rain, J. C., Selig, L., De Reuse, H., Battaglia, V., Reverdy, C., Simon, S., Lenzen, G., Petel, 
F., Wojcik, J., Schachter, V. et al. (2001). The protein-protein interaction map of Helicobacter 
pylori. Nature 409, 211-5. 
Rajagopala, S. V., Titz, B., Goll, J., Parrish, J. R., Wohlbold, K., McKevitt, M. T., Palzkill, T., 
Mori, H., Finley, R. L., Jr. and Uetz, P. (2007). The protein network of bacterial motility. Mol 
Syst Biol 3, 128. 
Reguly, T., Breitkreutz, A., Boucher, L., Breitkreutz, B. J., Hon, G. C., Myers, C. L., 
Parsons, A., Friesen, H., Oughtred, R., Tong, A. et al. (2006). Comprehensive curation and 
analysis of global interaction networks in Saccharomyces cerevisiae. J Biol 5, 11. 
Rentzsch, R. and Orengo, C. A. (2009). Protein function prediction--the power of multiplicity. 
Trends Biotechnol 27, 210-9. 
Riley, M. (1993). Functions of the gene products of Escherichia coli. Microbiol Rev 57, 862-952. 
Functional landscape of E. coli proteins                                                                                                                   4-45 
 
 
Riley, M., Abe, T., Arnaud, M. B., Berlyn, M. K., Blattner, F. R., Chaudhuri, R. R., Glasner, 
J. D., Horiuchi, T., Keseler, I. M., Kosuge, T. et al. (2006). Escherichia coli K-12: a 
cooperatively developed annotation snapshot--2005. Nucleic Acids Res 34, 1-9. 
Rison, S. C., Hodgman, T. C. and Thornton, J. M. (2000). Comparison of functional 
annotation schemes for genomes. Funct Integr Genomics 1, 56-69. 
Rives, A. W. and Galitski, T. (2003). Modular organization of cellular networks. Proc Natl Acad 
Sci U S A 100, 1128-33. 
Rogozin, I. B., Makarova, K. S., Murvai, J., Czabarka, E., Wolf, Y. I., Tatusov, R. L., 
Szekely, L. A. and Koonin, E. V. (2002). Connected gene neighborhoods in prokaryotic 
genomes. Nucleic Acids Res 30, 2212-23. 
Ruan, J., Dean, A. K. and Zhang, W. A general co-expression network-based approach to 
gene expression analysis: comparison and applications. BMC Syst Biol 4, 8. 
Rudd, K. E. (1998). Linkage map of Escherichia coli K-12, edition 10: the physical map. 
Microbiol Mol Biol Rev 62, 985-1019. 
Rzhetsky, A., Seringhaus, M. and Gerstein, M. (2008). Seeking a new biology through text 
mining. Cell 134, 9-13. 
Sabina, J., Dover, N., Templeton, L. J., Smulski, D. R., Soll, D. and LaRossa, R. A. (2003). 
Interfering with different steps of protein synthesis explored by transcriptional profiling of 
Escherichia coli K-12. J Bacteriol 185, 6158-70. 
Salgado, H., Moreno-Hagelsieb, G., Smith, T. F. and Collado-Vides, J. (2000). Operons in 
Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci U S A 97, 6652-7. 
Samanta, M. P. and Liang, S. (2003). Predicting protein functions from redundancies in large-
scale protein interaction networks. Proc Natl Acad Sci U S A 100, 12579-83. 
Schwikowski, B., Uetz, P. and Fields, S. (2000). A network of protein-protein interactions in 
yeast. Nat Biotechnol 18, 1257-61. 
Selinger, D. W., Saxena, R. M., Cheung, K. J., Church, G. M. and Rosenow, C. (2003). 
Global RNA half-life analysis in Escherichia coli reveals positional patterns of transcript 
degradation. Genome Res 13, 216-23. 
Serres, M. H., Goswami, S. and Riley, M. (2004). GenProtEC: an updated and improved 
analysis of functions of Escherichia coli K-12 proteins. Nucleic Acids Res 32, D300-2. 
Serres, M. H. and Riley, M. (2000). MultiFun, a multifunctional classification scheme for 
Escherichia coli K-12 gene products. Microb Comp Genomics 5, 205-22. 
Sharan, R. and Ideker, T. (2006). Modeling cellular machinery through biological network 
comparison. Nat Biotechnol 24, 427-33. 
Sharan, R., Ulitsky, I. and Shamir, R. (2007). Network-based prediction of protein function. 
Mol Syst Biol 3, 88. 
Functional landscape of E. coli proteins                                                                                                                   4-46 
 
 
Shoemaker, B. A. and Panchenko, A. R. (2007a). Deciphering protein-protein interactions. 
Part I. Experimental techniques and databases. PLoS Comput Biol 3, e42. 
Shoemaker, B. A. and Panchenko, A. R. (2007b). Deciphering protein-protein interactions. 
Part II. Computational methods to predict protein and domain interaction partners. PLoS 
Comput Biol 3, e43. 
Slonim, N., Elemento, O. and Tavazoie, S. (2006). Ab initio genotype-phenotype association 
reveals intrinsic modularity in genetic networks. Mol Syst Biol 2, 2006 0005. 
Snel, B., Bork, P. and Huynen, M. A. (2002). The identification of functional modules from the 
genomic association of genes. Proc Natl Acad Sci U S A 99, 5890-5. 
Spirin, V. and Mirny, L. A. (2003). Protein complexes and functional modules in molecular 
networks. Proc Natl Acad Sci U S A 100, 12123-8. 
Suthram, S., Shlomi, T., Ruppin, E., Sharan, R. and Ideker, T. (2006). A direct comparison of 
protein interaction confidence assignment schemes. BMC Bioinformatics 7, 360. 
Tarassov, K., Messier, V., Landry, C. R., Radinovic, S., Molina, M. M., Shames, I., 
Malitskaya, Y., Vogel, J., Bussey, H. and Michnick, S. W. (2008). An in vivo map of the yeast 
protein interactome. Science 320, 1465-70. 
Tatusov, R. L., Koonin, E. V. and Lipman, D. J. (1997). A genomic perspective on protein 
families. Science 278, 631-7. 
Tipton, K. F. (1994). Nomenclature Committee of the International Union of Biochemistry and 
Molecular Biology (NC-IUBMB). Enzyme nomenclature. Recommendations 1992. Supplement: 
corrections and additions. Eur J Biochem 223, 1-5. 
Titz, B., Rajagopala, S. V., Goll, J., Hauser, R., McKevitt, M. T., Palzkill, T. and Uetz, P. 
(2008). The binary protein interactome of Treponema pallidum--the syphilis spirochete. PLoS 
ONE 3, e2292. 
Typas, A., Nichols, R. J., Siegele, D. A., Shales, M., Collins, S. R., Lim, B., Braberg, H., 
Yamamoto, N., Takeuchi, R., Wanner, B. L. et al. (2008). High-throughput, quantitative 
analyses of genetic interactions in E. coli. Nat Methods. 5, 781. 
Vazquez, A., Flammini, A., Maritan, A. and Vespignani, A. (2003). Global protein function 
prediction from protein-protein interaction networks. Nat Biotechnol 21, 697-700. 
Vlasblom, J., Wu, S., Pu, S., Superina, M., Liu, G., Orsi, C. and Wodak, S. J. (2006). 
GenePro: a Cytoscape plug-in for advanced visualization and analysis of interaction networks. 
Bioinformatics 22, 2178-9. 
von Mering, C., Jensen, L. J., Snel, B., Hooper, S. D., Krupp, M., Foglierini, M., Jouffre, N., 
Huynen, M. A. and Bork, P. (2005). STRING: known and predicted protein-protein 
associations, integrated and transferred across organisms. Nucleic Acids Res 33, D433-7. 
Functional landscape of E. coli proteins                                                                                                                   4-47 
 
 
Wang, K., Narayanan, M., Zhong, H., Tompa, M., Schadt, E. E. and Zhu, J. (2009). Meta-
analysis of inter-species liver co-expression networks elucidates traits associated with common 
human diseases. PLoS Comput Biol 5, e1000616. 
Whisstock, J. C. and Lesk, A. M. (2003). Prediction of protein function from protein sequence 
and structure. Q Rev Biophys 36, 307-40. 
Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M. and Eisenberg, D. 
(2000). DIP: the database of interacting proteins. Nucleic Acids Res 28, 289-91. 
Yao, Z. and Ruzzo, W. L. (2006). A regression-based K nearest neighbor algorithm for gene 
function prediction from heterogeneous data. BMC Bioinformatics 7 Suppl 1, S11. 
Yellaboina, S., Goyal, K. and Mande, S. C. (2007). Inferring genome-wide functional linkages 
in E. coli by combining improved genome context methods: comparison with high-throughput 
experimental data. Genome Res 17, 527-35. 
Yu, H., Braun, P., Yildirim, M. A., Lemmens, I., Venkatesan, K., Sahalie, J., Hirozane-
Kishikawa, T., Gebreab, F., Li, N., Simonis, N. et al. (2008). High-Quality Binary Protein 
Interaction Map of the Yeast Interactome Network. Science. 322, 104-110. 
Zeghouf, M., Li, J., Butland, G., Borkowska, A., Canadien, V., Richards, D., Beattie, B., 
Emili, A. and Greenblatt, J. F. (2004). Sequential Peptide Affinity (SPA) system for the 
identification of mammalian and bacterial protein complexes. J Proteome Res 3, 463-8. 
Zhang, B. and Horvath, S. (2005). A general framework for weighted gene co-expression 
network analysis. Stat Appl Genet Mol Biol 4, Article17. 
Zhang, B., Park, B. H., Karpinets, T. and Samatova, N. F. (2008). From pull-down data to 
protein interaction networks and complexes with biological relevance. Bioinformatics 24, 979-86. 
Zhao, X. M., Chen, L. and Aihara, K. (2008a). Protein function prediction with high-throughput 
data. Amino Acids 35, 517-30. 
Zhao, X. M., Chen, L. and Aihara, K. (2008b). Protein function prediction with the shortest path 
in functional linkage graph and boosting. Int J Bioinform Res Appl 4, 375-84. 
 
 
  
 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-1 
 
 
 
 
 
 
 
5 Structure and dynamics of post-transcriptional 
regulatory networks directed by RNA-binding proteins 
 
 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-2 
 
 
 
CONTENTS OF CHAPTER 5 
 
OUTLINE ......................................................................................................................................... 5-3 
CONTRIBUTION TO THE WORK IN THIS CHAPTER.................................................... 5-3 
5.1 INTRODUCTION .................................................................................................................. 5-4 
5.2 RESULTS ................................................................................................................................ 5-7 
5.2.1 RNA BINDING PROTEINS AND POST-TRANSCRIPTIONAL REGULATION ............................... 5-7 
5.2.2 METHODS TO IDENTIFY RBPS AND THEIR TARGETS ........................................................... 5-9 
5.2.3 RBPS AND POST-TRANSCRIPTIONAL OPERONS .................................................................. 5-12 
5.2.4 POST-TRANSCRIPTIONAL NETWORK FORMED BY RBPS .................................................... 5-12 
5.2.5 EXPRESSION DYNAMICS OF RBPS IN POST-TRANSCRIPTIONAL NETWORKS...................... 5-15 
5.2.5.1 RBPS SHOW HIGH ABUNDANCE AND TIGHT REGULATION AT THE PROTEIN LEVEL ....... 5-15 
5.2.5.2 THE NUMBER OF DISTINCT TARGETS BOUND BY A RBP IS CORRELATED WITH ITS 
CELLULAR ABUNDANCE… .......................................................................................................... 5-19 
5.2.5.3 RBPS BOUND TO MANY RNA TARGETS ARE LESS FREQUENTLY DEGRADED AND TIGHTLY 
CONTROLLED AT PROTEIN LEVEL ................................................................................................ 5-21 
5.3 DISCUSSION & CONCLUSION ................................................................................... 5-23 
5.4 MATERIALS AND METHODS ........................................................................................ 5-24 
5.4.1 DATA ON RNA-BINDING PROTEINS IN S. CEREVISIAE AND THEIR INTERACTIONS ............. 5-24 
5.4.2 ANALYSIS OF THE STRUCTURE AND PROPERTIES OF POST-TRANSCRIPTIONAL REGULATORY 
NETWORK ..................................................................................................................................... 5-25 
5.4.3 DATA FOR COMPARATIVE ANALYSIS OF EXPRESSION DYNAMICS ..................................... 5-25 
5.4.4 COMPARISON OF THE REGULATORY PROPERTIES OF RBPS WITH OTHER PROTEIN CODING 
GENES ........................................................................................................................................... 5-26 
5.4.5 ANALYSIS OF THE RELATIONSHIP BETWEEN THE NUMBER OF TARGETS OF A RBP AND ITS 
DYNAMIC PROPERTIES ................................................................................................................. 5-27 
REFERENCES .............................................................................................................................. 5-27 
 
 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-3 
 
 
OUTLINE 
Gene expression is a highly controlled process which is known to occur at several levels in 
eukaryotic organisms. Although traditionally messenger RNAs have been viewed as passive 
molecules in the pathway from transcription to translation there is increasing evidence that their 
metabolism is controlled by a class of proteins called RNA-binding proteins (RBPs). In this 
chapter, I provide an overview of the recent developments in our understanding of the repertoire 
of RBPs across diverse model systems and discuss the approaches currently available for the 
construction of post-transcriptional networks governed by them. I also present the first analysis 
of the network properties of a post-transcriptional system in a model eukaryote using currently 
available data and discuss the implications of understanding the dynamic properties of this 
important class of regulatory molecules as more data detailing their dynamic, spatial and tissue-
specific maps across diverse model systems accumulates. I argue that such developments 
would not only allow us to gain a deeper understanding of regulation at a level which has been 
under-appreciated over the past decades but would also us to use the newly developed high-
throughput approaches to interrogate the prevalence of these phenomena in different states and 
thereby study their relevance to physiology and disease across organisms.  
CONTRIBUTION TO THE WORK IN THIS CHAPTER 
Please note that the work presented in this chapter is the result of the following two publications 
and my contribution to the work excludes the organization of the post-transcriptional network in 
yeast and the calculations performed on understanding the expression dynamics of RBPs, 
which were all performed by Nitish Mittal. I performed all other analyses. 
 
1) Structure and dynamics of post-transcriptional network directed by RNA-binding proteins 
Sarath Chandra Janga and Nitish Mittal 
Invited book chapter for Landes Bioscience Press for an edited book on “RNA infrastructure:                   
RNA processing and regulatory networks” 
2) Dissecting the expression dynamics of RNA-binding proteins in post-transcriptional regulatory       
networks 
Nitish Mittal, Nilanjan Roy, M. Madan Babu and  Sarath Chandra Janga 
Proc. Natl. Acad. Sci. U S A. 106(48): 20300-05, 2009 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-4 
 
 
5.1 INTRODUCTION 
Gene expression is a highly regulated process and is controlled at several levels. In eukaryotes, 
control of gene expression first occurs at the level of transcription, where transcription factors 
regulate the synthesis of RNA of specific gene in response to different internal and external 
stimuli. On the other hand, at the protein level, several post-translational modifications, such as 
phosphorylation by kinases and ubiquitin ligases, are known to spatially and temporally control 
the availability of functional protein products within the cell. However, a much less understood 
level of gene expression regulation, which occurs between these two layers, is due to the post-
transcriptional control of RNAs. It is now increasingly known that this level is controlled by 
numerous factors with major players being the RNA-binding proteins (RBPs) (Glisovic et al., 
2008; Keene, 2007; Mata et al., 2005). Therefore, intricate co-ordination of regulation from 
these three different layers is important for finely controlling the flow of genetic information from 
genes to proteins in different conditions. Indeed, changes in gene expression due to aberrations 
at any of these three levels have been shown to be responsible for the cause of a number of 
disorders (Cookson et al., 2009; Cooper et al., 2009; Feinberg and Tycko, 2004; Lukong et al., 
2008; Nica and Dermitzakis, 2008).  
Development of DNA microarray technology has made it possible to measure the 
expression of each annotated gene at the transcript level. Indeed, this technique has been the 
high-throughput approach of choice to efficiently characterize the transcriptomes of several 
model organisms. One common assumption in DNA microarray experiments is that the level of 
mRNA of particular gene reflects the amount of protein and there is little regulation at the post-
transcriptional level. Recent studies comparing the high-throughput data for mRNA and protein-
abundances indicate that there is a very weak correlation between the number of transcripts 
and protein products of a gene, challenging this notion (Gygi et al., 1999; Washburn et al., 
2003). This suggests that the regulation of gene expression at the post-transcriptional level is 
predominant. For instance, in eukaryotic pathogen,  Trypanosoma cruzi, it is well known that 
gene expression is primarily controlled at post-transcriptional level through RNA binding 
proteins (RBPs) (Noe et al., 2008). These studies suggest the extensive role of post-
transcriptional regulation in controlling gene expression in eukaryotes (Campbell et al., 2003; 
Foth et al., 2008). 
In eukaryotes, transcription and translation occur in different compartments. This allows 
for a plethora of options to control RNA at the post-transcriptional level, including their splicing, 
polyadenylation, transport, mRNA stability, localization and translational control (Glisovic et al., 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-5 
 
 
2008; Keene, 2007). Although some early studies revealed the involvement RBPs in the 
transport of mRNA from nucleus to the site of their translation, increasing evidence now 
suggests that RBPs regulate almost all of the post-transcriptional steps shown in Figure 5-1A. 
For example, in humans, Nova protein is associated with splicing (Ule et al., 2003), PUF family 
proteins have been shown to play an important role during Caenorhabditis elegans oogenesis 
(Lublin and Evans, 2007), Tap protein, like its yeast homolog Mex67, was reported as a bona 
fide mRNA nuclear export  factor (Gruter et al., 1998), Puf3p in yeast was shown to be 
responsible for localization of mitochondrial transcripts (Saint-Georges et al., 2008) and Pab1 
was reported to regulate the initiation of translation (Kessler and Sachs, 1998). While the 
extensive role of RBPs in post-transcriptional control of cellular processes has been reviewed 
by several groups (Glisovic et al., 2008; Keene, 2007; Lukong et al., 2008; Mata et al., 2005), in 
yeast alone I found that the known RBPs (see Methods) are involved in multiple cellular 
processes and components based on Gene Ontology analysis. All these aspects highlight the 
importance of RBPs in regulating gene expression at post-transcriptional level.  
Due to their central role in controlling gene expression at post-transcriptional level, 
alteration in expression or mutations in either RBPs or their RNA targets (i.e., the transcripts 
which physically associate with the RBP) have been reported to be the cause of several human 
diseases such as muscular atrophies, neurological disorders and cancer (Cooper et al., 2009; 
Kim et al., 2009; Lukong et al., 2008; Musunuru, 2003). In particular, disorders such as 
myotonic dystrophy (DM) and oculopharyngeal muscular dystrophy (OPMD) have been 
attributed with RNA’s gain-of-function - CUG repeat expansion in the case of myotonic 
dystrophy protein kinase (DMPK) (Musunuru, 2003) and GCG repeat expansion in exon 1 of the 
RBP, PABPN1 in the case of OPMD (Lukong et al., 2008) respectively. On the other hand, 
diseases like opsoclonus-myoclonus ataxia (POMA) and spinal muscular atrophy (SMA) have 
been reported to be due to the RBPs loss of function (Lukong et al., 2008), suggesting that 
mutations in either RBP or any of its interacting RNA target sequences can lead to extensive 
variations in their expression patterns and result in a number of diseases. In addition to the 
fitness defects that variations in RBPs can bring about in cells, it has been recently shown in 
yeast that RBPs form an important class of prionogenic proteins (Alberti et al., 2009).  
 
 
 (Space left for an enhanced layout of the figure) 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-6 
 
 
 
Figure 5-1: A) Schematic diagram showing the extensive role of RBPs in various post-transcriptional 
processes at different locations in eukaryotic cells. Circled number indicates the process in which RBPs 
are involved. RBPs are major players in splicing pre-mRNA into mature mRNA in the nucleus which are 
then exported into the cytoplasm by various other RBPs. In addition, RBPs are responsible for the 
localization of mRNAs to distinct sub-cellular compartments such as the mitochondria. In the cytoplasm, 
RBPs are also involved in governing the stability of transcripts by binding the substrate RNAs and in 
controlling the translation of mRNAs into corresponding protein products. For this reason, RBPs have 
been found to be key players either directly or indirectly responsible for the cause of several disorders 
due to changes in regulation they bring about at post-transcriptional level. B) In this study, an analysis of 
the sequence properties of the RBPs and the structure of the post-transcriptional network formed by 
RBPs followed by a detailed analysis on the expression dynamics of RBPs at two distinct levels is 
presented. First involved, RBPs as a functional class, where we compared the properties of RBPs with 
rest of the protein coding genes in the entire genome of Saccharomyces cerevisiae. This involved 
comparison of 561 RBPs against 5685 non-RBPs in the whole genome. Second, we studied the 
relationship between the RBP’s connectivity, defined as the number of target mRNAs which are bound by 
a given RBP and their transcript and protein stability, transcript and protein expression, rate of translation 
and expression noise. 
 
All these observations raise the questions: are RBPs finely controlled in terms of their 
expression patterns and are there constraints on their expression patterns depending on the 
number of distinct RNA targets they control? To address this, in what follows I present an over-
view of the analysis that was performed on the post-transcriptional network formed by RBPs in 
yeast, S. cerevisiae at two distinct levels shown in Figure 5-1B. The first involved asking 
whether RBPs as a group show distinct dynamic properties in comparison to non-RBPs in the 
whole genome. The second comprised of understanding the constraints placed on dynamic 
properties of RBPs in relation to the number of distinct transcripts controlled by them. Our 
analysis at the first level revealed that RBPs, as a functional class, are rapidly turned over (i.e., 
less stable) at the transcript level and are tightly controlled at the protein level. Analysis of the 
post-transcriptional network formed by RBPs indicated that highly connected RBPs are more 
abundant and ubiquitously present within the cell. 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-7 
 
 
In this chapter, I attempt to provide a comprehensive overview and preliminary insights 
on this quickly developing area of post-transcriptional regulatory networks formed by RBPs by 
organizing the work done into three major sections, namely sequence attributes and functional 
processes associated with RBPs, methods used for the construction of the networks formed by 
them and finally discuss the structure and dynamics of these post-transcriptional networks 
based on recent publicly available data.    
5.2 RESULTS 
5.2.1 RNA binding proteins and post-transcriptional regulation 
RNA binding proteins (RBPs) are key regulators of different steps in the metabolism of RNA in 
eukaryotes. As shown in Figure 5-1A, they participate in the processing of pre-mRNA which 
includes splicing, poly-adenylation and capping to get mature mRNA. Following which, they are 
responsible for mediating the transport of mRNA from nucleus to cytoplasm. RBPs are also 
found to facilitate and control the localization, translation, stability and degradation of mRNA. To 
regulate the different steps of RNA metabolism, RBPs bind to RNA and form ribonucleoprotein 
complexes (RNP). Depending upon whether RBPs are bound to pre-mRNA or mRNA, RNPs 
are classified as hnRNP or mRNP respectively.  RNPs are inherently highly dynamic complexes 
due to their ability to associate and dissociate with various RBPs to mediate different steps of 
RNA metabolism. Some RBPs associated with RNP complexes are known to remain bound to 
their target RNA during all the steps of the RNA processing, from splicing to translation. For 
instance, SF2/ASF, a member of the SR class of RBPs in mammals, is found to facilitate 
splicing, export and translation initiation of its target RNA (Sanford et al., 2004; Zhong et al., 
2009) . Similarly Npl3, a yeast SR protein, has also been shown to interact with pre-mRNA and 
regulate the events from splicing to translational elongation (Gross et al., 1998). Similarly, 
neuronal ELAV protein also regulates the fate of its target RNA by mediating the events from 
poly-adenylation to translation (Pascale et al., 2008). On the other hand, several RBPs are also 
responsible for participating in specific steps of RNA metabolism such as the Nova protein, 
which is associated with splicing in neuronal cells (Ule et al., 2003; Ule et al., 2006). Tap 
protein, like its yeast homolog Mex67, was reported to be a bona fide mRNA nuclear export 
factor (Gruter et al., 1998). All these examples highlight 1) the role of RBPs in regulating the 
expression of genes in multiple steps at post-transcriptional level and 2) the complex 
combinatorial interplay of different RBPs to integrate various post-transcriptional events to fine 
tune the availability of transcripts both spatially and temporally. 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-8 
 
 
Table 5-1. Common RNA binding domains in putative RBPs of the yeast S. cerevisiae, their frequency in 
RBPs and domains most often associated with these RNA binding domains according to the Pfam (Finn 
et al.) domain database. 
 
Domain Pfam accession Description 
Protein 
frequency 
Frequent Occurrence of 
other domain  
RRM_1 PF00076 
RNA recognition motif (RRM).  Many 
eukaryotic proteins containing one or 
more copies of a putative RNA-
binding domain of about 90 amino 
acids are known to bind single-
stranded RNAs 
0.105 RRM_1, Lsm_interact 
DEAD PF00270 
DEAD/DEAH box helicase.  
Members of this family include the 
DEAD and DEAH box helicases 
0.042 Helicase C,  
KH_1 PF00013 
K homology (KH) domain is a 
doamain of 70 amino acid and 
present in diverse RBPs.  
0.015 KH_1 
PUF PF00806 
Pumilio-family RNA binding repeat. 
Puf domain usually occurs as a 
tandem repeat of eight domains  
0.013 PUF, RRM_1 
WD40 PF00400 
WD-40 repeats (also known as WD 
or beta-transducin repeats) are short 
~40 amino acid motifs, often 
terminating in a Trp-Asp (W-D) 
dipeptide  
0.013 WD40 
 
RBPs bind to their RNA targets with the help of several domains having different 
specificity and affinity. Some of the most common domains are RRM (RNA recognization motif), 
KH (K homology domain), SR (serine arginine domain), Zn-finger, Pumilio/FBF (PUF domain) 
and Sm (Glisovic et al., 2008) . Table 5-1 shows the most frequently occurring RNA binding 
domains in the yeast, S. cerevisiae, along with the commonly appearing partner domains in the 
conventional list of 560 RBPs reported recently by Hogan and co-workers (Hogan et al., 2008) 
(see Materials and Methods).  A large number of proteins have been predicted as RBPs in 
several model organisms including humans on the basis of the presence of these commonly 
occurring domains. A list of approximate number of RBPs identified in different model organisms 
is shown in Table 5-2 along with a reference to the study reporting it. For instance, in C. elegans 
approximately 500 proteins are annotated as RBPs on the basis of the presence of one or more 
RNA binding domains. In the yeast, S. cerevisiae about 560 proteins have been reported as 
putative RBPs till date. In human, more than 1000 proteins are considered as RBPs of which 
there are 497 that contain at least one RRM domain (Maris et al., 2005). Other than these 
putative RBPs (on the basis of previously known RNA binding domains), several metabolic 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-9 
 
 
enzymes have also been shown to bind to RNA molecules (Ciesla, 2006). For example Aco1, 
TCA cycle enzyme, in yeast S. cerevisiae binds to several RNAs encoded by the mitochondrial 
genome (Hogan et al., 2008). Likewise, recent studies have also shown the ability of RBPs to 
bind to DNA suggesting that some of the known RBPs might act as unconventional DNA-
binding proteins (Hu et al., 2009).These examples indicate the potential for the existence of 
novel classes of RBPs in eukaryotes with yet to be discovered functional roles. 
 
Table 5-2. Putative number of RBPs reported in different organisms. 
 
Organism putative RBPs 
Approximate 
number of genes Reference 
S. cerevisiae  561 7000 (Hogan et al., 2008) 
C. elegans  500 20000 (Lee and Schedl, 2006) 
D. Melanogaster  300 13290 (Lasko, 2000) 
MusMusculus  380 28287 (McKee et al., 2005) 
Human 800 30000 (Sanchez-Diaz and Penalva, 2006) 
5.2.2 Methods to Identify RBPs and their targets 
Although, several RBPs have been identified on the basis of conservation of domains in 
different organisms, targets of these RBPs are poorly understood. Therefore, several methods 
have been employed to identify the targets of RBPs, both in vitro and in vivo. The list of some 
commonly used methods for identification of RBP targets have been described in Table 5-3. 
Traditionally, RNA targets for known RBPs have been identified in vitro by using cross-linking 
immunoprecipitation followed by electromobility shift assays (Pinero et al., 2000; Thomson et 
al., 1999). More recently, one hybrid (Wilhelm and Vale, 1996) and three hybrid assays 
(SenGupta et al., 1996) have been used to identify in vivo interaction of a RBP and RNA 
molecule. But these traditional methods have limitations in their ability to identify new targets. 
Therefore, other in vivo assays have been developed to identify the novel targets of a RBP such 
as ultraviolet (UV) cross-linking and immunoprecipitation (CLIP) and RNP immunoprecipitation-
microarray (RIP-CHIP). These assays usually work on a similar concept where in (i) the 
complex of RBP and its target RNAs is first extracted and (ii) the target RNA identified. 
However, they differ in the procedure used for extracting RBP-RNA complexes and identification 
of target RNAs. For example, in ultraviolet (UV) cross-linking and immunoprecipitation (CLIP) 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-10 
 
 
method, cells are exposed to ultraviolet light to crosslink RBP-RNA molecules inside the cells. 
Then cells are lysed and cross-linked RBP-RNA complexes are immunoprecipitated using 
antibody against the RBP of interest. Further, RNA is isolated from the complexes and identified 
by RT-PCR. For instance, in a study to discover the targets of the splicing factor Nova, thirty 
four transcripts were identified by using the CLIP method (Ule et al., 2003).  
In RNP immunoprecipitation-microarray (RIP-Chip) method, cells are not treated with UV 
light to crosslink RBP-RNA complex but cells are lysed directly and native RBP-RNA complexes 
for RBP of interest are purified from the cell lysate using immunoprecipitation method. Following 
which RNA is isolated from the complexes and identified by using high-density oligonucleotide 
microarrays. The targets of Puf family of RBPs and other RBPs in yeast S. cerevisiae have 
been identified by using modified RIP-Chip method, where tandem affinity tagged (TAP) RBPs 
are used to facilitate the immunoprecipitation (Gerber et al., 2004; Hogan et al., 2008). These 
studies showed that the RNA targets vary from 1-1300 approximately for the studied RBPs in 
yeast S. cerevisiae. For instance Nop13, responsible for pre-18s rRNA processing, has 2 RNA 
targets whereas Npl3 and Mex67, both involved in mRNA export, have 1266 and 1150 RNA 
targets respectively (Hieronymus and Silver, 2003; Hogan et al., 2008). 
Another fundamental area of exploration in elucidating post-transcriptional networks is 
the identification of the repertoire of RBPs across organisms and several approaches both 
computational and experimental have been developed in recent years. Computational 
approaches involve the identification of the set of protein-coding genes which contain the 
bonafide RNA-binding domains, following which manual curation of the collected set is 
undertaken to identify a high confidence set of RBPs (Galante et al., 2009; Hogan et al., 2008). 
Experimental techniques comprise of employing the protein chip of an organism of interest to 
probe for the potential binding of the cellular RNA molecules and is analogous to the attempts to 
characterize the repertoire of DNA-binding proteins (Fasolo and Snyder, 2009; Hall et al., 2004; 
Hu et al., 2009; Zhu et al., 2001). Another strategy which has been developed to identify the 
RBPs attached to known RNA molecule is the PNA-assisted identification of RBPs (PAIR) 
(Zeng et al., 2006). This assay utilizes specific mRNA binding probe (PNA) that has ability to 
cross the cell membrane and can bind to RNA of interest. This probe also contains 
photoactivable amino acid adduct p-benzophenylalaline (Bpa) which can covalently crosslinked 
to adjacent RBP on photoactivation. After delivery of PNA, cells are exposed to ultra violet light 
for crosslinking of PNA to RBPs associated with RNA of interest. Cells are then lysed, treated 
with RNase and PNA-RBP adducts are isolated by using sense oligo (bind to PNA) coupled 
magnetic beads. Following which RBPs are identified by mass spectrometry. This method has 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-11 
 
 
been used to identify the RBPs associated with ankylosis (ank) RNA, a panneuronal 
dendritically localized RNA (Zielinski et al., 2006). 
 
Table 5-3. Different methods to identify novel RBPs, their targets or RBP_RNA interactions. 
 
Method Description Reference 
Three hybrid 
in vivo yeast genetic method to detect and analyze the RNA-RBP 
interaction of known RNA and RBP. This methods is based on 
the binding of bifunctional RNA to both the two hybrid protein 
which activates the expression of reporter gene.   
(SenGupta et al., 
1996) 
RNAcompete  
in vitro identification RNA binding specificity of RBP. High 
concentration of RNA pool are used and incubated with tagged 
RBP. High concentration of RNA provide the competition for 
bindng and hence technique gets its name. RBP-RNA complexes 
are purified and microarray is used to identify the specific binding 
sites of RBP. (Ray et al., 2009) 
RIP-ChIP 
in vivo identification of RNA targets for RBP of interest. Cells are 
lysed and RBP-RNA complexes are immunoprecipitated in native 
state. Target RNA are extrated from the RBP-RNA complexes. 
Target RNAs are identified by microarray method where control 
RNAs are total RNA of the cell. 
(Tenenbaum et al., 
2000) 
CLIP 
in vivo identification of RNA targets for RBP of interest. Cells are 
treated with ultraviolet light to covalently crosslink RBP-RNA 
complex. Cells are lysed and RBP-RNA complexes are 
immunoprecipitated and RNA are identified by RTPCR. (Ule et al., 2003) 
PAIR 
in vivo identification of novel RBPs. mRNA binding PNA probe is 
delivered to cell. Cells are exposed to ultraviolet light that enable 
PNA to bind with RBP. Cells are lysed and PNA-RNA-RBP 
complexes are immunoprecipitated and RBPs are identified by 
mass spectrometry.   
(Zielinski et al., 
2006) 
SERF 
in vitro selection of RNA fragments that bind to RBP. Random 
pool of fragmented RNA is generated. RNA pool is incubated with 
RBP in test tube. RBP-RNA complex is extracted by filtration on 
nitrocellulose membrane. Selection cycle is repeated several time 
and selected RNA fragment are cloned and identified the 
consensus sequences binding to RBP   
(Stelzl and 
Nierhaus, 2001) 
TRAP 
in vivo system for identification of RNA-RBP interaction in yeast. 
Transformation of reporter mRNA encoding GFP protein and 
expression of RBP of interest. Fluorescence intensity of GFP is 
measured to know the binding of RBP of interest. Higher the 
interaction leads to lower expression and low fluorescence 
intensity.  
(Paraskeva et al., 
1998) 
SNAAP 
in vitro method used to identify mRNAs bind to specific RBP. 
Purified tagged RBP is treated with cell lysate. 
Immunoprecipitation of mRNP using antibody against tag. Target 
mRNA are identified by differential display method   
(Rodgers et al., 
2002) 
Quantitative 
proteomics 
in vitro method to identify RBPs bind to specific RNA sequence. 
RNA aptamer tagged RNA sequence is incubated with cell lysate. 
RNA aptamer-RNA-RBPs complex is purified. RBPs are identified 
by using mass spectrometer. (Butter et al., 2009) 
 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-12 
 
 
5.2.3 RBPs and post-transcriptional operons 
In prokaryotes, it has been long known that the genes involved in similar processes tend to 
cluster on chromosomes and are transcribed together using the same promoter thus forming 
DNA operons such as the well studied, Gal and Lac operons. On the other hand, in eukaryotes, 
DNA operons are rare. However, following this notion recently the concept of post-
transcriptional operons has been proposed in eukaryotes (Keene and Tenenbaum, 2002) which 
has become possible due to the availability of the wealth of information on RBP-RNA 
interactions. According to this concept, diverse RNAs related to a common biological process 
are regulated by similar RBPs. For instance, in yeast S. cerevisiae, study of the RBP-RNA 
interactions by modified RIP-Chip method has revealed that each member of Puf family RBPs 
bind with functionally and cytotopically related RNAs (Gerber et al., 2004). Puf1 and Puf2 have 
been shown to bind to mRNAs of membrane associated proteins. Similarly, Puf3 binds to 
cytoplasmic mRNAs of mitochondrial proteins. Likewise, Nova protein was found to regulate 
splicing of pre-mRNA encoding components of inhibitory synapses and a stem loop binding 
protein (SLBP) was involved solely in splicing and translation of replication dependent histone 
RNAs (Townley-Tilson et al., 2006). Further examples in support of post-transcriptional operons 
have been reviewed extensively elsewhere (Keene, 2007; Keene and Lager, 2005). These 
examples demonstrate the role of RBPs in view of post-transcriptional operons for coordinating 
the expression of functionally related genes in eukaryotes. 
5.2.4 Post-transcriptional network formed by RBPs 
Development of several high throughput approaches has increased the amount of data for 
targets of RBPs in diverse organisms. This data of RBPs and their targets could be utilized to 
construct RBP-RNA interaction network which is also typically referred to as post-transcriptional 
regulatory network (see Figure 5-1B). This post-transcriptional network is represented in the 
form of a directional network with each edge corresponding to a regulatory link between the 
nodes as shown in Figure 5-2A. In this directed network, one set of nodes are RBPs forming the 
regulatory proteins while the other set of  nodes are RNAs encoded by either protein-coding or 
non-protein coding genes referred to as the target nodes. These two nodes (regulator node and 
target node) are joined by an arrow starting from regulator node and directing towards target 
node. The target RNA may belong to diverse functional proteins including other RBPs. This 
network can also contain loops as a link starting from RBP and targeting itself, typically referred 
to as autoregulation of an RBP (Figure 5-2B). This loop structure suggests that RBP can bind to 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-13 
 
 
its own RNA and control its metabolism at transcript level. There are several examples 
suggesting the auto-regulation of RBPs at post-transcriptional level. For instance, in humans, 
RBPs such as AUF1, HuR, KSRP, NF90, TIA-1 and TIAR were reported to associate with their 
own mRNA and other RBPs (Pullmann et al., 2007).  
 
 
Figure 5-2: Concept figure showing the RBP mediated post-transcriptional regulatory network. A) Dark 
(Regulator) and light (Target) grey circles denote nodes in the network. These nodes are linked to each 
other via a directional arrow starting from regulator (which is RBP in the network) and pointing towards 
target (which may be RNA or miRNA) in the directional network. These linked nodes simply indicate that 
RBP (Dark grey circle) binds to RNA/miRNA of target gene (Light grey circle) and regulate its metabolism. 
B) Shows a toy network representing a dense set of RBP-RNA interactions with different RBPs having 
diverse targets. The targets of one RBP in the network may be RNA of other genes or miRNA (dark and 
light circle linked by arrow), the RNA of the RBP itself (loop from dark circle) and the RNA of other RBPs 
(two dark circles linked by an arrow). 
  
Due to the availability of the network of post-transcriptional interactions for a 
considerable fraction of RBPs in model systems such as S. cerevisiae (see Materials and 
Methods), it has become possible to address several questions concerning the structure and 
organization of post-transcriptional networks directed by RBPs. Table 5-4 summarizes some of 
the properties which govern the structure of this network obtained as described in the Materials 
and Methods section. It is evident from this table that majority of the mRNA transcriptome 
encoded by about ~ 70% of the genes had significant associations with at least one of the RBPs 
screened for RNA interactions, and on average, each distinct yeast mRNA was found to interact 
with three of the RBPs, suggesting the potential for a combinatorial and multidimensional 
network of regulation.  Indeed, it was found that the average connectivity of a node in this 
network was ~7 indicating that most nodes in this network have more number of targets and/or 
more the number of RBPs controlling them.  
 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-14 
 
 
 
Table 5-4. Properties defining the structure of the post-transcriptional network of RBPs and their target 
RNAs in the model eukaryote, S. cerevisiae. Dataset employed for characterizing the network structure 
was obtained from Hogan  et. al.(Hogan et al., 2008) and all the network properties are calculated using 
igraph, a publicly available R package for analyzing graphs [ http://cneurocvs.rmki.kfki.hu/igraph/ & 
http://www.r-project.org]. 
 
 
Property Definition Value* 
No. of edges 
Each edge corresponds to a single RBP-RNA interaction. Hence, 
total edges represent all the interactions in the post-transcriptional 
network 
19396 
No. of 
vertices/nodes 
Total number of nodes, which comprise of both the RBPs as well 
as the RNAs, encoding for both protein coding and non-coding 
genes. This network comprises of 41 RBPs which are screened 
for their RNA targets. 
5398 
Degree or 
Connectivity 
Degree or connectivity refers to the number of interactions a 
protein or RNA has in this network – the higher the connectivity 
(i.e., hub nodes) the more the number of targets and/or more the 
number of RBPs controlling it. 
7.18 
Clustering 
coefficient  
Clustering coefficient of a node reflects the extent to which the 
neighbors of a given node are interconnected among themselves 
to what is expected theoretically and indicates the cohesiveness 
or local modularity of the network. Average value taken over all 
nodes reflects the modularity of the network. 
0.37 
Betweenness  
Betweenness centrality of a node measures the number of 
shortest paths between all pairs of nodes in the network that pass 
through a node of interest – the higher the number of paths that 
pass through a node, the more important it is. 
43.11 
Average path 
length 
Average length of the shortest paths between all pairs of nodes in 
the network. 2.65 
Closeness 
Closeness centrality is defined as the inverse of the average 
length of all the shortest paths from a node of interest to all other 
nodes in the network - note that closeness centrality defined this 
way implies that higher the closeness value, the higher the 
importance (centrality) of a node. 
0.38 
Diameter 
The diameter of a network is the length of the longest path among 
all the shortest paths defined between two nodes. It gives an 
estimation of the distance between nodes in the network. 
6 
Graph density The density of a network is the ratio of the number of edges to the number of total possible edges. 1.33x10-3 
Power law fit 
(exponent-alpha) 
Fitting a power-law distribution function to the degree distribution 
of the network to study whether the network is likely to exhibit a 
scale-free network structure.  
1.77 
 
* Note that average values for the entire network are reported for properties which are defined for specific 
node or edge.   
 
Other measures of centrality like betweenness and closeness which provide a measure 
of the importance of a node in a network, shown in this table, also reflect this trend ( see (Junker 
et al., 2006) and references there in for comprehensive definitions). For instance, the average 
length of the shortest path between two nodes in this network which gives an indication of the 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-15 
 
 
distance between nodes suggests that most nodes are separated by no more than 3 edges - a 
measure reflecting the dense networking in this network. Similarly, diameter of a network which 
refers to the longest of all the shortest paths between a pair of nodes is about 6 indicating that 
two nodes in this network are separated by no more than 6 edges. Likewise, clustering 
coefficient which is a proxy for the modularity of the network shows that neighbors of most 
nodes tend to be highly interconnected among themselves forming a dense and cohesive 
network of regulatory linkages at this level of regulation. Finally, although incomplete in size, 
scaling exponent of this network is about 1.8 which suggests that the network might obey a 
scale-free topology with a power-law degree distribution. 
5.2.5 Expression dynamics of RBPs in post-transcriptional networks 
5.2.5.1 RBPs show high abundance and tight regulation at the protein 
level 
To compare and understand the differences in the gene expression dynamics of RBPs with 
other protein coding genes in S. cerevisiae, we first compiled the set of RBPs and non-RBPs as 
described in Materials and Methods (also see Figure 5-1B). This allowed us to define a set of 
561 proteins in yeast as those that encode for RNA-binding proteins and the remaining 5685 
proteins (from the complete set of protein coding genes) as non-RNA-binding proteins. We also 
collected high-throughput data documenting various dynamic properties of messenger RNA 
transcripts and their translated protein products in yeast from different sources as described in 
Materials and Methods. These properties included the mRNA stability, mRNA copy number, 
ribosome occupancy, protein stability and abundance. In addition to these attributes of mRNAs 
and proteins, we also obtained the data describing the cell-to-cell variation in protein expression 
in a genetically homogenous population of cells, typically referred to as protein expression 
noise.  
 
 
 
(Space left for an enhanced layout of the figure) 
 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-16 
 
 
A
E
B
F
C
D
 
Figure 5-3: Comparing expression dynamics of RBPs with non-RBPs in the entire genome. Box-plots 
showing the distribution of values for various regulatory properties for the two different groups of proteins 
(RBPs and non-RBPs) in S. cerevisiae. Blue and red bars correspond to RBP and non-RBP populations 
respectively. Box-plot identifies the middle 50% of the data, the median, and the extreme points. The 
entire set of data points is divided into quartiles and the inter-quartile range (IQR) is calculated as the 
difference between x0.75 and x0.25. The range of the 25% of the data points above (x0.75) and below 
(x0.25) the median (x0.50) is displayed as a filled box. The horizontal line and the notch represent the 
median and confidence intervals, respectively. Data points greater or less than 1.5 IQR represent outliers 
and are shown as dots. The horizontal line that is connected by dashed lines above and below the filled 
box (whiskers) represent the largest and smallest non-outlier data points, respectively. (A) mRNA half-life 
(B) mRNA copy number (C) Ribosome occupancy (D) Protein abundance (E) Protein half-life (F) Protein 
noise. In each case, P-values shown correspond to the significance estimated based on Wilcoxon test 
comparing the RBP and non-RBP group of proteins. RBPs were found to show significantly lower 
transcript stability, higher mRNA copy number, ribosome occupancy, protein stability and abundance. 
However protein noise which reflects the extent of cell-to-cell variation in protein levels, was found to be 
significantly lower for RBPs compared to non-RBPs suggesting that most RBPs are uniformly expressed 
across a homogenous population of cells.  
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-17 
 
 
Messenger RNA half-life is a measure of transcript stability in the cell, while mRNA copy 
number reflects its abundance. We first asked whether RBPs as a functional class show a 
different tendency in comparison to non-RBPs in these properties. As a result of this analysis, 
we found that mRNAs encoding RBPs are significantly less stable (i.e., short half-life) at the 
transcript level compared to those genes that do not encoded RBPs (p = 3.1 x 10–10, Wilcoxon 
test) (Figure 5-3A). In yeast it has been shown that, in general, mRNAs of central physiological 
pathways have longer half-life and mRNAs encoding regulatory and signaling proteins have 
shorter half-life (Pombo et al., 1999). In line with these observations, the observed lower half-life 
of RBPs in our analysis is consistent with their regulatory function and quick turn over at 
transcript level. However, a comparison of the mRNA copy number of the two groups of genes, 
which is a proxy for mRNA abundance in the cell, indicated that RBPs are encoded by genes 
which exhibit much higher mRNA copy number (p < 2.2 x 10–16, Wilcoxon test) (Figure 5-3B). 
Exclusion of translation and ribosome associated genes which form a significant fraction of the 
total repertoire of RBPs and are known to be highly expressed, did not change our results. 
These observations suggest that RBPs tend to be less stable but more abundant at transcript 
level suggesting that abundance is a more prominent factor than their stability. Both mRNA half-
life and mRNA abundance data indicate that RBP’s expression at mRNA level is likely to be 
transient but whenever they are transcribed they are produced at high concentrations. 
Ribosome occupancy has been shown to be a measure of translational efficiency of 
mRNA. Higher ribosome occupancy relates to higher protein synthesis and lower ribosome 
occupancy indicates low translation rate of mRNA. We next asked whether the ribosome 
occupancy i.e, rate of translation, of RBPs is higher than those for non-RBPs and if their protein 
levels are higher within the cell. This analysis clearly revealed that RBPs have high ribosome 
occupancy (p = 2.5 x 10–13, Wilcoxon test) (Figure 5-3C) and are also present in much higher 
concentrations (p < 2.2 x 10–16, Wilcoxon test) (Figure 5-3D) with median abundances of RBPs 
being roughly double that observed for non-RBPs (3895 versus 2132 protein molecules/cell). 
These results indicate that RBPs are abundant and are translated rapidly, supporting the 
versatile nature of their involvement in multiple post-transcriptional control mechanisms at 
different cellular locations. Exclusion of ribosome and translation associated factors from RBPs 
to compare non-ribosomal RBPs against non-RBPs indicated that ribosomal RBPs contribute 
significantly to the observed differences in the rate of translation and protein abundance of 
RBPs. Comparing the protein concentrations of non-ribosomal RBPs with non-RBPs indicated 
that the former are still significantly more abundant (p =  2.2 x 10-2). 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-18 
 
 
Stability of a protein measured as its half-life can be considered as a proxy for the life 
time of a protein in a cell. Therefore, to understand the degradation rates of RBPs and to 
compare them against non-RBPs we analyzed their protein half-lives (see Materials and 
Methods). This analysis revealed that RBPs are significantly more stable than non-RBPs, with 
RBPs exhibiting a median half-life of 71 min as against non-RBPs with 46 min (p = 5 x 10-12, 
Wilcoxon test) (Figure 5-3E).  Repeating the analyses with non-ribosomal RBPs showed a 
consistent trend despite their exclusion (p = 4.8 x 10-2). Our observations on the increased 
protein stability and concentration of the RBPs compared to other proteins in the cell suggests 
that RBPs, whose main functional role is in the processing and localization of their mRNA 
targets, might be required at multiple sub-cellular locations and be used throughout the cell 
cycle. This may likely warrant their higher abundance and stability at the protein level. It is 
important to note that although RBPs exhibit high protein stability, they also show low transcript 
stability which indicates that most RBPs which are stable at the protein level, might be avoiding 
cellular crowding of their transcripts by quick turnover at the transcript level. Indeed, it has been 
shown in yeast that most RBPs auto-regulate their own activity at the transcript level (Hogan et 
al., 2008).  
In order to understand how these properties vary with different processes in which RBPs 
are involved, we divided RBPs in to four major categories: translation, transport, RNA 
localization and processing using GO annotations and compared them with non-RBPs. This 
analysis revealed that the general trends observed for different categories are similar to those 
seen for RBPs as a whole although certain categories comprised of relatively few RBPs. 
Several RBPs have been shown to be post-translationally modified, which adds a layer of 
flexibility to their function. Many of these post-translational modifications have been shown to 
modify their RNA-binding properties or their sub-cellular localization. Indeed, at least four types 
of post-translational modifications namely phosphorylation, ubiquitination, methylation and 
SUMOylation have been reported for RBPs (Glisovic et al., 2008). High stability of RBPs 
indicates the potential that post-translational modifications can offer in the diversification of their 
function. Infact, analysis of the number of kinase substrates in RBP and non-RBP populations 
using the currently available protein phosphorylation map for yeast (Ptacek et al., 2005), 
suggests that some kinases not only target higher number of RBPs compared to non-RBPs (p = 
2.7 x 10-2) but also more number of kinases are associated with RBPs (p < 2.2 x10-16). 
Gene expression is a highly dynamic process and because of its dynamic nature there is 
a large variation in a protein’s abundance among different cells in a population. This variation is 
termed as biological noise. Genes whose expression varies to a large extent show more noise 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-19 
 
 
and these are typically involved in stress response, amino acid biosynthesis and heat shock. On 
the other hand, genes which show consistent expression during the cell cycle such as those 
involved in protein degradation and ribosomal proteins tend to show low noise (Newman et al., 
2006). Here, we have explored this noise data, to address whether RBPs show significant 
difference from non-RBPs in terms of biological noise. As shown in Figure 5-3F, RBPs were 
found to show significantly lower noise levels in comparison to non-RBPs (p = 1.7 x 10-12, 
Wilcoxon test). Re-analyzing the data by excluding ribosomal proteins still clearly indicated that 
RBPs exhibit much lower noise compared to other protein coding genes (p = 6.3 x 10-6, 
Wilcoxon test). This analysis unambiguously reveals that low noise is an inherent property of all 
RBPs and suggests that RBPs are tightly regulated at the protein level with little variation in their 
expression from cell to cell. 
5.2.5.2 The number of distinct targets bound by a RBP is correlated 
with its cellular abundance 
RBPs are the key elements responsible for the post-transcriptional control of gene expression 
and when combined with their RNA targets, this information can be represented as a RBP-RNA 
network. Although, on a genomic scale, RBPs are believed to control diverse range of functions 
with some eukaryotic systems predominantly using post-transcriptional mechanisms for gene 
expression control (Foth et al., 2008; Noe et al., 2008), large-scale elucidation of post-
transcriptional networks is limited to few model organisms for a select set of RBPs.  In yeast, 
few recent genome-wide studies identified the targets for several RBPs using RIP-chip 
technology (Gerber et al., 2004; Hogan et al., 2008). These studies revealed the important roles 
played by different families of RBPs and the structure of the post-transcriptional network formed 
by them. These high-throughput studies showed that the number of targets of a RBP can vary 
widely, from fewer than ten to more than thousands. In this study we obtained this network 
discussed above, where nodes represent RBPs or their targets and links represent a distinct 
physical association between the RBP and the target RNA. We then systematically investigated 
the relationship between different dynamic properties of RBPs and the number of distinct RNA 
targets they control.  
 
 
(Space left for an enhanced layout of the figure) 
 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-20 
 
 
R² = 0.18
p < 2.4e-1
0
5
10
15
20
25
30
35
40
45
0 250 500 750 1000 1250 1500 1750 2000m
R
N
A 
ha
lf-
lif
e 
(M
in
ut
es
)
Connectivity
R² = 0.96
p < 1.0e-3
0
0.5
1
1.5
2
2.5
3
3.5
4
0 250 500 750 1000 1250 1500 1750 2000
m
R
N
A 
co
py
/c
el
l
Connectivity
A BmRNA half-life mRNA copy number
R² = 0.77
p < 2.5e-2
0.65
0.67
0.69
0.71
0.73
0.75
0.77
0.79
0 250 500 750 1000 1250 1500 1750 2000
O
cc
up
an
cy
Connectivity
R² = 0.94
p < 3.0e-3
0
10000
20000
30000
40000
50000
60000
70000
80000
0 250 500 750 1000 1250 1500 1750 2000
Pr
ot
ei
n 
m
ol
e.
/c
el
l
Connectivity
C DRibosome occupancy Protein abundance
 
Figure 5-4: Relationship between the number of targets of a RBP and it’s A) transcript turn over B) 
estimated mRNA copy number per cell C) extent of ribosome occupancy and D) protein abundance. In 
each case, except for transcript stability, we found a strong correlation between the connectivity of a RBP 
and the regulatory property studied, suggesting that RBPs which regulate high number of targets are 
present at higher levels at the protein level. RBPs are divided into 5 bins, with approximately equal 
number of RBPs, based on their connectivity. Points correspond to the median values in the respective 
bins while the error bars show the normalized median deviation calculated as the ratio between the 
Median Absolute Deviation (MAD) and the square-root of the number of values in the bin.  
 
We first asked whether the number of targets of a RBP is correlated with its transcript 
stability by grouping the RBPs into different connectivity bins i.e., groups of RBPs comprising of 
number of distinct RNA targets (see Methods). As a result of this analysis, we found that there 
was a weak but positive correlation between them suggesting that transcript turnover of RBPs 
may not be dependent on their number of targets (R2=0.18, p < 0.24) (Figure 5-4A). On the 
other hand, a comparison of the mRNA copy number of a RBP and its number of targets 
revealed a strong positive correlation between them suggesting that RBPs with high number of 
targets are likely to be more highly expressed at the mRNA level (R2=0.96, p <  1 x 10-3) (Figure 
5-4B). For instance, PAB1 is a highly connected essential RBP which can bind to the poly (A) 
tail of an mRNA to regulate its translational initiation through its binding with eIF4G protein 
(Kessler and Sachs, 1998; Sachs et al., 1987). Indeed, it was reported to bind to 1,994 distinct 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-21 
 
 
RNA targets and was among the genes with very high mRNA copy number (7.1 mRNA 
copies/cell). These observations point to a direct link between the number of distinct targets of a 
RBP and its available number of copies of mRNA in the cell. To test the existence of a 
correlation between the connectivity and the rate of translation or the absolute protein 
abundance profile of RBPs, we further explored the relationship between them (Figure 5-4C and 
5-4D). This comparison uncovered a more general link between translational efficiency of a 
RBP and its degree. For instance, Pub1p is another poly (A) binding protein (Matunis et al., 
1993) which binds to diverse sets of transcripts involved in ribosome biogenesis, cellular 
metabolism and transport (Duttagupta et al., 2005). This protein was reported to be localized to 
both nucleus and cytoplasm (Anderson et al., 1993). Hence to be present at different locations 
and to bind to a large number of transcripts it has to be translated more often and should be 
present in more number of copies. Consistent with this, we find that it’s transcript exhibits high 
ribosome occupancy. Indeed, Hogan et. al (Hogan et al., 2008) demonstrated that RNA targets 
of highly connected RBPs were enriched for multiple processes and sub-cellular localizations. 
These results clearly unveil the strong relationship between the concentration of a RBP and the 
number of distinct RNA targets bound by them, indicating that RBPs responsible for controlling 
a wide range of targets must occur in more number of copies at the protein level. It is important 
to note that although RBPs as a group of genes are significantly higher expressed at the 
transcript and protein levels compared to non-RBP population, relative abundance of the RBPs 
is correlated to the hierarchy of a RBP, defined as the number of distinct RNA targets. It is also 
noteworthy to mention that the RBPs analyzed for connectivity in this section did not comprise 
of core ribosomal proteins, strengthening the generality of these observations. 
5.2.5.3 RBPs bound to many RNA targets are less frequently degraded 
and tightly controlled at protein level 
Although RBPs with more number of distinct targets are expressed at a higher level compared 
to those which control fewer targets, it is not evident if their protein turnover rates would hold a 
similar trend. Therefore, to understand whether there is any dependence between the stability of 
a RBP and the number of transcripts it controls, we employed a similar approach as above. This 
analysis clearly showed that RBPs which regulate many targets are highly stable at the protein 
level (R2=0.95, p < 3 x 10-3) (Figure 5-5A). The link between protein stability and RBP’s degree 
indicates that RBPs controlling several targets are less frequently degraded at the protein level 
and might be present throughout the cell cycle. Taken together, these observations raise the 
question: If highly connected RBPs are consistently expressed in large concentrations and are 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-22 
 
 
less frequently degraded, would their regulation be tightly controlled at the protein level. The fact 
that RBPs as a group show significantly lower noise in comparison to non-RBPs and that 
previous studies reported that regulatory proteins generally exhibit low noise (Newman et al., 
2006) suggests that highly connected RBPs can be expected to show less noise in comparison 
to those which are poorly connected. Hence, we compared the connectivity of RBPs with their 
noise value. As shown in Figure 5-5B, we found a strong correlation between the number of 
targets of a RBP and its protein noise. In particular, highly connected RBPs showed minimal 
variation in their protein expression across a population of cells (R2=0.93, p < 4 x 10-3). This 
suggests that RBPs controlling many targets are very tightly regulated with little cell-to-cell 
variation in their protein expression. These observations indicate that any significant change in 
their availability or regulation may result in an imbalance in cellular homeostasis as it may affect 
a vast number of transcripts. Indeed, a comparison of the number of essential genes in RBPs 
showed a two-fold enrichment compared to the whole genome, suggesting their central role in 
maintaining cellular homeostasis. These lines of evidence reveal that RBPs act as an important 
class of regulatory molecules in the cell whose expression is tightly controlled despite their 
occurrence in large cellular concentrations and in multiple sub-cellular locations.  
 
 
R² = 0.95
p < 3.0e-3
0
20
40
60
80
100
120
140
0 250 500 750 1000 1250 1500 1750 2000
Pr
ot
ei
n 
H
al
f-l
ife
 (M
in
ut
es
)
Connectivity
B
R² = 0.93
p < 4.0e-3
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
0 250 500 750 1000 1250 1500 1750 2000
Pr
ot
ei
n 
no
is
e Connectivity
A
Protein noiseProtein half-life
 
Figure 5-5: Relationship between RBP’s connectivity versus it’s A) protein stability and B) noise. RBPs 
controlling more number of targets showed an increasing tendency to be stable at the protein level and 
deceasing tendency in protein noise.  RBPs are divided into 5 bins, with approximately equal number of 
RBPs, based on their connectivity. Points correspond to the median values while the error bars show the 
normalized median deviation calculated as the ratio between the Median Absolute Deviation (MAD) and 
the square-root of the number of values in the bin.  
 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-23 
 
 
5.3 DISCUSSION & CONCLUSION 
RBPs form an important class of evolutionarily conserved proteins (Anantharaman et al., 2002) 
and are known to be involved in a wide range of cellular processes. In addition to their functional 
roles in diverse processes as shown in Figure 5-1A, RBPs are also known to be implicated in a 
number of disorders due to their mis-expression or mutations in the sequences that are 
employed to recognize their cognate target RNAs. For instance, in humans, malfunctioning of 
RBPs like NOVA, which is a neuron specific protein responsible for the alternative splicing of a 
subset of pre-mRNAs, is known to be involved in the pathogenesis of the neurodegenerative 
syndrome Paraneoplastic Opsoclonus-Myoclonus Ataxia (POMA) (Ule et al., 2003). In line with 
this and other observations on the impact of changes in the expression levels of RBPs being 
associated with diseases and fitness defects (Cooper et al., 2009; Lukong et al., 2008) results 
reported here reveal that RBPs as a functional class show very little variation in their expression 
across cells suggesting the importance in tightly controlling them. In addition, it was found that 
RBPs which regulate multiple transcripts show a significantly reduced noise indicating that 
variations in the expression levels of these key post-transcriptional regulators can have 
significant impact on the functioning of the cell thereby leading to a disease phenotype.  
The fact that RBPs are generally less stable at the transcript level but exhibit higher 
stability and abundance at the protein level demonstrates that they form a group of proteins 
which follow the theoretically proposed time averaging effect on noise propagation (Paulsson, 
2004), which suggests that if the protein has long half life compared to its mRNA then it 
averages over the noisy fluctuations in the mRNA decreasing the protein expression noise. 
These results also indicate that regulation of RBPs is predominantly controlled at the protein 
level through the use a number of post-translational modifications (PTMs) like phosphorylation, 
arginine methylation and sumoylation which have been reported to occur in several well-studied 
RBPs (Schullery et al., 1999; Vassileva and Matunis, 2004; Yu et al., 2004). Indeed, a 
comparison of the number of phosphorylated targets in RBPs and non-RBPs revealed the 
predominance of post-translational control in RBPs. Therefore, it is possible to suggest that a 
wide variety of these PTMs might be responsible for their ability to spatially and temporally 
regulate transcripts in eukaryotic systems. It is possible to speculate from these observations 
that the low noise levels of RBPs together with extensive regulatory flexibility at the protein level 
might give them an advantage to control gene regulation at a finer level compared to 
transcriptional control by transcription factors. This might thereby provide a quick and extensive 
framework for controlling gene expression of a wide range of genes. This is also supported 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-24 
 
 
based on the observation that RBPs which are central to the cell are not only required in large 
quantities but are also found to be present for a longer time in the cell. All these observations 
suggest the importance of a post-transcriptional network of interactions in higher eukaryotes 
and raise several open questions in the regulation of gene expression beyond transcription. It 
should be possible to address such questions in the near future as more data from different 
levels of regulation becomes available (Halbeisen et al., 2008; Hieronymus and Silver, 2004; 
Lackner et al., 2007). 
While the post-genomic era has introduced the genomic complement of hundreds of 
genomes, it has also left us with several unanswered questions regarding the functional 
relevance of the genes an organism encodes or principles that govern the regulation of the 
genes encoded on them. It is noteworthy to mention that even in a model organism like S. 
cerevisiae, regulation of gene expression at the post-transcriptional level is rather poorly 
understood. Nevertheless with recent improvements in and availability of high-throughput 
approaches such as RNA-sequencing and immunoprecipitation protocols, future years can 
expect to see a wealth of data detailing the dynamic, spatial and tissue-specific nature of the 
interactions governed by these exciting class of regulatory molecules, which would undoubtedly 
allow us to gain a deeper understanding of regulation at a level which has been under-
appreciated over the past decades. Given the unprecedented detail at which these high-
throughput technologies can reveal the link between the regulatory elements on the target 
genes and the RNA-binding proteins specific to environmental conditions, it is possible to use 
these approaches to interrogate the prevalence of these phenomena in different states and 
thereby study their relevance to physiology and disease in diverse model systems. 
5.4 MATERIALS AND METHODS 
5.4.1 Data on RNA-binding proteins in S. cerevisiae and their 
interactions 
The complete list of annotated RBPs and the data for well studied RBPs in S. cerevisiae was 
obtained from Hogan et al (Hogan et al., 2008).  The total number of annotated RBPs in yeast 
reported in this study was 561 and mRNA targets for 41 RBPs have been systematically 
identified on a whole genome scale by employing the RIP-chip technology. This approach 
essentially consists of two steps. The first involves generation of two RNA samples, isolation of 
RBP bound mRNA by immunoprecipitation of messenger-ribonucleoproteins using affinity 
purification and isolation of cellular RNA representing the whole set of transcripts in the cell. The 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-25 
 
 
second step involves hybridization of the two isolated RNA samples using dual-color 
microarrays and are analyzed for enriched transcripts, to detect the bound targets of a RBP 
(Sanchez et al., 2007). A total of 14,312 interactions comprising of 41 RBPs and 5025 genes in 
the entire genome of S. cerevisiae, which forms a network of post-transcriptional interactions 
between RBPs and the target RNAs encoding for proteins obtained using this approach was 
used for studying the expression dynamics, while the network properties have been studied 
using the entire network of 19396 interactions reported in the original study (Hogan et al., 2008). 
5.4.2 Analysis of the structure and properties of post-transcriptional 
regulatory network 
We used igraph, a publicly available R package [see http://cneurocvs.rmki.kfki.hu/igraph/ and 
http://www.r-project.org] to study the properties of this network and to calculate the centrality of 
the nodes in this framework. In particular, since the network analyzed in this study was 
considered as undirected for the sake of simplicity, we used the corresponding versions of the 
functions: degree, transitivity, betweenness and closeness for calculating the degree, clustering 
coefficient, betweenness and closeness centralities of a node. Betweenness centrality, which is 
the number of shortest paths going through a node was calculated using the brandes algorithm 
(Brandes, 2001) implemented in R. Similarly, closeness, measured as average length of the 
shortest paths to all the other vertices in the graph, was obtained using the implementation in R. 
Since the centrality measures, betweenness and closeness use the shortest path lengths 
between all pairs of nodes in a graph, for cases where no path exists between a particular pair 
of nodes, shortest path length was taken as one less than the maximum number of nodes in the 
graph. Note that this is also the default assumption for calculating centrality measures in igraph. 
The Clustering coefficient is a property of a node which tells how connected are the neighbors 
of a given node to what is expected when all the neighbors are completely connected. An 
extension of this metric to the complete network defined as the average clustering coefficient 
tells whether the network is modular or is sparsely connected. Other network properties were 
calculated using the default implementations in igraph or as discussed in the main text. 
5.4.3 Data for comparative analysis of expression dynamics 
To study the expression dynamics of RBPs in comparison to other protein coding genes in the 
genome and to analyze its relationship with the number of RNAs controlled by RBPs, we have 
employed a variety of datasets. These include the transcript stability, mRNA copy number, 
ribosome occupancy, protein half-life, protein abundance and protein noise. Transcript stability 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-26 
 
 
which is measured as the RNA half-life of a transcript was obtained from Wang et. al (Wang et 
al., 2002) and contained mRNA half-lives for 4687 genes in the entire genome. A key parameter 
describing the translational status of a gene is the fraction of its transcripts engaged in 
translation which is defined by the ribosome occupancy (Arava et al., 2003). Likewise, the 
number of mRNA copies of a gene can be best described by the parameter mRNA copy number 
per cell. Both these parameters for genes in S. cerevisiae were obtained from Arava et. al 
(Arava et al., 2003) where the authors employed velocity sedimentation to separate mRNAs 
bound to ribosomes and quantified them using microarray analysis. mRNA copy number could 
be obtained for 5643 genes while ribosome occupancy could be mapped for 5700 genes, 
allowing us to study the extent of transcript abundance and translation rates of the genes and 
transcripts. Stability of a protein which is an estimate of the duration it occurs with in the cell is 
measured as the half-life of the protein. In yeast, protein half-lives have been estimated by Belle 
and co-workers for about 3750 proteins by inhibiting translation (Belle et al., 2006). In this study 
we used this data by excluding proteins whose half-lives have been obtained by extrapolation. 
Protein abundance which reveals the absolute number of protein molecules per cell was 
obtained from Ghaemmaghami et. al (Ghaemmaghami et al., 2003). We could obtain 
abundance values for 3868 proteins in the entire genome. Biological noise which is typically 
defined as the variation in the expression of a protein between different cells in a homogenous 
population of cells was obtained from Newman et. al (Newman et al., 2006). We could obtain 
noise data for 2213 genes for cells grown on rich media. The authors in this study employed two 
distinct measures for calculating protein noise, coefficient of variation (CV), which is the ratio of 
the standard deviation in the expression of a protein and it’s mean expression and distance from 
median (DM), which was calculated as the difference between the CV value of a protein and a 
running median of all CV values. In this study we have used DM as a measure of protein noise 
as it was indicated to be a more robust measure compared to CV to understand protein to 
protein variations in noise levels (Newman et al., 2006). Since DM is the distance between the 
CV and median value of all CVs, negative values correspond to relatively less noise while 
positive values reflect higher levels of noise in the protein expression. 
5.4.4 Comparison of the regulatory properties of RBPs with other 
protein coding genes 
To study whether RBPs show differences in dynamic properties when compared to other protein 
coding genes, we defined non-RBP set of proteins. This set essentially comprised of proteins in 
the whole genome after excluding the list of 561 RBPs defined above. To assess whether RBPs 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-27 
 
 
exhibit a different trend compared to non-RBPs for each of the properties studied, we used 
Wilcoxon rank-sum test or Mann-Whitney U test available in the R statistical package to 
calculate the significance. Wilcoxon test enables the comparison of two samples to assess 
whether they come from the same distribution or not. Since this test is non-parametric and does 
not assume any inherent distribution of the samples it is ideal to compare different samples. Box 
plots were used to represent the distribution of values for each property. Since the RBP set 
comprised of a number of ribosome associated proteins we also excluded them from this list 
and repeated the analysis to test the robustness of the tendencies observed, in the absence of 
ribosomal proteins.  
5.4.5 Analysis of the relationship between the number of targets of a 
RBP and its dynamic properties 
To understand the link between the number of targets of a RBP and its dynamic properties, 
RBPs were first grouped on the basis of their number of distinct RNA targets to which they were 
bound. This grouping was done in such a way that each bin of RBPs contained roughly equal 
number of RBPs. This resulted in five different bins corresponding to varying degrees of RBPs, 
with some RBPs controlling as many as 2000 mRNAs in the RBP-RNA network. To nullify the 
effect of outliers in each bin, median values were calculated for different dynamic properties and 
correlation was estimated between median values and connectivity of RBPs. P-values were 
calculated using the coefficient of correlation and the number of data points, based on a linear 
fit. 
REFERENCES 
Alberti, S., Halfmann, R., King, O., Kapila, A. and Lindquist, S. (2009). A systematic survey 
identifies prions and illuminates sequence features of prionogenic proteins. Cell 137, 146-58. 
Anantharaman, V., Koonin, E. V. and Aravind, L. (2002). Comparative genomics and 
evolution of proteins involved in RNA metabolism. Nucleic Acids Res 30, 1427-64. 
Anderson, J. T., Paddy, M. R. and Swanson, M. S. (1993). PUB1 is a major nuclear and 
cytoplasmic polyadenylated RNA-binding protein in Saccharomyces cerevisiae. Mol Cell Biol 13, 
6102-13. 
Arava, Y., Wang, Y., Storey, J. D., Liu, C. L., Brown, P. O. and Herschlag, D. (2003). 
Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae. Proc Natl 
Acad Sci U S A 100, 3889-94. 
Belle, A., Tanay, A., Bitincka, L., Shamir, R. and O'Shea, E. K. (2006). Quantification of 
protein half-lives in the budding yeast proteome. Proc Natl Acad Sci U S A 103, 13004-9. 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-28 
 
 
Brandes, U. (2001). A Faster Algorithm for Betweenness Centrality. Journal of Mathematical 
Sociology 25, 163-177. 
Butter, F., Scheibe, M., Morl, M. and Mann, M. (2009). Unbiased RNA-protein interaction 
screen by quantitative proteomics. Proc Natl Acad Sci U S A 106, 10626-31. 
Campbell, D. A., Thomas, S. and Sturm, N. R. (2003). Transcription in kinetoplastid protozoa: 
why be normal? Microbes Infect 5, 1231-40. 
Ciesla, J. (2006). Metabolic enzymes that bind RNA: yet another level of cellular regulatory 
network? Acta Biochim Pol 53, 11-32. 
Cookson, W., Liang, L., Abecasis, G., Moffatt, M. and Lathrop, M. (2009). Mapping complex 
disease traits with global gene expression. Nat Rev Genet 10, 184-94. 
Cooper, T. A., Wan, L. and Dreyfuss, G. (2009). RNA and disease. Cell 136, 777-93. 
Duttagupta, R., Tian, B., Wilusz, C. J., Khounh, D. T., Soteropoulos, P., Ouyang, M., 
Dougherty, J. P. and Peltz, S. W. (2005). Global analysis of Pub1p targets reveals a 
coordinate control of gene expression through modulation of binding and stability. Mol Cell Biol 
25, 5499-513. 
Fasolo, J. and Snyder, M. (2009). Protein microarrays. Methods Mol Biol 548, 209-22. 
Feinberg, A. P. and Tycko, B. (2004). The history of cancer epigenetics. Nat Rev Cancer 4, 
143-53. 
Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin, O. L., 
Gunasekaran, P., Ceric, G., Forslund, K. et al. The Pfam protein families database. Nucleic 
Acids Res 38, D211-22. 
Foth, B. J., Zhang, N., Mok, S., Preiser, P. R. and Bozdech, Z. (2008). Quantitative protein 
expression profiling reveals extensive post-transcriptional regulation and post-translational 
modifications in schizont-stage malaria parasites. Genome Biol 9, R177. 
Galante, P. A., Sandhu, D., de Sousa Abreu, R., Gradassi, M., Slager, N., Vogel, C., de 
Souza, S. J. and Penalva, L. O. (2009). A comprehensive in silico expression analysis of RNA 
binding proteins in normal and tumor tissue: Identification of potential players in tumor 
formation. RNA Biol 6, 426-33. 
Gerber, A. P., Herschlag, D. and Brown, P. O. (2004). Extensive association of functionally 
and cytotopically related mRNAs with Puf family RNA-binding proteins in yeast. PLoS Biol 2, 
E79. 
Ghaemmaghami, S., Huh, W. K., Bower, K., Howson, R. W., Belle, A., Dephoure, N., 
O'Shea, E. K. and Weissman, J. S. (2003). Global analysis of protein expression in yeast. 
Nature 425, 737-41. 
Glisovic, T., Bachorik, J. L., Yong, J. and Dreyfuss, G. (2008). RNA-binding proteins and 
post-transcriptional gene regulation. FEBS Lett 582, 1977-86. 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-29 
 
 
Gross, T., Richert, K., Mierke, C., Lutzelberger, M. and Kaufer, N. F. (1998). Identification 
and characterization of srp1, a gene of fission yeast encoding a RNA binding domain and a RS 
domain typical of SR splicing factors. Nucleic Acids Res 26, 505-11. 
Gruter, P., Tabernero, C., von Kobbe, C., Schmitt, C., Saavedra, C., Bachi, A., Wilm, M., 
Felber, B. K. and Izaurralde, E. (1998). TAP, the human homolog of Mex67p, mediates CTE-
dependent RNA export from the nucleus. Mol Cell 1, 649-59. 
Gygi, S. P., Rochon, Y., Franza, B. R. and Aebersold, R. (1999). Correlation between protein 
and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. 
Halbeisen, R. E., Galgano, A., Scherrer, T. and Gerber, A. P. (2008). Post-transcriptional 
gene regulation: from genome-wide studies to principles. Cell Mol Life Sci 65, 798-813. 
Hall, D. A., Zhu, H., Zhu, X., Royce, T., Gerstein, M. and Snyder, M. (2004). Regulation of 
gene expression by a metabolic enzyme. Science 306, 482-4. 
Hieronymus, H. and Silver, P. A. (2003). Genome-wide analysis of RNA-protein interactions 
illustrates specificity of the mRNA export machinery. Nat Genet 33, 155-61. 
Hieronymus, H. and Silver, P. A. (2004). A systems view of mRNP biology. Genes Dev 18, 
2845-60. 
Hogan, D. J., Riordan, D. P., Gerber, A. P., Herschlag, D. and Brown, P. O. (2008). Diverse 
RNA-binding proteins interact with functionally related sets of RNAs, suggesting an extensive 
regulatory system. PLoS Biol 6, e255. 
Hu, S., Xie, Z., Onishi, A., Yu, X., Jiang, L., Lin, J., Rho, H. S., Woodard, C., Wang, H., 
Jeong, J. S. et al. (2009). Profiling the human protein-DNA interactome reveals ERK2 as a 
transcriptional repressor of interferon signaling. Cell 139, 610-22. 
Junker, B. H., Koschutzki, D. and Schreiber, F. (2006). Exploration of biological network 
centralities with CentiBiN. BMC Bioinformatics 7, 219. 
Keene, J. D. (2007). RNA regulons: coordination of post-transcriptional events. Nat Rev Genet 
8, 533-43. 
Keene, J. D. and Lager, P. J. (2005). Post-transcriptional operons and regulons co-ordinating 
gene expression. Chromosome Res 13, 327-37. 
Keene, J. D. and Tenenbaum, S. A. (2002). Eukaryotic mRNPs may represent 
posttranscriptional operons. Mol Cell 9, 1161-7. 
Kessler, S. H. and Sachs, A. B. (1998). RNA recognition motif 2 of yeast Pab1p is required for 
its functional interaction with eukaryotic translation initiation factor 4G. Mol Cell Biol 18, 51-7. 
Kim, M. Y., Hur, J. and Jeong, S. (2009). Emerging roles of RNA and RNA-binding protein 
network in cancer cells. BMB Rep 42, 125-30. 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-30 
 
 
Lackner, D. H., Beilharz, T. H., Marguerat, S., Mata, J., Watt, S., Schubert, F., Preiss, T. 
and Bahler, J. (2007). A network of multiple regulatory layers shapes gene expression in fission 
yeast. Mol Cell 26, 145-55. 
Lasko, P. (2000). The drosophila melanogaster genome: translation factors and RNA binding 
proteins. J Cell Biol 150, F51-6. 
Lee, M. H. and Schedl, T. (2006). RNA-binding proteins. WormBook, 1-13. 
Lublin, A. L. and Evans, T. C. (2007). The RNA-binding proteins PUF-5, PUF-6, and PUF-7 
reveal multiple systems for maternal mRNA regulation during C. elegans oogenesis. Dev Biol 
303, 635-49. 
Lukong, K. E., Chang, K. W., Khandjian, E. W. and Richard, S. (2008). RNA-binding proteins 
in human genetic disease. Trends Genet 24, 416-25. 
Maris, C., Dominguez, C. and Allain, F. H. (2005). The RNA recognition motif, a plastic RNA-
binding platform to regulate post-transcriptional gene expression. FEBS J 272, 2118-31. 
Mata, J., Marguerat, S. and Bahler, J. (2005). Post-transcriptional control of gene expression: 
a genome-wide perspective. Trends Biochem Sci 30, 506-14. 
Matunis, M. J., Matunis, E. L. and Dreyfuss, G. (1993). PUB1: a major yeast poly(A)+ RNA-
binding protein. Mol Cell Biol 13, 6114-23. 
McKee, A. E., Minet, E., Stern, C., Riahi, S., Stiles, C. D. and Silver, P. A. (2005). A genome-
wide in situ hybridization map of RNA-binding proteins reveals anatomically restricted 
expression in the developing mouse brain. BMC Dev Biol 5, 14. 
Musunuru, K. (2003). Cell-specific RNA-binding proteins in human disease. Trends Cardiovasc 
Med 13, 188-95. 
Newman, J. R., Ghaemmaghami, S., Ihmels, J., Breslow, D. K., Noble, M., DeRisi, J. L. and 
Weissman, J. S. (2006). Single-cell proteomic analysis of S. cerevisiae reveals the architecture 
of biological noise. Nature 441, 840-6. 
Nica, A. C. and Dermitzakis, E. T. (2008). Using gene expression to investigate the genetic 
basis of complex disorders. Hum Mol Genet 17, R129-34. 
Noe, G., De Gaudenzi, J. G. and Frasch, A. C. (2008). Functionally related transcripts have 
common RNA motifs for specific RNA-binding proteins in trypanosomes. BMC Mol Biol 9, 107. 
Paraskeva, E., Atzberger, A. and Hentze, M. W. (1998). A translational repression assay 
procedure (TRAP) for RNA-protein interactions in vivo. Proc Natl Acad Sci U S A 95, 951-6. 
Pascale, A., Amadio, M. and Quattrone, A. (2008). Defining a neuron: neuronal ELAV 
proteins. Cell Mol Life Sci 65, 128-40. 
Paulsson, J. (2004). Summing up the noise in gene networks. Nature 427, 415-8. 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-31 
 
 
Pinero, D. J., Hu, J. and Connor, J. R. (2000). Alterations in the interaction between iron 
regulatory proteins and their iron responsive element in normal and Alzheimer's diseased 
brains. Cell Mol Biol (Noisy-le-grand) 46, 761-76. 
Pombo, A., Jackson, D. A., Hollinshead, M., Wang, Z., Roeder, R. G. and Cook, P. R. 
(1999). Regional specialization in human nuclei: visualization of discrete sites of transcription by 
RNA polymerase III. Embo J 18, 2241-53. 
Ptacek, J., Devgan, G., Michaud, G., Zhu, H., Zhu, X., Fasolo, J., Guo, H., Jona, G., 
Breitkreutz, A., Sopko, R. et al. (2005). Global analysis of protein phosphorylation in yeast. 
Nature 438, 679-84. 
Pullmann, R., Jr., Kim, H. H., Abdelmohsen, K., Lal, A., Martindale, J. L., Yang, X. and 
Gorospe, M. (2007). Analysis of turnover and translation regulatory RNA-binding protein 
expression through binding to cognate mRNAs. Mol Cell Biol 27, 6265-78. 
Ray, D., Kazan, H., Chan, E. T., Pena Castillo, L., Chaudhry, S., Talukder, S., Blencowe, B. 
J., Morris, Q. and Hughes, T. R. (2009). Rapid and systematic analysis of the RNA recognition 
specificities of RNA-binding proteins. Nat Biotechnol 27, 667-70. 
Rodgers, N. D., Jiao, X. and Kiledjian, M. (2002). Identifying mRNAs bound by RNA-binding 
proteins using affinity purification and differential display. Methods 26, 115-22. 
Sachs, A. B., Davis, R. W. and Kornberg, R. D. (1987). A single domain of yeast poly(A)-
binding protein is necessary and sufficient for RNA binding and cell viability. Mol Cell Biol 7, 
3268-76. 
Saint-Georges, Y., Garcia, M., Delaveau, T., Jourdren, L., Le Crom, S., Lemoine, S., Tanty, 
V., Devaux, F. and Jacq, C. (2008). Yeast mitochondrial biogenesis: a role for the PUF RNA-
binding protein Puf3p in mRNA localization. PLoS ONE 3, e2293. 
Sanchez-Diaz, P. and Penalva, L. O. (2006). Post-transcription meets post-genomic: the saga 
of RNA binding proteins in a new era. RNA Biol 3, 101-9. 
Sanchez, M., Galy, B., Hentze, M. W. and Muckenthaler, M. U. (2007). Identification of target 
mRNAs of regulatory RNA-binding proteins using mRNP immunopurification and microarrays. 
Nat Protoc 2, 2033-42. 
Sanford, J. R., Gray, N. K., Beckmann, K. and Caceres, J. F. (2004). A novel role for shuttling 
SR proteins in mRNA translation. Genes Dev 18, 755-68. 
Schullery, D. S., Ostrowski, J., Denisenko, O. N., Stempka, L., Shnyreva, M., Suzuki, H., 
Gschwendt, M. and Bomsztyk, K. (1999). Regulated interaction of protein kinase Cdelta with 
the heterogeneous nuclear ribonucleoprotein K protein. J Biol Chem 274, 15101-9. 
SenGupta, D. J., Zhang, B., Kraemer, B., Pochart, P., Fields, S. and Wickens, M. (1996). A 
three-hybrid system to detect RNA-protein interactions in vivo. Proc Natl Acad Sci U S A 93, 
8496-501. 
Stelzl, U. and Nierhaus, K. H. (2001). SERF: in vitro election of random RNA fragments to 
identify protein binding sites within large RNAs. Methods 25, 351-7. 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-32 
 
 
Tenenbaum, S. A., Carson, C. C., Lager, P. J. and Keene, J. D. (2000). Identifying mRNA 
subsets in messenger ribonucleoprotein complexes by using cDNA arrays. Proc Natl Acad Sci 
U S A 97, 14085-90. 
Thomson, A. M., Rogers, J. T., Walker, C. E., Staton, J. M. and Leedman, P. J. (1999). 
Optimized RNA gel-shift and UV cross-linking assays for characterization of cytoplasmic RNA-
protein interactions. Biotechniques 27, 1032-9, 1042. 
Townley-Tilson, W. H., Pendergrass, S. A., Marzluff, W. F. and Whitfield, M. L. (2006). 
Genome-wide analysis of mRNAs bound to the histone stem-loop binding protein. RNA 12, 
1853-67. 
Ule, J., Jensen, K. B., Ruggiu, M., Mele, A., Ule, A. and Darnell, R. B. (2003). CLIP identifies 
Nova-regulated RNA networks in the brain. Science 302, 1212-5. 
Ule, J., Stefani, G., Mele, A., Ruggiu, M., Wang, X., Taneri, B., Gaasterland, T., Blencowe, 
B. J. and Darnell, R. B. (2006). An RNA map predicting Nova-dependent splicing regulation. 
Nature 444, 580-6. 
Vassileva, M. T. and Matunis, M. J. (2004). SUMO modification of heterogeneous nuclear 
ribonucleoproteins. Mol Cell Biol 24, 3623-32. 
Wang, Y., Liu, C. L., Storey, J. D., Tibshirani, R. J., Herschlag, D. and Brown, P. O. (2002). 
Precision and functional specificity in mRNA decay. Proc Natl Acad Sci U S A 99, 5860-5. 
Washburn, M. P., Koller, A., Oshiro, G., Ulaszek, R. R., Plouffe, D., Deciu, C., Winzeler, E. 
and Yates, J. R., 3rd. (2003). Protein pathway and complex clustering of correlated mRNA and 
protein expression analyses in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 100, 3107-
12. 
Wilhelm, J. E. and Vale, R. D. (1996). A one-hybrid system for detecting RNA-protein 
interactions. Genes Cells 1, 317-23. 
Yu, M. C., Bachand, F., McBride, A. E., Komili, S., Casolari, J. M. and Silver, P. A. (2004). 
Arginine methyltransferase affects interactions and recruitment of mRNA processing and export 
factors. Genes Dev 18, 2024-35. 
Zeng, F., Peritz, T., Kannanayakal, T. J., Kilk, K., Eiriksdottir, E., Langel, U. and Eberwine, 
J. (2006). A protocol for PAIR: PNA-assisted identification of RNA binding proteins in living 
cells. Nat Protoc 1, 920-7. 
Zhong, X. Y., Wang, P., Han, J., Rosenfeld, M. G. and Fu, X. D. (2009). SR proteins in 
vertical integration of gene expression from transcription to RNA processing to translation. Mol 
Cell 35, 1-10. 
Zhu, H., Bilgin, M., Bangham, R., Hall, D., Casamayor, A., Bertone, P., Lan, N., Jansen, R., 
Bidlingmaier, S., Houfek, T. et al. (2001). Global analysis of protein activities using proteome 
chips. Science 293, 2101-5. 
Post-transcriptional networks controlled by RNA-binding proteins                                                                        5-33 
 
 
Zielinski, J., Kilk, K., Peritz, T., Kannanayakal, T., Miyashiro, K. Y., Eiriksdottir, E., 
Jochems, J., Langel, U. and Eberwine, J. (2006). In vivo identification of ribonucleoprotein-
RNA interactions. Proc Natl Acad Sci U S A 103, 1557-62. 
 
 
Conclusions and Implications                                                                                                                                    6-1 
 
  
 
 
 
 
 
6           Conclusions and Perspectives 
 
 
Conclusions and Implications                                                                                                                                    6-2 
 
  
 
CONTENTS OF CHAPTER 6 
 
6.1 Outline .................................................................................................................................... 6-3 
6.2 Major Findings ..................................................................................................................... 6-5 
6.2.1 CONSTRAINTS IMPOSED BY TRANSCRIPTIONAL REGULATION ON GENOME ORGANIZATION 
AND REGULATORY NETWORK ........................................................................................................ 6-5 
6.2.2 UNCOVERING THE FUNCTIONAL LANDSCAPE OF A BACTERIAL GENOME ............................ 6-6 
6.2.3 STRUCTURE AND DYNAMICS OF POST-TRANSCRIPTIONAL NETWORKS CONTROLLED BY RNA 
BINDING PROTEINS ......................................................................................................................... 6-9 
Implications and Future Directions ................................................................................. 6-11 
REFERENCES .............................................................................................................................. 6-14 
 
 
Conclusions and Implications                                                                                                                                    6-3 
 
  
6.1 Outline 
An important notion that is emerging in post-genomic biology is that cellular components can be 
visualized as a network of associations between different molecules like proteins, DNA, RNA 
and metabolites. This has led to the application of network theory and network-based 
approaches to a wide range of biological problems from understanding regulation of gene 
expression to prediction of gene’s function and phenotype to drug discovery settings. In Chapter 
1, I introduced the notion of networks and the basic principles of network biology together with 
an overview of different kinds of networks that are being widely studied in biological sciences at 
the systems level. For instance, while in transcriptional and post-transcriptional networks, 
typically trans-acting elements like TFs, RBPs and sigma factors form one set of nodes and 
their target genes or RNAs, of which they control the activity, form the other set of nodes. The 
links between them which have directionality from the trans-acting elements to their target 
genes, controlled by their cis-regulatory elements, form a complex and directional network of 
interactions. In contrast, functional linkage networks constructed in function prediction pipelines 
typically comprise of undirected networks where all the nodes are treated essentially the same 
and there is no directionality between nodes. These networks aim to uncover the broad 
functional role of the uncharacterized genes using the annotations of already characterized 
members to which they are connected to. I then give a brief overview of small-molecule protein 
interaction networks which are also referred to as the drug-target networks to extend the 
generality and applicability of the network-guided approaches in understanding biological 
systems. 
Gene expression is a highly regulated process and is controlled at several levels. In 
prokaryotes, control of gene expression predominantly occurs at the level of transcription and 
TFs play important role in this process. In Chapter 2, I address the questions, how and why are 
genes organized on a particular fashion on bacterial genomes and what are the constraints 
bacterial transcriptional regulatory networks impose on their genomic organization. I extend this 
one step further to unravel the constraints imposed on the network of TF-TF interactions and 
relate it to the numerous phenotypes they can impart to growing bacterial populations.  
In contrast to prokaryotes, regulation of gene expression in eukaryotes is much more 
complex and is known to occur at many different levels even at the stage of transcription. In 
Chapter 3, I first present an overview of our current understanding of eukaryotic gene regulation 
at different levels and then present evidence for the existence of a higher-order organization of 
genes across and within chromosomes that is constrained by transcriptional regulation. These 
Conclusions and Implications                                                                                                                                    6-4 
 
  
results demonstrate that specific organization of genes across and within chromosomes that 
allowed for efficient control of transcription within the nuclear space has been selected during 
evolution. 
Determining the functions of proteins encoded by genome sequences represents a 
major challenge in contemporary biology. With traditional methods for annotation of a genome 
reaching their saturation there is an increasing need to develop alternate and complementary 
approaches for solving the genomic function prediction challenge. As a result, alternate 
computational methods for inferring the protein function such as those which exploit the context 
of a protein in protein association networks have come to be sought after. These network-based 
approaches aim to integrate diverse kinds of functional interactions as a means of boosting 
coverage as well as confidence level of an association. In Chapter 4, I first present an overview 
of different computational approaches for inferring the function of uncharacterized genes and 
discuss network-based approaches currently employed for predicting function. I then summarize 
a recent high-throughput study performed to provide a ‘systems-wide’ functional blueprint of the 
bacterial model, Escherichia coli K-12, with insights into the biological and evolutionary 
significance of previously uncharacterized proteins. Given the volume of high-throughput data 
that is being reported for understanding diverse model systems, the network-based approaches 
presented here would undoubtedly be a useful addition to unravel the functions of an increasing 
number of uncharacterized proteins accumulating in the genomic databases. 
While control of gene expression in eukaryotes first occurs at the level of transcription, 
there is accumulating evidence that an often neglected set of factors called RNA-binding 
proteins play major roles in controlling the expression of a protein by regulating expression at 
post-transcriptional level. In Chapter 5, I attempt to provide a comprehensive overview and 
preliminary insights on this rapidly developing area of post-transcriptional regulatory networks 
formed by RBPs. I discuss the sequence attributes and functional processes associated with 
RBPs, methods used for the construction of the networks formed by them and finally discuss the 
structure and dynamics of these post-transcriptional networks based on recent publicly available 
data. The results obtained from this study show that RBPs exhibit distinct gene expression 
dynamics compared to other class of proteins in a eukaryotic cell and that these properties are 
also reflected from an analysis of the post-transcriptional networks formed by them. 
In the current chapter, I first summarize the key findings of all the previous chapters and 
then discuss their broader implications.  
Conclusions and Implications                                                                                                                                    6-5 
 
  
6.2 Major Findings 
6.2.1 Constraints imposed by transcriptional regulation on genome 
organization and regulatory network  
In Chapter 2, using network-guided approaches for understanding the transcriptional regulatory 
networks of bacteria, I show that there are at least two kinds of constraints. The first is among 
the network of transcriptional regulatory interactions between TFs, where in I show that while 
the mode of regulatory interaction between transcription factors (TFs) is predominantly positive, 
TFs are frequently negatively auto-regulated. Furthermore, feedback loops, regulatory motifs 
and regulatory pathways are unevenly distributed in this network with short pathways, multiple 
feed-forward loops and negative auto-regulatory interactions being abundant in the sub-network 
controlling metabolic functions such as the use of alternative carbon sources. In contrast, long 
hierarchical cascades and positive auto-regulatory loops are over-represented in the sub-
networks controlling developmental processes for biofilm and chemotaxis. Based on these 
observations, I propose that these long transcriptional cascades coupled with regulatory 
switches (positive loops) for sensing external conditions enable the coexistence of multiple 
bacterial phenotypes in growing bacterial populations  (Martinez-Antonio et al., 2008). A second 
constraint is that of a link between the transcriptional hierarchy of regulons (TFs) and their 
genome organization. In particular, I show that, to drive the kinetics and concentration gradients, 
TFs belonging to big and small regulons (classified based on the number of genes they regulate 
in the transcriptional network) organize themselves differently on the genome with respect to 
their targets. Using data from independently reported studies in E. coli, I demonstrate that 
higher a TF is in the transcriptional hierarchy more are its detected number of mRNA and 
protein molecules per cell, reflecting its need to be expressed in higher concentrations to 
regulate target genes located dispersedly on the chromosome. In contrast to big regulons, local 
or dedicated TFs (lower in the network hierarchy and regulating much fewer genes) were found 
to be expressed in much lower concentrations explaining the reasons for their proximity on the 
chromosome to their target genes (Janga et al., 2009). These observations give insights into 
how the scale-free structure of transcriptional networks can be encoded on the chromosome to 
drive the kinetics and concentration gradients of TFs, depending on the number of genes they 
regulate and could facilitate the horizontal transfer of local environment-specific transcriptional 
modules. I then propose a conceptual model based on these observations to explain how the 
hierarchical structure of TRNs might be ultimately governed by the dynamic biophysical 
Conclusions and Implications                                                                                                                                    6-6 
 
  
requirements for targeting DNA-binding sites by transcription factors. These results suggest that 
the main parameters defining the position of a TF in the network hierarchy are the number and 
chromosomal distances of the genes they regulate and their protein concentration gradients. 
These observations give insights into how the hierarchical structure of transcriptional networks 
can be encoded on the chromosome to drive the kinetics and concentration gradients of TFs 
depending on the number of genes they regulate and could be a common theme valid for other 
prokaryotes, proposing the role of transcriptional regulation in shaping the organization of genes 
on a chromosome. 
In Chapter 3, extending these ideas to eukaryotic systems, I first describe our current 
understanding of eukaryotic regulation in all the three dimensions (DNA sequence level, 
chromatin level and nuclear organizational levels) to reinforce the notion that regulation in 
higher organisms is much more complex and needs intricate co-ordination of several molecular 
events in space and time. I then present evidence, analyzing the currently known transcriptional 
regulatory network of the single-celled model eukaryote, Saccharomyces cerevisiae, for the 
existence of a higher-order organization of genes across and within chromosomes that is 
constrained by transcriptional regulation. In particular, here I reveal that the target genes (TGs) 
of transcription factors (TFs) for the yeast, S. cerevisiae, are encoded in a highly ordered 
manner both across and within the 16 chromosomes by showing that (i) the TGs of a majority of 
TFs show a strong preference to be encoded on specific chromosomes, (ii) the TGs of a 
significant number of TFs display a strong preference (or avoidance) to be encoded in regions 
containing particular chromosomal landmarks such as telomeres and centromeres, and (iii) the 
TGs of most TFs are positionally clustered within a chromosome (Janga et al., 2008). These 
results demonstrate that specific organization of genes that allowed for efficient control of 
transcription within the nuclear space has been selected during evolution which has lead to the 
constraints observed at different levels reported in this chapter. Further analysis on human and 
mouse TFs permitted us to also show that the constraints are more general and are not limited 
to yeast alone suggesting that uncovering such higher-order organization of genes in other 
eukaryotes will provide insights into nuclear architecture, and will have implications in genetic 
engineering experiments, gene therapy, and understanding disease conditions that involve 
chromosomal aberrations. 
6.2.2 Uncovering the functional landscape of a bacterial genome 
Determining the functions of proteins encoded by genome sequences represents a major 
challenge in contemporary biology. As of now, public databases report more than 1000 
Conclusions and Implications                                                                                                                                    6-7 
 
  
completely sequenced genomes with over 3700 genome projects underway leading to a 
situation where we know the location and position of the protein coding genes on the genome 
but we hardly have a clue on what many of these protein machines do across genomes. Add to 
this the sequencing of metagenomic samples which currently stand at more than 100 in number, 
with the venter’s marine microbial community’s project alone contributing more than 6,000,000 
proteins to the already accumulating list of protein repertoire (Venter et al., 2004). All these point 
out to the slow pace at which we are able to understand the protein repertoire of the organisms 
at the functional level despite rapid pace at which sequencing technologies are able to generate 
the genome sequence data.  
For instance, yet despite being the most highly studied model bacterium, a recent 
comprehensive community annotation effort for the fully sequenced reference K-12 laboratory 
strains (Riley et al., 2006) indicated that only half (~54%) of the protein-coding gene products of 
E. coli currently have experimental evidence indicative of a biological role. The remaining genes 
have either only generic, homology-derived functional attributes (e.g. ‘predicted DNA-binding’) 
or no discernable physiological significance. In Chapter 4, I discuss a recent study where we 
attempted to characterize one-third of the 4,225 protein-coding genes of Escherichia coli K-12 
which remain functionally unannotated (functional orphans) (Hu et al., 2009). In particular, to 
elucidate their biological roles, we performed an extensive proteomic survey using affinity-
tagged E. coli strains and generated comprehensive genomic context inferences to derive a 
high-confidence compendium for virtually the entire proteome consisting of 5,993 putative 
physical interactions and 74,776 putative functional associations, most of which were novel. We 
then clustered the respective probabilistic networks to reveal putative orphan membership into 
discrete multiprotein complexes and functional modules, while a machine-learning strategy 
based on network integration methods implicated the orphans in specific biological processes. 
In an attempt to uncover the functions of these orphans and to have a complementary 
understanding (to traditional methods) of their biological roles in E. coli as well as in other of its 
close relatives, I highlight this resource in this chapter which provides a ‘systems-wide’ 
functional blueprint of a model microbe, with insights into the biological and evolutionary 
significance of previously uncharacterized proteins. The network-based methods developed and 
the approach adopted in this study can not only be used for understanding the functions of 
uncharacterized genes in other prokaryotic systems but will also enable to identify novel cellular 
processes and the interplay between them – an fundamental goal of systems biology which at 
the moment is rather under-appreciated.  
Conclusions and Implications                                                                                                                                    6-8 
 
  
Defining the precise biological roles and relationships of bacterial gene products in an 
often dynamically changing physiological context is a challenging proposition. Historically, 
systematic assessments of protein function in bacteria have tended to rely on molecular 
inferences based on sequence alignments and domain architectures, while experimental 
characterization has traditionally been driven by specific scientific interests rather than with the 
aim of providing the broader community with unbiased collections of functionally-related proteins 
and phenotypes. Since the biological role of a protein is not necessarily reflected in its primary 
sequence, the elucidation of molecular interaction networks can provide an alternate 
perspective even in the absence of detailed phenotypic data (Ideker and Sharan, 2008; Lee et 
al., 2008). Therefore, the notion of viewing a model microbial cell mechanistically as a series of 
modular molecular interaction networks that underlie the major biochemical processes that 
mediate cell homeostasis and proliferation provides a complementary understanding of 
biological systems with insights into the functional roles of proteins in the context of other 
cellular entities.  
 Since the various methods used in this study discover different types of molecular 
relationships and each has its own intrinsic bias, complementary information was obtained 
through data integration. The limited overlap between the high-confidence physical and 
functional interaction networks presumably stems in part from to the incomplete coverage 
typically achieved by high-throughput experiments and their methodological differences 
(Rajagopala et al., 2007; Yu et al., 2008). For example, certain orphans were difficult to evaluate 
by GC methods due to a lack of apparent orthologs at medium-to-high evolutionary distances, 
which hinders comparative genomic inferences. Likewise, although large-scale tandem affinity 
tagging and purification was performed under near-native physiological conditions to generate 
highly purified preparations of stable, endogenous multiprotein complexes, complete coverage 
of the proteome was not achieved. For instance, a large number of membrane-associated 
proteins were not purified, which require specialized solubilization procedures, while the soluble 
proteins that we failed to tag or detect by mass spectrometry were presumably either of very low 
abundance or not expressed in our growth conditions.   
The observation that the intersection of functional genomics inferences with low-
throughput curated physical interaction data is somewhat higher might be explained by two non-
mutually exclusive ways: first, protein-protein interactions reported in the literature based on 
traditional biochemical methods might be biased towards the most evolutionarily conserved 
multiprotein complexes, which tend to be enriched for essential components with broadly 
distributed phylogenetic profiles that are more easily and accurately predicted by GC methods 
Conclusions and Implications                                                                                                                                    6-9 
 
  
(like those of ribosomal proteins which form conserved clusters on the genome); second, the 
relatively high sensitivity of the two complementary forms of protein mass spectrometry used in 
this study may have resulted in the detection of lower abundance orphan proteins that have 
previously not been studied in depth.  
In general, the high confidence functional relationships inferred for E. coli in this study 
can be validated by independent experimental tests, and can be extrapolated to other bacterial 
species, including pathogens. In fact, over 35% of the orphans find orthologs as far away as 
Archaea, and hence are likely associated with the same basic housekeeping processes we 
predict for E. coli, such as formation of the cell wall and protein synthesis. Conversely, our 
systematic comparisons also revealed some unique aspects of the orphans in the evolutionary 
history of E. coli, such as the potential fimbriael factors that appear to be restricted to 
Enterobacteriaceae. One interpretation is that orphans (and orphan groups) with limited 
phylogenetic distributions in any major phyla contribute to fine tuning of adaptive physiological 
responses upon changing environmental conditions and hence might be responsible for not yet 
characterized processes in bacterial adaption. Alternatively, some orphans might belong to the 
well conserved biological systems which still need to be characterized for their functional role. 
6.2.3 Structure and dynamics of post-transcriptional networks 
controlled by RNA binding proteins 
While transcription factors regulate the synthesis of RNA of specific gene in response to 
different internal and external stimuli at the level of transcription and several post-translational 
modifications, such as phosphorylation by kinases and ubiquitin ligases, are known to spatially 
and temporally control the availability of functional protein products within the cell, little is known 
about regulation at the post-transcriptional level and major players involved in it. In contrast to 
prokaryotes where transcription and translation are coupled, in eukaryotes transcription usually 
takes place in nucleus and translation in cytoplasm. This uncoupling of transcription and 
translation provides an additional level of gene regulation at post-transcriptional level in 
eukaryotes. Although ignored for a long time the presence of this post-transcriptional control has 
been evidenced by a number of post-genomic studies which showed that in general there is a 
poor correlation between the mRNA and protein pools in eukaryotic cells (Greenbaum et al., 
2003; Gygi et al., 1999; Ideker et al., 2001). It is now increasingly known that this level is 
controlled by numerous factors with major players being the RNA-binding proteins (RBPs) 
(Glisovic et al., 2008; Keene, 2007; Mata et al., 2005). These observations have suggested that 
there is need for an intricate co-ordination of regulatory events from these three different layers 
Conclusions and Implications                                                                                                                                    6-10 
 
  
to finely control the flow of genetic information from genes to proteins in different conditions. 
Indeed, changes in gene expression due to aberrations at any of these three levels have been 
shown to be responsible for the cause of a number of disorders (Cookson et al., 2009; Cooper 
et al., 2009; Feinberg and Tycko, 2004; Lukong et al., 2008; Nica and Dermitzakis, 2008).  
In Chapter 5, I introduce the important class of post-transcriptional regulators - RBPs 
and show that RBPs are key regulators of different steps in the metabolism of RNA in 
eukaryotes including splicing, poly-adenylation, capping to get mature mRNA, localization, 
translation, stability and degradation of cellular RNAs. To regulate all these different steps of 
RNA metabolism, RBPs bind to RNA and form ribonucleoprotein complexes (RNP). RNPs are 
inherently highly dynamic complexes due to their ability to associate and dissociate with various 
RBPs to mediate different steps of RNA metabolism. I then summarize based on current 
knowledge that RBPs control almost all the steps at post-transcriptional level with some RBPs 
having the ability to be involved in multiple steps of a post-transcriptional regulatory cascade. I 
also argue that the complex combinatorial interplay of different RBPs to integrate various post-
transcriptional events is an inherent property of these post-transcriptional controllers as this 
property facilitates them to fine tune the availability of transcripts both spatially and temporally. 
I then provide an overview of the recent developments in our understanding of the 
repertoire of RBPs across diverse model systems and discuss the approaches currently 
available for the construction of post-transcriptional networks governed by them.  Following that 
I present for the first time an indepth analysis of the properties of post-transcriptional network 
governed by RBPs and proceed to discuss a study where in we compared the expression 
dynamics of RBPs with other protein coding genes in yeast (Mittal et al., 2009). The analysis on 
the expression dynamics showed that RBPs are generally less stable at the transcript level but 
exhibit higher stability and abundance at the protein level demonstrating that they form a group 
of proteins which follow the theoretically proposed time averaging effect on noise propagation 
(Paulsson, 2004), which suggests that if the protein has long half life compared to its mRNA 
then it averages over the noisy fluctuations in the mRNA, thereby decreasing the protein 
expression noise. These results also indicate that regulation of RBPs is predominantly 
controlled at the protein level through the use a number of post-translational modifications 
(PTMs) like phosphorylation, arginine methylation and sumoylation which have been reported to 
occur in several well-studied RBPs (Schullery et al., 1999; Vassileva and Matunis, 2004; Yu et 
al., 2004). Indeed, I also show that a comparison of the number of phosphorylated targets in 
RBPs and non-RBPs reveals the predominance of post-translational control in RBPs. Based on 
this I suggest that a wide variety of these PTMs might be responsible for their ability to spatially 
Conclusions and Implications                                                                                                                                    6-11 
 
  
and temporally regulate transcripts in eukaryotic systems. It is possible to speculate from these 
observations that the low noise levels of RBPs together with extensive regulatory flexibility at 
the protein level might give them an advantage to control gene regulation at a finer level 
compared to transcriptional control by transcription factors. This might thereby provide a quick 
and extensive framework for controlling gene expression of a wide range of genes. This is also 
supported based on the observations presented in this chapter that RBPs which are central to 
the cell are not only required in large quantities but are also found to be present for a longer 
time in the cell. 
Implications and Future Directions 
The observation that short regulatory pathways composing of multiple feed-forward 
loops with negative auto-regulatory interactions (of TFs) are abundant in the sub-network 
controlling metabolic functions, such as the use of alternative carbon sources in  E. coli, 
indicates that free living bacteria which have the ability to uptake a wide number of sugars to 
adapt themselves to diverse conditions must harbor a high number of such circuits as a means 
of switching between different carbon sources. Likewise, organisms living in extremely 
fluctuating environments might comprise of a higher number of long hierarchical cascades so as 
to accommodate them with developmental-like pathways so that a mixed number of phenotypes 
can be generated to survive the variations in the conditions. Alternatively, network structure in 
such fluctuating environments might be complemented with longer cascades or even 
bifurcations or divisions in the already established circuits as a means of generating novelty to 
the existing developmental programs. Part of the plasticity in such extended network structure 
could come from the presence of multiple auto-regulatory TFs at different stages so that 
decisions can be made at multiple stages enhancing the number of phenotypes and hence the 
adaptive potential of microbes. Therefore, while variations in regulatory network topology might 
be expected, for instance in the case of bacteria with asymmetric cell division (mostly alpha-
proteobacteria), where the offspring asymmetric cells cause a transient genetic asymmetry that 
triggers different developmental processes, such as the formation of stalked and swarmer cells 
in Caulabacter or vegetative and spore-forming cells in Bacillus (Ausmees and Jacobs-Wagner, 
2003; Dworkin, 2003; Dworkin and Losick, 2001; Hilbert and Piggot, 2004; Yudkin and Clarkson, 
2005), future comparisons between network topologies for different model systems should 
further enhance our understanding of regulatory network organization and its conservation or 
variations among different bacterial phyla. 
 
Conclusions and Implications                                                                                                                                    6-12 
 
  
 
It is now clear that regulatory networks are plastic with TFs evolving faster than their target 
genes (Borneman et al., 2007; Hogues et al., 2008; Lozada-Chavez et al., 2006; Madan Babu et 
al., 2006; Tuch et al., 2008). For instance, TFs, TF-families and global regulators have been 
shown not to be conserved in between different major groups of bacteria such as E. coli and B. 
subtilis, with different families being distinctly expanded in different lineages (Janga and Perez-
Rueda, 2009; Lozada-Chavez et al., 2006; Madan Babu et al., 2006). This suggests that 
although the general topological properties such as power-law degree distribution, hierarchical 
organization etc of the regulatory network as well as the gene repertoire might be well-
conserved across organisms, variations might be happening at the wiring of the 
interconnections between the TFs and their targets between different organisms. This implies 
that both genomic organization and architecture on one hand and TFs and their binding 
specificities and locations across genome on the other, play major roles in enabling a significant 
rewiring of the network across organisms as is evidenced from some recent studies (Borneman 
et al., 2007; De et al., 2009; Tuch et al., 2008). It is easy to imagine that this genomic and 
network rewiring together can explain the conservation of the constraints across genomes 
suggesting that the observed constraints might be generic principles valid for all genomes. 
Nevertheless, it remains to be learnt whether this rewiring is valid for high eukaryotes and if so 
how fast and what factors might best explain the rewiring while preserving the constraints. 
Recent experimental data show that regulatory networks can be plastic even among members 
of a population indicating that transcriptional control is much more dynamic in evolution than 
previously thought (Kasowski et al., ; McDaniell et al., ; Zheng et al.). Naturally, it follows that 
some types of TFs might be showing greater plasticity than others in closely related organisms 
or with in individuals and hence might be major contributors for the rewiring of regulatory 
programs. Therefore, it would be interesting to understand the design principles underlying 
these variations. In light of these recent studies, an interesting open question which still remains 
is whether the rewiring in regulatory network can be explained based on phylogenetic distance 
or if other factors like adaptation play more important roles as has been seen in bacteria (Madan 
Babu et al., 2006). 
 Observations in Chapter 4 show that genome-context and network-based methods 
developed here can be employed as powerful means for automating the functional prediction 
pipelines so that any newly sequenced genome can be studied as soon as the gene coordinates 
are available. With the availability of metagenomic sequences the power of these computational 
methods for generating functional association networks will increase not only in terms of 
Conclusions and Implications                                                                                                                                    6-13 
 
  
coverage but also in terms of quality of predicted associations- thereby increasing the quality of 
the function predictions. With the increasingly cheaper availability of RNA sequencing 
technologies it should be possible to construct expression compendiums to bacterial genomes 
as soon as genome sequences are available. Thus, enabling the addition of these high-
throughput transcriptomic profiling data to function prediction pipelines just like what microarrays 
did in the past decade. Integration of all these high-throughput computational methods together 
with genetic, physical and small molecular perturbation experiments aimed to provide different 
kinds of associations in a condition specific manner, should all enable the rapid screening for 
the phenotypes of newly identified genes faster than it was in the previous century and in 
elucidating their detailed functional inter-relationships with the rest of the cellular machinery.    
In a similar vein, availability of invivo crosslinking assays and cheaper sequencing 
should enable the identification of RNA targets of RBPs from different families at single 
nucleotide resolution which should enable the elucidation of genome-wide RBP-RNA maps. 
Such high density maps will not only permit the understanding of the mechanism of action of 
RBPs but also, when they are performed on a high-throughput way for many RBPs, allow the 
understanding of their interplay in Ribo-nucleoproteins (RNPs) to mediate different RNA 
processing events. In addition, such maps also allow the variations in the binding of different 
RBPs across conditions and between cell types so that tissue-specific variations and 
aberrations can be identified to further exploit them for therapeutic use. The data generated 
using these high-throughput techniques will also enable the improvements in our understanding 
of the cross-talk between different post-transcriptional events and processes. Understanding the 
links between different layers of regulation can also improve our global understanding of 
regulatory processes enabling a better modeling of eukaryotic systems.  
In general, the vast amount of data that will be generated using the next generation 
technologies in the coming years, will form a foundation not only to test many of the hypothesis 
that have been generated during my doctoral work but will also improve our understanding of 
the interpretations of these constraints at different levels. Improved understanding of the design 
principles governing biological systems will improve our ability to model disease phenotypes in a 
larger context. Such developments will allow the development of disease treatment strategies   
using modern systems approaches (Janga and Tzakos, 2009). 
 
Conclusions and Implications                                                                                                                                    6-14 
 
  
REFERENCES 
Ausmees, N. and Jacobs-Wagner, C. (2003). Spatial and temporal control of differentiation 
and cell cycle progression in Caulobacter crescentus. Annu Rev Microbiol 57, 225-47. 
Borneman, A. R., Gianoulis, T. A., Zhang, Z. D., Yu, H., Rozowsky, J., Seringhaus, M. R., 
Wang, L. Y., Gerstein, M. and Snyder, M. (2007). Divergence of transcription factor binding 
sites across related yeast species. Science 317, 815-9. 
Cookson, W., Liang, L., Abecasis, G., Moffatt, M. and Lathrop, M. (2009). Mapping complex 
disease traits with global gene expression. Nat Rev Genet 10, 184-94. 
Cooper, T. A., Wan, L. and Dreyfuss, G. (2009). RNA and disease. Cell 136, 777-93. 
De, S., Teichmann, S. A. and Babu, M. M. (2009). The impact of genomic neighborhood on 
the evolution of human and chimpanzee transcriptome. Genome Res 19, 785-94. 
Dworkin, J. (2003). Transient genetic asymmetry and cell fate in a bacterium. Trends Genet 19, 
107-12. 
Dworkin, J. and Losick, R. (2001). Differential gene expression governed by chromosomal 
spatial asymmetry. Cell 107, 339-46. 
Feinberg, A. P. and Tycko, B. (2004). The history of cancer epigenetics. Nat Rev Cancer 4, 
143-53. 
Glisovic, T., Bachorik, J. L., Yong, J. and Dreyfuss, G. (2008). RNA-binding proteins and 
post-transcriptional gene regulation. FEBS Lett 582, 1977-86. 
Greenbaum, D., Colangelo, C., Williams, K. and Gerstein, M. (2003). Comparing protein 
abundance and mRNA expression levels on a genomic scale. Genome Biol 4, 117. 
Gygi, S. P., Rochon, Y., Franza, B. R. and Aebersold, R. (1999). Correlation between protein 
and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. 
Hilbert, D. W. and Piggot, P. J. (2004). Compartmentalization of gene expression during 
Bacillus subtilis spore formation. Microbiol Mol Biol Rev 68, 234-62. 
Hogues, H., Lavoie, H., Sellam, A., Mangos, M., Roemer, T., Purisima, E., Nantel, A. and 
Whiteway, M. (2008). Transcription factor substitution during the evolution of fungal ribosome 
regulation. Mol Cell 29, 552-62. 
Hu, P., Janga, S. C., Babu, M., Diaz-Mejia, J. J., Butland, G., Yang, W., Pogoutse, O., Guo, 
X., Phanse, S., Wong, P. et al. (2009). Global functional atlas of Escherichia coli 
encompassing previously uncharacterized proteins. PLoS Biol 7, e96. 
Ideker, T. and Sharan, R. (2008). Protein networks in disease. Genome Res 18, 644-52. 
Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., 
Goodlett, D. R., Aebersold, R. and Hood, L. (2001). Integrated genomic and proteomic 
analyses of a systematically perturbed metabolic network. Science 292, 929-34. 
Conclusions and Implications                                                                                                                                    6-15 
 
  
Janga, S. C., Collado-Vides, J. and Babu, M. M. (2008). Transcriptional regulation constrains 
the organization of genes on eukaryotic chromosomes. Proc Natl Acad Sci U S A 105, 15761-6. 
Janga, S. C. and Perez-Rueda, E. (2009). Plasticity of transcriptional machinery in bacteria is 
increased by the repertoire of regulatory families. Comput Biol Chem 33, 261-8. 
Janga, S. C., Salgado, H. and Martinez-Antonio, A. (2009). Transcriptional regulation shapes 
the organization of genes on bacterial chromosomes. Nucleic Acids Res 37, 3680-8. 
Janga, S. C. and Tzakos, A. (2009). Structure and organization of drug-target networks: 
insights from genomic approaches for drug discovery. Mol Biosyst 5, 1536-48. 
Kasowski, M., Grubert, F., Heffelfinger, C., Hariharan, M., Asabere, A., Waszak, S. M., 
Habegger, L., Rozowsky, J., Shi, M., Urban, A. E. et al. Variation in transcription factor 
binding among humans. Science 328, 232-5. 
Keene, J. D. (2007). RNA regulons: coordination of post-transcriptional events. Nat Rev Genet 
8, 533-43. 
Lee, I., Lehner, B., Crombie, C., Wong, W., Fraser, A. G. and Marcotte, E. M. (2008). A 
single gene network accurately predicts phenotypic effects of gene perturbation in 
Caenorhabditis elegans. Nat Genet 40, 181-8. 
Lozada-Chavez, I., Janga, S. C. and Collado-Vides, J. (2006). Bacterial regulatory networks 
are extremely flexible in evolution. Nucleic Acids Res 34, 3434-45. 
Lukong, K. E., Chang, K. W., Khandjian, E. W. and Richard, S. (2008). RNA-binding proteins 
in human genetic disease. Trends Genet 24, 416-25. 
Madan Babu, M., Teichmann, S. A. and Aravind, L. (2006). Evolutionary dynamics of 
prokaryotic transcriptional regulatory networks. J Mol Biol 358, 614-33. 
Martinez-Antonio, A., Janga, S. C. and Thieffry, D. (2008). Functional organisation of 
Escherichia coli transcriptional regulatory network. J Mol Biol 381, 238-47. 
Mata, J., Marguerat, S. and Bahler, J. (2005). Post-transcriptional control of gene expression: 
a genome-wide perspective. Trends Biochem Sci 30, 506-14. 
McDaniell, R., Lee, B. K., Song, L., Liu, Z., Boyle, A. P., Erdos, M. R., Scott, L. J., Morken, 
M. A., Kucera, K. S., Battenhouse, A. et al. Heritable individual-specific and allele-specific 
chromatin signatures in humans. Science 328, 235-9. 
Mittal, N., Roy, N., Babu, M. M. and Janga, S. C. (2009). Dissecting the expression dynamics 
of RNA-binding proteins in posttranscriptional regulatory networks. Proc Natl Acad Sci U S A 
106, 20300-5. 
Nica, A. C. and Dermitzakis, E. T. (2008). Using gene expression to investigate the genetic 
basis of complex disorders. Hum Mol Genet 17, R129-34. 
Paulsson, J. (2004). Summing up the noise in gene networks. Nature 427, 415-8. 
Conclusions and Implications                                                                                                                                    6-16 
 
  
Rajagopala, S. V., Titz, B., Goll, J., Parrish, J. R., Wohlbold, K., McKevitt, M. T., Palzkill, T., 
Mori, H., Finley, R. L., Jr. and Uetz, P. (2007). The protein network of bacterial motility. Mol 
Syst Biol 3, 128. 
Riley, M., Abe, T., Arnaud, M. B., Berlyn, M. K., Blattner, F. R., Chaudhuri, R. R., Glasner, 
J. D., Horiuchi, T., Keseler, I. M., Kosuge, T. et al. (2006). Escherichia coli K-12: a 
cooperatively developed annotation snapshot--2005. Nucleic Acids Res 34, 1-9. 
Schullery, D. S., Ostrowski, J., Denisenko, O. N., Stempka, L., Shnyreva, M., Suzuki, H., 
Gschwendt, M. and Bomsztyk, K. (1999). Regulated interaction of protein kinase Cdelta with 
the heterogeneous nuclear ribonucleoprotein K protein. J Biol Chem 274, 15101-9. 
Tuch, B. B., Galgoczy, D. J., Hernday, A. D., Li, H. and Johnson, A. D. (2008). The evolution 
of combinatorial gene regulation in fungi. PLoS Biol 6, e38. 
Vassileva, M. T. and Matunis, M. J. (2004). SUMO modification of heterogeneous nuclear 
ribonucleoproteins. Mol Cell Biol 24, 3623-32. 
Venter, J. C., Remington, K., Heidelberg, J. F., Halpern, A. L., Rusch, D., Eisen, J. A., Wu, 
D., Paulsen, I., Nelson, K. E., Nelson, W. et al. (2004). Environmental genome shotgun 
sequencing of the Sargasso Sea. Science 304, 66-74. 
Yu, H., Braun, P., Yildirim, M. A., Lemmens, I., Venkatesan, K., Sahalie, J., Hirozane-
Kishikawa, T., Gebreab, F., Li, N., Simonis, N. et al. (2008). High-Quality Binary Protein 
Interaction Map of the Yeast Interactome Network. Science. 322, 104-110. 
Yu, M. C., Bachand, F., McBride, A. E., Komili, S., Casolari, J. M. and Silver, P. A. (2004). 
Arginine methyltransferase affects interactions and recruitment of mRNA processing and export 
factors. Genes Dev 18, 2024-35. 
Yudkin, M. D. and Clarkson, J. (2005). Differential gene expression in genetically identical 
sister cells: the initiation of sporulation in Bacillus subtilis. Mol Microbiol 56, 578-89. 
Zheng, W., Zhao, H., Mancera, E., Steinmetz, L. M. and Snyder, M. Genetic analysis of 
variation in transcription factor binding in yeast. Nature. 
 
 
Appendix                                                                                                                                                                    A-1 
 
 
APPENDIX 
for 
EXPLOITING NETWORK-BASED 
APPROACHES FOR UNDERSTANDING 
GENE REGULATION AND FUNCTION 
 
SARATH CHANDRA JANGA 
 
 
Appendix                                                                                                                                                                    A-2 
 
 
CONTENTS OF APPENDIX 
 
A.1 LIST OF PUBLICATIONS ............................................................................................ A-3 
PUBLICATIONS DURING PHD (JANUARY 2008- APRIL 2010).................................................... A-3 
PUBLICATIONS UNDER REVIEW, REVISION AND IN PREPARATION ............................................. A-5 
PUBLICATIONS PRIOR TO STARTING PHD ................................................................................. A-6 
A.2 REPRINTS ........................................................................................................................ A-7 
 
 
Appendix                                                                                                                                                                    A-3 
 
A.1 LIST OF PUBLICATIONS 
Publications during PhD (January 2008- April 2010) 
*indicates corresponding author either jointly or alone   
† indicates joint first author 
** indicates papers which are not discussed in the thesis but are appended as reprints here 
 
• Ten simple rules for organizing a scientific meeting ** 
Manuel Corpas, Nils Gehlenborg, Sarath Chandra Janga and Philip E Bourne 
      PLoS Comput Biol 2008 4(6):e1000080 
 
• Highlights from the Fourth International Society for Computational Biology Student Council 
Symposium ** 
Lucia Peixoto, Nils Gehlenborg and Sarath Chandra Janga  
      BMC Bioinformatics, 2008 
 
• Functional organization of Escherichia coli  transcriptional regulatory network 
Agustino Martinez-Antonio, Sarath Chandra Janga and Denis Thieffry  
      Journal of Molecular Biology, 2008, Vol. 381(1):238-247 
 
• Transcriptional regulation constrains the organization of genes on eukaryotic chromosomes 
Sarath Chandra Janga*, Julio Collado-Vides and M. Madan Babu 
Proc. Natl. Acad. Sci. U S A. 105(41): 15761-6, 2008 
+Featured on the news and highlights section of the journal Molecular Biosystems 
 
• Eukaryotic gene regulation in three dimensions and its impact on genome evolution 
M. Madan Babu, Sarath Chandra Janga, Ines Santiago and Ana Pombo 
      Curr. Opin. Genet. Dev., 2008, Vol. 18(6):571-582 
      +Featured on the cover page of the issue with a cover image 
 
• Network-based approaches for linking metabolism with environment ** 
Sarath Chandra Janga* and M. Madan Babu  
      Genome Biology, 2008, 9(11):239  
      +Featured on the journal’s home page 
 
• Transcript stability in the protein interaction network of Escherichia coli ** 
Sarath Chandra Janga* and M. Madan Babu 
      Molecular Biosystems, 2009, 5(2):154-62 
      +Featured on the cover page of the issue with a cover image 
 
• Transcriptional regulation shapes the organization of genes on bacterial chromosomes 
Sarath Chandra Janga*, Heladia Salgado and Agustino Martinez-Antonio  
      Nucleic Acids Research, 2009, Vol.37, No. 11, 3680-3688 
 
• Global functional atlas of Escherichia coli encompassing previously uncharacterized 
proteins 
Pingzhao Hu†, Sarath Chandra Janga†, Mohan Babu†,  J. Javier Díaz-Mejía†, Gareth 
Butland†, et. al   
Appendix                                                                                                                                                                    A-4 
 
PLoS Biology 2009, 7(4): e96 
+Featured in the journal ‘Nature methods’ 
 
• Scaling relationship in the gene content of transcriptional machinery in bacteria ** 
Ernesto Perez-Rueda, Sarath Chandra Janga* and Agustino Martinez-Antonio  
      Molecular Biosystems, 2009, 5(12):1494-501 
 
• Plasticity of transcriptional machinery in bacteria is increased by the repertoire of regulatory 
families ** 
Sarath Chandra Janga* and Ernesto Perez-Rueda 
Computational Biology and Chemistry, 2009, Vol. 33, No. 4, 261-268  
 
• Structure and organization of drug-target networks : Insights from genomic approaches for 
drug discovery ** 
Sarath Chandra Janga* and Andreas Tzakos  
      Molecular Biosystems, 2009, 5(12):1536-48 
 
• Interfacing systems biology and synthetic biology ** 
Allyson Lister, Varodom Charoensawan, Subhajyoti De, Katherine James, Sarath Chandra 
Janga and Julian Huppert 
      Genome Biology, 2009, 10(6):309 
  
• Dissecting the expression dynamics of RNA-binding proteins in post-transcriptional 
regulatory networks  
Nitish Mittal, Nilanjan Roy, M. Madan Babu and  Sarath Chandra Janga* 
      Proc. Natl. Acad. Sci. U S A. 106(48): 20300-05, 2009 
 
• Protein Complexes and Functional Pathways in S. cerevisiae and E. coli 
Mohan Babu,  Gareth Butland, J. Javier Díaz-Mejía, Pingzhao Hu, S Pu, Gabriel Moreno-
Hagelsieb, Sarath Chandra Janga, Shoshana Wodak, Andrew Emili, Jack Greenblatt  
      Mol Cell Proteomics, S27-27, 2009 (Meeting Abstract)  
 
• Identification and genomic analysis of transcription factors in archaeal genomes exemplifies 
their functional architecture and evolutionary origin ** 
Ernesto Perez-Rueda and  Sarath Chandra Janga* 
      Mol Biol Evol.,  2009. 27(4): 1-11, 2010 
• Transcriptional regulatory networks 
Book chapter for the edited book ’Networks in Cell Biology’, 2010 for Cambridge University 
Press (Ed: Michele Vendruscolo, Department of Chemistry, University of Cambridge) 
Sarath Chandra Janga* and M. Madan Babu 
 
• Operons and bacterial genome organization ** 
Book chapter for the edited book “Bacterial Gene Regulation and Transcriptional Networks”, 
2010 for Horizon Scientific Press (Ed: M. Madan Babu, MRC Laboratory of Molecular 
Biology, University of Cambridge) 
Sarath Chandra Janga* and Gabriel Moreno-Hagelsieb 
 
• Construction, structure and dynamics of post-transcriptional networks directed by RNA-
binding proteins 
Book chapter for the edited book “RNA infrastructure: RNA processing and regulatory 
networks”, 2010 for Springer/Landes Bioscience Press (Ed: Lesley Collins, Allan Wilson 
Appendix                                                                                                                                                                    A-5 
 
Centre for Molecular Biology and Ecology & Institute of Molecular BioSciences, Massey 
University) 
      Sarath Chandra Janga* and Nitish Mittal  
Publications under review, revision and in preparation 
• Coordination of bacterial transcription and the role of the RNA polymerase omega subunit 
Marcel Geertz, Andrew Travers, Sarath Chandra Janga, Sanja Mehandziska, Nobuo 
Shimamoto and Georgi Muskhelishvili 
      Mol. Microbiology, 2009 (Submitted) 
 
• Network-based function prediction in post-genomic era : Metabolic enzymes as a case study 
Sarath Chandra Janga* and Gabriel Moreno-Hagelsieb 
      Metabolic Engineering, 2010 (Submitted) 
 
• Dissecting the expression patterns of transcription factors across conditions using an 
integrated network-based approach 
Sarath Chandra Janga* and Bruno Contreras-Moreira  
      Nucleic Acids Research, 2010 (Submitted) 
 
• Polypharmacological approaches to fight antibacterial resistance : Insights from drug-target 
networks  
Sarath Chandra Janga* and Andreas Tzakos  
      Trends in Biotechnology (Submitted) 
 
• Transcriptional profiling of fetal hypothalamic TRH neurons 
Magdalena Guerra-Crespo, Carlos Pérez-Monter, Sarath Chandra Janga, Santiago 
Castillo-Ramírez, Rosa Maria Gutierrez-Rios, Patricia Joseph-Bravo, Leonor Pérez-Martínez 
and Jean-Louis Charli  
      BMC Genomics, 2010 (Submitted) 
 
• Genome-wide analysis of RNA decay patterns during early Drosophila development 
Stefan Thomsen, Simon Anders, Sarath Chandra Janga, Wolfgang Huber and Claudio R. 
Alonso 
      Genome Biology, 2010 (Submitted) 
 
• Systematic identification of RNA-binding proteins in yeast suggests dual functions for 
enzymes 
Tanja Scherrer, Nitish Mittal, Sarath Chandra Janga, André P Gerber 
      Nature Molecular Systems Biology (Submitted) 
 
• Intrinsic modularity in the genomic organization of eubacterial transcription factors 
Sarath Chandra Janga* and Gabriel Moreno-Hagelsieb  
      (To be Submitted) 
 
• Dissecting the interactome of RNA-binding proteins in post-transcriptional regulatory 
networks 
Sarath Chandra Janga*, Nitish Mittal and M. Madan Babu  
       (To be Submitted) 
Appendix                                                                                                                                                                    A-6 
 
Publications prior to starting PhD 
• Conservation of adjacency as evidence of paralogous operons 
Sarath Chandra Janga and Gabriel Moreno-Hagelsieb   
Nucleic Acids Research, 2004 Vol.32, No. 18, 5392-5397 
 
• Nebulon: a system for the inference of functional relationships of gene products from the 
rearrangement of predicted operons  
Sarath Chandra Janga, Julio Collado-Vides and Gabriel Moreno-Hagelsieb  
Nucleic Acids Research, 2005 Vol.33, No. 8, 2521-2530 
 
• The network of transcriptional interactions imposes linear constrains in the genome 
Ricardo Menchaca-Mendez†, Sarath Chandra Janga† and Julio Collado-Vides 
OMICS: A Journal of Integrative Biology Jun 2005, Vol.9, No. 2: 139-145 
 
• Internal sensing machinery directs the activity of the regulatory network in Escherichia coli 
Agustino Martínez-Antonio, Sarath Chandra Janga, Heladia Salgado and Julio Collado-
Vides 
Trends in Microbiology, 2006 Vol.14, No. 1, 22-27 
 
• The Partitioned Rhizobium etli Genome: Genetic and Metabolic Redundancy in Seven 
Interacting Replicons 
Víctor González, Rosa I. Santamaría, Patricia Bustos, Ismael Hernández-González, Arturo 
Medrano-Soto, Gabriel Moreno-Hagelsieb, Sarath Chandra Janga, Miguel A. Ramírez, 
Verónica Jiménez-Jacinto, Julio Collado-Vides and Guillermo Dávila  
Proc. Natl. Acad. Sci. U S A. 103(10): 3834-9, 2006 
 
• The distinctive signatures of promoter regions and operon junctions across Prokaryotes  
Sarath Chandra Janga*, Warren F. Lamboy, Araceli M. Huerta and Gabriel Moreno-
Hagelsieb 
Nucleic Acids Research, 2006 Vol.34, No. 14, 3980-3987 
 
• Bacterial regulatory networks are extremely flexible in evolution  
Irma Lozada-Chávez, Sarath Chandra Janga* and Julio Collado-Vides 
Nucleic Acids Research, 2006 Vol.34, No. 12, 3434-3445 
+Featured in the list of hot research papers on NAR website 
 
• Identification and analysis of DNA-binding Transcription Factors in Bacillus subtilis and other 
Firmicutes- A genomic approach  
Samadhi Moreno-Campuzano, Sarath Chandra Janga and Ernesto Perez-Rueda 
BMC Genomics. 2006 Jun 13;7(1):147 
+Accessed over 800 times in less than 5 months according to BMC report 
 
• Prediction and evolution of transcription factors and their evolutionary families in prokaryotes  
Sarath Chandra Janga* 
BMC Systems Biology, 2007, 1(Suppl 1):P3 (tutorial presentation as part of the proceedings 
of the BioSysBio conference) 
 
• Internal versus external effector and transcription factor gene pairs differ in their relative 
chromosomal position in Escherichia coli. 
Appendix                                                                                                                                                                    A-7 
 
Sarath Chandra Janga*, Heladia Salgado, Julio Collado-Vides and Agustino Martinez-
Antonio 
Journal of Molecular Biology, 2007 Vol. 368(1):263-72 
 
• Operons and the effect of genome redundancy in deciphering functional relationships using 
phylogenetic profiles  
Gabriel Moreno-Hagelsieb and Sarath Chandra Janga 
      Proteins: Structure, Function, and Bioinformatics, 2007 70(2):344-352 
 
• Conservation of Transcriptional Sensing Systems in prokarya: A perspective from 
Escherichia coli 
Heladia Salgado, Agustino Martinez-Antonio and Sarath Chandra Janga* 
Febs letters, 2007 Vol. 581:3499-3506 
 
• Structure and evolution of gene regulatory networks in microbial genomes  
Sarath Chandra Janga* and Julio Collado-Vides 
      Research in Microbiology, 2007 158(10):787-94   
      +Featured on the cover page of the issue & selected as journals’ representative image 
 
• Co-ordination logic of sensing machinery in the transcriptional regulatory network of 
Escherichia coli 
Sarath Chandra Janga*, Heladia Salgado, Agustino Martinez-Antonio and Julio Collado-
Vides  
Nucleic Acids Research, 2007 Vol.35, No. 20, 6963-6972 
 
• Highlights from the Third International Society for Computational Biology Student Council 
Symposium 
Nils Gehlenborg, Manuel Corpas and Sarath Chandra Janga*  
      BMC Bioinformatics, 2007, 8(Suppl 8):I1 
 
A.2 REPRINTS 
(Please see next pages for publications not discussed in the thesis)  
Editorial
Ten Simple Rules for Organizing a Scientific Meeting
Manuel Corpas1, Nils Gehlenborg1,2, Sarath Chandra Janga3, Philip E. Bourne4*
1 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom, 2Graduate School of Life Sciences, University of Cambridge,
Cambridge, United Kingdom, 3Medical Research Council–Laboratory of Molecular Biology, University of Cambridge, Cambridge, United Kingdom, 4 Skaggs School of
Pharmacy and Pharmaceutical Science, University of California San Diego, La Jolla, California, United States of America
Scientific meetings come in various
flavors—from one-day focused workshops
of 1–20 people to large-scale multiple-day
meetings of 1,000 or more delegates,
including keynotes, sessions, posters, social
events, and so on. These ten rules are
intended to provide insights into organiz-
ing meetings across the scale.
Scientific meetings are at the heart of a
scientist’s professional life since they
provide an invaluable opportunity for
learning, networking, and exploring new
ideas. In addition, meetings should be
enjoyable experiences that add exciting
breaks to the usual routine in the labora-
tory. Being involved in organizing these
meetings later in your career is a commu-
nity responsibility. Being involved in the
organization early in your career is a
valuable learning experience [1]. First, it
provides visibility and gets your name and
face known in the community. Second, it
is useful for developing essential skills in
organization, management, team work,
and financial responsibility, all of which
are useful in your later career. Notwith-
standing, it takes a lot of time, and
agreeing to help organize a meeting
should be considered in the context of
your need to get your research done and
so is also a lesson in time management.
What follows are the experiences of
graduate students in organizing scientific
meetings with some editorial oversight
from someone more senior (PEB) who
has organized a number of major meet-
ings over the years.
The International Society for Compu-
tational Biology (ISCB) Student Council
[2] is an organization within the ISCB
that caters to computational biologists
early in their career. The ISCB Student
Council provides activities and events to
its members that facilitate their scientific
development. From our experience in
organizing the Student Council Sympo-
sium [3,4], a meeting that so far has
been held within the context of the
ISMB [5,6] and ECCB conferences, we
have gained knowledge that is typically
not part of an academic curriculum and
which is embodied in the following ten
rules.
Rule 1: The Science Is the Most
Important Thing
Good science, above all else, defines a
good meeting; logistics are important, but
secondary. Get the right people there,
namely the best in the field and those who
will be the best, and the rest will take care
of itself. When choosing a topic for your
conference, map it to the needs of your
target audience. Make sure that you have
a sufficiently wide range of areas, without
being too general. The greater the number
of topics covered, the more likely people
are to come, but the less time you will have
to focus on particular subject matter.
Emerging areas can attract greater inter-
est; try to include them in your program as
much as possible; let your audience decide
the program through the papers they
submit to the general call for papers. This
can be done with broad and compelling
topic areas such as ‘‘Emerging Trends in
…’’ or ‘‘New Developments in …’’.
Rule 2: Allow for Plenty of
Planning Time
Planning time should range from nine
months to more than a year ahead of the
conference, depending on the size of your
event. Allow plenty of time to select your
meeting venue; to call for, review, and
accept scientific submissions; to arrange
for affordable/discounted hotel rooms; to
book flights and other transportation
options to the conference. Having out-
standing keynote speakers at your event
will also require you contact them months
in advance—the bigger the name, the
more time is required.
Rule 3: Study All Potential
Financial Issues Affecting Your
Event
Sponsors are usually your primary
source of funds, next to the delegates’
registration fees. To increase the chances
of being sponsored by industry, write them
a clear proposal stating how the money
will be spent and what benefits they can
expect to get in return. You may also want
to reserve a few time slots for industry talks
or demos as a way of attracting more
sponsors, but be wary that the scientific
flavor of the meeting is not impacted by
blatant commercialism. Make sure you
first approach the sponsors that match
your interest topics the closest. If they say
they are not interested this year, keep their
contact information, as they might be able
to sponsor you in future events. Approach
them early rather than later in any case.
The cost of your conference will be
proportional to the capacity of the venue;
therefore, a good estimation of the number
of attendees will provide you with a good
estimate of your costs. You will need to
include meals and coffee breaks together
with the actual cost of renting your venue.
Be aware that audiovisual costs can be
additional as well as venue staff—look out
for hidden costs. Aside from venue-related
costs, additional expenditures might in-
clude travel fellowships, publication costs
for proceedings in a journal, and awards
for outstanding contributors. All these
issues will determine how much you need
to charge your participants to attend. Map
all this out on a spreadsheet and do the
math. Allow for contingencies, such as
currency fluctuations and world-changing
Published June 27, 2008
Copyright: ! 2008 Corpas et al. This is an open-access article distributed under the terms of the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: The authors have received no specific funding for this article.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: bourne@scsd.edu
Citation: Corpas M, Gehlenborg N, Janga SC, Bourne PE (2008) Ten Simple Rules for Organizing a Scientific
Meeting. PLoS Comput Biol 4(6): e1000080. doi:10.1371/journal.pcbi.1000080
PLoS Computational Biology | www.ploscompbiol.org 1 June 2008 | Volume 4 | Issue 6 | e1000080
events that will impact attendance. For
large meetings, consider insurance against
such events. Starting with a template that
others have used for previous similar
conferences can be a big help.
Rule 4: Choose the Right Date
and Location
Your conference needs to be as far away
as possible from established conferences
and other related meetings. Alternatively,
you may want to organize your event
around a main conference, in the form of
a satellite meeting or Special Interest
Group (SIG). Teaming up with established
conferences may increase the chances of
attracting more people (especially if this is
your first time) and also save you a great
deal of administrative work. If you decide
to do it on your own, you should consider
how easy it is to travel to your chosen
location, whether it has a strong local
community in your field, and whether it
has cultural or other tourist attractions.
Inexpensive accommodation and airfares
to your conference are always a plus.
Rule 5: Create a Balanced
Agenda
A conference is a place for people
wanting to share and exchange ideas.
Having many well-known speakers will
raise the demand for your event (and the
cost) but that has to be balanced with
enough time for presentation of submitted
materials. A mix of senior scientists and
junior scientists always works for the better.
Young researchers may be more enthusias-
tic and inspiring for students, while top
senior scientists will be able to present a
more complete perspective of the field.
Allow plenty of time for socializing, too;
breaks, meals, and poster sessions are ideal
occasions to meet potential collaborators
and to foster networking among peers.
Rule 6: Carefully Select Your
Key Helpers: the Organizing
Committees
A single person will not have all the
skills necessary to organize a large meet-
ing, but the organizing committee collec-
tively needs to have the required expertise.
You might want to separate the areas of
responsibilities between your aides de-
pending on their interests and availability.
Some potential responsibilities you might
delegate are: 1) content and design of the
Web site promoting the meeting; 2)
promotion materials and marketing; 3)
finance and fundraising; 4) paper submis-
sions and review; 5) posters; 6) keynotes; 7)
local organization; 8) program and speak-
ers; 9) awards. Your organizing committee
should be large enough to handle all the
above but not too large, avoiding free-
loaders and communication issues. It is
invaluable to have a local organizing
committee since they know local institu-
tions, speakers, companies, and tourist
attractions. Local organizations may also
help you with administrative tasks; for
example, dealing with registration of
attendees and finding suitable accommo-
dations around the venue.
Rule 7: Have the Members of
the Organizing Committees
Communicate Regularly
It is good to have planning sessions by
teleconference ahead of the meeting. As
far as possible, everyone should be familiar
with all aspects of the meeting organiza-
tion. This collective wisdom will make it
less likely that important issues are forgot-
ten. The local organizers should convince
everyone that the venue will work. Use
these sessions to assign responsibilities
ahead of the meeting. Tasks such as
manning the registration tables, carrying
microphones for attendees to ask ques-
tions, introducing sessions and speakers,
checking presentations ahead of time, and
having poster boards, materials to attach
posters, etc., are easily overlooked. In
short, good communication will lead to
you covering all the little things so easily
forgotten.
Good communication continues
throughout the meeting. All organizers
should be able to contact each other
throughout the meeting via mobile phone
and e-mail. Distribute to all organizers the
names and contact information of caterers,
building managers, administrative person-
nel, technicians, and the main conference
organizer if you are having your event as
part of another conference. Onsite chang-
es that incur additional costs, however,
should require the approval of a single, key
organizer rather than all organizers oper-
ating independently of one another. This
will ensure there are no financial surprises
in the end. It is also important that you
have a designated meeting point where
someone from the organizing committee is
going to be available at all times to help
with problems.
Rule 8: Prepare for Emergencies
Attendees need to be aware of all
emergency procedures in terms of evacu-
ation, etc. This should be discussed with
the venue managers. All attendees should
be reachable as far as possible during the
conference. If an attendee has an emer-
gency at home, his or her family should be
able to reach them through the conference
desk—mobile phones are not perfect after
all.
Rule 9: Wrap Up the Conference
Properly
At the end of the conference, you
should give credit to everyone who helped
to make the event a success. If you have
awards to present, this is the right time for
the awards ceremony. Dedicate some time
to thank your speakers and sponsors as
well as everyone involved in the organiza-
tion of the conference. Also collect feed-
back about the event from the delegates
through questionnaires. This evaluation
will help you to understand the strengths
and weaknesses of your conference and
give you the opportunity to improve
possible future events. Have a party or
some other event for all those organizing
the conference.
Rule 10: Make the Impact of
Your Conference Last
Published proceedings are the best way
to make the results of your conference last.
Negotiate with journals far in advance of
the conference to publish the proceedings.
Make those proceedings as widely acces-
sible as possible. Upload photos and videos
of the event to the conference Web site
and post the names of presenters who have
received awards or travel fellowships. It is
also a good idea to link the results of your
evaluation to the Web site. Send one last
e-mail to all delegates, including a sum-
mary of the activities since the conference
and thanking them for their participation.
This is particularly important if you are
considering holding the conference again
in future years, in which case include some
information on your plans for the next
event.
As always, we welcome your comments
and experiences that you think would
enrich these ten rules so that they might
be useful to others. The comment feature
now supported by this journal makes it
easy to do this.
Acknowledgments
We would like to acknowledge the International
Society for Computational Biology (ISCB) for
their support in the organization of the Student
Council Symposiums, in particular BJ Morri-
son-McKay and Steven Leard. Thanks to
Michal Linial and Rita Casadio (our liaisons
PLoS Computational Biology | www.ploscompbiol.org 2 June 2008 | Volume 4 | Issue 6 | e1000080
at the ISCB Board of Directors), Burkhard Rost
(the ISCB President), and all the ISCB Board of
Directors for being so supportive of the work of
the Student Council. We are also grateful to all
the Student Council leadership and current and
past Student Council members for their enthu-
siasm and hard (unpaid) work. You all have
made the Student Council a great organization.
References
1. Tomazou EM, Powell GT (2007) Look who’s
talking, too: Graduates developing skills through
communication. Nat Rev Genet 8: 724–726.
doi:10.1038/nrg2177.
2. The International Society for Computational
Biology Student Council. Available: http://
www.iscbsc.org. Accessed 22 April 2008.
3. Corpas M (2005) Scientists and societies. Nature
436: 1204. doi:10.1038/nj7054–1204b.
4. Gehlenborg N, Corpas M, Janga SC (2007)
Highlights from the Third International Society
for Computational Biology (ISCB) Student Coun-
cil Symposium at the Fifteenth Annual Interna-
tional Conference on Intelligent Systems for
Molecular Biology (ISMB). BMC Bioinformatics
8 (Supplement 8):I1.
5. Lengauer T, McKay BJM, Rost B (2007) ISMB/
ECCB 2007: The premier conference on com-
putational biology. PLoS Comput Biol 3: e96.
doi:10.1371/journal.pcbi.0030096.
6. Third ISCB Student Council Symposium. Avail-
able: http://www.iscbsc.org/scs3 Accessed 22
April 2008.
PLoS Computational Biology | www.ploscompbiol.org 3 June 2008 | Volume 4 | Issue 6 | e1000080
Genome Biology 2008, 9:239
Minireview
Network-based approaches for linking metabolism with environment
Sarath Chandra Janga and M Madan Babu
Address: MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 0QH, UK.
Correspondence: Sarath Chandra Janga. Email: sarath@mrc-lmb.cam.ac.uk
Abstract
Progress in the reconstruction of genome-wide metabolic maps has led to the development of
network-based computational approaches for linking an organism with its biochemical habitat.
Published: 24 November2008
Genome Biology 2008, 9:239 (doi:10.1186/gb-2008-9-11-239)
The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2008/9/11/239
© 2008 BioMed Central Ltd 
The sequential nature of the reactions in metabolic pathways
means that they can be modeled in the form of a graph
(network) of enzymes and chemical transformations, and
network theory can be used to represent and understand
metabolism [1,2]. The connected collection of metabolic
pathways, describing the set of all enzymatic interc-
onversions of one small molecule into another, is defined as
the metabolic network of an organism (Figure 1a).
The most commonly used network representations are
‘metabolite-centric’. They consider metabolites as the nodes
of the graph and two metabolites are linked if one can be
converted into the other by an enzymatic reaction (Figure 1b,
left). An alternative network representation is ‘enzyme-centric’.
It considers the enzymes as nodes and links enzymes that
catalyze successive reactions (Figure 1b, right). Although
several studies have provided insights into the structure and
evolution of a metabolic network, very few have addressed
the influence of environment on metabolic network struc-
ture in species from diverse environmental conditions. The
availability of many completely sequenced genomes means
that metabolic-network analysis can now be extended from a
few model organisms to species from different branches of
the tree of life and living in very different environments. This
should enable the elucidation of general principles
underlying metabolic networks.
Two recent studies, published in the Proceedings of the
National Academy of Sciences by Eytan Ruppin and colleagues
(Kreimer et al. [3] and Borenstein et al. [4]), provide
important insights into links between the environment of an
organism and the structure of its metabolic network. Using
data from a large number of bacterial metabolic networks,
Kreimer et al. address the question of how the topologies of
the metabolic networks from different species reflect both
genome size and the diversity of environmental conditions
the species would encounter. Borenstein et al. set out to
identify the ‘seed set’ - that set of small molecules that are
absolutely needed from the external environment - of each
species and how this seed set differs across species from
different environments.
A network view of metabolism
Several studies have addressed a wide-range of questions
using network representation of small-molecule metabolism
[5-7]. For instance, at the structural level, the metabolic
network of an organism has been shown to have a scale-free
topology with few nodes (for example, pyruvate or coenzyme
A) reacting with many other substrates [8,9]. A distinguis-
hing feature of such scale-free networks is the existence of a
few highly connected metabolites, which participate in a very
large number of metabolic reactions. By definition, when a
large number of links integrate several substrates into a
single highly connected component, fully separated modules
will not exist. This has led to the notion of hierarchical
modular structures within the fully connected metabolic
network, where a ‘module’ is defined as a group of nodes
that are more connected to each other than to other nodes in
the network [10].
Kreimer et al. [3] have carried out a comprehensive, large-
scale characterization of metabolic-network modularity
(defined as in [11]) using 325 prokaryotic species with
sequenced genomes and metabolic networks in the KEGG
pathway database [12]. They found that network size was an
important topological determinant of modularity, with
larger genomes exhibiting higher modularity scores (that is,
a higher proportion of edges in the network forming part of
modules than would be expected by chance). In addition,
several environmental factors were shown to contribute to
the variation in metabolic-network modularity across species.
In particular, the authors found that endosymbionts and
mammal-specific pathogens have lower modularity scores
than bacterial species that occupy a wider range of niches.
Moreover, among the pathogens, those that alternate
between two distinct niches, such as insect and mammal,
were found to have relatively high metabolic-network
modularity. This supports the notion previously put forward
by Parter et al. [13] that variability in the natural habitat of
an organism promotes modularity in its metabolic network.
Kreimer et al. [4] also reconstructed likely ancestral states,
and found that modularity tends to decrease from ancestors
to descendants; they attribute this to niche specialization
and incorporation of peripheral metabolic reactions.
In line with the above effects of environmental diversity on
network structure, Pal et al. [14] observed that bacterial
metabolic networks grow by retaining horizontally acquired
genes (genes acquired from other species) involved in the
transport and catalysis of external nutrients, and that evolu-
tionary changes in networks are primarily driven by adap-
tation to changing environments. Accordingly, horizontally
transferred genes were found to be integrated at the
periphery of the network, whereas the central parts remain
evolutionarily stable. Indeed, genes encoding physiologically
coupled reactions were often found to be transferred
together, frequently in operons. This suggests that bacterial
metabolic networks evolve by direct uptake of peripheral
reactions in response to changing environments [14].
In this regard, a recent genome-wide study in yeast found
that central and highly connected enzymes evolve more
slowly than less connected ones and that duplicates of highly
connected enzymes tend to have a higher likelihood of
retention [15]. Enzymes carrying high metabolic fluxes
under natural biological conditions were also found to
experience greater evolutionary constraints. Interestingly,
however, it was shown that highly connected enzymes are no
more likely to be essential to survival than the less connected
ones [15].
The functional and evolutionary modularity of the Homo
sapiens metabolic network has also been investigated from a
topological point of view and was shown to be organized
with a highly modular, ‘core and periphery’ topology [16]. In
such a structure, the core modules are tightly linked together
and perform basic metabolic functions, whereas the
peripheral modules only interact with few other modules
and accomplish relatively independent and specialized
functions. Interestingly, as in bacteria and yeast, peripheral
modules were found to evolve more cohesively and faster
than core modules [16].
Linking external environment to the metabolic
circuitry
Microorganisms constantly monitor their surroundings for
the availability of nutrients and other chemicals, using both
http://genomebiology.com/2008/9/11/239 Genome Biology 2008, Volume 9, Issue 11, Article 239 Janga and Babu 239.2
Genome Biology 2008, 9:239
Figure 1
Metabolic networks. (a) A set of related metabolic reactions can be
represented as a network. M1, M2, and so on are metabolites and E1, E2,
and so on are the enzymes that catalyze the conversion of one metabolite
into another. The arrows represent the direction of the reaction. (b)
Different ways of representing a metabolic network: left, with the
metabolites as nodes; right, with the enzymes as nodes. (c) Representation
of seed compounds in a hypothetical metabolic network. The metabolic
boundary of the organism is represented by the gray oval. Metabolites
(the nodes in the network) are represented by colored circles. The set of
compounds that cannot be internally synthesized but must be obtained
from the environment is referred to as the seed set, and is represented
here as red circles. Seed metabolites form the interface between the
environment and the metabolic system and link the metabolic habitats of
an organism with its core metabolic processes. In this hypothetical
network, it is possible to reach any of the internal nodes (open green
nodes) from any other node except those that have to be obtained from
the environment (blue arrows).
E1
E2
E3
E4
E1 E2
E3E44
(a)
(b)
(c)
Metabolite-centric
metabolic network
Enzyme-centric
metabolic network
Metabolic network
M1 M2
M2 M3
M3 M4
M2 M4
E1
E2
E3
E4
Metabolic reactions
M2M1
M3M4
M2M1
M3M4
Environment
external and internal sensors to respond dynamically to
environmental changes [17]. Integration of the external
environment with metabolism occurs through the import of
compounds from the environment and results, for example,
in a transcriptional response or an allosteric interaction with
an enzyme [18-20]. In the second of the recent studies from
Ruppin and co-workers, Borenstein et al. [4] propose a
graph-theoretical approach to define these exogenously
acquired compounds - the seed set of an organism - and
have identified their repertoire across the tree of life (Figure 1b).
This is one of the most comprehensive studies so far that links
organisms’ metabolic circuitry with their environment.
The authors represent the metabolic network of a given
species as a directed graph with nodes representing metabo-
lites and edges corresponding to the linking reactions
converting substrates to products. Using this, they identify
the maximal set of metabolites that can be synthesized from
a particular precursor metabolite. This graph-based repre-
sentation of the metabolic network then enabled them to
discover the seed-set compounds for each of the 478 pro-
karyotic species with available metabolic networks in the
KEGG database [12]. On the whole, they found that about 8-
11% of the compounds in the metabolic network of an
organism correspond to the seed set. Their predictive ability
to correctly identify seed compounds reached a precision of
95% when benchmarked against a set of compounds
experimentally characterized as being taken up from the
environment by the rickettsia that cause the disease
ehrlichiosis in humans and animals. Recall values (defined
as the percentage of correctly identified seeds of all exoge-
nously acquired compounds) based on the same dataset
were low, suggesting that other factors might have a role in
the identification of seed compounds of an organism, such as
http://genomebiology.com/2008/9/11/239 Genome Biology 2008, Volume 9, Issue 11, Article 239 Janga and Babu 239.3
Genome Biology 2008, 9:239
Box 1. Models of metabolic pathway evolution
The most influential models of metabolic pathway evolution have been the ‘retrograde model’ proposed by Horowitz in
1945 [24] and the ‘patchwork model’ proposed by Ycas in 1974 [25] and later improved by Jensen in 1976 [26].
The retrograde model
In the retrograde model, pathways evolve bottom-up from a key metabolite, which is assumed to be initially abundant in
the ancestral condition. The model presupposes the existence of a chemical environment in which both the key
metabolite and potential intermediates are available. An organism primarily dependent on molecule Z will use up
environmental reserves of the metabolite to the point at which its growth is restricted; in such an environment, an
organism capable of synthesizing molecule Z from environmental precursors X and Y will have a selective advantage.
Any natural variant evolving an enzyme that catalyzes this synthesis will have a fitness advantage in such an environ-
ment. As a result, with the drop in environmental concentration of X or Y, the process will be repeated, with the similar
recruitment of further enzymes.
The retrograde model also proposes that the simultaneous unavailability of two intermediates (say X and Y) would favor
symbiotic association between two mutants, one capable of synthesizing X and the other of synthesizing Y from other
environmental precursors. One of the major assumptions of this model is that the evolution of metabolic pathways occurs in
an environment rich in metabolic intermediates, and it therefore cannot explain their evolution during major environmental
transitions in the history of life such as, for example, the depletion of organic molecules from the environment [24,27]. The
retrograde model also fails to explain the development of pathways that include labile metabolites, which could not have
accumulated in the environment for long enough for retrograde recruitment to take place.
The patchwork model
In light of these limitations, Ycas [25] and Jensen [26] proposed the patchwork model of metabolic pathway evolution, in
which pathway evolution depends on the initial existence of broad-specificity enzymes. In its original formulation [25],
such enzymes catalyze whole classes of reactions, forming a large network of possible pathways. The broad specificities
would mean that many metabolic chains, synthesizing key metabolites, may have existed, although short and incomplete
compared with the pathways observed today. The duplication of genes in such pathways (advantageous because
increased levels of the enzyme would generate more of the key metabolites), followed by their specialization, would
account for extant pathways. Jensen [26] subsequently pointed out that the fortuitous evolution of a novel chemistry,
together with the biological leakiness of such a system, could allow the production of a key metabolite from a novel
intermediate, even if it is several enzymatic steps away from the original product.
the incompleteness of the metabolic network or ways of
acquiring an exogenous compound that cannot be captured
by currently available metabolic maps. The resulting
compilation, which represents the overall static metabolic
interface of each organism characterizing its biochemical
habitat, enabled Borenstein et al. to trace the evolutionary
history of both metabolic networks and growth environments.
When the seed sets identified in each organism were
analyzed in detail, species living in variable environments
were found to have more versatile seed sets, in terms of
variability of size and diversity of composition. On the other
hand, obligate parasites like Buchnera aphidicola and those
microorganisms, such as archaea, that live in extreme and
narrowly defined environments, were found to have much
smaller seed set sizes. These results suggest that although
organisms surviving in predictable environments can take
up many compounds from their surroundings, this
capability is still significantly smaller than in organisms that
have to survive in a wide range of niches.
Borenstein et al. [4] carried out a phylogenetic analysis of
the seed sets across different taxa, which suggested not only
that an accurate tree of life can be reconstructed from them
but that such a tree can provide insights into the evolu-
tionary dynamics of seed compounds. In particular, the
study revealed that novel compounds can be integrated into
the metabolic network of an organism as either non-seeds or
seeds, and that seed compounds are more likely to be lost
during evolution than non-seed compounds. From the
comparison with ancestral metabolic networks, Borenstein
et al. [4] suggest that the transition from seed to non-seed
compound occurs 2.5 times more often than the reverse.
This suggested that, of the two main current hypotheses of
metabolic network evolution - the ‘patchwork’ and ‘retrograde’
models (see Box 1) - the retrograde model, in which
pathways evolve in a direction opposite to the metabolic
flow, might best explain the observed events. However, the
observations of Borenstein et al. [4] on the high overall rate
of integration of non-seed compounds and the relatively
high rate of transition of non-seed compounds into seed
metabolites, suggest that some aspects of network evolution
could be explained by the patchwork and other models. The
results highlight the fact that these models are not mutually
exclusive, but complementary, and might have contributed
to pathway evolution to different extents [21,22].
It should be noted that there are limitations to studies such
as those reported here, in that the incompleteness of meta-
bolic maps, the reversibility of reactions, possible alternative
mechanisms controlling metabolic import, and the ignoring
of the distinction between catabolic and anabolic pathways
can all potentially result in false positives in the identified
seed sets. Nevertheless, it is exciting to note that seed sets
obtained using the approach developed in these studies not
only reflect the metabolic environments of the species
themselves but also provide insight into their natural
biochemical habitats - the union of all the metabolic
environments an organism encounters.
Hence, such approaches can be exploited to study the
interaction and association of microbes with other species
thriving in similar habitats. This may help in the identifi-
cation of host-parasite and symbiotic relationships between
organisms and also enable the prediction and design of
drugs that can precisely target an organism of interest
without adversely affecting the host. With the availability of
metagenomic data ranging from viromes to biomes [23], we
anticipate that similar approaches can be applied to study
metagenomic environments to decipher species relationships
and dependencies occurring in large ecological niches, thereby
providing insights into ecological imbalances or tradeoffs.
Acknowledgements
SCJ and MMB acknowledge financial support from the MRC Laboratory of
Molecular Biology. SCJ acknowledges financial support from Cambridge
Commonwealth Trust. MMB thanks Darwin College and Schlumberger
Ltd for generous support. We thank A Wuster, R Janky, K Weber, V
Espinosa-Angarica and JJ Díaz-Mejía for critically reading the manuscript
and providing helpful comments.
References
1. Papin JA, Price ND, Wiback SJ, Fell DA, Palsson BO: Metabolic path-
ways in the post-genome era. Trends Biochem Sci 2003, 28:250-258.
2. Feist AM, Palsson BO: The growing scope of applications of genome-
scale metabolic reconstructions using Escherichia coli. Nat Biotech-
nol 2008, 26:659-667.
3. Kreimer A, Borenstein E, Gophna U, Ruppin E: The evolution of
modularity in bacterial metabolic networks. Proc Natl Acad Sci USA
2008, 105:6976-6981.
4. Borenstein E, Kupiec M, Feldman MW, Ruppin E: Large-scale recon-
struction and phylogenetic analysis of metabolic environments. Proc
Natl Acad Sci USA 2008, 105:14482-14487.
5. von Mering C, Zdobnov EM, Tsoka S, Ciccarelli FD, Pereira-Leal JB,
Ouzounis CA, Bork P: Genome evolution reveals biochemical net-
works and functional modules. Proc Natl Acad Sci USA 2003, 100:
15428-15433.
6. Spirin V, Gelfand MS, Mironov AA, Mirny LA: A metabolic network
in the evolutionary context: multiscale structure and modularity.
Proc Natl Acad Sci USA 2006, 103:8774-8779.
7. Guimera R, Nunes Amaral LA: Functional cartography of complex
metabolic networks. Nature 2005, 433:895-900.
8. Wagner A, Fell DA: The small world inside large metabolic net-
works. Proc Biol Sci 2001, 268:1803-1810.
9. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL: The large-
scale organization of metabolic networks. Nature 2000, 407:651-654.
10. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierar-
chical organization of modularity in metabolic networks. Science
2002, 297:1551-1555.
11. Newman ME: Modularity and community structure in networks.
Proc Natl Acad Sci USA 2006, 103:8577-8582.
12. Okuda S, Yamada T, Hamajima M, Itoh M, Katayama T, Bork P, Goto
S, Kanehisa M: KEGG Atlas mapping for global analysis of metabolic
pathways. Nucleic Acids Res 2008, 36(Web Server issue):W423-
W426.
13. Parter M, Kashtan N, Alon U: Environmental variability and modular-
ity of bacterial metabolic networks. BMC Evol Biol 2007, 7:169.
14. Pal C, Papp B, Lercher MJ: Adaptive evolution of bacterial metabolic
networks by horizontal gene transfer. Nat Genet 2005, 37:1372-
1375.
15. Zhao J, Ding GH, Tao L, Yu H, Yu ZH, Luo JH, Cao ZW, Li YX:
Modular co-evolution of metabolic networks. BMC Bioinformatics
2007, 8:311.
http://genomebiology.com/2008/9/11/239 Genome Biology 2008, Volume 9, Issue 11, Article 239 Janga and Babu 239.4
Genome Biology 2008, 9:239
16. Vitkup D, Kharchenko P, Wagner A: Influence of metabolic network
structure and function on enzyme evolution. Genome Biol 2006,
7:R39.
17. Martinez-Antonio A, Janga SC, Salgado H, Collado-Vides J: Internal-
sensing machinery directs the activity of the regulatory network in
Escherichia coli. Trends Microbiol 2006, 14:22-27.
18. Seshasayee AS, Fraser GM, Babu MM, Luscombe NM: Principles of
transcriptional regulation and evolution of the metabolic system in E.
coli. Genome Res 2008. doi: 10.1101/gr.079715.108.
19. Balaji S, Babu MM, Aravind L: Interplay between network structures,
regulatory modes and sensing mechanisms of transcription factors in
the transcriptional regulatory network of E. coli. J Mol Biol 2007,
372:1108-1122.
20. Janga SC, Salgado H, Martinez-Antonio A, Collado-Vides J: Coordina-
tion logic of the sensing machinery in the transcriptional regulatory
network of Escherichia coli. Nucleic Acids Res 2007, 35:6963-6972.
21. Diaz-Mejia JJ, Perez-Rueda E, Segovia L: A network perspective on
the evolution of metabolism by gene duplication. Genome Biol 2007,
8:R26.
22. Teichmann SA, Rison SC, Thornton JM, Riley M, Gough J, Chothia C:
Small-molecule metabolism: an enzyme mosaic. Trends Biotechnol
2001, 19:482-486.
23. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM,
Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MA,
Nelson KE, Nilsson C, Olson R, Paul J, Brito BR, Ruan Y, Swan BK,
Stevens R, Valentine DL, Thurber RV, Wegley L, White BA, Rohwer
F: Functional metagenomic profiling of nine biomes. Nature 2008,
452:629-632.
24. Horowitz NH: On the evolution of biochemical syntheses. Proc Natl
Acad Sci USA 1945, 31:153-157.
25. Ycas M: On earlier states of the biochemical system. J Theor Biol
1974, 44:145-160.
26. Jensen RA: Enzyme recruitment in evolution of new function. Annu
Rev Microbiol 1976, 30:409-425.
27. Lazcano A, Miller SL: On the origin of metabolic pathways. J Mol Evol
1999, 49:424-431.
http://genomebiology.com/2008/9/11/239 Genome Biology 2008, Volume 9, Issue 11, Article 239 Janga and Babu 239.5
Genome Biology 2008, 9:239
Transcript stability in the protein interaction network of Escherichia coli
Sarath Chandra Janga* and M. Madan Babu
Received 25th September 2008, Accepted 21st November 2008
First published as an Advance Article on the web 9th December 2008
DOI: 10.1039/b816845h
Gene expression is a dynamic process which can be controlled by a number of mechanisms as
genetic information flows from nucleic acids to proteins. The study of gene expression in the
steady state, while informative, overlooks the underlying dynamics of the processes. Steady-state
transcript levels are a result of both RNA synthesis and degradation, and as such, measurements
of degradation rates can be used to determine their rates of synthesis as well as reveal regulation
that occurs via changes in RNA stability. Messenger RNA degradation plays a central role in
diverse cellular processes and is controlled primarily by the activity of the degradosome in
prokaryotes. In this study, we use the currently available network of protein–protein interactions
(PPIs) and mRNA half-lives in Escherichia coli to demonstrate that centrality of a protein in the
PPI network is strongly correlated with its mRNA half-life. We find that interacting proteins tend
to show similar half-lives, commonly referred to as assortative behavior in networks, which is
frequently found in biological and social networks. While a major fraction of the interacting
proteins show significantly lower differences in mRNA stabilities, a smaller but significant number
of protein pairs tend to show higher differences than expected by chance. Higher differences in
transcript stabilities often involved those that encode for transcription factors and enzymes,
suggesting a feedback link at the post-translational level. We also note that although essential
genes, which act as a proxy for in vivo centrality in PPI networks, are highly expressed compared
to non-essential ones, they do not encode for more stable transcripts than non-essential genes. Our
results provide a direct link between mRNA stability and centrality of a protein in PPI network
indicating the importance of post-transcriptional mechanisms on nascent RNAs in the cell.
Introduction
RNAs can be classified by their stability in the cell. The best-
known stable RNAs are the tRNAs and rRNAs. mRNAs are
unstable, with half-lives in Escherichia coli ranging from 2 to
25 min (see Fig. 1A). In eukaryotic cells, mRNA turnover is
slower, but the half-lives are usually shorter than the genera-
tion time. The instability of mRNA is an important property
permitting timely adjustments to changes in growth conditions
or to genetically controlled programs of expression. Until
recently, tRNAs and rRNAs were believed to be protected
by their rapid folding and assembly into compact structures.
This simplistic view seems unlikely because of the discovery of
ribonucleolytic multienzyme complexes capable of unwinding
and degrading structured RNA. Another widely held precon-
ception was that the enzymes involved in the processing of
stable RNA would be distinct from those involved in the
degradation of mRNA. With the discovery in E. coli and
Saccharomyces cerevisiae that ribonucleases involved in the
processing of rRNA are also important in the degradation of
mRNA, it is now clear that there is a close connection between
processing and degradation.1–4
mRNA instability is an intrinsic property that permits
timely changes in gene expression by limiting the lifetime of
a transcript and acts as a regulator for controlling the produc-
tion of a protein product at the post-transcriptional level. It is
becoming increasingly clear that in eubacteria like E. coli,
RNase E, a single-strand-specific endonuclease is involved in
the processing of rRNA and the degradation of mRNA.5–7
A nucleolytic multienzyme complex now known as the RNA
degradosome was discovered during the purification and
characterization of RNase E.8,9 Two other major components
of this complex include a 30 exoribonuclease (polynucleotide
phosphorylase, PNPase) and a DEAD-box RNA helicase
(RNA helicase B, RhlB). RNase E is a large multidomain
protein with N-terminal ribonucleolytic activity, an RNA-
binding domain and a C-terminal ‘scaffold’ that binds
PNPase, enolase and RhlB. The association of RNase E and
PNPase in a complex provides a direct physical link for their
co-operation in the degradation of mRNA. Other associated
proteins, present in substoichiometric amounts, include poly-
phosphate kinase (PPK), DnaK and GroEL. Interactions with
other enzymes, such as E. coli poly(A) polymerase and
the ribosomal protein S1, have also been described, although
the role of enolase, PPK and other associated proteins in the
degradation of mRNA is still unknown.7 However, a ‘minimal’
degradosome containing RNase E, RhlB, PNPase and enolase
can be reconstituted from purified components and has been
proposed to comprise the degradosome complex.
In E. coli, the degradation of mRNA is mediated by the
combined action of endo- and exo-ribonucleases, RNase E
and PNPase, respectively, which degrade RNA in a 30–50
MRC Laboratory of Molecular Biology, Hills Road, Cambridge,
UK CB2 0QH. E-mail: sarath@mrc-lmb.cam.ac.uk;
Fax: +44 (0)1223 213556; Tel: +44 (0)1223 402479
154 | Mol. BioSyst., 2009, 5, 154–162 This journal is !c The Royal Society of Chemistry 2009
PAPER www.rsc.org/molecularbiosystems | Molecular BioSystems
pathway.10 Enzymes related to RNase E and PNPase are
widespread in both eubacteria and eukaryotes.11 Recent
studies have shown the existence of endonuclease binding
proteins like RraA and RraB that can modulate the remodelling
of degradosome composition in bacteria and can result in
dramatic, distinct, and inhibitor-specific changes in degrado-
some composition. These effects have also been shown to be
associated with alterations in RNA decay and global
transcript abundance profiles. These profiles were found to be
dissimilar to those observed during simple RNase E deficiency,
and such effects have been suggested to make degradosome
remodelling as a mechanism for the differential regulation of
RNA cleavages in E. coli.12,13 In addition, recent whole
genome microarray studies have revealed the importance
of the contribution from different components of the
degradosome, the relation between mRNA stability and its
abundance and the higher order cleavage characteristics
implying the importance of mRNA stability and the role of
post-transcriptional regulation in mediating cellular interac-
tions and cross-talk.14–17
Two key factors for controlling the concentration of a
protein in a bacterial cell include the number of transcripts
per cell cycle and the stability of the transcript. Evidence
points to the fact that most transcripts in bacteria and eukarya
are produced only once per cell cycle suggesting that stability
of a transcript during the cell cycle might play a more
important role than the actual number of mRNA molecules,
which is already low.18 As the transcription rate is generally
low, it follows that cells must depend on their mother’s
mRNAs and/or proteins for survival. Therefore, transcript
half-life might enforce a constraint on transcripts that have
critical roles in important cellular processes, which may take
place throughout the cell cycle or longer. On the other hand,
smaller cellular sizes in bacteria would force the infrequently
used transcripts to be rapidly decayed. In fact, it has been
shown in the E. coli transcriptional regulatory network that
highly connected transcription factors tend to be less stable
with short half-lives although they are highly expressed,19,20
indicating that stability of a transcript might play a vital role
in several cellular processes. Given these observations, it is
imperative to understand how stability of a transcript can
constrain the interaction of its protein product with other
cellular components. With the availability of data from high-
throughput technologies like affinity purification and two
hybrid system, it has become possible to address such ques-
tions on large-scale PPI maps.21–23 In this study, we use for the
first time the PPIs of a bacterial model organism, E. coli from
the database of interacting proteins (DIP)24 and ask how the
in silico and in vivo measures of centrality of a protein are
related to the stability of its transcript (Fig. 1B). In silico
centrality of a protein in a PPI network refers to its number
of connected neighbors and other network based topology
measures which indicate its importance while the in vivo
centrality indicates whether a protein is essential for survival
in specific experimental conditions.
Results and discussion
mRNA half-lives of proteins correlate positively with their PPI
network centrality
The control of mRNA degradation plays a central role in
diverse cellular processes and is regulated primarily by the
activity of the degradosome in prokaryotes.25–27 mRNA decay
has been studied in a range of organisms, and much has been
learned about the substrate features and ribonucleolytic
enzymes that influence mRNA stability based on data from
small sets of transcripts.27,28 However, due to the availability of
DNAmicroarrays, it has recently become possible to screen and
measure the mRNA levels of transcripts of thousands of
individual genes, enabling the determination of mRNA abun-
dance and stability on a genome-wide scale. A common strategy
used to determine mRNA half-lives is to block new transcrip-
tion and monitor expression levels of transcripts over a period
of time to obtain rates of decay of individual transcripts using a
microarray. Typically, rifampicin, a drug known to prevent the
initiation of new transcripts by binding to the b subunit of RNA
polymerase is used.15,16 In this study, we used the repertoire of
mRNA half-lives determined by Selinger et al.16 in E. coli
(see Materials and methods). The data on PPIs were obtained
from the DIP, which contains a high quality set of interactions24
(see Materials and methods). The final PPI network consisted of
5667 interactions involving 998 proteins, encompassing about
25% of the predicted proteome of E. coli.
Fig. 1 Schematic showing the (A) distribution of mRNA half-lives (in minutes) for all the protein coding genes in E. coli analyzed in this study.
(B) Concept of mRNA stability, protein–protein interaction (PPI) network and its relationship with in silico and in vivo centrality and essentiality
measures addressed in this study.
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 154–162 | 155
By integrating the data on mRNA half-life and protein
interaction network, we first asked if there is a correlation
between the mRNA half-life of the transcript and the impor-
tance of a protein (encoded by the transcript) in the protein
interaction network. To obtain the importance of a protein in
the PPI network, we calculated different centrality measures29
for every protein in the network as described in the Materials
and methods. In brief, three centrality measures have been
described in the literature: (i) degree or connectivity, which is
the number of interactions a protein has in the PPI network—
the higher the connectivity (i.e., hub nodes) the more impor-
tant a protein is, (ii) betweenness centrality, which measures
the number of shortest path lengths between all pairs of
proteins in the network that pass through a protein of
interest—the higher the number of paths that pass through a
protein, the more important it is, (iii) closeness centrality,
which provides the average length of all the shortest paths
from a protein of interest to all other proteins in the
network—note that closeness centrality defined this way
implies that lower the closeness value, the higher the impor-
tance (centrality) of a node. We used all of these parameters
measuring the importance of a protein in a PPI network and
compared them against the mRNA half-lives of the encoding
transcripts. As shown in Fig. 2, we found that all the centrality
measures correlate positively with the mRNA half-lives,
indicating that the transcripts of the highly central proteins
in the PPI network tend to be more stable in the cell. It should
be noted that the results presented here are insensitive to the
removal of up to 10% of the interactions in the network
suggesting that the findings presented here are generally robust
(see Materials and methods; data not shown). This implies that
proteins that are more central in the PPI network of E. coli
(e.g., hubs—those which interact with a large number of other
proteins) tend to have much more stable transcripts in order to
enable their availability for most of the cell cycle. This might
ensure that proteins that are important to co-ordinate cellular
activity by interacting with several other proteins can be
synthesized from their corresponding transcripts in required
concentrations at different times. It is interesting to note that
the finding reported here is in contrast to what is observed for
hubs in the transcriptional network of E. coli19 (i.e., transcription
factors which regulate the expression of several genes) possibly
as a result of different functional constraints and mechanisms
governing the roles for hubs in different networks. In other
words, since hubs in the transcriptional network have to be
transcription factors only (which regulate gene expression by
binding to upstream region) while hubs in the PPI network can
be any protein that need not be a transcription factor, this may
introduce very different constrains for transcript stability for
hubs in the two networks.
In addition to these centrality measures, we also computed
the clustering coefficient of a node, which reflects the extent to
which the neighbors of a given node are interconnected among
themselves and indicates the cohesiveness or local modularity
of the network. It is interesting to note from Fig. 2D that half-
lives of these very stable transcripts are inversely correlated
with their clustering coefficient implying that highly stable
transcripts may not form cohesive local modules in the inter-
action network. This result also suggests that highly connected
nodes in PPI network may not form part of any particular
Fig. 2 Relationship between network properties of proteins in the PPI network and their corresponding mRNA stability measured as the
transcripts half-life (A) degree of a node in the PPI network versus its mRNA half-life, (B) betweenness of a node versus mRNA half-life,
(C) closeness of a node versus its mRNA half life and (D) clustering coefficient of a node versus its mRNA half-life. All the centrality measures
indicate that proteins with high centrality tend to exhibit high mRNA half-lives. Clustering coefficient of highly stable nodes in the PPI network
decreases, indicating that although central proteins are more stable, they may not form multi-protein assemblies. Error bars are shown in each case
to show the extent of variation of the network property in each bin. p-values correspond to the significance level of the correlations.
156 | Mol. BioSyst., 2009, 5, 154–162 This journal is !c The Royal Society of Chemistry 2009
module but rather might be involved in multiple modules, due
to the hierarchical nature of biological systems previously
demonstrated for metabolic networks.30
mRNA half-lives of interacting proteins tend to have similar
degradation rates with few pairs showing significant differences
in half-lives
We then investigated if the transcripts of interacting proteins
in the PPI network tend to have similar or dissimilar half-lives.
To investigate this question, we calculated the ‘‘assortativity
value’’ for the PPI network based on the half-life values
associated with a node by adapting the formula described by
Newman31,32 (see Materials and methods). As originally
described by Newman,31,32 assortativity value is a single
global measure that tries to capture the dominant type of
interaction in a network. For instance, positive assortativity
value (assortative mixing) based on the degrees of a node in a
network would mean that there is a preference for interacting
nodes to have a similar degree, while negative assortativity
value (disassortative mixing) would imply that there is a
preference for interacting nodes to have dissimilar degree
(e.g., high-connectivity nodes interacting with low-connectivity
ones). Negative assortative values have been shown to
correspond to scale-free graphs like that of world-wide web,
internet and protein interaction networks with values ranging
from "0.06 to "0.18.31 We calculated the assortativity value
based on the mRNA half-life (rather than the degree) of a
protein in the PPI network and found a low but positive value
of 0.03, suggesting that proteins which interact with each other
tend to have comparable half-lives.
To investigate this observation in more detail, we computed
the differences in half-lives of interacting proteins and com-
pared the distribution with that observed from randomly
selected pairs of proteins as described in Materials and
methods. As a result of this analysis (Fig. 3), we found that
in general interacting proteins tend to show lower differences
in half-lives than what is expected by chance (Pr 1.43# 10"3),
although a small fraction of interacting proteins did show high
differences in half-lives (Table 1). This calculation shows that
interacting proteins on an average have a variation in half-life
of about B4 minutes with a vast majority of them falling in
the difference range of o3 minutes (marked with a black
arrow in Fig. 3). Likewise, very few interacting proteins were
found to show high differences in half-lives (threshold value of
B13 minutes above which a much smaller fraction of inter-
acting pairs was found compared to random pairs, marked
with a yellow arrow in Fig. 3) and they corresponded to
interactions between and among regulators and enzymes
involved in global regulatory processes (see Table 1). Sensi-
tivity analysis to test the robustness of the results indicated
that the results are reproducible with networks where up to
10% of the interactions are randomly removed (see Materials
and methods; data not shown). An analysis of the function of
interacting proteins that show large differences in half-lives
reveals that some highly stable transcripts belong to
the enzyme or regulator functional classes. These might be
involved in PPIs as a means of linking cellular processes as
diverse as metabolism, replication, repair and regulation. One
possible explanation for such large differences in stabilities
might be the usage of the stable transcripts (which typically
encode for hubs) as feedback controllers at different stages of
the cell cycle.
Since it has been known that gene duplication is an
important mechanism for genome evolution which has also
contributed to the growth of the PPI networks, we investigated
if duplicate genes have similar half-lives. By investigating the
sequences of the proteins in the network, we found that only
2% of the interacting protein pairs (108 of 5667 interactions)
are composed of duplicated protein partners at a BLAST
e-value threshold of 1e"5. Of these, we had half-life data for
39 pairs and they did not show any significant tendency for
high or low differences in half-lives compared to overall
distribution (data not shown). Since the fraction of duplicated
genes and the corresponding interactions in our network is
very low and does not show any inherent trends, our results
were robust to removal of these duplicate proteins. Varying
the BLAST e-value thresholds by 2 orders of magnitude to
detect duplicate genes did not change our end results (data not
shown). However, as more data on protein interaction net-
works become available, it should be possible to address this
question in greater detail.
Hubs in PPI network tend to be essential although transcripts
of essential genes are not more stable than non-essential ones
While the importance of a protein can be assessed by measur-
ing the centrality of a node in the PPI network, it can also be
inferred by experimentally testing if removal of the gene
renders a cell lethal or not. To complement our understanding
of the relationship between the in silico centrality of a node in
the PPI network against the stability of a transcript from an
experimental perspective, we investigated the following ques-
tions: do the experimentally determined essential genes tend to
be important proteins in the PPI interaction network? Are
Fig. 3 Distribution of the differences in mRNA half-lives of inter-
acting proteins compared against random pairs of proteins, indicating
that interacting protein pairs exhibit a higher tendency to have lower
differences in half-lives. Marked in black and yellow arrows are the
half-life thresholds where interacting protein pairs show significantly
lower and higher differences in half-lives, respectively. Very few
interacting protein pairs showed high differences in half-lives as shown
in Table 1.
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 154–162 | 157
transcripts of essential genes more stable or highly expressed
compared to the non-essential genes? To address these ques-
tions, we obtained a list of essential and non-essential genes in
the E. coli genome from a recent study conducted by Mori and
co-workers.33
To address the first question, we integrated essentiality data
with the connectivity data of proteins in the PPI network. In
particular, we compared the proportion of essential genes in
different connectivity bins (see Materials and methods), after
dividing the proteins in the PPI network into three different
groups based on their degree, D: low (D o 9), intermediate
(9 r D o 44) and high (D Z 44). Fig. 4A shows the
proportion of essential genes for each of the three groups of
proteins. We note that the bin with the highly connected
proteins also has a higher proportion of essential genes,
while the bin with lowly connected proteins shows a
depletion in the fraction of essential genes, indicating that
essentiality (a qualitative measure of in vivo centrality) of a
gene correlates with its degree in the PPI network, similar
to what has been observed in the yeast PPI network.34
We found that a total of 173 essential genes (58% of all
essential genes) formed part of these bins (82, 64 and 27 in
the low, intermediate and high-connectivity bins, respectively)
signifying that most of the essential genes in E. coli also form
part of the PPI network. Note that although the absolute
number of essential genes in highly connected bin is low, it
overlaps with a significant fraction of the hubs (i.e., proteins
with degree Z 44).
To address the second question on the relationship between
essentiality and transcript stability, we integrated the gene
essentiality data with mRNA half-life and expression data.
Though it is commonly believed that essential genes which
comprise core proteins necessary for the survival of the cell
need to be expressed in higher concentrations,35,36 this has not
been tested so far. Thus we investigated if essential genes
would be highly expressed compared to non-essential genes
and how hubs (proteins with degree Z 44) in the interaction
network compare in their expression with respect to essential
genes. It should be noted that though hubs tend to be essential
genes, not all essential genes are hubs. Hence it becomes
important to make this distinction and test them indepen-
dently. Fig. 4B shows the expression levels of essential and
non-essential genes in E. coli along with hubs in the interaction
network using most of the publicly available expression data
generated on the affymetrix platform (see Materials and
methods).37 Both hubs and essential genes were found to be
significantly more highly expressed than non-essential genes
(t-test, p o 2.2 # 10"16) and hubs were found to be more
highly expressed than essential genes (p o 8.87 # 10"10).
Given these observations that both hubs and essential genes
are significantly more abundant than non-essential genes at
the mRNA level, we asked whether this difference is also
reflected in the stability of their transcripts. To address this,
we compared their mRNA half-lives and found that although
the transcripts of essential genes are not more stable than non-
essential genes (p o 0.057), transcripts encoding hubs exhibited
significantly higher stabilities compared to those encoding
non-essential genes (p o 8.9 # 10"5) (Fig. 4C). These results
suggest that although essential genes are highly expressed, they
do not tend to be more stable than non-essential genes. These
observations together with biological processes enriched in
these classes of genes (see Table 2) suggested that the differences
may stem due to the nature of proteins with distinct functions
in the two groups. While hubs predominantly encode
for proteins involved in translation and protein synthesis,
essential gene set comprises genes belonging to various
metabolic and biosynthetic processes important for cellular
growth. Thus the higher abundance of essential genes
compared to non-essential genes may be a result of higher
transcription of essential genes rather than increased stability
Table 1 Interacting protein pairs exhibiting highest differences in their mRNA half-lives. Most of these interactions are between or among
regulatory factors and enzymes suggesting that these interactions contribute to the cellular integrity by linking different cellular processes and
pathways with the core machinery of the cell. All interacting pairs which showed high differences in half-lives and are under-represented in
proportion with increasing differences in half-lives threshold compared to random pairs, are shown below (see Fig. 3 for additional details)
Gene 1
(higher half-life) Function
Gene 2
(lower half-life) Function
Difference
(min)
hupA|b4000 Factor; basic proteins-synthesis, modification;
HU, DNA-binding transcriptional regulator,
alpha subunit
galR|b2837 Regulator; degradation of small
molecules: Carbon compounds; DNA-
binding transcriptional repressor
18.1
rpoD|b3067 Factor; global regulatory functions; RNA
polymerase, sigma 70 (sigma D) factor
crp|b3357 Regulator; global regulatory functions;
DNA-binding transcriptional dual
regulator
13.9
pflB|b0903 Enzyme; energy metabolism, carbon:
Anaerobic respiration; pyruvate formate lyase
I
yjeE|b4168 ATPase with strong ADP affinity 20.9
ybdN|b0602 Conserved protein sspB|b3228 Regulator; global regulatory functions;
ClpXP protease specificity-enhancing
factor
14.6
recG|b3652 Enzyme; DNA-replication, repair, restriction/
modification; ATP-dependent DNA helicase
ssb|b4059 Factor; DNA-replication, repair,
restriction/modification; single-
stranded DNA-binding protein
21.6
sgbH|b3581 Putative enzyme; central intermediary
metabolism: pool, multipurpose conversions;
3-keto-L-gulonate 6-phosphate decarboxylase
nudF|b3034 ADP-ribose pyrophosphatase 13
hupA|b4000 Factor; basic proteins-synthesis, modification;
HU, DNA-binding transcriptional regulator,
alpha subunit
nudH|b2830 Putative factor; not classified;
nucleotide hydrolase
18.2
158 | Mol. BioSyst., 2009, 5, 154–162 This journal is !c The Royal Society of Chemistry 2009
of their transcripts. Since cells prefer to degrade transcripts
encoding essential genes in the same way as other non-essential
genes, this also suggests that increased transcript stability of
essential genes, resulting in very high levels of essential genes,
might be generally unfavorable or expensive for the cell to
have them present for a longer time.
Materials and methods
Data on mRNA half-lives, gene essentiality and PPI network
in E. coli
Half-life of RNA acts as a direct measure of transcript stability
and is frequently used for measuring the stability of messenger
RNAs. mRNA half-lives of protein coding genes in E. coli
were obtained from a previous study where the authors
analyzed the global patterns of RNA degradation on a
genome-wide scale using high-density, subgenic-resolution
oligonucleotide microarrays.16 Half-life data could be
obtained for a total of 2680 genes whose distribution (in minutes)
is shown in Fig. 1A. Data on gene essentiality were obtained
from a recent genome-wide knock out study, where the
authors generated a whole genome single gene knock library.33
We collected a total of 298 genes in E. coli K12 which were
reported to be lethal according to this study. The remaining
genes were considered as non-essential. Manually curated and
high quality data on PPIs in E. coli were obtained from the
DIP,24 which included data from traditional independent
studies and high-throughput studies such as Butland et al.21
Our final dataset used for this study consisted of 5667 PPIs
with 49 proteins qualifying as hubs (top 5% of the nodes with
highest connectivity). Since not all the genes had half-life data
associated with it, only those set of interactions for which both
the genes had half-life information available were considered
for studying the differences in half-lives. This subset composed
of 447 interacting protein pairs.
Network properties of the proteins in the PPI network
To study the properties of the PPI network and their depen-
dence on mRNA half-life, we used igraph, a publicly available
R package for analyzing graphs [see http://cneurocvs.rmki.
kfki.hu/igraph/ and http://www.r-project.org]. In particular,
since the network of PPIs analyzed in this study is undirected,
we used the corresponding versions of the functions: degree,
transitivity, betweenness and closeness for calculating the
degree, clustering coefficient, betweenness and closeness cen-
tralities of a node. Betweenness centrality, which is the number
of shortest paths going through a node, was calculated using
the brandes algorithm38 implemented in R. Similarly,
closeness, measured as average length of the shortest paths
to all the other vertices in the graph, was obtained using
the implementation in R. Since the centrality measures,
betweenness and closeness use the shortest path lengths
between all pairs of nodes in a graph, for cases where no path
exists between a particular pair of nodes, shortest path length
was taken as one less than the maximum number of nodes in
the graph. Note that this is also the default assumption for
calculating centrality measures in igraph. Hubs were defined as
the nodes with degrees greater than two standard deviations
above average degree of a node in the PPI network
(i.e., degree Z 44), while poorly connected nodes were defined
as nodes with degrees less than average degree of a node in the
network (i.e., degree o 9). To assess the robustness of the
results, we have performed sensitivity analysis by randomly
removing 10% of the PPI network in 10 independent trails and
calculated the network properties and other results reported in
Fig. 4 (A) Hubs tend to be essential in the protein interaction
network of E. coli similar to what has been observed in yeast.34 Each
bin corresponds to the range of the degree of a protein in the PPI
network and shows the proportion of essential genes in the bin on the
Y-axis. (B) Although both hubs and essential genes were found to be
significantly more expressed than non-essential genes (p o 2.2 #
10"16), when their expression levels were compared using the publicly
available microarray compendium for more than 400 microarray
experiments performed on E. coli, (C) only hubs show a tendency to
be more stable at the mRNA level compared to non-essential genes
(po 8.9 # 10"5 comparing the stability of hubs against non-essential
genes versus po 0.057 for essential genes against non-essential genes).
p-values were calculated using the t-test function in the R statistical
package.
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 154–162 | 159
this study in each case. We found that the general observations
were consistent in each run indicating that incompleteness of the
network or the existence of false positives is unlikely to affect the
findings presented here (data not shown).
Calculation of assortativity value for the mRNA half-lives
associated network of PPIs
To calculate the assortativity value, r, for the PPI network
with mRNA half-lives associated to nodes, we used the
formula defined by Newman31,32 as below, wherein the vari-
ables ji and ki were substituted for the mRNA half-lives of the
interacting proteins of the ith edge with i varying from 1 toM,
which stands for the total number of edges or interactions in
the network. The range of r is the closed interval ["1, 1] with
positive values corresponding to assortative behavior while
negative values suggest disassortativity of the network.
r ¼
1
M
! "P
i jiki " 1M
P
i
1
2
! "ðji þ kiÞ# $2
1
M
! "P
i
1
2
! "ðj2i þ k2i Þ# $" 1M! "Pi 12! "ðji þ kiÞ# $2
Estimating the significance in the differences of observed
half-lives of interacting proteins
To assess the significance of the observed differences in mRNA
half-lives of interacting proteins, we compared the average
value of the observed differences against the same value in a
collection of randomly selected pairs of genes whose half-life
values were available. In each randomization, we generated
447 pairs of genes, which is equal to the number of interactions
in the real dataset and obtained their average of the differences.
A total of 100 000 random networks were generated to
estimate the statistical significance of the observed difference
in half-lives. Statistical significance was assessed based on
p-value estimation, defined as the fraction of the 100 000
random networks which showed a value Z what was
observed in the real network. Since the p-value of the average
of the differences in mRNA half-lives when compared against
random networks was lower than r1.43 # 10"3, the results
were considered to show a significant difference in comparison
to the null model described above, suggesting that interacting
proteins tend to show lower differences in mRNA half-lives
than what is expected by chance.
Analysis of expression data for protein coding genes in E. coli
To compare the expression levels of essential and non-essential
genes, we obtained a large compendium composing of 445
microarray datasets available as a public resource for E. coli.37
These data were available in the form of Robust Multi Array
(RMA) normalized profiles thus enabling us to directly calculate
the average expression value of protein coding genes across
all experimental conditions tested. Therefore, averaged
gene expression values were used to compare the levels of
expression of essential genes, non-essential genes and genes
encoding hubs.
Statistical analysis for comparing gene expression and mRNA
half-lives of essential, non-essential and hub encoding genes
To test the significance for the observed higher expression of
hubs and essential genes over non-essential genes and to
investigate whether essential genes produce stable transcripts
compared to non-essential ones, we used the Welch two-
sample t-test as implemented in the R statistical package
Table 2 Gene ontology biological processes enriched in hubs and essential genes along with their significance values. p-values were calculated
using BINGO,46 a JAVA-based tool for calculating predominant functional categories in a collection of genes. p-values were corrected for multiple
testing using the same package, at a false discovery rate (FDR) of 0.05 and only those less than 1 # 10"3 are shown. Hub class which comprised 49
proteins showed enrichment for only translation and protein metabolism while the essential-gene class comprising 243 genes showed enrichment
for many other metabolic and biosynthetic processes
Biological process ontology p-Value Corrected p-value
Enriched biological processes in hubs
Translation 1.9492 # 10"9 2.0077 # 10"7
Protein metabolic process 1.4225 # 10"6 4.8841 # 10"5
Cellular protein metabolic process 1.4225 # 10"6 4.8841 # 10"5
Cellular macromolecule metabolic process 8.6460 # 10"6 2.2263 # 10"4
Gene expression 2.1656 # 10"5 4.4611 # 10"4
Enriched biological processes in essentials
Cellular biosynthetic process 5.4841 # 10"13 1.6672 # 10"10
Translation 5.2333 # 10"12 7.9547 # 10"10
Oxidoreduction coenzyme metabolic process 7.8414 # 10"9 7.9460 # 10"7
Coenzyme biosynthetic process 2.4621 # 10"8 1.6639 # 10"6
Biosynthetic process 2.7367 # 10"8 1.6639 # 10"6
Cellular macromolecule metabolic process 1.5135 # 10"7 7.6686 # 10"6
Protein metabolic process 3.4365 # 10"7 1.3059 # 10"5
Cellular protein metabolic process 3.4365 # 10"7 1.3059 # 10"5
Cofactor biosynthetic process 4.5055 # 10"7 1.5219 # 10"5
Coenzyme metabolic process 1.1352 # 10"6 3.4511 # 10"5
Ubiquinone biosynthetic process 3.1663 # 10"6 8.0213 # 10"5
Ubiquinone metabolic process 3.1663 # 10"6 8.0213 # 10"5
tRNA aminoacylation 1.0490 # 10"5 2.1260 # 10"4
Amino acid activation 1.0490 # 10"5 2.1260 # 10"4
tRNA aminoacylation for protein translation 1.0490 # 10"5 2.1260 # 10"4
Cofactor metabolic process 1.6198 # 10"5 3.0776 # 10"4
tRNA metabolic process 1.7515 # 10"5 3.1322 # 10"4
160 | Mol. BioSyst., 2009, 5, 154–162 This journal is !c The Royal Society of Chemistry 2009
[see http://www.r-project.org]. We found a significantly lower
p-value (o2.2 # 10"16), when we compared expression values
of hubs and essential genes with respect to non-essential ones
suggesting increased expression of the former groups across
different conditions. In contrast, essential genes did not show a
significant difference in their half-lives compared to non-
essential gene set while hubs did.
Conclusion
It has long been believed that in bacteria, most of the regula-
tion of gene expression is at the transcriptional level with little
involvement of post-transcriptional control. However, recent
studies indicate a widespread role for several novel mechan-
isms and molecules in regulating expression at the RNA
level.17,39–41 In this context, the findings presented here
indicate for the first time, that proteins which are central in the
protein interaction network tend to encode for stable mRNA
transcripts and might be constrained to possess properties
such as the formation of stem–loop structures or protection of
their 30 ends. This would provide them with enhanced stability
over other transcripts, so that they would be available for a
longer duration in the cell. The findings presented here suggest
that hubs in the protein interaction network, which may need
to interact with multiple proteins possibly at different time
points during the cell cycle, might have been selected during
evolution to have increased transcript stability, thereby
enabling them to be utilized for multiple rounds of translation
and decreasing the cost of transcription. This hypothesis is
supported by the observation that certain functional
categories like transcription factors and enzymes involved in
core regulatory roles and central metabolism show extensive
differences in their stabilities so that while one transcript is
available for most of the cell’s life time the other is available
only under appropriate conditions in order to fine tune
the interaction between them and/or to prevent undesirable
cross-talk between them. Such high differences in half-lives can
also act as fine-tuned feedback mechanisms at appropriate
conditions through the physical interaction between proteins.
In light of recent studies demonstrating the impact of the
change in expression level of single gene over generations,42
our results suggest that mRNA stability might not only
provide a fitness advantage but also mediate regulation at
post-transcriptional level, thereby allowing an organism to
adapt to changing environments.
Our analysis of the expression level of essential genes
suggests that while essential genes are highly expressed com-
pared to non-essential ones and are enriched in hub encoding
genes, they do not seem to encode for more stable transcripts.
This is in contrast to what we observe for hubs which were
found to be highly expressed and were encoded by stable
transcripts. These findings are also in contrast to what was
seen in eukaryotic PPI network where transcripts encoding
hubs were significantly short-lived.43 These observations
suggest that the rapid turnover of hubs in eukaryotic PPI
network reported earlier43 might be explained based on the
distinct mechanisms that prokaryotic and eukaryotic cells use
to compartmentalize and regulate their protein availability.
For instance, microRNAs which are known to regulate the
expression of a significant fraction of the genes at the post-
transcriptional level in higher organisms are known to prefer-
entially inhibit the expression of hubs in both transcriptional
and protein–protein interaction networks44,45 possibly
explaining their high turn over rates. Another possible
explanation for these observed differences could be due to
the fact that in bacteria, cell cycle duration is often small
so that sometimes the turnover time of RNAs and other
molecules exceeds the lifespan of a single generation. This
may thereby provide the advantage of having higher stabilities
for frequently used transcripts by allowing them to carry over
the transcripts to future generations.
Our results also suggest that essential genes, which are
highly expressed, might compensate for their abundance by
not coding for highly stable transcripts, which might otherwise
cause cellular crowding. On the contrary, hubs which were
found to be highly expressed and produce stable transcripts
might be utilized by the cell during most of its cell cycle or be
translationally regulated, so that they are readily available
whenever they are needed. It is also interesting to note that
essential genes are composed of two kinds of genes, a small
fraction forming hubs in the PPIs and showing higher tran-
script stability and a majority which are not highly connected
in the PPI and show lower half-lives. These contributions from
essential genes to form a small but significant fraction in hubs
and a majority showing lower half-lives might be the cause for
the observation that essential genes are not more stable than
non-essential ones. Taken together our results demonstrate for
the first time that mRNA stability has a significant role in
mediating PPIs in bacteria and physical interactions might be
influenced by a variety of post-transcriptional mechanisms.
Acknowledgements
SCJ and MMB acknowledge financial support from the MRC
Laboratory of Molecular Biology. SCJ acknowledges financial
support from Cambridge Commonwealth Trust. MMB thanks
Darwin College and Schlumberger Ltd for generous support.
We thank Wuster A, De S, Venkatakrishnan AJ and Weber K
for critically reading the manuscript and providing helpful
comments.
References
1 A. Jacobson and S. W. Peltz, Annu. Rev. Biochem., 1996, 65,
693–739.
2 C. A. Beelman and R. Parker, Cell, 1995, 81, 179–183.
3 T. E. LaGrandeur and R. Parker, EMBO J., 1998, 17, 1487–1496.
4 J. S. Anderson and R. P. Parker, EMBO J., 1998, 17, 1497–1506.
5 R. S. Cormack, J. L. Genereaux and G. A. Mackie, Proc. Natl.
Acad. Sci. U. S. A., 1993, 90, 9006–9010.
6 K. J. McDowall and S. N. Cohen, J. Mol. Biol., 1996, 255,
349–355.
7 A. J. Carpousis, Annu. Rev. Microbiol., 2007, 61, 71–87.
8 C. P. Ehretsmann, A. J. Carpousis and H. M. Krisch, Genes Dev.,
1992, 6, 149–159.
9 A. J. Carpousis, G. Van Houwe, C. Ehretsmann and H. M. Krisch,
Cell, 1994, 76, 889–900.
10 Y. Feng, T. A. Vickers and S. N. Cohen, Proc. Natl. Acad. Sci.
U. S. A., 2002, 99, 14746–14751.
11 Y. Zuo and M. P. Deutscher, Nucleic Acids Res., 2001, 29,
1017–1026.
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 154–162 | 161
12 K. Lee, X. Zhan, J. Gao, J. Qiu, Y. Feng, R. Meganathan,
S. N. Cohen and G. Georgiou, Cell, 2003, 114, 623–634.
13 J. Gao, K. Lee, M. Zhao, J. Qiu, X. Zhan, A. Saxena, C. J. Moore,
S. N. Cohen and G. Georgiou,Mol. Microbiol., 2006, 61, 394–406.
14 J. A. Bernstein, P. H. Lin, S. N. Cohen and S. Lin-Chao, Proc.
Natl. Acad. Sci. U. S. A., 2004, 101, 2758–2763.
15 J. A. Bernstein, A. B. Khodursky, P. H. Lin, S. Lin-Chao and
S. N. Cohen, Proc. Natl. Acad. Sci. U. S. A., 2002, 99, 9697–9702.
16 D. W. Selinger, R. M. Saxena, K. J. Cheung, G. M. Church and
C. Rosenow, Genome Res., 2003, 13, 216–223.
17 A. Szalewska-Palasz, G. Wegrzyn and A. Wegrzyn, J. Appl. Genet.,
2007, 48, 281–294.
18 M. Bon, S. J. McGowan and P. R. Cook, FASEB J., 2006, 20,
1721–1723.
19 E. Wang and E. Purisima, Trends Genet., 2005, 21, 492–495.
20 A. Martinez-Antonio, S. C. Janga and D. Thieffry, J. Mol. Biol.,
2008, 381, 238–247.
21 G. Butland, J. M. Peregrin-Alvarez, J. Li, W. Yang, X. Yang,
V. Canadien, A. Starostine, D. Richards, B. Beattie, N. Krogan,
M. Davey, J. Parkinson, J. Greenblatt and A. Emili, Nature, 2005,
433, 531–537.
22 M. Arifuzzaman, M. Maeda, A. Itoh, K. Nishikata, C. Takita,
R. Saito, T. Ara, K. Nakahigashi, H. C. Huang, A. Hirai,
K. Tsuzuki, S. Nakamura, M. Altaf-Ul-Amin, T. Oshima,
T. Baba, N. Yamamoto, T. Kawamura, T. Ioka-Nakamichi,
M. Kitagawa, M. Tomita, S. Kanaya, C. Wada and H. Mori,
Genome Res., 2006, 16, 686–691.
23 J. C. Rain, L. Selig, H. De Reuse, V. Battaglia, C. Reverdy,
S. Simon, G. Lenzen, F. Petel, J. Wojcik, V. Schachter,
Y. Chemama, A. Labigne and P. Legrain, Nature, 2001, 409,
211–215.
24 L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie and
D. Eisenberg, Nucleic Acids Res., 2004, 32, D449–451.
25 M. J. Marcaida, M. A. DePristo, V. Chandran, A. J. Carpousis
and B. F. Luisi, Trends Biochem. Sci., 2006, 31, 359–365.
26 M. Grunberg-Manago, Annu. Rev. Genet., 1999, 33, 193–227.
27 R. Rauhut and G. Klug, FEMSMicrobiol. Rev., 1999, 23, 353–370.
28 D. A. Steege, RNA, 2000, 6, 1079–1090.
29 A. L. Barabasi and Z. N. Oltvai, Nat. Rev. Genet., 2004, 5,
101–113.
30 E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai and
A. L. Barabasi, Science, 2002, 297, 1551–1555.
31 M. E. Newman, Phys. Rev. Lett., 2002, 89, 208701.
32 M. E. Newman, Phys. Rev. E: Stat., Nonlinear, Soft Matter Phys.,
2003, 67, 026126.
33 T. Baba, T. Ara, M. Hasegawa, Y. Takai, Y. Okumura, M. Baba,
K. A. Datsenko, M. Tomita, B. L. Wanner and H. Mori, Mol.
Syst. Biol., 2006, 2, 0008.
34 H. Jeong, S. P. Mason, A. L. Barabasi and Z. N. Oltvai, Nature,
2001, 411, 41–42.
35 X. Gong, S. Fan, A. Bilderbeck, M. Li, H. Pang and S. Tao, Mol.
Genet. Genomics, 2008, 279, 87–94.
36 I. K. Jordan, I. B. Rogozin, Y. I. Wolf and E. V. Koonin, Genome
Res., 2002, 12, 962–968.
37 J. J. Faith, B. Hayete, J. T. Thaden, I. Mogno, J. Wierzbowski,
G. Cottarel, S. Kasif, J. J. Collins and T. S. Gardner, PLoS Biol.,
2007, 5, e8.
38 U. Brandes, J. Math. Sociol., 2001, 25, 163–177.
39 G. Storz, S. Altuvia and K. M. Wassarman, Annu. Rev. Biochem.,
2005, 74, 199–217.
40 W. C. Winkler and R. R. Breaker, Annu. Rev. Microbiol., 2005, 59,
487–517.
41 A. Serganov and D. J. Patel, Nat. Rev. Genet., 2007, 8,
776–790.
42 E. Dekel and U. Alon, Nature, 2005, 436, 588–592.
43 N. N. Batada, L. D. Hurst and M. Tyers, PLoS Comput. Biol.,
2006, 2, e88.
44 Q. Cui, Z. Yu, Y. Pan, E. O. Purisima and E. Wang, Biochem.
Biophys. Res. Commun., 2007, 352, 733–738.
45 H. Liang and W. H. Li, RNA, 2007, 13, 1402–1408.
46 S. Maere, K. Heymans and M. Kuiper, Bioinformatics, 2005, 21,
3448–3449.
162 | Mol. BioSyst., 2009, 5, 154–162 This journal is !c The Royal Society of Chemistry 2009
Scaling relationship in the gene content of transcriptional machinery
in bacteriawz
Ernesto Pe´rez-Rueda,a Sarath Chandra Janga*b and Agustino Martı´nez-Antonio*c
Received 14th April 2009, Accepted 9th June 2009
First published as an Advance Article on the web 17th July 2009
DOI: 10.1039/b907384a
The metabolic, defensive, communicative and pathogenic capabilities of eubacteria depend on their repertoire of genes and
ability to regulate the expression of them. Sigma and transcription factors have fundamental roles in controlling these
processes. Here, we show that sigma, transcription factors (TFs) and the number of protein coding genes occur in different
magnitudes across 291 non-redundant eubacterial genomes. We suggest that these differences can be explained based on the
fact that the universe of TFs, in contrast to sigma factors, exhibits a greater flexibility for transcriptional regulation, due to
their ability to sense diverse stimuli through a variety of ligand-binding domains by discriminating over longer regions on
DNA, through their diverse DNA-binding domains, and by their combinatorial role with other sigmas and TFs. We also note
that the diversity of extra-cytoplasmic sigma factors and TF families is constrained in larger genomes. Our results indicate that
most widely distributed families across eubacteria are small in size, while large families are relatively limited in their distribution
across genomes. Clustering of the distribution of transcription and sigma families across genomes suggests that functional
constraints could force their co-evolution, as was observed in sigma54, IHF and EBP families. Our results also indicate that
large families might be a consequence of lifestyle, as pathogens and free-living organisms were found to exhibit a major
proportion of these expanded families. Our results suggest that understanding proteomes from an integrated perspective, as
presented in this study, can be a general framework for uncovering the relationships between different classes of proteins.
Introduction
Bacteria respond and adapt to diverse environmental conditions
as a consequence of their gene repertoire and regulatory
mechanisms, among other elements.1–3 The availability of
their genome sequences has enabled the investigation of their
differences at genetic, molecular and biochemical levels. Recent
studies have shown that the evolutionary events associated
with regulatory gene families, such as their expansion and
contraction, contribute significantly to shaping the gene
repertoire and genome size of different lineages of prokaryotes.4–7
aDepartamento de Ingenierı´a Celular y Biocata´lisis,
IBT-UNAM. AP. 565-A, Cuernava-ca, Morelos, 62210, Me´xico
bMRC Laboratory of Molecular Biology, Hills Road, Cambridge,
UK CB2 0QH. E-mail: sarath@mrc-lmb.cam.ac.uk
cDepartamento de Ingenierı´a Gene´tica. Centro de Investigacio´n y de
Estudios Avanzados del Instituto Polite´cnico Nacional, Irapuato,
36500, Me´xico. E-mail: amartinez@ira.cinvestav.mx
w This article is part of a Molecular BioSystems themed issue on
Computational and Systems Biology.
z Electronic supplementary information (ESI) available: An extensive
set of TFs from all 675 bacterial genomes (including redundant ones)
and further supplementary material associated with this study. See
DOI: http://www.ibt.unam.mx/~erueda/ScalingTranscription.htm
Ernesto Pe´rez-Rueda
Ernesto Perez-Rueda has been
a professor at Universidad
Nacional Autonoma de
Mexico (UNAM) since
2004. He obtained his PhD at
the Center for Genomic
Sciences in UNAM and
worked on the identification
of functional residues in
homeoproteins in his post-
doctoral research at the Free
University of Brussels. His
research focuses on the
analysis of DNA-binding trans-
cription factors in diverse
bacteria, such as E. coli and
B. subtilis, to understand the evolution of TFs and predict
their functional roles. He has published several international
publications on these topics.
Sarath Chandra Janga
Sarath Chandra Janga is a
PhD student at the MRC
Laboratory of Molecular
Biology and University of
Cambridge. Sarath obtained
his Bachelors and Masters in
biochemical engineering and
biotechnology at the Indian
Institute of Technology, Delhi
in 2003. Prior to starting
his PhD, Sarath worked
extensively and co-ordinated
a number of research projects
on transcriptional regulation,
genome organization and
comparative genomics in
bacteria at UNAM in Mexico. He has published more than 25
research manuscripts on various aspects of prokaryotic and
eukaryotic biology in the fields of computational molecular
and systems biology. His current research interests include
understanding the design principles and constraints imposed
on post-transcriptional and post-translational gene control in
prokaryotic and eukaryotic organisms.
1494 | Mol. BioSyst., 2009, 5, 1494–1501 This journal is !c The Royal Society of Chemistry 2009
REVIEW www.rsc.org/molecularbiosystems | Molecular BioSystems
Based on comparative genomics, it has been shown that
genes associated with transcriptional regulation increase in a
quadratic proportion with respect to the genome size.8–10
These observations become pertinent given that the regulation
of transcription initiation in bacteria is primarily mediated by
sigma factors (ss), which provide most of the specificity for
promoter recognition and DNAmelting needed for transcription
initiation.11–13 In fact, sigma factors perform these functions
only when bound to the RNA polymerase (RNAP). On the
other hand, DNA-binding transcription factors (TFs)14 affect
gene expression by blocking or allowing the access of the
RNAP to the promoter, depending on the operator context
and ligand-binding status.15–18 Usually, most gene transcription
in exponentially growing bacteria is initiated by RNAP carrying
a housekeeping s, similar to E. coli s70 or B. subtilis sA.
Alternative ss typically redirect the RNAP towards a subset of
genes required during specific conditions, such as stress
response or growth transitions, among others.11–13 TFs represent
a class of proteins devoted to sense and bind signals to regulate
genes, in response to specific compounds.17,19 Although there
is extensive evidence for the existence of alternative regulatory
mechanisms in diverse bacterial systems from post-transcriptional
regulation,20–22 they are not considered in this study, as we
focus on the specific role of TFs in mediating regulatory
mechanisms in a wide range of completely sequenced bacterial
genomes.
It has been previously suggested that the abundance of TFs
increases with an increase in an organism’s complexity8,23–26
as a consequence of different evolutionary events, such as
gene expansion, gene loss and lateral gene transfer.24,27,28
On the other hand, the repertoire of TFs, depending on
their hierarchical position in the network of transcriptional
interactions, have also been shown to play an important
role in shaping the organization of genes on bacterial
chromosomes.29–32 In this study, we analyze the repertoires
of ss and TFs in 291 eubacterial genomes and compare
their distribution in relation to the genome size to under-
stand their contribution to gene regulation in different
lineages and lifestyles. The results obtained here provide
insights into the functional and evolutionary constraints
imposed on different classes of regulatory factors in bacterial
organisms.
Results
The abundance of sigma factors and TFs correlates with genome
size in bacteria
To study the abundance and diversity of regulatory proteins
controlling transcription initiation, the repertoires of ss and
TFs were obtained in 291 non-redundant (NR) bacterial
genomes (see Materials and methods section for details). A
comparison of regulatory elements across genomes suggested
that they increase almost quadratically with genome size
(Fig. 1). In particular, we found that the repertoire of TFs is
roughly 10 times higher than ss (hundreds vs. tens) when we
considered the general profiles in all the genomes analyzed,
suggesting a proportion in the order of 1 ss : 10 TFs : 100
annotated ORFs per genome, although some genomes deviate
from this trend. This observation suggests that possible
functional relationships between TFs and ss, on one hand,
and bacterial lifestyles, on the other, could both be influencing
the observed trend. We discuss the impact of both of these
scenarios in the following sections.
The variation in the extent of conservation of rs compared to
TFs might be explained based on their regulatory roles at
transcription initiation
Firstly, the differences in the abundance of repertoires of ss
and TFs in bacteria might be attributed to the different
regulatory roles associated with them. Transcription starts
when a s interacts with RNAP to recognize its specific
sequence promoter (Fig. 1). This promoter recognition stage
imposes the existence of at least one s per organism, which
Fig. 1 The distribution of the number of TFs and ss in bacterial
genomes as a function of genome size. Genomes are sorted on the
x-axis by the number of ORFs. The abundance of TFs and ss in each
genome is shown on the y-axis (each dot corresponds to one genome).
ss are shown in pink and transcription factors in blue.Agustino Martı´nez-Antonio
Agustino Martı´nez-Antonio
obtained his doctoral degree
in biochemical sciences from
UNAM. After postdoctoral
research at UNAM and
INSERM, he is currently a
professor at the Research and
Advanced Studies Centre of the
National Polytechnic Institute
(CINVESTAV-IPN). His
interests include understanding
the design principles governing
the structure and function
of regulatory networks in
prokaryotes.
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1494–1501 | 1495
typically belongs to the s70 family.13 As a result, bacterial
systems might be able to switch between different transcriptional
programs based exclusively on their repertoire of s factors.
Nonetheless, the transcriptional programs mediated uniquely
via ss would be restricted, as a result of their limited repertoire
and the small collection of ligands they can recognize, such as
guanosine tetraphosphate (ppGpp).33 As a consequence, ss
exhibit a limited ability to directly couple the environmental
conditions with gene transcription. In addition, ss have a
constrained DNA-binding region in terms of length and the
diversity of sequences they recognize, as they need to be
structurally-coupled to the RNAP in the promoter zone. These
restricted zones of action divide the universe of ss into
promoters recognized by s70 and those recognized by s54
(the binding zones correspond to about "10 to "35 bp for
s70 and "12 to "24 for s54, relative to the transcription
start site).34,35
On the other hand, TFs define a different regulatory level
compared to ss. These proteins exhibit diverse structural and
functional domains, where one of them specifically binds
to DNA and the other can sense and bind one or more
ligand compounds from endogenous and/or exogenous
sources,17 such as the TyrR of E. coli, which bind to three
aromatic amino acids and ATP.36 In addition, TFs associate
combinatorially, not only with ss, but also with a number of
other TFs and DNA-binding sites,37,38 thus allowing the
rewiring of a transcriptional network depending on the
environmental conditions; for instance, sodA, a gene encoding
for superoxide dismutase in E. coli, is regulated by up to eight
different TFs responsible for various cellular responses,
including Fur (ferric uptake regulation protein), Arc (aerobic
respiratory control) and Fnr (fumarate nitrate reduction/
regulator of anaerobic respiration).39,40 Finally, the diversity
of sequences that TFs can recognize is enormous and can
occur anywhere from a few bases downstream of the promoter
zone to up to hundreds of bases upstream of the transcription
start site (Fig. 1).41,42 For instance, the global regulator
CRP (catabolic repressor protein) in E. coli can regulate
promoters associated with four out of the seven possible ss
and co-regulate with more than 50 different TFs.43,44 In
summary, TFs constitute a class of proteins whose space of
action is more flexible than that of ss, not only in sensing
diverse environmental and endogenous stimuli, but also in
recognizing a wide range of binding site sequences over a
larger zone on the DNA around the transcription start site.
Lifestyles explain the abundance of rs and TFs in bigger
genomes
The results of the previous sections suggest that regulatory
complexity should increase in larger genomes and might be
associated with bacterial lifestyles, as the environment should
influence the bacterial genome structure and function. Thus,
we analyzed the genomes in relation to the four global classes
of lifestyles.45 These included extremophiles (21 genomes),
intracellular bacteria (28 genomes), pathogens (109 genomes)
and free-living bacteria (133 genomes). To understand how the
complexity of gene regulation depends on the number of ss
and/or TFs, as a function of increasing genome size and how
they are associated to lifestyle, we calculated the ratio of
TFs/number of genes (T/G) and ss/number of genes (S/G),
(Fig. 2). From this analysis, we found that the increase in
regulatory complexity in intracellular (I) and extremophilic (E)
bacteria depends almost exclusively on the TF repertoire
(no correlation was observed for an increase in s with genome
size for these lifestyles). On the other hand, in pathogenic (P)
bacteria, the regulatory repertoire is contributed-to by TFs
and to some extent by ss. In contrast, ss and TFs contributed
almost equally to the regulatory repertoire in free-living (F)
bacteria. Thus, TFs contribute significantly to the regulatory
complexity of bacteria belonging to different lifestyles,
whereas ss contribute more significantly to the transcriptional
machinery of regulation in pathogens and free-living bacteria.
These results agree with previous observations, which suggest
that few regulatory elements identified in small genomes would
compensate the regulation of the entire genome with an
increase in the number of DNA-binding sites per element, in
contrast to the large number of elements identified in large
genomes that control a lesser proportion of DNA-binding sites
Fig. 2 The ratio of regulatory factors to the total number of ORFs per genome. The number of genes encoding for TFs and ss were normalized
with respect to the total number of ORFs per genome (T/G and S/G, respectively), and these ratios are shown for bacteria belonging to four
different lifestyles: free-living (F) (m), extremophiles (E) (’), pathogens (P) (E) and intracellular (I) (K).
1496 | Mol. BioSyst., 2009, 5, 1494–1501 This journal is !c The Royal Society of Chemistry 2009
on average.10 In addition, genes in small genomes are organized
into large operons, simplifying the transcriptional machinery
necessary for gene expression. This is in contrast to large
genomes, which have a reduced number of genes in operons,
influencing the proportion of ss and TFs in those organisms,46
suggesting that complex lifestyles would require a higher
proportion of TFs and transcription units to better orchestrate
a response to changing conditions.
The contribution of sigma factors to the transcriptional
machinery trend
In order to assess the contribution of ss to the trends
described in Fig. 1 and Fig. 2, they were divided into three
main groups based on their sequence and function. As
described in the previous section, we then computed the ratio
of the number of ss/number of genes (S/G) in all the genomes
for each group of ss, namely s54, s70 and extra cytoplasmic
function (ECF) sigma factors.13 From this analysis, we found
that the abundance of ss is primarily determined by the
number of ECFs and s70s, as the number of s54 members
was found to be roughly constant and often occurred in no
more than a single copy in most genomes (Fig. 3(a)). ECFs
were highly abundant in free-living and pathogenic bacteria,
with genomes containing more than 2000 genes, and might be
the result of massive gene duplications.47,48 The extent of
conservation of different types of ss across bacteria suggests
a functional role for each, depending on their distribution. For
instance, s70 is indispensable to the adequate maintenance of a
cell and is the only sigma identified in small genomes with less
than 800 genes, whereas ECFs are factors associated with the
regulation of functional processes beyond the basal ones. In
obligate intracellular pathogens, such as Mycoplasma sp,
Streptococcus mutants or Lactobacillus plantarum, there is
only one housekeeping s70 and no alternative ss. s54 factors
were found to exhibit an almost constant distribution of one
copy per genome, except in some pathogens and free-living
eubacteria, where they were identified in two-copies (see the
ESIz). s54 factors require the assistance of specialized activators
of the EBP (enhancer binding protein) family of TFs, and this
might have constrained the number of genes regulated by s54,
i.e. promoters associated with s54 frequently require the
bending of long intergenic DNA stretches via IHF, resulting
in a specific physical proximity between the RNAP and
TFs.49,50 Thus, evolutive mechanisms working for chromosome
compactness might be working against the increased use of s54
promoters in bacteria.
To analyze the specific contribution of the different
families of ss to gene transcription, we computed the ratio
of the number of ss/genes (S/G) in all the genomes. Fig. 3(b)
shows, as expected, that s70s have a higher proportion
of genes to transcribe in small genomes, but that as genome
size increases, this proportion diminishes; ECF is the
only family whose proportion of regulated genes increases
in larger genomes. Most of the diversification of ECFs
corresponds to free-living and pathogenic genomes with
B5000 ORFs.
Fig. 3 The distribution of families of ss in bacterial genomes. (a) Genome size is shown on a log scale on the x-axis. The y-axis shows the number
of s factors in each family per genome. (b) The ratio of the number of sigma factors from each family to the total number of ORFs per genome; the
three outliers, with a high number of ECFs, correspond (from left to right) to b-proteobacteria (N. europaea) and two bacteriodes (B. fragilis
NCTC9434 and B. thetaiotaomicron VPI-5482).
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1494–1501 | 1497
The abundance of TFs does not correlate with the diversity
of families, and large families are not the most widely distributed
An appealing hypothesis is that a high diversity of TF families
would contribute more significantly to regulatory plasticity
than ss. In line with this hypothesis, an analysis of 93 TF
families, comprising of a total of 46 255 TFs across all the
genomes analyzed in this study, showed a reduced diversity of
families in small genomes, with an increasing proportion
in larger ones, especially in pathogens (P) and free-living
organisms (F) (Fig. 4(a)). The diversity of families reaches a
maximum in genomes with around 5000 ORFs. The higher
number of TFs in larger genomes does not necessarily imply
the diversity of families beyond this plateau, but instead an
increase in the size of some families of TFs. Congruent with
this observation, Fig. 4(b) shows that the average number of
TFs per family increases linearly, with a few families of TFs
expanding disproportionately. These families comprise of
LysR and TetR, which represent about 24% of the total set
of TFs identified (11 078 of 46 255 proteins). Members of these
two families increase abruptly in larger genomes, as shown
in Fig. 4(c), which also shows three other most-populated
families of TFs in eubacteria for the sake of comparison. The
increase in the size of these two families in larger genomes
coincides with the plateauing of the diversity of families in
these bacterial genomes (marked by arrows in Fig. 4(a), (b),
and (c)). Another feature associated with large families is that
they are not widely distributed among bacteria, despite their
role in controlling important processes, such as cell–cell
communication (LuxR), the response to external conditions
by two-component systems (OmpR), the sensing, uptake and
metabolism of external food sources (GntR and LysR), or
resistance to antibiotics (TetR). On the other hand, some
families with an average size of a few copies per genome, such
as DnaA, LexA and IHF from E. coli, proposed to be essential
in standard growth conditions in this bacterium and in keeping
its DNA and nucleoid integrity,51,52 can be considered to be
conserved across bacteria. This is because they were identified
in at least 86% of the genomes, suggesting probable gene loss
events in bacteria where they are absent (Fig. 5).
In summary, our results suggest that a family’s abundance
and distribution is associated with evolutionary events in
bacteria. For instance, small families widely distributed among
Fig. 4 Characteristics of TF families in bacterial genomes. (a) The number of TF families as a function of the number of ORFs in each genome,
grouped according to the lifestyle of the organism: E (extremophiles), I (intracellular), P (pathogens) and F (free-living bacteria). (b) The average
number of TFs per family as a function of the number of ORFs in each genome, grouped according to the lifestyle of the organism, as in (a).
(c) The ratio of the number of TFs to ORFs per genome for the five most abundant families of TFs in bacterial genomes.
1498 | Mol. BioSyst., 2009, 5, 1494–1501 This journal is !c The Royal Society of Chemistry 2009
bacteria might be related to ancestral functions beyond tran-
scriptional regulation, such as DNA organization, or nucleoid
integrity or DNA salvage, whereas large families might be
associated with the regulation of dispensable or emergent
processes in bacterial evolution, such as quorum sensing,
belonging to the members of the LuxR family, which are
widely identified in bacteria. Indeed, the evolution of this
mechanism in bacteria has been proposed to be one of the
early steps in the development of multicellularity,53 and may
be correlated with bacterial specialization.
Functional relationships might impose evolutionary constraints
Since some proteins tend to work together in a functional
context, we analyzed the distributions of different families, as
this would give us an indication about the co-evolution of
regulatory factors. Hence, we clustered the co-occurrence of
the regulatory protein families (TFs and ss) in all 291 bacterial
genomes, as shown in Fig. 6. From this analysis, we found that
the distribution of s54, IHF and EBP families is correlated,
supporting the functional interdependence discussed above
(and inset in Fig. 6) and probable co-evolution, where
members and mechanisms have been preserved along the
course of evolution. A second cluster including s70, the ECF
family of sigma factors and other highly abundant families
(more than 15 members per genome) responsible for regulating
diverse mechanisms of stress responses (MarR), antibiotic
resistance (TetR), osmotic response (OmpR) and quorum
sensing response (LuxR), among other processes, were also
found to be clustered as a result of this analysis. This suggests
a strong functional relationship among these s and TF
families. These clusters, in addition, give insights into the
functional interdependence between regulatory proteins from
different families, which could help in the characterization of
regulators in poorly studied genomes.
Materials and methods
Genome sequences
Predicted proteomes for 291 eubacteria were obtained from
the entrez genome database of the NCBI (ftp://ncbi.nlm.nih.
gov/genomes/bacteria).54
A complete list of non-redundant genomes can be obtained
at http://popolvuh.wlu.ca/Phyl_Profiles/NR_genomes/RE
DUNDANCY.html. In brief, two genomes are considered
redundant if they share a genomic similarity score (GSS)
higher than 0.95, where GSS is defined as the ratio of the
sum of all the BLAST bit-scores for protein coding genes that
have orthologs between two genomes being compared and
reaches a maximum of one if all the proteins of one organism
are identical to their corresponding orthologs of another
organism. This would be the case when the proteomes are
identical.55,56 A complete list of genomes analyzed and their
repertoire of TFs is provided as ESI.z
Fig. 5 The diversity and conservation of TF families in bacteria. The
occurrence of a TF family across genomes as a function of the total
number of TFs identified. Some families of TFs conserved in a few
copies per genome are circled in pink. Note that these are also the most
conserved families of TFs in the analyzed genomes. In contrast, some
families (circled in blue) are the most populated, though are less
conserved, in comparison to those circled in pink across genomes.
Fig. 6 The clustering of transcription and sigma factor families across bacterial genomes based on their co-occurrence profiles. A clear
co-occurrence distribution is observed for IHF, EBP and s54 families, suggesting a functional interdependence between them. The co-regulatory
mode of action for these regulatory proteins is shown in the inset.
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1494–1501 | 1499
The identification of families of DNA-binding transcription
factors (TFs)
To identify and analyze the repertoire of TFs in bacterial
genomes, a combination of information from different sources
and bioinformatics tools were used. Firstly, 45 088 putative
TFs were collected from the transcription factor DB,57 a
database devoted to the identification and classification of
DNA-binding TFs by means of the SUPERFAMILY library
and PFAM hidden Markov models (HMMs). In a second
phase, 90 family-specific HMMs previously reported from
E. coli K12 and 57 family-specific HMMs from B. subtilis5,58
were used to scan the complete genome sequences (E-value
threshold = 10"3) with the hmmsearch module of the HMMer
suite program (http://hmmer.janelia.org). TF families were
identified based on their DNA-binding domains: in a first
step, if a protein shared more than 25% of the identity in its
DNA-binding region with any member of the well-characterized
TFs of E. coli and/or B. subtilis, it was included in this
particular family. In order to include distant homologs and
to decrease the bias associated with the over-representation of
TFs from specific organisms, these families were expanded by
Blast searches59 against the SwissProt database60 using an
E-value threshold of 10"6. Proteins retrieved were filtered at
100% to exclude redundancy using the program CD-hit61 and
aligned with ClustalW.62 Proteins with less than 50% similarity
against their corresponding HMM were excluded. This step is
important to explore potential TFs not identified through the
first approach and vice versa, i.e. the coverage of the DBD
database corresponds to approximately 70% of the universe of
TFs and can be complemented with family-specific HMMs.63
Previous studies using this approach for predicting new TFs
suggest that these models are successful in identifying a
significant fraction of experimentally confirmed TFs in different
lineages,40,64 confirming the value of these predictions for
studying genome-scale patterns. An extensive set of TFs
from all 675 bacterial genomes (including redundant ones)
and supplementary material associated with this study is
available.z
The identification of r factors
Three HMMs were used to identify s70, s54 and ECF-like
sigma factors across genomes. s70 and s54 models were
retrieved from the PFAM database.65 ECFs have been
considered as a separate group of s70 proteins because of their
significant sequence divergence from the s70 family. Thus, we
constructed a specific ECF HMM based on the well-known
repertoire of ECF proteins in B. subtilis. These proteins
were used to run the motif discovery and search system,
MEME/MAST (using default parameters), to identify specific
regions associated with this group. We selected two motifs to
construct HMMs and to scan the whole repertoire of bacterial
genomes. The motifs and HMMs are available in the ESI.z
Clustering of families of regulatory factors
To analyze the distribution of ss and TF families across the
291 bacterial genomes, they were first saved as a matrix. This
matrix was then loaded into the cluster 3.0 program66 to
identify groups of families that correlate in terms of their
occurrence profile across all the bacterial genomes. A
hierarchical complete linkage clustering algorithm was run
with an uncentered correlation as a similarity measure. The
clustering results were then visualized using the Treeview
program.66
Conclusions
To understand the relationship between the expansion
patterns of different regulatory factors involved in gene
regulation at transcription initiation, 291 completely sequenced
bacterial genomes, which represent adaptive designs for
different lifestyles, were analyzed. We showed that the
distribution of ss and TFs follows a trend, with a ratio of
1 s per 10 TFs and 100 ORFs in all the genomes analyzed,
coinciding with our present knowledge that ss direct RNAP to
a small repertoire of binding sites in sequence and location,
compared to the diversity provided by the collection of TFs at
the promoters in a genome. For instance, in E. coli, around
95% of its genes are transcribed by s70, with the fine tuning of
their expression mediated by TFs.44 In addition, we found
that, in large genomes, there is a decrease in the number of
different families of TFs, i.e., in the diversity of families, than
would otherwise be expected. In this context, abundant
families are not widely distributed across all bacteria. In
contrast, some small families are the most widely distributed.
This difference might be associated with different phenomena,
such as evolutionary constraints by regulatory mechanisms, as
discussed in the case of DnaA or LexA and EBP families.
Our results also suggest that in larger genomes, regulatory
complexity may possibly increase as a result of the increasing
number of members from the ECF family and some TF
families. However, it is unclear if this increase would correspond
to an increase in complexity by means of multiple parallel
switches and feed-forward loops in regulatory networks (as
shown for carbon sources in E. coli67), as long regulatory
cascades, or as a combination of both. Overall, the analyses
presented here will not only contribute to improving our
understanding of the influence of design on the regulation of
gene expression, but also support the basis for a comprehensive
modelling of transcriptional regulatory networks in bacteria.
The observations discussed in this study should be valid for a
wide-range of bacteria in most genomic studies; the analysis of
over 100 genomes is reported to be sufficient and robust
enough to be generalized.68
Abbreviations
ss Sigma factors
TFs Transcription factors
EBP Enhancer binding proteins
ECF Extra-cytoplasmic sigma factors
Acknowledgements
EP-R was financed by a grant (IN-217508) from DGAPA-
UNAM and by grants given to Lorenzo Segovia. S. C. J.
acknowledges financial support from the Cambridge
Commonwealth Trust and the MRC-Laboratory of Molecular
1500 | Mol. BioSyst., 2009, 5, 1494–1501 This journal is !c The Royal Society of Chemistry 2009
Biology. A. M.-A. acknowledges financial support from
CONCYTEG (young researcher) and CINVESTAV (multi-
disciplinary). S. C. J. thanks colleagues at MRC-LMB for
critically reading the manuscript and providing helpful
comments. We thank Jose Antonio Ibarra and Gabriel
Moreno-Hagelsieb for their helpful comments in the preparation
of the manuscript, and Derek Wilson for providing us with the
collection of TFs available through the DBD database.
References
1 M. Lynch and J. S. Conery, Science, 2003, 302, 1401–1404.
2 M. Lynch, Annu. Rev. Microbiol., 2006, 60, 327–349.
3 B. O. Bengtsson, J. Theor. Biol., 2004, 231, 271–278.
4 Y. Minezaki, K. Homma and K. Nishikawa, DNA Res., 2005, 12,
269–280.
5 E. Perez-Rueda, J. Collado-Vides and L. Segovia, Comput. Biol.
Chem., 2004, 28, 341–350.
6 D. A. Rodionov, Chem. Rev., 2007, 107, 3467–3497.
7 J. A. Oguiza, K. Kiil and D. W. Ussery, Trends Microbiol., 2005,
13, 565–568.
8 E. van Nimwegen, Trends Genet., 2003, 19, 479–484.
9 O. X. Cordero and P. Hogeweg, Trends Genet., 2007, 23, 488–493.
10 N. Molina and E. van Nimwegen, Genome Res., 2008, 18, 148–160.
11 M. M. Wosten, FEMS Microbiol. Rev., 1998, 22, 127–150.
12 A. Ishihama, Annu. Rev. Microbiol., 2000, 54, 499–518.
13 T. M. Gruber and C. A. Gross, Annu. Rev. Microbiol., 2003, 57,
441–466.
14 D. F. Browning and S. J. Busby, Nat. Rev. Microbiol., 2004, 2,
57–65.
15 N. S. Miroslavova and S. J. Busby, Biochem. Soc. Symp., 2006,
1–10.
16 M. E. Wall, W. S. Hlavacek andM. A. Savageau, Nat. Rev. Genet.,
2004, 5, 34–42.
17 A. Martinez-Antonio, S. C. Janga, H. Salgado and J. Collado-
Vides, Trends Microbiol., 2006, 14, 22–27.
18 S. C. Janga and J. Collado-Vides, Res. Microbiol., 2007.
19 A. Goelzer, F. Bekkal Brikci, I. Martin-Verstraete, P. Noirot,
P. Bessieres, S. Aymerich and V. Fromion, BMC Syst. Biol.,
2008, 2, 20.
20 A. Gutierrez-Preciado, T. M. Henkin, F. J. Grundy, C. Yanofsky
and E. Merino, Microbiol. Mol. Biol. Rev., 2009, 73, 36–61.
21 R. R. Breaker, Science, 2008, 319, 1795–1797.
22 C. A. Wakeman, W. C. Winkler and C. E. Dann, 3rd, Trends
Biochem. Sci., 2007, 32, 415–424.
23 J. H. Brown, V. K. Gupta, B. L. Li, B. T. Milne, C. Restrepo and
G. B. West, Philos. Trans. R. Soc. London, Ser. B, 2002, 357,
619–626.
24 M. Levine and R. Tjian, Nature, 2003, 424, 147–151.
25 M. A. Changizi, J. Theor. Biol., 2001, 211, 277–295.
26 G. B. West and J. H. Brown, J. Exp. Biol., 2005, 208,
1575–1592.
27 L. Aravind, V. Anantharaman, S. Balaji, M. M. Babu and
L. M. Iyer, FEMS Microbiol. Rev., 2005, 29, 231–262.
28 M. Madan Babu, S. A. Teichmann and L. Aravind, J. Mol. Biol.,
2006.
29 S. C. Janga, H. Salgado, J. Collado-Vides and A. Martinez-
Antonio, J. Mol. Biol., 2007, 368, 263–272.
30 S. C. Janga, H. Salgado and A. Martinez-Antonio, Nucleic Acids
Res., 2009, 37, 3680–3688.
31 G. Kolesov, Z. Wunderlich, O. N. Laikova, M. S. Gelfand and
L. A. Mirny, Proc. Natl. Acad. Sci. U. S. A., 2007, 104,
13948–13953.
32 C. Marr, M. Geertz, M. T. Hutt and G. Muskhelishvili, BMC Syst.
Biol., 2008, 2, 18.
33 L. Jores and R. Wagner, J. Biol. Chem., 2003, 278, 16834–16843.
34 J. D. Gralla, Curr. Opin. Genet. Dev., 1996, 6, 526–530.
35 G. Lloyd, P. Landini and S. Busby, Essays Biochem., 2001, 37,
17–31.
36 J. Pittard, H. Camakaris and J. Yang, Mol. Microbiol., 2005, 55,
16–26.
37 S. Adhya, Sci. STKE, 2003, 2003, pe22.
38 A. Barnard, A. Wolfe and S. Busby, Curr. Opin. Microbiol., 2004,
7, 102–108.
39 I. Compan and D. Touati, J. Bacteriol., 1993, 175, 1687–1696.
40 H. Salgado, S. Gama-Castro, M. Peralta-Gil, E. Diaz-Peredo,
F. Sanchez-Solano, A. Santos-Zavaleta, I. Martinez-Flores,
V. Jimenez-Jacinto, C. Bonavides-Martinez, J. Segura-Salazar,
A. Martinez-Antonio and J. Collado-Vides, Nucleic Acids Res.,
2006, 34, D394–397.
41 M. Madan Babu and S. A. Teichmann, Trends Genet., 2003, 19,
75–79.
42 J. Collado-Vides, B. Magasanik and J. D. Gralla, Microbiol. Rev.,
1991, 55, 371–394.
43 S. Gama-Castro, V. Jimenez-Jacinto, M. Peralta-Gil, A. Santos-
Zavaleta, M. I. Penaloza-Spinola, B. Contreras-Moreira,
J. Segura-Salazar, L. Muniz-Rascado, I. Martinez-Flores,
H. Salgado, C. Bonavides-Martinez, C. Abreu-Goodger,
C. Rodriguez-Penagos, J. Miranda-Rios, E. Morett, E. Merino,
A. M. Huerta, L. Trevin˜o-Quintanilla and J. Collado-Vides,
Nucleic Acids Res., 2008, 36, D120–124.
44 A. Martinez-Antonio and J. Collado-Vides, Curr. Opin. Microbiol.,
2003, 6, 482–489.
45 I. Cases, V. de Lorenzo and C. A. Ouzounis, Trends Microbiol.,
2003, 11, 248–253.
46 G. Moreno-Hagelsieb, Curr. Genomics, 2006, 7, 163–170.
47 D. Missiakas and S. Raina, Mol. Microbiol., 1998, 28, 1059–1066.
48 J. D. Helmann, Adv. Microb. Physiol., 2002, 46, 47–110.
49 S. V. Kuznetsov, S. Sugimura, P. Vivas, D. M. Crothers and
A. Ansari, Proc. Natl. Acad. Sci. U. S. A., 2006, 103, 18515–18520.
50 S. Sugimura and D. M. Crothers, Proc. Natl. Acad. Sci. U. S. A.,
2006, 103, 18510–18514.
51 Y. Yamazaki, H. Niki and J. Kato,Methods Mol. Biol., 2008, 416,
385–389.
52 S. Y. Gerdes, M. D. Scholle, J. W. Campbell, G. Balazsi,
E. Ravasz, M. D. Daugherty, A. L. Somera, N. C. Kyrpides,
I. Anderson, M. S. Gelfand, A. Bhattacharya, V. Kapatral,
M. D’Souza, M. V. Baev, Y. Grechkin, F. Mseeh,
M. Y. Fonstein, R. Overbeek, A.-L. Baraba´si, Z. N. Oltvai and
A. L. Osterman, J. Bacteriol., 2003, 185, 5673–5684.
53 M. B. Miller and B. L. Bassler, Annu. Rev. Microbiol., 2001, 55,
165–199.
54 R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova,
U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin,
N. D. Fedorova and E. V. Koonin, Nucleic Acids Res., 2001, 29,
22–28.
55 G. Moreno-Hagelsieb and S. C. Janga, Proteins, 2008, 70, 344–352.
56 G. Moreno-Hagelsieb and J. Collado-Vides, Bioinformatics, 2002,
18(suppl. 1), S329–336.
57 S. K. Kummerfeld and S. A. Teichmann, Nucleic Acids Res., 2006,
34, D74–81.
58 S. Moreno-Campuzano, S. C. Janga and E. Perez-Rueda, BMC
Genomics, 2006, 7, 147.
59 S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang,
W. Miller and D. J. Lipman, Nucleic Acids Res., 1997, 25,
3389–3402.
60 A. Bairoch and R. Apweiler, Nucleic Acids Res., 2000, 28, 45–48.
61 W. Li, L. Jaroszewski and A. Godzik, Bioinformatics, 2002, 17,
77–82.
62 J. D. Thompson, D. G. Higgins and T. J. Gibson, Nucleic Acids
Res., 1994, 22, 4673–4680.
63 E. Sonnhammer, S. Eddy and R. Durbin, Proteins, 1997, 28,
405–420.
64 N. Sierro, Y. Makita, M. de Hoon and K. Nakai, Nucleic Acids
Res., 2008, 36, D93–96.
65 J. Mistry and R. Finn, Methods Mol. Biol., 2007, 396, 43–58.
66 M. B. Eisen, P. T. Spellman, P. O. Brown and D. Botstein, Proc.
Natl. Acad. Sci. U. S. A., 1998, 95, 14863–14868.
67 A. Martinez-Antonio, S. C. Janga and D. Thieffry, J. Mol. Biol.,
2008, 381, 238–247.
68 D. E. Whitworth, Trends Microbiol., 2008, 16, 512–519.
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1494–1501 | 1501
Author's personal copy
Computational Biology and Chemistry 33 (2009) 261–268
Contents lists available at ScienceDirect
Computational Biology and Chemistry
journa l homepage: www.e lsev ier .com/ locate /compbio lchem
Research Article
Plasticity of transcriptional machinery in bacteria is increased
by the repertoire of regulatory families
Sarath Chandra Jangaa,∗,1, Ernesto Pérez-Ruedab,∗,2
a MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 0QH, UK
b Departamento de Ingeniería Celular y Biocatálisis, Instituto de Biotecnologia. Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62100, Mexico
a r t i c l e i n f o
Article history:
Received 30 September 2008
Received in revised form 16 June 2009
Accepted 17 June 2009
Keywords:
Transcription factor families
Regulatory network
Transcription machinery
Prokaryotes
Evolution
a b s t r a c t
Escherichia coli K12 and Bacillus subtilis 168 are two of the best characterized bacterial organisms with
a long history in molecular biology for understanding various mechanisms in prokaryotic species. How-
ever, at the level of transcriptional regulation little is known on a comparative scale. Here we address
the question of the degree to which transcription factors (TFs) and their evolutionary families are shared
between them.We found that 59proteins and28 families are shared between these twobacteria,whereas
different subsetswere lineage specific.We demonstrate thatmajority of the common families expand in a
lineage-specificmanner.More specifically,we found that AraC, ColD, Ebp, LuxR and LysR families are over-
represented in E. coli, while ArsR, AsnC, MarR, MerR and TetR families have significantly expanded in B.
subtilis. We introduce the notion of regulatory superfamilies based on an empirical number of functional
categories regulated by them and show that these families are essentially different in the two bacteria.
We further show that global regulators seem to be constrained to smaller regulatory families and gener-
ally originate from lineage-specific families. We find that although TF families may be conserved across
genomes their functional roles might evolve in a lineage-specific manner and need not be conserved,
indicating convergence to be an important phenomenon involved in the functional evolution of TFs of the
same family. Although topologically the networks of transcriptional interactions among TF families are
similar in both the genomes, we found that the players are different, suggesting different evolutionary
origins for the transcriptional regulatory machinery in both bacteria. This study provides evidence from
complete repertoires that not only novel families originate in different lineages but conserved TF families
expand/contrast in a lineage-specificmanner, and suggests that part of the global regulatorymechanisms
might originate independently in different lineages.
© 2009 Elsevier Ltd. All rights reserved.
1. Introduction
The genomes of the two model organisms, Escherichia coli
K12 (Blattner et al., 1997) and Bacillus subtilis 168 (Kunst et al.,
1997), contain a different proportion of Transcription Units (TU’s)
(Moreno-Hagelsieb and Collado-Vides, 2002), sigma factors and
promoters (Salgado et al., 2006; Makita et al., 2004). Despite these
basic differences, it has been possible to find some conserved
and unique DNA-binding transcription factors (TFs) acting over
their complete gene repertoires (Makita et al., 2004; Perez-Rueda
et al., 2004). Such TFs have been related to a wide diversity of
functions including catabolite repression, differentiation and cel-
∗ Corresponding authors.
E-mail addresses: sarath@mrc-lmb.cam.ac.uk (S.C. Janga), erueda@ibt.unam.mx
(E. Pérez-Rueda).
1 Tel.: +44 1223 402479; fax: +44 1223 213556.
2 Tel.: +52 56 22 76 10; fax: +52 777 3 17 23 88.
lular maintenance, among others. However, it is unclear how the
collection of proteins performing similar functions (DNA-binding
ability) could have evolved in these two organisms with different
evolutionary history and ancestry (Hedges, 2002). Understanding
the evolution of the transcriptional regulatory machinery across
genomes would improve our knowledge about the evolutionary
constraints that play a role in the formation of regulatory networks
and would also help to decipher the design principles governing
these networks across bacteria (Janga et al., 2009). Although some
recent works have dealt with the evolution of the components
and suggested duplication of genes as the main factor contribut-
ing to the formation of the Transcriptional Regulatory Network
(TRN) (Madan Babu and Teichmann, 2003; Teichmann and Babu,
2004), there has not been comparative analysis of TFs and their
families between genomes to understand the evolutionary con-
straints, functional aspects and design principles governing their
formation. Despite the fact that there has been an increasing inter-
est to identify and understand the regulatory repertoires of entire
genomesusingavarietyof computational approaches (Perez-Rueda
1476-9271/$ – see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compbiolchem.2009.06.004
Author's personal copy
262 S.C. Janga, E. Pérez-Rueda / Computational Biology and Chemistry 33 (2009) 261–268
et al., 2004; Brune et al., 2005; Moreno-Campuzano et al., 2006;
Kummerfeld and Teichmann, 2006), there has not been genome
scale comparative study reported so far to our knowledge, using
representative genomes from distant lineages especially in the
context of regulatory networks. Here we present the first com-
prehensive comparative analysis of the complete repertoires of TFs
from two prokaryotic model organisms, E. coli K12 and B. subtilis.
In this work, we first identify and classify the repertoire of DNA-
binding TFs of E. coli and B. subtilis into families using a previously
reported approach applied to E. coli (Perez-Rueda and Collado-
Vides, 2000). We then analyze the collection of TFs and their TF
families at various levels to deduce thereof the common set of regu-
latory genes and families and to infer specific tendencies of TFs. Our
analyses were based on the collection of TFs reported and collected
fromtwodifferentdatabases:RegulonDB (Salgadoet al., 2006) forE.
coli K12 andDBTBS (Makita et al., 2004) for B. subtilis. Additional lit-
erature look upwas performed, to retrieve amore complete dataset
of TFs in these organisms. Here, we demonstrate that although E.
coli and B. subtilis contain a similar proportion of DNA-binding TFs,
the majority of the TF families have expanded and evolved inde-
pendently. The regulatory networks based on the set ofwell-known
TFs in both genomes suggest that the functions of genes regulated
by similar families could be different. These findings open diverse
opportunities to understand the complex regulatory systems in dif-
ferent bacteria, beyond Proteobacteria and Firmicutes.
2. Materials and Methods
2.1. Identification of TFs and Construction of TF Families in B.
subtilis
In order to identify the repertoire of TFs in B. subtilis, we used
a combination of information sources and bioinformatics tools as
reported earlier (Moreno-Campuzano et al., 2006). Briefly, 237 TFs
were identified by an exhaustive analysis of three sources, those
TFs identified from DBTBS, a database devoted to the gene regula-
tory mechanisms in B. subtilis strain 168 (Makita et al., 2004), TFs
identified by the search of family-specific Hidden Markov Models
(HMMs) reported previously (Perez-Rueda et al., 2004) from E. coli
TFs (E-value threshold ≥10-3), and those TFs identified with the
library of HMMs from the Superfamily database (E-value ≥10-3)
(Madera et al., 2004). ThisHMMlibrary is based on the sequences of
domains collected in the Structural Classification of Proteins (SCOP)
database (Hubbardet al., 1997)and is thusapplicable for a structural
classification of proteins. In summary, the final dataset included
those proteins identified by HMMs, Superfamily searches, and the
repertoire (manually curated) of TFs described inDBTBS. These pro-
teins were classified into families by using HMMs deposited in the
PFAM DB (Bateman et al., 2000), and aligned by using the program
hmmalign from HMMer. Our final collection included 90 families
in E. coli and 51 families for B. subtilis. Additionally, their corre-
sponding HMMs were used to scan a collection of 234 genomes,
including bacterial, archaeal and eukaryotic species, in order to
determine their evolutionary emergence in different lineages (see
Supplementary Material for a complete list of genomes analyzed
and the number of TFs identified across genomes).
2.2. Data of Regulatory Interactions
Transcriptional regulatory interactions of E. coli K12 were
obtained from RegulonDB (Salgado et al., 2006), which contains
experimental information extracted from literature, whereas the
regulatory interactions of B. subtilis were retrieved from DBTBS
(Makita et al., 2004). Those interactions from the datasets where
a sigma factor is known to control the expression of a gene were
excluded. Therefore, a total of 1816 regulatory interactions were
considered for E. coli while 745 were included from the B. subtilis
TRN.
2.3. Identification of Orthologs
Orthologs are defined as proteins in different species that
evolved from a common ancestor by speciation (Fitch, 1970) and
usually have the same function.Ourworkingdefinitionof orthology
consisted of BLASTP reciprocal best hits, which is awidely accepted
notion for identifying functional orthologs and homologous genes
were identified with an E-value cutoff of 1e-6 as described else-
where (Janga and Moreno-Hagelsieb, 2004).
3. Results and Discussion
3.1. Conserved TFs and TF Families Between E. coli and B. subtilis
Genomes
Two proteins associated to common functions might be a con-
sequence of common origin in different genomes (orthologous) or
gene duplication within a genome after speciation (paralogous).
Thus, we sought to determine the fraction of the total repertoire of
TFs in E. coli and B. subtilis related by orthology andhow it compares
with genomic conservation.We found that 59 TFs from E. coliwhich
correspond to around 20% of total TFs, had orthologs in B. subtilis,
while around 29% of their total gene products are related by orthol-
ogy which is statistically significant (see Supplementary Material),
as has been previously observed about their conservation patterns
using only a known subset of TFs in these genomes (Madan Babu
et al., 2006; Lozada-Chavez et al., 2006). This finding suggests that
TFs between the two genomes are 30% less conserved than other
protein classes, indicating that TFs are likely lost to a greater extent
at such phylogenetic distances (Lozada-Chavez et al., 2006). These
observations give rise to several questions concerning the evolu-
tionary and functional conservation of TFs between these bacterial
genomes, so in order to have an insight into the commonalities and
differences in the gene regulation between the prokaryotic species
from the perspective of TFs, we used the complete repertoires of
TFs in E. coli and B. subtilis. Based on diverse sequence and HMM
searches, a total number of 303E. coliTFs and237B. subtilisTFswere
identified. These repertoires were also classified into families and
compared to understand their evolutionary trends. Fig. 1 evidences
the different proportions of TF families identified in the genomes.
However, it canbenoted thatArgR, BirA,DnaA, FrvR, LexA, PrpDand
WrbA families showavery similardistribution inboth thegenomes.
The similar proportion of these groups suggests the possibility of
an early evolution of these families before the split of Proteobacte-
ria and Firmicutes and no subsequent lineage-specific expansion or
loss. A closer look at the functions of these families indicates that
they are mostly involved in the synthesis of amino acids, replica-
tion and DNA repair mechanisms andmetabolism of sugars. On the
contrary AraC, ColD, DeoR, Ebp, IclR, LacI, LuxR, RpiR, YjhU YdeW
and YeiL families are dominant in E. coli, whereas ArsR, AsnC, GntR,
Fur, MarR, MerR, ROK, TetR and OmpR can be seen to be domi-
nant in B. subtilis. It is interesting to observe that AraC, ColD, Ebp,
LuxR and LysR families are roughly double in proportion in E. coli
than in B. subtilis, while ArsR, AsnC, MarR, MerR and TetR show a
marked over-representation in B. subtilis. To test the significance of
this observation and to determine if these distributions are in fact
very different we performed a chi-square test, with the expected
distribution in each genome calculated as the product of the total
TFs from the common families and proportion of the TF family as
seen in other genome. We observed a P-value <10−53 when the
familial distribution in B. subtilis was considered as the observed
Author's personal copy
S.C. Janga, E. Pérez-Rueda / Computational Biology and Chemistry 33 (2009) 261–268 263
Fig. 1. Proportion of TFs in the common families identified among E. coli and B.
subtilis. Proportion of TFs in each family was calculated as the fraction of TFs in a
given family normalized against the total TFs that were identified in that genome in
the common families. Chi-square tests against the distributions suggest that they are
significantly differentwith P-values lower than 10−19 observed in both the genomes.
set and P-value <10−19 when families in E. coli were treated as
observed, indicating a significant difference in the distribution of
common families. In order to test if some of the well conserved
single-gene families like ArgR, BirA, DnaA, discussed above affect
the P-values, we re-calculated them excluding these families and
found them to be lower than 10−57 and 10−22 respectively, suggest-
ing that these families indeed make the distributions more similar
due to their very similar distribution in both the genomes and are
probably evolutionary conserved as single copies in most prokary-
otic genomes (Perez-Rueda et al., 2004; Lozada-Chavez et al., 2006;
Makarova et al., 2001; Rodionov et al., 2002; Erill et al., 2004; Fujita
et al., 1989). A plausible explanation for this biased lineage-specific
over-representation of antibiotic related families in B. subtilis and
metabolic andstructural components inE. coli couldbeattributed to
the niche in which these organisms survive. For instance, although
both of them are free-living bacteria, E. coli has adapted to thrive
inside its host and can degrade a wide variety of carbon sources
thereby harboring a number of TFs responsible for degradation of
carbon compounds (Martinez-Antonio et al., 2008), while B. subtilis
has adapted to soil environments and can accept limited number
of carbon sources but this is probably complemented due to its
ability to form spores and starve long durations in the absence of
substrates which might be driven by the excess number of sigma
factors present in this bacterium.
We proceeded to analyze those DNA-binding TFs that are com-
mon to both the bacterial genomes. Table 1 shows the distribution
of 59 orthologous TFs between the genomes grouped according
to their association to TF families (Perez-Rueda et al., 2004). We
found that the core of the TFs conserved between the genomes
are involved in the regulation of amino acid biosynthesis, car-
bon sources assimilation, antibiotic resistance, DNA replication
and repair and biosynthesis of membrane components. To iden-
tify shared and species-specific TFs quantitatively, overlooking the
different evolutionary phenomenon the organisms might have
undergone over time, homologous proteins were identified in the
complete collection of TFs by using BLASTP matches. This compar-
ative analysis allowed us to assess the extent of variation between
the two bacteria at the level of regulatory proteins. We found that
151 TFs have homologs in both genomes, corresponding to about
52% of the TFs of E. coli, suggesting that the regulatory networks
might be poorly conserved between the species and controlled by
few homologous regulatory functions. It is also interesting to note
Table 1
Distribution of orthologous TFs between E.coli and B. subtilis into different families.
Family Number of orthologs Regulatory roles
LysR 7 Amino acid biosynthesis
GntR 7 NA
AraC 5 Carbon metabolism, cell wall
synthesis, stress responses and
pathogenesis
LuxR 3 Quorum sensing
TetR 3 Tetracycline resistance
DeoR 3 Deoxiribose assimilation
MerR 3 Mercury resistance
MarR 2 Multiple antibiotic resistance
response
YjeB 2 NA
Rok 2 NA
Cold 2 Low temperature adaptation
LacI 2 Carbon sources assimilation
Fis 2 NA
YeiL 1 Global regulatory functions
PrpD 1 NA
LexA 1 DNA repair
AsnC 1 Leucine repressor protein
Fur 1 Iron assimilation
Ihf 1 Integration host factor
OmpR 1 Biosynthesis of membrane
components
YjhU YdeW 1 NA
RpiR 1 Ribose phosphate isomerase
IclR 1 Glyoxylate bypass operon
BirA 1 Biosynthesis of biotin
DnaA 1 Transcription and replication
regulation
ArsR 1 Arsenic resistance
ArgR 1 Arginine biosynthesis
HipB 1 NA
Functions of the TF families were as described in a previous study on TF families
in E. coli. TF families whose function cannot de determined either because of lack
of information or due to the complexity of the members to regulate more than one
function are represented as not available (NA).
that E. coli and B. subtilis contain 48% and 36% unique TFs at the
BLASTP thresholds used to determine homologs, indicating that the
genomes have considerably changed their repertoires of regulatory
machinery.
3.2. Lineage-specific Families
Diverse familieswere identified exclusively inE. coliorB. subtilis.
In Table 2 we present 16 families identified as specific to Firmi-
Table 2
Lineage-specific TF families identified in B. subtilis.
TF family Family size Regulatory roles
AbrB 3 Transition state genes
CodY 1 Global regulatory mechanisms
ComK 1 Late competence genes
CtsR 1 Class III stress genes
DtxR 1 Manganese transport
DUF24 3 Unknown
HxlR 2 Detoxification system regulation
LytTR 3 Regulation of autolysis
PadR 1 Phenolic acid synthesis
PRD 1 Phosphotransferase system
Psq 1 NA
PucR 1 Purine degradation
Rrf2 3 NA
RsfA 1 Prespore-specific regulation
TenA 1 Extracellular enzyme genes
Xre 17 Prophage, competence development
and sporulation associated genes
TF families whose function cannot de determined because of lack of information are
represented as not available (NA).
Author's personal copy
264 S.C. Janga, E. Pérez-Rueda / Computational Biology and Chemistry 33 (2009) 261–268
Table 3
Distribution of global TFs into different families in E. coli K12 and B. subtilis 168.
Family Family size (E. coli) Family size (B. subtilis) Global TFs (E. coli) Global TFs (B. subtilis)
YeiL 3 1 Crp/Fnr –
Hns 2 0 Hns –
AsnC 3 6 Lrp –
IHF 4 2 IHF –
Fis 8 3 Fis –
LacI 14 11 – CcpA
ComK 0 1 – ComK
AbrB 0 3 – AbrB
CodY 0 1 – CodY
OmpR 14 8 ArcA- Spo0A
cutes or closer lineages in terms of their taxonomic distribution.
In these families there are diverse global regulators, such as ComK,
AbrBandCodY,whichact as switchesbetweensporulationand free-
living state in B. subtilis. From the perspective of E. coli, we found
15 characterized families which are specific to this bacterium or
closely related lineages. For instance, MetJ and TrpR, the regulators
ofmethionine and tryptophan relatedgenes andAlpAandCrlwhich
are known to be involved in the context of lipopolysaccharide adhe-
sion to human gastric tissue and regulation of curly surface fibers
respectively, are constrained to enterobacteria while some fami-
lies like CaiF, HycA and HtgA are exclusive to E. coli and Salmonella
strains. This suggests that diverse lineage-specific TFs might be
involved in specific and important processes, such as sporulation in
bacilli or in some specific amino acid biosynthesis routes in enter-
obacterial species. It is interesting to note that the absence of TFs for
several important amino acid biosynthetic routes in B. subtilis and
other Firmicutes is complemented by the invention of novel regula-
torymechanisms such as transcription attenuation, despite the fact
that these genomesmight be responding to identical regulatory sig-
nals in the synthesis of these amino acids, suggesting the possibility
for variations even in fundamental processes of the cell (Gollnick
et al., 2005; Gutierrez-Preciado et al., 2005; Winkler et al., 2003;
MerinoandYanofsky, 2005;Rodionovet al., 2004). Inotherbacteria,
similar lineage-specificTFsandTF familiesmightbeexpectedashas
been previously reported for Streptomyces coelicolor (Bentley et al.,
2002).
3.3. Evolution of Global TFs in the Context of TF Families
Global TFs, defined as those regulatory proteins which regulate
a wide variety of functional categories and have their influence on
a considerable number of genes (Martinez-Antonio and Collado-
Vides, 2003), provide important insights into the evolution of
regulatory mechanisms in bacterial genomes. Therefore, it was our
interest to understand how this class of TFs is distributed across
TF families in both the genomes (see Table 3). From this table,
we did not find any global TFs in common families, thus although
there are common families between the two genomes, global TFs
have originated from completely different TF families in different
lineages. Some specific examples in this direction have been also
demonstrated in other bacteria, like Crc in Pseudomonas putida
which belongs to the endonuclease/exonuclease/phosphatase fam-
ily (Morales et al., 2004) or ArlR in Staphylococcus aureus and PrrA
in Rhodobacter sphaeroides which are members of two component
response regulators (Liang et al., 2005; Mao et al., 2005). How-
ever, many of the global TFs occur in families identified in both the
genomesexcept forHnsandArcA inE. coliandComK,AbrBandCodY
in B. subtilis which occur in genome or lineage-specific families. A
glance at the functions of these global TFs indicates that they are
specific in their functional roles and might have evolved depend-
ing on the organism specific needs like sporulation in B. subtilis,
essentially implying that different bacteria might have developed
Fig. 2. Distributionof theCOGcategories of the regulated genes by each family of TFs
in (a) E. coli and b) B. subtilis. Only those families which have more than 3 regulated
genes per family are shown. The first 8 TF families correspond to the ones which
exist in both the genomes. The fractions in each column are normalized against the
total COG annotated genes. COG functional categories: amino acid transport and
metabolism (E); carbohydrate transport and metabolism (G); energy production
and conversion (C); transcription (K); cell wall/membrane/envelope biogenesis (M);
replication, recombination and repair (L); inorganic ion transport and metabolism
(P); translation, ribosomal structure and biogenesis (J); posttranslational modi-
fication, protein turnover, chaperones (O); signal transduction mechanisms (T);
coenzyme transport and metabolism (H); cell motility (N); nucleotide transport
and metabolism (F); lipid transport and metabolism (I); secondary metabolites
biosynthesis, transport and catabolism (Q); defence mechanisms (V); intracellu-
lar trafficking, secretion, and vesicular transport (U); cell cycle control, cell division,
chromosomepartitioning (D). Correlations observed in the distribution of functional
categories of the regulated genes in the common TF families: YeiL (R2 = 0.0398),
BirA (R2 = 0.9423), LacI (R2 = 0.0023), AsnC (R2 = 0.8036), Fur (R2 = 0.734), OmpR
(R2 = 0.4332), GntR (R2 = 0.1103) and ArgR (R2 = 0.8598).
Author's personal copy
S.C. Janga, E. Pérez-Rueda / Computational Biology and Chemistry 33 (2009) 261–268 265
Fig. 3. TF family size versus number of regulatory interactions with in members of the same family (a) E. coli and (b) B. subtilis.
at least part of their global regulatory mechanisms independently.
A second observation that can be made from the table is that most
of the global TFs seem to fall into smaller TF families hinting that
global TFsmight avoid the cross talk over the binding sites between
different members of their TF family by reducing the number of
family members. Some TFs of the same family are known to bind
to very similar binding sites when the sequences encoding them
have significant sequence similarity as in the case ofMarA, Rob and
SoxS (Martin and Rosner, 2002). An alternate explanation for the
observed tendency could be that large families through gene dupli-
cation could have sub-divided their regulatory functions among
many TFs, thus leaving no room for global regulators in larger
families. To test the significance of this observation we compared
the average size of a family for a global regulator (observed to be
3.4 and 5.6 in E. coli and B. subtilis respectively) with the average
family size of a general TF in 1000 randomly sampled collections
each equal to the size of respective total repertoires in both the
genomes. We found that the average family size of a general TF
in the randomized collection followed a normal distribution and
hence used Z-scores to calculate P-values. In both E. coli and B.
subtilis, P-values <10−37 were observed indicating that global TFs
have a strong tendency to occur in small families. Despite the rea-
sons which can best explain the tendency, the above observation
should enhance our ability to predict global TFs in other microbial
genomes.
3.4. Distribution of Functional Classes in the genes Regulated by
TF Families
In order to study the heterogeneity of the TFs in families in a
functional context one has to compare the functions of the regu-
lators in each TF family. However, given the poor annotations for
genes encoding TFs about their specific functional roles it would be
hard to use them for a comparative functional analysis. Moreover,
most of the functional classification schemes for genes do not con-
tain a detailed description for the physiological roles played by the
regulators in the context of the genome being analyzed. Consider-
ing these issues we used the functional categories of the regulated
genes in each family to analyze the extent of functional variation in
TF families in both the organisms.
To understand the variability in the functions of the regulated
genes by each family of TFs in E. coli and B. subtilis we used the
COG annotations of the protein coding genes available from NCBI
(Tatusov et al., 1997; Tatusov et al., 2003). In Fig. 2, we show the dis-
tributionof theCOGcategories of the regulatedgenes for TF families
that are known to regulate more than 3 COG annotated genes. The
familiesYeiL, Fis, AraCandFur inE. coliandGerE, ComK, LacI, Fur and
AbrB in B. subtilis regulate more than 7 different categories. These
families can be considered as “regulatory superfamilies” in these
organisms because of their ability to control diverse physiological
processes. Fur is the only family which regulates a large number
Fig. 4. Network of transcriptional interactions between different TF families identified in (a) E. coli and (b) B. subtilis. Two nodes are shown to be connected if there exists at
least one regulatory interaction between the nodes.
Author's personal copy
266 S.C. Janga, E. Pérez-Rueda / Computational Biology and Chemistry 33 (2009) 261–268
of categories in both the bacteria, probably indicating its presence
in the vital roles of the cell. Other families like GerE, ComK and
AbrB suggest the evolution and expansion of their functions inde-
pendently in Firmicutes. Whenwe examined the TF families which
regulate far fewer categories we found that most of them are either
very restricted in their function or have an ancient origin.
A closer look into the distribution of COG categories of the 8
common families, namely YeiL, BirA, LacI, AsnC, Fur, OmpR, GntR
and ArgR, between the two genomes gave further insights into
the evolution of TF families in the context of functional roles. For
instance, the YeiL family which composes of the Crp, Fnr and YeiL
TFs in E. coli regulates 15 different categories and 4 in B. subtilis, of
which the 2 categories “inorganic ion transport and metabolism”
and “signal transduction mechanisms” are predominantly regu-
lated in the second bacterium. The case of the BirA and ArgR
families is interesting, because they are known to be well con-
served across all the genomes (Makarova et al., 2001; Rodionov et
al., 2002). Accordinglywe found an appreciable overlap in the func-
tional categories of the regulated genes in these families. TFs from
the families LacI and GntR were found to preferentially regulate
the functions “carbohydrate transport andmetabolism” and “Tran-
scription and energy production and conversion” in B. subtiliswhile
in E. coli the dominantly regulated categories included “nucleotide
transport”, “carbohydrate transport” and “amino acid transport and
metabolism” for LacI members and “transcription” and “carbohy-
drate transport andmetabolism” for GntRmembers. The case of the
family of Fur regulators seems to be interestingwith themajority of
regulated genes in both E. coli and B. subtilis belonging to “inorganic
ion transport and metabolism” possibly suggesting partial conser-
vation of the regulatory roles of its members. The family of OmpR
regulators are known to be involved in the regulation of genes
related to the biosynthesis of membrane components in E. coli,
accordingly we found them to regulate the categories “inorganic
ion transport andmetabolism”, “cellwall/membrane/envelope bio-
genesis”, “lipid transport and metabolism” and “transcription” in
both bacteria. These observations lead us to conclude that although
TF families may be conserved across genomes their functional
roles might evolve in an organism-specific or lineage-specific
manner and are not always conserved indicating convergence to
be a major phenomenon involved in the functional evolution of
transcription factors of the same TF family. This finding also sug-
gests that existence of common families between two organisms
could be the result of a common ancestry initially but with spe-
ciation, functional divergence and lineage-specific expansion or
contraction of TF families occurs rapidly to adapt to changing
environments.
Fig. 5. Number of TFs predicted across genomes by using TF family models from E. coli and B. subtilis as the phylogenetic distance with respect to E. coli increases. Intersection
stands for the number of TFs predicted by the models from both the genomes while the predictions identified using models in E. coli only are shown in green and those based
on B. subtilismodels only are shown in blue. To facilitate the display of results, we only show 105 complete genomes, obtained by filtering out strains and species of the same
bacterial genus keeping the strain or species with the maximum number of genes among a given genera of organisms. The evolutionary distance from E. coli to all organisms
was obtained according to the evolutionary branching process previously reported (Brown et al., 2001).
Author's personal copy
S.C. Janga, E. Pérez-Rueda / Computational Biology and Chemistry 33 (2009) 261–268 267
3.5. Interaction of TFs Within and Across Families
In order to understand if TFs in a given family interact with each
other to regulate biological processes, we sought to see any relation
between the number of regulatory interactions amongmembers in
a given family and its family size. In Fig. 3 we show the number of
transcriptional interactions between members of the same family
of TFs as the family size increases. It appears from this figure that
interactions among members of a TF family increase as the family
size increases, although the number of interactions is always low
in both the genomes.
To study the interaction of TFs from different families we
identified the transcriptional interactions between the regulators
belonging to different families. As shown in Fig. 4 we observed a
scale-free like topology when the interactions between TF families
were modeled as a network. Some of the well-connected families
are those containing the global regulators; however, it should be
noted that the families which are responsible for the scale-free
nature in E. coli and B. subtilis are different as has been observed
in the TRNs of these genomes (Madan Babu et al., 2006). It is
interesting to visualize from this figure the case when one of the
well-connected hubs like Fis or AraC in E. coli or AbrB or ComK in
B. subtilis but not the central node is removed from their genomes,
which would lead to removal of a branch of the interactions rather
than lethality of the whole network (and hence the cell) which
is typically what is observed in scale-free networks and has been
described as robustness (Albert et al., 2000). However, in this con-
text, robustness might refer to the conditions of growth in which
these regulatory families are no longer needed by the cell to reg-
ulate its processes. Although the data of the TRN of B. subtilis is
smaller in size compared to E. coli, these observations allow us to
conclude that scale-free nature in the networks of TF families is
common to both the genomes and might have evolved to choose
different nodes as hubs despite the existence of common families.
3.6. Effect of Lineage Specificity in Predicting TFs From
Comparative Genomics
Despite the poor conservation of the TFs between the two bac-
teria, we wanted to determine how much comparative genomics
can help to identify TFs across organisms using the family-specific
HMMs developed in these bacteria and howmuch overlap the pre-
dictions based on the models from different organisms might have
in a given genome. We therefore identified the repertoires of TFs
in complete genomes using the models from both the genomes
(see Section 2 and Supplementary Material). In Fig. 5 the num-
ber of predicted TFs across 105 complete non-redundant genomes
from the perspective of both the genomes is shown. It is clear from
the figure that although the number of TFs predicted from B. sub-
tilis perspective is lower than that from E. coli’s, there is almost a
complete overlap in the predictions between the two sets across
genomes suggesting that comparative genomics approaches based
on family-specific HMMs as against homology based approaches
which typically search similarity across the entire length of the
sequence can be very powerful to predict TFs with a high positive
predictive value (calculated as True Positives/(True Positives + False
Positives)). For example in B. subtilis we identified 185 TFs using E.
colibasedmodels ofwhich167were a subset of 237TFs identified in
this bacterium, similarly we identified 122 TFs in E. coli based on B.
subtilismodels of which 114 were a subset of the collection identi-
fiedearlier inE. coli (Perez-RuedaandCollado-Vides, 2000). It is also
easy to note from the figure that as the evolutionary distance with
respect to E. coli increases (in Archaea and Eukarya) the predictive
coverage drops rapidly indicating the loss of domain level signal
at such distances. A second observation to note is that B. subtilis
models tend topredict slightlyhigherproportionof TFs in closer lin-
eages like Bacillales and Lactobacillales while E. colimodels clearly
dominate the number of predictions in all proteobacterial lineages
suggesting the effect of lineage or genome-specific expansion of TFs
playing an important role in identifying TFs across genomes.
These observations suggest that although this approach to iden-
tify TFs can produce high quality predictions, the limiting factor
can be the evolutionary distance because at large evolutionary dis-
tances it would be hard to trace the repertoires of TFs not only due
to the poor conservation of domains but also due to the evolution of
novel TF families as has been demonstrated in this work. However,
as the experimental knowledge about the TFs from lineage-specific
families increases it should be possible to expand the repertoires of
TFs across prokarya beyond the few model organisms that are the
focus of the study.
4. Conclusions
Based on genome analysis we defined the individual set of DNA-
binding TFs in E. coli and B. subtilis genomes and deduced thereof
the common repertoire of transcriptional regulators and regulatory
families of these species. The set of thewell-conserved TFs between
the two genomes is involved in fundamental cellular processes
and could have an ancient origin. We show that TF families evolve
rapidly and expand in a lineage-specific manner to adapt to vary-
ing environmental needsof theorganisms. Similar trendshavebeen
observed in previous comparative studies on TF families in plants
versus animals and at the level of taxa (Shiu et al., 2005; Coulson
et al., 2001). A more general perspective of lineage-specific expan-
sion of protein families and its implications on the diversification
of organisms has also been shown in eukaryotic species (Lespinet
et al., 2002). Our results show that global TFs responsible for global
regulatory mechanisms in bacteria can evolve independently in
different organisms and from totally different regulatory families,
suggesting that transcriptional regulatory machinery plays a very
important role in the speciation of organisms. We observe that
global regulators have a tendency to occur in smaller and lineage-
specific familieswhichmight be of recent origin indicating a source
for the innovation of novel regulatory interactions andmechanisms
across different lineages while still keeping the genetic repertoire
well conserved. Our findings show that larger TF families regulate
disproportionately low number of genes. It is possible that these
large families function as local modules of regulation while the
smaller families act asmajor hubs of the Transcriptional Regulatory
Network.
It is interesting to speculate the variation of regulatory networks
across prokaryotic organisms at three different levels (a) variation
of regulon composition due to the re-organization of genomic con-
text of genes accompanied by changes in the cis-regulatory regions
in closely related species, though preserving the mode of action
of TFs (Espinosa et al., 2005) (b) variation at the level of reper-
toire of TFs due to the need for different requirements of regulatory
machinery in different environments (Madan Babu et al., 2006) (c)
variation at the level of the regulatory mechanisms employed to
perform the same biological process, as is seen in the case of atten-
uation mechanisms replacing transcriptional regulation in some
bacteria. While the variations at the first level can be believed to
occur mostly in the same phylogenetic group/lineage and can be
treated analogously to changes affecting a given valley of moun-
tains and the variations at the second and third levels could be
major reasons for differentiation of lineages/phylogenetic groups
analogous to differences between valleys.
Supplementary Material
Supplementary material can be accessed at: http://tikal.ccg.
unam.mx/sarath/tfevolution/.
Author's personal copy
268 S.C. Janga, E. Pérez-Rueda / Computational Biology and Chemistry 33 (2009) 261–268
Acknowledgements
We would like to thank Gabriel Moreno-Hagelsieb, Martin
Peralta-Gil and Bruno Contreras-Moreira for helpful discussions
and comments on the initial versions of this manuscript.Wewould
also like to thank IrmaLozada-Chávez forhelpinguswith thephylo-
genetic analysis. SCJ acknowledges support fromMedical Research
Council, Laboratory of Molecular Biology and Cambridge Com-
monwealth Trust. EP-R was financed by a grant (IN-217508) from
DGAPA-UNAM, and by grants given to Lorenzo Segovia.
References
Albert, R., Jeong, H., Barabasi, A.L., 2000. Error and attack tolerance of complex
networks. Nature 406, 378–382, PMID: 10935628.
Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., Sonnhammer, E.L., 2000. The
Pfamprotein familiesdatabase.NucleicAcidsRes. 28, 263–266, PMID:10592242.
Bentley, S.D., Chater, K.F., Cerden˜o-Tárraga, A.M., Challis, G.L., Thomson, N.R., James,
K.D., Harris, D.E., Quail, M.A., Kieser, H., Harper, D., Bateman, A., Brown, S., Chan-
dra, G., Chen, C.W., Collins,M., Cronin, A., Fraser, A., Goble, A., Hidalgo, J., Hornsby,
T., Howarth, S., Huang, C.H., Kieser, T., Larke, L., Murphy, L., Oliver, K., O’Neil, S.,
Rabbinowitsch, E., Rajandream, M.A., Rutherford, K., Rutter, S., Seeger, K., Saun-
ders, D., Sharp, S., Squares, R., Squares, S., Taylor, K., Warren, T., Wietzorrek, A.,
Woodward, J., Barrell, B.G., Parkhill, J., Hopwood, D.A., 2002. Complete genome
sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 417,
141–147, PMID: 12000953.
Brown, J.R., Douady, C.J., Italia, M.J., Marshall, W.E., Stanhope, M.J., 2001. Univer-
sal trees based on large combined protein sequence data sets. Nat Genet 28,
281–285.
Blattner, F.R., Plunkett 3rd, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-
Vides, J., Glasner, J.D., Rode, C.K.,Mayhew,G.F., Gregor, J., Davis, N.W., Kirkpatrick,
H.A., Goeden, M.A., Rose, D.J., Mau, B., Shao, Y., 1997. The complete genome
sequence of Escherichia coli K-12. Science 277, 1453–1474.
Brune, I., Brinkrolf, K., Kalinowski, J., Puhler, A., Tauch, A., 2005. The individual and
common repertoire of DNA-binding transcriptional regulators of Corynebac-
terium glutamicum, Corynebacterium efficiens, Corynebacterium diphtheriae
and Corynebacterium jeikeium deduced from the complete genome sequences.
BMC Genomics 6, 86.
Coulson, R.M., Enright, A.J., Ouzounis, C.A., 2001. Transcription-associated protein
families are primarily taxon-specific. Bioinformatics 17, 95–97.
Erill, I., Jara, M., Salvador, N., Escribano, M., Campoy, S., Barbe, J., 2004. Differences in
LexA regulon structure among Proteobacteria through in vivo assisted compar-
ative genomics. Nucleic Acids Res 32, 6617–6626.
Espinosa, V., Gonzalez, A.D., Vasconcelos, A.T., Huerta, A.M., Collado-Vides, J., 2005.
Comparative studiesof transcriptional regulationmechanisms inagroupof eight
gamma-proteobacterial genomes. J Mol Biol 354, 184–199.
Fitch,W.M., 1970. Distinguishing homologous from analogous proteins. Syst Zool 19,
99–113.
Fujita,M.Q., Yoshikawa,H., Ogasawara,N., 1989. Structure of thednaA regionof Pseu-
domonas putida: conservation among three bacteria, Bacillus subtilis, Escherichia
coli and P. putida. Mol Gen Genet 215, 381–387.
Gollnick, P., Babitzke, P., Antson, A., Yanofsky, C., 2005. Complexity in regulation of
tryptophan biosynthesis in Bacillus subtilis. Annu Rev Genet 39, 47–68.
Gutierrez-Preciado, A., Jensen, R.A., Yanofsky, C., Merino, E., 2005. New insights into
regulation of the tryptophan biosynthetic operon in Gram-positive bacteria.
Trends Genet 21, 432–436.
Hedges, S.B., 2002. The origin and evolution of model organisms. Nat Rev Genet 3,
838–849.
Kunst, F., Ogasawara, N., Moszer, I., Albertini, A.M., Alloni, G., Azevedo, V., Bertero,
M.G., Bessieres, P., Bolotin, A., Borchert, S., Borriss, R., Boursier, L., Brans, A.,
Braun,M., Brignell, S.C., Bron, S., Brouillet, S., Bruschi, C.V., Caldwell, B., Capuano,
V., Carter, N.M., Choi, S.K., Codani, J.J., Connerton, I.F., Danchin, A., et al., 1997.
The complete genome sequence of the gram-positive bacterium Bacillus subtilis.
Nature 390, 249–256.
Hubbard, T.J., Murzin, A.G., Brenner, S.E., Chothia, C., 1997. SCOP: a structural classi-
fication of proteins database. Nucleic Acids Res 25, 236–239.
Janga, S.C., Moreno-Hagelsieb, G., 2004. Conservation of adjacency as evidence of
paralogous operons. Nucleic Acids Res 32, 5392–5397.
Janga, S.C., Salgado,H.,Martinez-Antonio,A., 2009. Transcriptional regulation shapes
the organization of genes on bacterial chromosomes. Nucleic Acids Res. 37,
3680–3688.
Kummerfeld, S.K., Teichmann, S.A., 2006. DBD: a transcription factor prediction
database. Nucleic Acids Res 34, D74–81.
Lespinet, O.,Wolf, Y.I., Koonin, E.V., Aravind, L., 2002. The role of lineage-specific gene
family expansion in the evolution of eukaryotes. Genome Res 12, 1048–1059.
Liang, X., Zheng, L., Landwehr, C., Lunsford, D., Holmes, D., Ji, Y., 2005. Global
regulation of gene expression by ArlRS, a two-component signal transduction
regulatory system of Staphylococcus aureus. J Bacteriol 187, 5486–5492.
Lozada-Chavez, I., Janga, S.C., Collado-Vides, J., 2006. Bacterial regulatory networks
are extremely flexible in evolution. Nucleic Acids Res 34, 3434–3445.
Madan Babu, M., Teichmann, S.A., 2003. Evolution of transcription factors and the
gene regulatory network in Escherichia coli. Nucleic Acids Res 31, 1234–1244.
Madan Babu, M., Teichmann, S.A., Aravind, L., 2006. Evolutionary dynamics of
prokaryotic transcriptional regulatory networks. J Mol Biol 358, 614–633.
Madera, M., Vogel, C., Kummerfeld, S.K., Chothia, C., Gough, J., 2004. The SUPER-
FAMILY database in 2004: additions and improvements. Nucleic Acids Res 32,
D235–239.
Makarova, K.S., Mironov, A.A., Gelfand, M.S., 2001. Conservation of the binding
site for the arginine repressor in all bacterial lineages. Genome Biol 2 (4),
RESEARCH0013.
Makita, Y., Nakao, M., Ogasawara, N., Nakai, K., 2004. DBTBS: database of transcrip-
tional regulation inBacillus subtilisand its contribution tocomparativegenomics.
Nucleic Acids Res 32, D75–D77.
Mao, L., Mackenzie, C., Roh, J.H., Eraso, J.M., Kaplan, S., Resat, H., 2005. Combining
microarray and genomic data to predict DNA binding motifs. Microbiology 151,
3197–3213.
Martin, R.G., Rosner, J.L., 2002. Genomics of themarA/soxS/rob regulon of Escherichia
coli: identification of directly activated promoters by application of molecular
genetics and informatics to microarray data. Mol Microbiol 44, 1611–1624.
Martinez-Antonio, A., Collado-Vides, J., 2003. Identifying global regulators in tran-
scriptional regulatory networks in bacteria. Curr Opin Microbiol 6, 482–489.
Martinez-Antonio, A., Janga, S.C., Thieffry, D., 2008. Functional organisation of
Escherichia coli transcriptional regulatory network. J Mol Biol 381, 238–247.
Merino, E., Yanofsky, C., 2005. Transcription attenuation: a highly conserved regula-
tory strategy used by bacteria. Trends Genet 21, 260–264.
Morales, G., Linares, J.F., Beloso, A., Albar, J.P., Martinez, J.L., Rojo, F., 2004. The Pseu-
domonas putida Crc global regulator controls the expression of genes from
several chromosomal catabolic pathways for aromatic compounds. J Bacteriol
186, 1337–1344.
Moreno-Campuzano, S., Janga, S.C., Perez-Rueda, E., 2006. Identification and analysis
of DNA-binding transcription factors in Bacillus subtilis and other Firmicutes—a
genomic approach. BMC Genomics 7, 147.
Moreno-Hagelsieb, G., Collado-Vides, J., 2002. A powerful non-homology method
for the prediction of operons in prokaryotes. Bioinformatics 18 (Suppl. 1),
S329–S336.
Perez-Rueda, E., Collado-Vides, J., 2000. The repertoire of DNA-binding transcrip-
tional regulators in Escherichia coli K-12. Nucleic Acids Res 28, 1838–1847.
Perez-Rueda, E., Collado-Vides, J., Segovia, L., 2004. Phylogenetic distributionofDNA-
binding transcription factors in bacteria and archaea. Comput Biol Chem 28,
341–350.
Rodionov, D.A., Mironov, A.A., Gelfand, M.S., 2002. Conservation of the biotin regu-
lon and the BirA regulatory signal in Eubacteria and Archaea. Genome Res 12,
1507–1516.
Rodionov, D.A., Vitreschak, A.G., Mironov, A.A., Gelfand, M.S., 2004. Comparative
genomics of the methionine metabolism in Gram-positive bacteria: a variety of
regulatory systems. Nucleic Acids Res 32, 3340–3353.
Salgado, H., Gama-Castro, S., Peralta-Gil, M., Diaz-Peredo, E., Sanchez-Solano, F.,
Santos-Zavaleta, A.,Martinez-Flores, I., Jimenez-Jacinto, V., Bonavides-Martinez,
C., Segura-Salazar, J., Martinez-Antonio, A., Collado-Vides, J., 2006. RegulonDB
(version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon
organization, and growth conditions. Nucleic Acids Res 34, D394–D397.
Shiu, S.H., Shih, M.C., Li, W.H., 2005. Transcription factor families have much higher
expansion rates in plants than in animals. Plant Physiol 139, 18–26.
Teichmann, S.A., Babu, M.M., 2004. Gene regulatory network growth by duplication.
Nat Genet 36, 492–496.
Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V.,
Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov,
S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A., 2003. The
COG database: an updated version includes eukaryotes. BMC Bioinformatics
4, 41.
Tatusov, R.L., Koonin, E.V., Lipman, D.J., 1997. A genomic perspective on protein fam-
ilies. Science 278, 631–637.
Winkler, W.C., Nahvi, A., Sudarsan, N., Barrick, J.E., Breaker, R.R., 2003. An mRNA
structure that controls gene expression by binding S-adenosylmethionine. Nat
Struct Biol 10, 701–707.
Structure and organization of drug-target networks: insights from
genomic approaches for drug discoveryw
Sarath Chandra Janga*a and Andreas Tzakosz*b
Received 23rd April 2009, Accepted 12th August 2009
First published as an Advance Article on the web 4th September 2009
DOI: 10.1039/b908147j
Recent years have seen an explosion in the amount of ‘‘omics’’ data and the integration of several
disciplines, which has influenced all areas of life sciences including that of drug discovery. Several
lines of evidence now suggest that the traditional notion of ‘‘one drug–one protein’’ for one
disease does not hold any more and that treatment for most complex diseases can best be
attempted using polypharmacological approaches. In this review, we formalize the definition of a
drug-target network by decomposing it into drug, target and disease spaces and provide an
overview of our understanding in recent years about its structure and organizational principles.
We discuss advances made in developing promiscuous drugs following the paradigm of
polypharmacology and reveal their advantages over traditional drugs for targeting diseases such
as cancer. We suggest that drug-target networks can be decomposed to be studied at a variety of
levels and argue that such network-based approaches have important implications in
understanding disease phenotypes and in accelerating drug discovery. We also discuss the
potential and scope network pharmacology promises in harnessing the vast amount of data from
high-throughput approaches for therapeutic advantage.
Introduction
In living organisms, viability and functionality is accomplished
through a constant flow of information transmitted through
interactions between the basic building blocks RNA, DNA,
proteins and small molecules. This ‘‘biological cosmos’’,
represented as a global biological network, although inherent
in its complexity, is bound with stability and equilibrium.
Any change that irreversibly distorts the equilibrium in this
network could result in pathological conditions and hence
aMRC Laboratory of Molecular Biology, Hills Road, Cambridge,
UK CB2 0QH. E-mail: sarath@mrc-lmb.cam.ac.uk;
Fax: +44-1223-213556; Tel: +44-1223-402479
b Institut de Biologie Structurale et Microbiologie, CNRS, Marseille,
France. E-mail: atzakos@cc.uoi.gr; Fax: +30-26510-97200;
Tel: +30-26510-08387
w This article is part of a Molecular BioSystems themed issue on
Computational and Systems Biology.
Sarath Chandra Janga
Sarath Chandra Janga is a
PhD student at the MRC
Laboratory of Molecular
Biology and University of
Cambridge. Sarath obtained
his Bachelors and Masters in
Bio-chemical engineering and
Biotechnology at the Indian
Institute of Technology, Delhi
in 2003. Prior to starting
his PhD, Sarath worked
extensively and co-ordinated
a number of research projects,
on transcriptional regulation,
genome organization and
comparative genomics in
bacteria, at UNAM in Mexico. He has published more than
25 research manuscripts on various aspects of prokaryotic and
eukaryotic biology in the fields of computational molecular
and systems biology. His current research interests include
understanding the design principles and constraints imposed
on post-transcriptional and post-translational gene control in
prokaryotic and eukaryotic organisms.
Andreas Tzakos
Andreas Tzakos obtained
his doctoral degree at the
Department of Chemistry,
Section of Organic Chemistry
and Biochemistry, University
of Ioannina, Greece. After
postdoctoral research at
the MRC Laboratory of
Molecular Biology and
CNRS, he is currently (since
2009) a Lecturer at the
University of Ioannina and
a member of the recently
established Human Cancer
Biobank Center, Molecular
Oncology Lab, University of
Ioannina. His current interests include decoding the molecular
mechanisms, which underlie cancer development using multi-
disciplinary approaches in the interfaces of chemistry, biology
and medicine.
z Current address: Department of Chemistry, Section of Organic
Chemistry and Biochemistry, University of Ioannina, Ioannina,
Gr-45110, Greece
1536 | Mol. BioSyst., 2009, 5, 1536–1548 This journal is !c The Royal Society of Chemistry 2009
REVIEW www.rsc.org/molecularbiosystems | Molecular BioSystems
confer disease. It is increasingly becoming clear that for a drug
to combat a disease effectively, it should target not a single
but several building blocks in this biological network1–4 to
re-establish the equilibrium state. This emerging field is
often referred to as network pharmacology.3,4 Network
pharmacology resembles the LeChatelier’s principle of bio-
logical networks, i.e. if a system (biological network) at
equilibrium (healthy state) experiences a change (disease
state), the effective drug will shift the equilibrium in order to
minimize that change (Fig. 1A). For over twenty years the
focus of drug development has been to develop highly selective
molecules. However, recent improvements in high throughput
screening methods and advances in genomics, transcriptomics,
proteomics and metabolomics have enabled us to gather large
amounts of data on drug-target interactions in several model
organisms and this picture is starting to be challenged.5–10 This
is also driven by the fact that the cost for discovery of new
drugs continues to rise disproportionately in comparison to
the approval rates despite the munificent investment in screening
technologies and genomics.11 The costs of discovering and
developing a drug are in the order of US $900 million. The
majority of this cost is spent in the later stages of the
pharmaceutical pipeline,11–13 as can be ascribed in the form
of a drug development clepsydra (Fig. 1B). Unfortunately, the
attrition rates are reported to be higher in the later phases
(Phases IIb and III) of full clinical development, where most of
the cost will have been apparently invested in the wrong
direction due to molecules that failed to pass the final
stages.12,13 The most important reasons for drugs failing in
development are either due to inadequate efficacy or their
inability to pass the safety standards, as initial screenings done
on animal models can often be unpredictive, with other factors
like complexity of the disease playing an important role as
well.13 Recent surveys suggest that the success rates are less
than 10%, with this figure being even more worrisome for
drugs targeting novel mechanisms, since they have higher
attrition rates.13,14 Attrition rates are not equally distributed
across different therapeutic areas and remarkable differences
have been reported,13 with oncology suffering from higher
failure rates in Phase II and III trials.13,15 Fortunately, these
figures are not irreversible and multidisciplinary research can
identify and remedy the causes of attrition. For instance, it is
clear that the rate of attrition of compounds with novel
mechanisms of action is higher than those with previously
precedent mechanisms of action, suggesting a need to focus on
already approved drugs for new therapeutic benefits.13 Indeed,
an emerging, promising and cost efficient direction of drug
development surmounting the risk of toxicity and efficacy is
the area of finding new therapeutic uses for approved drugs, so
called drug repurposing.16 It is anticipated that decoding the
molecular pathophysiology of disease, through global under-
standing of disease heterogeneity and connection between
diseases and targets in the biological network, will eventually
lead to improved target validation. This is evident even in the
therapeutic area of oncology with attrition rates in the order of
82% as a whole, wherein if one considers the clinical success of
multitargeted kinase inhibitors as a subset, the attrition rate is
only 53%, emphasizing the importance of polypharmacological
approaches to drug discovery.17 All these challenges that
pharmaceutical industries are facing call for novel ideas,
approaches and methodologies for speeding up the drug
development process using our current understanding. In
particular, more effective tools will be needed to critically
analyze the information flow in the early stages of the drug
discovery and development pipeline. Network pharmacology,
which is built on the foundations of multi-target drugs i.e.,
polypharmacology, could be a strong asset to treat complex
diseases such as cancer. This paradigm shift together with the
explosion of information from several multi-disciplinary areas
aggregated into three dimensions: Drug space—comprising of
small molecules, Target space—comprising of the cellular
interactome available for small molecules and Disease
space—comprising of the disease states an organism encounters,
has brought great attention in drug discovery circles. In this
review, we discuss recent advancements in Drug, Target and
Disease spaces in the context of this paradigm and propose
new research venues in light of these recent findings.
Drug space
Sampling for biologically active compounds in the vast drug
space
Several advances in medicinal chemistry, including parallelization
and miniaturization of synthetic compounds, have increased
dramatically the synthesis and screening of thousands of
compounds against a single target.18 Combinatorial chemistry
is widely used to build large libraries of many thousands of
compounds both for the identification and optimization of
lead compounds. The design of these libraries followed different
philosophies and approaches. Initial efforts to chart the global
drug space followed probabilistic approaches implementing
chemical libraries of large size and diversity.19 However, such
approaches were only successful in identifying hits but not lead
compounds.20 The reason is simple: the chemical universe is
just vast and may contain 1020–10200 molecules.20,21 Thus,
‘‘when trying to find a needle in a haystack, the best strategy
might not be to increase the size of the haystack’’.20 Therefore,
over the years a more rational design of chemical libraries was
required for more successful optimizations which included
preserving the drug-like properties and generating pharma-
cophore mapping libraries which can possess attributes of
drugs with minor variations in the backbone structure.22–24
Due to the enormous size of the chemical space, a thorough
experimental exploration is not feasible and thus novel
methodologies and strategies are required to effectively and
intelligently map the sub-portion of the chemical space that is
of biological relevance. Current trends in the design strategy of
chemical libraries include high diversity in molecular properties
such as hydrophobicity and hydrogen bond donors/acceptors,
variations in the backbone and scaffold of the molecules
(skeletal diversity)25–27 and diversity in the spatial placement
of atoms in the 3D space (stereochemical diversity).26–30 These
efforts aim to sample in an effective manner the chemical and
conformational space and increase the likelihood to discover a
novel hit. Since the number of molecules is large and only
a small fraction can be potential drugs which can satisfy a
number of constraints such as bioavailability, cell membrane
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1536–1548 | 1537
permeability and non-toxicity, a number of in silico approaches
have been developed in the past years to pre-filter drugs in the
early stages of their synthesis. One of the most popular is
Lipinski’s rules which is a molecular property filter developed
after analysis of marketed drugs and describes in a quantitative
manner the cut-off or upper-limits for a number of molecular
properties like hydrogen bond donor/acceptor, molecular
weight, rotatable bonds, solubility, etc.31,32 Recommendations
in terms of bioavailability, solubility and drug-likeness have
also been constructed on the basis of predicted physicochemical
properties from the 2D structure. Although several properties
and guidelines exist, most commercial drugs satisfy Lipinski’s,22
Veber’s,33 Bergstro¨m’s34 andWenlock’s35 recommendations in
terms of solubility, bioavailability and drug-likeness. Prediction
of pharmacokinetic properties is also considered early on in
the drug development pipeline and chemical libraries are
screened for absorption, distribution, metabolism and excretion
properties (ADME).36,37
There are several databases that have accumulated information
on chemical molecules and drugs (see examples in Table 1).
Such databases contain judicious information on the drug
space currently available to be navigated. Several computational
algorithms have been employed to explore this chemical space
from 1D to 3D.38–41 Artificial intelligence, machine learning
and pattern recognition approaches have been recruited and
gained determinant roles in rational drug design and screening
of candidate molecules.42 There is now a growing tendency to
sculpt a drug for multiple targets since it can be a strong asset
for the treatment of numerous disorders. A common tactic is
to take as a framework a drug that is well-established for a
given disease and introduce additional functionalities to
enhance its efficacy and reduce its side-effects (see the section
of target space for detailed discussion). The trend is to identify
more promiscuous drugs, drugs that can recognize multiple
targets following upon the notion of polypharmacology. All
these observations indicate the importance of new approaches
to efficiently and intelligently navigate the vast drug space to
identify the desired multi-target drugs.
Guidelines and methods to construct functional ligand
promiscuity
Ligand promiscuity is a plus according to the paradigm of
network pharmacology. However, the scope is not a generic
transformation of compounds affecting a single node in a
disease network to compounds perturbing non-selectively
several nodes.3 On the contrary, the aim is to affect the ideal
combination of nodes, which will only perturb the disease state
to restore it to its natural un-diseased state, by creating
functional promiscuous ligands (Fig. 1A). Given the enormous
size of drug and target space, an extensive exploration seems
impossible and makes rational design of functional ligand
promiscuity rather difficult. Therefore, a more focused navigation
towards fractions or regions of the chemical space with
increased likelihood to contain biologically active and
promiscuous ligands is required. Several studies have identified
general qualitative physicochemical and structural principles
for sculpting promiscuous molecules capable of binding to
multiple targets.43–45 Generally, it has been proposed that
ligand promiscuity is favoured by molecules exhibiting specific
physicochemical criteria such as: (a) low molecular weight i.e.,
small size, (b) increased hydrophobicity—such ligands are
closer to the centre of the biological charge space and are
not very sensitive to differences in the shapes of targets,44 (c)
conformational flexibility—which allows for increased binding
affinity to multiple partners, however, induces higher specifi-
city for polar and charged ligands,44 (d) asymmetric groups
can also lead to increased promiscuity, (e) increased molecular
complexity of a ligand reduces the probability to recognize
multiple targets as it would also increase the mismatch
probability between the ligand and the targets.46
Several methods have been suggested for the construction
of ligands with promiscuity, with the predominant technique
Fig. 1 (A) Network pharmacology resembles the LeChatelier’s principle in biological networks, i.e. if a system (biological network) at equilibrium
(healthy state) experiences a change (disease state), the effective drug will shift the equilibrium in order to minimize that change. (B) The drug
development clepsydra over the different phases of drug development (I, II, III and registration). The sizes of the clepsydra correlate with the
number of tested drug candidates, the cost incurred and attrition rate over the process of drug development. Initially, there are a huge number of
molecules that enter the drug development pipeline but this shrinks over time in contrast to the attrition rate and effective cost.
1538 | Mol. BioSyst., 2009, 5, 1536–1548 This journal is !c The Royal Society of Chemistry 2009
T
ab
le
1
R
ep
re
se
n
ta
ti
ve
se
t
o
f
d
at
ab
as
es
an
d
re
so
u
rc
es
fo
r
ch
em
ic
al
m
o
le
cu
le
s,
d
ru
gs
an
d
th
ei
r
ta
rg
et
s
N
am
e
D
es
cr
ip
ti
o
n
s
W
eb
si
te
D
ru
gB
an
k
8
6
D
ru
gB
an
k
is
a
b
io
in
fo
rm
at
ic
s
an
d
ch
em
in
fo
rm
at
ic
s
re
so
u
rc
e
th
at
co
m
b
in
es
d
et
ai
le
d
d
ru
g
d
at
a
w
it
h
co
m
p
re
h
en
si
ve
d
ru
g
ta
rg
et
in
fo
rm
at
io
n
in
se
ve
ra
lm
o
d
el
o
rg
an
is
m
s.
It
co
n
ta
in
s
o
ve
r
48
00
d
ru
g
en
tr
ie
s
in
cl
u
d
in
g
4
13
50
F
D
A
-a
p
p
ro
ve
d
sm
al
l
m
o
le
cu
le
d
ru
gs
,
12
3
F
D
A
-a
p
p
ro
ve
d
b
io
te
ch
(p
ro
te
in
/
p
ep
ti
d
e)
d
ru
gs
,
71
n
u
tr
ac
eu
ti
ca
ls
an
d
4
32
43
ex
p
er
im
en
ta
l
d
ru
gs
.
h
tt
p
:/
/w
w
w
.d
ru
gb
an
k
.c
a/
S
T
IT
C
H
1
0
0
S
T
IT
C
H
in
te
gr
at
es
in
fo
rm
at
io
n
ab
o
u
t
in
te
ra
ct
io
n
s
fr
o
m
m
et
ab
o
li
c
p
at
h
w
ay
s,
cr
ys
ta
l
st
ru
ct
u
re
s,
b
in
d
in
g
ex
p
er
im
en
ts
an
d
d
ru
g-
ta
rg
et
re
la
ti
o
n
sh
ip
s.
It
co
n
ta
in
s
in
te
ra
ct
io
n
in
fo
rm
at
io
n
fo
r
o
ve
r
68
00
0
d
iff
er
en
t
ch
em
ic
al
s,
in
cl
u
d
in
g
22
00
d
ru
gs
,
an
d
co
n
n
ec
ts
th
em
to
1.
5
m
il
li
o
n
ge
n
es
ac
ro
ss
37
3
ge
n
o
m
es
.
h
tt
p
:/
/s
ti
tc
h
.e
m
b
l.
d
e/
W
O
M
B
A
T
W
O
M
B
A
T
co
n
ta
in
s
o
ve
r
11
00
0
m
ed
ic
in
al
ch
em
is
tr
y
p
ap
er
s,
o
ve
r
26
0
00
0
re
co
rd
s
w
it
h
55
0
00
0
b
io
ac
ti
vi
ti
es
o
n
m
o
re
th
an
22
00
ta
rg
et
s.
W
O
M
B
A
T
-P
K
co
n
ta
in
s
12
28
ap
p
ro
ve
d
d
ru
gs
an
d
ac
ti
ve
m
et
ab
o
li
te
s,
to
ta
ll
in
g
o
ve
r
10
00
0
p
h
ar
m
ac
o
k
in
et
ic
,
to
xi
ci
ty
an
d
d
ru
g-
ta
rg
et
re
la
te
d
en
d
p
o
in
ts
,
co
ve
ri
n
g
m
o
re
th
an
55
0
d
ru
g
ta
rg
et
s.
h
tt
p
:/
/w
w
w
.s
u
n
se
tm
o
le
cu
la
r.
co
m
/
B
in
d
in
gD
B
1
2
0
B
in
d
in
gD
B
is
a
d
at
ab
as
e
co
n
ta
in
in
g
m
ea
su
re
d
b
in
d
in
g
affi
n
it
ie
s,
fo
cu
si
n
g
o
n
in
te
ra
ct
io
n
s
b
et
w
ee
n
p
ro
te
in
s,
co
n
si
d
er
ed
to
b
e
d
ru
g-
ta
rg
et
s
an
d
sm
al
l,
d
ru
g-
li
k
e
m
o
le
cu
le
s.
It
co
n
ta
in
s
ab
o
u
t
28
00
0
sm
al
l
m
o
le
cu
le
s
w
it
h
ac
ti
vi
ty
d
at
a
(5
5
00
0
ex
p
er
im
en
ta
ll
y
d
et
er
m
in
ed
b
in
d
in
g
affi
n
it
ie
s)
fo
r
ab
o
u
t
60
0
p
ro
te
in
ta
rg
et
s.
h
tt
p
:/
/w
w
w
.b
in
d
in
gd
b
.o
rg
K
E
G
G
D
R
U
G
1
2
1
K
E
G
G
D
R
U
G
is
a
ch
em
ic
al
st
ru
ct
u
re
b
as
ed
in
fo
rm
at
io
n
re
so
u
rc
e
fo
r
al
la
p
p
ro
ve
d
d
ru
gs
in
Ja
p
an
an
d
th
e
U
S
A
,
w
it
h
m
an
y
al
so
ap
p
ro
ve
d
in
E
u
ro
p
e.
h
tt
p
:/
/w
w
w
.g
en
o
m
e.
jp
/k
eg
g/
d
ru
g/
C
h
E
B
I1
2
2
C
h
em
ic
al
E
n
ti
ti
es
o
f
B
io
lo
gi
ca
l
In
te
re
st
(C
h
E
B
I)
is
a
d
ic
ti
o
n
ar
y
o
f
m
o
le
cu
la
r
en
ti
ti
es
fo
cu
se
d
o
n
sm
al
l
ch
em
ic
al
co
m
p
o
u
n
d
s.
T
h
e
m
o
le
cu
la
r
en
ti
ti
es
ar
e
ei
th
er
n
at
u
ra
l
p
ro
d
u
ct
s
o
r
sy
n
th
et
ic
p
ro
d
u
ct
s,
u
se
d
to
in
te
rv
en
e
in
th
e
p
ro
ce
ss
es
o
f
li
vi
n
g
o
rg
an
is
m
s.
In
ad
d
it
io
n
to
m
o
le
cu
la
r
en
ti
ti
es
,
C
h
E
B
I
co
n
ta
in
s
gr
o
u
p
s
(p
ar
ts
o
f
m
o
le
cu
la
r
en
ti
ti
es
)
an
d
cl
as
se
s
o
f
en
ti
ti
es
.
h
tt
p
:/
/w
w
w
.e
b
i.
ac
.u
k
/c
h
eb
i/
in
it
.d
o
C
h
em
D
B
1
2
3
C
h
em
D
B
is
a
ch
em
ic
al
d
at
ab
as
e
co
n
ta
in
in
g
n
ea
rl
y
5
m
il
li
o
n
co
m
m
er
ci
al
ly
av
ai
la
b
le
sm
al
l
m
o
le
cu
le
s,
im
p
o
rt
an
t
fo
r
th
ei
r
u
se
as
sy
n
th
et
ic
b
u
il
d
in
g
b
lo
ck
s,
p
ro
b
es
in
sy
st
em
s
b
io
lo
gy
an
d
as
le
ad
s
fo
r
th
e
d
is
co
ve
ry
o
f
d
ru
gs
an
d
o
th
er
u
se
fu
l
co
m
p
o
u
n
d
s.
T
h
e
ch
em
ic
al
d
at
a
in
cl
u
d
es
p
re
d
ic
te
d
o
r
ex
p
er
im
en
ta
ll
y
d
et
er
m
in
ed
p
h
ys
ic
o
ch
em
ic
al
p
ro
p
er
ti
es
.
h
tt
p
:/
/c
d
b
.i
cs
.u
ci
.e
d
u
Z
in
c1
2
4
Z
in
c
is
a
d
at
ab
as
e
o
f
o
ve
r
8
m
il
li
o
n
m
o
le
cu
le
s
w
it
h
th
ei
r
3D
st
ru
ct
u
re
an
d
an
n
o
ta
te
d
w
it
h
p
h
ys
ic
o
ch
em
ic
al
p
ro
p
er
ti
es
.
E
ac
h
m
o
le
cu
le
in
th
e
li
b
ra
ry
co
n
ta
in
s
ve
n
d
o
r
an
d
p
u
rc
h
as
in
g
in
fo
rm
at
io
n
an
d
is
re
ad
y
fo
r
d
o
ck
in
g
u
si
n
g
a
n
u
m
b
er
o
f
p
o
p
u
la
r
d
o
ck
in
g
p
ro
gr
am
s.
h
tt
p
:/
/z
in
c.
d
o
ck
in
g.
o
rg
S
u
p
er
ta
rg
et
1
2
5
S
u
p
er
T
ar
ge
t
in
te
gr
at
es
d
ru
g-
re
la
te
d
in
fo
rm
at
io
n
ab
o
u
t
m
ed
ic
al
in
d
ic
at
io
n
ar
ea
s,
ad
ve
rs
e
d
ru
g
eff
ec
ts
,
d
ru
g
m
et
ab
o
li
sm
,p
at
h
w
ay
s
an
d
G
en
e
O
n
to
lo
gy
te
rm
s
o
f
th
e
ta
rg
et
p
ro
te
in
s.
P
ro
vi
d
es
to
o
ls
fo
r
2D
d
ru
g
sc
re
en
in
g
an
d
se
q
u
en
ce
co
m
p
ar
is
o
n
o
f
th
e
ta
rg
et
s.
T
h
e
d
at
ab
as
e
co
n
ta
in
s
m
o
re
th
an
25
00
ta
rg
et
p
ro
te
in
s,
w
h
ic
h
ar
e
an
n
o
ta
te
d
w
it
h
ab
o
u
t
73
00
re
la
ti
o
n
s
to
15
00
d
ru
gs
.
h
tt
p
:/
/i
n
si
li
co
.c
h
ar
it
e.
d
e/
su
p
er
ta
rg
et
/
M
A
T
A
D
O
R
1
2
5
M
A
T
A
D
O
R
is
a
re
so
u
rc
e
fo
r
p
ro
te
in
-c
h
em
ic
al
in
te
ra
ct
io
n
s
(b
o
th
d
ir
ec
t
an
d
in
d
ir
ec
t)
.
h
tt
p
:/
/m
at
ad
o
r.
em
b
l.
d
e/
M
M
sI
N
C
1
2
6
M
M
sI
N
C
is
a
d
at
ab
as
e
o
f
n
o
n
-r
ed
u
n
d
an
t,
an
n
o
ta
te
d
,a
n
d
b
io
m
ed
ic
al
ly
re
le
va
n
t
ch
em
ic
al
st
ru
ct
u
re
s.
T
h
e
cu
rr
en
t
d
at
ab
as
e
co
n
ta
in
s
ab
o
u
t
4
m
il
li
o
n
u
n
iq
u
e
co
m
p
o
u
n
d
s.
h
tt
p
:/
/m
m
s.
d
sf
ar
m
.u
n
ip
d
.i
t/
M
M
sI
N
C
/s
ea
rc
h
/
C
h
em
S
p
id
er
C
h
em
S
p
id
er
p
ro
vi
d
es
ac
ce
ss
to
m
il
li
o
n
s
o
f
ch
em
ic
al
st
ru
ct
u
re
s
an
d
in
te
gr
at
io
n
to
a
m
u
lt
it
u
d
e
o
f
o
th
er
o
n
li
n
e
se
rv
ic
es
.
h
tt
p
:/
/w
w
w
.c
h
em
sp
id
er
.c
o
m
/
C
h
em
B
an
k
1
2
7
C
h
em
B
an
k
is
a
d
at
ab
as
e
th
at
in
cl
u
d
es
d
at
a
d
er
iv
ed
fr
o
m
sm
al
l
m
o
le
cu
le
s
an
d
sm
al
l-
m
o
le
cu
le
sc
re
en
s
an
d
re
so
u
rc
es
fo
r
st
u
d
yi
n
g
th
is
d
at
a.
It
st
o
re
s
an
in
cr
ea
si
n
gl
y
va
ri
ed
se
t
o
f
m
ea
su
re
m
en
ts
d
er
iv
ed
fr
o
m
ce
ll
s
an
d
o
th
er
b
io
lo
gi
ca
l
as
sa
y
sy
st
em
s
tr
ea
te
d
w
it
h
sm
al
l
m
o
le
cu
le
s.
A
n
al
ys
is
to
o
ls
ar
e
av
ai
la
b
le
th
at
al
lo
w
th
e
re
la
ti
o
n
sh
ip
s
b
et
w
ee
n
sm
al
l
m
o
le
cu
le
s,
ce
ll
m
ea
su
re
m
en
ts
,
an
d
ce
ll
st
at
es
to
b
e
st
u
d
ie
d
.
It
st
o
re
s
in
fo
rm
at
io
n
o
n
h
u
n
d
re
d
s
o
f
th
o
u
sa
n
d
s
o
f
sm
al
l
m
o
le
cu
le
s
(1
27
3
44
3)
an
d
h
u
n
d
re
d
s
o
f
b
io
m
ed
ic
al
ly
re
le
va
n
t
as
sa
ys
(4
74
2)
.
h
tt
p
:/
/c
h
em
b
an
k
.b
ro
ad
.h
ar
va
rd
.e
d
u
M
D
D
R
M
D
D
R
co
n
ta
in
s
o
ve
r
15
0
00
0
b
io
lo
gi
ca
ll
y
re
le
va
n
t
co
m
p
o
u
n
d
s
an
d
w
el
l-
d
efi
n
ed
d
er
iv
at
iv
es
.
U
p
d
at
es
ad
d
ab
o
u
t
10
00
0
su
b
st
an
ce
s
a
ye
ar
to
th
e
d
at
ab
as
e.
M
D
D
R
co
ve
rs
p
at
en
t
li
te
ra
tu
re
,
jo
u
rn
al
s,
m
ee
ti
n
gs
an
d
co
n
fe
re
n
ce
p
ro
ce
ed
in
gs
.
h
tt
p
:/
/w
w
w
.s
ym
yx
.c
o
m
/p
ro
d
u
ct
s/
d
at
ab
as
es
/
b
io
ac
ti
vi
ty
/m
d
d
r/
in
d
ex
.j
sp
P
u
b
C
h
em
1
2
8
P
u
b
C
h
em
fo
cu
se
s
o
n
th
e
ch
em
ic
al
,
st
ru
ct
u
ra
l
an
d
b
io
lo
gi
ca
l
p
ro
p
er
ti
es
o
f
sm
al
l
m
o
le
cu
le
s,
p
ar
ti
cu
la
rl
y
th
ei
r
ap
p
li
ca
ti
o
n
as
d
ia
gn
o
st
ic
an
d
th
er
ap
eu
ti
c
ag
en
ts
.
It
co
m
p
ri
se
s
o
f
o
ve
r
19
.6
m
il
li
o
n
co
m
p
o
u
n
d
s
w
it
h
o
ve
r
11
m
il
li
o
n
u
n
iq
u
e
st
ru
ct
u
re
s.
h
tt
p
:/
/p
u
b
ch
em
.n
cb
i.
n
lm
.n
ih
.g
o
v/
se
ar
ch
/s
ea
rc
h
.c
gi
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1536–1548 | 1539
being referred to as the pharmacophore combination
approach.47–52 In this approach, different pharmacophores
from an array of selective ligands are joined together. This
technique is a knowledge-based approach since it requires the
knowledge of the structure-activity relationship (SAR) of the
functionalities of the ligands to be joined.48–50,52 Different
approaches for connecting pharmacophores into single
entities have been demonstrated in the literature.47–51 Typically
pharmacophore connections are achieved via incorporation of
a cleavable or a non-cleavable conjugated spacer.47–52 Merging
of pharmacophores through cleavable conjugates resembles
drug cocktails since the different pharmacophore-drugs
are separated after administration in to two independent
moieties commonly via plasma esterases53,54 (Fig. 2A). The
pharmacophore combination approach with linkers normally
suffers by not providing compounds with good oral drug-like
properties due to the inevitable increase of the molecular
weight and complexity of the resulting compounds. Other
approaches targeting high merging through overlapping55
(Fig. 2B) or integration56 (Fig. 2C) of the different pharma-
cophores may lead to smaller and simpler molecules surmounting
problems of unfavourable physicochemical properties.
An alternative to the SAR knowledge-based approaches are
the screening approaches of diverse or focused compound
libraries for activity on different targets.47,49 The later
approaches are especially valuable for cases where there is lack
of selective ligands for the targets of interest or a combination
of unrelated receptor families is to be targeted. Once a
compound is identified having a predetermined set of
requirements, a heavy elaboration follows to improve binding
affinity and drug-like properties. This can be either performed
through ‘‘fragment evolution’’,57 where a systematic
incorporation of chemical functionalities to the starting core
is attempted or ‘‘fragment linking’’57 that basically follows the
approaches of pharmacophore connection mentioned above.
Approaches to generate effective promiscuous drugs
Promiscuous drugs are typically developed by employing
one of the three approaches outlined below. Firstly, in
drug-repurposing approach, knowledge about existing old
drugs or other historical compounds available from literature
or proprietary company sources are exploited. This involves
discovering new therapeutic uses for approved drugs.16 Drugs
have been traditionally designed to have unidirectional character
interacting with a single target that was relevant to the disease
of interest and hence during the drug optimization process,
very limited attention was given to address properly the issue
of target selectivity. One of the most interesting examples in
this direction is that of aspirin, which was originally developed
Fig. 2 Examples of different pharmacophore combination approaches to design promiscuous ligands. (A) Pharmacophore connection through
cleavable ester linker for a nitric oxide-releasing derivative of aspirin (1)53 and a hydrogen sulfide-releasing derivative of diclofenac (2).54 The drugs
aspirin and dichlofenac are coloured in red and blue respectively. (B) Pharmacophore merging through overlapping of pharmacophores from
antagonists of histamine receptors H1 (3), coloured in grey, and H3 (4), coloured in blue, to construct the dual H1/H3 antagonist (5). Merging was
made via the amine moieties that are common to both (3) and (4).55 (C) Pharmacophore merging through integration of pharmacophores from an
endothelin A receptor antagonist (6) and an angiotensin II receptor antagonist (7) to construct the dual endothelin A/angiotensin II receptor
antagonist (8).56 In grey and red are the pharmacophores used from compounds (6) and (7) respectively. This high pharmacophore integration was
achieved since starting compounds shared a common biphenyl core (coloured in red).
1540 | Mol. BioSyst., 2009, 5, 1536–1548 This journal is !c The Royal Society of Chemistry 2009
to combat arthritis but it was found later on to have
antipyretic, analgesic, anti-inflammatory, anti-platelet activities,
to inhibit the synthesis of prothrombin, promote apoptosis
and to have cancer preventive effects.58 The capability of
in silico target profiling methods to identify new targets for
old drugs as also to alert for potential off-target effects has
been demonstrated.58–60
Natural products (NPs) have a dominant role in
pharmacology, since almost 60% of anticancer compounds
and 75% of drugs for infectious diseases are either natural
products or natural product derivatives and hence form
an important source of chemicals for natively increasing
promiscuity.61–63 An important feature exhibited by them is
their ability to interact with multiple targets and modulate
multiple signal transduction pathways. One example is
quercetin that targets cancer prevention at several levels due
to its favourable anti-mutagenic and anti-proliferative effects,
its role in the regulation of cell signaling, cell cycle
and apoptosis.64 NPs have been evolutionary selected after
nature’s combinational chemistry to have chemical diversity
and interact with multiple biological target molecules.62,65,66
In addition, natural products often resemble endogenous
metabolites or biosynthetic intermediates, thus favourably
operating in active transport mechanisms.67 Therefore, the
investigation of such compound collections in biochemical and
biological screens should yield high hit rates at comparably
small library size and will be an important source for the
identification of small molecules for multi-target compounds.65,68
From analysis of the drug-like properties of NPs that have
been approved as drugs, it was found that they could be
divided in two equal subsets.67 The first subset follows Lipinski’s
rules and the second violates them. Interestingly, nature
through its multiple combinatorial design efforts succeeded
to bypass Lipinski constraints for large compounds in terms of
molecular weight and number of rotatable bonds maintaining
at the same time low hydrophobicity and intermolecular
H-bond donating potential.67 Unexpectedly, both subsets
had identical success rate in delivering an oral drug.67 We
should therefore take lessons from nature’s effort to design
multi-target compounds. A comparison of the molecular
property profiles in combinatorial libraries, natural products
and marketed drugs indicated a broader distribution and
increased diversity in natural products and drugs compared
to combinatorial compounds.69 Therefore, a good strategy to
increase the chemical space charting by combinatorial library
efforts will be to mimic the molecular properties and diversity
found in natural products.69
Yet another approach to increase the promiscuity of a drug
is by creating drug cocktails. Although one of the major
disadvantages of a drug cocktail compared to a single multi-
target agent is the risk of drug–drug interactions,70 there is an
increasing interest in developing drugs which can bypass these
issues. Such an approach becomes especially appealing
towards evolving targets that generate mutant forms escaping
drug interaction as it requires the consideration of more than
just one drug binding tightly to the existing target molecule.
Due to the potential for increased toxicity with each additional
drug in the cocktail and possible cross interactions, the smallest
cocktail possible should be targeted that will effectively cover
the different ensembles of the evolving target. To that end,
focus has been given for the development of theoretical
methods to design optimal drug cocktails for targeting
molecular ensembles.71
Target space
Major components of the target space
Organisms respond to continuous variations in internal and
external cellular conditions by orchestrating their responses
depending on the environmental challenges they are faced
with. This involves the usage of a complex network of
interactions among different proteins, RNA, metabolites and
several other cellular entities, which undergo rewiring when
perturbed by chemicals or drugs from the Drug space
(Fig. 3A). The interaction between different chemicals
and cellular entities can be represented in the form of a
network—so called Drug Target network. Recent years
have seen the development of a number of approaches both
computational and experimental for the identification and
elucidation of the molecular targets of a drug on a genomic
scale.72–81 This cellular target space which contains the targets
of drugs, can be considered to predominantly comprise of
three components namely protein–protein, metabolic and
transcriptional interactions (Fig. 3B). While the vast majority
of the drugs target the protein–protein and metabolic
components, limited number of targets have been identified
till date for the transcriptional pool.10,82–85 Indeed, most
common therapeutic targets for established drugs belong to
either protein kinase or receptor families with enzymes and ion
channels forming the second most predominant class of
targets.86 This explains the reasons for the increased attention
towards understanding the biophysics of protein–protein
contacts in the context of drug targets as these protein classes
form major players in protein–protein interactions.87
An interesting possibility for systematic target identification
is that the structures of biological networks may actually
provide valuable information in assessing targets and their
combinations. In recent years, it has been appreciated
that many effective drugs in therapeutic areas as diverse as
oncology, psychiatry and infectious diseases act on multiple
rather than single targets.6 Indeed, this has been confirmed by
network analysis of the drug target interactions where it was
found that not only drugs commonly act on multiple targets
but also drug targets are often involved in multiple diseases
with over 40% of the drug targets that map onto disease genes
involved in more than one disease.10 This observation is
further strengthened by an independent analysis to analyze
the genetic origins of most diseases using OMIM database,88
where the authors found that of 1284 disorders documented in
OMIM nearly 70% share at least one gene with another
disorder.82 Taken together, studies employing network
approaches reveal that in most cases exquisitely selective
compounds may exhibit a lower than needed efficacy for the
treatment of disorders and that compounds that selectively act
on two or more targets of interest might be more efficacious
than single target agents, ruling out the assumption of one
drug for one target in a disease which has significantly
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1536–1548 | 1541
influenced the drug discovery pipeline for more than a decade
before the advent of genomics.
An additional insight gained from network-based approaches
in pharmacology is that, although disease genes play central roles
in the protein–protein interaction networks of the target space,
the vast majority of disease genes are nonessential and show no
tendency to encode hub proteins and their expression pattern
indicates that they are localized in the functional periphery of the
network. This is in contrast to essential genes which show higher
likelihood to encode for hub proteins in the protein interactome,
higher transcript levels and are expressed widely in most
tissues.82,89 These studies also show that genes with intermediate
connectivities are likely to harbor germ-line disease mutations,
suggesting that disease genes tend to occupy an intermediate
niche in terms of their physiological and cellular importance.89
Likewise, analysis of the protein interactions of the drug
targets suggested that they have more interactions than expected
by chance but lower than that observed for essential genes,
re-enforcing the trend observed for disease genes.10
Different knowledge based approaches have also been
employed to prioritize the drug targets based on the existing
datasets of protein interactomes. For instance, Lage et al.83
constructed a phenome–interactome network comprising gene
products implicated in many different categories of human
disease which permitted them to identify previously unknown
complexes likely to be associated with disease by using a
phenotype similarity score. Others exploited the topological
features of the protein interactome for predicting novel disease
genes.90 Similarly, the notion of polypharmacology had its
effect on computational approaches to associate targets to
well-established drugs based on similarity in protein sequence
space.81 Polypharmacology also had its influence on experimental
screening in a high throughput fashion wherein attempts have
been made to understand the relationship between similarity
of ligands and their targets when they belong to one or more
different sequence clusters.59,80
While much of mainstream research has focused on the
protein–protein interaction component of the target space few
Fig. 3 Different components of the drug-target network. (A) Drug space (marked with the outer box) consists of the small-molecules which can
potentially bind entities with-in the cell (marked as drug targets in red spheres). In turn cellular interactions between different components (marked
with red spheres and green circles) form cellular interactome comprising the target space. (B) Target space comprises of different components
namely protein–protein interactions, metabolic pathways and transcriptional circuits which together form the biological network or the cellular
interactome.
1542 | Mol. BioSyst., 2009, 5, 1536–1548 This journal is !c The Royal Society of Chemistry 2009
attempts are made to exploit the therapeutic potential of
metabolic and transcriptional components. Many reasons have
been accounted for the slow pace in targeting these components,
including the challenges involved in manipulating them with
ligands and lack of experimental protocols for studying their
impact when perturbed. For instance, transcription factors
(TFs) play a major role in many human diseases including
cancer, inflammatory and heart diseases however very few TFs
such as those containing the ligand binding domains could
be successfully exploited.84 However, availability of new
approaches such as those which can directly block transcription
factor dimerization or those which can indirectly target specific
DNA and DNA decoys are being increasingly explored.84
Methods for identifying drug targets
Availability of genome sequences together with technologies
for high-throughput screening of chemicals on a large scale has
revolutionized our ability to understand the biological activity
of novel ligands and in identifying their targets in short time
periods. These methods can be broadly classified into genetics-
based, proteomics-based and knowledge-driven approaches
(See Table 2). Genetics based approaches typically involve
use of mutant libraries of large set of genes, generated either by
exploiting the RNA interference pathways in mammalian
systems or by knocking-out the genes, which are exposed to
drugs at different concentrations to study the resistance
measured as the fitness of a strain of interest with respect to
the wild type.76,77 Improvements in these approaches include
bar-coding of deletion strains with unique DNA sequences
which enable parallel genetic screens of a large number of
drugs.78,91 Alternatives to the mutant-based approaches
involve forward chemical genetics techniques which typically
involves screening of small molecule libraries for their ability
to induce a particular phenotype in cells or cellular extracts. In
these approaches, instead of deleting or impairing protein
function at the genetic level, small molecules generally act by
inhibiting (or activating) a particular protein or set of proteins
directly. Tracing the inhibitor (or activator) back to its target
protein can, in principle, provide a causal link between the
target and its associated phenotype.92
Proteomic methods for drug-target identification can be
classified into three major categories: (a) based on affinity
chromatography, (b) based on small-molecule or protein
microarrays, (c) based on active-site profiling of the proteins
of interest (reviewed in ref. 93). In affinity-based methods
typically a small molecule is immobilized via a functional
group onto a solid support followed by the addition of a
protein extract. This is followed by a series of washing steps to
finally isolate and identify proteins which remain in the
column due to the affinity for the small molecule.94 Protein
microarray based methods comprise of recombinant protein
molecules or antibodies immobilized on the surface of a
substrate material like glass or silicon, which is then exposed
to small molecules which are labeled and the binding on the chip
is monitored.95 Antibody microarrays form a variant which can
be useful to study the ligands which can bind to low abundance
proteins.96 Chemical microarrays form a promising class of
methods when the goal is to screen for a large number of
small-molecules against a selected set of proteins.97 In activity-
based profiling methods, active-site directed chemical probes
consisting of two-components, a moiety which covalently binds
to the active site of an enzyme and a reporter tag for tracking the
modified proteins, are used to measure of binding.98 More
recently there is an increasing interest in using metabolic
approaches on systems-scale for identifying drug targets due
to improvements in mass spectrometry techniques.99
Knowledge-driven approaches comprise of both literature
derived data sources and informatics methods which employ
our current understanding of design principles of drug space,
target space or an integration of them.72,73,81,100–102 These
include but are not limited to the use of structure activity
relationships, network-based, genomics-based, pathway
analysis and integration of data sources72,73,80,81,103 (briefly
summarized in Table 2). One of the major limitations of these
approaches is that these are often only incremental and
can not predict counterintuitive or unexpected outcomes.
Identifying off-target effects in the target space
One of the significant outcomes of the notion of poly-
pharmacology is that the promiscuity of drugs can be of
therapeutic advantage if the drug under investigation can
perturb the relevant genes in the disease state to bring it to
equilibrium. For instance, if the target is a rapidly mutating
agent, a drug that is too specific will quickly lose its efficacy by
not binding well to functional mutants. Therefore, in molecular
design, it is crucial to tailor the binding specificity of a drug in
such cases to all the functional mutants to improve its
efficacy.44 The ideal situation will be to design promiscuous
drugs that are directed towards a desirable multi-target space,
however this is not straightforward for protein families such as
kinases that are structurally divergent but yet need to be
targeted by a single drug.45 Recently, Apsel et al.80 reported
the discovery of dual inhibitors of tyrosine and phosphoinositide
kinases which form intensely pursued cancer drug target
families. Although tyrosine and phosphoinositide kinases lack
significant sequence similarity, they share several short motifs.
Through iterative chemical synthesis, X-ray crystallography
and kinome-level biochemical profiling they were able to
develop molecules that adopted dual selectivity to the hydro-
phobic pocket conserved in both enzyme classes. Other
approaches employed the phenotypic side-effects of the marketed
drugs to cluster drugs which exhibited similar therapeutic
indications as a means of identifying potential new drug
targets for existing drugs thereby extending the applicability
of existing drugs in less explored disease phenotypes.104
In addition to using off-target effects for therapeutic benefits
recent advances have also involved the use of computational
and experimental means to identify them at a first glance.60,105
For instance, Ericson et al.105 employed chemo-genomic
screening to identify the off-target effects of 81 psychoactive
drugs in yeast. The general consensus is that most drugs
showed a propensity to affect multiple cellular functions
ranging from secretion, protein folding to chromatin remodeling
roles suggesting the utility of model organism pharmacogenetic
studies to provide a rational foundation for studying the
off-target effects of clinically important drugs.
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1536–1548 | 1543
Disease space
Decomposing drug-target associations using network-based
approaches
An important notion that has emerged in post-genomic drug
discovery is that the large-scale integration of genomic,
proteomic, signalling and metabolomic data can allow us to
construct complex networks of the cell that would provide us
with a new framework for understanding the molecular basis
of physiological or pathophysiological states.106 Such an
integrated view has important implications in improving our
understanding of the disease phenotypes by viewing them as
perturbations in a complex system rather than effects on a
selective set of proteins. Using such a framework, network-
based drug discovery aims to harness this knowledge to
investigate and understand the impact of interventions, such
as candidate drugs, on the molecular networks that define
different states and therefore can significantly complement the
existing drug discovery pipelines.
In such a framework at the most basic level is the
Drug-Target (DT) network which composes of a directed
graph connecting the set of drugs from the drug space which
Table 2 Different methods available for identifying drug-targets on a genomic scale. Methods can be broadly classified into proteomics-based,
genetics-based and knowledge-driven
Proteomics-based methods Description
Activity based protein
profiling (ABPP)129
This is a functional proteomic technology that uses chemical probes that react with mechanistically related
classes of enzymes. The basic unit of ABPP is a probe that typically consists of a reactive group (electrophile
or a photoreactive group) that covalently binds to the active site of an enzyme (nucleophilic residue) and a tag.
The tag can either be a reporter (i.e. fluorophore, radioactive group) or a handle (i.e. affinity tags such as
biotin). A tag-free strategy for activity-based protein profiling has also been introduced that utilizes the
copper(I)-catalyzed azide-alkyne cycloaddition reaction (click chemistry) and gives the advantage of not
interfering with biological activity or binding affinities of the probes. The activity-based protein profiling and
multidimensional protein identification technologies (ABPP-MudPIT) can provide profiling of inhibitor
selectivity, as the potency of an inhibitor can be tested against hundreds of targets simultaneously.130
Affinity chromatography94 This is a protein separation method based on the interaction between target proteins and specific immobilized
ligands. Traditionally, the ligand is tethered on a solid support via a spacer arm followed by the addition of a
cellular lysate or tissue extract. Only target proteins binding tightly to the ligand are selectively purified, eluted
off (denaturation or competition with free ligand) and subsequently identified by mass spectroscopy. To
minimize the identification of nonspecifically bound proteins, the protein profile that is obtained with an
inactive ligand analogue is also determined and compared with the relevant profile, determined with the
desired analogue. More recently, an improved method for the identification of proteins that can bind to small-
molecules and drugs has been established which uses quantitative mass spectrometry (MS)-based proteomics
(utilizing stable isotope labeling with amino acids in cell culture (SILAC)) and affinity chromatography.131
Microarrays95–97,132 Microarrays in drug target discovery provide miniaturized high-throughput tools to study binding of specific
molecules to immobilized proteins or small molecules. In protein microarrays, different recombinant proteins
or antibodies that are immobilized on a solid substrate are exposed to a drug solution to identify the target
protein(s) which can bind to the small molecule. In chemical microarrays, immobilized drug compounds can
be screened for candidate drug–target interactions with purified proteins.97 When the target protein is known,
small molecule arrays can be also used to identify off-target interactions that could have implications for
side-effects.
Genetics-based Methods Description
Synthetic lethality/
Gene knock-out76,78
Single gene knock-out strains on a genomic scale or for a selected set are exposed to small molecules at
different concentrations to evaluate the fitness defects and fitness levels are compared to wild-type populations
exposed to the same conditions. This provides an easy means to identify targets on a large scale.76,78
RNAi RNA interference pathways in mammalian systems are used for silencing genes and similar approaches
as above are employed to study the fitness defects of cell lines to identify potential drug targets in higher
eukaryotes77,133
Forward chemical
genetics92
Unlike the use of mutants in previous approaches, small molecules are screened for their ability to induce a
particular phenotype in cells or cellular extracts. Instead of deleting or impairing protein function at the
genetic level, as in classical genetics, small molecules generally act by inhibiting (or activating) a particular
protein or set of proteins directly. Tracing the inhibitor (or activator) back to its target protein can, in
principle, provide a causal link between the target and its associated phenotype. Forward chemical genetics
requires three components: one, a collection or ‘library’ of compounds; two, a biological assay with a
quantifiable phenotypic output; and three, a strategy for identifying the target(s) of active com-
pounds.92,134,135
Knowledge-driven approaches Description
Literature derived
interactions.10,103,136,137
In these approaches, manually curated set of interactions are obtained from the literature to generate high
confidence set of drug-target relationships to either study their overall structure10 or focus on specific disease
of interest.10,103,136,137
Network-based
approaches.3,80
In these approaches, literature derived interactions are exploited to predict new interactions based on the
principles governing the structure of the networks, so that new disease targets are identified using comparative
genomics or other informatics-based methods, followed possibly by experiments to improve the chemicals.3,80
in silico chemogenomics41 In predictive chemogenomics one predicts relationships between genes/proteins and compounds. In silico
approaches that are used can be classified into ligand-based approaches (ligand comparison for target
prediction), target-based approaches (target comparison for ligand prediction) or ligand-target based
approaches41
1544 | Mol. BioSyst., 2009, 5, 1536–1548 This journal is !c The Royal Society of Chemistry 2009
can bind to the targets in the target space (Fig. 4). Such a
network has been shown to form a bipartite graph with a giant
connected component, suggesting the involvement of most
drugs in targeting more than one target protein.9,10 At the
second level, one can visualize the decomposition of the DT
network into the Drug-Drug (DD) network or Target-Target
(TT) network (Fig. 4). The former is typically constructed by
linking two drugs if they share a significant number of targets
while the later comprises of associations or links between
targets which are targeted by the same set of drugs. As is
expected, increasing the level of significance in the number of
shared targets104 or drugs would improve the quality of the
polypharmacological downstream network. Initial analysis by
decomposing the experimentally validated DT network, for
FDA-approved drugs, into DD or TT network confirmed the
tendency of pharmaceutical industry to target already experi-
mentally validated proteins, leading to an increase in the
number of follow-on drugs.10,104 The authors also found that
a number of drugs in the DD network grouped into distinct
clusters confirmed from their similarity according to the
Anatomical Therapeutic Chemical (ATC) classification.
At the third level in this framework one can connect diseases
and therapies associated with them using the already known
network of DD or TT interactions. This typically involves
mapping the already known disease phenotype of a drug or a
target onto the DD or TT networks respectively to generate a
network of disease associations (Fig. 4). Recent attempts to
generate such disease association or common therapy based
networks starting from drug–drug interactions revealed that
the average path length between drugs is shorter than three
steps reinforcing the polypharmacology notion from the
perspective of drugs i.e., most chemicals might be sharing
their phenotypic affects to a significant extent when perturbing
the target space.107 An alternative explanation for these
observations is that most drugs currently employed are a
result of follow-on of existing knowledge about previously
established drugs, suggesting that there is enormous potential
in the drug space for identifying bioactive compounds with
novel targets in the target space which yet needs to be
exploited. Future studies in this direction can address questions
on the link between different diseases and the role of multi-
target drugs in a range of related disorders to pinpoint the
basis for some of these observations.
Exploiting disease networks to study disease associations
Independently drug–drug relations can also be constructed by
obtaining only the structural similarity or prior phenotypic
characteristics of well-exploited drugs in contrast to the
network-based approaches described above. Likewise other
variants such as a Target-Target (TT) network in which
proteins documented in literature to be involved in the same
disease can also be constructed to study relationships between
different entities in the target space or diseases. For instance,
recent studies used the phenotypic information for several
disease associated genes available in the OMIM database88
to construct networks of disease associations.82,89 These
studies unambiguously demonstrated that network properties
of genes in such networks influence the likelihood and
phenotypic consequences of disease mutations, with genes
exhibiting intermediate connectivities having the highest
probability of harboring germ-line disease mutations, suggesting
that disease genes tend to occupy an intermediate niche in
terms of their physiological and cellular importance. In addition,
disease genes were found to show significant functional
clustering in the studied network suggesting the existence of
disorder-specific functional modules.89
Conclusion
Data completeness in the drug, disease and target space is a
crucial issue to our understanding of Drug-Target networks.
Thus, efforts should be directed to systematically illuminate
the drug interactome.108,109 Nevertheless, advances in genomics
have influenced the way we understand the action of drugs on
a genomic scale.110–112 One of the challenging aspects of drug
discovery is the evolution of drug resistant strains emerging in
several human diseases such as malaria, tuberculosis and cystic
fibrosis. Systematic genomic screens have improved the way
we understand the combinatorial drug chemistry113,114 and
would enable us to design ‘‘hyper-antagonistic’’ drugs which
can fight drug-resistant strains by working as drug cocktails
when the agents have acquired resistance to individual
drugs.110 Indeed, recent studies show that although synergistically
acting drugs, which are commonly used in clinical settings,
might favor immediate efficacy, they might also favor evolution
Fig. 4 Decomposition of Drug-Target (DT) network. Drug-target
network composes of a directed graph with the interactions from drugs
to targets. Such a directed network can be exploited to study associations
between drugs or targets. The former consists of linking two drugs
(DD network) when they share significant number of targets while the
later involves linking target proteins which share a significant number
of drugs (TT network). Both these decompositions have yielded
significant insights in recent years, in particular, to understand the
polypharmacological nature of drugs. One can study the disease
associations by using either DD or TT networks by overlaying the
phenotypic knowledge accumulated in databases for either drugs or
targets. Such integrated approaches not only provide insights into
relationships between different diseases but also allow the applicability
of existing drugs in less explored disease phenotypes in the paradigm
of network pharmacology.
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1536–1548 | 1545
of resistance compared to antagonistic combinations, thereby
indicating a need to develop antagonistic combinations,
which might be effective in combating diseases in multi-drug
chemotherapy.115,116
An emerging view of polypharmacology in the post-genomic
era is that drug, target and disease spaces can be correlated
to study the effect of drugs on different spaces and their
interrelationships can be exploited for designing drugs or
cocktails which can effectively target one or more disease
states (Fig. 5). According to such a view, systems-level
understanding of the cell by integration of data from different
omics platforms can be of unprecedented value, as it not only
improves our understanding of the pathophysiological states,
but also its relationship in the context of different chemicals
thereby making useful interpretation of existing data and
accelerating hypothesis generation for testing in disease
models.117–119
Acknowledgements
SCJ acknowledges financial support from the MRC Laboratory
of Molecular Biology and Cambridge Commonwealth Trust.
We thank Briasoulis E., De S., Lang B., Mittal N., Perica T.,
Venkatakrishnan A. J., Wuster A. and Michnick S. for
critically reading the manuscript and providing helpful
comments. We apologize to colleagues whose relevant work
could not be cited due to lack of space.
References
1 P. Csermely, V. Agoston and S. Pongor, Trends Pharmacol. Sci.,
2005, 26, 178–182.
2 Y. Chen, J. Zhu, P. Y. Lum, X. Yang, S. Pinto, D. J. MacNeil,
C. Zhang, J. Lamb, S. Edwards, S. K. Sieberts, A. Leonardson,
L. W. Castellini, S. Wang, M. F. Champy and B. Zhang, et al.,
Nature, 2008, 452, 429–435.
3 A. L. Hopkins, Nat. Chem. Biol., 2008, 4, 682–690.
4 A. L. Hopkins, Nat. Biotechnol., 2007, 25, 1110–1111.
5 C. G. Wermuth, Drug Discovery Today, 2004, 9, 826–827.
6 B. L. Roth, D. J. Sheffler and W. K. Kroeze, Nat. Rev. Drug
Discovery, 2004, 3, 353–359.
7 G. R. Zimmermann, J. Lehar and C. T. Keith, Drug Discovery
Today, 2007, 12, 34–42.
8 A. Petrelli and S. Giordano, Curr. Med. Chem., 2008, 15,
422–432.
9 A. Ma’ayan, S. L. Jenkins, J. Goldfarb and R. Iyengar,Mt. Sinai.
J. Med., 2007, 74, 27–32.
10 M. A. Yildirim, K. I. Goh, M. E. Cusick, A. L. Barabasi and
M. Vidal, Nat. Biotechnol., 2007, 25, 1119–1126.
11 J. A. DiMasi, R. W. Hansen and H. G. Grabowski, J. Health
Econ., 2003, 22, 151–185.
12 J. A. DiMasi, Pharmacoeconomics, 2002, 20(supplement 3), 1–10.
13 I. Kola and J. Landis, Nat. Rev. Drug Discovery, 2004, 3,
711–715.
14 P. Ma and R. Zemmel, Nat. Rev. Drug Discovery, 2002, 1,
571–572.
15 I. Walker and H. Newell, Nat. Rev. Drug Discovery, 2009, 8,
15–16.
16 K. A. O’Connor and B. L. Roth, Nat. Rev. Drug Discovery, 2005,
4, 1005–1014.
17 B. Apsel, J. A. Blair, B. Gonzalez, T. M. Nazif, M. E. Feldman,
B. Aizenstein, R. Hoffman, R. L. Williams, K. M. Shokat and
Z. A. Knight, Nat. Chem. Biol., 2008, 4, 691–699.
18 J. Inglese, R. L. Johnso n, A. Simeonov, M. H. Xia, W. Zheng,
C. P. Austin and D. S. Auld, Nat. Chem. Biol., 2007, 3, 466–479.
19 J. Alper, Science, 1994, 264, 1399–1401.
20 R. Lahana, Drug Discovery Today, 1999, 4, 447–448.
21 P. D. Leeson and B. Springthorpe, Nat. Rev. Drug Discovery,
2007, 6, 881–890.
22 C. A. Lipinski, F. Lombardo, B. W. Dominy and P. J. Feeney,
Adv. Drug Delivery Rev., 2001, 46, 3–26.
23 J. Sadowski and H. Kubinyi, J. Med. Chem., 1998, 41, 3325–3329.
24 A. Ajay, W. P. Walters and M. A. Murcko, J. Med. Chem., 1998,
41, 3314–3324.
25 M. Soural, I. Bouillon and V. Krchnak, J. Comb. Chem., 2008, 10,
923–933.
26 W.M. Dai and J. Y. Shi,Comb. Chem. High Throughput Screening,
2007, 10, 837–856.
27 M. D. Burke and S. L. Schreiber, Angew. Chem., Int. Ed., 2004,
43, 46–58.
28 M. Peuchmaur and Y. S. Wong, Comb. Chem. High Throughput
Screening, 2008, 11, 587–601.
29 D. A. Spiegel, F. C. Schroeder, J. R. Duvall and S. L. Schreiber,
J. Am. Chem. Soc., 2006, 128, 14766–14767.
30 D. S. Tan, Nat. Chem. Biol., 2005, 1, 74–84.
31 C. A. Lipinski, J. Pharmacol. Toxicol. Methods, 2000, 44,
235–249.
32 C. A. Lipinski, F. Lombardo, B. W. Dominy and P. J. Feeney,
Adv. Drug Delivery Rev., 2001, 46, 3–26.
33 D. F. Veber, S. R. Johnson, H. Y. Cheng, B. R. Smith,
K. W. Ward and K. D. Kopple, J. Med. Chem., 2002, 45,
2615–2623.
34 C. A. Bergstrom, M. Strafford, L. Lazorova, A. Avdeef,
K. Luthman and P. Artursson, J. Med. Chem., 2003, 46, 558–570.
35 M. C. Wenlock, R. P. Austin, P. Barton, A. M. Davis and
P. D. Leeson, J. Med. Chem., 2003, 46, 1250–1256.
36 S. D. Pickett, I. M. McLay and D. E. Clark, J. Chem. Inf.
Comput. Sci., 2000, 40, 263–272.
37 T. I. Oprea and H. Matter, Curr. Opin. Chem. Biol., 2004, 8,
349–358.
38 A. Bender, J. L. Jenkins, M. Glick, Z. Deng, J. H. Nettles and
J. W. Davies, J. Chem. Inf. Model., 2006, 46, 2445–2456.
39 A. Bender, D. W. Young, J. L. Jenkins, M. Serrano,
D. Mikhailov, P. A. Clemons and J. W. Davies, Comb. Chem.
High Throughput Screening, 2007, 10, 719–731.
40 E. Gregori-Puigjane and J. Mestres, Comb. Chem. High Through-
put Screening, 2008, 11, 669–676.
41 D. Rognan, Br. J. Pharmacol., 2007, 152, 38–52.
42 W. Duch, K. Swaminathan and J. Meller, Curr. Pharm. Des.,
2007, 13, 1497–1508.
43 S. M. Lippow, K. D. Wittrup and B. Tidor, Nat. Biotechnol.,
2007, 25, 1171–1176.
44 M. L. Radhakrishnan and B. Tidor, J. Phys. Chem. B, 2007, 111,
13419–13435.
Fig. 5 The Disease-Target-Drug matrix: correlating drug, disease
and target space in three dimensions. Targets are grouped into specific
diseases and drugs, drugs are grouped into specific diseases and
targets. Cyan boxes represent ideal targets that are related to specific
diseases. Red boxes represent off targets that should not be targeted.
Different diseases can share common targets and different drugs can
aim different targets. Detailed knowledge of the global Disease-
Target-Drug map can allow targeting different diseases with a single
drug (drug A), avoiding at the same time off-target effects (drug B).
1546 | Mol. BioSyst., 2009, 5, 1536–1548 This journal is !c The Royal Society of Chemistry 2009
45 A. L. Hopkins, J. S. Mason and J. P. Overington, Curr. Opin.
Struct. Biol., 2006, 16, 127–136.
46 M. M. Hann, A. R. Leach and G. Harper, J. Chem. Inf. Comput.
Sci., 2001, 41, 856–864.
47 R. Morphy and Z. Rankovic, Drug Discovery Today, 2007, 12,
156–160.
48 R. Morphy and Z. Rankovic, J. Med. Chem., 2006, 49,
4961–4970.
49 R. Morphy and Z. Rankovic, J. Med. Chem., 2005, 48,
6523–6543.
50 R. Morphy, C. Kay and Z. Rankovic, Drug Discovery Today,
2004, 9, 641–651.
51 R. Morphy and Z. Rankovic, Curr. Pharm. Des., 2009, 15,
587–600.
52 R. Horuk, Expert Reviews in Molecular Medicine, 2009, 11, e1.
53 J. L. Wallace and P. Del Soldato, Fundam. Clin. Pharmacol.,
2003, 17, 11–20.
54 J. L. Wallace, Trends Pharmacol. Sci., 2007, 28, 501–505.
55 R. Aslanian, M. Mutahi, N. Y. Shih, J. J. Piwinski, R. West,
S. M. Williams, S. She, R. L. Wu and J. A. Hey, Bioorg. Med.
Chem. Lett., 2003, 13, 1959–1961.
56 N. Murugesan, Z. Gu, L. Fadnis, J. E. Tellew, R. A. Baska,
Y. Yang, S. M. Beyer, H. Monshizadegan, K. E. Dickinson,
M. T. Valentine, W. G. Humphreys, S. J. Lan, W. R. Ewing,
K. E. Carlson and M. C. Kowala, et al., J. Med. Chem., 2005, 48,
171–179.
57 D. Fattori, A. Squarcia and S. Bartoli, Drugs R. D., 2008, 9,
217–227.
58 P. C. Elwood, A. M. Gallagher, G. G. Duthie, L. A. Mur and
G. Morgan, Lancet, 2009, 373, 1301–1309.
59 M. J. Keiser, B. L. Roth, B. N. Armbruster, P. Ernsberger,
J. J. Irwin and B. K. Shoichet, Nat. Biotechnol., 2007, 25,
197–206.
60 A. Bender, J. Scheiber, M. Glick, J. W. Davies, K. Azzaoui,
J. Hamon, L. Urban, S. Whitebread and J. L. Jenkins,
ChemMedChem, 2007, 2, 861–873.
61 G. M. Cragg and D. J. Newman, J. Ethnopharmacol., 2005, 100,
72–79.
62 J. D. McChesney, S. K. Venkataraman and J. T. Henri,
Phytochemistry, 2007, 68, 2015–2022.
63 D. J. Newman, G. M. Cragg and K. M. Snader, J. Nat. Prod.,
2003, 66, 1022–1037.
64 A. Murakami, H. Ashida and J. Terao, Cancer Lett., 2008, 269,
315–325.
65 M. A. Koch, L. O. Wittenberg, S. Basu, D. A. Jeyaraj,
E. Gourzoulidou, K. Reinecke, A. Odermatt and
H. Waldmann, Proc. Natl. Acad. Sci. U. S. A., 2004, 101,
16721–16726.
66 J. Clardy and C. Walsh, Nature, 2004, 432, 829–837.
67 A. Ganesan, Curr. Opin. Chem. Biol., 2008, 12, 306–317.
68 M. A. Koch, A. Schuffenhauer, M. Scheck, S. Wetzel,
M. Casaulta, A. Odermatt, P. Ertl and H. Waldmann, Proc.
Natl. Acad. Sci. U. S. A., 2005, 102, 17272–17277.
69 M. Feher and J. M. Schmidt, J. Chem. Inf. Comput. Sci., 2003, 43,
218–227.
70 I. R. Edwards and J. K. Aronson, Lancet, 2000, 356, 1255–1259.
71 M. L. Radhakrishnan and B. Tidor, J. Chem. Inf. Model., 2008,
48, 1055–1073.
72 G. V. Paolini, R. H. Shapland, W. P. van Hoorn, J. S. Mason and
A. L. Hopkins, Nat. Biotechnol., 2006, 24, 805–815.
73 Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda and
M. Kanehisa, Bioinformatics, 2008, 24, i232–240.
74 L. Jacob and J. P. Vert, Bioinformatics, 2008, 24, 2149–2156.
75 S. C. Brewerton, Curr. Opin. Drug Discov. Devel., 2008, 11,
356–364.
76 M. E. Hillenmeyer, E. Fung, J. Wildenhain, S. E. Pierce, S. Hoon,
W. Lee, M. Proctor, R. P. St Onge, M. Tyers, D. Koller,
R. B. Altman, R. W. Davis, C. Nislow and G. Giaever, Science,
2008, 320, 362–365.
77 A. W. Whitehurst, B. O. Bodemann, J. Cardenas, D. Ferguson,
L. Girard, M. Peyton, J. D. Minna, C. Michnoff, W. Hao,
M. G. Roth, X. J. Xie and M. A. White, Nature, 2007, 446,
815–819.
78 C. H. Ho, L. Magtanong, S. L. Barker, D. Gresham,
S. Nishimura, P. Natarajan, J. L. Koh, J. Porter, C. A. Gray,
R. J. Andersen, G. Giaever, C. Nislow, B. Andrews, D. Botstein
and T. R. Graham, et al., Nat. Biotechnol., 2009, 27, 369–377.
79 M. A. Fabian, W. H. Biggs, 3rd, D. K. Treiber, C. E. Atteridge,
M. D. Azimioara, M. G. Benedetti, T. A. Carter, P. Ciceri,
P. T. Edeen, M. Floyd, J. M. Ford, M. Galvin, J. L. Gerlach,
R. M. Grotzfeld and S. Herrgard, et al., Nat. Biotechnol., 2005,
23, 329–336.
80 B. Apsel, J. A. Blair, B. Gonzalez, T. M. Nazif, M. E. Feldman,
B. Aizenstein, R. Hoffman, R. L. Williams, K. M. Shokat and
Z. A. Knight, Nat. Chem. Biol., 2008, 4, 691–699.
81 M. Kuhn, M. Campillos, P. Gonzalez, L. J. Jensen and P. Bork,
FEBS Lett., 2008, 582, 1283–1290.
82 K. I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal and
A. L. Barabasi, Proc. Natl. Acad. Sci. U. S. A., 2007, 104,
8685–8690.
83 K. Lage, E. O. Karlberg, Z. M. Storling, P. I. Olason,
A. G. Pedersen, O. Rigina, A. M. Hinsby, Z. Tumer, F. Pociot,
N. Tommerup, Y. Moreau and S. Brunak, Nat. Biotechnol., 2007,
25, 309–316.
84 P. Brennan, R. Donev and S. Hewamana,Mol. BioSyst., 2008, 4,
909–919.
85 D. S. Lee, J. Park, K. A. Kay, N. A. Christakis, Z. N. Oltvai and
A. L. Barabasi, Proc. Natl. Acad. Sci. U. S. A., 2008, 105,
9880–9885.
86 D. S. Wishart, C. Knox, A. C. Guo, D. Cheng, S. Shrivastava,
D. Tzur, B. Gautam and M. Hassanali, Nucleic Acids Res., 2008,
36, D901–906.
87 A. I. Archakov, V. M. Govorun, A. V. Dubanov, Y. D. Ivanov,
A. V. Veselovsky, P. Lewi and P. Janssen, Proteomics, 2003, 3,
380–391.
88 A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini and
V. A. McKusick, Nucleic Acids Res., 2005, 33, D514–517.
89 I. Feldman, A. Rzhetsky and D. Vitkup, Proc. Natl. Acad. Sci.
U. S. A., 2008, 105, 4323–4328.
90 J. Xu and Y. Li, Bioinformatics, 2006, 22, 2800–2805.
91 Z. Yan, M. Costanzo, L. E. Heisler, J. Paw, F. Kaper,
B. J. Andrews, C. Boone, G. Giaever and C. Nislow, Nat.
Methods, 2008, 5, 719–725.
92 R. S. Lokey, Curr. Opin. Chem. Biol., 2003, 7, 91–96.
93 L. Sleno and A. Emili, Curr. Opin. Chem. Biol., 2008, 12, 46–54.
94 H. Katayama and Y. Oda, J. Chromatogr., B: Anal. Technol.
Biomed. Life Sci., 2007, 855, 21–27.
95 S. F. Kingsmore, Nat. Rev. Drug Discovery, 2006, 5, 310–320.
96 C. Wingren and C. A. Borrebaeck, OMICS, 2006, 10, 411–427.
97 H. Ma and K. Y. Horiuchi, Drug Discovery Today, 2006, 11,
661–668.
98 H. Schmidinger, A. Hermetter and R. Birner-Gruenberger,
Amino Acids, 2006, 30, 333–350.
99 J. K. Nicholson and J. C. Lindon, Nature, 2008, 455, 1054–1056.
100 M. Kuhn, C. von Mering, M. Campillos, L. J. Jensen and
P. Bork, Nucleic Acids Res., 2008, 36, D684–688.
101 W. Loging, L. Harland and B. Williams-Jones, Nat. Rev. Drug
Discovery, 2007, 6, 220–230.
102 C. G. Wermuth, Drug Discovery Today, 2006, 11, 160–164.
103 I. F. Tsui, R. Chari, T. P. Buys and W. L. Lam, Cancer Inform,
2007, 3, 389–407.
104 M. Campillos, M. Kuhn, A. C. Gavin, L. J. Jensen and P. Bork,
Science, 2008, 321, 263–266.
105 E. Ericson, M. Gebbia, L. E. Heisler, J. Wildenhain, M. Tyers,
G. Giaever and C. Nislow, PLoS Genet., 2008, 4, e1000151.
106 E. E. Schadt, S. H. Friend and D. A. Shaywitz, Nat. Rev. Drug
Discovery, 2009, 8, 286–295.
107 J. C. Nacher and J. M. Schwartz, BMC Pharmacol., 2008, 8, 5.
108 M. Cases and J. Mestres, Drug Discovery Today, 2009, 14,
479–485.
109 J. Mestres, E. Gregori-Puigjane, S. Valverde and R. V. Sole, Nat.
Biotechnol., 2008, 26, 983–984.
110 R. Chait, A. Craney and R. Kishony, Nature, 2007, 446, 668–671.
111 M. T. Holden, E. J. Feil, J. A. Lindsay, S. J. Peacock, N. P. Day,
M. C. Enright, T. J. Foster, C. E. Moore, L. Hurst, R. Atkin,
A. Barron, N. Bason, S. D. Bentley, C. Chillingworth and
T. Chillingworth, et al., Proc. Natl. Acad. Sci. U. S. A., 2004,
101, 9786–9791.
112 J. M. Rolain, P. Francois, D. Hernandez, F. Bittar, H. Richet,
G. Fournous, Y. Mattenberger, E. Bosdure, N. Stremler,
This journal is !c The Royal Society of Chemistry 2009 Mol. BioSyst., 2009, 5, 1536–1548 | 1547
J. C. Dubus, J. Sarles, M. Reynaud-Gaubert, S. Boniface,
J. Schrenzel and D. Raoult, Biology Direct, 2009, 4, 1.
113 J. Lehar, A. Krueger, G. Zimmermann and A. Borisy,Mol. Syst.
Biol., 2008, 4, 215.
114 J. Lehar, A. S. Krueger, W. Avery, A. M. Heilbut,
L. M. Johansen, E. R. Price, R. J. Rickles, G. F. Short, 3rd,
J. E. Staunton, X. Jin, M. S. Lee, G. R. Zimmermann and
A. A. Borisy, Nat. Biotechnol., 2009, 27, 659–666.
115 M. Hegreness, N. Shoresh, D. Damian, D. Hartl and R. Kishony,
Proc. Natl. Acad. Sci. U. S. A., 2008, 105, 13977–13981.
116 J. B. Michel, P. J. Yeh, R. Chait, R. C. Moellering, Jr and
R. Kishony, Proc. Natl. Acad. Sci. U. S. A., 2008, 105,
14918–14923.
117 F. Aguero, B. Al-Lazikani, M. Aslett, M. Berriman,
F. S. Buckner, R. K. Campbell, S. Carmona, I. M. Carruthers,
A. W. Chan, F. Chen, G. J. Crowther, M. A. Doyle, C. Hertz-
Fowler, A. L. Hopkins and G. McAllister, et al., Nat. Rev. Drug
Discovery, 2008, 7, 900–907.
118 E. C. Butcher, E. L. Berg and E. J. Kunkel, Nat. Biotechnol.,
2004, 22, 1253–1259.
119 R. S. Faustino and A. Terzic, Clin. Pharmacol. Ther., 2008, 84,
543–545.
120 T. Liu, Y. Lin, X. Wen, R. N. Jorissen and M. K. Gilson, Nucleic
Acids Res., 2007, 35, D198–201.
121 S. Goto, Y. Okuno, M. Hattori, T. Nishioka and M. Kanehisa,
Nucleic Acids Res., 2002, 30, 402–404.
122 K. Degtyarenko, P. de Matos, M. Ennis, J. Hastings,
M. Zbinden, A. McNaught, R. Alcantara, M. Darsow,
M. Guedj and M. Ashburner, Nucleic Acids Res., 2008, 36,
D344–350.
123 J. H. Chen, E. Linstead, S. J. Swamidass, D. Wang and P. Baldi,
Bioinformatics, 2007, 23, 2348–2351.
124 J. J. Irwin and B. K. Shoichet, J. Chem. Inf. Model., 2005, 45,
177–182.
125 S. Gunther, M. Kuhn, M. Dunkel, M. Campillos, C. Senger,
E. Petsalaki, J. Ahmed, E. G. Urdiales, A. Gewiess, L. J. Jensen,
R. Schneider, R. Skoblo, R. B. Russell, P. E. Bourne and P. Bork,
et al., Nucleic Acids Res., 2008, 36, D919–922.
126 J. Masciocchi, G. Frau, M. Fanton, M. Sturlese, M. Floris,
L. Pireddu, P. Palla, F. Cedrati, P. Rodriguez-Tome and
S. Moro, Nucleic Acids Res., 2009, 37, D284–290.
127 K. P. Seiler, G. A. George, M. P. Happ, N. E. Bodycombe,
H. A. Carrinski, S. Norton, S. Brudz, J. P. Sullivan, J.
Muhlich, M. Serrano, P. Ferraiolo, N. J. Tolliday,
S. L. Schreiber and P. A. Clemons, Nucleic Acids Res., 2008,
36, D351–359.
128 D. L. Wheeler, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese,
V. Chetvernin, D. M. Church, M. Dicuccio, R. Edgar,
S. Federhen, M. Feolo, L. Y. Geer, W. Helmberg, Y. Kapustin
and O. Khovayko, et al., Nucleic Acids Res., 2008, 36, D13–21.
129 A. E. Speers and B. F. Cravatt, ChemBioChem, 2004, 5, 41–47.
130 N. Jessani, S. Niessen, B. Q. Wei, M. Nicolau, M. Humphrey,
Y. Ji, W. Han, D. Y. Noh, J. R. Yates, 3rd, S. S. Jeffrey and
B. F. Cravatt, Nat. Methods, 2005, 2, 691–697.
131 S. E. Ong, M. Schenone, A. A. Margolin, X. Li, K. Do,
M. K. Doud, D. R. Mani, L. Kuai, X. Wang, J. L. Wood,
N. J. Tolliday, A. N. Koehler, L. A. Marcaurelle, T. R. Golub
and R. J. Gould, et al., Proc. Natl. Acad. Sci. U. S. A., 2009, 106,
4617–4622.
132 M. Salcius, G. A. Michaud, B. Schweitzer and P. F. Predki,
Methods Mol. Biol., 2007, 382, 239–248.
133 N. C. Turner, C. J. Lord, E. Iorns, R. Brough, S. Swift, R. Elliott,
S. Rayter, A. N. Tutt and A. Ashworth, EMBO J., 2008, 27,
1368–1377.
134 S. W. Michnick, P. H. Ear, E. N. Manderson, I. Remy and
E. Stefan, Nat. Rev. Drug Discovery, 2007, 6, 569–582.
135 D. A. Bachovchin, S. J. Brown, H. Rosen and B. F. Cravatt,
Nat. Biotechnol., 2009, 27, 387–394.
136 R. Frijters, S. Verhoeven, W. Alkema, R. van Schaik and
J. Polman, Pharmacogenomics, 2007, 8, 1521–1534.
137 E. S. Chen, G. Hripcsak, H. Xu, M. Markatou and C. Friedman,
J Am Med Inform Assoc, 2008, 15, 87–98.
1548 | Mol. BioSyst., 2009, 5, 1536–1548 This journal is !c The Royal Society of Chemistry 2009
Genome Biology 2009, 10:309
Meeting report
Interfacing systems biology and synthetic biology
Allyson Lister*, Varodom Charoensawan
†
, Subhajyoti De
†
, 
Katherine James*, Sarath Chandra Janga
†
and Julian Huppert
‡
Addresses: *Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN) and School of Computing Science, Newcastle
University, Newcastle upon Tyne NE1 7RU, UK. 
†
MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 0QH, UK. 
‡
Cavendish Laboratory, University of Cambridge, JJ Thomson Ave, Cambridge CB3 0HE, UK.
Correspondence: Julian Huppert. Email: jlh29@cam.ac.uk
Published: 26 June 2009
Genome Biology 2009, 10:309 (doi:10.1186/gb-2009-10-6-309)
The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2009/10/6/309
© 2009 BioMed Central Ltd 
A report of BioSysBio 2009, the IET conference on Synthetic
Biology, Systems Biology and Bioinformatics, Cambridge, UK,
23-25 March 2009.
The fourth meeting in the BioSysBio conference series
brought together international researchers in the interacting
disciplines of synthetic biology, systems biology and bio-
informatics. This conference was largely student-run, and as
well as the formal talks included workshops, discussion
sessions and a panel session on ethics, public engagement
and biosecurity. A wide range of topics was covered at the
conference, including modeling, biofuels and environmental
bioremediation, metabolomics, structural and computational
genomics, and software tools. Of note were the number of
groups presenting improved models of metabolism, studying
cellular subsystems such as cell death and circadian rhythms.
Others are developing new approaches and standards for
systems and synthetic biology, and significant improvements
were reported for Systems Biology Markup Language (SBML)
and the MIT Registry of Standard Biological Parts. A few
highlights of the meeting are given here.
Synthetic biology and its standardization
Synthetic biology is a newly emerging field, where biological
components are reengineered to provide new, designed
functions. In a keynote lecture, Adam Arkin (University of
California, Berkeley, USA) discussed the origins of synthetic
biology and its scalability, as well as the engineering
challenges that lie beyond the bioreactor. In his view, using
synthetic biology, whether to meet an engineering or
biological challenge, can be transparent, efficient, reliable,
predictable and safe, unlike other human interventions such
as selective breeding and the introduction of non-native
species. Arkin also described ways of reducing the time and
improving the reliability of biosynthesis, such as the use of
standardized parts, computer-assisted design, and methods
for quickly assembling parts. Evolved systems are complex
and subtle, and he highlighted the fact that synthetic
organisms need to deal with the same uncertainty and
competition as do existing organisms.
Among the ‘parts’ required in synthetic biology are switches
that can function, for example, as regulators of gene
expression. Christina Smolke (Stanford University, USA)
presented novel design strategies for constructing RNA-
based molecular switches that can function as both bio-
sensors and ligand-controlled regulators of gene expression.
Binding of the appropriate ligand leads to a regulated
conformational change in a designed RNA molecule, which
in turn can be linked to an appropriate readout signal,
enabling these molecules to act as sophisticated cellular
biosensors. She also described how such riboswitches can be
used as targeted or ‘intelligent’ therapeutic molecules for
treatment of cancer, allowing them to be carefully tuned to
respond as a precise set of molecular stimuli.
Given the recent explosion in the number of approaches to
synthetic biology and the amount of data at the interface of
genomic and systems biology, there is now an over-whelming
need to organize these data efficiently in appropriate reposi-
tories. An update on current standards for DNA description
by Guy Cochrane (EBI, Cambridge, UK) focused on the
different raw sequencing formats available and, in
particular, the work that is being done at EMBL to integrate
them, via SRS. In an overview of standards and improve-
ments in SBML language, which is the platform for most
software in systems biology, Herbert Sauro (University of
Washington, Seattle, USA) emphasized the need to
incorporate multi-compartment models into the existing
framework of SMBL. Randy Rettberg (Massachusetts
Institute of Technology, Cambridge, USA) provided an
overview of the publicly available synthetic biology repository
being developed at MIT
[http://partsregistry.org/Main_Page] as a result of contri-
butions from participants in iGEM - the international
genetically engineered machine competition.
Systems biology and automation
Because of the complexity of biological systems, it has
always been a challenge to develop predictive dynamic
models that are sensitive to changes in biological inputs, but
at the same time robust to technical noises. A variety of
approaches were described at the meeting. Using a Bayesian
framework to study the inferability of model parameters
under experimental noise, Kamil Erguler (Imperial College
London, UK) introduced sensitivity profiles to identify the
relative impacts of changes in parameters on the global
dynamics of biochemical models. This analysis revealed the
degree of robustness of inferences drawn from different
parts of biochemical pathways and thus provides a guide to
improved data collection. Andre Ribeiro (Tampere
University of Technology, Finland) has developed a delayed
stochastic model to investigate the stepwise elongation
motion of RNA polymerase and its pauses during
transcription. He showed that transcriptional noise level was
affected by the durations of the pauses, which could in turn
be intrinsically encoded within the DNA sequence.
Another challenge is to store all the information being
generated by all the -omic sciences. Catherine Lloyd
(Auckland Bioengineering Institute, New Zealand) described
the language CellML, which is written in XML and uses
existing formats such as MathML and RDF to describe
biological models of cellular function. The CellML model
repository has over 380 models, free to download
[http://www.cellml.org/]. CellML has a number of other
useful features, including modularity and the sharing of
components such as entities and processes. Ulrike Wittig
(EML Research, Heidelberg, Germany) presented SABIO-
RK, a database of information about biochemical reactions
and enzyme kinetics. The reactions in the database are
mainly taken from the Kyoto Encyclopedia of Genes and
Genomes (KEGG) and the literature, and the kinetic data
comes from the literature. SABIO-RK can be accessed via
both a user interface and web services [http://sabio.villa-
bosch.de/]. Recent improvements include a new data model
for SABIO-RK that allows the storage of intermediate steps
in a reaction, making SABIO-RK the first database to offer
kinetic information for both biochemical reactions and their
individual steps.
DNA synthesis and sequencing comprise one of the
cornerstones of modern biology, and Tuval Ben Yehezkel
(Weizmann Institute, Rehovot, Israel) described new
strategies for synthesizing completely de novo DNA
fragments using single-molecule PCR in a completely
automated fashion. Single-molecule PCR can be readily
scaled up, and will complement the highly parallel DNA
sequencing technologies such 454 and Solexa sequencing in
the future.
Steve Oliver (University of Cambridge, UK) and his
colleagues have taken automation even further, describing
an automated experimental system to study yeast
metabolism. He and colleagues have designed a robot, called
Adam, that uses abductive logic programming (ALP) and is
capable of reasoning about hypotheses and data, designing
experiments to test the hypotheses, and then carrying out
those experiments and interpreting the results.
Ethics and security
Scientists in all fields have a duty to consider the public
impact of their work and the conference included a lively
panel discussion covering ethics, public engagement and
biosecurity. Drew Endy (Stanford University, USA) asserted
that while the basics of genetic engineering have not
changed in more than 30 years, synthetic biology is
revolutionary. He raised the question of people trying to
‘hack’ genomes in their garage: how should they be
managed, if indeed they should be managed at all? He also
described how the patent system is flawed with regard to
synthetic biology; for example, patenting the BioBricks
registry of DNA parts encoding basic biological function
would be expensive and counterproductive.
Matthew Harvey (Royal Society, London, UK) cautioned that
we should not assume that the public must be engaged:
sometimes the public simply are not interested. In contrast
to genetically modified organisms, there are no synthetic
biology products queuing up to be sold right now. Therefore,
questioning the public about synthetic biology is currently
less like traditional public engagement and more like social-
intelligence gathering.
Two concerns were discussed by Julian Savulescu (Univer-
sity of Oxford, UK): that synthetic biology may pose risks in
terms of malevolent use, and that the use of synthetic
biology might undermine the moral status of living things.
For regulators, the challenge is to minimize the risk of male-
volent use. For scientists, it is to make better predictions
about how research will be used in the future. For
philosophers, the challenge is to ascertain criteria for moral
status, and determine how to weigh the risk of future
wrongdoing against the benefits of pursuing research in
synthetic biology. Piers Millet (UN Biological Weapons
Convention Implementation Support Unit, Geneva, Switzer-
land) invited scientists to work with security people to
prevent bioterrorism. He highlighted that this engagement
http://genomebiology.com/2009/10/6/309 Genome Biology 2009, Volume 10, Issue 6, Article 309 Lister et al. 309.2
Genome Biology 2009, 10:309
needs to be bottom up, not top down, and that his
organization could help.
A new feature for BioSysBio 2009 to extend participation in
the conference was to a wider audience by communicating
live content through microblogging (using FriendFeed and
Twitter; Figure 1) and live blogging (providing an immediate
and permanent log) [http://themindwobbles.wordpress.
com/tag/biosysbio-2009/]. The fields covered by the
conference are still developing. Researchers are opening up
new topics, discovering that mathematical, physical and
engineering concepts apply to ever more biological
problems. The new generation of researchers increasingly
see themselves as forming a new discipline, and while this is
exciting, they must ensure that they do not cut themselves
off from either of the ‘parent’ disciplines, the physical
sciences (including engineering) and the biological sciences;
in particular, more traditional biologists do have important
knowledge to convey and questions to pose. However, the
results reported at the meeting show that, in most cases, the
best from both disciplines is being matched - and exceeded.
http://genomebiology.com/2009/10/6/309 Genome Biology 2009, Volume 10, Issue 6, Article 309 Lister et al. 309.3
Genome Biology 2009, 10:309
Figure 1
Word cloud of the contents of the BioSysBio Twitter feed, identified via the search term “#biosysbio”. The size of each of the words corresponds to
their usage frequency. Image generated using wordle.net by Simon Cockell [http://www.flickr.com/photos/sjcockell/3389493857/]. Licensed under the
Attribution 2.0 Generic License [http://creativecommons.org/licenses/by/2.0/deed.en_GB].
Identification and Genomic Analysis of Transcription Factors
in Archaeal Genomes Exemplifies Their Functional
Architecture and Evolutionary Origin
5 Ernesto Pe´rez-Rueda*,1 and Sarath Chandra Janga*,2
1Departamento de Ingenierı´a Celular y Biocata´lisis, IBT-UNAM, AP 565-A, Cuernavaca, Morelos, Me´xico
½AQ1
2MRC½AQ2 Laboratory of Molecular Biology, Cambridge, United Kingdom
*Corresponding author: E-mail: erueda@ibt.unam.mx; sarath@mrc-lmb.cam.ac.uk.
Associate editor: Michele Vendruscolo
10 Abstract
Archaea, which represent a large fraction of the phylogenetic diversity of organisms, are prokaryotes with eukaryote-like basal
transcriptional machinery. This organization makes the study of their DNA-binding transcription factors (TFs) and their
transcriptional regulatory networks particularly interesting. In addition, there are limited experimental data regarding their
TFs. In this work, 3,918 TFs were identified and exhaustively analyzed in 52 archaeal genomes. TFs represented less than 5% of
15 the gene products in all the studied species comparable with the number of TFs identified in parasites or intracellular
pathogenic bacteria, suggesting a deficit in this class of proteins. A total of 75 families were identified, of which HTH_3, AsnC,
TrmB, and ArsR families were universally and abundantly identified in all the archaeal genomes. We found that archaeal TFs are
significantly small compared with other protein-coding genes in archaea as well as bacterial TFs, suggesting that a large fraction
of these small-sized TFs could supply the probable deficit of TFs in archaea, by possibly forming different combinations of
20 monomers similar to that observed in eukaryotic transcriptional machinery. Our results show that although the DNA-binding
domains of archaeal TFs are similar to bacteria, there is an underrepresentation of ligand-binding domains in smaller TFs,
which suggests that protein–protein interactions may act as mediators of regulatory feedback, indicating a chimera of
bacterial and eukaryotic TFs’ functionality. The analysis presented here contributes to the understanding of the details of
transcriptional apparatus in archaea and provides a framework for the analysis of regulatory networks in these organisms.
25 Key words: transcription factors, protein families, archaeal genomes, evolution, gene regulation.
Introduction
Regulation of gene expression at the transcriptional level is
a ubiquitous and fine-tuned process observed in all cellular
organisms. The ability to respond and adapt to environ-
30 mental changes is defined by the cell’s repertoire of
DNA-binding transcription factors (TFs) through interac-
tions between the TFs and the cis-regulatory regions of
their target genes in the form of a transcriptional regulatory
network (Babu et al. 2004; Janga and Collado-Vides 2007).
35 These TFs bind to the promoter regions of specific genes to,
either positively or negatively, regulate expression. Due to
the crucial role of TFs in coordinating the gene expression
kinetics of a genome, they have been studied in many as-
pects, including mutational analysis, sequence compari-
40 sons, and elucidation of numerous 3D structures.
The identification of the TF repertoire in a genome se-
quence is a prerequisite to understanding the regulation of
gene expression and, on a global scale, for the elucidation of
regulatory networks. In this context, the organisms with
45 the best studied transcriptional regulatory networks, where
TFs have been identified, are the eukaryote Saccharomyces
cerevisiae (Lee et al. 2002; Janga et al. 2008) and the bacteria
Escherichia coli K12 (Babu and Teichmann 2003; Gama-
Castro et al. 2008), Bacillus subtilis (Moreno-Campuzano
50et al. 2006; Sierro et al. 2008), and more recently Coryne-
bacterium glutamicum (Brune et al. 2005; Brinkrolf et al.
2006). However, relatively, little is known about TFs and
the transcriptional regulatory networks controlled by them
in archaeal genomes, despite the fact that they represent
55a large fraction of the phylogenetic diversity of organisms.
Furthermore, archaea are well suited as model organisms
for eukaryotes because of the similarities they share in their
information transfer machinery, due to a common ances-
tor, as proposed by the symbiotic theory (Martin and
60Muller 1998; Moreira and Lopez-Garcia 1998; Lopez-Garcia
1999; Martin et al. 2001; Esser and Martin 2007).
Archaea constitute one of the three cellular domains in
the universal tree of life (Woese 1998) composed of organ-
isms highly diverse in morphology, physiology, and natural
65habitats (Chaban et al. 2006; Clementino et al. 2007; Nam
et al. 2008; Auguet et al. 2009). Organisms included in this
cellular domain possess basal transcription machinery re-
sembling that of eukaryotes. For instance, archaea include
a TATA box promoter sequence, a TATA box–binding pro-
70tein (TBP), a homologue of the transcription factor TFIIB
(TFB), and a RNA polymerase (RNAp) containing between
8 and 13 subunits (Goede et al. 2006) (see supplementary
fig. S1, Supplementary Material online). In contrast,
© 2010 The Authors
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and
reproduction in any medium, provided the original work is properly cited. Open Access
Mol. Biol. Evol. 27(4):1–11. 2010 doi:10.1093/molbev/msq033 Advance Access publication February 1, 2010 1
R
esearch
article
MBE msq033 RLX
Journal Name Art. No. CE Code NOT FOR PUBLIC RELEASE
archaeal messenger RNAs (mRNAs) are structurally similar
75 to bacterial mRNAs, and, most importantly, the majority of
identified TFs in archaeal organisms are homologous to
bacterial activators and repressors (Kyrpides and Woese
1998; Bell 2005). Indeed, very few eukaryotic-like TFs were
found to occur in archaea (Kruger et al. 1998). These ob-
80 servations raise different basic questions with regard to the
mechanisms of transcriptional regulation and the manner
by which bacterial-like TFs may interact or interfere with the
components of the eukaryotic-like basal transcriptional ma-
chinery within an archaeal cell. It is for this reason that ar-
85 chaeal DNA-binding TFs represent an important class of
proteins to explain the molecular mechanisms that underlie
transcription regulation. Even though the ever-growing num-
ber of archaeal genome sequences reveals an increasing list of
potential regulators (Coulson et al. 2007; Wu et al. 2008), ar-
90 chaeal transcriptional regulation is still poorly documented,
and the most detailed and advanced studies have been per-
formed with only a dozen TFs, mainly from the AsnC family
(formerly feast/famine protein family) (see supplementary
table S1, Supplementary Material online) (Napoli et al.
95 1999; Leonard et al. 2001; Bell 2005). Initial sequence analy-
sis–based attempts using family-specific models from E. coli
TFs resulted in a low proportion of bacterial-like TFs in archaea
(Perez-Rueda et al. 2004; Coulson et al. 2007). One probable
cause for this discrepancy could be that archaeal TF zregula-
100 tory repertoire includes additional classes of DNA-binding
motifs not observed in E. coli, suggesting that our current
knowledge on the repertoire of TFs in archaeal genomes is
far from being complete. Importantly, comparative genomic
analysis of archaea represents an opportunity to fill in this gap
105 and is anindispensablesteptowardour understanding of gene
regulation networks in prokaryotes and eukaryotes.
In the present study, an exhaustive analysis of gene se-
quences from 52 completely sequenced archaeal genomes
to identify potential DNA-binding TFs was performed. In
110 addition, a comparative analysis was carried out to deduce
the distribution of TFs and their evolutionary families
among the archaeal genome sequences. Using this reper-
toire of TFs, we show that 1) there is an underrepresenta-
tion of the number of TFs in these organisms compared
115 with bacterial genomes, 2) a considerable number of TFs
encode for short polypeptides with a significant fraction
encoding for single-domain proteins, and 3) a high propor-
tion of TFs are homologous between archaea and bacteria,
mainly from the class clostridia of firmicutes.
120 Materials and Methods½AQ3
List of Archaeal Genomes Analyzed in This Study
The archaeal genomes analyzed in this work are as follows
(see supplementary table S2, Supplementary Material online,
for a more detailed annotation of the genomes): Crenarchaea
125 (C): Aeropyrum pernix K1, Caldivirga maquilingensis IC-167,
Hyperthermus butylicus DSM 5456, Ignicoccus hospitalis
KIN4/I, Metallosphaera sedula DSM 5348, Nitrosopumilus
maritimus SCM1, Pyrobaculum aerophilum str. IM2, Pyrobac-
ulum arsenaticum DSM 13514, Pyrobaculum calidifontis JCM
13011548, Pyrobaculum islandicum DSM 4184, Staphylothermus
marinus F1, Sulfolobus acidocaldarius DSM 639, Sulfolobus
solfataricus P2, Sulfolobus tokodaii str. 7, Thermofilum pen-
dens Hrk 5, Thermoproteus neutrophilus V24Sta; Euryarchaea
(E): Methanocorpusculum labreanum Z, Methanoculleus
135marisnigri JR1, Methanopyrus kandleri AV19, Methanosaeta
thermophila PT, Methanosarcina acetivorans C2A, Metha-
nosarcina barkeri str. Fusaro, Methanosarcina mazei Go1,
Methanosphaera stadtmanae DSM 3091, Methanospirillum
hungatei JF-1, Methanothermobacter thermautotrophicus
140str. Delta H, Natronomonas pharaonis DSM 2160, Picrophi-
lus torridus DSM 9790, Pyrococcus abyssi GE5, Pyrococcus
furiosus DSM 3638, Pyrococcus horikoshii OT3, Thermococ-
cus kodakarensis KOD1, Thermoplasma acidophilum DSM
1728, Thermoplasma volcanium GSS1, uncultured metha-
145nogenic archaeon RC-I, Methanocaldococcus jannaschii
DSM 2661, Methanococcoides burtonii DSM 6242, Metha-
nococcus aeolicus Nankai-3, Methanococcus maripaludis
C5, Methanococcus maripaludis C6, Methanococcus mari-
paludis C7, Methanococcus maripaludis S2, Methanococcus
150vannielii SB, Archaeoglobus fulgidus DSM 4304, Candidatus
Methanoregula boonei 6A8, Haloarcula marismortui ATCC
43049, Halobacterium salinarum R1, Halobacterium sp.
NRC-1, Haloquadratum walsbyi DSM 16790, Methanobre-
vibacter smithii ATCC 35061; Korarchaeota (K): Candidatus
155Korarchaeum cryptofilum OPF8; Nanoarchaeum (N): Nano-
archaeum equitans Kin4-M.
Identification of DNA-Binding TFs
To identify and analyze the repertoire of TFs in 52 archaeal
genome sequences, we used a combination of information
160sources and bioinformatics tools. First ½AQ4, 1,820 putative TFs
were collected from Transcription Factor DB (Kummerfeld
and Teichmann 2006), a database comprising computa-
tionally derived predictions of DNA-binding TFs using
the SUPERFAMILY library and PFAM hidden Markov mod-
165els (HMMs). From this data set, 223 proteins, annotated as
transposases, invertases, and integrases, were manually ex-
cluded. In brief, this exclusion was based on sequence com-
parisons against the National Center for Biotechnology
Information’s nonredundant (NR) protein database (E
170value 5 103) by using Blast search followed by the iden-
tification of protein domains with CD-search (E value 5
103) (Marchler-Bauer et al. 2007).
In the second phase, 90 family-specific HMMs previously
reported for E. coli K12 (Perez-Rueda et al. 2004) and 57 fam-
175ily-specific HMMs for B. subtilis (Moreno-Campuzano et al.
2006) were used to scan the whole 52 archaeal genome se-
quences (E value threshold 5 103), with the hmmsearch
module from HMMer suite of programs (http://HMMER.
wustl.edu). Briefly, these HMMs were constructed by using
180the previously identified TF families in E. coli K12 and B. sub-
tilis as seeds, considering every protein family’s DNA-binding
domain (DBD) sequences (around 60 amino acids). Proteins
with less than 50% similarity in the DNA-binding region
against their corresponding HMM were excluded. At this
185stage, 424 proteins were identified as potential TFs. This
was an important step to explore potential TFs not
Pe´rez-Rueda and Chandra Janga · doi:10.1093/molbev/msq033 MBE
2
identified in the first step and vice versa, that is, the coverage
of superfamily and PFAM assignations correspond to ap-
proximately 70% of the universe of TFs, whereas the rest
190 were complemented with these family-specific HMMs.
In the third phase, 70 new TFs were identified with
HMMs constructed from 17 proteins annotated as TFs
and not identified in previous searches. This step essen-
tially involved retrieving these 17 TFs from Haloweb server
195 (http://halo4.umbi.umd.edu/cgi-bin/haloweb/nrc1.pl?o-
peration5nrc1), and using them as sequence seeds in
Blast searches to retrieve homologous sequences from
the NR database with an E value 5 103. Redundancy
was removed using CD-hit (Li and Godzik 2006) at
200 90%, and the potential DBD was identified with CD-
search (Marchler-Bauer et al. 2007) (varying the E value
from 103 to 101) in the remaining proteins. This region
was then aligned using ClustalW, with parameters set to
default and manually editing output. Finally, 14 HMMs were
205 constructed with the HMMer suite of programs corre-
sponding to the 17 proteins clustered by sequence similarity
into 14 different groups. For two proteins, there was not
enough information to construct a HMM as they appeared
to be lineage specific and no homologues were identified.
210 In addition, a HMM corresponding to the helix-turn-he-
lix (HTH) DNA-binding motif kindly provided by Yan
(2006) was used to identify 686 HTH proteins in the ar-
chaeal genomes. This data set was also filtered to exclude
those proteins described as transposases, ligases, synthases,
215 synthetases, TFIIB, and TFIIE and those proteins identified
in the previous phases, resulting in a total of 95 new prob-
able TFs. Finally½AQ5 , COG assignations associated to TFs in ar-
chaea were also used to retrieve new potential archaeal TFs.
This resulted in 491 proteins, which were filtered and com-
220 pared against the whole data set of predictions, but only 2
of them were found to be novel predictions.
All data sets were finally compared and a total set of
3,918 proteins were compiled and used in this study as
the final collection of TFs (see fig. 1 for a summary of
225the steps). This collection of proteins was classified into
75 families by using HMMs deposited in the PFAM DB
(Finn et al. 2006) and searches with CD-search server (E
value 5 101) and aligned against their corresponding
models by using the program hmmalign from HMMer.
230Identification of Homologous DNA-Binding TFs in
Bacteria and Eukarya
In order to identify TFs, which are homologous to the ar-
chaeal set, we compared the whole repertoire against 291
NR genome sequences (Moreno-Hagelsieb and Janga 2008),
235which included bacterial, archaeal, and eukaryotic sequen-
ces. A protein was considered as a homologue of a TF in
a given genome if the alignment covered at least 60% of
the query sequence with an E value 106.
Results and Discussion
240Identification of DNA-Binding TFs in Archaea
To understand the distribution of TFs in 52 archaeal ge-
nomes (34 Euryarchaea, 16 Crenarchaeota, 1 Korarchaeota,
and Nanoarchaeota each), we used a HMM-based strategy
in two steps. In the first step, we used a battery of family-
245specific HMMs (see Materials and Methods for details) and
DBD assignments characteristic of TFs to scan the archaeal
genomes (see fig. 1 for a complete outline). These steps
allowed the detection of 3,751 TFs in 52 genomes (see Ma-
terials and Methods for a complete list of genomes ana-
250lyzed), including 53 of the 72 TFs (75%) from
Halobacterium sp. NRC-1 described so far in the Haloweb
server. Halobacterium sp. NRC-1 is one of the few archaea
whose TF repertoire has been extensively analyzed, and
Manual curation. Remotion of
transposases,
Invertases, replication/repair
and other enzymesIncreasing the coverage: Family
specific HMM’s ,designed with
TFs from Halobacterium
salinarum, COGs and HTH
searches
52 Archaeal
genome
sequences
Transcription Factors
identified with DBD database
Family specific HMM’s,
designed using E. coli and
B. subtilis TF families
3918 TFs were
identified
Pfam assignments
Literature look up
FIG. 1. Flowchart½AQ13 showing the different steps involved in the identification of high confidence set of archaeal TFs. Branch points on the vertical
line from top to bottom correspond to the stage at which a particular step was taken in the process of obtaining a cleanerdata set.
Genomic Analysis of Archaeal Transcription Factors · doi:10.1093/molbev/msq033 MBE
3
thus, we used its TFs repertoire as a benchmark. In the sec-
255 ond step, in order to increase the sensitivity, the 19 Halo-
bacterium sp. NRC-1 TFs not identified in the first step
were used as seeds for Blast searches against the NR data-
base (E value cutoff 5 103), and the matched proteins
were used to build new HMMs for a second round of
260 searches, identifying 70 new TFs. Additionally, archaeal ge-
nomes were scanned to look for HTH and COG annota-
tions to identify new potential TFs not identified
previously. Because it is known that HTH is one of the most
prominent structure associated with TFs in prokaryotes
265 (Perez-Rueda and Collado-Vides 2000, 2001), with at least
80% of the TFs containing this DNA-binding structure, we
employed a specific HMM, which considers amino acid res-
idue identity and solvent accessibility, constructed from
a set of heterogeneous DNA-binding proteins with stan-
270 dard HTH motifs (Yan 2006). After manually excluding pro-
teins that, although can bind to DNA, are unlikely to be TFs,
97 potential TFs that escaped our HMM-based searches
were identified. This composite strategy allowed the detec-
tion of additional 167 potential archaeal TFs not identified
275previously and included all the 72 TFs described in Halo-
bacterium sp. NRC-1. In total, a set of 3,918 potential
TFs in 52 archaeal genomes were finally identified.
Although extensive survey performed in this work iden-
tified a large set of TFs widely distributed in archaea, it is
280still possible that some potential novel TFs escaped the
search criteria or are missing because of their linage-specific
nature, presumably due to de novo invention of TFs whose
DNA-binding models are not included in our seeddata set.
Dissecting the Repertoire of TFs
285Comprehensive identification and characterization of the
repertoire of TFs across archaeal genomes are the first step
toward expanding the possibilities for exploration of their
regulatory networks. Based on our predictions, we found
that smaller archaeal genomes contain fewer TFs than
290larger ones, following a linear correlation (r2 5 0.82), as
has been previously reported for bacteria (Perez-
Ruedaet al. 2004; fig. 2a). This finding might represent
FIG. 2. a) Distribution of TFs identified in 52 archaeal genomes. Nanoarchaeum equitans (Neq), Haloarcula marismortui (Hma),
Methanospirillum hungatei (Mhu), and Methanosarcina acetivorans C2A (Mac) are indicated as a reference. On x axis, genomes are sorted from
smallest to largest size and on y axis the number of TFs is plotted. A linear regression was calculated using the Pearson correlation (r2) between
the number of genes and the total number of TFs. b) Proportion of TFs in all the archaeal genomes. Proportion of TFs was calculated as the
fraction of ORFs encoding for TFs and plotted against the total number of ORFs for each genome. Pyrococcus horikoshii (pho) and Pyrococcus
abyssi (pab) are indicated as a reference. On x axis, genomes are sorted from smallest to largest size and on y axis, the fraction of TFs is plotted.
Pe´rez-Rueda and Chandra Janga · doi:10.1093/molbev/msq033 MBE
4
either an expansion or a contraction of the repertoire of
TFs in archaea, as a consequence of adaptation to partic-
295 ular habitats or lifestyles. Although larger genomes might
be harboring ampler repertoire of TFs to exploit diverse or
more complex habitats, smaller genomes containing fewer
regulators might be associated with specific niches. For in-
stance, E. coli, which thrives on a large number of sugars,
300 was found to harbor a higher number of TFs compared
with B. subtilis, which is similar in genome size (Janga
and Perez-Rueda 2009). Likewise, we found that the sym-
biotic hyperthermophile, N. equitans, has both a reduced
genome and a lower proportion of TFs than other archaea,
305 whereas Haloarcula marismortui, a chemoheterotrophic
halophilic archaea, was found to have the highest propor-
tion of TFs and Methanosarcina acetivorans (an aerobic
chemolitho(aceto)autotrophic methanogen, nitrogen fix-
ing) with one of the largest genomes contained the highest
310 the number of TFs among archaeal genomes sequenced so
far. An interesting case is that of Methanospirillum hunga-
tei, a methanogenic archaea reported to have an unusual
filamentous structure, which was found to have the lowest
proportion of TFs after N. equitans among the archeael ge-
315 nomes studied. Complex lifestyles might require a higher
proportion of genes and TFs to better orchestrate re-
sponses to changing environments, as is the case of Meth-
anosarcina acetivorans that can form aggregate
multicellular structures when passing from anaerobiosis
320 to aerobiosis (Oelgeschlager and Rother 2008) or the case
of Haloarcula marismortui, a halophilic archaea, which are
generally described to be surprisingly different in its nutri-
tional demands and metabolic pathways (Falb et al. 2008).
In fact, the proportion of TFs in larger genomes is consis-
325 tent with the hypothesis that an increase of genome com-
plexity and physiological functionality is generally
associated with a more complex regulation of gene expres-
sion (Woese 1998).
In this context, the number of predicted TFs in archaea is
330 variable (see supplementary table S2, Supplementary Ma-
terial online), ranging from 8 in the archaeon with the
smallest sequenced genome (N. equitans) to up to 158
TFs in the largest genome, Methanosarcina acetivorans
C2A. A closer look into the normalized distribution of
335 TFs calculated as the proportion of the genes coding for
TFs gave further insights into the evolution of TFs in
the context of their genome size and lifestyles. For instance,
as shown in figure 2b, less than 5% of the open reading
frames (ORFs) in most archaeal genomes are devoted to
340 gene regulation in contrast to about 8–10% observed in
bacterial genomes with similar number of ORFs (Perez-
Rueda and Collado-Vides 2000, 2001). Indeed, larger ar-
chaeal genomes, such as Methanosarcina acetivorans and
Haloarcula marismortui, with similar number of ORFs to
345 E. coli K12, encode a lesser proportion of TFs (4.8%,
3.5%, and 8%, respectively). Thus, the TF repertoire ob-
served in archaea is much more similar to bacteria associ-
ated with gene loss events, such as intracellular pathogens
and endosymbionts (3.9% in average). Notable exceptions
350 are Pyrococcus horikoshi and Pyrococcus abyssi, two small
genomes containing 4.8% and 5.1% of TFs, respectively,
comparable with the proportion of TFs in larger archaeal
genomes. In contrast, N. equitans, which was found to fol-
low the trend in figure 2a, exhibited a clear deviation when
355proportion of genes coding for TFs was compared against
genome size.
Although this intriguingly low proportion of TFs in ar-
chaea compared with bacteria could be partially explained
due to our inability to identify those lineage or organism-
360specific TFs, it is also possible to suggest that other regu-
latory strategies in this cellular domain might be compen-
sating for this underrepresentation. These could involve,
for example, formation of alternative TBP–TFB–RNAp
complexes, with the possibility of interactions with differ-
365ent accessory factors (Baliga et al. 2000; Facciotti et al.
2007). However, the existence of new classes of TFs not ex-
plored here or archaeal-specific regulatory mechanisms
cannot be excluded to be responsible for this trend. For
instance, it has been shown recently from a global analysis
370of translationally regulated genes in Halobacterium salina-
rum and Halobacterium volcanii that 20% and 12% of all
genes in these genomes show growth phase–dependent
differential translational regulation (Lange et al. 2007).
However, the overlap between the two sets was found
375to be negligible, indicating that archaeal organisms may
use differential translational control for regulation of gene
expression, adding a layer of regulatory complexity at post-
transcriptional level (Mittal et al. 2009). Therefore ½AQ6, regula-
tory strategies such as either those that are found
380exclusively in archaea or those that are exploited to
a greater extent in archaea compared with bacteria might
be responsible for these differences.
Archaeal Genomes Encode a Large Proportion of
Small TFs
385Transcription regulation in archaea appears to be a chi-
mera, with general TFs being clearly eukaryotic like and
candidates for regulating specific responses being bacte-
rial like (Aravind and Koonin 1999). We found that a large
proportion (43.5%) of TFs in the archaeal genomes were
390small in size (100–200 amino acids). In contrast, 42% of
the bacterial TFs have between 200 and 300 amino acids
(vs. 26.5% of the archaeal TFs with this length). Nonethe-
less, 287 large TFs with amino acid length greater than 400,
corresponding to about 2.3%, were identified in the ar-
395chaeal repertoire (fig. 3). To determine the significance
of these findings, we randomly sampled 1,000 collections
of 3,918 proteins from the archaeal genome sequences
and compared their lengths with those observed in
TFs. As the distribution of average length of proteins in
400the random samples followed a normal distribution,
a Z score was used as a test statistic. Z score was calculated
as the number of standard deviations the observed value
(average length of an archaeal TF) is away from the mean
of the 1,000 random collections. This is obtained as the
405ratio of the difference between the observed, x, and
the random expected, l, values to the standard deviation,
r, that is, Z 5 (x  l)/r. P value was defined as the
Genomic Analysis of Archaeal Transcription Factors · doi:10.1093/molbev/msq033 MBE
5
fraction of the 1,000 random collections that showed an
average length greater than or equal to what was observed
410 in the archaeal TF collection. Using this approach for the
TF population, a Z score of 23.6 (corresponding to a P
value ,103) was found, indicating that TFs in archaea
tend to be significantly smaller than the overall proteome.
In contrast, the repertoire of TFs in E. coli K12 does not
415 exhibit such a tendency compared with the rest of the
proteome (see supplementary fig. S2, Supplementary Ma-
terial online). In fact, a higher proportion of TFs in E. coli
are generally longer compared with other proteins, indi-
cating that archaeal TFs are indeed encoded as small
420 genes. To test whether this observation is more general,
we compared the lengths of archaeal TFs against a com-
plete set of bacterial TFs available from the DBD database
(Kummerfeld and Teichmann 2006). We found that ar-
chaeal TFs showed significantly lower lengths compared
425 with bacterial ones (median size of 179 vs. 236 amino
acids, P , 2.2  1016, Wilcoxon test; see supplementary
fig. S3, Supplementary Material online). Because three of
the abundant families, ArsR, AsnC, and HTH_3, were
found to be composed of small proteins contributing
430 to about 40% of the total TF repertoire (see below), to
exclude the possibility that these large families are indeed
responsible for this tendency, we excluded this set of TFs
from the complete collection and compared their length
distribution with bacterial TFs. This comparison clearly
435 revealed that independent of these large families archaeal
TFs show smaller lengths compared with bacterial ones
(median size of 190 vs. 236 amino acids, P , 2.2 
1016, Wilcoxon test; see supplementary fig. S3, Supple-
mentary Material online). These observations raise the
440 question, if archaeal TFs are shorter than bacterial TFs,
do they also encode for smaller number of domains?
To address this, we compared the number of domains
archaeal TFs possess in comparison with those seen for
bacterial ones by obtaining all those TFs for which super-
445family domain assignments were available (Madera et al.
2004). Of the 2,621 archaeal TFs for which domain assign-
ments were available, we found that 1,963 comprised sin-
gle-domain proteins (;75%), whereas single domain
containing TFs in bacteria comprised 50% of the total data
450set analyzed. Further analysis of the distributions of the
number of domains in TFs of both the major kingdoms
of life unambiguously revealed that archaeal TFs encode
for lesser number of domains independent of the exclu-
sion of the large archaeal families (P , 2.2  1016, Wil-
455coxon test). These results clearly unveil that archaeal TFs
comprise a significant proportion of single-domain pro-
teins. One possibility is that most of these one-domain
proteins encode for a DBD and might not contain a li-
gand-binding domain, suggesting that although archaeal
460TFs contain DBDs similar to bacteria, their mechanism of
action might be similar to eukaryotic TFs. In light of these
observations, it is possible to hypothesize that archaeal
TFs although similar in sequence recognition domains
with bacteria (discussed below) might be similar to eu-
465karyotic TFs in mechanistic sense.
The high proportion of small TFs in archaea together
with the observation that most archaea have few TFs per
genome also suggests a dense combinatorial interplay of
TFs for mediating regulation. These data support various
470possible scenarios namely 1) regulation similar to bacteria,
where homodimers can regulate gene expression; 2) for-
mation of different oligomeric assemble forms affected by
the interaction with metabolites associated to a particular
metabolic state, that is, the formation of oligomers with
475different sizes, that is, dimers, tetramers, octamers, and so
on, as has been observed for the members of the AsnC
family (with an average length of around 160 amino
acids), whose small TFs can form dimers, tetramers, or oc-
tamers with differing regulatory functions (Koike et al.
4802004), such as FL11 of Pyrococcus sp., which can form a disc
or a chromatin-like cylinder upon interaction of two pep-
tides and TrmB of Pyrococcus furiosus, which is tetrameric
at ambient temperature and octameric in the presence of
its inducer (maltotriose or maltose) (Lee et al. 2005; Krug
485et al. 2006); 3) binding of the same protein to a broad
spectrum of compounds or ligands, enhancing its activity
under different metabolic states, such as TrmB that binds
maltose, sucrose, maltotriose, and trehalose compounds
in decreasing order of affinity (Koike et al. 2004; Lee
490et al. 2005); and 4) alternative physical interactions or co-
complex memberships with TBP–TFB–RNAp can also be
modulating the structure of the regulatory network in ar-
chaea similar to eukarya. In this regard, Facciotti et al.
found with protein coimmunoprecipitation, ChIP-Chip,
495global transcriptional factor (GTF) perturbation and
knockout, and measurement of transcriptional changes
that global transcriptional factors can associate to nearly
half of all putative promoters and show evidence for at
least 7 of the 42 possible functional GTF pairs (Baliga
500et al. 2000; Facciotti et al. 2007).
FIG. 3. Distribution of amino acid sequence lengths for TFs. On x
axis, the intervals of protein size are shown and on y axis, the
normalized frequency of TFs per interval is shown. Thousand groups
of 3,918 protein sequences were randomly retrieved from archaeal
genome sequences to compare the length distribution of TFs
against other protein-coding genes. In each length internal, bars
marked as random represent the proportion of proteins in an
interval ± their standard deviations from the average in the random
samples.
Pe´rez-Rueda and Chandra Janga · doi:10.1093/molbev/msq033 MBE
6
Phylogenetic Distribution of TFs in Archaea
It has been previously proposed that DNA-binding TFs can
be grouped into families based on their amino acid se-
quence similarity (Perez-Rueda and Collado-Vides 2000).
505 In order to determine the number of TF families associated
with archaeal genomes, all the 3,918 DNA-binding TFs
were grouped into 75 families according to the PFAM da-
tabase (Finn et al. 2006). As elaborated below, we explored
the familial abundance in the archaeal genomes and the
510 relative contribution of each family to the proteome size
and overall proportion of TFs. This analysis also enabled us
to determine the families that are shared between archaea,
bacteria, and eukarya and the main functions of these
families.
515 The population of TF families was found to follow
a power-law distribution, with 13 families containing more
than 100 members each, representing 71% of the whole TF
repertoire (fig. 4). The½AQ7 top three most populated families
are ArsR (721 TFs), the HTH_3 (361 TFs), and the AsnC
520 (367 TFs), whereas other ten families contained between
101 and 276 TFs. About 49 families comprised less than
30 TFs, each representing in total ;11% of the TF reper-
toire. Previous analysis (Moreno-Campuzano et al. 2006;
Janga and Perez-Rueda 2009) suggests that global regula-
525 tors (GRs) in bacteria usually belong to small families; how-
ever, in Archaea apparently, this is not the case, at least for
the GRs identified so far. For instance, ArsR and TrmB were
found to belong to two large families with 721 and 276
members, respectively.
530 Figure 5 shows that four families are universally distrib-
uted across the four archaeal divisions (Crenarchaea, Eur-
yarchaea, Nanoarchaea, and Korarchaea) namely: the
HTH_3 (a family of putative activator proteins), AsnC
(associated with global regulation of amino acid biosyn-
535 thesis), TrmB (maltose-specific regulation), and ArsR (de-
toxification process). These families might belong to the
ancestral core of TFs in archaea. A second group of families
(PhoU and RpiR) was detected in all archaeal genomes,
with the exception of the endosymbiont, N. equitans,
540and hence can also be considered as part of the archaeal
TF core set. These families are mainly putative regulators of
phosphate uptake (PhoU) and sugar metabolism (RpiR).
Based on these findings, it is possible to suggest that ar-
chaea from new divisions might carry on TFs from these
545universal families, potentially regulating central metabolic
processes, as might be the case with the last common an-
cestor of archaea. Some families such as TrpR were found
exclusively in Metallosphaera sedula, and CopY was found
in diverse Halobacterium strains suggesting that they
550might have been transferred laterally from bacteria to
archaea.
It is possible to speculate from this data that abundant
families like ArsR, AsnC, or HTH_3 might be a consequence
of the lifestyles and a response to the deficit of TFs, that is,
555archaea might have expanded certain families associated
with small sizes, to generate a plethora of combinatorial
possibilities to regulate their gene expression. It is notewor-
thy to mention in this context that these three families
contribute to around 40% of the total TFs with length be-
560tween 100 and 200 amino acids.
In order to understand the similarity of TF repertoires
per family among the archaeal genomes, a hierarchical cen-
troid linkage-clustering algorithm (Eisen et al. 1998) was
applied with uncentered correlation as the similarity mea-
565sure. The clustering results were visualized using the tree-
view program (Saldanha 2004). From this clustering, six
groups of archaea sharing a common set of TFs were iden-
tified (based on a node correlation value 0.6), whereas
three organisms could not be included in any cluster
570and were hence considered as orphans (see fig. 5). It is ev-
ident from this analysis that these six clusters reflect the
major taxonomic positions of the organisms analyzed, al-
though some exceptions could be observed. The TF reper-
toire also reflects the main lifestyle of archaea, such as the
FIG. 4. Abundance of TF families in archaeal genomes. Proportion of TFs in each family was calculated as the fraction of total TFs identified that
belonged to a particular family. The families are displayed from largest to smallest size. Families with less than 20 members were not displayed
as they corresponded to less than 6% of the totaldata set.
Genomic Analysis of Archaeal Transcription Factors · doi:10.1093/molbev/msq033 MBE
7
575 first cluster that includes mainly methanogenic archaea
(such as Methanocaldococcus jannaschii and Methanococ-
cus maripaludis S2 among others). The intermixing of
organisms in some clusters might be a consequence of
lateral gene transfer events, as has been suggested for ar-
580 chaea included in the fourth cluster, that is, N. equitans
(Nanoarchaeum) and I. hospitalis (Desulfurococcales)
(Podar et al. 2008).
Comparison of the TF Repertories of Bacteria and
Archaea
585 It has been proposed that bacteria and archaea share a great
similarity at gene regulatory level (Aravind and Koonin
1999), with archaeal TFs clearly being bacterial like, whereas
their basal transcriptional machinery clearly associated to
eukarya. Thus, to understand the degree of conservation of
590 TFs between archaea, bacteria, and eukarya, the probable
homologues of the repertoire of transcriptional regulators
were identified (see Materials and Methods). From this
analysis, it was found that 53% of the 3,918 archaeal TFs
exhibit at least one homologue in bacterial genomes (fig.
595 6). In particular, archaea and clostridia share TFs from
the families HTH_3, Xre, and Rrf2, whereas TFs from the
families DeoR, IclR, and cold shock are shared with several
actinobacteria and some gammaproteobacteria. Another
45% of the 3,918 TFs were clearly identified as archaeal spe-
600 cific, whereas other 6% exhibited homology with bacterial
and eukaryotic TFs and about 2% exhibited homology with
only eukaryotes (mainly with Ascomycetes) possibly sug-
gesting a lateral gene transfer. This reinforces the notion
that TFs of bacteria and archaea share a common ancestry
605and highlight a close relationship between the TFs from
archaea and firmicutes, pointing evidence to drive experi-
ments that can confirm if they share a functional related-
ness as well.
Archaeal TFs Are Predominantly Comprised
610Bacterial DBDs
An important aspect of TFs is their ability to organize into
multidomain proteins and hence understanding them in
a structural context can provide important clues about
how they coordinate regulation. Therefore, the repertoire
615of archaeal TFs was analyzed using the library of HMMs
deposited in superfamily database (Madera et al. 2004).
From this analysis, we found that the most abundant
DBD in these TFs is the winged helix DBD, detected in
45% of the total set. Followed ½AQ8by the lambda repressor-like
620DBD (;15%). This result is similar to that previously ob-
served for the repertoire of bacterial TFs, reinforcing the
notion of common ancestry in the transcriptional regula-
tory machinery of prokaryotes (Aravind and Koonin 1999;
Aravind et al. 2005). Alternative ½AQ9DBDs, such as IHF-like
625DBD, PhoU-like domain, nucleic acid–binding domain as-
sociated to cold shock proteins or zinc-finger domains,
FIG. 5. Clustering of TF families and archaeal genomes. A hierarchical centroid linkage-clustering algorithm was applied with uncentered
correlation as the similarity measure and complete linkage (Eisen et al. 1998). Brackets indicate the clusters identified by using a correlation
value 0.6. Nomenclature is as follows: Crenarchaea (C); Euryarchaea (E); Korarchaeota (K), and Nanoarchaeum (N).
Pe´rez-Rueda and Chandra Janga · doi:10.1093/molbev/msq033 MBE
8
were also identified, although in lower proportions (corre-
sponding to around 12% of the total TFs). Several of these
domains were also identified in bacterial TFs. Zinc fingers
630 represent an intriguing result because this class of proteins
has been found exclusively in eukaryotic transcriptional
proteins.
Most TF families have been found to undergo lineage-
specific duplications resulting in the accumulation of partic-
635 ular families in some microbial species, such as LysR family in
E. coli (45 TFs; Janga and Perez-Rueda 2009) or ArsR in Meth-
anosarcina acetivorans C2A (48 TFs). Indeed, this hypothesis
is consistent with the more general notion that a genome
evolves from a set of precursor genes to a mature size by
640 gene duplications and increasing modifications (Yanai
et al. 2000; Koonin et al. 2002). Therefore, the domain orga-
nization and more generally the properties of the TF reper-
toire described for archaeal genomes in this study open
diverse questions like, if the evolution of regulatory networks
645 in archaea is different to that observed in E. coli, B. subtilis,
and/or other biological systems (Aravind and Koonin 1999;
Koike et al. 2004; Lee et al. 2005; Lozada-Chavez et al. 2006;
Janga et al. 2008, 2009; Perez and Groisman 2009).
Conclusions
650 In this study, 52 archaeal genome sequences representing
a plethora of lifestyles were analyzed to identify the reper-
toire of proteins involved in controlling the gene expres-
sion. Given the fact that there is currently no archaeal
genome, which is completely characterized at the level
655 of transcriptional regulation, the repertoire of TFs and
the conclusions presented here can be a good starting
point in understanding transcriptional regulatory networks
in archaeal genomes. In particular, because the archaeal ge-
nomes studied here are from different taxa, the results pre-
660 sented here should be valid with high confidence for a wide
range of archaea.
Our analysis suggests that although there is a correlation
between the number of TFs and genome size, there is also
a deficit for TFs in all the archaeal genomes, indicating that
665this deficit in TFs, and hence, regulatory plasticity is pos-
sibly supplemented by their ability to form different assem-
bly structures by small-sized TFs found to be enriched in
archaea. We also note that there is an important fraction
of transcriptional regulators common to archaea and bac-
670teria. The distribution of TF families common to prokar-
yotes shows an ancient evolution of transcriptional
machinery in bacteria and archaea. We found that the
number of TF families is distributed almost homogeneously
among all archaea, although there are a small proportion of
675them that are overrepresented in all archaea but not in
bacteria. Further research is necessary to determine the
physiological function of such species-specific or shared
transcriptional regulators. Nevertheless, the analysis pre-
sented here will provide a basis for understanding the or-
680ganization and evolution of regulatory networks in archaea.
Supplementary Material
Supplementary figures and tables are available at Molecular
Biology and Evolution online (http://www.mbe.oxfordjour-
nals.org/).
685Acknowledgments
E.P.R. was financed by a grant (ASTF 224-2005) from EMBO
and by a grant (IN-217508) from DGAPA-UNAM. E.P.R.
thanks Lorenzo Segovia, Claudia Martinez-Anaya, and
Javier Diaz-Mejia for their helpful comments in the prep-
690aration of the manuscript and Rosa Maria Gutierrez in the
clustering analysis. S.C.J. acknowledges financial support
from MRC Laboratory of Molecular Biology and Cambridge
Commonwealth Trust. We would also like to thank Nitish
Mittal and Arthur Wuster for critically reading the manu-
695script and providing helpful comments.
FIG. 6. Distribution of archaeal TFs shared by the three cellular domains, archaea, bacteria, and eukarya. Pie chart showing the distribution of
archaeal TF homologues identified in different domains of life; Blast searches were performed between all TFs previously identified against total
sequences of bacterial and eukaryotic genomes. A protein was considered as homologue if the alignment covered at least 60% of the query
sequence, with an E value 106.
Genomic Analysis of Archaeal Transcription Factors · doi:10.1093/molbev/msq033 MBE
9
References
Aravind L, Anantharaman V, Balaji S, Babu MM, Iyer LM. 2005. The
many faces of the helix-turn-helix domain: transcription
regulation and beyond. FEMS Microbiol Rev. 29:231–262.
700 Aravind L, Koonin EV. 1999. DNA-binding proteins and evolution of
transcription regulation in the archaea. Nucleic Acids Res.
27:4658–4670.
Auguet JC, Barberan A, Casamayor EO. 2009. Global ecological
patterns in uncultured archaea. ISME J.½AQ10
705 Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA.
2004. Structure and evolution of transcriptional regulatory
networks. Curr Opin Struct Biol. 14:283–291.
Babu MM, Teichmann SA. 2003. Evolution of transcription factors
and the gene regulatory network in Escherichia coli. Nucleic
710 Acids Res. 31:1234–1244½AQ11 .
Baliga NS, Goo YA, Ng WV, Hood L, Daniels CJ, DasSarma S. 2000. Is
gene expression in Halobacterium NRC-1 regulated by multiple
TBP and TFB transcription factors? Mol Microbiol. 36:1184–1185.
Bell SD. 2005. Archaeal transcriptional regulation—variation on
715 a bacterial theme? Trends Microbiol. 13:262–265.
Brinkrolf K, Brune I, Tauch A. 2006. Transcriptional regulation of
catabolic pathways for aromatic compounds in Corynebacte-
rium glutamicum. Genet Mol Res. 5:773–789.
Brune I, Brinkrolf K, Kalinowski J, Puhler A, Tauch A. 2005. The
720 individual and common repertoire of DNA-binding transcrip-
tional regulators of Corynebacterium glutamicum, Corynebac-
terium efficiens, Corynebacterium diphtheriae and
Corynebacterium jeikeium deduced from the complete genome
sequences. BMC Genomics. 6:86.
725 Chaban B, Ng SY, Jarrell KF. 2006. Archaeal habitats—from the
extreme to the ordinary. Can J Microbiol. 52:73–116.
Clementino MM, Fernandes CC, Vieira RP, Cardoso AM,
Polycarpo CR, Martins OB. 2007. Archaeal diversity in naturally
occurring and impacted environments from a tropical region. J
730 Appl Microbiol. 103:141–151.
Coulson RM, Touboul N, Ouzounis CA. 2007. Lineage-specific
partitions in archaeal transcription. Archaea 2:117–125.
Eisen MB, Spellman PT, Brown PO, Botstein D. 1998. Cluster analysis
and display of genome-wide expression patterns. Proc Natl Acad
735 Sci U S A. 95:14863–14868.
Esser C, Martin W. 2007. Supertrees and symbiosis in eukaryote
genome evolution. Trends Microbiol. 15:435–437.
Facciotti MT, Reiss DJ, Pan M, et al. (11 co-authors). 2007. General
transcription factor specified global gene regulation in archaea.
740 Proc Natl Acad Sci U S A. 104:4630–4635.
Falb M, Muller K, Konigsmaier L, Oberwinkler T, Horn P, von
Gronau S, Gonzalez O, Pfeiffer F, Bornberg-Bauer E, Oesterhelt D.
2008. Metabolism of halophilic archaea. Extremophiles 12:177–196.
Finn RD, Mistry J, Schuster-Bockler B, et al. (13 co-authors). 2006. Pfam:
745 clans, web tools and services. Nucleic Acids Res. 34:D247–D251.
Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, et al. (19 co-
authors). 2008. RegulonDB (version 6.0): gene regulation model
of Escherichia coli K-12 beyond transcription, active (experi-
mental) annotated promoters and Textpresso navigation.
750 Nucleic Acids Res. 36:D120–D124.
Goede B, Naji S, von Kampen O, Ilg K, Thomm M. 2006. Protein-
protein interactions in the archaeal transcriptional machinery:
binding studies of isolated RNA polymerase subunits and
transcription factors. J Biol Chem. 281:30581–30592.
755 Janga SC, Collado-Vides J. 2007. Structure and evolution of gene
regulatory networks in microbial genomes. Res Microbiol.
158:787–794.
Janga SC, Collado-Vides J, Babu MM. 2008. Transcriptional
regulation constrains the organization of genes on eukaryotic
760 chromosomes. Proc Natl Acad Sci U S A. 105:15761–15766.
Janga SC, Perez-Rueda E. 2009. Plasticity of transcriptional
machinery in bacteria is increased by the repertoire of regulatory
families. Comput Biol Chem. 33:261–268.
Janga SC, Salgado H, Martinez-Antonio A. 2009. Transcriptional
765regulation shapes the organization of genes on bacterial
chromosomes. Nucleic Acids Res. 37:3680–3688.
Koike H, Ishijima SA, Clowney L, Suzuki M. 2004. The archaeal feast/
famine regulatory protein: potential roles of its assembly forms
for regulating transcription. Proc Natl Acad Sci U S A.
770101:2840–2845.
Koonin EV, Wolf YI, Karev GP. 2002. The structure of the protein
universe and genome evolution. Nature 420:218–223.
Krug M, Lee SJ, Diederichs K, Boos W, Welte W. 2006. Crystal
structure of the sugar binding domain of the archaeal
775transcriptional regulator TrmB. J Biol Chem. 281:10976–10982.
Kruger K, Hermann T, Armbruster V, Pfeifer F. 1998. The
transcriptional activator GvpE for the halobacterial gas vesicle
genes resembles a basic region leucine-zipper regulatory protein.
J Mol Biol. 279:761–771.
780Kummerfeld SK, Teichmann SA. 2006. DBD: a transcription factor
prediction database. Nucleic Acids Res. 34:D74–D81.
Kyrpides NC, Woese CR. 1998. Archaeal translation initiation
revisited: the initiation factor 2 and eukaryotic initiation factor
2B alpha-beta-delta subunit families. Proc Natl Acad Sci U S A.
78595:3726–3730.
Lange C, Zaigler A, Hammelmann M, Twellmeyer J, Raddatz G,
Schuster SC, Oesterhelt D, Soppa J. 2007. Genome-wide analysis
of growth phase-dependent translational and transcriptional
regulation in halophilic archaea. BMC Genomics. 8:415.
790Lee SJ, Moulakakis C, Koning SM, Hausner W, Thomm M, Boos W.
2005. TrmB, a sugar sensing regulator of ABC transporter genes
in Pyrococcus furiosus exhibits dual promoter specificity and is
controlled by different inducers. Mol Microbiol. 57:1797–1807.
Lee TI, Rinaldi NJ, Robert F, et al. (21 co-authors). 2002.
795Transcriptional regulatory networks in Saccharomyces cerevi-
siae. Science 298:799–804.
Leonard PM, Smits SH, Sedelnikova SE, Brinkman AB, de Vos WM,
van der Oost J, Rice DW, Rafferty JB. 2001. Crystal structure of
the Lrp-like transcriptional regulator from the archaeon
800Pyrococcus furiosus. EMBO J. 20:990–997.
Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and
comparing large sets of protein or nucleotide sequences.
Bioinformatics 22:1658–1659.
Lopez-Garcia P. 1999. DNA supercoiling and temperature adapta-
805tion: a clue to early diversification of life? J Mol Evol. 49:439–452.
Lozada-Chavez I, Janga SC, Collado-Vides J. 2006. Bacterial regulatory
networks are extremely flexible in evolution. Nucleic Acids Res.
34:3434–3445.
Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J. 2004. The
810SUPERFAMILY database in 2004: additions and improvements.
Nucleic Acids Res. 32:D235–D239.
Marchler-Bauer A, Anderson JB, Derbyshire MK, et al. (25 co-authors).
2007. CDD: a conserved domain database for interactive domain
family analysis. Nucleic Acids Res. 35:D237–D240.
815Martin W, Hoffmeister M, Rotte C, Henze K. 2001. An overview of
endosymbiotic models for the origins of eukaryotes, their ATP-
producing organelles (mitochondria and hydrogenosomes), and
their heterotrophic lifestyle. Biol Chem. 382:1521–1539.
Martin W, Muller M. 1998. The hydrogen hypothesis for the first
820eukaryote. Nature 392:37–41.
Mittal N, Roy N, Babu MM, Janga SC. 2009. Dissecting the expression
dynamics of RNA-binding proteins in posttranscriptional
regulatory networks. Proc Natl Acad Sci U S A. 106:20300–20305.
Moreira D, Lopez-Garcia P. 1998. Symbiosis between methanogenic
825archaea and delta-proteobacteria as the origin of eukaryotes: the
syntrophic hypothesis. J Mol Evol. 47:517–530.
Pe´rez-Rueda and Chandra Janga · doi:10.1093/molbev/msq033 MBE
10
Moreno-Campuzano S, Janga SC, Perez-Rueda E. 2006. Identification
and analysis of DNA-binding transcription factors in Bacillus subtilis
and other Firmicutes—a genomic approach. BMC Genomics. 7:147.
830 Moreno-Hagelsieb G, Janga SC. 2008. Operons and the effect of
genome redundancy in deciphering functional relationships
using phylogenetic profiles. Proteins 70:344–352.
Nam YD, Chang HW, Kim KH, Roh SW, Kim MS, Jung MJ, Lee SW,
Kim JY, Yoon JH, Bae JW. 2008. Bacterial, archaeal, and eukaryal
835 diversity in the intestines of Korean people. J Microbiol. 46:491–501.
Napoli A, van der Oost J, Sensen CW, Charlebois RL, Rossi M,
Ciaramella M. 1999. An Lrp-like protein of the hyperthermo-
philic archaeon Sulfolobus solfataricus which binds to its own
promoter. J Bacteriol. 181:1474–1480.
840 Oelgeschlager E, Rother M. 2008. Carbon monoxide-dependent
energy metabolism in anaerobic bacteria and archaea. Arch
Microbiol. 190:257–269.
Perez JC, Groisman EA. 2009. Evolution of transcriptional regulatory
circuits in bacteria. Cell 138:233–244.
845 Perez-Rueda E, Collado-Vides J. 2000. The repertoire of DNA-binding
transcriptional regulators in Escherichia coli K-12. Nucleic Acids
Res. 28:1838–1847.
Perez-Rueda E, Collado-Vides J. 2001. Common history at the origin
of the position-function correlation in transcriptional regulators
850 in archaea and bacteria. J Mol Evol. 53:172–179.
Perez-Rueda E, Collado-Vides J, Segovia L. 2004. Phylogenetic
distribution of DNA-binding transcription factors in bacteria
and archaea. Comput Biol Chem. 28:341–350.
Podar M, Anderson I, Makarova KS, et al. (27 co-author). 2008. A
855genomic analysis of the archaeal system Ignicoccus hospitalis-
Nanoarchaeum equitans. Genome Biol. 9:R158.
Saldanha AJ. 2004. Java Treeview—extensible visualization of
microarray data. Bioinformatics 20:3246–3248.
Sierro N, Makita Y, de Hoon M, Nakai K. 2008. DBTBS: a database of
860transcriptional regulation in Bacillus subtilis containing up-
stream intergenic conservation information. Nucleic Acids Res.
36:D93–D96.
Woese C. 1998. The universal ancestor. Proc Natl Acad Sci U S A.
95:6854–6859.
865Wu J, Wang S, Bai J, Shi L, Li D, Xu Z, Niu Y, Lu J, Bao Q. 2008.
ArchaeaTF: an integrated database of putative transcription
factors in archaea. Genomics 91:102–107.
Yan C. 2006. A hidden Markov model approach to model protein
sequence and structural information: identification of helix-
870turn-helix DNA-binding motif In: Proceedings of IEEE In-
ternational Conference on Granular Computing. p. 385–388 ½AQ12.
Yanai I, Camacho CJ, DeLisi C. 2000. Predictions of gene family
distributions in microbial genomes: evolution by gene duplica-
tion and modification. Phys Rev Lett. 85:2641–2644.
Genomic Analysis of Archaeal Transcription Factors · doi:10.1093/molbev/msq033 MBE
11
The original printed version of the thesis includes a copy of a book chapter to be 
printed with Horizon Press in 2011:  
 
Janga, S.C. and Moreno-Hagelsieb, G. 'Operons and bacterial genome organization' to 
be published with Horizon Scientific Press for an edited book on 'Bacterial Gene 
Regulation and Transcriptional Networks' (Ed: M. Madan Babu, MRC Laboratory of 
Molecular Biology, Cambridge, U. K) 
 
This chapter has been removed from the electronic file for copyright reasons. 
Figure 1a 
Figure 1b 
0 
0.1 
0.2 
0.3 
0.4 
0.5 
0.6 
1 2 3 4 5 >= 6 
P
ro
po
rt
io
n 
Transcription Unit (TU) Size 
E. coli K12 
0 
0.05 
0.1 
0.15 
0.2 
0.25 
0.3 
0.35 
0.4 
0.45 
1 2 3 4 5 >= 6 
P
ro
po
rt
io
n 
Transcription Unit (TU) Size 
B. subtilis 
Figure 2 
Figure 3 
Figure 4a 
Figure 4b