PLoS Comput BiolplosploscompPLOS Computational Biology1553-734X1553-7358Public Library of ScienceSan Francisco, CA USAPCOMPBIOL-D-19-0166210.1371/journal.pcbi.1007982Research ArticlePhysical sciencesMathematicsOptimizationResearch and analysis methodsSimulation and modelingPhysical sciencesMathematicsDifferential equationsResearch and analysis methodsMathematical and statistical techniquesMathematical functionsTransfer functionsComputer and information sciencesSystems scienceNonlinear dynamicsPhysical sciencesMathematicsSystems scienceNonlinear dynamicsBiology and life sciencesCell biologySignal transductionCell signalingSignaling cascadesProtein kinase signaling cascadeBiology and life sciencesBiochemistryProteinsPost-translational modificationPhosphorylationResearch and analysis methodsMathematical and statistical techniquesMathematical functionsConvolutionIt’s about time: Analysing simplifying assumptions for modelling multi-step pathways in systems biologySimplified modelling of multi-step pathways in systems biologyhttp://orcid.org/0000-0001-9811-3190KorsboNiklasConceptualizationData curationFormal analysisInvestigationMethodologySoftwareWriting – original draftWriting – review & editing^{1}^{2}http://orcid.org/0000-0003-2340-588XJönssonHenrikConceptualizationMethodologyProject administrationResourcesWriting – review & editing^{1}^{2}^{3}*The Sainsbury Laboratory, University of Cambridge, Cambridge, United KingdomDepartment of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, United KingdomDepartment of Astronomy and Theoretical Physics, Computational Biology and Biological Physics, Lund University, Lund, SwedenRaoChristopher V.EditorUniversity of Illinois at Urbana-Champaign, UNITED STATES
The authors have declared that no competing interests exist.
* E-mail: henrik.jonsson@slcu.cam.ac.uk620202962020166e1007982269201927520202020Korsbo, JönssonThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Thoughtful use of simplifying assumptions is crucial to make systems biology models tractable while still representative of the underlying biology. A useful simplification can elucidate the core dynamics of a system. A poorly chosen assumption can, however, either render a model too complicated for making conclusions or it can prevent an otherwise accurate model from describing experimentally observed dynamics. Here, we perform a computational investigation of sequential multi-step pathway models that contain fewer pathway steps than the system they are designed to emulate. We demonstrate when such models will fail to reproduce data and how detrimental truncation of a pathway leads to detectable signatures in model dynamics and its optimised parameters. An alternative assumption is suggested for simplifying such pathways. Rather than assuming a truncated number of pathway steps, we propose to use the assumption that the rates of information propagation along the pathway is homogeneous and, instead, letting the length of the pathway be a free parameter. We first focus on linear pathways that are sequential and have first-order kinetics, and we show how this assumption results in a three-parameter model that consistently outperforms its truncated rival and a delay differential equation alternative in recapitulating observed dynamics. We then show how the proposed assumption allows for similarly terse and effective models of non-linear pathways. Our results provide a foundation for well-informed decision making during model simplifications.
Author summary
Mathematical modelling can be a highly effective way of condensing our understanding of biological processes and highlight the most important aspects of them. Effective models are based on simplifying assumptions that reduce complexity while still retaining the core dynamics of the original problem. Finding such assumptions is, however, not trivial. In this paper, we explore ways in which one can simplify long chains of simple reactions wherein each step is only dependent on its predecessor. After generating synthetic data from models that describe such chains in explicit detail we compare how well different simplifications retain the original dynamics. We show that the most common such simplification, which is to ignore parts of the chain, often renders models unable to account for time delays. However, we also show that when such a simplification has had a detrimental effect it leaves a detectable signature in its optimised parameter values. We also propose an alternative assumption which leads to a highly effective model with only three parameters. By comparing the effects of these simplifying assumptions in thousands of different cases and for different conditions we are able to clearly show when and why one is preferred over the other.
http://dx.doi.org/10.13039/501100000324Gatsby Charitable FoundationGAT3395-PR4http://orcid.org/0000-0003-2340-588XJönssonHenrikThis work was supported by the Gatsby Charitable Foundation (grant GAT3395-PR4B; https://www.gatsby.org.uk/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.PLOS Publication Stagevor-update-to-uncorrected-proofPublication Update2020-07-10Data AvailabilityAll code and instructions on how to reproduce all results are available at the Sainsbury Laboratory Gitlab Repository https://gitlab.com/slcu/teamhj/publications/korsbo_et_al_2020.Introduction
Biochemical reaction networks are often complicated and any attempt to describe them using mathematical models relies heavily on simplifying assumptions [1]. Effective models are often built upon simplifying assumptions that avoid over-fitting by using as few free parameters as possible while still capturing the main properties of the biological system [2, 3]. Thoughtful assumptions, as well as robust methods to identify parameter values, and (semi) global analysis of dynamical behaviour within model spaces, are all essential when evaluating models [4–7]. However, assumptions that are beneficial in one setting may be detrimental in another and it is important, although non-trivial, to identify when this happens [8].
Multi-step processes are ubiquitous in biology. Examples are transcription and translation, where an RNA polymerase or a ribosome can perform thousands of sequential reactions before a protein is produced. Yet, in gene-regulatory networks, this is often reduced to a one or two-step reaction of a transcription factor that may lead to an mRNA before it leads to a finished protein (e.g. [1, 9–11]). Another example is kinase cascades, where the product of a kinase triggers the action of downstream kinases [12]. A well known such cascade can be found in the MAP kinases, which are triggered by the MAP kinase kinases, which in turn are triggered by the MAP kinase kinase kinases [13–18]. There are also molecules which must undergo sequential multi-site phosphorylations before they are activated and can pass on any signalling [19, 20]. This is, for example, important for the Drosophila circadian clock protein CLOCK who’s inactivity, activity, and degradation are thought to be governed by its sequential states of phosphorylation [21]. Sequential multi-step reactions can also be important for signal perception and transduction pathways. One example is the receptor-like kinase FLAGELLIN SENSING 2 which, upon detecting of a pathogen, triggers a long chain of phosphotransfers, phosphorylations, and subcellular re-localisations that eventually leads to an immune response in Arabidopsis thaliana [22, 23]. Another well studied system is the TGFβ growth factor, which similarly triggers a sequence of phosphorylation steps before affecting the expression of downstream genes [24–26].
Linear pathways represent a large class of multi-step reactions which are both biologically relevant and theoretically approachable. Multi-step pathways can be represented as a chain of state changes where the activation rate of one state is dependent on the activity of the previous state. The mechanisms by which one active state regulates the activation of the next may be complicated but it can be useful to approximate these as being linear. This is partly because it is a minimally complex assumption and partly because many biochemical reactions appear to be linear as long as they operate in a weakly activated manner, far from saturation [12, 27].
Linear pathways are dynamically important and can entirely change the qualitative behaviour of a model. Their main effects are to supply signal amplification/dampening and to provide delays in the signalling [12, 28]. The amplitude modulation of the signalling is governed by the ratio between the activation and inactivation rates; a pathway step will provide amplification if its activation rate exceeds its inactivation rate. The time-delay, on the other hand, is governed by the inactivation rates and by the length of the pathways [12, 27]. These time-delays can have a significant impact on how a biological system works. A striking example of this is that delays are required for oscillations to be possible [29–31].
The modelling of linear pathways poses a specific set of challenges. Full enumeration of the linear pathway greatly increases model complexity yet add disproportionately little in terms of dynamical range. However, even if the individual steps are of little dynamical significance, the aggregate effect of the full pathway may not be. There is, therefore, a need for a simplifying assumption which reduces the complexity of the linear pathway while still representing its total effect.
A common way of simplifying linear pathways is to ignore most of the reaction steps and assume that a model can recapitulate their effect using only one or a few steps [3]. Such topological model reduction is common and the approach has been rigorously analysed for both simple and complex networks [32–34]. While important, such analyses often presuppose precise knowledge of the system that is being simplified. Nevertheless, while this assumption is often implicit, it is easy to find examples where it has been used to simplify multi-step reactions in real systems—where many details are unavailable—such as protein production (e.g. [7, 35–37]); protein-to-protein signalling networks (e.g. [7, 36–38]); protein modifications such as phosphorylations, methylations, and ubiquitinations (e.g. [37, 38]); and more. A question that remains is then what dynamical behaviour a model is prevented from reproducing when this kind of simplifying assumption is applied to partially unknown systems.
An alternative simplification is to represent the effect of the linear pathway using a fixed time-delay in the model. Focusing on this aspect of the linear pathway and assuming that all other aspects are negligible allows for a terse model description using delayed differential equations (DDEs) [39–42]. However, it is not clear how such an assumption limits a model’s ability to recapitulate the dynamics of the full system.
A third simplification is to make use of a gamma-distributed delay. This approach models the output of the linear pathway as the convolution between the input to the first pathway step and the probability density function of the gamma distribution. This convolution has been used to describe delays in a diverse set of processes, including drug uptake [43, 44], circadian clocks [45–47], population dynamics [48] and even traffic jams [49]. Furthermore, it was recently shown to be effective at simplifying specific models while still allowing them to retain the dynamical properties of the original models [47]. Similar to the fixed-delay approach, it is commonly used as a method to introduce a delay without explicit regard to what the underlying cause of that delay is. It can be derived from a chain of identical linear processes which indicates that it may be particularly relevant for linear pathways. However, an understanding of how well this simplification can represent a general linear pathway is still missing.
The problem of how to simplify non-linear pathways is even more complex. Both the dynamical contribution and the simplification of certain non-linear, sequential, pathways have been studied previously [1, 12, 32]. However, the diversity of possible non-linear pathways and their resistance to analytical exploration have left us with relatively few tools that allow for their simplification without sacrificing the ability to either describe or predict the dynamics of real pathways.
Here, we examine the dynamical effect of sequential multi-step pathways on a system and especially whether certain simplifying assumption yields models capable of reproducing those dynamical effects. We primarily focus on linear pathways—where each step is linearly dependent on the last—where we first analyse what dynamical properties a model will be unable to reproduce when it is simplified using pathway truncation. We show how such models may be incapable of producing an output that is both as delayed and as sharply defined as the output of the full systems that they are trying to emulate. This analysis further led us to a diagnostic tool for revealing when such a model assumption has had detrimental effects.
Thereafter, we suggest the use of an alternative simplifying assumption and demonstrate its effectiveness. Rather than assuming a fixed (and truncated) pathway length, we assume a fixed rate of information propagation along a pathway of dynamic length. While similar ideas have recently been used, their effect has not to our knowledge been systematically studied before [47, 50]. We use this assumption to define a three-parameter model which can recapture the dynamics of arbitrary linear pathways with high fidelity. The assumption allows for a direct derivation of the gamma-distributed delay and it allows the model parameters to be anchored to the underlying biology. Furthermore, it outperforms the use of the reduced step approximation as well as the fixed-delay approximation and it provides a building block for an operational model inference approach [51].
Finally, we expand our analysis to non-linear pathways in the form of kinase cascades where each pathway step can become saturated. Here, we find a similar, albeit smaller, performance difference between different model assumptions when it comes to reproducing data and that the two assumptions both are reasonably good at predicting the results of novel conditions as long as they have been trained on sufficient data.
ResultsLinear pathway truncation causes different degrees of error for different underlying distributions of reaction rates
We set out to explore the dynamical consequences of misrepresenting the number of pathway steps in a model of a linear pathway. The main aim was to understand whether and how a short (truncated) linear pathway model fails to reproduce the dynamics generated by a longer pathway.
To investigate this, we first defined a model wherein a sequence of n states (which we will also refer to as steps), with concentrations X_{1}, X_{2}, …, X_{n}, each activates its successor. A step i in such a pathway is linear if
dXidt=αi·Xi-1-βi·Xi,
for some activation rate constant α_{i} and degradation/inactivation rate constant β_{i}. We define the entire pathway as linear if this equation holds for all steps. In this study, we specifically focus on the relationship between an input and the output of a linear pathway and not on the relative concentrations along the different pathway steps. We can thus define a set of new parameters, such that scaling is done by a single parameter, γ, and the response rate of pathway step i to changes in the previous step is governed by a parameter r_{i} (Methods). This leads to a model that describes how an n- step linear pathway transforms an input signal, I(t), to an output, as defined by the concentration of the n-th step, which is given by
dX1dt=r1·(γ·I(t)-X1),dXidt=ri·(Xi-1-Xi)∀i∈{2,3,…,n}.
Synthetic data sets were generated using Eqs 2 and 3 with different pathway lengths, n_{data} ∈ {1, 2, …, 50}. The response rates were drawn from a log-uniform distribution, ri∼10U(-2,1), and the scaling parameters, γ, was set to one.
The effects of misrepresenting the pathway length in a model were tested on each set of synthetic data. Models with a fixed pathway step length (fixed-step models) of n_{model} = 1, 2, …, 5, respectively, were each treated with the same input and initial conditions that were used for data generation and had the values of their parameters (γ, r_{i}) optimised to fit the output dynamics of that data (Methods). To best characterise how well a model could perform when n_{model} ≠ n_{data} we focused on studying their response to a unit impulse input, I(t) = δ(t), from an initially inactive state, X_{i}(0) = 0 ∀i. This input is useful since analyses of the impulse response is easily extendible to arbitrary inputs [52] (Methods). It is worth noting that the impulse input is equivalent to the pathway receiving no input at all but instead start its simulation from an initial condition where X_{1}(0) = γr_{1} in an otherwise fully inactive pathway. This simulates the sudden start of a reaction at t = 0 where X_{1} passes on its signalling while being exponentially depleted itself and could represent the conversion of a depleting responder upon the onset of an external signal.
We analysed how well such fixed-step models reproduces the generated data and especially how the model/data fit depends on how many pathway steps were used to generate that data (Fig 1 and S1–S4 Figs). When n_{model} ≥ n_{data}, the model can perfectly reproduce the output dynamics, as expected, and provides a control for our numerical optimisation scheme (Fig 1 and S1–S4 Figs). However, despite the perfection in the input-output correspondence, the fitted model and the data-generating model will not be precisely the same. While every optimised reaction rate in the fitted model will also be found in the data-generating linear pathway (Fig 2B), they do not necessarily appear in the same order (Fig 2A). The order of the response rates along the pathway does not matter for the output [27, 28]. This means that optimising for an input-output relationship will not provide any means of correctly inferring which rate belongs to which step. Not only does the model correctly fit the data when n_{model} = n_{data}, but also when n_{model} > n_{data} (Fig 1 and S2–S4 Figs). During the optimisation procedure, ‘additional’ rates become fast enough to ‘instantaneously’ pass information from the previous step to the next (Fig 2B), confirming that fast steps are less dynamically relevant than slow steps [1, 27]. When the model has fewer steps than the linear pathway that was used to generate the data (n_{model} < n_{data}) it may no longer be possible to find a good fit. Unsurprisingly, the model/data mismatch increases with the length of the pathway that generated the data and decreases with the length of the model used to fit the data (Fig 1 and S1–S4 Figs).
10.1371/journal.pcbi.1007982.g001Modelling linear pathways using a truncated number of pathway steps.
Two-step linear pathway models (Eqs 2 and 3) were fitted towards synthetic data. The data shown was generated by networks of step-lengths varying from 1 to 50. (A-E) The worst model/data fits for a given length, n_{data}, of the model that generated the data. Black lines show the synthetic data while simulations of the fitted models are overlaid in colour. (F) The cost value for models optimised towards 5000 different sets of synthetic data. Stars are the cost values resulting from data wherein all the steps in the data-generating linear pathway have the same reaction rates, r_{i} = 1 ∀i. The x-axis shows the number of steps in the models that were used to generate the data. Annotations highlight the simulations that are represented in the other subplots. (G-K) Examples of the best model/data fits for different data pathway lengths, n_{data}.
10.1371/journal.pcbi.1007982.g002Analysis of optimised parameter values for the fixed-step models.
(A) The same linear pathway model run twice, but with the order of its response rate parameters changed. (B) The optimised parameters of a model with three steps, fitted against data generated using only two steps. The model parameter values are shown as circles, and the corresponding parameter values used to generate the data are shown as crosses. The green circles represent the model parameter values of the superfluous pathway step.
There is a high variability in the ability of a truncated model to fit the output dynamics (Fig 1 and S1–S4 Figs). While small models cannot in general represent arbitrary linear pathways, in some cases they do perform well. For example, a two-step model can be good at reproducing the dynamics of even a 10-step linear pathway (Fig 1J), and a five-step model can accurately describe the dynamics of some 25-step pathways (S4 Fig). However, the optimised model performance decreases with the homogeneity of the reaction rates of the data-generating pathway (Fig 3). When all the response rates of the data-generating pathway are the same, the ability of models to fit the data quickly decreases with the length of the pathway (Fig 3, cf. Fig 1A–1E, stars in Fig 1F). Conversely, when the response rates are highly heterogeneous, even a heavily truncated model is able to fit data from a long pathway (Fig 3), again indicating the different contributions of fast and slow steps to the resulting dynamics. Here, we see evidence that the relative distribution of response rates in the data-generating pathway affects the truncated model’s ability to fit data, next we will more rigorously examine why.
10.1371/journal.pcbi.1007982.g003Response rate homogeneity of a linear pathway affect how well models can reproduce their dynamics.
Models were fitted towards data that was generated with different levels of response rate inhomogeneity. The synthetic data was generated using linear pathway models (Eqs 2 and 3) with parameters γ = 1 and ri=10δ·(2·indata-1)∀i∈{0,1,…,ndata-1} where δ is a parameter that governs the inhomogeneity of the rate parameters. For δ = 3, the rate parameters were thus logarithmically spread from 0.001 to 1000. (A) The cost values (lower is better) of models fitted to an 11-step linear pathway, plotted against the inhomogeneity parameter, δ, used to generate the synthetic data. (B-E) Samples of the model/data fit for the optimised parameter sets, as marked in (A). The fixed-rate model (Eqs 8 and 9) is compared with the two-step model (Eqs 2 and 3) since they have the same number of free parameters. (F) The optimised value for the fixed-rate model parameter n compared with the inhomogeneity of the response rates (n_{data} = 11). (G) and (H) The cost values for the two-step model and the fixed-rate model, respectively, when fitted to pathways of different lengths and degrees of inhomogeneity. (I) The ratio of the cost values for the fixed-rate and the two-step models. Blue indicates that the fixed-rate model did better while red indicates that the two-step model did better. White indicates that the two models performed equally well.
Detrimental truncation of models lead to unsharp responses but can be detected by characteristic parameter values
While pathway length and response rate homogeneity are key determinants for whether the truncation of a linear pathway reduces a model’s accuracy, these features may often be unknown from experiments. Hence, it would be useful to find characteristics of the (simplified) model to quantitatively identify when it is performing badly and how that can be detected.
In order to characterise the response curve, we define its delay, duration and sharpness. For signal delay and duration, we adapt definitions and linear-pathway specific relations described in Heinrich et al. [12]. The delay, τ, is defined as the average time it takes for an input to activate the output which can be expressed as
τ≡∫0∞tXn(t)dt∫0∞Xn(t)dt=∑i=1n1ri,
where the second equality holds for linear pathways. This definition is similar to the mean of a statistical distribution. The signal duration is defined in a way that is similar to a standard deviation of a distribution, given by
σ≡∫0∞t2Xn(t)dt∫0∞Xn(t)dt-τ2=∑i=1n1ri2,
where, again, the second equality holds for linear pathways. We also define the sharpness of the output curve, s≡τσ, as the reciprocal of a normalised signal duration. By using the Cauchy-Schwarz inequality and that the sum of squares is smaller than or equal to the square of a sum we get that
τn≤σ≤τ
for linear pathways. This means that for any given signal delay, the signal duration is limited from both above and below. The inequality is equivalent to
1≤s≤n
which shows that the sharpness of the output curve is bounded by the square root of the pathway length. This means that when a linear pathway is simplified by reducing the number of pathway steps, the resulting model may become incapable of producing as sharp an output curve as the original system.
Here, we see why a short model is only occasionally able to capture the dynamics of a long one and why the homogeneity of the response rates matter. It is because even though the maximal sharpness of a longer pathway is greater than what a truncated model can emulate, the actual sharpness might not be. Data from homogeneous pathways (where r_{i} = r ∀i) is hard for a truncated model to fit since the actual sharpness of the data matches its maximal sharpness, sdata=ndata. The actual sharpness of the data curve, s_{data}, decreases as the response rates become more heterogeneous and it becomes easier for a truncated model to reproduce the data. The linear pathways which are the most effectively truncated are those where all but one step is infinitely fast since this results in the minimal sharpness s_{data} = 1. Such data can be reproduced by any fixed-step model of length one or more.
A consequence of this is that a detrimentally truncated model creates a clear signature in the optimised parameter values. While one could try to detect detrimental truncation by inspecting the maximal sharpness of the model and comparing to the actual sharpness of the data this only works for an impulse-like input and it requires that the direct output of the linear pathway is experimentally observable. A more reliable method to detect detrimental truncation comes from how when the data has a sharper peak than a fixed-step model is able to reproduce, the fixed-step model will optimise to have its maximal sharpness, smodel→nmodel (Fig 4). This limit is achieved when all the model’s response rate parameters have the same value (since τσ=n when r_{i} = r ∀i). We thus see that a fixed-step model that has been simplified to the point that it can no longer capture the dynamics of the data will optimise to have homogeneous rate parameters. Detrimental pathway truncation can thus be detected by inspection of the optimised parameters of a model. If a chain of steps in a model has the same response (or degradation) rates, this indicates that a linear pathway might have been truncated to the detriment of the model/data fit. Importantly, while this analysis is based on the assumption of an impulse input to the model, it holds for essentially all inputs. This is because we can expect the optimal model parameters in just about any relevant scenario to be those that enables the closest possible match between the transfer functions of the model and the data. Since these transfer functions are input-independent so is this diagnostic.
10.1371/journal.pcbi.1007982.g004The fixed-step model has a maximal peak sharpness that can both prohibit data recapture and provide a diagnostic tool for when this occurs.
(A) The sharpness of the output peaks for the synthetic data and fitted two and five-step models. The shaded regions show the sharpness bounds for the data (1≤sdata≤ndata) and the white diamonds highlight the parameter sets used for the example trajectory in (D). (B) The dimensionless dispersion of rate parameters of the data and fitted models compared to the sharpness of the data. We define the dispersion to be the standard deviation over the mean of the rate parameters of a single model instance. Zero dispersion means that all the rates equal one another. (C) The rate parameter dispersion compared to the length of the pathway that generated the data. (D) The time-course of the model instances marked as diamonds in (A-C).
An alternative assumption for model simplification improves predictability of pathway output dynamics
An alternative approach for parameter reduction is to assume that every pathway step has the same response rate (r_{i} = r ∀i), and to treat the number of pathway steps, n, as a free parameter. With this ‘fixed-rate’ assumption we can represent a linear pathway with the set of equations
dX1dt=r·(γ·I(t)-X1),dXidt=r·(Xi-1-Xi)∀i∈{2,3,…,n}.
This simplified model has only three free parameters: γ for scaling, n for pathway length, and r for the response rates of the pathway steps.
While the assumption of homogeneous reaction rates is natural when the same process is repeated multiple times, such as a molecular motor walking along a microtubule, or for the assembly of monomers into a polymer, it is not true for most pathways. We analysed the effectiveness of the fixed-rate model by individually fitting its three parameters towards each of the synthetic data sets used above. The resulting model/data fits show that the fixed-rate assumption is indeed well suited for modelling linear pathways, even when the reaction rates of the data-generating network are highly heterogeneous (Figs 5 and 3). The relative effectiveness of the fixed-rate and the fixed-step model depends on the nature of the data itself. For very short or highly heterogeneous pathways, the fixed-step assumption yields better results (Fig 3I). That said, the fixed-rate model outperformed the two-step model, which has the same amount of free parameters, in a clear majority of the cases we examined (Fig 5). This holds true for a wide range of different inputs to the linear pathway (Fig 6). Strikingly, the fixed-rate model was almost inseperable from the output of the original model in most cases (blue and black lines in Fig 6). It should also be noted that the fixed-rate model is reliably good (cost values less than 0.2), even in the cases where the fixed-step model is better (Fig 3G–3I). In contrast, the fixed-step model can become very bad at representing the data (with cost values above 0.9) when the data peak is sharper than the model can reproduce. Increasing the number of steps in the fixed-step model does increase its ability to fit the data (e.g. Fig 4, S1–S4 Figs). However, this adds more parameters while it does not change the fundamental limitation that when the data is too sharp for the model there is no limit to how bad the fit can become. The fixed-rate model was usually better in our test cases, even compared to longer fixed-step models (S6 Fig). It is thus highly effective at simplifying a linear pathway, and the performance only decrease slightly when the underlying pathway has heterogeneous reaction rates (Fig 3). More importantly, the fixed-rate model proved reliably good when its more common counterpart did not.
10.1371/journal.pcbi.1007982.g005Modelling linear pathways using the fixed-rate assumption for simplification.
The fixed-rate model (Eqs 8 and 9) was fitted towards synthetic data generated by networks of step-lengths varying from 1 to 50. The data sets are the same as those used for Fig 1. (A-E) Examples of the worst model/data fits for the fixed-rate model. Orange lines show simulations of the fitted model and the black line shows the synthetic data. The grey line is the corresponding fit using the two-step truncated model on the same data set. (F) The cost value for models optimised towards 5000 different sets of synthetic data. The x-axis shows the number of steps in the model which were used to generate the data. Blue circles are cost values for the fixed-rate model while grey dots are the cost values for a two-step truncated model. The parameter sets used in figures A-E and G-K are marked accordingly. (G-K) Examples of the best model/data fits. The fixed-rate model (green lines) almost completely matches the data (black lines). (L) The scaling parameter, γ, is accurately identified as 1 in all optimisations. (M) The optimised values of n for different number of steps in the pathway underlying the data. (N) A comparison between the optimised value of r and the smallest rate constant of the model that generated the data. (O) A comparison of the cost values when using either the fixed-rate model or a two-step truncated model to fit the same data. Each circle represents a single synthetic data set. Percentages indicate how many of the data sets had a higher (worse) cost value for the respective models.
10.1371/journal.pcbi.1007982.g006Comparing the ability of the fixed-rate and the two-step model to reproduce the dynamics of linear pathways that respond to different inputs.
Each pair of figures (A,B; C,D; …; K,L) demonstrates the performance of the two models for the different model inputs: impulse, piecewise constant, ramp, step, wave and noisy auto-activator, respectively (see Table 1). Figures A, C, E, G, I and K compares the optimised cost values (lower is better) for the fixed-rate model and the two-step model for each synthetic data set (5000 per input). Every data set is represented by a low-opacity dot; colour saturation thus indicates the density of similar values. Percentages indicate how often one model had a worse cost than the other. The orange dot shows the geometric median of the cost values for the different data sets. Figures B, D, F, H, J and L shows the model and data dynamics for the median data set of the corresponding input. The two-step model was chosen for the comparison since it has the same number of free parameters as the fixed-rate model.
To verify the applicability of the fixed-rate model beyond our synthetic data, we compared its performance with that of the two-step model when fitting real data. For that, we used data from the immune response of Arabidopsis thaliana. There, a set of different surface receptors respond to different danger-associated inputs and then, through a multi-step pathway, trigger an output of reactive oxygen species (ROS) [23]. The ROS response is transient even though the input is persistent and this is hypothesised to be due to input-induced degradation of the receptor. We can interpret this as an initially inactive pathway where only the first step (the inactive receptor) has a non-zero concentration. At time t = 0, the receptors start binding to the input and keeps on producing signalling until they are degraded. This case is analogous to the impulse input that we have used in the paper. The multi-step pathway leading to the release of ROS is not perfectly characterised and we do not know whether or not it is linear. We do, however, know that for one of these inputs, flagellin22 (flg22), there is a sequence of at least five reactions that occur before the production of ROS is triggered [23]. Fitting the models to this data clearly shows that the fixed-rate model is better able to capture the experimentally observed dynamics (Fig 7).
10.1371/journal.pcbi.1007982.g007Simplified linear pathway models fitted to real data from different immune responses of <italic>Arabidopsis thaliana</italic>.
(A-C) Measured and simulated reactive oxygen response to immune elicitations by flg22, elf18 and AtPep1, respectively. Data extracted from Monaghan et al. [53].
The fixed-rate model outperforms a fixed-delay model
A main limitation for the truncated models is that they perform badly in capturing the delay of a signal output. Rather than introducing multiple pathway steps in a model in order for it to capture a time-delay, such time-delays can be introduced explicitly. We next aimed to compare such an approach of modelling linear pathways to the use of the fixed-step and the fixed-rate formulations. In order to make a fair comparison, we defined a DDE with three free parameters. In this ‘fixed-delay’ model, the pathway step that we consider an output, X, responds to an input at a rate governed by r, with a fixed, singular, time delay, t_{delay}. Similarly to the other models, a scaling parameter γ is also defined.
dX(t)dt=r·(γ·I(t-tdelay)-X(t)).
We optimised the fixed delay model (Eq 10) towards the same data used for the previous models and compared their performance (Fig 8). In most cases, this model performed better than the two-step truncated model but worse than the fixed-rate model. The fixed-delay model is able to adjust its signal delay (Eq 4) without affecting the signal duration (Eq 5). This allows it to fit the delay, duration, and thus, sharpness of the data well. However, it fails to fit the symmetry of the output curve around its mean delay (skewness if we continue the distribution analogy). The model output starts very abruptly at t = t_{delay} and its response to sudden input changes lacks the smoothness that is seen in the data. The fixed-rate model, on the other hand, can achieve the correct time-delay and sharpness while also accurately smoothing out the signalling over time.
10.1371/journal.pcbi.1007982.g008Modelling of linear pathways using a fixed delay DDE model (<xref ref-type="disp-formula" rid="pcbi.1007982.e017">Eq 10</xref>) compared to the two-step and the fixed-rate models.
(A) A cost value comparison of the fixed-rate model and the DDE model for every data set generated with an impulse input (Table 1, 5, 000 data sets). Every synthetic data set is represented with a low opacity dot; colour saturation thus indicate a high density of similar values. The orange dot highlights the geometric median and that median data set is used as an example in figure (B). (B) An example time trajectory where the fixed-rate, two-step and DDE models have all been optimised to reproduce a synthetic data set. (C) A cost value comparison, similar to (A), between the two-step model and the DDE model. The orange dot shows the cost of the trajectory displayed in (B). (D-F and G-I) Repeats of (A-C) but using a step input and a noisy input, respectively (Table 1). The synthetic data was generated with pathway lengths, n_{data}, uniformly distributed between 1 and 50.
The fixed-rate assumption leads to a gamma distributed delay model
The impulse response function for the fixed-rate model is given by
g(t)=γ·rn·tn-1·e-r·tΓ(n),
as shown in Methods. Here, t is time, Γ is the gamma function [54] and, just like in the fixed-rate model, γ is a scaling factor, n relates to the pathway length and r to the response rates of the steps along the pathway. For the impulse input, the fixed-rate model is easily simulated by solving X_{n}(t) = g(t). Similarly terse analytical solutions to the fixed-rate model for other, specific, inputs can be found in [27]. But the impulse response function is of particular importance since it can be used to derive an expression which is applicable to any kind of input. This is done through a convolution and because g(t) is really identical to the probability density function of the gamma distribution (scaled with γ) the result is called a gamma-distributed delay,
Xn(t)=g(t)*I(t)=∫0tg(t-t′)·I(t′)dt′.
While the fixed-rate model is by its very structure limited to integer-valued n, this ‘gamma model’ is not and allowing real-valued n increases the model’s ability to fit data. To test this, we optimised the gamma model and the fixed-rate model towards the same synthetic data sets. The gamma model, with n ∈ [1, ∞), improved performance compared to the fixed-rate model in nearly all cases (Fig 9, cf. S5 Fig, [27]). In most cases, the two models were approximately equivalent but the performance difference became apparent in some cases, especially when n_{data} was small.
10.1371/journal.pcbi.1007982.g009Allowing a real-valued pathway length parameter, <italic>n</italic>, increases the gamma model’s ability to recapitulate dynamics from arbitrary linear pathways.
(A) The optimised cost values for the fixed-rate and the gamma models when they are both optimised towards the same data sets in which the pathway receives an impulse input (Table 1, 5000 data sets). The colour of the dots indicate the length of the linear pathway that was used to generate the synthetic data set. (B) Demonstrating the model performances for the median data set, as denoted by a green dot in (A). (C, D) similar to (A) and (B) but using the generalised gamma model (Eq 12) and a noisy input (Table 1).
Due to the close connection between the gamma model and the gamma distribution the model’s signal delay and duration is directly given by the statistics of the distribution. The mean of the gamma distribution identifies the signal delay, τ=nr, standard deviation identifies the signal duration and is given by σ=nr. This also means that the sharpness of the gamma model (as well as of the fixed-rate model) is given by s=τσ=n. This means that the gamma model is always able to fit its signal sharpness exactly to that of the data. The fixed-rate model, on the other hand, is only approximately able to fit the sharpness and the relative error is largest for data with low sharpness due to the limitation of an integer-valued n.
Identifiability of biological parameters using the fixed-rate assumption
It is of interest to analyse how well a fixed-rate assumption approach performs when it comes to identifying the values of the parameters in the underlying biological pathway. When a pathway is using the same reaction rates for all pathway steps, the fixed-rate assumption is exact, and the optimal model parameters truly reflect the underlying parameters of the data-generating system. However, the model performs well even when representing a pathway with heterogeneous reaction rates, and in this case, the connection between the model parameters and the biological system is less clear.
The γ parameter is a simple scaling parameter. The model parameter can be considered a reflection of the total scaling that a linear pathway performs on a signal before that signal reaches the output (Methods, Eq 20). Unlike for the fixed-step model, the optimised values of γ for the gamma and the fixed-rate models were all very close to the true value, one (Fig 5L). All of these models are capable of ensuring that the signalling is properly scaled between the input and the output. However, since the fixed-step model is unable to correctly time its output, the optimisation scheme will sometimes lead to the use of the scaling parameter to mitigate the cost that this timing discrepancy creates. Since the fixed-rate assumption leads to models with much better control over timing, this never became an issue during our study, and those models always identified the correct value for the scaling parameter.
The n parameter represents the number of steps in the approximated pathway but since we relaxed the demands that all rates are equal the connection between n and the number of steps in the biological (synthetic) data is not exact. Nevertheless, a precise relationship between the optimal pathway length parameter, n_{optimal}, and the parameters of the approximated pathway can be found. This can be done if we assume that the model fits the data optimally when the two share both signal delay and duration (τ_{model} = τ_{data} and σ_{model} = σ_{data}). Such optimality is only approximately possible for the fixed-rate model but it is strictly possible for the gamma model. In this case, we find that n_{optimal} can be determined from the rates, r_{i}, of all the pathway steps of the data
noptimal=τmodel2σmodel2=τdata2σdata2=(∑indata1ri)2∑indata1ri2.
Here, we see that if the rates of the underlying linear pathway are known, or if the gamma model is used to simplify a known linear pathway model, the optimal value of n is easily obtained. This equation holds with good precision for the numerically optimised parameters of this paper (Pearson correlation for the gamma and the fixed-rate models using the impulse input, ρ_{gamma} = 0.999 and ρ_{fr} = 0.997, respectively). It should, however, be noted that the slight imperfection is not merely due to failures of our numerical optimisation scheme but mostly because our cost function was not perfectly in line with the assumption above. Even so, the relation is still informative and by using the Cauchy-Schwarz inequality it further tells us that
noptimal≤ndata.
This means that while the value of the optimised n parameter does not give the precise length of the underlying pathway, it does give a lower bound (Fig 5M). Again, equality occurs when all the response rates of the underlying pathway have the same values.
The parameter r of the fixed-rate model is related to the rates at which information is being passed along the linear pathway. By again assuming that the optimal model fit occurs when the signal delay and duration of the model and the data are equal, we get
roptimal=τmodelσmodel2=τdataσdata2=∑indata1ri∑indata1ri2.
Where, r_{optimal}, is the optimal r parameter value of the gamma model and r_{i} are the response rates along the approximated linear pathway. Like with n_{optimal}, this r_{optimal} corresponds very closely to the optimised r parameters of this paper (ρ_{gamma} = 0.999, ρ_{fr} = 0.996). While this relationship can be useful, it provides little in terms of intuition. A more intuitive, albeit less precise, heuristic is that the parameter r approximately identifies the slowest part of the pathway that is being simplified. This is because the slowest steps provide the greatest contributions to the overall dynamics [12]. This heuristic turns exact both when all the rates of the data-generating pathway are the same as well as when they tend to infinite heterogeneity. For the synthetic data of this paper, which is near neither of these limits, the optimised r parameter still corresponds well with—and only slightly overestimates—the slowest reaction rate of the data (Fig 5N, ρ_{gamma} = 0.993, ρ_{fr} = 0.989).
Counter-intuitively, the connection between parameters and observables in the underlying data is a little bit weaker for the fixed-rate model which does not assume that the number of pathway steps can be a non-integer. This is because it may not be possible to simultaneously have both the delay and the duration of the fixed-rate model and the data be equal. The above arguments still apply but only approximately. Adapted versions of Eqs 13 and 15 could still yield good parameters but they may not be optimal.
Altogether, the fixed-rate assumption allows for a relatively close connection between a model’s parameters and the observables that they represent. The optimised parameters can thus strongly indicate the properties of the linear pathway under study.
The fixed-rate assumption is applicable to non-linear pathways
Many multi-step, unbranched, pathways are in fact not linear for all inputs. The linear approximation is often accurate for low inputs but reaction rates tend to saturate at sufficiently high inputs. It would, therefore, be useful to be able to efficiently simplify also non-linear multi-step reactions.
We, again, generated synthetic data, but this time from a non-linear kinase cascade model equivalent to that of Heinrich et al. [12]. Each step, i, in this model consists of the conversion back and forth between an active (Xion) and an inactive (Xioff) kinase. The total amount, Xitot≡Xion+Xioff, is conserved and is treated as a parameter. The conversion from inactive to active kinase is proportional to the activity of the previous kinase, Xi-1on, and the availability of Xioff, with a proportionality constant α˜i. The activation of the first step is proportional to the input, I(t), and the activity of the last step, Xnon(t), is considered the model’s output. The inactivation of all steps is proportional to their activity Xion, with a proportionality constant β_{i}. Similar to how we redefined the parameters of the linear model to increase their orthogonality, we here define the dimensionless parameter αi≡α˜i·Xitotβi and express the model as
dX1ondt=β1·(α1·I(t)·(1-X1onX1tot)-X1on)dXiondt=βi·(αi·Xi-1on·(1-XionXitot)-Xion)∀i∈{2,…,n}.
Data sets were generated by simulating the activity of the last step, Xnon, using random parameters where αi∼10U(-1,2), βi∼10U(-1,2), Xitot∼10U(-1,1), and n ∼ {1, …, 20}. For sufficiently small inputs or high values of Xitot, this cascade model will be linear since each step will only be weakly activated such that 1-XionXitot≈1. The cascade model is then dynamically equivalent to its linear counterpart in the study.
Simplifications of such cascades using the fixed-rate assumption often yields greater descriptive power than those using a two-step assumption. We simplified the cascade model described in Eqs 16 and 17 using either a fixed two-step assumption (n = 2, independent α_{i}β_{i} and Xitot) or a fixed-rate assumption (n is a free parameter but each step is the same: α_{i} = α_{j}, β_{i} = β_{j} and Xitot=Xjtot∀i,j). These models were fitted towards reproducing the synthetic data. The fixed-rate cascade model yielded better fits with data than the two-step cascade model in a majority of cases (Fig 10). This holds despite the two-step model having more free parameters than the fixed-step model (6 compared to 4). Also, the fixed-rate model describes the data well regardless of how saturated that data is (Fig 10A). The two-step model, on the other hand, performs best when either the underlying pathway really is short or when the data is highly saturated (Fig 10B).
10.1371/journal.pcbi.1007982.g010The descriptive power of simplified non-linear kinase cascade models depends on the saturation level of the approximated system.
(A and B) The fixed-rate and two-step assumptions respective dependency on the saturation level of the data it tries to emulate. The saturation is measured by a quantity that depends on the shape of the final peak (Methods). Linear systems receive a saturation score of between 0.7 (for one-step systems) to 0.83. A score of 1 means that the peak is completely square. Circles represent a single data/model pair and diamonds highlight example curves in (D-G). (C) Comparing the cost values of the data-fits for models using the fixed-rate or two-step assumption.
In the case of linear pathways, any simplification will have predictive power as long as they can emulate the transfer function of the original system. For non-linear models, however, we have no such guarantees. A model that fits the data perfectly under one condition may nevertheless give false predictions for how the system would react to others. We, therefore, analysed the predictive power of the two model simplifications by fitting them to data from one input and then measuring the model/data mismatch that ensued from applying a differently scaled input (Fig 11A–11D). The result shows that while certain dynamics aspects, such as maximal peak height, is often unchanged in both the data-generating model and the simplified model other aspects such as signal duration are often poorly predicted. However, predictiveness was dramatically improved for both models when we instead fitted them to data from three inputs of different scaling (10^{−3}, 1 and 10^{−3})(Fig 11E–11H). In this case, we saw very little difference in the effectiveness between the fixed-rate and the two-step assumptions except that the fixed-rate assumption requires fewer parameters. Finally, it is worth pointing out that both models performed reasonably well in this regard and that while the fixed-rate model does not always outperform the fixed-step model, it still performed reliably well in our test cases in a way that the fixed step model did not.
10.1371/journal.pcbi.1007982.g011The predictive power of simplified non-linear kinase cascade models when fitted towards a single vs. three different inputs.
(A) Fixed-rate model (blue), two-step model (orange) and synthetic data (black) simulated for three different pulse inputs. The inputs were pulses of concentrations 10^{−3}, 1, and 10^{3}, respectively–lasting one time unit. The models were only fitted to the unscaled input. (B and C) The cost values of the fixed-rate and two-step models, respectively, for different input scalings. The coloured line shows the mean cost for 2000 fitted models, the shaded region show the standard deviation and the black line highlights the model used as an example in (A) and (D). (D) Applying a noisy input to models that were trained on a single pulse input. Colours like in (A). The generated noise input was rescaled by a factor 2⋅10^{−6} to avoid the uninteresting case of full and continuous saturation. (E-H) Repeating the above figures but with models trained on data from pulse-inputs with three different scalings, 10^{−3}, 1, and 10^{3}. The noise input rescale factor in (H) was 10^{−7}.
Discussion
Biology is full of multi-step pathways where each step is (at least approximately) linearly dependent on the previous step. Transcription, translation, kinase cascades, sequential phosphorylation and signal transduction are all examples of processes where this can apply. The full inclusion of such pathways is seldom advisable when modelling biological systems since the added benefit in dynamical range is outweighed by the disadvantage of an increased model complexity. It is, therefore, common for such pathways to be simplified. However, the manner in which such pathways are simplified is not always particularly effective.
We demonstrate how the coarse-graining of a long linear pathway to a short one (pathway truncation) often lead to a detectably incorrect temporal relationship between the input and the output signal. This discrepancy is important to understand not only because it can cause a model to quantitatively misrepresent time-course data but also because signal timing can qualitatively alter dynamical behaviour. Negative feedback loops, for example, can change from having a stabilising effect to generating oscillations when the feedback is delayed (e.g. [1, 11, 31]). It is, therefore, notable that when this simplification has adverse effects we could identify a detectable signature in the form of homogeneity of the optimised response rate parameters. This can be used as a model diagnostic and could prove especially helpful for complex models where the source of model/data mismatch is not always apparent from their design or output.
Next, we asked whether there might be a way to remedy the shortcoming of a truncated model without increasing the number of model parameters. We propose to assume that the signalling is being passed along the pathway at a constant rate while the number of steps is not fixed. This assumption allows for a three-parameter approximation of arbitrary linear pathways where the pathway length is a tunable parameter. We showed that this ‘fixed-rate’ assumption outperformed both truncated pathway models and a DDE model with an explicit time-delay parameter even if the original pathway had highly heterogeneous rates for individual pathway steps. Here, we argue for its effectiveness by applying it to both synthetic and experimental data and we argue that it is applicable for any model input. This was neatly exemplified in Tokuda et al [47] where they showed that introducing distributed delays (fixed-rate assumption) between components of circadian clock models allowed for a parameter reduction while retaining the main dynamics.
A solution of a fixed-rate model allows it to be re-formulated as a gamma-distributed delay. In this model formulation, which does not require a separate differential equation for each pathway step, it is possible to extend the domain of its pathway length parameter, n, to real numbers. Doing so improves model accuracy and simplifies some forms of numerical optimisation since this ‘gamma model’ does not mix integer and real-valued parameters. However, it also disallows the use of ordinary differential equations for numerical simulation which is in many cases more efficient than the alternative. Whether this trade-off is worth it will be dependent on both the case at hand and on the availability of numerical tools for efficient and accurate model simulation.
For a model to be useful, it is important to retain information about the underlying biological system even after the model assumptions have simplified reality. Simplifying assumptions break the precise mapping between model parameters and biological observables. However, much of the connection between the model parameters and real biological processes are retained when the fixed-rate assumption is used for simplification of a linear pathway and even more so when the gamma model is used. If the precise details of a linear pathway are known then the optimal parameters of the gamma model can be obtained from simple expressions. From those expressions, we can, for example, see that optimal pathway length parameter of the gamma model, n, provides a lower bound for the length of the real pathway. While the integer-valued pathway length parameter, n of the fixed-rate model may seem more natural when modelling a real pathway with a discrete amount of steps its mapping to reality is actually less precise than that of the gamma model. The mapping is similar to that described for the gamma model, but the relationships described are only approximate for the fixed-rate model. However, this should be contrasted with the truncated pathway model which still uses individual response rate parameters for individual pathway steps. Intuitively, this may seem more closely related to reality but when such a model is optimised to perform the action of a longer pathway, these individual reaction rates become highly decoupled with the response rates of any real pathway step. Rather than being tuned to represent real pathway steps, they are tuned to delay the response peak while not sacrificing too much of the response sharpness. When the model is unable to reproduce the sharpness of the data, its rate parameters become homogeneous. This has the consequence that if two different data-sets are both too sharp for a truncated model, the optimised model will have the same parameters for the two data sets, even if vastly different. Such one-to-many mappings of optimal parameter values to the target data essentially disables inference of biological properties from parameter values. So the fixed-rate assumption not only performs better than the alternatives, but its individual components also retain a closer connection with reality.
A fixed-rate assumption, in line with that proposed to a linear pathway, is applicable also to non-linear systems but its relative merit decreases as the degree of saturation increases. When a kinase cascade is weakly activated (linear) then a fixed-rate assumption provides a more robust and often more effective simplification than the more common approach of disregarding some intermediate steps. When strongly activated, the choice between these simplifications seems to matter less. The non-linearity means that these results are much harder to generalise from since it is no longer true that the model can be completely characterised by its response to a single input. But, even though the performance difference between the two approaches effectively disappeared under certain conditions, the demonstrated advantage of the fixed-rate assumption under other conditions indicates an over-all benefit from its use.
A mathematical model is defined by its underlying assumptions and can be seen as merely acting as a logical device to deduce the consequences of those assumptions [55]. The main part of model development is thus to find a set of assumptions which accurately and concisely captures the nature of the dynamical system under study. However, finding a good balance between detail and simplicity is often non-trivial and requires some degree of craftsmanship. The better we understand the consequences of specific assumptions, the better we become at striking this balance. In this context, we systematically investigate the consequences of simplifying multi-step pathways by including only a few steps in the model. We clearly demonstrate problems such simplifications may cause. More importantly, we present how to detect these issues when they occur and we provide an alternative approximation to remedy them. We hope this will supply a foundation for well-informed decisions regarding when and how to simplify the ubiquitous linear pathway.
MethodsDefinition of the linear pathway ODE models
If we model an n–step linear pathway that follows Eq 1 except that it receives some input signal to the first step, X_{1}, the dynamical equations can be written as
dX1dt=β1·(α1β1·I(t)-X1),dXidt=βi·(αiβi·Xi-1-Xi)∀i∈{2,3,…,n}.
where α_{i} is a production/activation rate, β_{i} a degradation/deactivation rate, I(t) is some upstream input and X_{n}(t) is the output of the pathway. The scaling that step i performs on the signal from step i − 1 is given by αiβi and the total scaling, γ, between the input and output is [28]
γ=∏i=1nαiβi.
Since we are only interested in the dynamics of how the output responds to an input we can collapse the scaling into the single parameter γ and write
dX1dt=β1·(γ·I(t)-X1),dXidt=βi·(Xi-1-Xi)∀i∈{2,3,…,n}.
Here the position of the γ parameter along the pathway is irrelevant for the output. The parameter β_{i} is still a degradation/inactivation rate but the scaling effect that such a parameter usually have has been absorbed into γ. We, therefore, chose to instead call it a response rate parameter, r_{i} ≡ β_{i}, which better reflects its revised effect on the model. The result is the linear pathway model we have been using in the paper, Eqs 2 and 3.
The analysis of the models was done using different inputs and initial concentrations (Table 1). Unless otherwise stated, the impulse input was used and while it is defined as using the Dirac delta function the actual simulations were done by setting the initial concentration of X_{1} to γ ⋅ r_{1}. The noisy input was generated using a Gillespie simulation of a single, auto-activating, variable [56]. The noisy time-course that this generated was used as an input both during the generation of synthetic data and during the subsequent model simulations.
10.1371/journal.pcbi.1007982.t001Model definition of linear pathways with different inputs.
The models are all defined by Eqs 2 and 3, with inputs, I(t), and initial conditions, X¯(0), as specified here.
Solutions to the fixed-rate model and the definition of the gamma model
Laplace transforms and the transfer function provides a way of finding a solution to the fixed-rate model with arbitrary inputs [52, 54]. To find this solution, we first assume that X_{i}(0) = 0 ∀i and apply the Laplace transform, L, on both sides on the equations for the fixed-rate model, utilising the fact that the Laplace transform of a function’s derivative, f′, follows L[f′]=sL[f]-f(0). This leads to
sL[X1]=r·(γ·L[I(t)]-L[X1]),sL[Xi]=r·(L[Xi-1]-L[Xi]),
which can be rearranged to get
L[X1]=γ·rs+r·L[I(t)],L[Xi]=rs+rL[Xi-1].
This can be recursed to solve for n steps, leading to
L[Xn]=γ·rn(s+r)n·L[I(t)].
From this, we can easily get the transfer function, G(s), of the fixed-rate model,
G(s)=L[Xn(t)]L[I(t)]=γ·rn(r+s)n.
The impulse response function, g(t), is the inverse Laplace transform of this transfer function, given by
g(t)=L-1[G(s)]=γ·rn·tn-1·e-rt(n-1)!.
This impulse response can be used to get a solution to X_{n}(t) for almost any input (the input needs to have a Laplace transform, even if we never have to calculate it). From Eq 23, we have that
L[Xn(t)]=G(s)L[I(t)].
This equation is still in the complex domain but we can use the connection between multiplication in the complex domain to convolution in the time domain to get the desired function
L[Xn(t)]=G(s)L[I(t)],⇔,Xn(t)=∫0tI(t′)g(t-t′)dt′,=∫0tI(t′)γ·rn·(t-t′)n-1·e-r·(t-t′)(n-1)!dt′.
In order to improve the model’s ability to fit data, we can replace (n − 1)! with Γ(n). This makes no difference for integer n but it expands the possible domain of n to all real numbers greater than or equal to 1. After this, we end up with what we call the gamma model which can be expressed in either input-specific forms (e.g. Eq 11) or in its gamma-distributed delay form (Eq 12).
Model simulation
Differential equations were solved using an algorithm with stiffness detection that toggled between Tsitouras5 for non-stiff regions and Rosenbrock23 for stiff regions [57–59]. The integral of the gamma-distributed delay model (Eq 12) was evaluated using Gauss-Kronrod quadrature.
For the wave input we needed an initial concentration of X_{i}(0) = γ ∀i. However, since we in the definition of the model assumed an all-zero initial concentration of the pathway, it is built-in to the model that it starts from an all-zero state. This limitation can, however, often be circumvented by performing the simulation in two stages. The first stage would apply the input required for the model to get to the proper ‘initial’ conditions for the second stage which simulates the model with the input one actually wished to study. For the wave input in this paper, this was achieved by using the input function
I(t)={1,t<01+sin(t),t≥0
and by allowing the integration to start at a negative t. The actual value used was based on the signal delay and the duration of the gamma model
tmin=t-τ-3·σ=-nr-3·nr.
This discards the effect of inputs that were early enough to have a very small impact on the current output value.
Data generation
Synthetic data sets were generated using the fixed-step model (Eqs 2 and 3). The procedure was to first draw an integer between 1 and 50 to be used as the number of pathway steps, n_{data}. A response rate, r_{i} was randomly drawn for each of the n_{data} steps in the pathway. For most input types, the parameters r_{i}, were generated by transforming the uniformly random variable Y_{i} ∼ U(−2, 1) according to ri=10Yi. However, since any too slow reaction rates will filter out the dynamics of the noise and the wave input, the reaction rates for those models were drawn from ri∼10U(-1,1) and ri∼10U(0,1), respectively. The value of the scaling parameter, γ, was in all cases set to 1. The model was then run and the resulting trajectory of the last pathway step, X_{n}(t), was stored for use as the synthetic data set (5000 times for each input).
A similar scheme was used for the non-linear cascade model. There, 2000 sets of parameters were drawn according to ndata∼U{1,20}, αi∼10U(-0.5,1.5), βi∼10U(-1,1), and γi∼10U(-1,1).
Fitness definition
In order to automatically evaluate the fitness of a model, an ‘integral cost’ function was defined. The idea of this cost function is to measure the mismatch in the area under the curve for the model and the data (Fig 12). This is a form of ℓ1 norm that ensures an even sampling of the data for cost evaluation. We define this cost value as
C=∫-∞∞|Xmodel(t)-Xdata(t)|dt∫-∞∞|Xdata(t)|dt.
10.1371/journal.pcbi.1007982.g012Demonstrating the cost value and saturation score.
(A) The cost value measures the area mismatch between the model output and that of the data. Linear models used the area under the data curve for normalisation and the kinase cascade models used the union of the data and model areas. (B) The saturation score indicates the degree to which a kinase cascade is saturated. If the curve is the output of the system, the saturation score is the fraction of the indicated square that is filled by the shaded region under the curve.
Numerically, C was calculated using Riemann sums and interpolations of both the data and the model solution. The evaluations started at t = 0 and for the ‘Impulse’, ‘Step’, and ‘Piecewise’ input types they continued until the model derivatives were close to zero (absolute and relatives tolerance 10^{−8} and 10^{−6}, respectively) or until t = 5000, whichever came first. The ‘Piecewise’ input simulations were additionally prohibited from stopping before the end of the piecewise constant input. Since the ‘Ramp’, ‘Wave’ and ‘Noise’ inputs do not allow for equilibration, we set fixed stopping times of t = 500, 30 and 200, respectively.
Much of the analysis was also repeated using the more commonly used normalised least square cost function. While the results are slightly different, they did not change any of the conclusions in this work.
Model optimisation
The model parameters were all optimised to reproduce the time-trajectory of each synthetic data set. The optimisation target was to minimise the cost value, C, described above. For the actual optimisation, we used an adaptive differential evolution algorithm following the /rand/1/bin/ scheme, with a radius limited sampler that took 2,000 steps per free parameter of the model [60, 61]. The search space for the parameters of the different models were chosen to allow for the time delays that the synthetic data could generate (Table 2). The sampling space was linear for n and logarithmic for the other parameters.
10.1371/journal.pcbi.1007982.t002Parameter search space for the optimisation of the different models.
model
γ
r
n
τ
fixed-step
[10^{−3}, 10^{3}]
[10^{−3}, 10^{3}]
N/A
N/A
fixed-rate
[10^{−1}, 10^{1}]
[10^{−2}, 10^{2}]
{1..30}
N/A
gamma
[10^{−4}, 10^{2}]
[10^{−3}, 10^{3}]
[1, 31]
N/A
DDE
[10^{−4}, 10^{2}]
[10^{−4}, 10^{1}]
N/A
[10^{−2}, 10^{4}]
Non-linear analysis
The non-linear kinase cascade models were defined by Eqs 16 and 17, as described in Results. A pulse input was applied to initially inactive cascades for both the data generation and the model simulations,
I(t)={ξif0<t<10otherwise.
Here, ξ was set to 1 except for when it was used to scale the input, as stated in Results. 2000 data sets were generated where the data consisted of a single trajectory for a single input and 500 were generated where each data set consisted of three trajectories, for three differently scaled inputs. Simulations were done using DifferentialEquations.jl’s Rosenbrock23 ODE solver [58, 59]. A few of the proposed parameter sets for data-generation caused numerical errors in the differential equation solver and were re-drawn.
The normalisation of the cost function was revised for the non-linear study to ensure that the cost was symmetric in terms of whether the data or the model curve was the largest one. The revised cost function was defined by
C=∫-∞∞|Xmodel(t)-Xdata(t)|dt∫-∞∞max(Xdata(t),Xtarget(t))dt.
The number of optimisation steps per parameter was increased in the non-linear case to 5, 000. The parameter search range is defined in Table 3. The fixed-rate model had a smaller search range since the repeated application of steps with very high signal gain or dampening were neither relevant nor numerically stable. Some parameter sets proposed by the optimiser resulted in prematurely terminated simulations due to numerical issues. When this happened during optimisation, the parameter set was rejected by assigning it an infinite cost. Such errors also occurred during the analysis of model predictiveness (Fig 11) and when it did the offending parameter set was excluded from the analysis (8% for the two-step model and 3% for the fixed-rate model).
10.1371/journal.pcbi.1007982.t003Parameter search space for the optimisation the non-linear kinase cascade models.
Model
α
β
X^{tot}
n
Fixed-step
[10^{−8}, 10^{5}]
[10^{−8}, 10^{5}]
[10^{−8}, 10^{5}]
N/A
Fixed-rate
[10^{−2}, 10^{2}]
[10^{−3}, 10^{3}]
[10^{−6}, 10^{1}]
{1..20}
We defined a ‘saturation score’ to reflect the degree to which the dynamics of a cascade was dominated by saturation. For this, we examined the top half of the output signal peak and divided its area with that of a square which encompasses this half peak (Fig 12B).
Software availability
All the computational results were generated using the Julia programming language [62]. Differential equations were solved using DifferentialEquations.jl, Gauss Kronrod quadrature was done with QuadGK.jl and optimisation was done with BlackBoxOptim.jl [59, 61]. Source code for the reproduction of the results is openly available under the MIT licence at the Sainsbury Laboratory GitLab repository https://gitlab.com/slcu/teamHJ/publications/Korsbo_et_al_2020. This repository also includes documentation and a tutorial aimed at making reproduction and reuse easy.
Supporting informationFitting one-step models to data from linear pathways of different length.
(A-E) The worst model/data fits for a given length, n_{data}, of the model that generated the data. Coloured lines show simulations of the fitted model while black lines show the synthetic data. (F) The cost value for 5000 parameter sets, each optimised towards a different set of synthetic data. Circles show the cost values resulting from data wherein all the steps in the data-generating linear pathway were randomly drawn, r_{i} ∼ 10^{U(−2, 1)} ∀i. Stars are the cost values from data wherein the pathway has homogeneous reaction rates, r_{i} = 1 ∀i. The x-axis shows the number of steps in the model which were used to generate the data. (G-K) Examples of the best model/data fits for different data pathway lengths, n_{data}.
(TIF)
Fitting three-step models to data from linear pathways of different length.
Description otherwise as in S1 Fig.
(TIF)
Fitting four-step models to data from linear pathways of different length.
Description otherwise as in S1 Fig.
(TIF)
Fitting five-step models to data from linear pathways of different length.
Description otherwise as in S1 Fig.
(TIF)
Fitting gamma models to data from linear pathways of different length.
The gamma model (Eq 11) with nmodel∈R efficiently recapitulates linear pathway dynamics. (A-E) Examples of the worst model-data fits for different lengths of the data-generating pathway, n_{data}. Black lines show the synthetic data, orange lines show the results of the gamma model, and grey lines show the results of a two-step model (Eqs 2 and 3). (F) The ability of the gamma model to fit the data changes with the length of the underlying pathway, n_{data}. Blue dots show the cost value of the gamma model. The small, grey, dots show the corresponding cost values for a two-step model. (G-K) Examples of the best model-data fits.
(TIF)
A comparison of the cost values when using either the fixed-rate model (Eqs <xref ref-type="disp-formula" rid="pcbi.1007982.e015">8</xref> and <xref ref-type="disp-formula" rid="pcbi.1007982.e016">9</xref>) or a fixed-step model (Eqs <xref ref-type="disp-formula" rid="pcbi.1007982.e002">2</xref> and <xref ref-type="disp-formula" rid="pcbi.1007982.e003">3</xref>) to fit the same data.
Each circle represents a single synthetic data set. 20 synthetic data sets were generated for each n_{data} ∈ {1, …, 50} using the impulse input (Table 1). The circles position indicates the optimised cost for the respective models and its color indicates the n_{data} value of the data-generating pathway. Percentages indicate how many of the data sets had a higher (worse) cost value for the respective models.
(TIF)
We would like to thank Torkel Loman and members of the Jönsson group for fruitful discussions; Torkel Loman, Ross Carter and James Locke for feedback on the manuscript; and Alexey Chizh who’s masters project on the simplification of different pathways never made it into the paper but was still informative.
ReferencesAlonUri. MayerJürgen, KhairyKhaled, and HowardJonathon. Drawing an elephant with four complex parameters. AldridgeBree B., BurkeJohn M., LauffenburgerDouglas A., and SorgerPeter K. Physicochemical modelling of cell signalling pathways. LiepeJuliane, BarnesChris, CuleErika, ErgulerKamil, KirkPaul, ToniTina, and StumpfMichael P.H. Abc-sysbio—approximate bayesian computation in python with gpu support. PullenNick and MorrisRichard J. Bayesian model comparison and parameter inference in systems biology using nested sampling. GáborAttila and BangaJulio R. Robust and efficient parameter estimation in dynamic models of biological systems. GruelJérémy, LandreinBenoit, TarrPaul, SchusterChristoph, RefahiYassin, SampathkumarArun, HamantOlivier, MeyerowitzElliot M., and JönssonHenrik. An epidermis-driven mechanism positions and scales stem cell niches in plants. FlachE.H. and SchnellS. Use and abuse of the quasi-steady-state approximation. DavidsonEric H. GardnerTimothy S., CantorCharles R., and CollinsJames J. Construction of a genetic toggle switch in escherichia coli. ElowitzMichael B. and LeiblerStanislas. A synthetic oscillatory network of transcriptional regulators. HeinrichReinhart, NeelBenjamin G, and RapoportTom A. Mathematical models of protein kinase signal transduction. HommesD W, PeppelenboschM P, and van DeventerS J H. Mitogen activated protein (MAP) kinase signal transduction pathways and novel anti-inflammatory targets. PearsonG, RobinsonF, Beers GibsonT, XuB E, KarandikarM, BermanK, and CobbM H. Mitogen-activated protein (MAP) kinase pathways: regulation and physiological functions. HornbergJorrit J, BinderBernd, BruggemanFrank J, SchoeberlBirgit, HeinrichReinhart, and WesterhoffHans V. Control of mapk signalling: from complexity to what really matters. SturmOliver E., OrtonRichard, GrindlayJoan, BirtwistleMarc, VyshemirskyVladislav, GilbertDavid, CalderMuffy, PittAndrew, KholodenkoBoris, and KolchWalter. The mammalian mapk/erk pathway exhibits properties of a negative feedback amplifier. KimYoosik, ParoushZe’ev, NairzKnud, HafenErnst, JiménezGerardo, and ShvartsmanStanislav Y. Substrate-dependent control of mapk phosphorylation in vivo. PialaAlexander T, HumphreysJohn M, and GoldsmithElizabeth J. MAP kinase modules: the excursion model and the steps that count. SalazarCarlos and HöferThomas. Versatile regulation of multisite protein phosphorylation by the order of phosphate processing and protein-protein interactions: Kinetic models of multisite phosphorylation. SalazarCarlos and HöferThomas. Multisite protein phosphorylation–from molecular mechanisms to kinetic models. HungHsiu-Cheng, MaurerChristian, ZornDaniela, ChangWai-Ling, and WeberFrank. Sequential and compartment-specific phosphorylation controls the life cycle of the circadian CLOCK protein. Ben KhaledSara, PostmaJelle, and RobatzekSilke. A moving view: subcellular trafficking processes in pattern recognition receptor-triggered plant immunity. CoutoDaniel and ZipfelCyril. Regulation of pattern recognition receptor signalling in plants. VilarJose MG, JansenRonald, and SanderChris. Signal processing in the tgf-β superfamily ligand-receptor network. MelkePontus, JönssonHenrik, PardaliEvangelia, ten DijkePeter, and PetersonCarsten. A rate equation approach to elucidate the kinetics and robustness of the tgf-β pathway. KhatibiShabnam, ZhuHong-Jian, WagnerJohn, Wee TanChin, MantonJonathan H., and BurgessAntony W. Mathematical model of tgf-βsignalling: feedback coupling is consistent with signal switching. Beguerisse-DíazMariano, DesikanRadhika, and BarahonaMauricio. Linear models of activation cascades: analytical solutions and coarse-graining of delayed signal transduction. ChavesMadalena, SontagEduardo D, and DinersteinRobert J. Optimal length and signal amplification in weakly activated signal transduction cascades. NovákBéla and TysonJohn J. Design principles of biochemical oscillators. KyrychkoY. N., BlyussK. B., and SchöllE. Amplitude and phase dynamics in oscillators with distributed-delay coupling. RomboutsJan, VanderveldeAlexandra, and GelensLendert. Delay models for the early embryonic cell cycle oscillator. GorbanA.N. and RadulescuO. Chapter 3 Dynamic and Static Limitation in Multiscale Reaction Networks, Revisited. In RadulescuO., GorbanA. N., ZinovyevA., and NoelV. Reduction of dynamical biochemical reactions networks in computational biology. HelfferichFriedrich G. Systematic approach to elucidation of multistep reaction networks. HausserJean, MayoAvi, KerenLeeat, and AlonUri. Central dogma rates and the trade-off between precision and economy in gene expression. CollierJoanne R., MonkNicholas A.M., MainiPhilip K., and LewisJulian H. Pattern Formation by Lateral Inhibition with Feedback: a Mathematical Model of Delta-Notch Intercellular Signalling. GordonSean P., ChickarmaneVijay S., OhnoCarolyn, and MeyerowitzElliot M. Multiple feedback loops through cytokinin signaling control stem cell number within the Arabidopsis shoot meristem. BarkaiNaama and LeiblerStan. Robustness in simple biochemical networks. HerzAndreas V M, BonhoefferSebastian, AndersonRoy M, MayRobert M, and NowakMartin A. Viral dynamics in vivo: Limitations on estimates of intracellular delay and virus decay. BocharovGennadii A. and RihanFathalla A. Numerical modelling in biosciences using delay differential equations. MomijiHiroshi and MonkNicholas A.M. Dissecting the dynamics of the hes1 genetic oscillator. LiuXinfeng, JohnsonSara, LiuShou, KanojiaDeepak, YueWei, SingnUdai, WangQian, NieQing, and ChenHexin. Nonlinear growth kinetics of breast cancer stem cells: Implications for cancer stem cell targeted therapy. SavicRadojka M, JonkerDaniël M, KerbuschThomas, and KarlssonMats O. Implementation of a transit compartment model for describing drug absorption in pharmacokinetic studies. MittlerJohn E., SulzerBernhard, NeumannAvidan U., and PerelsonAlan S. Influence of delayed viral production on viral dynamics in HIV-1 infected patients. AkmanOzgur E, LockeJames C W, TangSanyi, CarréIsabelle, MillarAndrew J, and RandDavid A. Isoform switching facilitates period control in the Neurospora crassa circadian clock. AkmanOzgur E., RandDavid A., BrownPaul E., and MillarAndrew J. Robustness from flexibility in the fungal circadian clock. TokudaIsao T., AkmanOzgur E., and LockeJames C.W. Reducing the complexity of mathematical models for the plant circadian clock by distributed delays. SongXinyu and ChenLansun. Persistence and global stability for nonautonomous predator-prey system with diffusion and time delay. OroszGábor, Eddie WilsonR., SzalaiRóbert, and StépánGábor. Exciting traffic jams: Nonlinear phenomena behind traffic jam formation on highways. DufourtJeremy, TrulloAntonio, HunterJennifer, FernandezCarola, LazaroJorge, DejeanMatthieu, MoralesLucas, Nait-AmerSaida, SchulzKatharine N., HarrisonMelissa M., FavardCyril, RadulescuOvidiu, and LaghaMounia. Temporal control of gene expression by the pioneer factor Zelda through transient interactions in hubs. AntebiYaron E., NandagopalNagarajan, and ElowitzMichael B. An operational view of intercellular signaling pathways. OgataKatsuhiko. MonaghanJacqueline, MatschiSusanne, ShorinolaOluwaseyi, RovenichHanna, MateiAlexandra, SegonzacCécile, Gro MalinovskyFrederikke, RathjenJohn P., MacLeanDan, RomeisTina, and ZipfelCyril. The Calcium-Dependent Protein Kinase CPK28 Buffers Plant Immunity and Regulates BIK1 Turnover. RileyK. F., HobsonM. P., and BenceS. J. GunawardenaJeremy. Models in biology: ‘accurate descriptions of our pathetic thinking’. GillespieDaniel T. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. TsitourasCh. Runge–Kutta pairs of order 5(4) satisfying only the first column simplifying assumption. ShampineLawrence F and ReicheltMark W. The MATLAB ODE suite. RackauckasChristopher and NieQing. DifferentialEquations.jl—a performant and Feature-Rich ecosystem for solving differential equations in julia. StornRainer and PriceKenneth. Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. Robert Feldt and Alexey Stukalov. BlackBoxOptim.jl. https://github.com/robertfeldt/BlackBoxOptim.jl, v0.5.0, 2019.BezansonJeff, EdelmanAlan, KarpinskiStefan, and ShahViral B. Julia: A fresh approach to numerical computing. 10.1371/journal.pcbi.1007982.r001Decision Letter 0LauffenburgerDouglas ADeputy EditorRaoChristopher V.Associate Editor2020Lauffenburger, RaoThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Submission Version0
7 Jan 2020
Dear Dr Jonsson,
Thank you very much for submitting your manuscript 'It's about time: Analysing an alternative approach for reductionist modelling of linear pathways in systems biology' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.
Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.
Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.
In addition, when you are ready to resubmit, please be prepared to provide the following:
(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.
(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.
(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.
Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:
- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).
- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.
- Funding information in the 'Financial Disclosure' box in the online system.
While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.
To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here.
We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.
Sincerely,
Christopher V. Rao
Associate Editor
PLOS Computational Biology
Douglas Lauffenburger
Deputy Editor
PLOS Computational Biology
A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:
[LINK]
Reviewer's Responses to Questions
Comments to the Authors:
Please note here if the review is uploaded as an attachment.
Reviewer #1: General comment:
In biochemical network modeling, it is a common practice to replace a long sequence of reactions with fewer reaction steps. Such simplification often involves lumping and rescaling of rate parameters to keep the reduced network consistent with the original system. This manuscript points out the caveats of such simplification and provides a few alternatives. The main analyses and findings of this work are summarized below.
The manuscript considers a linear pathway consisting of N reaction steps. It also considers another linear pathway consisting of n steps (n \\neq N) that serves as a model for the N-step pathway. By fitting the n-step model to (synthetic) data produced by the N-step pathway, the authors investigate how well the model recapitulates the system’s dynamic behaviors in response to several inputs. It appears that the model performs poorly when n << N (most analyses involve N = 50, n = 2), especially when the steps are assigned with similar or identical rates. The authors then propose an alternative model, where n is treated as a free variable whose value is determined by optimization while the reaction rates of all steps are held fixed. In the majority of the cases, this alternative model (fixed-rate model) outperforms a two-step model. Finally, the authors provide an analytical form for the fixed-rate model for a simple case where the system input is zero. This analytical form, called the beta model, is not subject to the constraint that the number of steps n must be an integer. This beta model outperforms both the fixed-rate and the two-step model.
The analyses are interesting and the reviewer thinks the manuscript is well-written. Nevertheless, I have several questions, as listed below.
Major comments:
1) The equations indicate that the authors only consider first-order reactions in describing their linear pathway. However, most biochemical transformations (steps) are better described by the Michalis-Menten rate law. For example, a phosphorylation reaction involves a transient complex formation between the substrate protein and an enzyme (kinase). Such steps are prevalent in the examples provided by the authors in the introduction: MAPK activation, transcription or translation by RNA polymerase or ribosome, sequential protein phosphorylation in various pathways, etc. Could the authors provide examples of published models that treated these steps as first-order reactions?
Are the analyses and conclusions still valid if the Michaelis-Menten rate law is used to describe the steps in the sequential pathway?
2) The analyses provided are based on generic or toy examples. Better if the authors provided a case study involving a specific biological pathway. Examples could be published models where such simplification was applied.
Most analysis involve a 50-step vs. a 2-step model (N = 50 vs. n = 2). A 50-step linear pathway without having an intermediate crosstalk/non-linear component/feedback seems unlikely in cell signaling or gene transcription. The result indicates the two-step model performs quite well when N < 10, and the two-step model is almost as good as the fixed-rate model in the range N =5 -10 (Fig. 4). In a more realistic scenario, where N < 10, is the two-step model adequate?
3) The optimization evaluated the number of steps for the fixed-rate model. This should lead to many more steps in the fixed-rate model compared to the two-step model when N = 50 (Fig. 5 and 6). Is not it expected that the fixed-rate model outperforms the two-step model because the former involves more steps? The analysis shows that the fixed-rate model outperformed the 2-step model in approximately 90% of the cases. How would this result change if the fixed-rate model was compared against a 5-step model? Or if the data was generated using a 10-step pathway (N = 10)?
4) “---- it is easy to find examples where it has been used to simplify multi-step reactions such as protein production (e.g. 7, 31{33); protein-to-protein signalling networks (e.g. 7, 32{34); protein modifications such as 57 phosphorylations, methylations, and ubiquitinations (e.g. 33, 34)”
--- It seems like these simple models (cited references) did not truncate or simplify a linear multistep reaction, rather the goal could be to use a minimal model to describe specific data or experimentally-observed phenomena. Are these citations relevant in the context of this work?
Minor comments:
1) In the DDE model, the delay time is enforced in the final step. Does it make any difference if the optimization is allowed to choose where to introduce the delay time? What if more steps are allowed with fixed or heterogeneous delay times?
2) Font sizes in the legends of Fig. 5d and 6E are too small to read.
3) In the result section, figures are often referenced randomly without following any specific order.
Reviewer #2: This paper focuses on the impact of number of reaction steps in linear biochemical pathways. They introduce a framework for linear pathway simulations in order to compare several reduction operations over such pathways : reduction of the number of reaction steps, introduction of several classes of delays. Based on their simulations on synthetic dataset, the author conclude that simplifying to a three-model parameters (more precisely, scaling, pathway lenght, homogeneous response rate of the pathway) outperforms both state-of-the-art reduction approaches.
The paper is well-written, interesting and technically correct. However, I have doubts on its applicability on real models.
The first issue raised by the paper is that the main assumption of the author is that is a linear chain of reactions can be modeled with a linear differential system, without taking into account any non linearity effects. The study performed by the authors concerns only the biological systems for which this assumption is true. The paper would be more convincing if the author could detail the number of biological models (for instance in the BioModels database) for which this linear model assumption is used.
A second issue is that the synthetic data used to conduct all the experimentation were generated according to a linear ODE model with different pathway lenghts (from 1 to 50) and a log-uniform distribution for response rates. How valid are these hypotheses with respect to existing ODE models of biological processes processes in the litterature ? Simulating several linear published models to provde that the synthetic data are realistic seems necessary to validate the approach.
A third issue is that, as mentionned by the authors, linear pathways occur in biological models as combined with self-regulatory controls that may impact on the input signal. Therefore, the model of the input signal should deserve a very special attention to validate the conclusion of the authors. Addressing this issue could be done for instance by applying the three reduction methods studied in the paper on a family of complex published and validated models in order to figure out if the conclusions are still valid for input functions controlled by the system dynamics.
Finally, the author advocate that a software is associated with their publication. It appears that this software is a code in the Julia programming language allowing to reproduce the simulations shown in the paper. It should be clarified that the software is not a tool for model reduction (and parameter fitting) as it could be expected while reading the paper.
Reviewer #3: The manuscript addresses the problem of the number of steps in a simplified, linear model of signaling with no branching. This is an important issue and may guide the modeler’s choice. The main result of the paper is that a “fixed rate” model with a variable number of steps is well suited for fitting a broad corpus of data. This idea has already been used in the systems biology literature but never tested systematically. The submitted work has the merit of testing this reduction ansatz. Nevertheless, the conclusions on the applicability of the “fixed rate” ansatz are entirely based on numerical simulations instead of precise mathematical estimates. In many places, the manuscript can be substantially improved by precise definitions and more rigorous specification of the domain of validity of the results. A list of issues that need major amendments follows:
1) The idea to use the number of steps as a free parameter has already been used for a chain model of transcription in Dufourt et al Nature comm. (2018) 9: 5194, where the authors chose the number of steps and the homogeneous or heterogeneous rates on the bases of quality of fitting and variability of optimal and suboptimal parameters; for the optimal number of steps a heterogeneous rate hypothesis is rejected on the basis of parameter variability. These similar ideas should be cited in the manuscript.
2) The type of model and the meaning of linearity should be made more precise in summary and introduction. Which type of models are dealt with, stochastic or deterministic? The authors use the name “linear pathways” to speak of a special case of first order chemical reaction networks (CRNs), first introduced by Heinrich et al (2002) ref. [12]. In this context linearity may mean both “first-order” and unbranched topology.
3) The notion of “step” is crucial in the paper and needs careful discussion.
4) It is not true that little effort has been made to understand pathway truncation. Mathematical theories of pathway truncation are available for monomolecular CRNs (Gorban and Radulescu Adv.Chem.Eng. (2008) 34:103), first order CRNs (Helfferich J.Phys.Chem. (1989) 93:6676), and more general CRNs (Radulescu et al, Front. Genet. (2012) 3:131).
5) The scaling leading from Eqs (1) to (2) and (3) should be explicitly given in the main text (this is not available in the methods).
6) Clearly specify that the output is Xn(t).
7) Eq. at line 120 is valid only if Xi(0)=0 for i >1.
8) At various places fitting is said to be guaranteed perfect. But fitting results from a numerical scheme; even if theoretically a perfect fit is possible, the scheme may not find it. At other places fitting is said to perform well. What does this mean quantitatively and on which statistical grounds?
9) Some criteria are proposed to identify “detrimentally truncated models”, such as the amplitude and width if the output signal and the variability in the rate parameters. This section is important and should be treated with more care. What quantitative recipes should be used here?
10) The eq (6) corresponds to a singular delay model, whereas the fixed rate model leads to a distributed delay model. This should be clearly stated. The explanation of the failure of the singular delay model (lines 218-220) is obscure.
11) The r parameter of the fixed rate model is said to represent the slowest step. This is rigorously true for separated rates but not necessarily true for similar rates. I don’t think that Fig4 represents a rigorous proof of this correlation. Computing the correlation with next slowest steps and with the harmonic mean and comparing them with the slowest step would be a proof.
12) A gamma model with real shape parameter is proposed. This implies non-integer number of steps. What does it mean?
13) The way how the restriction on the zero initial concentration can be lifted (lines 297-300) is not clearly explained.
14) In the conclusion (also in the introduction) it is said that wrong delays can destabilize oscillations. To which extent is this true? Can the authors provide examples where delays with the same mean, but different distributions have different effect on the stability of oscillations? Some results from delay differential equations could be invoked.
15) Methods should be proofread. Eq.11 seems to be used for proving a number of statements that are by no means clear. What is the meaning of the linear dependence of the production terms (line 380)? The whole paragraph between lines 378 and 390 is obscure. Same for lines 438-443, 447-450 (the l1 norm does not have the same advantages?). What are the “interpolations” at line 452?
16) How is the heterogeneity parameter delta defined for the general distribution of rates? The definition in the figures is based on equidistant rates in log scale, which is a very special choice.
**********
Have all data underlying the figures and results presented in the manuscript been provided?
Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biologydata availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.
Reviewer #1: None
Reviewer #2: Yes
Reviewer #3: No: The conclusions of the manuscript rely heavily on numerically generated data. This data should be made available on a public repository such as zenodo, for instance.
**********
PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.
Reviewer #1: No
Reviewer #2: No
Reviewer #3: No
10.1371/journal.pcbi.1007982.r002Author response to Decision Letter 0Submission Version1
21 Apr 2020
Submitted filename: review.pdf
10.1371/journal.pcbi.1007982.r003Decision Letter 1LauffenburgerDouglas ADeputy EditorRaoChristopher V.Associate Editor2020Lauffenburger, RaoThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Submission Version1
27 May 2020
Dear Dr Jonsson,
We are pleased to inform you that your manuscript 'It's about time: Analysing simplifying assumptions for modelling
multi-step pathways in systems biology' has been provisionally accepted for publication in PLOS Computational Biology. The reviewers were mixed regarding its suitability for PLOS Computational Biology. Based on our reading of the manuscript, however, we believe that it is indeed suitable.
Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.
Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.
IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.
Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.
Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.
Please note here if the review is uploaded as an attachment.
Reviewer #1: I am satisfied with the revision made by the authors in this current manuscript. One of my primary concerns was that the scope of this work could be limited to a linear system only. In this revised manuscript, the authors have demonstrated applications to non-linear systems involving the Michaelis-Menten reaction scheme. This change should broaden the scope of the work as signaling pathways mostly represent nonlinear systems. I also found other revisions in the manuscript satisfactory based on my remaining suggestions.
Reviewer #2: The authors have substantially changed their manuscript to include additional theoretical proofs about systems equivalence.
The main issue raised in my first review was that the application of the proved results may have a limited impact if the authors could not be able to provide a panel of examples of models on which the reduction they suggest could apply. Unfortunately, the authors added a reference to a recent paper (2019) and a validation on a toy-example model of Aradpidopsis Thailand but did not really apply their approach to a convincing panel of existing model. On the contrary, they argue that general theoretical proofs are more convincing than applications to real middle-scale examples.
As a consequence, the author have substantially changed the manuscript to include theoretical proofs to justify the reduction methods. It seems to me that the present version of the manuscript does not fit for PLOS computational Biology and would deserve a more theoretical and mathematical audience.
**********
Have all data underlying the figures and results presented in the manuscript been provided?
Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biologydata availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.
Reviewer #1: None
Reviewer #2: None
**********
PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.
Reviewer #1: No
Reviewer #2: No
10.1371/journal.pcbi.1007982.r004Acceptance letterLauffenburgerDouglas ADeputy EditorRaoChristopher V.Associate Editor2020Lauffenburger, RaoThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
23 Jun 2020
PCOMPBIOL-D-19-01662R1
It's about time: Analysing simplifying assumptions for modelling
multi-step pathways in systems biology
Dear Dr Jönsson,
I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.
The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.
Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.
Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!