Combining different models

Portfolio selection is one of the most important areas of modern finance, both theoretically and practically. Reliance on a single model is fraught with difficulties, so attempting to combine the strengths of different models is attractive; see, for example, Geweke and Amisano (J Econom 164(1):130–141, 2011) and the many references therein. This paper contributes to the model combination literature, but with a difference: the models we consider here are making statements about different sets of assets. There appear to be no studies making this structural assumption, which completely changes the nature of the problem. This paper offers suggestions for principles of model combination in this situation, characterizes the solution in the case of multivariate Gaussian distributions, and provides a small illustrative example.


Introduction
Suppose that you are faced with the problem of choosing a portfolio position in a universe of N assets, where N may be many hundreds. It is generally understood that a simple-minded direct attempt to build a portfolio involving all N of the assets will be a dismal failure, for various reasons, chief among them being the difficulty in forming accurate estimates of the covariance matrix of returns; see, for example, the book by Fan et al. [2]. The dimensionality of the problem requires innovation, and there are many different directions we may look to get traction. For example, we might propose a low-dimensional factor model, where all the returns processes are driven by a handful of factors which should be easier to deal with. The factors could be series which are economically significant, such as the returns on a major stock index, the prices of important commodities, key interest rates or exchange rates; or the factors could be derived from a principal components analysis of an estimated covariance matrix. Or again, we might constrain the portfolio positions to be long-only to try to deal with the extreme long-short positions that generally arise in a simple-minded approach. We might increment the covariance with a multiple of the identity, for the same reason. We might take different models and combine their forecasts in some way, as in Bates and Granger [3], Elliott and Timmermann [4], Geweke and Amisano [1], Pettenuzzo and Ravazzolo [5], and many other papers referred to therein.
A different approach to the high-dimensionality would be to split the universe of assets into smaller sets of assets, and try to do something sensible with those smaller sets. If we believe we can make a reasonable combination of up to ten assets (say) then we could in principle use such a 'divide and conquer' approach, but it would not allow us to exploit the correlations between sets of assets, and the problem still remains of how to weight the different portfolios formed from the subsets.
The approach taken here has this flavour, in that we suppose the universe of N assets is broken down into subsets of assets, but we do not suppose that those subsets are disjoint. 1 It might be for example that we want to build one model for G10 currencies, and another one for European government bonds, stock indices and currencies; we may have insights into the way currencies move together, and we may have separate insights into how a nation's currency, stock index and bonds move together. Now we would like to combine these (hopefully reasonably good) models. So the European G10 currencies are common to both, but each model speaks of variables that are outside the other. How should this be done?
This paper offers some possible answers. In Sect. 2, we introduce notation and formulate the problem. Models speak of different sets of assets, and this introduces an equivalence relation on the assets, two assets being considered equivalent if there is no model which speaks about one of the assets but not the other. The equivalence classes (which for brevity we refer to as tiles) are sets of assets that can be considered together for the purposes of inference. We then propose that all those models which speak about the assets in a given tile are combined by Bayesian model averaging 2 . This tells us how to combine the statements of all our models for any individual tile, but how we combine across different tiles still needs to be specified. We address this in Sect. 3, where we propose to construct a measure with the required tile marginals which solves a relative entropy minimization problem. It turns out that this problem has a unique solution, which can be characterized quite cleanly, albeit implicitly.
This approach operates at an abstract level, so we do not need to make any structural assumptions about any of the variables; we do not even need to assume that they take values in a vector space. However, to be able to apply the results, we develop the form of the solution under the hypothesis that all the predictive distributions are multivariate Gaussians. The hypothesis is unlikely to hold in practical situations, but it is a plausible approximation, and we are able to make the combined distribution reasonably explicit. Identifying the distribution in general requires numerical solution. We briefly discuss a numerical example before concluding.

Problem formulation
We work in discrete time with a universe of N assets. The returns on day t will be denoted by the N -vector and we take the probability space to be the canonical path space ≡ (R N ) Z + . Information available by time t will be denoted by F t . Let the set S = {1, . . . , N } index the assets. Suppose that we have models M 1 , . . . , M K , where model M α makes predictions only about assets with labels in S α ⊆ S. To avoid triviality, we assume that ∪ α S α = S. It is also worth noticing that if there is some set I ⊂ {1, . . . , K } of models such that S I ≡ ∪ α∈I S α is disjoint from S ∼I ≡ ∪ α / ∈I S α , then there is no connection between the models {M α , α ∈ I } and {M α , α / ∈ I }. We could therefore analyse the two sets of models completely separately, and in practice it will probably be worth making such decompositions before we start, though in the account which follows this will not be assumed. Formally We refer to the m α t as the predictive distributions of the individual models; these are the central objects of study.
Notice that a model M α is not a probability on ( , F ), because it makes no statements about the distributions of returns X i for i / ∈ S α . Nor is M α a probability defined on the sub-sample space α ≡ (R S α ) Z + , because it may be informed by the behaviour of assets other than those in S α . The number N of assets may be very large, and each set S α may be quite small, perhaps just a single asset. Sets S α and S α may be disjoint, they may have non-empty intersection, one may be contained in the other; anything is possible.
On day t − 1, each model M α makes a prediction m α t of the distribution of the day-t returns of the assets in S α ; the goal is to find some algorithm to combine the m α t into a single predictive distribution for all the assets in S in a reasonable way. We shall require that the combination algorithm should not depend on specific distributional assumptions, and should be compatible with Bayesian principles. The set S is partitioned by the equivalence relation into equivalence classes C k , k = 1, . . . , J , which we will refer to as tiles. These are sets of variables which are not split by any model. This simple Venn diagram illustrates the situation; the region coloured red is a tile, C r ≡ (S 1 ∩ S 2 )\S 3 , as is the region coloured green, Thus on day t − 1 each of the models M 1 , M 2 , M 3 makes a prediction for day t about the variables in the green tile: how should we combine these? Bearing in mind that we can only compare predictions which speak about the same variables, we propose the following principle for combining the predictive distributions: (P0) Predictions for a tile are done by Bayesian model averaging over the maximal set of common variables.
So if we consider the green tile, models M 1 , M 2 , M 3 predict for those variables (and others besides), and what we do is take the marginals of the predictive distributions m α t (α = 1, 2, 3) on the green tile C g , and then do a Bayesian model averaging of these marginal distributions.
To explain this in a little more detail, if on day t−1 model α states that the law of observation X t to be observed on day t will have density f α t (·), then the posterior probabilities π α t of the different models update according to Bayes' rule: where the constant of proportionality is determined by the requirement that the π α t sum to 1. Thus on day t − 1, each of the models M 1 , M 2 , M 3 states a (marginal) density for the variables 4 X t [S α ], and the predicted law of these variables will be the average of those predicted densities, weighted according to the day-(t − 1) posterior. In practice, the updating relation (2) is modified to where P = ( p jk ) is a fixed transition matrix 5 . The interpretation is that the data-generating process may change state like a Markov chain with transition matrix P. The reason for introducing this possibility is because of the tendency of the updating recursion (2) to get stuck at historical average values, which in the context of asset returns is undesirable-we do not believe that asset returns from the distant past should have the same influence on our actions as more recent data, and using (3) instead of (2) reflects that.
To expand a little on this, the numerical example studied later performs a Bayesian averaging of models where the predicted mean return is a geometrically-weighted moving average of recent returns, with different weighting parameters for different models, perhaps with mean lookbacks of 10, 20 and 40 days. So each of these models pays more attention to recent returns than to older data, but what we want to avoid is the situation where after a lot of data has been processed we put almost all the weight on the model with a 20-day lookback, and are unable to change that posterior very much during the next few months. Experience shows that market dynamics can change quite quickly, and a model that does not admit that will go on losing money for too long when such a change happens.
This explains how we use the data gathered by day t − 1 to state what we think the distribution of the C g -variables will be on day t. One further point needs to be made, however, when we look at the variables in the red tile C r ≡ (S 1 ∩ S 2 )\S 3 , because the procedure just detailed for tile C g could be applied in two different ways: do we • form the Bayesian average of M 1 , M 2 on S 1 ∩ S 2 , and then take the To summarize then, we have just described how we construct the predictive distribution each day for the variables in each of the tiles separately. But as yet we have not determined how we combine these to make a predictive distribution for all of the variables at once. Doing this is clearly the essence of the problem, because we have to decide what the co-dependence of the variables in different tiles will be in order to make portfolio choices.
Remarks At first sight, we can easily reduce the problem to one where all of the predictive distributions speak about all of the variables-for any variables not in S α , we just say that model M α predicts a large-variance zero-mean return! This gets round the mathematical issues at the cost of turning the problem into nonsense. Why? Suppose that we have just two models, M 1 which makes statements about all of the asset returns, and M 2 which makes statements about only asset 1. We could expand M 2 to speak about all the assets by saying that its predictions for the others are just noise, but the problem is that if M 2 happens to predict asset 1 much better than M 1 , then we would end up with most posterior weight on M 2 . We would then believe that we had no useable information about any of the assets other than asset 1, whereas in fact M 1 might be telling us some quite valuable information about them.

Combining distributions
The situation then is this. We have for each model M α a predictive distribution m α t for the S αvariables on day t; we have for each tile C j a predictive distribution q j t for the C j -variables on day t, obtained by Bayesian model averaging. How do we come up with a distribution q ∈ P(R N ) with the properties: (P1) For each j = 1, . . . , J, the C j -marginal of q is q j t ; (P2) The co-dependence of the different tiles is inherited from the co-dependence of the m α t ?
For notational convenience, we will henceforth abbreviate m α t to m α , and q j t to q j , as the time index is immaterial. Of course, property (P2) of the construct q is not defined as yet, and the essential issue is to try to give a reasonable and precise meaning to this.
Let us first look only at the tiles C 1 , . . . , C J which make up S α ; we want to make a distribution μ for the variables X [S α ] which satisfies (P1) and (P2). We could of course ensure property (P1) simply by taking the product measure μ = q 1 ⊗ · · · ⊗ q J but this ignores any information about co-dependence which there might be in m α . We want to construct some measure μ which is as 'near' as possible to m α while satisfying property (P1). The sense of closeness we propose here is closeness in relative entropy-other choices could be made, but this is a very natural one, and is widely used. So we will determine the measure μ by solving the problem min H (μ|m α ) subject to μ C j = q j ∀ j, where μ A denotes the A-marginal of μ, and as usual With a minor abuse of notation, we shall write when f is a probability density with respect to m, or even H ( f ) if the reference measure m does not need to be clarified.
To ensure that problem (4) is well-posed, we have the following result.

Proposition 1
Assume that there exists some measure μ with μ C j = q j for all j = 1, . . . , J for which the relative entropy H (μ|m α ) is finite. Then the problem (4) has a unique solution.
Proof Let S denote the set of all densities f for which the relative entropy is finite, and for which the C j -marginal is the given measure q j where x[∼ C j ] denotes the subvector of x on the complement of C j . The set S is a convex subset of L 1 + (m α ), non-empty by assumption.
The family ( f n ) is clearly uniformly integrable, so we may apply a version of Komlos' Theorem (see Lemma 2.1 in [6]) to conclude that there exist g n ∈ conv( f n , f n+1 , . . .) which converge in L 1 to some limit g. But since S is convex, the g n are in S, and it is immediate that the limit is also; the marginals are as given by (6). By Fatou's Lemma, we conclude that H (g) = b, and the infimum is attained.
This tells us what to do if we were just concerned with a single S α which was split into tiles, but we need to deal with the whole set S = ∪ α S α . What we propose to do therefore is to seek some probability μ so as to The argument of Proposition 1 which showed that the minimum is attained by a unique minimizer runs into technical issues, which are illustrated by simple examples.
Example 1 Suppose that f n is the density of a bivariate normal distribution with mean 0 and covariance where the ρ n increase to 1. Then the 1-marginal densities of the f n are a uniformly integrable family, as are the 2-marginal densities, but the family ( f n ) is not uniformly integrable.
This shows that we will not be able to re-run the argument of Proposition 1 for problem (8), because this argument required uniform integrability of the densities, not just of their marginals. A second issue arises, illustrated by the next example.  N (0, I ). Suppose further that the predictive distributions for each of the three tiles are a standard N (0, 1) law. Then for any a ∈ (−1, 1) the law is a minimizer of the objective (8) (it achieves value 0), and satisfies the marginal law constraints on each tile. Therefore we cannot hope for uniqueness of the solution to problem (8), and the family of solutions will not in general be uniformly integrable. We return to this example later once we have explained how we plan to get around these issues.
To deal with these issues then, we propose to modify the problem (8) to the following: Here, ε > 0 is some chosen positive parameter. This choice is arbitrary, and undesirable; the inclusion of this term is required to prevent degeneracy. It may be that we can deal directly with problem (8), but Example 2 shows that even if we could prove that a minimizer exists, we would in general need to make an arbitrary choice of minimizer, so some arbitrary choice cannot be avoided-but, as we shall see, we may be able to get around this. Granted this modification of the objective, we have the following analogue of Proposition 1, whose proof follows the same lines as the proof of Proposition 1, and is therefore omitted. The key point is that the set of μ with finite relative entropy H (μ|⊗ J j=1 q j ) is once again uniformly integrable.

Proposition 2
Assume that there exists some measure μ with μ C j = q j for all j = 1, . . . , J for which the relative entropy H (μ| ⊗ J j=1 q j ) is finite. Then the problem (10) has a unique solution.
Can we get a clearer picture of what the solution to problem 10 looks like? We can indeed, but to describe it we need some notation. So suppose that with respect to some product measure dx the measures m α have densities f α , and the tile marginals q j have densities ϕ j . We shall abbreviateφ for the reference density. Let g α denote the S α marginal density of μ, so that the objective to be minimized in (10) can be written where of course the function g is the density of the unknown measure μ. The constraints must also be satisfied. We then have the following characterization of the optimal density g.

Theorem 1
The density g of the optimal solution to problem (10) is represented as where the functions η j are determined up to additive constants by the constraints (14).
Proof We shall absorb the constraints (13) with Lagrange multipliers λ α (x[S α ]) and the constraints (14) with Lagrange multipliers η j (x[C j ]) to construct the Lagrangian Minimizing L over g α gives the first-order condition 6 and minimizing L over g gives the first-order condition From these conditions, we deduce expressions for g α and g in terms of the unknown multipliers λ α and η j : Comparing with (18) now shows us that 7 Summing (21) over α now tells us that or equivalently that Substituting this into (19) gives the stated form (15).

Remarks 1.
A first glance at the form of the solution (15) for g might lead one to believe that by integrating out x[∼ C j ] and using the constraint (14) we will obtain explicitly what η j is, and therefore have a very concrete representation of the solution. But unfortunately when we integrate over x[∼ C j ] the integration involves the values of η i for i = j, and so what we obtain is an implicit characterization of the η i . 2. If we take the expression (15) for the optimal g and formally let ε tend to zero, we find the much simpler expression For the reasons explained above, this formal passage to the limit may not deliver an optimal g, and the η j in any case depend on ε so there are issues there also. Nevertheless, the simple structural form (24) is appealing, and may provide a good place to start looking.

Multivariate Gaussian distributions
A very important special case which can be reduced to a matrix equation is the case where all the distributions are multivariate Gaussian. This is important because of the following result. 7 The symbol . = denotes that the two sides differ by a constant.
for some symmetric (though not necessarily positive-definite) matrices W j . The form (15) of the optimal g will now give us where we write W for the unknown block-diagonal matrix of the W j , and V for the remaining (positive-definite) terms of the quadratic form. It is now clear that we must choose the W j so that for all j the C j -diagonal block of (1 + ε)(V − W) −1 is the given covariance matrix Q j . Determining the W j given the α and the Q j appears to be a challenging numerical problem. The approach used in the following numerical study minimized the relative entropy numerically, but a direct method would be good.
Example 2 Revisited Example 2 fits into the multivariate Gaussian framework just presented; let us see how the general results look in this case. We have that each Q j is the 1 × 1 matrix 1, and each α is the 2 × 2 identity matrix. Hence In order to match the C j -marginals for each tile, we see that we must take W 1 = W 3 = 0, and W 2 = 1, values which do not depend on ε. The solution picked out from the family (9) is the one where X 1 and X 3 are independent, perhaps not surprisingly.

A numerical study
This study worked with 10 years of daily price data on 35 major US stocks, and proposed three models, each making statements about exactly 20 of the stocks. In each of the nonempty subsets of the Venn diagram below there were 5 stocks, the allocation of stocks to subsets being arbitrary. Without going into the exact details, each model generated a predictive distribution each day, and the Bayesian combination was used to come up with a predictive distribution for each equivalence class C j of variables.
This predictive distribution was of course a mixture distribution, in fact, a mixture of multivariate Gaussians. The combination of the predictive distributions was done by pretending that each predictive distribution was a multivariate Gaussian with matching mean and covariance, and then using the results of the previous subsection to come up with an overall mean μ and covarianceV for the predicted distribution of all 35 asset returns. Then the portfolio of the assets used was simplyV −1μ . Not very much can be concluded from the output Fig. 1, except that the realized gains seem to have behaved quite sensibly, without any large jumps or swings. The allocation of assets to models was quite arbitrary, so we would not expect that anything particularly good would result; with some more guidance over the choice of Fig. 1 Cumulative P&L of mean-variance strategy using the combination predictive distribution variables for models, we may well be able to produce some more interesting P&L plots, but the point to note is that this methodology does give a way to combine small models into a sensible algorithm for dealing with a much larger set of assets, which was the goal of the study.

Conclusions
The problem of making a good portfolio from a large number of assets is an important and challenging one; this paper offers an approach that envisages that the big problem is first broken down into smaller more manageable problems. The key feature is that we do not need to suppose that the smaller problems make statements about disjoint sets of assets, but rather that understanding of co-dependence of assets can come from multiple separate models which each embody some part of the co-dependence of different assets. The given models naturally partition the assets into equivalence classes (tiles) on which standard Bayesian model averaging can be applied. We have developed an entropyminimization method of combining the measures on different assets into a consensus measure. Performing the optimization under constraints on the marginal laws on the individual tiles leads us to the overall combination.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.