Bayesian VARs - The Likelihood Function

\[ \newcommand{\bm}{\mathbf} \newcommand{\bA}{\bm A} \newcommand{\ba}{\bm a} \newcommand{\bB}{\bm B} \newcommand{\bb}{\bm b} \newcommand{\bC}{\bm C} \newcommand{\bc}{\bm c} \newcommand{\bD}{\bm D} \newcommand{\bd}{\bm d} \newcommand{\bE}{\bm E} \newcommand{\be}{\bm e} \newcommand{\bF}{\bm F} \newcommand{\bof}{\bm f} \newcommand{\bG}{\bm G} \newcommand{\bg}{\bm g} \newcommand{\bH}{\bm H} \newcommand{\bh}{\bm h} \newcommand{\bI}{\bm I} \newcommand{\bi}{\bm i} \newcommand{\bJ}{\bm J} \newcommand{\bj}{\bm j} \newcommand{\bK}{\bm K} \newcommand{\bk}{\bm k} \newcommand{\bL}{\bm L} \newcommand{\bl}{\bm l} \newcommand{\bM}{\bm M} \newcommand{\bom}{\bm m} \newcommand{\bN}{\bm N} \newcommand{\bn}{\bm n} \newcommand{\bO}{\bm O} \newcommand{\bo}{\bm o} \newcommand{\bP}{\bm P} \newcommand{\bp}{\bm p} \newcommand{\bQ}{\bm Q} \newcommand{\bq}{\bm q} \newcommand{\bR}{\bm R} \newcommand{\br}{\bm r} \newcommand{\bS}{\bm S} \newcommand{\bs}{\bm s} \newcommand{\bT}{\bm T} \newcommand{\bt}{\bm t} \newcommand{\bU}{\bm U} \newcommand{\bu}{\bm u} \newcommand{\bV}{\bm V} \newcommand{\bv}{\bm v} \newcommand{\bW}{\bm W} \newcommand{\bw}{\bm w} \newcommand{\bX}{\bm X} \newcommand{\bx}{\bm x} \newcommand{\bY}{\bm Y} \newcommand{\by}{\bm y} \newcommand{\bZ}{\bm Z} \newcommand{\bz}{\bm z} % bold Greek lowercase letters for vectors \newcommand{\balpha}{\bm \alpha} \newcommand{\bbeta}{\bm \beta} \newcommand{\bgamma}{\bm \gamma} \newcommand{\bdelta}{\bm \delta} \newcommand{\bepsilon}{\bm \epsilon} \newcommand{\bvarepsilon}{\bm \varepsilon} \newcommand{\bzeta}{\bm \zeta} \newcommand{\boeta}{\bm \eta} \newcommand{\btheta}{\bm \theta} \newcommand{\biota}{\bm \iota} \newcommand{\bkappa}{\bm \kappa} \newcommand{\blambda}{\bm \lambda} \newcommand{\bmu}{\bm \mu} \newcommand{\bnu}{\bm \nu} \newcommand{\bxi}{\bm \xi} \newcommand{\bpi}{\bm \pi} \newcommand{\brho}{\bm \rho} \newcommand{\bsigma}{\bm \sigma} \newcommand{\btau}{\bm \tau} \newcommand{\bupsilon}{\bm \upsilon} \newcommand{\bphi}{\bm \phi} \newcommand{\bchi}{\bm \chi} \newcommand{\bpsi}{\bm \psi} \newcommand{\bomega}{\bm \omega} \]

A standard Baysian Vector AutoRegression (VAR) takes the form

\[ \mathbf{y}_t = C\bar{\mathbf{y}}_t + \sum_{i=1}^p A_i \mathbf{y}_{t-i} + \mathbf{\varepsilon}_t \]

where \(\mathbf{y}_t\) is a \(m \times 1\) vector of endogenous time series, \(\bar{\mathbf{y}}_t\) is a \(m_c \times 1\) vector of constants and other deterministic trends, and \(A_i\) for \(i=1, ..., p\) and \(C\) are coefficient matrices. In total, there are \(k\equiv mp + m_c\) coefficients in each equation, resulting in a total of \(mk\) coefficients in the model.

Without further specifying the distribution for the error terms, \(\mathbf{\varepsilon}_t\), no likelihood can be derived. The most common choice is to assume that the errors follow a multivariate Normal distribution with mean zero and covariance matrix \(\Sigma\). That is, most commonly we assume that \(\mathbf{\varepsilon}_t \sim N(\mathbf{0}, \Sigma)\). This then implies that

\[ \mathbf{y}_t | C, A_1, ..., A_p, \Sigma, \mathbf{y}_{t-1}, ..., \mathbf{y}_{t-p} \sim N(C\bar{\mathbf{y}}_t + \sum_{i=1}^p A_i \mathbf{y}_{t-i}, \Sigma) \]

which is then used to derive the for of the likelihood that is classically used. For the Bayesian setting, it turns out, that there is a more convenient form of the likelihood. While both are functionally equivalent, the way Baysians write down the likelihood allows them to easily calculate the posterior given a prior on all parameters.

To derive the more convenient form, define \(A = [c \quad A_1 \quad ... \quad A_p]'\) and \(\mathbf{x}_t = [\bar{\mathbf{y}}_t', \mathbf{y}_{t-1}', ..., \mathbf{y}_{t-p}']\) where the former is of dimenions \(k \times m\) and the latter is a row vector of dimensions \(1\times k\). Using these definitions, we can write the VAR as

\[ \mathbf{y}_t = A'\mathbf{x}_t' + \mathbf{\varepsilon}_t \Rightarrow \mathbf{y}_t' = \mathbf{x}_t A + \mathbf{\varepsilon}_t' \]

This form is convenient, because it allows us to define

\[ \begin{array}{ccc} Y = \begin{bmatrix}\mathbf{y}_1' \\ \vdots \\ \mathbf{y}_T'\end{bmatrix} & X = \begin{bmatrix}\mathbf{x}_1 \\ \vdots \\ \mathbf{x}_T\end{bmatrix} & E = \begin{bmatrix}\mathbf{\varepsilon}_1 \\ \vdots \\ \mathbf{\varepsilon}_T\end{bmatrix} \end{array} \]

We can then write the VAR in matrix notation as

\[ \DeclareMathOperator{\vect}{vec} \DeclareMathOperator{\cov}{cov} Y = XA + E \Rightarrow \mathbf{y} = (I_m \otimes X)\mathbf{\alpha} + \mathbf{\varepsilon} \]

where \(y = \vect(Y)\), \(\mathbf{\alpha} = vec(A)\) and \(\mathbf{\varepsilon} = \vect(E)\). The vectorised form follows directly from the rules of the \(\vect\) operator which can be found in e.g. the matrix cookbook.

What are the distributions of \(E\) and \(\mathbf{\varepsilon}\) though? I will pospone the discussion of the distribution of \(E\) to another post, since it is of Matrix Normal form and not many seem to know of the Matrix Normal distributions. Instead, I will focus on the distribution of \(\mathbf{\varepsilon}\) which is Multivariate Normal which is directly inhereted form the Multivarite Normal distribution of \(\mathbf{\varepsilon}_t\). To fully describe the distributions, we need its mean and covariance. The mean is easy; since every \(\varepsilon_{it}\) has mean \(\mathbf{0}\), we must have that \(\mathbf{\varepsilon}\) has mean \(\mathbf{0}\). The covariance is a bit more tricky, and it helps to consider a small example case.

Lets suppose for a minute that \(T=2\) and \(m=2\). Then, \(\mathbf{\varepsilon} = [\varepsilon_{11}, \varepsilon_{12}, \varepsilon_{21}, \varepsilon_{22}]'\). Since errors are uncorrelated over time, we have in general that \(\cov(\varepsilon_{it}, \varepsilon_{jt'})=0\) for all \(i,j \in \{1, ...., m\}\) and all \(t\neq t'\). We also know the covariance of any two errors at the same point in time: \(\cov(\varepsilon_{it}, \varepsilon_{jt})=\sigma_{ij}=[\Sigma]_{ij}\) - the \(ij\) element of the covariance matrix \(\Sigma\). Thus, in our small example, the covariance structure of \(\mathbf{\varepsilon}\) must take the form

\[ \begin{bmatrix} \sigma_{11} & 0 & \sigma_{12} & 0 \\ 0 & \sigma_{11} & 0 & \sigma_{12} \\ \sigma_{21} & 0 & \sigma_{22} & 0 \\ 0 & \sigma_{21} & 0 & \sigma_{22} \end{bmatrix} = \Sigma \otimes I_T \]

A careful look at the form above reveals that it is equal to \(\Sigma \otimes I_T\). This generally holds, implying that \(\mathbf{\varepsilon} \sim N(\mathbf{0}, \Sigma \otimes I_T)\).

A small useful trick

Before we continue out derivation of the likelihood, it is useful to first derive a small algebraic trick. The general problem is, that we are given a quadratic form

\[ (\by - X\bbeta)'\Sigma^{-1}(\by - X\bbeta) \]

and we want to rewrite the above into a quadratic form in \(\bbeta\) and \(\by\). We thus first write

\[ \begin{split} &(\by - X\bbeta)'\Sigma^{-1}(\by - X\bbeta) \\ =& (\by - X\hat\bbeta + X\hat\bbeta X\bbeta)'\Sigma^{-1}(\by - X\hat\bbeta + X\hat\bbeta X\bbeta) \\ =& (\by - X\hat\bbeta)'\Sigma^{-1}(\by - X\hat\bbeta) + (\bbeta - \hat\bbeta)X'\Sigma^{-1}X(\bbeta - \hat\bbeta) + 2(\by - X\hat\bbeta)'\Sigma^{-1}(X\hat\bbeta - X\bbeta) \end{split} \]

We would like to get rid of the last term by choosing a smart \(\hat\bbeta\neq \bbeta\). For the last term to be zero, it is sufficient to find a \(\hat\bbeta\) that sets \(\by - X\hat\bbeta = 0\).

\[ \begin{split} &\by - X\hat\bbeta = 0 \\ \Leftrightarrow & \by = X\hat\bbeta \\ \Leftrightarrow & X'\by = X'X\hat\bbeta \\ \Rightarrow & \hat\bbeta = (X'X)^{-1}X'\by \end{split} \]

Thus, by choosing \(\hat\bbeta\) to be the standard OLS estimate, we can rewrite the quadratic involving both \(\by\) and \(\bbeta\) into two quadratic forms, each involving only one of \(\by\) and \(\bbeta\). That is, for \(\hat\bbeta = (X'X)^{-1}X'\by\) we have

\[ (\by - X\bbeta)'\Sigma^{-1}(\by - X\bbeta) = (\by - X\hat\bbeta)'\Sigma^{-1}(\by - X\hat\bbeta) + (\bbeta - \hat\bbeta)'X'\Sigma^{-1}X(\bbeta - \hat\bbeta) \]

So we know that \(\by = (I_m \otimes X)\balpha + \bvarepsilon\) with \(\bvarepsilon \sim N(\bm 0, \Sigma \otimes I_T)\). Thus, we also know \(\by \sim N((I_m \otimes X)\balpha, \Sigma \otimes I_T)\). Looking up the density of the multivariate Normal on, for example, wikipedia reveals that the likelihood can be proportially by written as

\[ \mathcal{L}(\balpha, \Sigma) \propto |\Sigma \otimes I_T|^{-1/2}\exp\{-\frac{1}{2}(\by - (I_m \times X)\balpha)'(\Sigma \otimes I_T)^{-1}(\by - (I_m \otimes X)\balpha)\} \]

Using the small trick above, we can rewrite this as

\[ \begin{split} \mathcal{L}(\balpha, \Sigma) &\propto |\Sigma \otimes I_T|^{-1/2}\exp\{-\frac{1}{2}(\by - (I_m \otimes X)\hat\balpha)'(\Sigma \times I_T)^{-1}(\by - (I_m \otimes X)\hat\balpha)\} \\ &\times \exp\{-\frac{1}{2}(\balpha - \hat\balpha)'(I_m\otimes X)'(\Sigma \otimes I_T)(I_m \otimes X)(\balpha - \hat\balpha)\} \end{split} \]

where \(\hat\balpha = (I_m \otimes (X'X)^{-1}X')\by\). This can be massively simplified by making the following observations (see for example the matrix cookbook)

\(|A \otimes B| = |A|^{rank(B)}|B|^{rank(A)}\) and thus \(|\Sigma \otimes I_T| = |\Sigma|^T\)
\((A \otimes B)' = A' \otimes B'\) and thus \((I_m \otimes X)' = I_m \otimes X'\)
\((A \otimes B)^{-1} = A^{-1}\otimes B^{-1}\) if the inverses exist. Thus \((\Sigma \otimes I_T)^{-1} = \Sigma^{-1}\otimes I_T\)
If all matrices are comformable, then \((A\otimes B)(C\otimes D) = AD\otimes BC\) and thus

\[ (I_m\otimes X)'(\Sigma\otimes I_T)^{-1}(I_m \otimes X) = (\Sigma \otimes (X'X)^{-1})^{-1} \]

Using all of this in the likelihood, we can write

\[ \begin{split} \mathcal{L}(\balpha, \Sigma) &\propto |\Sigma|^{-T/2}\exp\{\frac{1}{2}(\by - (I_m \otimes X)\hat\balpha)'(\Sigma^{-1}\otimes I_T)(\by - (I_m \otimes X)\hat\balpha)\} \\ &\times \exp\{-\frac{1}{2}(\balpha - \hat\balpha)'(\Sigma \otimes (X'X)^{-1})^{-1}(\balpha - \hat\balpha)\} \end{split} \]

Before we can come to the conclusion of this derivation, we need to rewrite one more part. For this, consider the first exponent term and write

\[ \begin{split} &(\by - (I_m \otimes X)\hat\balpha)'(\Sigma^{-1}\otimes I_T)(\by - (I_m\otimes X)\hat\balpha)\\ = &[(\Sigma^{-1/2}\otimes I_T)(\by - (I_m \otimes X)\hat\balpha)]'[(\Sigma^{-1/2}\otimes I_T)(\by - (I_m \otimes X)\hat\balpha)] \end{split} \]

Carefully looking at some of the kronecker product and \(\vect\) operator properties, we can then recognise that

\[ (\Sigma^{-1/2}\otimes I_T)(\by - (I_m \otimes X)\hat\balpha) = \vect((Y - X\hat A)\Sigma^{-1/2}) \]

where \(\hat\balpha = \vect(\hat A)\). Thus, the term in the first exponent can be written as

\[ \begin{split} & (\by - (I_m \otimes X)\hat\balpha)'(\Sigma^{-1}\otimes I_T)(\by - (I_m\otimes X)\hat\balpha) \\ = &\vect((Y - X\hat A)\Sigma^{-1/2})'\vect((Y - X\hat A)\Sigma^{-1/2}) \end{split} \]

By the properties of the trace, this equals

\[ \DeclareMathOperator{\tr}{tr} \begin{split} \vect((Y - X\hat A)\Sigma^{-1/2})'\vect((Y - X\hat A)\Sigma^{-1/2}) &= \tr((\Sigma^{-1/2})'(Y-X\hat A)'(Y-X\hat A)\Sigma^{-1/2}) \\ &= \tr(\Sigma^{-1/2}(\Sigma^{-1/2})'(Y - X\hat A)'(Y-X \hat A)) \\ &= \tr(\Sigma^{-1}(Y - X\hat A)'(Y - X\hat A)) \\ &= \tr(\underbrace{(Y-X\hat A)'(Y - X\hat A)}_{S}\Sigma^{-1}) \end{split} \]

We then finally arrive at the result that we were heading for

\[ \begin{split} \mathcal{L}(\balpha, \Sigma) &\propto |\Sigma|^{-T/2}\exp\{\frac{1}{2}\tr(S\Sigma^{-1})\} \\ &\times \exp\{-\frac{1}{2}(\balpha - \hat\balpha)'(\Sigma \otimes (X'X)^{-1})^{-1}(\balpha - \hat\balpha)\} \end{split} \]

Why is this result so useful? Carefully looking at the expression above and comparing the forms to some known distributions, you might notice that the second exponential looks very much like a multivariate Normal distribution, while the first part looks very much like an Inverse-Wishart distribution. To actually get there, we still need to multiply and devide by \(|\Sigma \otimes (X'X)^{-1}|^{-1/2}\) which by the rules of the kronecker product is the same as \(|\Sigma|^{-k/2}|(X'X)^{-1}|^{-m/2}\). Thus, after ignoring \(|(X'X)^{-1}|^{-m/2}\) because it does not depend on \(\balpha\) or \(\Sigma\), we can finally write the likelihood as

\[ \begin{split} \mathcal{L}(\balpha, \Sigma) &\propto |\Sigma|^{-(T-k)/2}\exp\{\frac{1}{2}\tr(S\Sigma^{-1})\} \\ &\times |\Sigma \otimes (X'X)^{-1}|^{-1/2}\exp\{-\frac{1}{2}(\balpha - \hat\balpha)'(\Sigma \otimes (X'X)^{-1})^{-1}(\balpha - \hat\balpha)\} \\ &=p(\Sigma)p(\balpha|\Sigma) \end{split} \]

In the last like above I have already indicated that this is proportional to the product of a marginal distribution for \(\Sigma\) and a conditional distribution for \(\balpha\). From the functional forms we find that

\[ \begin{split} \Sigma &\sim IW(S, T-k-m-1) \\ \balpha | \Sigma &\sim N(\hat\balpha, \Sigma \otimes (X'X)^{-1}) \end{split} \]

So again, why is this useful? Because it turns out that the likelihood is proportional to these two distributions. Why is that in turn useful? Because we know how to design smart priors for each part. That is, we know how to design a smart prior for the Inverse-Wishart part and we know how to design a smart prior for the conditional Normal part. With smart, I mean, among others, that we know conjugate priors for both parts. But even for non-conjugate priors, the form above is often more convenient to use than the classical form of the likelihood. I will show these benefits in future posts.