I found the following interpretation to partial least square regression is much better than mine. So I cited it as below:

“Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares Discriminant Analysis (PLS-DA) is a variant used when the Y is categorical.”

—from Wikipedia

PLS algorithm

For record purpose, I am deriving the PLS algorithm in this post.

Consider a linear regression problem in a vector format as in Eq.[eq:1] given an observed data, which is denoted by a vector \(\boldsymbol{x}^{T}=(x_{1},x_{2},\cdots,x_{n})\)

$$
\begin{align}
y &=b_{1}x_{1}+\cdots+b_{n}x_{n}+e \\
&=\boldsymbol{x}^{T}\boldsymbol{b}+e \tag{1}
\end{align}
$$

We are aiming to work out the unknown regression coefficients vector \(\boldsymbol{b}\) for future prediction of \(y\) once we have the observed data \(\boldsymbol{x}\). To keep the prediction as accurate as possible, we will want to make the residual error e as small as possible. This is simply the whole idea of the linear regression method. So one solution is to use training samples, that is given m observations \(\boldsymbol{y}^{T}=(y_{1},\cdots,y_{m}),X=[x_{ij}],i=1,\cdots,m,j=1,\cdots,n \)
to solve \(\boldsymbol{b}\) out in terms of keeping the residual error sufficiently small

$$
\begin{pmatrix}
y_{1}\\
y_{2}\\
\vdots\\
y_{m}
\end{pmatrix}=\begin{pmatrix}
x_{11} & x_{12} & \cdots & x_{1n}\\
x_{21} & x_{22} & \cdots & x_{2n}\\
\vdots & \vdots & \ddots & \vdots\\
x_{m1} & x_{m2} & \cdots & x_{mn}
\end{pmatrix}\begin{pmatrix}
b_{1}\\
b_{2}\\
\vdots\\
b_{n}
\end{pmatrix}+\begin{pmatrix}
e_{1}\\
e_{2}\\
\vdots\\
e_{m}
\end{pmatrix}\tag{2}
$$

For comparability among different data, all the observations are assumed to be centered and normalized in advance. Now the problem is translated to an optimization problem, that is to find a least square solution for the regression coefficients such that the sum of the square errors of the residuals is minimized. For simplicity, I would like to use the matrix form for Eq.[2]

$$
Y=X\boldsymbol{b}+\boldsymbol{e}. \tag{3}
$$

We want to find out \(\boldsymbol{b}\) such that

$$
min\{\boldsymbol{e}^{T}\boldsymbol{e}\}.
$$

The least square solution for \(\boldsymbol{b}\) is given by

$$
\boldsymbol{b}=(X^{T}X)^{-1}X^{T}Y. \tag{4}
$$

The derivation is simply as follows:

$$
F_{\boldsymbol{b}}=\frac{1}{2}(X\boldsymbol{b}-Y)^{T}(X\boldsymbol{b}-Y) \tag{5}
$$

So the derivative of F_{\boldsymbol{b}} yields

$$
\begin{align}
\nabla_{\boldsymbol{b}}F_{\boldsymbol{b}}
&=\frac{1}{2}\nabla_{\boldsymbol{b}}(X\boldsymbol{b}-Y)^{T}(X\boldsymbol{b}-Y)\\
&=\frac{1}{2}\nabla_{\boldsymbol{b}}(\boldsymbol{b}^{T}X^{T}-Y^{T})(X\boldsymbol{b}-Y)\\
&=\frac{1}{2}\nabla_{\boldsymbol{b}}(\boldsymbol{b}^{T}X^{T}X\boldsymbol{b}-\boldsymbol{b}^{T}X^{T}Y-Y^{T}X\boldsymbol{b}+Y^{T}Y)\\
&=\frac{1}{2}\nabla_{\boldsymbol{b}}tr(\boldsymbol{b}^{T}X^{T}X\boldsymbol{b}-\boldsymbol{b}^{T}X^{T}Y-Y^{T}X\boldsymbol{b}+Y^{T}Y)\\
&=\frac{1}{2}\nabla_{\boldsymbol{b}}\left(tr(\boldsymbol{b}^{T}X^{T}X\boldsymbol{b})-2tr(Y^{T}X\boldsymbol{b})+tr(Y^{T}Y)\right)\\
&=X^{T}X\boldsymbol{b}-X^{T}Y.
\end{align}
$$

By setting the derivative to be 0, solution Eq.[4] is obtained.
However, this least square solution may have problem when the training sample is not enough, that is m<n. In that case, the matrix \(X^{T}X\) doesn’t have full rank which means it is nonsingular.

To solve this problem, we can project each measurement \(\boldsymbol{x_{i}}=(x_{i1},\cdots,x_{in})\) into a lower-dimensional subspace spanned by data \(T=XW\). We can think of this as forming a smaller set of features, each being the linear combination of the original set of features. These new features are also called “latent” variables. Therefore, the linear regression can be written as a linear regression system on the new latent variables

$$
\begin{align}
Y &=TQ^{T}+F \tag{6} \\
X &=TP^{T}+E \tag{7}
\end{align}
$$

where \(P,Q\) are the coefficient matrices and E,F are matrices of errors. As in partial least squares regression (PLSR), the weight matrix \(W\) reflects the covariance structure between the predictor and response variables. Hence, maximizing the covariance of the latent variables and the response variables \(Cov(T,Y)\) gives us the weight matrix \(W\). Once obtaining \(W\) and then constructing \(T\), \(Q^{T}\) is solved by the least squares solution Eq.[6]:

$$
Q^{T}=(T^{T}T)^{-1}T^{T}Y.
$$

Plug Eq.[eq:latent] into the regression equation Eq.[6], we obtain the solution of the matrix \(\boldsymbol{b}\) of the coefficient in the model Eq.[3]

$$
\boldsymbol{b}=W(T^{T}T)^{-1}T^{T}Y.
$$

So this is the whole idea of PLS. A commonly used algorithm to compute PLSR is the nonlinear iterative partial least square (NIPALS) method. The steps are summarized as follows:

NIPALS

  1. Normalize \(m\) training samples by

$$
\begin{align}
x_{ij} &=\left(x_{ij}-\bar{x} {\cdot\ j}\right)/ \sigma {x{\cdot\ j}} \\
y
{i} &=\left(y_{i}-\bar{y}\right)/ \sigma _{y}\ \text{for }i=1,\cdots,m,j=1,\cdots,n
\end{align}
$$

Compute the scores

  1. Compute the dominant eigenvector of \(X_{0}^{T}Y_{0}Y_{0}^{T}X_{0}\) and assign it to \(w_{1}\) (the step of maximizing the covariance). Normalize \(w_{1}\). The scores \(t_{1}\) of \(X_{0}\) yields

$$
t_{1}=X_{0}w_{1}.
$$

Do the same to \(Y_{0}\) and obtain a normalized vector \(c_{1}\). The scores \(v_{1}\) of \(Y_{0}\) yields

$$
v_{1}=Y_{0}c_{1}.
$$

Compute the loadings

  1. Based on the regression equations

$$
\begin{align}
X &=t_{1}p_{1}^{T}+E \\
Y &=t_{1}q_{1}^{T}+F,
\end{align}
$$

the least square method gives us

$$
\begin{align}
p_{1} &=\frac{X^{T}t_{1}}{t_{1}t_{1}^{T}} \\
q_{1} &=\frac{Y^{T}t_{1}}{t_{1}t_{1}^{T}}.
\end{align}
$$

  1. Check if \(F\) is sufficiently small. Otherwise, set \(E\) as \(X_{1}\), \(F\) as \(Y_{1}\). Repeat step 2-3.

  2. Finally, we obtain

$$
\begin{align}
X &=t_{1}p_{1}^{T}+t_{2}p_{2}^{T}+\cdots+t_{n}p_{n}^{T}+E \\
Y &=t_{1}q_{1}^{T}+t_{2}q_{2}^{T}+\cdots+t_{n}q_{n}^{T}+F
\end{align}
$$

which in a matrix format is as follows:

$$
\begin{align}
X &=TP^{T}+E \\
Y &=TQ^{T}+F \\
&=XWQ^{T}+F \\
&:=XB+F.
\end{align}
$$

So, when we have a new data \(x_{new}\), to define the PLS components in our ABC-MCMC algorithm, we only need to compute

$$
t_{1}=x_{new}^{T}p_{1},t_{2}=x_{new}^{T}p_{2},\cdots,t_{n}=x_{new}^{T}p_{n}
$$