Regression Basics - 1

Regression

Econometrics

Elementary Regression Concepts from an Advanced Point of View

Author

Shaikh Tanvir Hossain

Published

September 14, 2024

Introduction

In the series of posts I will discuss some elementary regression concepts, but using the idea of Conditional Expectation Function. Not all econometrics books discuss these concepts from this point of view, but I think it is a very useful way to think about regression, exceptions are Hansen’s Econometrics and Angrist and Pischke’s Mostly Harmless Econometrics. I am taking some materials from these books and also from my own understanding. Please correct me if you think something is wrong.

Usually when we think about regression, we think about a function \(f(X)\) to predict \(Y\), here usually \(X\) is a vector of covariates or features or regressors , i.e., \(X:=(X_1, X_2, \ldots, X_k)\) and \(Y\) is a scalar. Most of the time we think about additive error, in that case we can define the error as \(\epsilon := Y - f(X)\) and then write the model as

\[ Y = f(X) + \epsilon \]

In this post we will talk about what is the best choice of \(f(X)\)? This question is a bit vague, since we need to be more specific about what do we mean by “best”. We will pick an easy definition of “best”, and that is to minimize the mean squared error, i.e., we want to find \(f\) such that it minimizes \(\mathbb{E}\left[(Y - f(X))^2\right]\). It turns out that in this minimum MSE sense the answer is Conditional Expectation Function (CEF) \(\mathbb{E}[Y \mid X]\). Below we first explain what is CEF, discuss some properties and show how it is the best predictor of \(Y\) in minimum MSE sense.

Conditional Expectation Function (CEF)

Suppose we have two random variables \(X\) and \(Y\). Let \(f_{X, Y}(x, y)\) be the joint density of \(X\) and \(Y\), and \(f_X(x)\), \(f_Y(y)\) be the marginal densities, and \(f_{Y \mid X}(y \mid x)=\frac{f_{X, Y}(x, y)}{f_X(x)}\) when \(f_X(x)>0\) are the conditional densities. The Conditional Expectation Function (CEF) is defined as,

\[ m(x):=\mathbb{E}[Y \mid X=x]=\int y f_{Y \mid X}(y \mid x) d y \]

Here we assume \(\mathbb{E}[|Y|]<\infty\) (and \(\mathbb{E}\left[Y^2\right]<\infty\), but these are standard assumptions). We will show that the CEF is the best predictor of \(Y\) given \(X\) in the mean squared error sense. That is, for any function \(g(x)\) such that \(\mathbb{E}\left[g(X)^2\right]<\infty\), we have

\[ \mathbb{E}\left[\left(Y-g(X)\right)^2\right]\geq \mathbb{E}\left[\left(Y-m(X)\right)^2\right] \]

We will see the proof of this statement shortly, but first we will see some properties of CEF which will be useful in the proof.

Lemma: CEF Properties

Property 1: Law of Iterated Expectations: \(\mathbb{E}[\mathbb{E}[Y \mid X]]=\mathbb{E}[Y]\).
Property 2 : Iterated Condition on a smaller subset \(\mathbb{E}\left[\mathbb{E}\left[Y \mid X_1, X_2\right] \mid X_1\right]=\mathbb{E}\left[Y \mid X_1\right]\)
Property 3 : “Take out what’s known”: for any function \(g\) of \(X\), \(\mathbb{E}[g(X) Y \mid X]=g(X) \mathbb{E}[Y \mid X]\) and \(\mathbb{E}[g(X) Y]=\mathbb{E}[g(X) \mathbb{E}[Y \mid X]]\).
Property 4 : CEF residuals are mean-zero and orthogonal to \(X\): for \(\epsilon:=Y-m(X)\), we have \(\mathbb{E}[\epsilon \mid X]=0\) and \(\mathbb{E}[\epsilon]=0\), and also \(\mathbb{E}[g(X) \epsilon]=0\) for any function \(g\) of \(X\) such that \(\mathbb{E}\left[g(X)^2\right]<\infty\).

Proof of CEF Properties

Proof of Property 1:

\[ \begin{gathered} \mathbb{E}[m(X)]=\int m(x) f_X(x) d x=\int\left(\int y f_{Y \mid X}(y \mid x) d y\right) f_X(x) d x=\iint y f_{Y \mid X}(y \mid x) f_X(x) d y d x \\ =\iint y f_{X, Y}(x, y) d y d x=\mathbb{E}[Y] \end{gathered} \]

We will talk about the proof of other properties later.