Ordinary Least Squared

Ordinary least squares estimators are one of the most commonly used statistical techniques for forecasting and causal inference.  OLS is an attempt to estimate a linear process.  Below is a classic example of a linear model:

Suppose you want to model the relationship between a mother’s height and her child’s height.  You can assume this relationship can be estimated as linear (we do not see a 5 foot mother having a 25 foot child).  Also, the child’s height should not influence the mother’s so direction of causation is one-way. This can be represented as the following:

Y= Alpha + Beta *X + error

Where

Y is the child’s height at 18

X is the mother’s height

Beta is the influence of the mother’s height on the child’s height

Alpha is the intercept or base height for the people in the study

Error includes missing variables, measurement error, and inherent randomness to the relationship.

This model would use cross-sectional data, the data consists of individuals, and time is not a relevant factor.  Ordinary least squares can also model time series data like the influence of interest rates on GNP.

OLS is a powerful tool, however, it has many restrictive assumptions.  Violation of just one of the assumptions can render a model invalid. In the above example, the exclusion of a relevant variable such as poverty level (which may influence the health of the child) and a time indicator to show an advance in medical technology may invalidate or weaken the results.

BLUE

The ordinary least squares estimator is BLUE ( Best linear unbiased estimator) as long as certain assumptions hold true. The Guass-Markov theorem defines the conditions where a least squared estimator is BLUE.  When a linear model violates the assumption required for BLUE it is no longer true that the model is the most accurate in reflecting the true underline relationship of the model.

BLUE means: The model is unbiased, efficient (have the minimum variance), and consistent (the estimate improves as the sample size increases).

Some of the results of violations of BLUE:<

1. Including irrelevant independent variables or excluding relevant independent variables.

2. Poor or bias fit

3. Unclear or misleading casual relationships

Assumptions for BLUE

Assumption Examples of Violations Effect
1. Linearity, the relationship between the dependant and independent variables is linear in nature. Non-linear, wrong regressors, non-linearity, changing parameters Coefficients are biased.
2. The error term’s mean is zero and from a normal distribution. Omitted variables, biased intercept Intercept is Biased
3. Homoscedasticity, the error terms have a uniform variance. Heteroskedasticity. Standard errors are biased.
4. No serial correlation. Observations are uncorrelated. Errors in variables, autoregression, simultaneous equations. Standard errors are biased.
5. Variables are minimally interval level data No dummy variables for dependant variables. Coefficients are biased
6. No exact linear relationship between independent variables and more observations than independent variables. No perfect multicollinearity. Standard errors are inflated

Other important assumptions

 

No Outliers This is an issue when examining data with rare events. The question can become is this a rare event that must be modeled or an atypical case? Can lead to a biased estimate.
No Measurement errors There will always exist some level of measurement error and it can lead to the coefficient to be biased.

3. Testing if you are BLUE

Testing for BLUE

Assumption Test Possible Solution
1. Linearity, the relationship between the dependant and independent variables is linear in nature. For non-linearity: Plot residuals. Transform variables
2. The error term’s mean is zero and from a normal distribution. Plot residuals Use GLM
3. Homoscedasticity, the error terms have a uniform variance. Breusch Pagan  and White Test for Heteroskedasicity Use GLM
4. No serial correlation. Observations are uncorrelated. Durban Watson: Auto-correlations. Include lagged repressors
5. Variables are minimally interval level data Understand how the dependent variables are constructed. Use logistic regression
6. No exact linear relationship between independent variables and more observations than independent variables. Run correlations for independent variables. Exclude co-linear variables or use two-stage least squares for strongly correlated data.
No Outliers Cooks test for outliners, plot residuals Remove outliers