## Ordinary Least Squared

Ordinary least squares estimators are one of the most commonly used statistical techniques for forecasting and causal inference.  OLS is an attempt to estimate a linear process.  Below is a classic example of a linear model:

Suppose you want to model the relationship between a mother’s height and her child’s height.  You can assume this relationship can be estimated as linear (we do not see a 5 foot mother having a 25 foot child).  Also, the child’s height should not influence the mother’s so direction of causation is one-way. This can be represented as the following:

Y= Alpha + Beta *X + error

Where

Y is the child’s height at 18

X is the mother’s height

Beta is the influence of the mother’s height on the child’s height

Alpha is the intercept or base height for the people in the study

Error includes missing variables, measurement error, and inherent randomness to the relationship.

This model would use cross-sectional data, the data consists of individuals, and time is not a relevant factor.  Ordinary least squares can also model time series data like the influence of interest rates on GNP.

OLS is a powerful tool, however, it has many restrictive assumptions.  Violation of just one of the assumptions can render a model invalid. In the above example, the exclusion of a relevant variable such as poverty level (which may influence the health of the child) and a time indicator to show an advance in medical technology may invalidate or weaken the results.

BLUE

The ordinary least squares estimator is BLUE ( Best linear unbiased estimator) as long as certain assumptions hold true. The Guass-Markov theorem defines the conditions where a least squared estimator is BLUE.  When a linear model violates the assumption required for BLUE it is no longer true that the model is the most accurate in reflecting the true underline relationship of the model.

BLUE means: The model is unbiased, efficient (have the minimum variance), and consistent (the estimate improves as the sample size increases).

Some of the results of violations of BLUE:<

1. Including irrelevant independent variables or excluding relevant independent variables.

2. Poor or bias fit

3. Unclear or misleading casual relationships

 Assumptions for BLUE Assumption Examples of Violations Effect 1. Linearity, the relationship between the dependant and independent variables is linear in nature. Non-linear, wrong regressors, non-linearity, changing parameters Coefficients are biased. 2. The error term’s mean is zero and from a normal distribution. Omitted variables, biased intercept Intercept is Biased 3. Homoscedasticity, the error terms have a uniform variance. Heteroskedasticity. Standard errors are biased. 4. No serial correlation. Observations are uncorrelated. Errors in variables, autoregression, simultaneous equations. Standard errors are biased. 5. Variables are minimally interval level data No dummy variables for dependant variables. Coefficients are biased 6. No exact linear relationship between independent variables and more observations than independent variables. No perfect multicollinearity. Standard errors are inflated Other important assumptions No Outliers This is an issue when examining data with rare events. The question can become is this a rare event that must be modeled or an atypical case? Can lead to a biased estimate. No Measurement errors There will always exist some level of measurement error and it can lead to the coefficient to be biased.

3. Testing if you are BLUE

 Testing for BLUE Assumption Test Possible Solution 1. Linearity, the relationship between the dependant and independent variables is linear in nature. For non-linearity: Plot residuals. Transform variables 2. The error term’s mean is zero and from a normal distribution. Plot residuals Use GLM 3. Homoscedasticity, the error terms have a uniform variance. Breusch Pagan  and White Test for Heteroskedasicity Use GLM 4. No serial correlation. Observations are uncorrelated. Durban Watson: Auto-correlations. Include lagged repressors 5. Variables are minimally interval level data Understand how the dependent variables are constructed. Use logistic regression 6. No exact linear relationship between independent variables and more observations than independent variables. Run correlations for independent variables. Exclude co-linear variables or use two-stage least squares for strongly correlated data. No Outliers Cooks test for outliners, plot residuals Remove outliers

## Generalized Linear Models

Generalized Linear Models (GLM or GLZ) are growing in popularity as an alternative to OLS for predictive and explanatory models. They have less restrictive assumptions on the distribution of the dependent variables than an OLS model which allows them to better model a greater variety of real-world problems. Generalized Linear Models are not to be confused with General Linear Models (GLM), which assumes normality and whose family includes OLS and ANOVA.

Often time the assumptions of normality for the dependent variable is too restrictive to model real-world problems.  A classic example is a binary choice model.  A binary choice model estimates the probability of an event happening like whether a person will renew their subscription with a magazine.  To model this type of problem you would use a GLZ with a link function which is also unknown as a logistic regression, which was discussed here: Logistic Models

But GLZs are a useful alternative even when the dependent variable is continuous. For example, insurance claims experience. The severity (or dollar amount) of claims is a continuous variable however the distribution that generates claims experience is not normal. Most polices never report a claims so the bulk of the data is at zero. To make matters worse, the tail of claims experience can be very large. A rare event, like a hurricane over a major metropolitan area can have extreme values ranging in the billions. Below is a graphical example of such a claims loss experience. GLZ allow for modeling non-normality by using link functions. A link function transforms the dependent variable and models how the dependent variable is related to the independent variable. A GLZ with an Identity link function will yield identical results as a standard OLS with no link function. An Inverse link function would model exponential processes such as acceleration due to gravity. In the claims example you could use a Weibull link function.