sqlskillport & Machine Learning: Assumptions of Linear Regression & Hypothesis Testing

In the previous blog we learnt how to use Linear Regression to predict response variables with only one predictor / dependent variable. As there was only one response and one dependent variable, it was termed as Linear Regression.

For the simplicity of implementing mathematical and computational techniques on the dataset for prediction, we apply certain generic assumptions on the dataset for this regression :-

Linearity - Independent and dependent variables should / must have relationship between each other. Relationship between variables can be understood or tested by plotting them on the graphs.

Homoscedasticity - The variance of all random variables in a dataset is finite. Scatter plot can show whether data are homoscedastic or not.

Lack of Multicollinearity - Multicollinearity occurs when more than two independent variables are highly correlated with each other. This doesn't occur here as there's only one independent variable in the dataset.

Multivariate Normality - All correlated random variables clusters around a mean value. This assumption can be tested by plotting the histogram which usually shows normality ( left / right aligned ) of the data.

Independence of Errors - Linear regression analysis requires that the residuals of the data is not independent from each other. Autocorrelation occurs when the residuals are not independent from each other. This can be tested using Null Hypothesis

Hypothesis Testing :

Hypothesis is the statement or an assumption about relationships between variables.

Criteria for constructing Hypothesis is that,

It should be precise and not contradictory
It should be testable whether it is right or wrong.
Should specify the variables between which the relationship is established.

Types of Hypothesis :

Null Hypothesis ( H0 ) - Null Hypothesis claims that there's no different in the population of data.
Alternative Hypothesis ( H0 or H1 ) - Alternate Hypothesis claims H0 is false.

Hypothesis testing is also called Significance testing.

The Major purpose of hypothesis testing is to choose between two competing hypothesis about the value of a population parameter. For Ex:- If

One hypothesis says the salaries of men and women at every level is equal, while the alternative might claim that it is false.

Null Hypothesis usually referred with the symbol H0 and the other hypothesis which is assumed to be true when the null hypothesis is proved false, which is referred as alternative hypothesis.

Generally, for the sake of convenience we take the Null Hypothesis on the equal side for the sample of population, so that, when it is proved false, the alternative will be either less than or more than a certain value associated with the null hypothesis.

Ideally, true value of the population parameter should be included in the set specified by H0 or in the set specified by H1.

There's a very famous illustrative example of "Body weight" where the age group of 20 - 29 years old men in US had a mean body weight of 170 pounds. Standard deviation σ = 40 pounds.

So, here we test the

Null hypothesis H0 = µ = 170 pounds
Alternative hypothesis = H1 = µ > 170 pounds or µ < 170 pounds.

Let's say we take multiple random samples of data to validate the Null hypothesis.

Random samples of n = 64.

Based on the Sampling Distribution of a Mean ( SDM ) : Xbar = N( µ, SE ) where SE = σ / √n

Applying this formula to calculate the Zstat :- Zstat = Xbar - µ0 / SE ( Where µ0 = population mean )

Test Statistic Zstat = Xbar - µ0 / ( σ / √n )

Now, we will take a few samples from the population where Sample mean Xbar is ( first sample mean = 172, second sample mean = 164 )

Substituting the values of µ0 = 170, σ = 40, and SE = 5 in various samples

We will get the Zstat scores as follows

Test Statistic Zstat1 = ( 172 - 170 ) / 5 = 0.4
Test Statistic Zstat2 = ( 183 - 170 ) / 5 = 2.6
Test Statistic Zstat3 = ( 164 - 170 ) / 5 = -1.2

Now Calculating the P-Value - AUC ( Area Under Curve ) of the above found test statistic.

P-Value :- is the probability of obtaining a sample that is closer than the observed data, assuming that H0 ( Null hypothesis ) is true.

P-Value is a calculation that we make during the hypothesis testing to determine if we reject the null hypothesis or fail to reject it.

How to find P-Value : P-Value is calculated by first determining the Zstat score. Then, we find the probability in a normal distribution table and then interpret the results by comparing the p-value to the level of significance.

Normal distribution chart to find P-Value is available on the back of every statistical text book and also easily available on the internet.

Normal Distribution chart can be found here :

Calculating the p-Value :- Probability(Zstat)

P-Value of Zstat1 = 0.6554

P-Value of Zstat2 = 0.9953

P-Value of Zstat3 = 0.1150

These P-Values corresponds to the AUC ( Area Under Curve ) in the tail of the standard Normal Distribution beyond the Zstat.

As mentioned above, P-Value is the probability of the observed test statistic when H0 Null Hypothesis is true.

Hence, smaller and smaller the P-Values provide stronger evidence against H0

In this scenario, if we have 5% probability of erroneously rejecting the H0. in other words, setting the threshold Ɛ = 5% = 0.05 to erroneously rejecting the H0.

Reject H0 if P <= Ɛ
Retain H0 if P > Ɛ

Example:- If Ɛ = 0.05 and P = 0.655 => Retain H0

SUMMARIZING THE EVENTS FOR HYPOTHESIS TESTING :-

State H0 and H1
Specify the level of significance ( Ɛ )
Decide the appropriate sampling distribution
Calculate the Zstat ( test statistic) using the formula mentioned above.
Lookup for the Probability P-Value of the Zstat ( test statistic)
If the P-Value is smaller than the Ɛ value - Reject the Hypothesis else Retain the Hypothesis

sqlskillport & Machine Learning

Wednesday, August 14, 2019

Assumptions of Linear Regression & Hypothesis Testing

No comments:

Post a Comment

Multiple Linear Regression

Search This Blog