In the previous
blog we learnt how to use Linear Regression to predict response variables with
only one predictor / dependent variable. As there was only one response and one
dependent variable, it was termed as Linear Regression.
For the
simplicity of implementing mathematical and computational techniques on the
dataset for prediction, we apply certain generic assumptions on the dataset for this
regression :-
- Linearity - Independent and dependent variables should / must have relationship between each other. Relationship between variables can be understood or tested by plotting them on the graphs.
- Homoscedasticity - The variance of all random variables in a dataset is finite. Scatter plot can show whether data are homoscedastic or not.
- Lack of Multicollinearity - Multicollinearity occurs when more than two independent variables are highly correlated with each other. This doesn't occur here as there's only one independent variable in the dataset.
- Multivariate Normality - All correlated random variables clusters around a mean value. This assumption can be tested by plotting the histogram which usually shows normality ( left / right aligned ) of the data.
- Independence of Errors - Linear regression analysis requires that the residuals of the data is not independent from each other. Autocorrelation occurs when the residuals are not independent from each other. This can be tested using Null Hypothesis
Hypothesis Testing :
Hypothesis
is the statement or an assumption about relationships between variables.
Criteria
for constructing Hypothesis is that,
- It should be precise and not contradictory
- It should be testable whether it is right or wrong.
- Should specify the variables between which the relationship is established.
Types of Hypothesis :
- Null Hypothesis ( H0 ) - Null Hypothesis claims that there's no different in the population of data.
- Alternative Hypothesis ( H0 or H1 ) - Alternate Hypothesis claims H0 is false.
Hypothesis
testing is also called Significance
testing.
The
Major purpose of hypothesis testing is to choose between two competing
hypothesis about the value of a population parameter. For Ex:- If
One
hypothesis says the salaries of men and women at every level is equal, while
the alternative might claim that it is false.
Null Hypothesis usually referred with
the symbol H0 and the other hypothesis which is assumed to be true when the
null hypothesis is proved false, which is referred as alternative hypothesis.
Generally,
for the sake of convenience we take the Null Hypothesis on the equal side for
the sample of population, so that, when it is proved false, the alternative
will be either less than or more than a certain value associated with the null
hypothesis.
Ideally,
true value of the population parameter should be included in the set specified
by H0 or in the set specified by H1.
There's
a very famous illustrative example of "Body weight" where the age group of 20 - 29 years old men
in US had a mean body weight of 170 pounds. Standard deviation σ = 40
pounds.
So,
here we test the
- Null hypothesis H0 = µ = 170 pounds
- Alternative hypothesis = H1 = µ > 170 pounds or µ < 170 pounds.
Let's
say we take multiple random samples of data to validate the Null hypothesis.
Random
samples of n = 64.
Based
on the Sampling Distribution of a Mean ( SDM ) : Xbar = N( µ,
SE ) where SE = σ / √n
Applying
this formula to calculate the Zstat :- Zstat =
Xbar - µ0 / SE ( Where µ0 =
population mean )
Test Statistic Zstat = Xbar - µ0 / ( σ / √n )
Now,
we will take a few samples from the population where Sample mean Xbar is ( first sample mean = 172, second sample mean
= 164 )
Substituting
the values of µ0 = 170, σ = 40, and SE = 5 in various samples
We
will get the Zstat scores as follows
- Test Statistic Zstat1 = ( 172 - 170 ) / 5 = 0.4
- Test Statistic Zstat2 = ( 183 - 170 ) / 5 = 2.6
- Test Statistic Zstat3 = ( 164 - 170 ) / 5 = -1.2
Now
Calculating the P-Value - AUC ( Area Under Curve ) of the above found test
statistic.
P-Value :- is the probability of obtaining a
sample that is closer than the observed data, assuming that H0 ( Null
hypothesis ) is true.
P-Value is a calculation that we make during
the hypothesis testing to determine if we reject the null hypothesis or fail to
reject it.
How to find P-Value : P-Value is calculated by first determining
the Zstat score. Then, we find the
probability in a normal distribution table and then interpret the results by
comparing the p-value to the level of significance.
Normal
distribution chart to find P-Value is available on the back of every
statistical text book and also easily available on the internet.
Normal
Distribution chart can be found here
:
Calculating the p-Value :- Probability(Zstat)
P-Value of Zstat1 =
0.6554
P-Value of Zstat2 = 0.9953
P-Value of Zstat3
= 0.1150
These
P-Values corresponds to the AUC ( Area Under Curve ) in the tail of the standard Normal Distribution beyond the
Zstat.
As
mentioned above, P-Value is the
probability of the observed test statistic when
H0 Null Hypothesis is true.
Hence,
smaller and smaller the P-Values provide stronger
evidence against H0
In
this scenario, if we have 5% probability of
erroneously rejecting the H0. in other words, setting the threshold Ɛ = 5% = 0.05 to erroneously rejecting the H0.
- Reject H0 if P <= Ɛ
- Retain H0 if P > Ɛ
Example:-
If Ɛ = 0.05 and P = 0.655 => Retain H0
SUMMARIZING THE EVENTS FOR
HYPOTHESIS TESTING :-
- State H0 and H1
- Specify the level of significance ( Ɛ )
- Decide the appropriate sampling distribution
- Calculate the Zstat ( test statistic) using the formula mentioned above.
- Lookup for the Probability P-Value of the Zstat ( test statistic)
- If the P-Value is smaller than the Ɛ value - Reject the Hypothesis else Retain the Hypothesis
No comments:
Post a Comment