Simple Linear
Regression is a process of regression in finding relationship of dependent and
independent continuous quantitative variables. It is called Simple because there's only
one independent / explanatory or predictor variable that is used to predict the
relationship dependency
The Formula for Simple Linear Regression is : Y = β0 + β1* X + Ɛ
Here Y is the dependent Variable - Ex : How the Salary
changes with the No.Of.YearsOfExperience,
where Y is the Salary
and X is the
Independent Variable ( Yrs.of.experience ).
X is a variable that is causing the dependent variable to change based on unit
of rate of change occurs in X.
There is still a
possibility that there's no direct correlation between the two, but, still have
a sort of implied association between dependent and independent variable.
β1 <- is the coefficient of X, it shows how
a unit change in X shows the degree change in Y, which implies that there's a
proportion of change that occurs between X & Y and not a one to one change
in units.
β0 <- is a constant or Intercept of the
slope, which indicates a value when β1 is 0. i.e., when β1 is 0, X = 0 and
hence, Y = β0 which is constant OR an intercept of the slope when X = 0.
Let us look at this
in the graph for plotting Salary vs Yrs.of.experience.
For any prediction
using Classic Machine Learning / Neural Network
techniques, there are 5 major steps to be followed [ Fundamentals of Machine Learning ]
- Finding the data based on problem statement.
- Finding appropriate data for correct analysis
- Exploratory data analysis
- Understand the data spread and outliers
- Visualize the data
- Preparing your data
- Use Feature engineering on the data as necessary for applying suitable algorithms to train the model.
- Fit / Train the model - using appropriate algorithm
- Use of appropriate algorithm to train/Fit the model
- Developing a model that does better than a baseline
- Regularizing the model and turning Hyperparameters.
- Apply prediction using the trained model.
- Use validation & test data for testing the model
- Testing & Data Visualization
- Validate results / accuracy using test data sets
- Use data visualization techniques to visualize the results.
We will follow these
steps, as applicable based on sanity of the data,
We will start with
Predicting Salary from the given data with no. of years of experience.
Predicting Salary with Simple Linear Regression using R :-
#Import the
dataset
dataset =
read.csv("Salary.csv")
#install.packages('caTools')
library(caTools)
set.seed(123)
#Splitting the
Dataset into Training and Testing dataset.
fragment =
sample.split(dataset$Salary, SplitRatio = 2/3, group = NULL)
trainingset =
subset(dataset, fragment == TRUE)
testset =
subset(dataset, fragment == FALSE)
#Fitting simple
Linear model to training set
LinearRegressor
= lm(formula = Salary ~ YearsOfExperience,
data = trainingset )
summary(LinearRegressor)
<<<<Output of Summary >>>>
#The above command will show summary of the fitted model.
Residuals:
Min 1Q Median 3Q Max -9279.5 -4339.5 -648.5 3620.7 9748.9
Coefficients: Estimate Std. Error t value
Pr(>|t|)
(Intercept) 31792.0 1935.7 16.42 <2e-16 *** <= Intercept is the slope β0 i.e.,, when X = 0, Y = β0 YearsOfExperience 7739.4 161.9 47.79 <2e-16 *** <= *s denotes the significance of the var on Response var
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard
error: 5663 on 31 degrees of freedom
Multiple R-squared: 0.9866, Adjusted R-squared: 0.9862 F-statistic: 2284 on 1 and 31 DF, p-value: < 2.2e-16
#Predicting the
Test Set Results
Predictor =
predict(LinearRegressor, newdata = testset)
# Plotting the
graph on test set results using ggplot2
#install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = testset$YearsOfExperience,
y = testset$Salary),
colour = 'blue') +
geom_line(aes(x = testset$YearsOfExperience,
y= Predictor),
colour = 'red') +
ggtitle('Salary vs Experience') +
xlab('YearsOfExperience') +
ylab('Salary')
With Graph plotted,
we can see that there's a difference between the Actual
Salary and the Predicted Salary.
Based on the Multiple R-squared: 0.9866, Adjusted R-squared: 0.9862 values, we can evaluate the
accuracy of the Prediction.
|
Predicting Salary with Simple Linear Regression using Python :-
For executing the
below code in python, you can use any of the interfaces like ( Jupyter
notebook, Pycharm , Spyder or a plain
notepad++ )
I have used
Notepad++ and executed through iPython console. Same steps have been applied in
both the platforms to predict the response variable.
#Importing libraries
import pandas
as pd
import numpy as
np
import
matplotlib.pyplot as plt
#Importing the data
dataset =
pd.read_csv('Salary.csv')
# Validate the dataset
dataset.describe()
# Split into Dependent and Independent variables
X =
dataset.iloc[:,:-1].values
Y =
dataset.iloc[:, 1].values
#X = dataset.YearsOfExperience
#Y = dataset.Salary
# No Manipulation or Imputation of data is needed as
this is a Simple and clean dataset.
# Splitting the dataset to Train and Test data.
from
sklearn.cross_validation import train_test_split
Xtrain, Xtest,
Ytrain, Ytest = train_test_split(X,Y, test_size = 0.3, random_state = 1)
#Fitting the Simple Linear Regression Model
from
sklearn.linear_model import LinearRegression
LinearRegressor
= LinearRegression()
LinearRegressor.fit(Xtrain,
Ytrain)
# Predicting the Test Set Results
YPredictor =
LinearRegressor.predict(Xtest)
# Evaluating
the Intercept
LinearRegressor.intercept_
#Plotting the
Graph for Predicted Test Set Results
plt.scatter(Xtest,
Ytest, color = 'blue')
plt.plot(Xtest,
YPredictor, color = 'red')
plt.title('Experience
vs Salary')
plt.xlabel('Number
of Years of Experience')
plt.ylabel('Salary')
plt.show()
#Predicting
for any random value
YPred =
LinearRegressor.predict([[16.5]])
print(np.float32(YPred))
#Calculating the R Squared value for accuracy
from
sklearn.metrics import r2_score
print("r2_score",np.float32(r2_score(Ytest,YPredictor)*100))
|
In the next blog, we will talk about the Assumptions of Linear Regression and Hypothesis testing.
No comments:
Post a Comment