sqlskillport & Machine Learning: Simple Linear Regression with R & Python

Simple Linear Regression is a process of regression in finding relationship of dependent and independent continuous quantitative variables. It is called Simple because there's only one independent / explanatory or predictor variable that is used to predict the relationship dependency

The Formula for Simple Linear Regression is : Y = β0 + β1* X + Ɛ

Here Y is the dependent Variable - Ex : How the Salary changes with the No.Of.YearsOfExperience, where Y is the Salary

and X is the Independent Variable ( Yrs.of.experience ). X is a variable that is causing the dependent variable to change based on unit of rate of change occurs in X.

There is still a possibility that there's no direct correlation between the two, but, still have a sort of implied association between dependent and independent variable.

β1 <- is the coefficient of X, it shows how a unit change in X shows the degree change in Y, which implies that there's a proportion of change that occurs between X & Y and not a one to one change in units.

β0 <- is a constant or Intercept of the slope, which indicates a value when β1 is 0. i.e., when β1 is 0, X = 0 and hence, Y = β0 which is constant OR an intercept of the slope when X = 0.

Let us look at this in the graph for plotting Salary vs Yrs.of.experience.

For any prediction using Classic Machine Learning / Neural Network techniques, there are 5 major steps to be followed [ Fundamentals of Machine Learning ]

Finding the data based on problem statement.

Finding appropriate data for correct analysis

Exploratory data analysis

Understand the data spread and outliers
Visualize the data
Preparing your data
Use Feature engineering on the data as necessary for applying suitable algorithms to train the model.

Fit / Train the model - using appropriate algorithm

Use of appropriate algorithm to train/Fit the model
Developing a model that does better than a baseline
Regularizing the model and turning Hyperparameters.

Apply prediction using the trained model.

Use validation & test data for testing the model

Testing & Data Visualization

Validate results / accuracy using test data sets
Use data visualization techniques to visualize the results.

We will follow these steps, as applicable based on sanity of the data,

We will start with Predicting Salary from the given data with no. of years of experience.

You can find the data file here

You can also find the entire code in this Github repo

Predicting Salary with Simple Linear Regression using R :-

#Import the dataset

dataset = read.csv("Salary.csv")

#install.packages('caTools')

library(caTools)

set.seed(123)

#Splitting the Dataset into Training and Testing dataset.

fragment = sample.split(dataset$Salary, SplitRatio = 2/3, group = NULL)

trainingset = subset(dataset, fragment == TRUE)

testset = subset(dataset, fragment == FALSE)

#Fitting simple Linear model to training set

LinearRegressor = lm(formula = Salary ~ YearsOfExperience,

data = trainingset )

summary(LinearRegressor)

<<<<Output of Summary >>>>

#The above command will show summary of the fitted model.

Residuals:
Min 1Q Median 3Q Max
-9279.5 -4339.5 -648.5 3620.7 9748.9

Coefficients:        Estimate Std. Error t value Pr(>|t|)
(Intercept)               31792.0     1935.7   16.42   <2e-16 ***   <= Intercept is the slope β0 i.e.,, when X = 0, Y = β0
YearsOfExperience   7739.4      161.9   47.79 <2e-16 ***   <=  *s denotes the significance of the var on Response var

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5663 on 31 degrees of freedom
Multiple R-squared: 0.9866, Adjusted R-squared: 0.9862
F-statistic: 2284 on 1 and 31 DF, p-value: < 2.2e-16

#Predicting the Test Set Results

Predictor = predict(LinearRegressor, newdata = testset)

# Plotting the graph on test set results using ggplot2

#install.packages('ggplot2')

library(ggplot2)

ggplot() +

geom_point(aes(x = testset$YearsOfExperience, y = testset$Salary),

colour = 'blue') +

geom_line(aes(x = testset$YearsOfExperience, y= Predictor),

colour = 'red') +

ggtitle('Salary vs Experience') +

xlab('YearsOfExperience') +

ylab('Salary')

With Graph plotted, we can see that there's a difference between the Actual Salary and the Predicted Salary.

Based on the Multiple R-squared: 0.9866, Adjusted R-squared: 0.9862 values, we can evaluate the accuracy of the Prediction.

Predicting Salary with Simple Linear Regression using Python :-

For executing the below code in python, you can use any of the interfaces like ( Jupyter notebook, Pycharm , Spyder or a plain notepad++ )

I have used Notepad++ and executed through iPython console. Same steps have been applied in both the platforms to predict the response variable.

#Importing libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

#Importing the data

dataset = pd.read_csv('Salary.csv')

# Validate the dataset

dataset.describe()

# Split into Dependent and Independent variables

X = dataset.iloc[:,:-1].values

Y = dataset.iloc[:, 1].values

#X = dataset.YearsOfExperience

#Y = dataset.Salary

# No Manipulation or Imputation of data is needed as this is a Simple and clean dataset.

# Splitting the dataset to Train and Test data.

from sklearn.cross_validation import train_test_split

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, test_size = 0.3, random_state = 1)

#Fitting the Simple Linear Regression Model

from sklearn.linear_model import LinearRegression

LinearRegressor = LinearRegression()

LinearRegressor.fit(Xtrain, Ytrain)

# Predicting the Test Set Results

YPredictor = LinearRegressor.predict(Xtest)

# Evaluating the Intercept

LinearRegressor.intercept_

#Plotting the Graph for Predicted Test Set Results

plt.scatter(Xtest, Ytest, color = 'blue')

plt.plot(Xtest, YPredictor, color = 'red')

plt.title('Experience vs Salary')

plt.xlabel('Number of Years of Experience')

plt.ylabel('Salary')

plt.show()

#Predicting for any random value

YPred = LinearRegressor.predict([[16.5]])

print(np.float32(YPred))

#Calculating the R Squared value for accuracy

from sklearn.metrics import r2_score

print("r2_score",np.float32(r2_score(Ytest,YPredictor)*100))

In the next blog, we will talk about the Assumptions of Linear Regression and Hypothesis testing.

sqlskillport & Machine Learning

Thursday, August 8, 2019

Simple Linear Regression with R & Python

The Formula for Simple Linear Regression is : Y = β0 + β1* X + Ɛ

Predicting Salary with Simple Linear Regression using R :-

Predicting Salary with Simple Linear Regression using Python :-

No comments:

Post a Comment

Multiple Linear Regression

Search This Blog