Thursday, August 8, 2019

Simple Linear Regression with R & Python

Simple Linear Regression is a process of regression in finding relationship of dependent and independent continuous quantitative variables.  It is called Simple because there's only one independent / explanatory or predictor variable that is used to predict the relationship dependency

The Formula for Simple Linear Regression is  : Y = β0 + β1* X + Ɛ

Here Y is the dependent Variable - Ex : How the Salary changes with the No.Of.YearsOfExperience, where Y is the Salary

and X is the Independent Variable ( Yrs.of.experience ). X is a variable that is causing the dependent variable to change based on unit of rate of change occurs in X.
There is still a possibility that there's no direct correlation between the two, but, still have a sort of implied association between dependent and independent variable.

 β1 <- is the coefficient of X, it shows how a unit change in X shows the degree change in Y, which implies that there's a proportion of change that occurs between X & Y and not a one to one change in units.

β0  <- is a constant or Intercept of the slope, which indicates a value when β1 is 0. i.e., when β1 is 0, X = 0 and hence, Y = β0 which is constant OR an intercept of the slope when X = 0.

Let us look at this in the graph for plotting Salary vs Yrs.of.experience.

For any prediction using Classic Machine Learning / Neural Network techniques, there are 5 major steps to be followed [ Fundamentals of Machine Learning ]
  1. Finding the data based on problem statement.
    1. Finding appropriate data for correct analysis
  2. Exploratory data analysis
    1. Understand the data spread and outliers
    2. Visualize the data
    3. Preparing your data
    4. Use Feature engineering on the data as necessary for applying suitable algorithms to train the model.
  3. Fit / Train the model - using appropriate algorithm
    1. Use of appropriate algorithm to train/Fit the model
    2. Developing a model that does better than a baseline
    3. Regularizing the model and turning Hyperparameters.
  4. Apply prediction using the trained model.
    1. Use validation & test data for testing the model
  5. Testing & Data Visualization
    1. Validate results / accuracy using test data sets
    2. Use data visualization techniques to visualize the results.

We will follow these steps, as applicable based on sanity of the data,

We will start with Predicting Salary from the given data with no. of years of experience.

You can find the data file here

You can also find the entire code in this Github repo

Predicting Salary with Simple Linear Regression using R :-

#Import the dataset
dataset = read.csv("Salary.csv")


#Splitting the Dataset into Training and Testing dataset.
fragment = sample.split(dataset$Salary, SplitRatio = 2/3, group = NULL)
trainingset = subset(dataset, fragment == TRUE)
testset = subset(dataset, fragment == FALSE)

#Fitting simple Linear model to training set
LinearRegressor = lm(formula = Salary ~ YearsOfExperience,
                     data = trainingset )


<<<<Output of Summary >>>>
#The above command will show summary of the fitted model.
    Min                1Q              Median              3Q               Max
-9279.5          -4339.5         -648.5             3620.7         9748.9
Coefficients:        Estimate Std. Error t value Pr(>|t|)   
(Intercept)               31792.0     1935.7   16.42   <2e-16 ***    <= Intercept is the slope β0 i.e.,, when X = 0, Y =  β0
YearsOfExperience   7739.4      161.9   47.79    <2e-16 ***    <=  *s denotes the significance of the var on Response var
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5663 on 31 degrees of freedom
Multiple R-squared:  0.9866,        Adjusted R-squared:  0.9862
F-statistic:  2284 on 1 and 31 DF,  p-value: < 2.2e-16

#Predicting the Test Set Results
Predictor = predict(LinearRegressor, newdata = testset)

# Plotting the graph on test set results using ggplot2

ggplot() +
  geom_point(aes(x = testset$YearsOfExperience, y = testset$Salary),
             colour = 'blue') +
  geom_line(aes(x = testset$YearsOfExperience, y= Predictor),
            colour = 'red') +
  ggtitle('Salary vs Experience') +
  xlab('YearsOfExperience') +

With Graph plotted, we can see that there's a difference between the Actual Salary and the Predicted Salary.
Based on the Multiple R-squared:  0.9866,        Adjusted R-squared:  0.9862 values, we can evaluate the accuracy of the Prediction.

Predicting Salary with Simple Linear Regression using Python :-

For executing the below code in python, you can use any of the interfaces like ( Jupyter notebook, Pycharm , Spyder  or a plain notepad++ )

I have used Notepad++ and executed through iPython console. Same steps have been applied in both the platforms to predict the response variable.

#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Importing the data
dataset = pd.read_csv('Salary.csv')

# Validate the dataset

# Split into Dependent and Independent variables
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:, 1].values

#X = dataset.YearsOfExperience
#Y = dataset.Salary

# No Manipulation or Imputation of data is needed as this is a Simple and clean dataset.
# Splitting the dataset to Train and Test data.
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, test_size = 0.3, random_state = 1)

#Fitting the Simple Linear Regression Model
from sklearn.linear_model import LinearRegression
LinearRegressor = LinearRegression(), Ytrain)

# Predicting the Test Set Results
YPredictor = LinearRegressor.predict(Xtest)

# Evaluating the Intercept

#Plotting the Graph for Predicted Test Set Results
plt.scatter(Xtest, Ytest, color = 'blue')
plt.plot(Xtest, YPredictor, color = 'red')
plt.title('Experience vs Salary')
plt.xlabel('Number of Years of Experience')

#Predicting for any random value
YPred = LinearRegressor.predict([[16.5]])

#Calculating the R Squared value for accuracy
from sklearn.metrics import r2_score

In the next blog, we will talk about the Assumptions of Linear Regression and Hypothesis testing.

