项目作者: manoharpavuluri

项目描述 :
Linear Regression & Gradient Boosting Regressor based
高级语言: Jupyter Notebook
项目地址: git://github.com/manoharpavuluri/salary-prediction--LNR-GBR.git


Salary Prediction using Linear Regression and Gradient Boosting Regressor

Problem -

Predict salaray based on multiple features.

Data

What we have

  • We have 2 files - Train and Test File.
  • Train file has 100k observations with 7 features
  • 4 categorical and 2 numerical data

We have data as in below
alt tex

Data Preparataion

We ran through data processing to look for following

  • Nulls
  • Data types to see if numerical columns are marked as object
  • how many categorical and numericals columns in the dataframe

Feature Engineering

Hot Encoding the categorical values

Used hot encoding to convert the categorical values to numerical values as below, as the models only work on the numerical columns
alt tex

Correlation features to Salary

Evaluated the correlation to see which featured need to be considered.
alt tex

alt tex

alt tex

alt tex

alt tex

alt tex

alt tex

From the Correlation, Company ID doenst have impact on Salary, so will be ignored.

Model

Evaluated 2 models - Linear Regression and Gradient Boosting Regressor

Linear Regression

Predicted VS Real Plot

alt tex

MSE Evaluation

alt tex

Gradient Boosting Regressor

Predicted VS Real Plot

alt tex

MSE Evaluation

alt tex

Conclussion

Although Predicted VS Real plots looks same, from further evaluations MSE numbers, GBR seems to be better model.

Using GBR, evaluated the Features to see which has more impact

alt tex

And Finally the Predicted Salaries using GBR Model

alt tex