项目作者: Mega-Barrel

项目描述 :
You plan to unleash the inner Machine Learning expert in you and build a sophisticated Machine Learning model that predicts selling prices of products based on the mentioned factors.
高级语言: Jupyter Notebook
项目地址: git://github.com/Mega-Barrel/Machine-Learning-Carnival-Wars.git


My Leaderboard

Secured 301 Position out of 2144 participants with Prediction Score of: 85.17497


The Process is divided in 4 Steps

  • Cleaning the Data
  • Removing Outliers
  • EDA on cleaned dataset
  • Model Building

Cleaning The Data-Set

Filling NaN values with Median for Train Dataset

  1. fill_na_train = train_df['Stall_no'].median()
  2. train_df['Stall_no'].fillna(fill_na_train, inplace=True)
  3. fill_na_train = train_df['Discount_avail'].median()
  4. train_df['Discount_avail'].fillna(fill_na_train, inplace=True)
  5. fill_na_train = train_df['charges_1'].median()
  6. train_df['charges_1'].fillna(fill_na_train, inplace=True)
  7. fill_na_train = train_df['charges_2 (%)'].median()
  8. train_df['charges_2 (%)'].fillna(fill_na_train, inplace=True)
  9. fill_na_train = train_df['Minimum_price'].median()
  10. train_df['Minimum_price'].fillna(fill_na_train, inplace=True)
  11. fill_na_train = train_df['Maximum_price'].median()
  12. train_df['Maximum_price'].fillna(fill_na_train, inplace=True)
  13. fill_na_train = train_df['Selling_Price'].median()
  14. train_df['Selling_Price'].fillna(fill_na_train, inplace=True)

Checking all values were filled

  1. train_df.isnull().sum()

Output:
```Product_id 0
Stall_no 0
instock_date 0
Market_Category 0
Customer_name 211
Loyalty_customer 0
Product_Category 0
Grade 0
Demand 0
Discount_avail 0
charges_1 0
charges_2 (%) 0
Minimum_price 0
Maximum_price 0
Selling_Price 0
dtype: int64

  1. **Filling NaN values with Median for Test Dataset**
  2. ```python
  3. fill_na_test = test_df['Stall_no'].median()
  4. test_df['Stall_no'].fillna(fill_na_test, inplace=True)
  5. fill_na_test = test_df['charges_1'].median()
  6. test_df['charges_1'].fillna(fill_na_test, inplace=True)
  7. fill_na_test = test_df['charges_2 (%)'].median()
  8. test_df['charges_2 (%)'].fillna(fill_na_test, inplace=True)
  9. fill_na_test = test_df['Minimum_price'].median()
  10. test_df['Minimum_price'].fillna(fill_na_test, inplace=True)

Checking all values were filled

  1. test_df.isnull().sum()

Output:

  1. Product_id 0
  2. Stall_no 0
  3. instock_date 0
  4. Market_Category 0
  5. Customer_name 53
  6. Loyalty_customer 0
  7. Product_Category 0
  8. Grade 0
  9. Demand 0
  10. Discount_avail 0
  11. charges_1 0
  12. charges_2 (%) 0
  13. Minimum_price 0
  14. Maximum_price 0
  15. dtype: int64



Removing The Outlier

Removed Outliers by Using IQR method:

  1. def outlier_removal(column):
  2. sorted(column)
  3. Q1, Q3 = np.percentile(column, [25, 75])
  4. IQR = Q3 - Q1
  5. lower_value = Q1 - (1.5 * IQR)
  6. upper_value = Q3 + (1.5 * IQR)
  7. return lower_value, upper_value



Doing Some EDA on the cleaned dataset.

Plotting Distribution for the Data-Set

1. Selling Price Distribution

1

2. Charges_1 Distribution

2

3. Maximum_Price Distribution

3

4. Product Sold Per Category

4

5. Products Sold As Per Loyality Customers

5

6. Products Sold Per Year(2014, 2015, 2016)

6

7. Loyality Customers

7

Model Building

  1. Model Preparation

    Data for X:

    train[['Stall_no', 'Market_Category', 'Grade', 'Demand', 'Discount_avail', 'charges_1', 'charges_2 (%)', 'Minimum_price', 'Maximum_price', 'month', 'year', 'day', 'loyality_customers', 'Product_Category_Cosmetics', 'Product_Category_Educational', 'Product_Category_Fashion', 'Product_Category_Home_decor', 'Product_Category_Hospitality', 'Product_Category_Organic', 'Product_Category_Pet_care', 'Product_Category_Repair', 'Product_Category_Technology']]

    Data for Y:

    train['Selling_Price']


  1. Train_Test_Split

    1. from sklearn.model_selection import train_test_split
    2. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=10)


  2. Using Random Forest

    1. from sklearn.ensemble import RandomForestRegressor
    2. reg = RandomForestRegressor(max_depth = 15, random_state=0, n_estimators = 100, verbose = 1)
    3. reg.fit(X_train,y_train)
    1. reg.score(X_train, y_train)

    Reg.score: 0.996330144542303


  • R^2 Value

    Value for rsquare is 0.9722337968809527

  • RMSE Value

    Value for Root Mean Square Error is : 401.4823455561136

  • RMSE Value

    Value for Root Mean Square Error is : 401.4823455561136

Hope You Liked My Competition Walkthrough :)


😀 😄 😃