This repository contains files for Udacity's Machine Learning Nanodegree Project: Boston House Price Prediction
You want to be the best real estate agent out there. In order to compete with other agents in your area, you decide to use machine learning. You are going to use various statistical analysis tools to build the best model to predict the value of a given house. Your task is to find the best price your client can sell their house at. The best guess from a model is one that best generalizes the data.
For this assignment your client has a house with the following feature set: [11.95, 0.00, 18.100, 0, 0.6590, 5.6090, 90.00, 1.385, 24, 680.0, 20.20, 332.09, 12.13]. To get started, use the example scikit implementation. You will have to modify the code slightly to get the file up and running.
Loading the dataset:
from sklearn import datasets
city_data = datasets.load_boston()
# Get the labels and features from the housing data
housing_prices = city_data.target
housing_features = city_data.data
Let us know begin with the exploration of data using numpy
number_of_houses = housing_features.shape[0]
print "number of houses:",number_of_houses
number of houses: 506
number_of_features = housing_features.shape[1]
print "number of features:",number_of_features
number of features: 13
print “max price of house:”,max_price
print “min price of house:”,min_price
max price of house: 50.0
min price of house: 5.0
---
- Mean and median Boston housing prices?
```python
mean_price = np.mean(housing_prices)
median_price = np.median(housing_prices)
print "mean price of house:",mean_price
print "median price of house:",median_price
mean price of house: 22.5328063241
median price of house: 21.2
standard deviation for prices of house: 9.18801154528
standard_deviation = np.std(housing_prices)
print "standard deviation for prices of house:",standard_deviation
In my previous iteration of the work I used r^2 but as the project demands a measure if the error, I believe using the mean squarred error would give us the best solution. The reason being MSE emphasises large error unlike absolute mean or median error. This is what is seen as ideal by most statisticians. However, another important aspect to see in the case of this project would be the measur that minimises the error the most. Therefore just the thought is not enough and we would need to see some kind of graphical proof for this.
Mean Absolute Error | Mean Squarred Error | Median Absolute Error |
---|---|---|
![]() |
![]() |
![]() |
One of the biggest problems that could occur is overfitting or underfitting. If we were to not split the dataset, it would imply that we have used all our data for training and there is a high chance that our algorithm becomes tailor made to the given dataset. Thus, there could be a huge dip in the performance of the algorihtm on the test dataset.
Some of the advantages that splitting the dataset gives us are as follows:
Grid search helps in parameter tuning and the selecetion of appropriate model based on the parameters we wish to tune. While computing this, it also implemets cross validation folds, thus eliminating the risk of overfitting/underfitting. It also has the flexibility of making a customised scorer function.
The goal of cross validation is to define a dataset to “test” the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc.
Gird search performs 3-fold cross validation, hence we can easily validate the optimized parameter model generated by it. Cross validation ensures that we have a sufficiently good model and that we don’t over or underfit the data in any way. Since, the gridsearch provides parameter optimization and uses crosss validation to provide the optimal parameters, we can be sure that there are no significant problems of underfitting or overfitting.
As the training size increases, the error reduces and the model begins to predict better. This can be seen form the learning curves. With the increase in max depth of the decision tree and training size, the training error is almost 0 whereas the testing error continues to reduce. Thus, there is improvement in the regression model as training size increases.
This however, is the general trend for higher depths. In the beginning, we see that both the training and the testing error flattens out. Thus, as the training size increases there are errors in the testing dataset that increases.
Lets analyze the scenario for max depth being 1:
In this case, we are underfitting the dataset, both the training and test dataset has high errors and flattens out after a while. Thus, there is a high amount of bias when using max depth = 1. This is somewhat intuituive as we do not let the tree expand and we restrict the entropy or the learning value. Hence, with a depth levelof 1 we cannot the complexity of the dataset.
Lets analyze the scenario for max depth being 10:
Here, we are clearly overfitting. Using the same discussion above, since we allow increase in depth upto level 10, we are increasing the entropy levels. Thus, we see an error of 0 for training dataset. However, the testing data begins to flatten out with a few occassional spikes.
I think a max_depth between 5 generalises the dataset the best. It doesn’t overfit as in the case of higher depths, we simply get a training error of 0. This means we are completly fitting the data and there is a high chance of overfitting. The model complxity graph verifies the above conclusion
Final Model:
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
max_leaf_nodes=None, min_samples_leaf=3, min_samples_split=1,
min_weight_fraction_leaf=0.0, random_state=None,
splitter='best')
House: [11.95, 0.0, 18.1, 0, 0.659, 5.609, 90.0,
1.385, 24, 680.0, 20.2, 332.09, 12.13]
Prediction: [ 20.96776316]
The final decision tree regressor uses the following optimized parameters:
The central tendency for the given dataset with respect to the mean and the median are as follows:
mean price of house: 22.5328063241
median price of house: 21.2
This means that our current prediction for the given house is credible. Therefore we can say that we have a model that fits the data to an extent.
However, I would say that the model still doesn’t completely describe the variance in the dataset. If we plot the graph of residuals, we get a cycle like graph. This means that a linear mode won’t be able to generalise the data. We are underfitting the dataset.
![Image of Residual Plot]
(residual_plot.png)