项目作者: rayzapata

项目描述 :
Regression project for predicting property values with Zillow
高级语言: Jupyter Notebook
项目地址: git://github.com/rayzapata/project_regression_zillow.git
创建时间: 2021-06-11T19:58:08Z
项目社区:https://github.com/rayzapata/project_regression_zillow

开源协议:

下载


Predicting Property Values with Zillow

Table of Contents


I. Project Overview

  1. Description
  2. Deliverables

II. Project Summary

  1. Goals
  2. Initial Thoughts & Hypothesis
  3. Findings & Next Phase

III. Data Context

  1. Database Relationships
  2. Data Dictionary

IV. Process

  1. Project Planning
  2. Data Acquisition
  3. Data Preparation
  4. Data Exploration
  5. Modeling & Evaluation
  6. Product Delivery

V. Modules

VI. Project Reproduction

I. Project Overview


1. Description

This project serves to fulfill the requests made to the Zillow data science department to predict the values of “single unit properties.” With the available data, it is also requested to obtain county specific tax rates and their distributions.

2. Deliverables

  • GitHub repository and README stating project overview, goals, findings, and summary
  • Jupyter Notebook showing detailed process through data science pipeline
  • Slide deck presentation summarizing findings of driver for single unit property values

II. Project Summary


1. Goals

Utilizing the data available from Zillow, this project sets out to best predict value in USD of properties sold in 2017 during the “hot months” of May, June, July, and August. Property value in this context is defined as the tax appraised value of land and any structures built upon it. Through the process of preparing and exploring data, we will find drivers of single unit property value and create predictive regression models to predict that value using the count of bathrooms, bedrooms and the calculated finished square feet of the structure.

Secondary to this main objective, the data science team is also requested to obtain the distributions and estimation of tax rates for each of the counties where these properties exist. Using the Federal Information Processing Standards (FIPS) codes to obtain county names, and the tax assessed value and tax amount, we will calculate and visualize this request.

2. Initial Thoughts & Hypothesis

The initial hypothesis of the project was that as the dimensions of the structure on the property increased, so too would it’s value. The same was considered true for the count of bathrooms and bedrooms on the property as well, but was not believed to be as strong a driver as size. It is also believed that size and location of the land itself would play a role in property value, and this is a potential avenue of future exploration after completing the minimally viable product (MVP).

3. Findings & Next Phase

Through statistical testing, we found evidence that counts of bathrooms and bedrooms as well as structure finished square feet had a positive relationship with property value. In creating our models, we utilized all three requested features as part of training and fitting for the MVP, and created a second degree polynomial regression model which performed well enough to make it through to the testing dataset. With a root mean squared error of $271,768.12 on the final test, this model maintained consistent performance through each stage. With each model created, it was found that the higher the true value, the larger the residuals of predictions grew.

Estimating Tax Rates

Using the FIPS codes within the data, these identifiers were used with data from census.gov to obtain counties for each property. Using the tax assessed value and the tax payment for each property, the distribution of tax rates were obtained, and then a calculated average tax rate for each county as stated below. All counties were located within the US state of California.

County Average Tax Average Value Tax Rate
Los Angeles $5,159.40 $406,432.95 1.38%
Orange $5,631.82 $483,927.33 1.21%
Ventura $5,117.08 $437,588.39 1.20%

With additional time, we will be able to explore further variables as model features. It is a continuing hypothesis that location and lot size will play a strong role in better predicting when the true value is higher.

III. Data Context


1. Database Relationships

The Codeup zillow SQL database contains twelve tables, nine of which have foreign key links with our primary table properties_2017: airconditioningtype, architecturalstyletype, buildingclasstype, heatingorsystemtype, predictions_2017, propertylandusetype, storytype, typeconstructiontype, and unique_properties. Each table is connected by a pointed arrow with the corresponding foreign keys that link them. Many of these tables are unused in this project due to missing values, and this database map serves only to define the database.

2. Data Dictionary

Following acquisition and preparation of the initial SQL database, the DataFrames used in this project contain the following variables. Contained values are defined along with their respective data types.

Variable Definition Data Type
bedrooms count of bedrooms on property integer
bathrooms count of bathrooms and half-bathrooms on property float
county human readable name of county where property exists object
fips federal information processing standards codes integer
property_id unique identifier for each property index
square_feet total calculated square feet in property structure float
tax_amount_usd property taxes based on assessed value in USD float
tax_value_usd * assessed value of property in USD float
  1. * Target variable

IV. Process


1. Project Planning

🟢 Plan ➜ ☐ Acquire ➜ ☐ Prepare ➜ ☐ Explore ➜ ☐ Model ➜ ☐ Deliver

  • Build this README containing:
    • Project overview
    • Initial thoughts and hypotheses
    • Project summary
    • Instructions to reproduce
  • Build and use Trello board for data science pipeline
  • Consider project needs versus project desires

2. Data Acquisition

Plan ➜ 🟢 Acquire ➜ ☐ Prepare ➜ ☐ Explore ➜ ☐ Model ➜ ☐ Deliver

  • Obtain initial data and understand its structure
  • Remedy any inconsistencies, duplications, or structural problems within data
  • Perform data summation

3. Data Preparation

Plan ➜ ✓ Acquire ➜ 🟢 Prepare ➜ ☐ Explore ➜ ☐ Model ➜ ☐ Deliver

  • Address missing or inappropriate values, including outliers
  • Plot distributions of variables
  • Encode categorical variables
  • Consider and create new features as needed
  • Split data into train, validate, and test

4. Data Exploration

Plan ➜ ✓ Acquire ➜ ✓ Prepare ➜ 🟢 Explore ➜ ☐ Model ➜ ☐ Deliver

  • Visualize relationships of variables
  • Formulate hypotheses
  • Perform statistical tests
  • Decide upon features and models to be used

5. Modeling & Evaluation

Plan ➜ ✓ Acquire ➜ ✓ Prepare ➜ ✓ Explore ➜ 🟢 Model ➜ ☐ Deliver

  • Establish baseline prediction
  • Create, fit, and predict with models
  • Evaluate models with out-of-sample data
  • Utilize best performing model on test data
  • Summarize, visualize, and interpret findings

6. Product Delivery

Plan ➜ ✓ Acquire ➜ ✓ Prepare ➜ ✓ Explore ➜ ✓ Model ➜ 🟢 Deliver

  • Prepare Jupyter Notebook of project details through data science pipeline
  • With additional time, continue with exploration beyond MVP
  • Proof read and complete README and project repository
  • [X] Prepare slide deck presentation of project

V. Modules


The created modules used in this project below contain full comments an docstrings to better understand their operation. Where applicable, all functions used random_state=19 at all times. Use of functions requiring access to the Codeup database require an additional module named env.py. See project reproduction for more detail.

  • acquire: contains functions used in initial data acquisition leading into the prepare phase
  • prepare: contains functions used to prepare data for exploration and visualization
  • explore: contains functions to visualize the prepared data and estimate the best drivers of property value
  • model : contains functions to create, test models and visualize their performance

VI. Project Reproduction


To recreate and reproduce results of this project, you will need to create a module named env.py. This file will need to contain login credentials for the Codeup database server stored in their respective variables named host, username, and password. You will also need to create the following function within. This is used in all functions that acquire data from the SQL server to create the URL for connecting. db_name needs to be passed as a string that matches exactly with the name of a database on the server.

  1. def get_connection(db_name):
  2. return f'mysql+pymysql://{username}:{password}@{host}/{db_name}'

After its creation, ensure this file is not uploaded or leaked by ensuring git does not interact with it. When using any function housed in the created modules above, ensure full reading of comments and docstrings to understand its proper use and passed arguments or parameters.

[Return to Top]