项目作者: aegis301

项目描述 :
Data cleaning project using NYC high school data
高级语言: Jupyter Notebook
项目地址: git://github.com/aegis301/nyc_high_school_project.git
创建时间: 2020-12-21T20:48:44Z
项目社区:https://github.com/aegis301/nyc_high_school_project

开源协议:

下载


NYC High School Project

Disclaimer: This project is based on the data cleaning walkthrough provided by dataquest.io. Though my own take on this project might differ from what can be found on their website.

The Project

In this project I will try to showcase my skills in data cleaning, data exploration and presentation. While I will perform some analyses on this project, they will remain at a lower level of complexity. If you want to see my performance on more complex issues, I would refer you to my other projects.

The Question

In this project, I will try to investigate whether standardized testing in U.S. highschools is efficiant and if certain groups are at a disadvantage.

The Data

The Data

In order to answer said question, I am going to use publicly accessible SAT data from 2012 from the city of New York. In order to investigate demographics I need more data though. Here’s a list of all datasets I am going to use:

  • SAT scores by school - SAT scores for each high school in New York City
  • School attendance - Attendance information for each school in New York City
  • Class size - Information on class size for each school
  • AP test results - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)
  • Graduation outcomes - The percentage of students who graduated, and other outcome information
  • Demographics - Demographic information for each school
  • School survey - Surveys of parents, teachers, and students at each school

Skills Used

  • reading different file formats into pandas
  • condensing data by concatenating and merging pandas Data Frames
  • converting data types into different formats
  • converting, cleaning and recalculating rows and columns using vectorized methods
  • working with geographic data
  • handling and replacing missing data