NYC High School Project

Disclaimer: This project is based on the data cleaning walkthrough provided by dataquest.io. Though my own take on this project might differ from what can be found on their website.

The Project

In this project I will try to showcase my skills in data cleaning, data exploration and presentation. While I will perform some analyses on this project, they will remain at a lower level of complexity. If you want to see my performance on more complex issues, I would refer you to my other projects.

The Question

In this project, I will try to investigate whether standardized testing in U.S. highschools is efficiant and if certain groups are at a disadvantage.

The Data

In order to answer said question, I am going to use publicly accessible SAT data from 2012 from the city of New York. In order to investigate demographics I need more data though. Here’s a list of all datasets I am going to use:

SAT scores by school - SAT scores for each high school in New York City
School attendance - Attendance information for each school in New York City
Class size - Information on class size for each school
AP test results - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)
Graduation outcomes - The percentage of students who graduated, and other outcome information
Demographics - Demographic information for each school
School survey - Surveys of parents, teachers, and students at each school

Skills Used

reading different file formats into pandas
condensing data by concatenating and merging pandas Data Frames
converting data types into different formats
converting, cleaning and recalculating rows and columns using vectorized methods
working with geographic data
handling and replacing missing data