using some HPC techniques to wrangle a large, irregular housing-market dataset on an ordinary laptop, in a finite amount of time
Extractor / parser, written for web-scraped craigslist data as provided to BC Stats, by Harmari Inc.
Overview
Wrangle a large, irregularly formatted, housing-market dataset on an ordinary computer, in a finite amount of time, using some large data / HPC-ish techniques
The challenge
The original data incl. an irregularly formatted CSV file (22GB) incl. approx. 1,000,000 HTML files stuffed into a CSV, where each HTML-file attribute, spans approx. 500 lines. Python 3’s “import csv” and R’s “library{vroom}” couldn’t read the data at this time, so custom out-of-memory slice/extract/parse was used. Moreover, Python3’s BeautifulSoup html-parsing, was accelerated using full machine parallelism. The data contain sensitive information and will not be posted
How to produce separate outputs for Apartments (vs Sublets)
Place only apartments (or sublets) related data input files, in the code directory, to produce a merged output file that contains only apartments (or sublets) related data
Sample visualization of process monitor for one of the steps in this “big-data” application
Copyright 2020 Province of British Columbia
Licensed under the Apache License, Version 2.0 (the “License”);
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an “AS IS” BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.