项目作者: bcgov

项目描述 :
using some HPC techniques to wrangle a large, irregular housing-market dataset on an ordinary laptop, in a finite amount of time
高级语言: Python
项目地址: git://github.com/bcgov/bcstats_ohcs_craigslist.git
创建时间: 2020-01-31T21:01:43Z
项目社区:https://github.com/bcgov/bcstats_ohcs_craigslist

开源协议:Apache License 2.0

下载


Lifecycle:Retired
status: archive

bcstats ohcs craigslist

Extractor / parser, written for web-scraped craigslist data as provided to BC Stats, by Harmari Inc.

Overview

Wrangle a large, irregularly formatted, housing-market dataset on an ordinary computer, in a finite amount of time, using some large data / HPC-ish techniques

  • out of memory
  • parallelism

The challenge

The original data incl. an irregularly formatted CSV file (22GB) incl. approx. 1,000,000 HTML files stuffed into a CSV, where each HTML-file attribute, spans approx. 500 lines. Python 3’s “import csv” and R’s “library{vroom}” couldn’t read the data at this time, so custom out-of-memory slice/extract/parse was used. Moreover, Python3’s BeautifulSoup html-parsing, was accelerated using full machine parallelism. The data contain sensitive information and will not be posted

How to produce separate outputs for Apartments (vs Sublets)
Place only apartments (or sublets) related data input files, in the code directory, to produce a merged output file that contains only apartments (or sublets) related data

Process analytics

Sample visualization of process monitor for one of the steps in this “big-data” application
Process analytics

License

Copyright 2020 Province of British Columbia

Licensed under the Apache License, Version 2.0 (the “License”);
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an “AS IS” BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.