项目作者: cogeorg

项目描述 :
Research project to help measuring complexity of legal documents.
高级语言: HTML
项目地址: git://github.com/cogeorg/RegulatoryComplexity.git
创建时间: 2016-05-22T20:15:08Z
项目社区:https://github.com/cogeorg/RegulatoryComplexity

开源协议:GNU General Public License v3.0

下载


RegulatoryComplexity

This is the code repository for the research project “Measuring Regulatory Complexity” by Jean-Edouard Colliard and Co-Pierre Georg. Use this code at your own risk. The code provides a simple dashboard that allows users to classify words in large regulatory texts (in our case the Dodd-Frank Act) in various categories, e.g. as operators or operands. This is useful when measuring the complexity of the regulatory text using the Halstead (1977) measures. The dashboard is still work in progress.

Source of raw data
https://www.fdic.gov/regulations/laws/important/

Pdf to txt
It will parse pdf documents located in 001_raw_data to txt.
Input: 001_raw_data/pdf/.pdf
Run the shell 100_code/shells/001_totxt.sh
Output: 001_raw_data/txt/
.txt

Pdf to xlm (maximum 100 pages per document)
It will parse pdf documents located in 001_raw_data to xlm.
Input: 001_raw_data/pdf/.pdf
Run the shell 100_code/shells/001_toxlm.sh
Output: 001_raw_data/xlm/
.xlm

Clean data for Dodd-Frank
Input:001_raw_data/txt/DODDFRANK.txt
Run 100_code/python/001_clean_data

Halstead Measures.

Features of the text such as bullets, definitions, references…
Input: 010_cleaned_data/DODDFRANK.txt
Run 100_code/python/002_regex_op