项目作者: ekstroem

项目描述 :
An R package for data screening
高级语言: HTML
项目地址: git://github.com/ekstroem/dataMaid.git
创建时间: 2016-09-26T11:15:17Z
项目社区:https://github.com/ekstroem/dataMaid

开源协议:

下载


" class="reference-link">dataMaid

Travis-CI Build
Status
CRAN\_Release\_Badge
Download counter

dataMaid is an R package for documenting and creating reports on data cleanliness.

dataMaid has become dataReporter

dataMaid has been renamed to dataReporter. dataMaid is no longer maintained. All future updates and development will be made for dataReporter. Install the new package from CRAN like this

  1. install.packages("dataReporter")

or install the development version from Github:

  1. devtools::install_github("ekstroem/dataReporter")

Please report bugs at our new repository.

Installation

This github page contains the development version of dataMaid. For the
latest stable version download the package from CRAN directly using

  1. install.packages("dataMaid")

To install the development version of dataMaid run the following
commands from within R (requires that the devtools package is already installed)

  1. devtools::install_github("ekstroem/dataMaid")

Package overview

A super simple way to get started is to load the package and use the
makeDataReport() function on a data frame (if you try to generate several
reports for the same data, then it may be necessary to add the replace=TRUE
argument to overwrite the existing report).

  1. library("dataMaid")
  2. data(trees)
  3. makeDataReport(trees)

This will create a report with summaries and error checks for each
variable in the trees data frame. The format of the report depends on your OS and whether
you have have a LaTeX installation on your computer, which
is needed for creating pdf reports.

Using dataMaid interactively

The dataMaid package can also be used interactively by running checks
for the individual variables or for all variables in the dataset

  1. data(toyData)
  2. check(toyData$events) # Individual check of events
  3. check(toyData) # Check all variables at once

By default the standard battery of tests is run depending on the
variable type. If we just want a specific test for, say, a numeric
variable then we can specify that. All available checks can be viewed
by calling allCheckFunctions(). See the
documentation

for an overview of the checks available or how to create and include
your own tests.

  1. check(toyData$events, checks = setChecks(numeric = "identifyMissing"))

We can also access the graphics or summary tables that are produced for a variable by calling the visualize or summarize functions. One can visualize a single variable or a full dataset:

  1. #Visualize a variable
  2. visualize(toyData$events)
  3. #Visualize a dataset
  4. visualize(toyData)

The same is true for summaries. Note also that the choice of checks/visualizations/summaries are customizable:

  1. #Summarize a variable with default settings:
  2. summarize(toyData$events)
  3. #Summarize a variable with user-specified settings:
  4. summarize(toyData$events, summaries = setSummaries(all = c("centralValue", "minMax"))

Detailed documentation

You can read the main paper accompanying the package at the Journal
of Statistical
Software
. It provides
a detailed introduction to the dataMaid package.

We also have two blog posts that provide an introduction to the package. The can be found here (the primary one) and here.

Moreover, we have
created a vignette that describes how to extend dataMaid to include
user-defined data screening checks, summaries and visualizations. This
vignette is called extending_dataMaid:

  1. vignette("extending_dataMaid")

Online app

We are currently working on an online version of the tool, where users
can upload their data and get a report. A prototype
is already up and running - we just need to configure the R server correctly.

Until we have set it up online, you can try it out on your own machine:

  1. library(shiny)
  2. runUrl("https://github.com/ekstroem/dataMaid/raw/master/app/app.zip")