项目作者: junyuan-chen

项目描述 :
Read data files from Stata, SAS and SPSS into Julia tables
高级语言: Julia
项目地址: git://github.com/junyuan-chen/ReadStatTables.jl.git
创建时间: 2021-04-20T22:20:25Z
项目社区:https://github.com/junyuan-chen/ReadStatTables.jl

开源协议:MIT License

下载


ReadStatTables.jl

Read and write Stata, SAS and SPSS data files with Julia tables

CI-stable
codecov
PkgEval
docs-stable
docs-dev

ReadStatTables.jl
is a Julia package for reading and writing Stata, SAS and SPSS data files with
Tables.jl-compatible tables.
It utilizes the ReadStat C library
developed by Evan Miller
for parsing and writing the data files.
The same C library is also the backend of popular packages in other languages such as
pyreadstat for Python
and haven for R.
As the Julia counterpart for similar purposes,
ReadStatTables.jl leverages the state-of-the-art Julia ecosystem
for usability and performance.
Its read performance, especially when taking advantage of multiple threads,
surpasses all related packages by a sizable margin
based on the benchmark results
here:




Features

ReadStatTables.jl provides the following features in addition to
wrapping the C interface of ReadStat:

  • Fast multi-threaded data collection from ReadStat parsers to a Tables.jl-compatible ReadStatTable
  • Interface of file-level and variable-level metadata compatible with DataAPI.jl
  • Integration of value labels into data columns via a custom array type LabeledArray
  • Translation of date and time values into Julia time types Date and DateTime
  • Write support for Tables.jl-compatible tables (experimental)

Supported File Formats

ReadStatTables.jl recognizes data files with the following file extensions at this moment:

  • Stata: .dta
  • SAS: .sas7bdat and .xpt
  • SPSS: .sav and .por

Installation

ReadStatTables.jl can be installed with the Julia package manager
Pkg.
From the Julia REPL, type ] to enter the Pkg REPL and run:

  1. pkg> add ReadStatTables

Quick Start

To read a data file located at data/sample.dta:

  1. julia> using ReadStatTables
  2. julia> tb = readstat("data/sample.dta")
  3. 5×7 ReadStatTable:
  4. Row mychar mynum mydate dtime mylabl myord mytime
  5. String3 Float64 Date? DateTime? Labeled{Int8} Labeled{Int8?} DateTime
  6. ─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
  7. 1 a 1.1 2018-05-06 2018-05-06T10:10:10 Male low 1960-01-01T10:10:10
  8. 2 b 1.2 1880-05-06 1880-05-06T10:10:10 Female medium 1960-01-01T23:10:10
  9. 3 c -1000.3 1960-01-01 1960-01-01T00:00:00 Male high 1960-01-01T00:00:00
  10. 4 d -1.4 1583-01-01 1583-01-01T00:00:00 Female low 1960-01-01T16:10:10
  11. 5 e 1000.3 missing missing Male missing 2000-01-01T00:00:00

To access a column from the above table:

  1. julia> tb.myord
  2. 5-element LabeledVector{Union{Missing, Int8}, Vector{Union{Missing, Int8}}, Union{Char, Int32}}:
  3. 1 => low
  4. 2 => medium
  5. 3 => high
  6. 1 => low
  7. missing => missing

Notice that for data variables with value labels,
both the original values and the value labels are preserved.
For variables representing date/time,
the translation to Julia Date/DateTime is lazy.
One can access the underlying numerical values as follows:

  1. julia> tb.mydate.data
  2. 5-element SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}:
  3. 21310.0
  4. -29093.0
  5. 0.0
  6. -137696.0
  7. missing

File-level and variable-level metadata can be retrieved and modified
via methods compatible with DataAPI.jl:

  1. julia> metadata(tb)
  2. ReadStatMeta:
  3. row count => 5
  4. var count => 7
  5. modified time => 2021-04-23T04:36:00
  6. file format version => 118
  7. file label => A test file
  8. file extension => .dta
  9. julia> colmetadata(tb, :mylabl)
  10. ReadStatColMeta:
  11. label => labeled
  12. format => %16.0f
  13. type => READSTAT_TYPE_INT8
  14. value label => mylabl
  15. storage width => 1
  16. display width => 16
  17. measure => READSTAT_MEASURE_UNKNOWN
  18. alignment => READSTAT_ALIGNMENT_RIGHT

For more details, please see the documentation.