项目作者: ingmarboeschen

项目描述 :
A text extraction and manipulation toolset for NISO-JATS coded XML files
高级语言: R
项目地址: git://github.com/ingmarboeschen/JATSdecoder.git
创建时间: 2020-10-20T09:25:32Z
项目社区:https://github.com/ingmarboeschen/JATSdecoder

开源协议:GNU General Public License v3.0

下载


JATSdecoder

A metadata and text extraction and text manipulation tool set for the statistical programming language R.

JATSdecoder facilitates text mining projects on scientific articles by enabling an individual selection of metadata and text parts.
Its function JATSdecoder() extracts metadata, sectioned text and reference list from NISO-JATS coded XML files.
The function study.character() uses the JATSdecoder() result to perform fine-tuned text extraction tasks to identify key study characteristics like statistical methods used, alpha-error, statistical results reported in text and others.

Note:

  • PDF article collections can be converted to NISO-JATS coded XML files with the open source software CERMINE.
  • To extract statistical test results reported in simple/unpublished PDF documents with JATSdecoder::get.stats(), the R package pdftools and its function pdf_text() may help to extract textual content (be aware that tabled content may cause corrupt text).

Note too:

  • A minimal web app to extract statistical results from textual resources with get.stats() is hosted at:
    https://get-stats.app
  • An interactive web application to analyze study characteristics of articles stored in the PubMed Central database and perform an individual article selection by study characteristcs is hosted at:
    https://scianalyzer.com/

JATSdecoder supplies some convenient functions to work with textual input in general.
Its function text2sentences() is especially designed to break floating text with scientific content (references, results) into sentences.
text2num() unifies representations of written numbers and special annotations (percent, fraction, e+10) into digit numbers.
You can extract adjustable n words around a pattern match in a sentence with ngram().
letter.convert() unifies hexadecimal to Unicode characters and, if CERMINE generated CERMXML files are processed, special error correction and special letter uniformization is performed, which is extremely relevant for get.stats()‘s ability to extract and recompute statistical results in text.

The contained functions are listed below. For a detailed description, see the documentation on CRAN.

  • JATSdecoder::JATSdecoder() uses functions that can be applied stand alone on NISO-JATS coded XML files or text input:
    • get.title() # extracts title
    • get.author() # extracts author/s as vector
    • get.aff() # extracts involved affiliation/s as vector
    • get.journal() # extracts journal
    • get.vol() # extracts journal volume as vector
    • get.doi() # extracts Digital Object Identifier
    • get.history() # extracts publishing history as vector with available date stamps
    • get.country() # extracts country/countries of origin as vector with unique countries
    • get.type() # extracts document type
    • get.subject() # extracts subject/s as vector
    • get.keywords() # extracts keyword/s as vector
    • get.abstract() # extracts abstract
    • get.text() # extracts sections and text as list
    • get.references() # extracts reference list as vector
  • JATSdecoder::study.character() applies several functions on specific elements of the JATSdecoder() result. These functions can be used stand alone on any plain textual input:

    • get.n.studies() # extracts number of studies from sections or abstract
    • get.alpha.error() # extracts alpha error from text
    • get.method() # extracts statistical methods from method and result section with ngram()
    • get.stats() # extracts statistical results reported in text (abstract and full text, method and result section, result section only) and compare extracted recalculated p-values if possible
    • get.software() # extracts software name/s mentioned in method and result section with dictionary search
    • get.R.package() # extracts mentioned R package/s in method and result section with dictionary search on all available R packages created with available.packages()
    • get.power() # extracts power (1-beta-error) if mentioned in text
    • get.assumption() # extracts mentioned assumptions from method and result section with dictionary search
    • get.multiple.comparison() # extracts correction method for multiple testing from method and result section with dictionary search
    • get.sig.adjectives() # extracts common inadequate adjectives used before significant and not significant
  • JATSdecoder helper functions are helpful for many text mining projects and straight forward to use on any textual input:

    • text2sentences() # breaks floating text into sentences
    • text2num() # converts spelled out numbers, fractions, potencies, percentages and numbers denoted with e+num to decimals
    • ngram() # creates ±n-gram bag of words around a pattern match in text
    • strsplit2() # splits text at pattern match with option “before” or “after” and without removing the pattern match
    • grep2() # extension of grep(). Allows connecting multiple search patterns with logical AND operator
    • letter.convert() # unifies many and converts most hexadecimal and HTML characters to Unicode and performs CERMINE specific error correction
    • which.term() # returns hit vector for a set of patterns to search for in text (can be reduced to hits only)

Built With

How to cite JATSdecoder

  1. JATSdecoder: A Metadata and Text Extraction and Manipulation Tool Set. Ingmar Böschen (2023). R package version 1.2.0

Resources

Articles:

Evaluation data and code:

https://github.com/ingmarboeschen/JATSdecoderEvaluation/

JATSdecoder on CRAN:

https://CRAN.R-project.org/package=JATSdecoder/

Getting Started

To install JATSdecoder run the following steps:

Installation

Option 1: Install JATSdecoder from CRAN

  1. install.packages("JATSdecoder")

Option 2: Install JATSdecoder from github with the devtools package

  1. if(require(devtools)!=TRUE) install.packages("devtools")
  2. devtools::install_github("ingmarboeschen/JATSdecoder")

Usage for a single XML file

Here, a simple download of a NISO-JATS coded XML file is performed with download.file():

  1. # load package
  2. library(JATSdecoder)
  3. # download example XML file via URL
  4. URL <- "https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0114876&type=manuscript"
  5. download.file(URL,"file.xml")
  6. # convert full article to list with metadata, sectioned text and reference list
  7. JATSdecoder("file.xml")
  8. # extract specific content (here: abstract)
  9. JATSdecoder("file.xml",output="abstract")
  10. get.abstract("file.xml")
  11. # extract study characteristics as list
  12. study.character("file.xml")
  13. # extract specific study characteristic (here: statistical results)
  14. study.character("file.xml",output=c("stats","standardStats"))
  15. # reduce to checkable results only
  16. study.character("file.xml",output="standardStats",stats.mode="checkable")
  17. # compare with result of statcheck's function checkHTML() (Epskamp & Nuijten, 2018)
  18. install.packages("statcheck")
  19. library(statcheck)
  20. checkHTML("file.xml")
  21. # extract results with get.stats() from simple/unpublished manuscripts with pdftools::pdf_text()
  22. x<-pdftools::pdf_text("path2file.pdf")
  23. x<-unlist(strsplit(x,"\\n"))
  24. JATSdecoder::get.stats(x)

Usage for a collection of XML files

The PubMed Central database offers more than 5.4 million documents related to the biology and health sciences. The full repository is bulk downloadable as NISO-JATS coded NXML documents here: PMC bulk download.

  1. Get XML file names from working directory
    1. setwd("/home/PMC") # choose a specific folder with NISO-JATS coded articles in XML files on your device
    2. files<-list.files(pattern="XML$|xml$",recursive=TRUE)
  2. Apply the extraction of article content to all files (replace lapply() with future.apply() from future.apply package for multicore processing)
    1. library(JATSdecoder)
    2. # extract full article content
    3. JATS<-lapply(files,JATSdecoder)
    4. # extract single article content (here: abstract)
    5. abstract<-lapply(files,JATSdecoder,output="abstract")
    6. # or
    7. abstract<-lapply(files,get.abstract)
    8. # extract study characteristics
    9. character<-lapply(files,study.character)
  3. Working with a list of JATSdecoder() results
    1. # first article content as list
    2. JATS[[1]]
    3. character[[1]]
    4. # names of all extractable elements
    5. names(JATS[[1]])
    6. names(character[[1]])
    7. # extract one element only (here: title, abstract, history)
    8. lapply(JATS,"[[","title")
    9. lapply(JATS,"[[","abstract")
    10. lapply(JATS,"[[","history")
    11. # extract year of publication from history tag
    12. unlist(lapply(JATS,"[[","history") ,"[","pubyear")
  4. Examples for converting, unifying and selecting text with helper functions
    1. # extract full text from all documents
    2. text<-lapply(JATS,"[[","text")
    3. # convert floating text to sentences
    4. sentences<-lapply(text,text2sentences)
    5. sentences
    6. # only select sentences with pattern and unlist article wise
    7. pattern<-"significant"
    8. hits<-lapply(sentences,function(x) grep(pattern,x,value=T))
    9. hits<-lapply(hits,unlist)
    10. hits
    11. # number of sentences with pattern
    12. lapply(hits,length)
    13. # unify written numbers, fractions, percentages, potencies and numbers denoted with e+num to digit number
    14. lapply(text,text2num)

Exemplary analysis of some NISO-JATS tags

Next, some example analysis are performed on the full PMC article collection. As each variable is very memory consuming, you might want to reduce your analysis to a smaller amount of articles.

  1. Extract JATS for article collection (replace lapply() with future.apply() from future.apply package for multicore processing)

    1. # load package
    2. library(JATSdecoder)
    3. # set working directory
    4. setwd("/home/foldername")
    5. # get XML file names
    6. files<-list.files(patt="xml$|XML$")
    7. # extract JATS
    8. JATS<-lapply(files,JATSdecoder)
  2. Analyze distribution of publishing year

    1. # extract and numerize year of publication from history tag
    2. year<-unlist(lapply(lapply(JATS,"[[","history") ,"[","pubyear"))
    3. year<-as.numeric(year)
    4. # frequency table
    5. table(year)
    6. # display absolute number of published documents per year in barplot
    7. # with factorized year
    8. year<-factor(year,min(year,na.rm=TRUE):max(year,na.rm=TRUE))
    9. barplot(table(year),las=1,xlab="year",main="absolute number of published PMC documents per year")
    10. # display cummulative number of published documents in barplot
    11. barplot(cumsum(table(year)),las=1,xlab="year",main="cummulative number of published PMC documents")

  3. Analyze distribution of document type

    1. # extract document type
    2. type<-unlist(lapply(JATS ,"[","type"))
    3. # increase left margin of grafik output
    4. par(mar=c(5,12,4,2)+.1)
    5. # display in barplot
    6. barplot(sort(table(type)),horiz=TRUE,las=1)
    7. # set margins back to normal
    8. par(mar=c(5,4,4,2)+.1)

  4. Find most frequent authors

NOTE: author names are not stored fully consistent. Some first and middle names are abbreviated, first names are followed by last names and vice versa!

  1. # extract author
  2. author<-lapply(JATS ,"[","author")
  3. # top 100 most present author names
  4. tab<-sort(table(unlist(author)),dec=T)[1:100]
  5. # frequency table
  6. tab
  7. # display in barplot
  8. # increase left margin of grafik output
  9. par(mar=c(5,12,4,2)+.1)
  10. barplot(tab,horiz=TRUE,las=1)
  11. # set margins back to normal
  12. par(mar=c(5,4,4,2)+.1)
  13. # display in wordcloud with wordcloud package
  14. library(wordcloud)
  15. wordcloud(names(tab),tab)

References



- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM). 2014. Journal Publishing Tag Library - NISO JATS Draft Version 1.1d2.
[https://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html].

- Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski.
CERMINE: automatic extraction of structured metadata from scientific literature.
In International Journal on Document Analysis and Recognition (IJDAR), 2015,
vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.
[https://github.com/CeON/CERMINE/].

- Böschen, I. (2021) Software review: The JATSdecoder package—extract metadata, abstract and sectioned text from NISO-JATS coded XML documents; Insights to PubMed central’s open access database. Scientometrics. https://doi.org/10.1007/s11192-021-04162-z

- Böschen, I. (2021). Evaluation of JATSdecoder as an automated text extraction tool for statistical results in scientific reports. Scientific Reports. 11, 19525. https://doi.org/10.1038/s41598-021-98782-3

- Böschen, I. (2023). Evaluation of the extraction of methodological study characteristics with JATSdecoder. Scientific Reports. 13, 139. https://doi.org/10.1038/s41598-022-27085-y

Acknowledgements

This software is part of a dissertation project about the evolution of methodological characteristics in psychological research and financed by a grant awarded by the Department of Research Methods and Statistics, Institute of Psychology, University Hamburg, Germany.