Reading and Using Data in Textual Dataset File Format in R • tidyfr

Authors: Johannes Rainer [cre] (https://orcid.org/0000-0002-6977-7147), Michele Filosi [aut] (https://orcid.org/0000-0002-3872-347X)
Last modified: 2023-04-04 07:07:34.433562
Compiled: Tue Apr 4 07:14:10 2023

Introduction

The Textual Dataset File Format (TDFF) is a file format to store data from biomedical experiments in simple text files which enables easier extension, storage and editing. A detailed description of the format is provided here. The tidyfr is an R package designed to facilitate import and export of data in TDF format to and from R.

General usage

Data from different modules (i.e. data sets/sources) in TDFF are stored in separate folders under a base data directory. Note that also different versions for the same data set (module) are stored in separate folders hence allowing to load data from a specific version of a module.

Below we use some (completely randomized) test data bundled within this package to illustrate the use of the tidyfr package. First we define the base data folder that contains the data sets.

library(tidyfr)

data_path <- system.file("txt", package = "tidyfr")

For real use of the package this data_path variable should point to the data directory containing the real data. We can next list available modules using the list_data_modules function:

list_data_modules(data_path)

##          name version
## 1 db_example1   1.0.0
## 2 db_example2   1.0.0
## 3 db_example2   1.0.1
##                                                  description
## 1 CHRIS baseline dataset: General information (test version)
## 2 CHRIS baseline dataset: General information (test version)
## 3 CHRIS baseline dataset: General information (test version)

We can now choose one of the modules and load it by specifying its name, version and the base data path.

mdl <- data_module(name = "db_example2", version = "1.0.1", path = data_path)

Various information from a module can be retrieved using the moduleName, moduleDate, moduleDescription and moduleVersion functions.

moduleName(mdl)

## [1] "CHRIS baseline"

moduleDate(mdl)

## [1] "2021-07-01"

moduleDescription(mdl)

## [1] "CHRIS baseline dataset: General information (test version)"

moduleVersion(mdl)

## [1] "1.0.1"

The actual data from a data module can be retrieved with the data function. This function returns a data.frame and ensures that the data is correctly formatted (i.e. categorical variables are represented as factors, with the right categories, dates are formatted correctly and missing values are represented by NA). The AIDs (identifiers of the samples) provided by the data module are used as row names of the data.frame (if these AIDs are not unique, i.e., some AIDs are duplicated, AIDs are not used as row names but provided in an additional column "aid" in the returned data.frame).

d <- data(mdl)
d

##            x0_sex   x0_age x0_ager  x0_birthd         x0_birthpc x0_residpc
## 0010100001 Female 19.61465      20 1983-01-04 Vinschgau district       MALS
## 0010200002   Male 54.44558      54 1956-03-08 Vinschgau district     LATSCH
##              x0_examd x0_workf   x0_note  x0_notesaliva x0_noteint x0_noteself
## 0010100001 2016-01-02        G something something else       <NA>        <NA>
## 0010200002 2012-02-01        E      <NA>           <NA>       <NA>        <NA>
##            x0_notespiro x0_birthm x0_birthy x0_examm x0_examy
## 0010100001         <NA>         2      1983        2     2016
## 0010200002         <NA>         3      1956        1     2012

Categorical variables are correctly converted to factors:

d$x0_sex

## [1] Female Male  
## Levels: Male Female

Dates are converted into the correct format:

d$x0_examd

## [1] "2016-01-02 UTC" "2012-02-01 UTC"

Annotations (labels) for the individual variables can be retrieved using the labels function:

labels(mdl)

##                       label  unit        type  min  max    missing
## x0_age               x0_age  year       float    0  100        -89
## x0_ager             x0_ager  year     integer    0  100        -89
## x0_sex               x0_sex       categorical   NA   NA        -89
## x0_birthpc       x0_birthpc       categorical   NA   NA        -89
## x0_residpc       x0_residpc       categorical   NA   NA        -89
## x0_examd           x0_examd  date        date   NA   NA 0000-00-00
## x0_workf           x0_workf         character   NA   NA    missing
## x0_birthd         x0_birthd              date   NA   NA 0000-00-00
## x0_note             x0_note         character   NA   NA    missing
## x0_notesaliva x0_notesaliva         character   NA   NA    missing
## x0_noteint       x0_noteint         character   NA   NA    missing
## x0_noteself     x0_noteself         character   NA   NA    missing
## x0_notespiro   x0_notespiro         character   NA   NA    missing
## x0_birthm         x0_birthm month     integer    1   12        -89
## x0_birthy         x0_birthy  year     integer 1900 2021        -89
## x0_examm           x0_examm month     integer    1   12        -89
## x0_examy           x0_examy  year     integer 2000 2021        -89
##                                 description
## x0_age           Age at examination (years)
## x0_ager          Age at examination (years)
## x0_sex                                  Sex
## x0_birthpc     Birthplace (offical) - coded
## x0_residpc    Place of recidence (official)
## x0_examd                Date of examination
## x0_workf                           Workflow
## x0_birthd                     Date of birth
## x0_note                 Participation notes
## x0_notesaliva       Notes saliva collection
## x0_noteint                  Notes interview
## x0_noteself                Notes self-admin
## x0_notespiro            Notes spiralography
## x0_birthm                    Month of birth
## x0_birthy                     Year of birth
## x0_examm               Month of examination
## x0_examy                Year of examination

The groups function retrieves the optional grouping of variables and the grp_labels function the description/definition of these groups.

groups(mdl)

##            group         label
## 1         person        x0_sex
## 2         person        x0_age
## 3         person       x0_ager
## 4         person     x0_birthd
## 5         person     x0_birthm
## 6         person     x0_birthy
## 7         person    x0_birthpc
## 8         person    x0_residpc
## 9            age        x0_age
## 10           age       x0_ager
## 11     birthdate     x0_birthd
## 12     birthdate     x0_birthm
## 13     birthdate     x0_birthy
## 14 participation      x0_examd
## 15 participation      x0_examm
## 16 participation      x0_examy
## 17 participation      x0_workf
## 18 participation       x0_note
## 19 participation x0_notesaliva
## 20 participation    x0_noteint
## 21 participation   x0_noteself
## 22 participation  x0_notespiro
## 23      examdate      x0_examd
## 24      examdate      x0_examm
## 25      examdate      x0_examy
## 26         notes       x0_note
## 27         notes x0_notesaliva
## 28         notes    x0_noteint
## 29         notes   x0_noteself
## 30         notes  x0_notespiro

grp_labels(mdl)

##                       group                description
## person               person              Personal data
## participation participation Participation-related data
## age                     age                        Age
## birthdate         birthdate                 Birth date
## examdate           examdate           Examination date
## notes                 notes                      Notes

Data export

Data can be exported using the export_tdf function. See the help for this function for more details.

Session information

sessionInfo()

## R Under development (unstable) (2023-03-16 r83996)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## 
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyfr_0.99.16   BiocStyle_2.27.1
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.5-3                jsonlite_1.8.4             
##  [3] compiler_4.3.0              BiocManager_1.30.20        
##  [5] SummarizedExperiment_1.29.1 Biobase_2.59.0             
##  [7] stringr_1.5.0               GenomicRanges_1.51.4       
##  [9] bitops_1.0-7                jquerylib_0.1.4            
## [11] systemfonts_1.0.4           IRanges_2.33.0             
## [13] textshaping_0.3.6           yaml_2.3.7                 
## [15] fastmap_1.1.1               lattice_0.20-45            
## [17] XVector_0.39.0              R6_2.5.1                   
## [19] GenomeInfoDb_1.35.16        knitr_1.42                 
## [21] BiocGenerics_0.45.3         DelayedArray_0.25.0        
## [23] bookdown_0.33               desc_1.4.2                 
## [25] MatrixGenerics_1.11.1       rprojroot_2.0.3            
## [27] GenomeInfoDbData_1.2.10     bslib_0.4.2                
## [29] rlang_1.1.0                 cachem_1.0.7               
## [31] stringi_1.7.12              xfun_0.38                  
## [33] fs_1.6.1                    sass_0.4.5                 
## [35] memoise_2.0.1               cli_3.6.1                  
## [37] pkgdown_2.0.7.9000          magrittr_2.0.3             
## [39] zlibbioc_1.45.0             grid_4.3.0                 
## [41] digest_0.6.31               lifecycle_1.0.3            
## [43] S4Vectors_0.37.4            vctrs_0.6.1                
## [45] evaluate_0.20               glue_1.6.2                 
## [47] ragg_1.2.5                  RCurl_1.98-1.12            
## [49] stats4_4.3.0                rmarkdown_2.21             
## [51] purrr_1.0.1                 matrixStats_0.63.0         
## [53] tools_4.3.0                 htmltools_0.5.5