Authors: Johannes Rainer [cre] (https://orcid.org/0000-0002-6977-7147), Michele Filosi
[aut] (https://orcid.org/0000-0002-3872-347X)
Last modified: 2023-04-04 07:07:34.433562
Compiled: Tue Apr 4 07:14:10 2023
The Textual Dataset File Format (TDFF) is a file format to store data from biomedical experiments in simple text files which enables easier extension, storage and editing. A detailed description of the format is provided here. The tidyfr is an R package designed to facilitate import and export of data in TDF format to and from R.
Data from different modules (i.e. data sets/sources) in TDFF are stored in separate folders under a base data directory. Note that also different versions for the same data set (module) are stored in separate folders hence allowing to load data from a specific version of a module.
Below we use some (completely randomized) test data bundled within
this package to illustrate the use of the tidyfr
package.
First we define the base data folder that contains the data sets.
library(tidyfr)
data_path <- system.file("txt", package = "tidyfr")
For real use of the package this data_path
variable should point to the data directory containing the real
data. We can next list available modules using the
list_data_modules
function:
list_data_modules(data_path)
## name version
## 1 db_example1 1.0.0
## 2 db_example2 1.0.0
## 3 db_example2 1.0.1
## description
## 1 CHRIS baseline dataset: General information (test version)
## 2 CHRIS baseline dataset: General information (test version)
## 3 CHRIS baseline dataset: General information (test version)
We can now choose one of the modules and load it by specifying its name, version and the base data path.
mdl <- data_module(name = "db_example2", version = "1.0.1", path = data_path)
Various information from a module can be retrieved using the
moduleName
, moduleDate
,
moduleDescription
and moduleVersion
functions.
moduleName(mdl)
## [1] "CHRIS baseline"
moduleDate(mdl)
## [1] "2021-07-01"
moduleDescription(mdl)
## [1] "CHRIS baseline dataset: General information (test version)"
moduleVersion(mdl)
## [1] "1.0.1"
The actual data from a data module can be retrieved with the
data
function. This function returns a
data.frame
and ensures that the data is correctly formatted
(i.e. categorical variables are represented as factors, with the right
categories, dates are formatted correctly and missing values are
represented by NA
). The AIDs (identifiers of the samples)
provided by the data module are used as row names of the
data.frame
(if these AIDs are not unique, i.e., some AIDs
are duplicated, AIDs are not used as row names but
provided in an additional column "aid"
in the returned
data.frame
).
d <- data(mdl)
d
## x0_sex x0_age x0_ager x0_birthd x0_birthpc x0_residpc
## 0010100001 Female 19.61465 20 1983-01-04 Vinschgau district MALS
## 0010200002 Male 54.44558 54 1956-03-08 Vinschgau district LATSCH
## x0_examd x0_workf x0_note x0_notesaliva x0_noteint x0_noteself
## 0010100001 2016-01-02 G something something else <NA> <NA>
## 0010200002 2012-02-01 E <NA> <NA> <NA> <NA>
## x0_notespiro x0_birthm x0_birthy x0_examm x0_examy
## 0010100001 <NA> 2 1983 2 2016
## 0010200002 <NA> 3 1956 1 2012
Categorical variables are correctly converted to factors:
d$x0_sex
## [1] Female Male
## Levels: Male Female
Dates are converted into the correct format:
d$x0_examd
## [1] "2016-01-02 UTC" "2012-02-01 UTC"
Annotations (labels) for the individual variables can be retrieved
using the labels
function:
labels(mdl)
## label unit type min max missing
## x0_age x0_age year float 0 100 -89
## x0_ager x0_ager year integer 0 100 -89
## x0_sex x0_sex categorical NA NA -89
## x0_birthpc x0_birthpc categorical NA NA -89
## x0_residpc x0_residpc categorical NA NA -89
## x0_examd x0_examd date date NA NA 0000-00-00
## x0_workf x0_workf character NA NA missing
## x0_birthd x0_birthd date NA NA 0000-00-00
## x0_note x0_note character NA NA missing
## x0_notesaliva x0_notesaliva character NA NA missing
## x0_noteint x0_noteint character NA NA missing
## x0_noteself x0_noteself character NA NA missing
## x0_notespiro x0_notespiro character NA NA missing
## x0_birthm x0_birthm month integer 1 12 -89
## x0_birthy x0_birthy year integer 1900 2021 -89
## x0_examm x0_examm month integer 1 12 -89
## x0_examy x0_examy year integer 2000 2021 -89
## description
## x0_age Age at examination (years)
## x0_ager Age at examination (years)
## x0_sex Sex
## x0_birthpc Birthplace (offical) - coded
## x0_residpc Place of recidence (official)
## x0_examd Date of examination
## x0_workf Workflow
## x0_birthd Date of birth
## x0_note Participation notes
## x0_notesaliva Notes saliva collection
## x0_noteint Notes interview
## x0_noteself Notes self-admin
## x0_notespiro Notes spiralography
## x0_birthm Month of birth
## x0_birthy Year of birth
## x0_examm Month of examination
## x0_examy Year of examination
The groups
function retrieves the optional grouping of
variables and the grp_labels
function the
description/definition of these groups.
groups(mdl)
## group label
## 1 person x0_sex
## 2 person x0_age
## 3 person x0_ager
## 4 person x0_birthd
## 5 person x0_birthm
## 6 person x0_birthy
## 7 person x0_birthpc
## 8 person x0_residpc
## 9 age x0_age
## 10 age x0_ager
## 11 birthdate x0_birthd
## 12 birthdate x0_birthm
## 13 birthdate x0_birthy
## 14 participation x0_examd
## 15 participation x0_examm
## 16 participation x0_examy
## 17 participation x0_workf
## 18 participation x0_note
## 19 participation x0_notesaliva
## 20 participation x0_noteint
## 21 participation x0_noteself
## 22 participation x0_notespiro
## 23 examdate x0_examd
## 24 examdate x0_examm
## 25 examdate x0_examy
## 26 notes x0_note
## 27 notes x0_notesaliva
## 28 notes x0_noteint
## 29 notes x0_noteself
## 30 notes x0_notespiro
grp_labels(mdl)
## group description
## person person Personal data
## participation participation Participation-related data
## age age Age
## birthdate birthdate Birth date
## examdate examdate Examination date
## notes notes Notes
## R Under development (unstable) (2023-03-16 r83996)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
##
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyfr_0.99.16 BiocStyle_2.27.1
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.5-3 jsonlite_1.8.4
## [3] compiler_4.3.0 BiocManager_1.30.20
## [5] SummarizedExperiment_1.29.1 Biobase_2.59.0
## [7] stringr_1.5.0 GenomicRanges_1.51.4
## [9] bitops_1.0-7 jquerylib_0.1.4
## [11] systemfonts_1.0.4 IRanges_2.33.0
## [13] textshaping_0.3.6 yaml_2.3.7
## [15] fastmap_1.1.1 lattice_0.20-45
## [17] XVector_0.39.0 R6_2.5.1
## [19] GenomeInfoDb_1.35.16 knitr_1.42
## [21] BiocGenerics_0.45.3 DelayedArray_0.25.0
## [23] bookdown_0.33 desc_1.4.2
## [25] MatrixGenerics_1.11.1 rprojroot_2.0.3
## [27] GenomeInfoDbData_1.2.10 bslib_0.4.2
## [29] rlang_1.1.0 cachem_1.0.7
## [31] stringi_1.7.12 xfun_0.38
## [33] fs_1.6.1 sass_0.4.5
## [35] memoise_2.0.1 cli_3.6.1
## [37] pkgdown_2.0.7.9000 magrittr_2.0.3
## [39] zlibbioc_1.45.0 grid_4.3.0
## [41] digest_0.6.31 lifecycle_1.0.3
## [43] S4Vectors_0.37.4 vctrs_0.6.1
## [45] evaluate_0.20 glue_1.6.2
## [47] ragg_1.2.5 RCurl_1.98-1.12
## [49] stats4_4.3.0 rmarkdown_2.21
## [51] purrr_1.0.1 matrixStats_0.63.0
## [53] tools_4.3.0 htmltools_0.5.5