Export data in the Textual Dataset File Format

The export_tdf exports the provided data in the TDFF format. The function first creates all required folders, checks the input files and then exports the data in the TDFF format (see below for more information on this format).

The data is organized in the following way:

Within the base directory path a folder name is created for the data set.
Within a folder with the version of the data set (parameter version) two folders data and docs are created. The actual data files are stored in the data folder while the docs folder allows to contains any documentation files (any file) related to the data set. The docs folder contains also a file docs.txt that is supposed to contain information for any added documentation file (this information needs to be manually addedd).
Within the base folder (with the name of the data set) a NEWS.md file is created which is supposed to be manually edited to add some information or change log for the currently exported version of the data.

Automatic convertions performed by the export function are:

Columns in data that are of data type factor are correctly and automatically converted to the expected format (i.e. their categories are added to the mapping data.frame and the values are replaced with the indices).
If not specified in labels, columns "min", "max" in labels are calculated on the provided data.
Missing values in data are automatically converted and the respective encoding specified in labels.

The labels_from_data creates a template labels data.frame from the provided data. The function retrieves various information like the data type of the various columns from the provided data and adds the corresponding values to the data.frame. Other columns, such as "description" or "unit" need to be filled out manually.

The mapping_from_data creates a mapping data.frame for all categorical variables in data (i.e. columns in data with data type factor).

export_tdf(
  name = character(),
  description = character(),
  version = character(),
  date = character(),
  path = ".",
  data = data.frame(),
  groups = data.frame(),
  grp_labels = data.frame(),
  labels = labels_from_data(data),
  mapping = mapping_from_data(data),
  na = -89
)

labels_from_data(data, na = -89)

mapping_from_data(data)

Arguments

name: required character(1) with the name of the data module.
description: character(1) providing a description of the data.
version: required character(1) with the version of the data.
date: character(1) providing the date of the data.
path: character(1) with the base path where the folders and data files should be created. Defaults to path = ".".
data: data.frame with the data to export. Required column "aid" is expected to contain the unique identifiers of the participants. All additional columns are expected to contain the data of additional variables.
groups: data.frame with optional grouping of labels (variables) in data. Expected columns are "group" and "label". See the TDF definition for details.
grp_labels: data.frame with the names (descriptions) of the groups defined in groups.
labels: data.frame with annotations to the variables (labels) in data. See the TDF definition for details. Columns "min", "max" and "missing" will be filled by the export_tdf function if not already provided. Defaults to labels = labels_from_data() hence creates a labels data.frame from the provided data. Any annotations that are not part of the pre-defined hard set of columns are stored in a separate file labels_additional_info.txt.
mapping: data.frame with the definition of the levels (categories) of the categorical variables in labels. Expected columns are "label", "code" and "value". The default is mapping = mapping_from_data(data) and a mapping data.frame is thus generated by default from the provided data.
na: the value to represent missing values in data.

Value

export_tdf: (invisibly) returns a character(1) with the path to the folder where the data was stored. labels_from_data returns a labels

data.frame based on the data in data.

Short information on the TDF

See the official Textual Dataset Format definition for a complete description of the format.

data: contains the data of the various variables. Columns are variables, rows study participants. Column "aid" is mandatory and contains the ID of the study participants.
labels: provides information on the variables in data. Columns are "label" (the name of the column in data), "unit" (unit of the measured value), "type" (the data type), "min" (the minimal value), "max" (the maximal value), "missing" (the value with which missing values in data are encoded) and "description" (a name/description of the variable).
mapping: contains the encoding of categorical variables (factors) in data. Required columns are: "label" (the name of the column in data), "code" (the value of the category in data) and "value" (the category, i.e. the level of the factor).
groups: allows to optionally group variables in data. Expected columns are "group" (the name of the group) and "label" (the name of the column in data).
grp_labels: contains descriptions for the groups. Expected columns are "group" (the name of the group) and "description" (the name/description of the group).

Author

Johannes Rainer

Examples


## Exporting a test data set. Creating a *data* data.frame with data on
## 5 individuals.
d <- data.frame(
    aid = c("00101", "00102", "00103", "00104", "00105"),
    x0_sex = factor(c("Male", "Female", "Female", NA, "Male")),
    x0_age = c(45, 54, 33, 36, 66),
    x0_weight = c(78.5, 57.2, 55.2, 67.9, 84.2))

## Generate a *labels* data.frame from the data
l <- labels_from_data(d)
l
#>               label unit        type  min  max missing description
#> x0_sex       x0_sex      categorical   NA   NA     -89            
#> x0_age       x0_age            float 33.0 66.0     -89            
#> x0_weight x0_weight            float 55.2 84.2     -89            

## Fill missing information to labels
l$unit <- c(NA, "Year", "kg")
l$description <- c("Sex", "Age", "Weight")

## Generate a *mapping* data.frame from data
m <- mapping_from_data(d)
m
#>           label code  value
#> x0_sex.1 x0_sex    1 Female
#> x0_sex.2 x0_sex    2   Male

## Create a simple grouping of all variables into a "general information"
## group
g <- data.frame(
    group = c("ginfo", "ginfo", "ginfo"),
    label = c("x0_sex", "x0_age", "x0_weight"))

## Define a description for the group
gl <- data.frame(group = "ginfo", description = "General information")

## Now export all data to a temporary folder
path <- tempdir()

## Export the data specifying the name of the module, the version and other
## information
export_tdf(name = "test_data", description = "Simple test data.",
    version = "1.0.0", date = date(), path = path, data = d,
    groups = g, grp_labels = gl, labels = l, mapping = m)
#> Data set was written to: /tmp/RtmprNgPNi/test_data/1.0.0/data