Extracting TDF-compliant data from a SummarizedExperiment — chris-SummarizedExperiment • tidyfr

The SummarizedExperiment() class is a container for data from large scale assays for biological experiments. In contrast to TDF data, samples are organized in columns of a SummarizedExperiment and measurements in rows. The data, labels and groups methods allow to extract information from such objects in a Textual Dataset File (TDF)-compliant format (structure):

columns are variables, rows individuals (samples).
variable IDs (labels) follow a standard format (e.g. "x0pt001").

The available methods are:

data: export the (quantitative) assay data from an SummarizedExperiment as a data.frame with columns representing variables and rows samples (study participants). Parameter assayNames. allows to specify which of the assays from the SummarizedExperiment should be extracted. Variables (rows) in the SummarizedExperiment get assigned a new variable ID (called label), which consists of the labelPrefix followed by an integer representing the index of the variable in the SummarizedExperiment (i.e. the row number of the variable in the SummarizedExperiment). A letter is appended to IDs for variables from assays different than the first one. Thus, "x0xx001" corresponds to the first row in the first assay, while "x0xx001a" represents the first row in the second assay. The colnames of the SummarizedExperiment are used as sample identifiers and are returned in column "aid" of the result data.frame.
groups: retrieves a data.frame that specifies the grouping of variables returned by data from a SummarizedExperiment. Columns (variables) containing data from the same assay of the SummarizedExperiment are grouped into the same group.
labels: extracts label annotations for the data extracted with data from a SummarizedExperiment. The returned data.frame is in the labels format of the TDF but contains additional columns with the available annotations from the SummarizedExperiment's rowData(). The rownames of the SummarizedExperiment are returned in columns "description".

# S3 method for SummarizedExperiment
groups(x, assayNames. = assayNames(x), labelPrefix = "x0xx", ...)

# S4 method for SummarizedExperiment
labels(object, assayNames. = assayNames(object), labelPrefix = "x0xx")

Arguments

x: a SummarizedExperiment object
assayNames.: character defining the names of the assays in object from which data should be extracted.
labelPrefix: character(1) defining the prefix for the variable IDs (labels) of the object.
...: Further arguments passed to downstream groups method
object: A SummarizedExperiment object.

Value

a data.frame with the data.

Author

Johannes Rainer

Examples


## Create a simple SummarizedExperiment with some random data as one assay
## and a second assay with all values multiplied with 2. For a
## SummarizedExperiment columns represent samples and rows measurements
## (variables).
mat <- matrix(rnorm(60), ncol = 10, nrow = 6)

## SummarizedExperiments allow to store also column and row annotations along
## with the data. We thus define below a data.frame with some annotations
## for the variables.
rowd <- data.frame(analyte_id = paste0("id", 1:6), analyte_name = letters[1:6])
rownames(rowd) <- rowd$analyte_id

library(SummarizedExperiment)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: ‘MatrixGenerics’
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: ‘S4Vectors’
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: ‘Biobase’
#> The following object is masked from ‘package:MatrixGenerics’:
#> 
#>     rowMedians
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     anyMissing, rowMedians
se <- SummarizedExperiment(
    assay = list(values = mat, double = 2 * mat),
    rowData = rowd)
se
#> class: SummarizedExperiment 
#> dim: 6 10 
#> metadata(0):
#> assays(2): values double
#> rownames(6): id1 id2 ... id5 id6
#> rowData names(2): analyte_id analyte_name
#> colnames: NULL
#> colData names(0):

## What assays are available?
assayNames(se)
#> [1] "values" "double"

## Get a data.frame with all variables
data(se)
#>    aid      x0xx1       x0xx2      x0xx3        x0xx4      x0xx5      x0xx6
#> 1    1 -0.8965242  0.90966433 -0.1288219 -0.013031450  0.3597483  1.1864404
#> 2    2  2.2266181 -0.24722918 -0.5157435 -0.816109299  1.4736489  0.1180701
#> 3    3  0.1875334 -0.12046762 -0.6131071  1.118765882 -1.6989048  0.3422766
#> 4    4  0.3422617 -0.92282700 -0.7719974  0.193640410  1.0732061  0.6174077
#> 5    5  0.1515003 -0.71151095 -0.1881688 -0.774229785 -0.8924119  0.7953266
#> 6    6  0.3369002 -0.76522830 -1.2260711 -0.009805471 -0.4122893  1.0963385
#> 7    7  0.9444024  0.14786264  1.2996096  1.306019228 -0.1200253  0.5310052
#> 8    8  0.9207570 -1.22420575  1.3915198 -0.163163564 -0.7596888 -0.3985823
#> 9    9 -1.1429766 -0.87863854  1.4535518 -0.710723674  0.5847751 -0.8690617
#> 10  10  0.3760858 -0.04179954  1.0281860 -1.614637827  0.4474193 -1.2262790
#>        x0xx1a      x0xx2a     x0xx3a      x0xx4a     x0xx5a     x0xx6a
#> 1  -1.7930484  1.81932866 -0.2576439 -0.02606290  0.7194965  2.3728809
#> 2   4.4532362 -0.49445836 -1.0314870 -1.63221860  2.9472977  0.2361401
#> 3   0.3750669 -0.24093523 -1.2262142  2.23753176 -3.3978095  0.6845531
#> 4   0.6845233 -1.84565400 -1.5439949  0.38728082  2.1464122  1.2348153
#> 5   0.3030007 -1.42302190 -0.3763377 -1.54845957 -1.7848239  1.5906533
#> 6   0.6738003 -1.53045660 -2.4521422 -0.01961094 -0.8245786  2.1926771
#> 7   1.8888048  0.29572529  2.5992192  2.61203846 -0.2400506  1.0620104
#> 8   1.8415140 -2.44841150  2.7830395 -0.32632713 -1.5193776 -0.7971645
#> 9  -2.2859533 -1.75727709  2.9071036 -1.42144735  1.1695503 -1.7381233
#> 10  0.7521716 -0.08359908  2.0563720 -3.22927565  0.8948385 -2.4525579

## Get the label information
labels(se)
#>         label unit  type       min       max missing      description
#> x0xx1   x0xx1      float -1.142977 2.2266181     -89              id1
#> x0xx2   x0xx2      float -1.224206 0.9096643     -89              id2
#> x0xx3   x0xx3      float -1.226071 1.4535518     -89              id3
#> x0xx4   x0xx4      float -1.614638 1.3060192     -89              id4
#> x0xx5   x0xx5      float -1.698905 1.4736489     -89              id5
#> x0xx6   x0xx6      float -1.226279 1.1864404     -89              id6
#> x0xx1a x0xx1a      float -2.285953 4.4532362     -89 id1 assay double
#> x0xx2a x0xx2a      float -2.448412 1.8193287     -89 id2 assay double
#> x0xx3a x0xx3a      float -2.452142 2.9071036     -89 id3 assay double
#> x0xx4a x0xx4a      float -3.229276 2.6120385     -89 id4 assay double
#> x0xx5a x0xx5a      float -3.397810 2.9472977     -89 id5 assay double
#> x0xx6a x0xx6a      float -2.452558 2.3728809     -89 id6 assay double
#>        analyte_id analyte_name
#> x0xx1         id1            a
#> x0xx2         id2            b
#> x0xx3         id3            c
#> x0xx4         id4            d
#> x0xx5         id5            e
#> x0xx6         id6            f
#> x0xx1a        id1            a
#> x0xx2a        id2            b
#> x0xx3a        id3            c
#> x0xx4a        id4            d
#> x0xx5a        id5            e
#> x0xx6a        id6            f

## Get the variable grouping
groups(se)
#>            group  label
#> 1   assay_values  x0xx1
#> 2   assay_values  x0xx2
#> 3   assay_values  x0xx3
#> 4   assay_values  x0xx4
#> 5   assay_values  x0xx5
#> 6   assay_values  x0xx6
#> 7   assay_double x0xx1a
#> 8   assay_double x0xx2a
#> 9   assay_double x0xx3a
#> 10  assay_double x0xx4a
#> 11  assay_double x0xx5a
#> 12  assay_double x0xx6a
#> 13 analyte_x0xx1  x0xx1
#> 14 analyte_x0xx1 x0xx1a
#> 15 analyte_x0xx2  x0xx2
#> 16 analyte_x0xx2 x0xx2a
#> 17 analyte_x0xx3  x0xx3
#> 18 analyte_x0xx3 x0xx3a
#> 19 analyte_x0xx4  x0xx4
#> 20 analyte_x0xx4 x0xx4a
#> 21 analyte_x0xx5  x0xx5
#> 22 analyte_x0xx5 x0xx5a
#> 23 analyte_x0xx6  x0xx6
#> 24 analyte_x0xx6 x0xx6a