The groupByCorrelation allows to group rows in a numeric matrix based on their correlation with each other.

Two types of groupings are available:

  • inclusive = FALSE (the default): the algorithm creates small groups of highly correlated members, all of which have a correlation with each other that are >= threshold. Note that with this algorithm, rows in x could still have a correlation >= threshold with one or more elements of a group they are not part of. See notes below for more information.

  • inclusive = TRUE: the algorithm creates large groups containing rows that have a correlation >= threshold with at least one element of that group. For example, if row 1 and 3 have a correlation above the threshold and rows 3 and 5 too (but correlation between 1 and 5 is below the threshold) all 3 are grouped into the same group (i.e. rows 1, 3 and 5).

Note that with parameter f it is also possible to pre-define groups of rows that should be further sub-grouped based on correlation with each other. In other words, if f is provided, correlations are calculated only between rows with the same value in f and hence these pre-defined groups of rows are further sub-grouped based on pairwise correlation. The returned factor is then f with the additional subgroup appended (and separated with a "."). See examples below.

groupByCorrelation(
  x,
  method = "pearson",
  use = "pairwise.complete.obs",
  threshold = 0.9,
  f = NULL,
  inclusive = FALSE
)

Arguments

x

numeric matrix where rows should be grouped based on correlation of their values across columns being larger than threshold.

method

character(1) with the method to be used for correlation. See corr() for options.

use

character(1) defining which values should be used for the correlation. See corr() for details.

threshold

numeric(1) defining the cut of value above which rows are considered to be correlated and hence grouped.

f

optional vector of length equal to nrow(x) pre-defining groups of rows in x that should be further sub-grouped. See description for details.

inclusive

logical(1) whether a version of the grouping algorithm should be used that leads to larger, more loosely correlated, groups. The default is inclusive = FALSE. See description for more information.

Value

factor with same length than nrow(x) with the group each row is assigned to.

Note

Implementation note of the grouping algorithm:

  • all correlations between rows in x which are >= threshold are identified and sorted decreasingly.

  • starting with the pair with the highest correlation groups are defined:

  • if none of the two is in a group, both are put into the same new group.

  • if one of the two is already in a group, the other is put into the same group if all correlations of it to that group are >= threshold (and are not NA).

  • if both are already in the same group nothing is done.

  • if both are in different groups: an element is put into the group of the other if a) all correlations of it to members of the other's group are not NA and >= threshold and b) the average correlation to the other group is larger than the average correlation to its own group.

This ensures that groups are defined in which all elements have a correlation >= threshold with each other and the correlation between members of the same group is maximized.

See also

Other grouping operations: groupEicCorrelation(), groupToSinglePolarityPairs()

Author

Johannes Rainer

Examples

x <- rbind( c(1, 3, 2, 5), c(2, 6, 4, 7), c(1, 1, 3, 1), c(1, 3, 3, 6), c(0, 4, 3, 1), c(1, 4, 2, 6), c(2, 8, 2, 12)) ## define which rows have a high correlation with each other groupByCorrelation(x)
#> [1] 1 1 2 3 4 1 1 #> Levels: 1 2 3 4
## assuming we have some prior grouping of rows, further sub-group them ## based on pairwise correlation. f <- c(1, 2, 2, 1, 1, 2, 2) groupByCorrelation(x, f = f)
#> [1] 1.001 2.001 2.002 1.001 1.002 2.001 2.001 #> Levels: 1.001 1.002 2.001 2.002