The groupByCorrelation
allows to group rows in a numeric matrix based on
their correlation with each other.
Two types of groupings are available:
inclusive = FALSE
(the default): the algorithm creates small groups of
highly correlated members, all of which have a correlation with each other
that are >= threshold
. Note that with this algorithm, rows in x
could
still have a correlation >= threshold
with one or more elements of a
group they are not part of. See notes below for more information.
inclusive = TRUE
: the algorithm creates large groups containing rows that
have a correlation >= threshold
with at least one element of that group.
For example, if row 1 and 3 have a correlation above the threshold and
rows 3 and 5 too (but correlation between 1 and 5 is below the threshold)
all 3 are grouped into the same group (i.e. rows 1, 3 and 5).
Note that with parameter f
it is also possible to pre-define groups of
rows that should be further sub-grouped based on correlation with each other.
In other words, if f
is provided, correlations are calculated only between
rows with the same value in f
and hence these pre-defined groups of rows
are further sub-grouped based on pairwise correlation. The returned factor
is then f
with the additional subgroup appended (and separated with a
"."
). See examples below.
groupByCorrelation( x, method = "pearson", use = "pairwise.complete.obs", threshold = 0.9, f = NULL, inclusive = FALSE )
x |
|
---|---|
method |
|
use |
|
threshold |
|
f | optional vector of length equal to |
inclusive |
|
factor
with same length than nrow(x)
with the group each row
is assigned to.
Implementation note of the grouping algorithm:
all correlations between rows in x
which are >= threshold
are
identified and sorted decreasingly.
starting with the pair with the highest correlation groups are defined:
if none of the two is in a group, both are put into the same new group.
if one of the two is already in a group, the other is put into the same
group if all correlations of it to that group are >= threshold
(and are not NA
).
if both are already in the same group nothing is done.
if both are in different groups: an element is put into the group of the
other if a) all correlations of it to members of the other's group
are not NA
and >= threshold
and b) the average correlation to the
other group is larger than the average correlation to its own group.
This ensures that groups are defined in which all elements have a correlation
>= threshold
with each other and the correlation between members of the
same group is maximized.
Other grouping operations:
groupEicCorrelation()
,
groupToSinglePolarityPairs()
Johannes Rainer
x <- rbind( c(1, 3, 2, 5), c(2, 6, 4, 7), c(1, 1, 3, 1), c(1, 3, 3, 6), c(0, 4, 3, 1), c(1, 4, 2, 6), c(2, 8, 2, 12)) ## define which rows have a high correlation with each other groupByCorrelation(x)#> [1] 1 1 2 3 4 1 1 #> Levels: 1 2 3 4## assuming we have some prior grouping of rows, further sub-group them ## based on pairwise correlation. f <- c(1, 2, 2, 1, 1, 2, 2) groupByCorrelation(x, f = f)#> [1] 1.001 2.001 2.002 1.001 1.002 2.001 2.001 #> Levels: 1.001 1.002 2.001 2.002