Performs the guided clustering of observations in the main data modality, informed by another "guide" data modality. It does so through an exploration of the main space where clusters are formed by maximizing the classification performance of predicting cluster membership based on guide features in a supervised manner.

stoic_clustering(
  data,
  nstart = 1,
  n_trees = 500,
  n_cores = ifelse(is.na(parallel::detectCores()), 1, max(parallel::detectCores() - 1,
    1)),
  bucket_prop = 0.002,
  imbalance_limit = 0.025,
  top_n_cands_refit = 10,
  cluster_limit = 30,
  initial_cor = 0.8,
  swap_thr = 0.95,
  auc_thr = 0.7,
  n_trees_refine = 500,
  coverage_conv = 5,
  verbose = F,
  seed = NULL,
  importance_type = "permutation_conditional"
)

Arguments

data

Object returned by the `stoic_data` function.

nstart

Number of different initializations. One initialization corresponds to the sampling of one observation to be used as the first centroid in the procedure. As in kmeans, clustering outcome can be influenced by this random initialization, so stoic can be run for `nstart` initializations, the results of which are then combined to provide the best set of clusters (non-redundant and with maximal AUC). Default: 1.

n_trees

Number of trees in the Random Forest used to predict cluster membership based on guide data. Default: 500.

n_cores

Number of cores to use for parallel computing. Default: Number of detected CPU cores - 1.

bucket_prop

Proportion of positive class observations in the leaves of the probability Random Forest. Default: 0.002. May be increased for small number of observations.

imbalance_limit

Minimal size of a cluster allowed relatively to the total number of observations. Training classes (e.g cluster vs other) are not balanced, so this parameter may be used to limit class imbalance. Default: 2.5%.

top_n_cands_refit

Maximal number of models fitted to check weather new candidate observations for centroid update actually increase the AUC. Default (and recommended): 10.

cluster_limit

Maximal number of clusters to be found for one initialization. Default: 30.

initial_cor

Initial correlation value to the centroid above which observations belong to a cluster (and are considered in the positive class during model training). Default: 0.8. It is then optimized using the AUC in the procedure.

swap_thr

Score above which two clusters can be considered redundant in terms of guide features. It is used internally, in conjunction with an overlap and correlation condition to determine if two clusters are redundant (their centroids have converged to the same region in the main data and are interchangeable with respect to the guide data). Default: 0.95. It means that no more than a 5% deterioration of AUC is tolerated when swapping the models of two clusters.

auc_thr

Minimal AUC for a cluster to be kept in stoic once it has been optimized. Default: 0.7.

n_trees_refine

Number of trees for the last model estimation, needed to compute permutation-based feature importance. We recommend to increase the number of trees here in order to get stable feature importance estimates. Default: 500

coverage_conv

Number of iterations without coverage being increased by more than 10%, needed to declare convergence and stop optimizing clusters. New clusters will stop being drawn either when this convergence threshold or the cluster limit is reached. Default: 5.

verbose

Print procedure logs in the console? Default: FALSE.

seed

Random seed (Default is NULL).

importance_type

Character: "permutation" or "permutation_conditional" (default). Conditional permutations are a permutation based measure of feature importance where permutations are stratified based on other features correlated to the feature of which importance is being measured, to alleviate correlation issues in feature importance assessment. See https://gradientforest.r-forge.r-project.org/Conditional-importance.pdf.

Value

A list of results:

+ best: the list of optimized and non-redundant clusters learned by stoic. Each entry of the list is a cluster, with information like its list of observations (partition), correlation threshold and centroid, the estimated model from guide features, ...

+ aucs: a named vector that gives the AUC (classification performance from guide features) for each cluster.

+ clusters: a named vector that gives cluster membership to all observations (a number, of "none" if no cluster was found in the area of this observation).

+ coverages: the `nstart` vectors of coverage variation used for convergence test. (fraction of observations belonging to at least one cluster).

Examples

if (FALSE) { # \dontrun{
data("neuron_diff")
set.seed(666)
data <- stoic_data(main_data = neuron_diff$main_data,
                   guide_data = neuron_diff$guide_data,
                   sample_order = neuron_diff$sample_order)
stoic_results <- stoic_clustering(data, nstart = 2)
} # }