stoic_neuron_differentiation.Rmd
library(stoic)In this vignette, stoic will be used on a systems biology case study, to extract meaningful associations between:
This single cell dataset (sampled here, and not public yet) was generated as part of a collaboration within the FANTOM6 project, in the lab of Chi Wai Yip (RIKEN).
The methodology of STOIC allows to find groups of co-expressed regulatory regions that can be well discriminated by sequence features. It does so through an exploration of the expression space where co-expression clusters are formed by maximizing the classification performance of predicting cluster membership based on sequence features.
Such a guidance of clustering by sequence information has the advantage to discover stronger associations between the two modalities, and only focus on clusters predictable by sequence.
This approach could be generalized to any question where the clustering of observations in one type of data can be informed by another type of data, and where the user wants to learn the associations between the two.
We start by creating the data used by stoic. It needs two matrices:
The main data matrix, giving, for each observation, values in a set of samples. In our case this is, for 1000 regulatory regions, the transcriptional activity in 391 cellular states ordered along neuron differentiation (metacells). We recommend using several hundreds or even thousands of observations in stoic, and several tenths or hundreds of samples.
The guide data matrix, that gives, for each observation, a set of complementary features used to inform clustering. In our case study this is the sequence characteristics of 1000 regulatory regions, i.e k-mers frequencies with from 1 to 3. We recommend several tenths or thousands of guide features in stoic.
(Optional) The order of the samples in the main data, for visualization purposes only. In our case, this is the numerical metacell order during neuron differentiation. It can be numeric or alphanumeric.
data("neuron_diff")
data <- stoic_data(main_data = neuron_diff$main_data,
guide_data = neuron_diff$guide_data,
sample_order = neuron_diff$sample_order, seed = 666)The guided clustering procedure can then be launched by stoic. Below
is an example with two different initializations (nstart,
as in kmeans), which are later combined to take the best clusters. We
recommend increasing nstart for even more stable results (e.g 5 or
more).
The classification model in stoic is a Random Forest, of which the number of trees can be set.
stoic_results <- stoic_clustering(data, nstart = 2, n_cores = 4, n_trees = 500, seed = 666)
table(stoic_results$clusters)
#>
#> 1 2 3 4 5 none
#> 27 94 17 315 36 511The results contain several objects in a list, the most important being:
clusters: a named vector that gives cluster membership
to all observations (a number, or “none” if no cluster was found in the
area of this observation).aucs : a named vector that gives the AUC
(classification performance) for each cluster.best: a list of all clusters learned by stoic,
containing more details like the list of observations, AUC, the model
and its predictions, the important guide features, the correlation
threshold and centroid of the cluster, and so on.Most notably, we can access the sequence feature importance of a given cluster (number 2 here) as follows:
head(get_feature_importance(stoic_results, 2))
#> feature importance importance_min importance_max direction
#> 1 CA 0.004745341 0.004310244 0.005180438 TRUE
#> 2 CG 0.004040024 0.003541362 0.004538686 FALSE
#> 3 GCA 0.003418567 0.003009708 0.003827426 TRUE
#> 4 CCG 0.002469698 0.002083054 0.002856342 FALSE
#> 5 CGA 0.002211791 0.001850673 0.002572910 FALSE
#> 6 CC 0.001807957 0.001502072 0.002113841 FALSEThe direction column is TRUE when the feature has higher values in the cluster than in other observations, FALSE otherwise.
A few visualization solutions are included in the package, and are assembled together below. This gives an overview of the clustering, by showing the observations in the main space (the expression space, reduced in a tSNE), and colored by cluster. Next to it, the activity profiles of the regulatory regions appear for each cluster, ranked by AUC. Stoic here identified a series of diverse expression profiles during the differentiation process (like activation, repression, peaks in intermediary states, …).
draw_tsne(data, stoic_results) +
draw_profiles(data, stoic_results)
One can also look at a given cluster of interest, as in the following lines, where the important sequence features for the cluster also appear.
for (i in 1:length(stoic_results$best)){
print(
draw_tsne(data, stoic_results, cluster = i) +
draw_profiles(data, stoic_results, cluster = i) +
draw_features(data, stoic_results, cluster = i, only_positive = T,
n_top_features = 20)
)
}




An interpretation of cluster 2 based on stoic results would be: “This set of 93 enhancers and promoters are mostly active and transcribed in the middle of the differentiation process (neural stem cells), and can be characterized in terms of sequence by A-rich k-mers such as CA or AAG. Sequence features, combined together, allow to discriminate these regulatory regions from others transcriptionnally variable regions with an AUC of 0.78”
It can also be helpful to see how similar are the inferred clusters to one another. This is done internally by stoic to prune redundant clusters, but can also be visualized downstream of the clustering. Three similarity measures are used:
A symmetrical swap score (between 0 and 1), where the model of one cluster is used to predict the membership of another cluster. It tells us weather or not the rules learned from the guide data (sequence features) are the same (score = 1), or similar (score close to 1), between two clusters.
The linear correlation of the profile in the main data (expression here), between two clusters.
The Jaccard index (overlap/union) between two clusters (the real positive classes of the clusters are considered, before the final step of assigning observations to one cluster only).
draw_clusters_similarity(data, stoic_results)
This can help the user interpret clusters, but also potentially tune
the swap threshold score used in the stoic_clustering
function. Clusters are deemed redundant if their swap score is above the
swap threshold supplied (0.95 by default), if their Jaccard index is not
0, and if they are correlated over the maximum correlation threshold
learned to define the two clusters.
Stoic provides a heatmap function to show the scores of the important
sequence features within a cluster (selected via an elbow rule), and
compare them to observations outside a cluster. The obs
parameters allow to chose whether we want to show only the observations
within the cluster, outside the cluster, or all observations with an
annotation about cluster information.
For example, in the first cluster:
draw_features_heatmap(stoic_results, data, cluster = 1, obs = "cluster")
draw_features_heatmap(stoic_results, data, cluster = 1, obs = "other")
draw_features_heatmap(stoic_results, data, cluster = 1, obs = "all")
All sequence features are scaled to a z-score within all observations, so red values indicate higher than average values, and blue lower than average. By default, only the important sequence features enriched in the cluster are shown, but all important features including depleted ones can also be shown as follows:
draw_features_heatmap(stoic_results, data, cluster = 1, obs = "cluster", only_positive = F)
draw_features_heatmap(stoic_results, data, cluster = 1, obs = "other", only_positive = F)
draw_features_heatmap(stoic_results, data, cluster = 1, obs = "all", only_positive = F)