co2_gradient.Rmd
library(stoic)In this vignette, stoic will be used on a systems biology case study, to extract meaningful associations between:
Transcriptional profiles of genes in Arabidopsis thaliana exposed to a gradient of CO2 and two types of nitrate nutrition (KNO3: nitrate, Mix: nitrate and ammonium mix). Genes are subsetted to the genes differentially expressed by CO2 under nitrate nutrition by a spline analysis.
The TF binding scores of CO2-responsive transcription factors, scanned in the promoter regions of the CO2 responsive genes (TSS -1000, +200bp).
The fact that nitrate nutrition is tightly linked to the response to elevated CO2 has already been documented in the literature, and is very promising to uncover the pathways involved in CO2 acclimation and mineral status alteration in plants. This dataset was generated in the SIRENE team (IPSIM, 2021, unpublished).
The methodology of STOIC allows to find groups of co-expressed regulatory regions that can be well discriminated by sequence features. It does so through an exploration of the expression space where co-expression clusters are formed by maximizing the classification performance of predicting cluster membership based on sequence features.
We start by creating the data used by stoic. It needs two matrices:
The main data matrix, giving, for each observation, values in a set of samples. In our case this is, for 2092 genes, the RNA-Seq expression in 40 experimental samples (5 CO2 levels and 2 nitrate nutrition conditions). We recommend using several hundreds or even thousands of observations in stoic, and several tenths or hundreds of samples.
The guide data matrix, that gives, for each observation, a set of complementary features used to inform clustering. In our case study this is the TF binding motifs scores of 114 CO2-responsive regulators within the 2092 gene promoters. We recommend several tenths or thousands of guide features in stoic.
(Optional) The order of the samples in the main data, for visualization purposes only. In our case, this is the order of CO2*nitrate conditions to be displayed. It can be numeric or alphanumeric.
data("co2_gradient")
data <- stoic_data(main_data = co2_gradient$main_data,
guide_data = co2_gradient$guide_data,
sample_order = co2_gradient$sample_order,
seed = 666)
#> The observations have been subsetted to the common
#> observations between guide and main data ( 2092 obs)The guided clustering procedure can then be launched by stoic. Below
is an example with two different initializations (nstart,
as in kmeans), which are later combined to take the best clusters. We
recommend increasing nstart for even more stable results
(e.g 5 or more).
In this case we also lower the AUC limit to include a cluster to 0.6.
stoic_results <- stoic_clustering(data, nstart = 5, auc_thr = 0.6,
seed = 666)
table(stoic_results$clusters)
#>
#> 1 2 3 4 5 6 7 none
#> 47 47 56 66 59 68 55 1694The results contain several objects in a list, the most important being:
clusters: a named vector that gives cluster membership
to all observations (a number, or “none” if no cluster was found in the
area of this observation).aucs : a named vector that gives the AUC
(classification performance) for each cluster.best: a list of all clusters learned by stoic,
containing more details like the list of observations, AUC, the model
and its predictions, the important guide features, the correlation
threshold and centroid of the cluster, and so on.Most notably, we can access the sequence feature importance of a given cluster (number 2 here) as follows:
head(get_feature_importance(stoic_results, 2))
#> feature importance importance_min importance_max direction
#> 1 DIV2 0.0004805618 0.0003670969 0.0005940267 TRUE
#> 2 AT5G56840 0.0004210237 0.0003038716 0.0005381758 TRUE
#> 3 MYBH 0.0003256395 0.0002311749 0.0004201040 TRUE
#> 4 AT1G19000 0.0002797340 0.0001607295 0.0003987384 TRUE
#> 5 AT2G38090 0.0002074671 0.0001115663 0.0003033680 TRUE
#> 6 NAC069 0.0001438136 0.0001045429 0.0001830843 FALSEThe direction column is TRUE when the feature has higher values in the cluster than in other observations, FALSE otherwise.
A few visualization solutions are included in the package, and are assembled together below. This gives an overview of the clustering, by showing the observations in the main space (the expression space, reduced in a tSNE), and colored by cluster. Next to it, the expression profiles of the genes appear for each cluster, ranked by AUC. Stoic here identified a series of diverse expression profiles in response to CO2 elevation (like activation, repression), that can vary depending on the type of nitrate nutrition provided. The fact that nitrate nutrition is tightly linked to the response to elevated CO2 has already been documented in the literature, and is very promising to uncover the pathways involved in CO2 acclimation and mineral nutrition alteration in plants.
draw_tsne(data, stoic_results) +
draw_profiles(data, stoic_results)
One can also look at a given cluster of interest, as in the following lines, where the important sequence features for the cluster also appear.
for (i in 1:length(stoic_results$best)){
print(draw_tsne(data, stoic_results, cluster = i) +
draw_profiles(data, stoic_results, cluster = i) +
draw_features(data, stoic_results, cluster = i, only_positive = T,
n_top_features = 20))
}






An interpretation of cluster 4 based on stoic results would be: “A set of 49 genes are mostly activated by elevated CO2 specifically under nitrate nutrition but not a mix of nitrate and ammonium. These genes can be distinguished from other CO2-responsive genes by their enrichment in the binding motifs of TFs like NLP7, DIV2, CDF3, among others. TF binding motifs, combined together, allow to discriminate these genes from others with an AUC of 0.64”.
It can also be helpful to see how similar are the inferred clusters to one another. This is done internally by stoic to prune redundant clusters, but can also be visualized downstream of the clustering. Three similarity measures are used:
A symmetrical swap score (between 0 and 1), where the model of one cluster is used to predict the membership of another cluster. It tells us weather or not the rules learned from the guide data (sequence features) are the same (score = 1), or similar (score close to 1), between two clusters.
The linear correlation of the profile in the main data (expression here), between two clusters.
The Jaccard index (overlap/union) between two clusters (the real positive classes of the clusters are considered, before the final step of assigning observations to one cluster only).
draw_clusters_similarity(data, stoic_results)
This can help the user interpret clusters, but also potentially tune
the swap threshold score used in the stoic_clustering
function. Clusters are deemed redundant if their swap score is above the
swap threshold supplied (0.95 by default), if their Jaccard index is not
0, and if they are correlated over the maximum correlation threshold
learned to define the two clusters.
Stoic provides a heatmap function to show the scores of the important
sequence features within a cluster (selected via an elbow rule), and
compare them to observations outside a cluster. The obs
parameters allow to chose whether we want to show only the observations
within the cluster, outside the cluster, or all observations with an
annotation about cluster information.
For example, in the 2nd cluster:
draw_features_heatmap(stoic_results, data, 2, obs = "cluster")
draw_features_heatmap(stoic_results, data, 2, obs = "other")
draw_features_heatmap(stoic_results, data, cluster = 2, obs = "all")
All sequence features are scaled to a z-score within all observations, so red values indicate higher than average values, and blue lower than average. By default, only the important sequence features enriched in the cluster are shown, but all important features including depleted ones can also be shown as follows:
draw_features_heatmap(stoic_results, data, 2, obs = "cluster", only_positive = F)
draw_features_heatmap(stoic_results, data, 2, obs = "other", only_positive = F)
draw_features_heatmap(stoic_results, data, 2, obs = "all", only_positive = F)