Coracle is an artificial intelligence framework to identify microbes associated with a continuous physiological variable. In our case, we measured a lot of corals for their standardized coral thermal tolerance using CBASS assays with subsequent ED50 modeling (Voolstra et al. 2020; Voolstra et al 2021) and looked at prokaryote association using 16S rRNA metabarcoding. We were curious to determine whether specific prokaryotes (bacteria) were indicative of increased thermal tolerance, so we started Coracle to answer this question. But really any continuous phenotypic variable can be queried against microbial assemblage. The framework is designed to make the most out of smaller datasets and thus sacrifices efficient runtime for larger datasets. Coracle uses an ensemble approach and combines different preprocessing steps and different machine learning methods that are integrated into one comprehensible score. It is meant to be a decision-'maker' by picking prokaryote candidates for further examination.

Voolstra CR, Buitrago-López C, Perna G, Cárdenas A, Hume BCC, Rädecker N, et al. Standardized short-term acute heat stress assays resolve historical differences in coral thermotolerance across microhabitat reef sites. Glob Chang Biol. 2020;26: 4328–4343. doi:10.1111/gcb.15148

Voolstra CR, Valenzuela JJ, Turkarslan S, Cárdenas A, Hume BCC, Perna G, et al. Contrasting heat stress response patterns of coral holobionts across the Red Sea suggest distinct mechanisms of thermal tolerance. Mol Ecol. 2021;30: 4466–4480. doi:10.1111/mec.16064

In the following form you are asked to upload your data in two files. First, the continuous physiological variable file should include one column that specifies sample IDs and one column that holds the values of the target variable. A column header is necessary but can be empty. The second file is a prokaryote abundance file that should include the sample IDs in the first column, followed by the 'group' name (taxonomic annotation) as the column header and the bacterial abundance as values (each column resembles one microbial group). Microbial abundance and target variable should have the same number of rows and should have the same sequence of sample IDs! Datatables can be uploaded as comma separated files (.csv-ending is required) or as tab stop delimited files (either .tsv or .txt). If you try to upload different files or the dimensions of your files don't add up, an error will be shown. Example files can be found in the Tutorial.

The runtime of coracle scales significantly with both the number of samples (n) and the number of bacteria groups (k). Although in the worst case complexity is shown to be [n²k³log(k)], in practice it seems to be around [nk²]. Thus, the number of bacteria groups is the driving the runtime and we highly suggest using higher aggregations of taxonomic levels. The current version of Coracle is limited in both the runtime (24h) and the maximal number of microbial groups (10000). Thus we recommend to use Family- or Order-level first. Insights at the lowest levels (ASV/OTU) are possible by only feeding Coracle with microbial species from previously successfull microbial groups (like Family-/Order-level).

Upload your datafiles and run Coracle

UniCor is a feature score for quantitative, hierarchical datasets. It combines a feature’s association with the target and its uniqueness relative to other features in the same group. The score is computed from feature–target correlation and the average feature–feature correlation and lies in the range −0.5 to 1.

UniCorP applies UniCor in a bottom-up propagation across the hierarchy. At each level it evaluates features within their parent group, selects UNICORNs using a top-k rule (k highest scores per level) and propagates selected features to the next higher level (e.g., species → genus). This repeats until the highest level is reached, enriching upper levels with informative features.
Scoring can use Pearson or Spearman correlation. An optional preprocessing step for scoring can apply relative abundance or CLR to reflect compositional structure. These settings affect scoring only; propagation is determined by the selection rule at each level.

In the following form you are asked to upload your data in three files. The first two files (feature set and target variable) are prepared: First, the continuous physiological target variable file should include one column that specifies sample IDs and one column that holds the values of the target variable. A column header is necessary. The second file is the continuous feature matrix (e.g. a prokaryote abundance file) that should include the 'sample' IDs in the first column, followed by the 'feature' ID (taxonomic annotation, lowest hierarchical level) as the column header and the features (bacterial abundance) as numeric values (each column resembling one OTU/ASV). The hierarchical (e.g. taxonomic) structure should be prepared in a third file, with the 'feature' IDs (OTU/ASV) in the first column and the complete hierarchical information in the following columns. The column headers should represent the different hierarchical levels and should be in either ascending or descending order. It is recommended to fill null values within the hierarchy with the next higher annotation. Feature set and target variable should have the same number of rows and should have the same sequence of sample IDs! (case sensitive). Feature set and hierarchical information should have the same feature IDs (case sensitive). Datatables can be uploaded as comma separated files (.csv-ending is required) or as tab stop delimited files (either .tsv or .txt). File types have to match. Example files can be found in the Tutorial.

In principle, UniCorP follows the cost of computing pairwise correlations across all features, which scales as

O (n m^{2})

with n = number of samples and m = number of features. With hierarchical grouping, correlations are computed within groups, giving a per-level cost of

O (n \sum_{g} {m_{g}}^{2})

where

m_{g}

is the size of group g. Under approximately equal group sizes, the sum

\sum_{g} {m_{g}}^{2}

becomes

m^{2} / G

with G = number of groups, which is substantially smaller than

m^{2}

.

The selection rule influences group sizes across levels. With top-k per level, the number of propagated features is fixed by k, which stabilizes group sizes and runtime across levels. Overall, runtime grows linearly with the number of samples and with the number of hierarchical levels processed.

Upload your datafiles and run UniCorP

load bacterial abundance file (x):

load continuous physiological variable (y):

load taxonomic hierarchy:

select correlation method:

select transformation:

select number of top features to propagate(top_k):

enter your e-mail adress:

UniCoracle extends Coracle with hierarchical feature selection. It combines a bottom-up enrichment step with a top-down skimming step before final modeling at the lowest level.

Bottom-up (UniCorP): Starting at the lowest (most specific) hierarchical level, UniCorP identifies uniquely correlated features (UNICORNs) within each group and propagates only the selected features upward. This repeats level by level until the highest (least specific) level is enriched.

Top-down skimming (TDS): From the enriched highest level, UniCoracle selects informative groups and propagates them downward through the hierarchy. This reduces the number of features that reach the lowest level.

Modeling: At the lowest level, the reduced feature set is analyzed with Coracle to quantify associations between features and the continuous target variable.

Voolstra CR, Buitrago-López C, Perna G, Cárdenas A, Hume BCC, Rädecker N, et al. Standardized short-term acute heat stress assays resolve historical differences in coral thermotolerance across microhabitat reef sites. Glob Chang Biol. 2020;26: 4328–4343. doi:10.1111/gcb.15148

Voolstra CR, Valenzuela JJ, Turkarslan S, Cárdenas A, Hume BCC, Perna G, et al. Contrasting heat stress response patterns of coral holobionts across the Red Sea suggest distinct mechanisms of thermal tolerance. Mol Ecol. 2021;30: 4466–4480. doi:10.1111/mec.16064

Runtime complexity. UniCoracle is less sensitive to very large feature sets than Coracle because it first restricts computations to within-group correlations during bottom-up propagation and then limits the number of features passed downward during top-down skimming.

Bottom-up (UniCorP): The dominant cost per level is computing correlations within groups, which scales as $O (n \sum_{g} {m_{g}}^{2})$ with n samples and $m_{g}$ features in group g. Using top-k selection per level keeps the number of propagated features fixed and stabilizes runtime.

Top-down skimming (TDS): Runtime is controlled by the cap on features kept per level (n_features). Larger caps select more children and increase cost. Smaller caps reduce cost but pass fewer features to the next level.

Note: For simplicity we set top_k = n_features to reach stable runtime and resolution at the lowest level. Optimal values depend on the dataset and the structure of its hierarchy.

Upload your datafiles and run UniCoracle

load bacterial abundance file (x):

load continuous physiological variable (y):

load taxonomic hierarchy:

selectcorrelation method for UniCorP:

select transformation for UniCorP:

select number of features (UniCor:top_k = UniCoracle:n_features):

enter your e-mail adress:

In the following we show how to use Coracle and give a short tutorial on the data handling requiered for the use of our tools. Code examples are given for programming languages R and Python and example datasets for all relevant steps are available for download. Depending on your dataset not all steps may be necessary so feel free to skip irrelevant steps
First we provide the original dataset:

The dataset consists of the 16S OTU abundance file (link) of the CBASS84 study (Voolstra et al 2021). The continuous physiological variable data table includes the sample IDs in the first column and the associated ED50 temperature tolerance values (°C) in the second column. The third and fourth column contain some metadata and the ASVs fill subsequent column headers with their corresponding abundances (absolutes) for each sample ID as rows. These tables are downloaded as comma separated files (.csv).

Voolstra CR, Valenzuela JJ, Turkarslan S, Cárdenas A, Hume BCC, Perna G, et al. Contrasting heat stress response patterns of coral holobionts across the Red Sea suggest distinct mechanisms of thermal tolerance. Mol Ecol. 2021;30: 4466–4480. doi:10.1111/mec.16064

Let's load our datasets to work with them.

### 1. Data read
# Import packages
import pandas as pd

# Set YOUR directory path
directory = "C:/Users/JohnDoe/Downloads/"

# Load CBASS84 dataset
ASV = pd.read_csv(directory + "cbass.csv", index_col=0)
tax = pd.read_csv(directory + "cbass_tax.csv", index_col=0)

Python

### 1. Data Read
# Load required libraries
library(dplyr)
library(tidyr)

# Set YOUR directory path
directory <- "C:/Users/JohnDoe/Downloads/"

# Load CBASS84 dataset
ASV <- read.csv(paste0(directory, "cbass.csv"), row.names = 1)
tax <- read.csv(paste0(directory, "cbass_tax.csv"), row.names = 1)

In the next step we split the ASV dataset to obtain our target variable in a separated file:

### 2. Split Target Variable and feature set
y = ASV["ED50"].to_frame()
x = ASV.iloc[:, 3:]

Python

### 2. Split Target Variable and feature set
y <- ASV[,"ED50"]
x <- ASV[, -c(1:3)]

You can run UniCorP and UniCoracle directly from these three files: The feature matrix (x), the continuous target variable (y), and the taxonomic hierarchy (tax).
For Coracle analyses, we aggregate to a higher (less specific) taxonomic level (e.g., Family) to reduce dimensionality, since Coracle works best with at most a few hundred features. In order to access different taxonomic levels we have to merge the ASV dataset (without ED50 and metadata):

### 3. Combine ASV with taxonmic information
merged = tax.merge(ASV.iloc[:,3:].transpose(), right_index=True, left_index=True)

Python

### 3. Combine ASV with taxonomic information
merged <- cbind(tax, t(ASV[, 4:ncol(ASV)]))

... and aggregate the absolute abundances according to the groups of one of the taxonomic levels, if necessary. In this case we aggregate at the family level to get a good tradeoff between the number of features, the resolution of our dataset and the corresponding performance of our models.

### 4. Aggregate according to taxonomic level (e.g. Family)
ASV_family = merged.groupby( ["Family"] ).sum()
# Split ASV data from taxonomic information
ASV_family = ASV_family.transpose().iloc[4:, :].astype('int32' )
ASV_family.to_csv(directory + "x_fam.csv")

Python

### 4. Aggregate according to taxonomic level (e.g. Family)
ASV_family <- merged %>%
group_by(Family) %>%
summarize( across( where( is.numeric), sum, na.rm = TRUE)) %>%
t() %>%
as.data.frame()
# Set the first row as column headers
colnames(ASV_family) <- as.character( ASV_family[1, ])
ASV_family <- ASV_family[-1, ]
write.csv(ASV_family, file = paste0(directory, "x_fam.csv"), row.names = TRUE)

We can now already run Coracle with the files y (ED50/target variable) and x_fam (abundance at family level). Both files can be used to run coracle as they support all requirements. Microbial abundance and target variable have the same number of rows and share the same sequence of sample IDs!

Now we can upload the prepared data tables (x_fam at the family level) , enter an email-address (to which the results will be sent) and click on run Coracle.

Coracle might take a few minutes to run. If you choose to leave the tab open a landing page will be loaded once Coracle is finished. There you can have a first look at your results, receive a short explanation and a button to download your files as a .csv-file. Additionally, the explanation and a download link for your results will be sent to you to the email-address provided. No registration is necessary. timeout errors can occur while waiting for the result page to load.

The results can also be found here:

Micportal

What is Coracle?

Prepare your data correctly

Runtime Complexity

Upload your datafiles and run Coracle

What is UniCorP?

Prepare your data correctly

Runtime Complexity

Upload your datafiles and run UniCorP

What is UniCoracle?

Prepare your data correctly

Runtime Complexity

Upload your datafiles and run UniCoracle

What is Coracle?

What is UniCorP?

What is UniCoracle?