Coracle

Coracle is an artificial intelligence framework to identify microbes associated with a continuous physiological variable. In our case, we measured a lot of corals for their standardized coral thermal tolerance using CBASS assays with subsequent ED50 modeling (Voolstra et al. 2020; Voolstra et al 2021) and looked at prokaryote association using 16S rRNA metabarcoding. We were curious to determine whether specific prokaryotes (bacteria) were indicative of increased thermal tolerance, so we started Coracle to answer this question. But really any continuous phenotypic variable can be queried against microbial assemblage. The framework is designed to make the most out of smaller datasets and thus sacrifices efficient runtime for larger datasets. Coracle uses an ensemble approach and combines different preprocessing steps and different machine learning methods that are integrated into one comprehensible score. It is meant to be a decision-'maker' by picking prokaryote candidates for further examination.

Voolstra CR, Buitrago-López C, Perna G, Cárdenas A, Hume BCC, Rädecker N, et al. Standardized short-term acute heat stress assays resolve historical differences in coral thermotolerance across microhabitat reef sites. Glob Chang Biol. 2020;26: 4328–4343. doi:10.1111/gcb.15148

Voolstra CR, Valenzuela JJ, Turkarslan S, Cárdenas A, Hume BCC, Perna G, et al. Contrasting heat stress response patterns of coral holobionts across the Red Sea suggest distinct mechanisms of thermal tolerance. Mol Ecol. 2021;30: 4466–4480. doi:10.1111/mec.16064

In the following form you are asked to upload your data in two files. First, the continuous physiological variable file should include one column that specifies sample IDs and one column that holds the values of the target variable. A column header is necessary but can be empty. The second file is a prokaryote abundance file that should include the sample IDs in the first column, followed by the 'group' name (taxonomic annotation) as the column header and the bacterial abundance as values (each column resembles one microbial group). Microbial abundance and target variable should have the same number of rows and should have the same sequence of sample IDs! Datatables can be uploaded as comma separated files (.csv-ending is required) or as tab stop delimited files (either .tsv or .txt). If you try to upload different files or the dimensions of your files don't add up, an error will be shown. Example files can be found in the Tutorial.

The runtime of coracle scales significantly with both the number of samples (n) and the number of bacteria groups (k). Although in the worst case complexity is shown to be [n²k³log(k)], in practice it seems to be around [nk²]. Thus, the number of bacteria groups is the driving the runtime and we highly suggest using higher aggregations of taxonomic levels. The current version of Coracle is limited in both the runtime (24h) and the maximal number of microbial groups (10000). Thus we recommend to use Family- or Order-level first. Insights at the lowest levels (ASV/OTU) are possible by only feeding Coracle with microbial species from previously successfull microbial groups (like Family-/Order-level).



Upload your datafiles and run Coracle



load bacterial abundance file (x):

load continuous physiological variable (y):

enter your e-mail adress:










HiCoracle is is based on Coracle and combines it with different hierarchical feature selection strategies: UniCor, am bottom-up approach that propagates uniquely correlated features up the hierarchical order is applied first. From this enriched dataset at the highest order, a top-down approach selects and propagates the most useful features down the hierachy. The feature set of the remaining lowest hierarchy level is thus significantly smaller. The goal is to reduce the noise and only keep the most promising features. At the lowest level it is again fed into the original Coracle framework to investigate association between the remaining features and the continuous target variable.

Voolstra CR, Buitrago-López C, Perna G, Cárdenas A, Hume BCC, Rädecker N, et al. Standardized short-term acute heat stress assays resolve historical differences in coral thermotolerance across microhabitat reef sites. Glob Chang Biol. 2020;26: 4328–4343. doi:10.1111/gcb.15148

Voolstra CR, Valenzuela JJ, Turkarslan S, Cárdenas A, Hume BCC, Perna G, et al. Contrasting heat stress response patterns of coral holobionts across the Red Sea suggest distinct mechanisms of thermal tolerance. Mol Ecol. 2021;30: 4466–4480. doi:10.1111/mec.16064

In the following form you are asked to upload your data in three files. The first two files (feature set and target variable) are prepared just like with Coracle: First, the continuous physiological variable file should include one column that specifies sample IDs and one column that holds the values of the target variable. A column header is necessary but can be empty. The second file is a prokaryote abundance file that should include the sample IDs in the first column, followed by the 'feature' ID (taxonomic annotation, lowest hierarchical level) as the column header and the bacterial abundance as values (each column resembles one OTU/ASV). The hierarchical (e.g. taxonomic) structure should be prepared in a third file, with the 'feature' IDs (OTU/ASV) in the first column and the complete hierarchical information in the following columns. The column headers should represent the different hierarchical levels and should be in the right order (ascending/descending does not matter). It is recommended to fill null values within that table with the next best value from a different level (e.g. ASV/OTU ID). feature set and target variable should have the same number of rows and should have the same sequence of sample IDs! Feature set and hierarchical information should have the same feature IDs. Datatables can be uploaded as comma separated files (.csv-ending is required) or as tab stop delimited files (either .tsv or .txt). File types and dimensions of the different files have to match. Example files can be found in the Tutorial.

In comparison to Coracle, HiCoracle is less vulnerable towards a large amount of features. The threshold parameters play a big role in runtime complexity. As the uc threshold decreases the amount of features propagated each level will increase, resulting in an increased runtime. As the top-down threshold increases, the amount of features kept each level will increase, resulting in increased runtimes as well. Thus the goal is to set the uc threshold as high as possible without losing information while setting the top-down threshold as high as possible without loosing all features. For a first try we recommend something along uc threshold = 0.15 and top-down threshold = 0.2. Optimal values depend on the dataset. The optimal top-down threshold is mostly impacted by the structure of the hierarchy while the optimal uc threshold is impacted by the correlations present within the dataset.



Upload your datafiles and run HiCoracle

load bacterial abundance file (x):

load continuous physiological variable (y):

load taxonomic hierarchy:

top-down skimming cut-off threshold:

UniCor threshold:

enter your e-mail adress:











In the following we show how to use Coracle and give a short tutorial on the data handling requiered for the optimal use of Coracle. Code examples are given for programming languages R and Python and example datasets for all relevant steps are available for download. Depending on your dataset not all steps may be necessary so feel free to skip irrelevant steps
First we provide the original dataset:




The dataset consists of the 16S OTU abundance file (link) of the CBASS84 study (Voolstra et al 2021). The continuous physiological variable data table includes the sample IDs in the first column and the associated ED50 temperature tolerance values (°C) in the second column. The third and fourth column contain some metadata and the ASVs fill subsequent column headers with their corresponding abundances (absolutes) for each sample ID as rows. These tables are downloaded as comma separated files (.csv).

Voolstra CR, Valenzuela JJ, Turkarslan S, Cárdenas A, Hume BCC, Perna G, et al. Contrasting heat stress response patterns of coral holobionts across the Red Sea suggest distinct mechanisms of thermal tolerance. Mol Ecol. 2021;30: 4466–4480. doi:10.1111/mec.16064


Let's load our datasets to work with them.


### 1. Data read
# Import packages
import pandas as pd

# Set YOUR directory path
directory = "C:/Users/JohnDoe/Downloads/"

# Load CBASS84 dataset
ASV = pd.read_csv(directory + "cbass.csv", index_col=0)
tax = pd.read_csv(directory + "cbass_tax.csv", index_col=0)

Python

### 1. Data Read
# Load required libraries
library(dplyr)
library(tidyr)

# Set YOUR directory path
directory <- "C:/Users/JohnDoe/Downloads/"

# Load CBASS84 dataset
ASV <- read.csv(paste0(directory, "cbass.csv"), row.names = 1)
tax <- read.csv(paste0(directory, "cbass_tax.csv"), row.names = 1)

R


In the next step we split the ASV dataset to obtain our target variable in a separated file:


### 2. Split Target Variable
y = ASV["ED50"].to_frame()

Python

### 2. Split Target Variable
y <- ASV[,"ED50"]

R



In order to access different taxonomic levels we have to merge the ASV dataset (without ED50 and metadata):


### 3. Combine ASV with taxonmic information
merged = tax.merge(ASV.iloc[:,3:].transpose(), right_index=True, left_index=True)

Python

### 3. Combine ASV with taxonomic information
merged <- cbind(tax, t(ASV[, 4:ncol(ASV)]))

R


... and aggregate the absolute abundances according to the groups of one of the taxonomic levels, if necessary. In this case we aggregate at the family level to get a good tradeoff between the number of features, the resolution of our dataset and the corresponding performance of our models.


### 4. Aggregate according to taxonomic level (e.g. Family)
ASV_family = merged.groupby( ["Family"] ).sum()
# Split ASV data from taxonomic information
ASV_family = ASV_family.transpose().iloc[4:, :].astype('int32' )

Python

### 4. Aggregate according to taxonomic level (e.g. Family)
ASV_family <- merged %>%
  group_by(Family) %>%
  summarize( across( where( is.numeric), sum, na.rm = TRUE)) %>%
  t() %>%
  as.data.frame()
# Set the first row as column headers
colnames(ASV_family) <- as.character( ASV_family[1, ])
ASV_family <- ASV_family[-1, ]

R



We can now already run Coracle with the files y (ED50/target variable) and x_fam (abundance at family level). Both files can be used to run coracle as they support all requirements. Microbial abundance and target variable have the same number of rows and share the same sequence of sample IDs! Assuming we already did this and now decide to run it at the ASV level for the top performing families. We have to prepare our dataset accordingly:


### 5. Select according to top performing groups
ASV_from_family = merged[ merged[ "Family" ].isin(("Microbacteriaceae" , "Alteromonadaceae", "Marinobacteraceae"))]
ASV_from_family = ASV_from_family.iloc[:, 5:].transpose().astype('int32')

Python

### 5. Select according to top performing groups
selected_families <- c("Microbacteriaceae", "Alteromonadaceae", "Marinobacteraceae")
ASV_from_family <- merged %>%
  filter(Family %in% selected_families) %>%
  select(-Phylum, -Class, -Order, -Family, -Genus) %>%
  t() %>%
  as.data.frame()

R




If we want to save the new files we can do so:


### 6. Save file in a submittable format (e.g. csv)
ASV_from_family.to_csv(directory + "x_ASVff.csv", float_format="%.4g")
y.to_csv(directory + "y.csv", float_format="%.4g")

Python

### 6. Save files in a submittable format (e.g. csv)
write.csv(ASV_from_family, file = paste0(directory, "x_ASVff.csv"), row.names = TRUE)
write.csv(y, file = paste0(directory, "y.csv"), row.names = TRUE)

R

Now we can upload the prepared data tables (either x_fam at the family level or x_ASVff, the selected ASVs from top performing families) , enter an email-address (to which the results will be sent) and click on run Coracle.

Screenshot dataupload

Coracle might take a few minutes to run. If you choose to leave the tab open a landing page will be loaded once Coracle is finished. There you can have a first look at your results, receive a short explanation and a button to download your files as a .csv-file. Additionally, the explanation and a download link for your results will be sent to you to the email-address provided. No registration is necessary. timeout errors can occur while waiting for the result page to load.


Screenshot dataupload

The results can also be found here: