Coracle is an artificial intelligence framework to identify microbes associated with a continuous physiological variable.
In our case, we measured a lot of corals for their standardized coral thermal tolerance using CBASS assays with subsequent
ED50 modeling (Voolstra et al. 2020; Voolstra et al 2021) and looked at prokaryote association using 16S rRNA metabarcoding.
We were curious to determine whether specific prokaryotes (bacteria) were indicative of increased thermal tolerance,
so we started Coracle to answer this question.
But really any continuous phenotypic variable can be queried against microbial assemblage.
The framework is designed to make the most out of smaller datasets and thus sacrifices efficient runtime for larger datasets.
Coracle uses an ensemble approach and combines different preprocessing steps and different machine learning methods
that are integrated into one comprehensible score.
It is meant to be a decision-'maker' by picking prokaryote candidates for further examination.
In the following form you are asked to upload your data in two files.
First, the continuous physiological variable file should include one column that specifies sample IDs
and one column that holds the values of the target variable.
A column header is necessary but can be empty.
The second file is a prokaryote abundance file that should include the sample IDs in the first column,
followed by the 'group' name (taxonomic annotation) as the column header and the bacterial abundance as
values (each column resembles one microbial group).
Microbial abundance and target variable should have the same number of rows and should have the
same sequence of sample IDs!
Datatables can be uploaded as comma separated files (.csv-ending is required)
or as tab stop delimited files (either .tsv or .txt).
If you try to upload different files or the dimensions of your files don't add up, an error will be shown.
Example files can be found in the Tutorial.
The runtime of coracle scales significantly with both the number of samples (n) and the number of bacteria groups (k).
Although in the worst case complexity is shown to be [n²k³log(k)], in practice it seems to be around [nk²].
Thus, the number of bacteria groups is the driving the runtime and we highly suggest using higher
aggregations of taxonomic levels.
The current version of Coracle is limited in both the runtime (24h) and the maximal number of microbial groups (10000).
Thus we recommend to use Family- or Order-level first.
Insights at the lowest levels (ASV/OTU) are possible by only feeding Coracle with microbial species
from previously successfull microbial groups (like Family-/Order-level).
Upload your datafiles and run Coracle
UniCor is a metric for the importance of a feature in high dimensional datasets.
It combines the feature's correlation to the target variable, measured as Pearson's Correlation Coefficient,
and it's uniqueness, measued as the average correlation to other features.
UniCorP is a propagation algorithm that utilizes the UniCor metric in hierarchical feature sets.
The UniCor metric is computed for each feature with the average correlation only being computed between features
of the same group (e.g. all bacteria species in the same genus).
If the UniCor metric of a feature surpasses the respective threshold, the feature will be propagated to the next higher hierarchical level
(e.g. bacteria species to genus level). This Bottom-Up Propagation is repeated until the highest level is reached,
resulting in a new hierarchy enriched by UNICORNs (UNIquely CORrelated eNtities).
In the following form you are asked to upload your data in three files.
The first two files (feature set and target variable) are prepared:
First, the continuous physiological variable file should include one column that specifies sample IDs
and one column that holds the values of the target variable.
A column header is necessary but can be empty.
The second file is the continuous featureset (e.g. a prokaryote abundance file) that should include the 'sample' IDs in the first column,
followed by the 'feature' ID (taxonomic annotation, lowest hierarchical level) as the column header and the feautes (bacterial abundance) as
values (each column resembling one OTU/ASV).
The hierarchical (e.g. taxonomic) structure should be prepared in a third file, with the 'feature' IDs (OTU/ASV) in the first column and the
complete hierarchical information in the following columns. The column headers should represent the different
hierarchical levels and should be in the right order (ascending/descending does not matter).
It is recommended to fill null values within that table with the value of the next higher level (e.g. ASV/OTU ID).
feature set and target variable should have the same number of rows and should have the
same sequence of sample IDs! Feature set and hierarchical information should have the same feature IDs.
Datatables can be uploaded as comma separated files (.csv-ending is required)
or as tab stop delimited files (either .tsv or .txt).
File types and dimensions of the different files have to match.
Example files can be found in the Tutorial.
In principle, UniCorP follows the runtime complexity of a correlation matrix (
with n = number of samples and m = number of features). Through hierarchical grouping the runtime can be
significantly reduced (
, where
resembles the group with the most features).
Assuming a relatively equal distribution of features to feature-groups we see that
is significanlty
smaller than .
The UniCor threshold parameter thus plays a big role in runtime complexity.
As the threshold decreases the amount of features propagated each level will increase,
resulting in bigger groups at higher hierarchical levels and a potentially increased runtime.
Thus the goal is to set the threshold as high as possible without losing information.
While in theory the threshold can take on values between -0.5 and 1, we recommend only using values above 0.
The optimal theshold depends on the dataset and is impacted by the correlations present within the dataset.
Upload your datafiles and run UniCorP
HiCoracle is is based on Coracle and combines it with different hierarchical feature selection strategies:
UniCor, am bottom-up approach that propagates uniquely correlated features up the hierarchical order is applied first.
From this enriched dataset at the highest order, a top-down approach selects and propagates the most useful features down the hierachy.
The feature set of the remaining lowest hierarchy level is thus significantly smaller. The goal is to reduce the noise and only keep the most promising features.
At the lowest level it is again fed into the original Coracle framework to investigate association between the remaining features and the continuous target variable.
In the following form you are asked to upload your data in three files.
The first two files (feature set and target variable) are prepared just like with Coracle:
First, the continuous physiological variable file should include one column that specifies sample IDs
and one column that holds the values of the target variable.
A column header is necessary but can be empty.
The second file is a prokaryote abundance file that should include the sample IDs in the first column,
followed by the 'feature' ID (taxonomic annotation, lowest hierarchical level) as the column header and the bacterial abundance as
values (each column resembles one OTU/ASV).
The hierarchical (e.g. taxonomic) structure should be prepared in a third file, with the 'feature' IDs (OTU/ASV) in the first column and the
complete hierarchical information in the following columns. The column headers should represent the different
hierarchical levels and should be in the right order (ascending/descending does not matter).
It is recommended to fill null values within that table with the next best value from a different level (e.g. ASV/OTU ID).
feature set and target variable should have the same number of rows and should have the
same sequence of sample IDs! Feature set and hierarchical information should have the same feature IDs.
Datatables can be uploaded as comma separated files (.csv-ending is required)
or as tab stop delimited files (either .tsv or .txt).
File types and dimensions of the different files have to match.
Example files can be found in the Tutorial.
In comparison to Coracle, HiCoracle is less vulnerable towards a large amount of features.
The threshold parameters play a big role in runtime complexity.
As the uc threshold decreases the amount of features propagated each level will increase, resulting in an increased runtime.
As the top-down threshold increases, the amount of features kept each level will increase, resulting in increased runtimes as well.
Thus the goal is to set the uc threshold as high as possible without losing information while setting the top-down threshold as high as possible without loosing all features.
For a first try we recommend something along uc threshold = 0.15 and top-down threshold = 0.2.
Optimal values depend on the dataset. The optimal top-down threshold is mostly impacted by the structure of the hierarchy
while the optimal uc threshold is impacted by the correlations present within the dataset.
Upload your datafiles and run HiCoracle
Coracle is an artificial intelligence framework to identify microbes associated with a continuous physiological variable.
In our case, we measured a lot of corals for their standardized coral thermal tolerance using CBASS assays with subsequent
ED50 modeling (Voolstra et al. 2020; Voolstra et al 2021) and looked at prokaryote association using 16S rRNA metabarcoding.
We were curious to determine whether specific prokaryotes (bacteria) were indicative of increased thermal tolerance,
so we started Coracle to answer this question.
But really any continuous phenotypic variable can be queried against microbial assemblage.
The framework is designed to make the most out of smaller datasets and thus sacrifices efficient runtime for larger datasets.
Coracle uses an ensemble approach and combines different preprocessing steps and different machine learning methods
that are integrated into one comprehensible score.
It is meant to be a decision-'maker' by picking prokaryote candidates for further examination.
UniCor is a metric for the importance of a feature in high dimensional datasets.
It combines the feature's correlation to the target variable, measured as Pearson's Correlation Coefficient,
and it's uniqueness, measued as the average correlation to other features.
UniCorP is a propagation algorithm that utilizes the UniCor metric in hierarchical feature sets.
The UniCor metric is computed for each feature with the average correlation only being computed between features
of the same group (e.g. all bacteria species in the same genus).
If the UniCor metric of a feature surpasses the respective threshold, the feature will be propagated to the next higher hierarchical level
(e.g. bacteria species to genus level). This Bottom-Up Propagation is repeated until the highest level is reached,
resulting in a new hierarchy enriched by UNICORNs (UNIquely CORrelated eNtities).
In the following we show how to use Coracle and give a short tutorial
on the data handling requiered for the optimal use of Coracle.
Code examples are given for programming languages R and Python and example datasets
for all relevant steps are available for download. Depending on your dataset not all steps
may be necessary so feel free to skip irrelevant steps
First we provide the original dataset:
The dataset consists of the 16S OTU abundance file
(link)
of the CBASS84 study (Voolstra et al 2021).
The continuous physiological variable data table includes the sample IDs in the first column and the
associated ED50 temperature tolerance values (°C) in the second column.
The third and fourth column contain some metadata and the ASVs
fill subsequent column headers with their corresponding abundances (absolutes) for each sample ID as rows.
These tables are downloaded as comma separated files (.csv).
In the next step we split the ASV dataset to obtain our target variable in a separated file:
### 2. Split Target Variable and feature set
y = ASV["ED50"].to_frame()
x = ASV.iloc[:, 3:]
Python
### 2. Split Target Variable and feature set
y <- ASV[,"ED50"]
x <- ASV[, -c(1:3)]
R
In order to access different taxonomic levels we have to merge
the ASV dataset (without ED50 and metadata):
### 3. Combine ASV with taxonmic information
merged = tax.merge(ASV.iloc[:,3:].transpose(), right_index=True, left_index=True)
Python
### 3. Combine ASV with taxonomic information
merged <- cbind(tax, t(ASV[, 4:ncol(ASV)]))
R
... and aggregate the absolute abundances according
to the groups of one of the taxonomic levels, if necessary.
In this case we aggregate at the family level to get a good
tradeoff between the number of features, the resolution of our dataset
and the corresponding performance of our models.
### 4. Aggregate according to taxonomic level (e.g. Family)
ASV_family = merged.groupby( ["Family"] ).sum() # Split ASV data from taxonomic information
ASV_family = ASV_family.transpose().iloc[4:, :].astype('int32' )
Python
### 4. Aggregate according to taxonomic level (e.g. Family)
ASV_family <- merged %>%
group_by(Family) %>%
summarize( across( where( is.numeric), sum, na.rm = TRUE)) %>%
t() %>%
as.data.frame() # Set the first row as column headers
colnames(ASV_family) <- as.character( ASV_family[1, ])
ASV_family <- ASV_family[-1, ]
R
We can now already run Coracle with the files
y (ED50/target variable) and x_fam (abundance at family level).
Both files can be used to run coracle as they support all requirements.
Microbial abundance and target variable have the same number of rows and share the same sequence of sample IDs!
Assuming we already did this and now decide to run it at the ASV level
for the top performing families. We have to prepare our dataset accordingly:
### 5. Select according to top performing groups
ASV_from_family = merged[ merged[ "Family" ].isin(("Microbacteriaceae"
, "Alteromonadaceae", "Marinobacteraceae"))]
ASV_from_family = ASV_from_family.iloc[:, 5:].transpose().astype('int32')
Python
### 5. Select according to top performing groups
selected_families <- c("Microbacteriaceae",
"Alteromonadaceae", "Marinobacteraceae")
ASV_from_family <- merged %>%
filter(Family %in% selected_families) %>%
select(-Phylum, -Class, -Order, -Family, -Genus) %>%
t() %>%
as.data.frame()
R
If we want to save the new files we can do so:
### 6. Save file in a submittable format (e.g. csv)
ASV_from_family.to_csv(directory + "x_ASVff.csv",
float_format="%.4g")
y.to_csv(directory + "y.csv",
float_format="%.4g")
Python
### 6. Save files in a submittable format (e.g. csv)
write.csv(ASV_from_family, file = paste0(directory,
"x_ASVff.csv"), row.names = TRUE)
write.csv(y, file = paste0(directory, "y.csv"), row.names = TRUE)
R
Now we can upload the prepared data tables
(either x_fam at the family level or x_ASVff, the selected ASVs from top performing families)
, enter an email-address (to which the results will be sent)
and click on run Coracle.
Coracle might take a few minutes to run.
If you choose to leave the tab open a landing page will be loaded once Coracle is finished.
There you can have a first look at your results, receive a short explanation and a button to download your files as a .csv-file.
Additionally, the explanation and a download link for your results will be sent to you to the email-address provided.
No registration is necessary. timeout errors can occur while waiting for the result page to load.