From 93d7e526769e370c7116aaef30667fa51f378a17 Mon Sep 17 00:00:00 2001 From: JustGag <158193589+JustGag@users.noreply.github.com> Date: Mon, 2 Sep 2024 16:01:01 -0400 Subject: [PATCH] Update main.tex aPhyloGeo steps clarifications --- papers/Gagnon_Kebe_Tahiri/main.tex | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/papers/Gagnon_Kebe_Tahiri/main.tex b/papers/Gagnon_Kebe_Tahiri/main.tex index 98bef5de59..f170c10ecd 100644 --- a/papers/Gagnon_Kebe_Tahiri/main.tex +++ b/papers/Gagnon_Kebe_Tahiri/main.tex @@ -1,5 +1,5 @@ \begin{abstract} -Cumacea (crustaceans: Peracarida) are vital indicators of benthic health in marine ecosystems. This study investigated the influence of envirionmental, climatic and geographic attributes on their genetic variability in the Northern North Atlantic, focusing on Icelandic waters. We analyzed mitochondrial sequences of the 16S rRNA gene from 62 Cumacea specimens. Using the \textit{aPhyloGeo} software, we correlated these sequences with relevant factors such as latitude (decimal degree) at the start of sampling, wind speed (m/s) at the start of sampling, O\textsubscript{2} concentration (mg/L), and depth (m) at the beginning of sampling. +Cumacea (crustaceans: Peracarida) are vital indicators of benthic health in marine ecosystems. This study investigated the influence of envirionmental, climatic and geographic attributes on their genetic variability in the Northern North Atlantic, focusing on Icelandic waters. We analyzed mitochondrial sequences of the 16S rRNA gene from 62 Cumacea specimens. Using the \textit{aPhyloGeo} software, we correlated these sequences with relevant factors such as latitude (decimal degree) at the start of sampling, wind speed (m/s) at the start of sampling, O\textsubscript{2} concentration (mg/L), and depth (m) at the start of sampling. Our analyses revealed variability in geographic and environmental attributes, reflecting diverse ecological requirements and benthic habitats. The most common Cumacea families, Diastylidae and Leuconidae, suggest adaptations to various marine environments. Phylogeographic analysis showed that specific genetic sequences were correlated with wind speed (m/s) at the start of sampling and O\textsubscript{2} concentration (mg/L). This indicates potential local adaptation to these fluctuating conditions. @@ -138,7 +138,7 @@ \subsubsection{Other environmental attributes} To better interpret benthic species' relationship and evolutionary responses, genetic data are required \citep{wilson_speciation_1987, uhlir_adding_2021}. Thus, the aligned DNA sequence of the 16S rRNA mitochondrial gene region from each of the samples was included in our analyses. This region is standard in phylogeny and phylogeography studies \citep{hugenholtz1998impact} and sufficiently conserved over time to guarantee exact alignments between different species or populations \citep{saccone1999evolutionary}. We examined 62 of the 306 aligned DNA sequences used for phylogeographic analyses by \citep{uhlir_adding_2021}. As some specimens in our sample have their DNA sequence duplicated, or even quadruplicated with a difference of one or two nucleotides, we took into account the longest-aligned DNA sequence of each specimen. \subsection{{\textit{aPhyloGeo} software}\label{aPhyloGeo-software}} -We used the cross-platform Python software \textit{aPhyloGeo} for our phylogeographic analyses, designed to analyze phylogenetic trees using ecological and geographic attributes \citep{koshkarov_phylogeography_2022}. Developed by My-Linh Luu, Georges Marceau, David Beauchemin, and Nadia Tahiri, \textit{aPhyloGeo} offers tools for the study of correlations between species genetics and habitat characteristics, helping to understand the evolution of species under different environmental conditions \citep{koshkarov_phylogeography_2022}. +We used the cross-platform Python software \textit{aPhyloGeo} for our phylogeographic analyses, designed to analyze phylogenetic trees using ecological and geographic attributes (\autoref{lst:main}). Developed by My-Linh Luu, Georges Marceau, David Beauchemin, and Nadia Tahiri, \textit{aPhyloGeo} offers tools for the study of correlations between species genetics and habitat characteristics, helping to understand the evolution of species under different environmental conditions \citep{koshkarov_phylogeography_2022}. We selected this software for our analysis because, to our knowledge, it is the first phylogeographic tool capable of correlating species genetics with climatic, environmental, and geographic attributes - precisely the objective of our study \citep{koshkarov_phylogeography_2022}. \textit{aPhyloGeo} offers several key functionalities: @@ -148,8 +148,6 @@ \subsection{{\textit{aPhyloGeo} software}\label{aPhyloGeo-software}} \item Investigation of genetic diversity: \textit{aPhyloGeo} measures genetic heterogeneity based on geographic variation, enabling the recognition of potential local adaptations and evolutionary processes \citep{manel_perspectives_2010}. \end{enumerate} -The \textit{aPhyloGeo} Python package is freely and publicly available on \href{https://github.com/tahiri-lab/aPhyloGeo}{GitHub}, and is also available on \href{https://pypi.org/project/aphylogeo/}{PyPi}, to facilitate complex phylogeographic analyses. The software process has three main stages (\autoref{lst:main}). - %\autoref{lst:main}. \begin{lstlisting}[label=lst:main,language=Python,caption=Main script for tutorial using the aPhyloGeo package.] # This script outlines the main steps for processing genetic and attribute data using the aPhyloGeo package. @@ -192,12 +190,14 @@ \subsection{{\textit{aPhyloGeo} software}\label{aPhyloGeo-software}} utils.filter_results(attribute_trees, genetic_trees, attribute_data) \end{lstlisting} +The \textit{aPhyloGeo} Python package is freely and publicly available on \href{https://github.com/tahiri-lab/aPhyloGeo}{GitHub}, and is also available on \href{https://pypi.org/project/aphylogeo/}{PyPi}, to facilitate complex phylogeographic analyses. The software process has three main stages: + \begin{enumerate} -\item \textbf{The first step} is to collect cumacean DNA sequences of sufficient quality for the needs of our results \citep{koshkarov_phylogeography_2022}. 62 cumaceans samples were selected to represent 62 16S rRNA mitochondrial gene sequences. We then included two climatic factors, namely wind speed (m/s) at the start and end of the sampling. We also included three environmental factors, such as sampling depth (m) at the start of sampling, temperature ($^\circ$C), and water O\textsubscript{2} concentration (mg/L). Finally, we integrated two geographic attributes, latitude (DD) and longitude (DD) at the start of sampling. +\item \textbf{The first step} was to collect cumacean DNA sequences of sufficient quality for the needs of our results \citep{koshkarov_phylogeography_2022}. In this study, 62 cumaceans samples were selected to represent 62 16S rRNA mitochondrial gene sequences. We then included two climatic attributes, namely wind speed (m/s) at the start and end of the sampling; three environmental attributes, such as depth (m) at the start of sampling, temperature ($^\circ$C), and water O\textsubscript{2} concentration (mg/L); and two geographic attributes, latitude (DD) and longitude (DD) at the start of sampling. -\item \textbf{In the second step}, trees were generated with environmental, geographic, climatic, and genetic data. Concerning geographic attributes, we calculated the dissimilarity between each data pair from different geographic conditions, producing a symmetrical square matrix. The {neighbor-joining algorithm}\footnote{It is a method used to construct phylogenetic trees using distance matrices.} was used to design the geographic tree from this matrix. The same approach was applied to environmental, climatic, and genetic data. An {iterative phylogenetic reconstruction method}\footnote{This method makes it possible to build phylogenetic trees by progressively reforming them as new data or sequences are added.} was used to construct phylogenetic trees based on 62 16S rRNA mitochondrial sequences, taking into account only data located within an interval that moves along the alignment. This displacement can vary according to the steps and the size of the window defined by the user (their length is determined by the number of base pairs (bp)) \citep{koshkarov_phylogeography_2022}. In our case, we set up the software \textit{aPhyloGeo} as follows: 'pairwiseAligner' for sequence alignment; 'Hamming distance' to measure simple dissimilarities between sequences of identical length; 'Wider Fit by elongating with Gap (starAlignment)' algorithm takes alignment gaps into account, which is often mandatory in the case of major deletions or insertions in the sequences; windows size: 1 nucleotide (nt); and finally, window step: 10 nt. The last two configurations imply that for each 1 nt window, a phylogenetic tree is produced, then the window is moved by 10 nt, creating a new tree. +\item \textbf{In the second step}, trees were generated separately from environmental, geographic, climatic and genetic data. Concerning geographic attributes, we calculated the dissimilarity between each pair of cumaceans from distinct geographic conditions \citep{koshkarov_phylogeography_2022}. This produced a symmetrical square matrix \citep{koshkarov_phylogeography_2022}. The {neighbor-joining algorithm}\footnote{It is a method used to construct phylogenetic trees using distance matrices.} was used to design the geographic tree from this matrix \citep{koshkarov_phylogeography_2022}. The same approach was applied to environmental, climatic, and genetic data. Phylogenetic reconstruction was reiterated to build genetic trees based on 62 16S rRNA mitochondrial sequences, taking into account only data within a window that progresses along the alignment \citep{koshkarov_phylogeography_2022}. This displacement can vary according to the steps and the size of the window defined by the user (their length is determined by the number of base pairs (bp)) \citep{koshkarov_phylogeography_2022}. In our case, we set up the \textit{aPhyloGeo} software as follows: 'pairwiseAligner' for sequence alignment; 'Hamming distance' to measure simple dissimilarities between sequences of identical length; 'Wider Fit by elongating with Gap (starAlignment)' algorithm takes alignment gaps into account, which is often mandatory in the case of major deletions or insertions in the sequences; windows size: 1 nucleotide (nt); and finally, window step: 10 nt. The last two configurations imply that for each 1 nt window, a phylogenetic tree is produced, then the window is moved by 10 nt, creating a new tree. -\item \textbf{In the third step}, the phylogenetic trees constructed in each sliding window are compared with environmental, climatic, and geographic trees using Robinson-Foulds Distance \citep{robinson_comparison_1981, koshkarov_phylogeography_2022}, normalized Robinson-Foulds Distance, Euclidean Distance, and Least-Squares Distance. These contribute to understanding the correspondence between cumacean genetic sequences and their habitat. The results of the metrics used were obtained using the functions leastSquare(tree1, tree2), robinsonFoulds(tree1, tree2), euclideanDist(tree1, tree2) from the \textit{aPhyloGeo} software and were organized by the main function (\autoref{lst:main}). The approach also takes bootstrapping into account. A sliding window allows detailed identification of subtle regions with high rates of genetic mutation. This method requires shifting a fixed-size window over the alignment of genetic sequences, allowing phylogenetic trees to be reconstructed for each part of the sequence. As a result, it recognizes changes in evolutionary relationships along the 16S rRNA mitochondrial gene region of the Cumacea species. This method is critical for determining whether specific gene sequences of cumaceans can be affected by certain ecological or geographic conditions of their habitat (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}). +\item \textbf{In the third step}, the genetic trees constructed in each sliding window are compared with environmental, climatic, and geographic trees using Robinson-Foulds Distance \citep{robinson_comparison_1981, koshkarov_phylogeography_2022}, normalized Robinson-Foulds Distance, Euclidean Distance, and Least-Squares Distance. These contribute to understanding the correspondence between cumacean genetic sequences and their habitat. The results of the metrics used were obtained using the functions leastSquare(tree1, tree2), robinsonFoulds(tree1, tree2), euclideanDist(tree1, tree2) from the \textit{aPhyloGeo} software and were organized by the main function (\autoref{lst:main}). The approach also takes bootstrapping into account. A sliding-window approach enables the precise localization of subtle sequences with high rates of genetic mutation. This method requires shifting a fixed-size window over the alignment of genetic sequences, allowing phylogenetic trees to be reconstructed for each part of the sequence. As a result, it recognizes changes in evolutionary relationships along the 16S rRNA mitochondrial gene region of the Cumacea species. This method is critical for determining whether specific gene sequences of cumaceans can be affected by certain ecological or geographic conditions of their habitat (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}). \end{enumerate} \subsection{Figure construction} @@ -245,7 +245,7 @@ \subsection{Robinson-Foulds Distance}\label{RF} tree2_newick = ete3.Tree(tree2.format("newick"), format=1) # Calculate the Robinson-Foulds distance - # The method returns the RFD, the maximum possible RFD, and the common leaves (or taxa) between the trees. + # The method returns the RFD, the maximum possible RFD, and the common leaves (i.e., taxa) between the trees. rf, rf_max, common_leaves = tree1_newick.robinson_foulds(tree2_newick, unrooted_trees=True) # If there are no common leaves, set the RFD to 0 @@ -386,7 +386,7 @@ \subsection{Least-Squares Distance}\label{LS} Our phylogeographic study used these four metrics to quantify topological differences between phylogenetic trees and assess dissimilarities between genetic sequences and environmental, geographic, and climatic features. The result is a comprehensive analysis of the evolutionary dynamics of cumacean populations in different contexts. Note that the results in Figure \ref{fig:fig6} and Figure \ref{fig:fig7} were obtained by comparing a tree constructed from habitat attribute 1 with all other trees constructed from the other attributes (2, 3, 4, 5, ...). \section{Results}\label{results} -The violon diagrams shown in Figure \ref{fig:fig2} are used to display summary statistics similar to box plots, showing medians (white dots), interquartile ranges (thickened black bars), and the rest of the distributions (thin black lines), except the "extreme" points. Wider areas indicate a greater probability of the attributes taking a given value. They summarize the distribution of geographic (latitude and longitude at the start of sampling (DD)), climatic (wind speed (m/s) at the start of sampling), and environmental (depth (m) at the beginning of sampling, water temperature ($^\circ$C), and O\textsubscript{2} concentration (mg/L)) data. These diagrams are essential for understanding habitat conditions and highlighting unique habitats that can potentially influence cumacean genetic adaptability. +The violon diagrams shown in Figure \ref{fig:fig2} are used to display summary statistics similar to box plots, showing medians (white dots), interquartile ranges (thickened black bars), and the rest of the distributions (thin black lines), except the "extreme" points. Wider areas indicate a greater probability of the attributes taking a given value. They summarize the distribution of geographic (latitude and longitude at the start of sampling (DD)), climatic (wind speed (m/s) at the start of sampling), and environmental (depth (m) at the start of sampling, water temperature ($^\circ$C), and O\textsubscript{2} concentration (mg/L)) data. These diagrams are essential for understanding habitat conditions and highlighting unique habitats that can potentially influence cumacean genetic adaptability. \begin{figure}[htbp] \centering @@ -394,7 +394,7 @@ \section{Results}\label{results} \caption{Violin diagrams of two geographic attributes, one climatic attribute, and three environmental attributes from our sample that provide essential information about the ecological and geographic conditions of cumacean habitats. a) Latitude (DD) at the start of sampling (red) suggest that the samples come from two dominant latitudinal regions (around 61.64 and 67.64 DD); b) Longitude (DD) at the start of sampling (yellow) implies that the samples come from two dominant longitudinal regions (around -26.77 and -18.14 DD); c) Depth (m) at the start of sampling (green) suggest that the samples were mainly collected and concentrated at three different depths (around 500, 1500 and 2500 m); d) Wind speed (m/s) at start of sampling (light blue) indicate stable wind conditions at the start of sampling (around 6 m/s); e) Water temperature ($^\circ$C) (dark blue) suggest that the samples were mostly collected and concentrated at two different temperature (around 0.06 and 2.66 $^\circ$C); f) O\textsubscript{2} concentration (mg/L) of water (pink) implies that the samples were primarily collected and concentrated at two different O\textsubscript{2} concentration (around 258.39 and 290.90 mg/L). The mean, median, standard deviation (Std Dev), 1st quartile (Q1), and 3rd quartile (Q3) of the dataset for each attribute are shown in the top right-hand corner of each graph. \label{fig:fig2}} \end{figure} -Our results revealed variability in most cumacean habitat attributes, as shown in Figure \ref{fig:fig2}. For instance, the median latitude at the start of sampling (Figure \ref{fig:fig2}a, 67.15 DD) is higher than the mean (64.83 DD), showing an asymmetric distribution skewed towards lower values. This trend is also observed for depth (m) at the start of sampling (Figure \ref{fig:fig2}c ) and water O\textsubscript{2} concentration (mg/L) (Figure \ref{fig:fig2}f), indicating that sampling took place in regions with varied environmental contexts. The bimodal shape of the latitude distribution curve (Figure \ref{fig:fig2}a) suggests that samples came from two dominant latitudinal regions at the start of sampling. This bimodality is reflected in longitude (DD) at the beginning of sampling (Figure \ref{fig:fig2}b), as well as for temperature ($^\circ$C) (Figure \ref{fig:fig2}e) and O\textsubscript{2} concentration (mg/L) of the water (Figure \ref{fig:fig2}f), suggesting fluctuations in climatic and environmental conditions in these areas. +Our results revealed variability in most cumacean habitat attributes, as shown in Figure \ref{fig:fig2}. For instance, the median latitude at the start of sampling (Figure \ref{fig:fig2}a, 67.15 DD) is higher than the mean (64.83 DD), showing an asymmetric distribution skewed towards lower values. This trend is also observed for depth (m) at the start of sampling (Figure \ref{fig:fig2}c ) and water O\textsubscript{2} concentration (mg/L) (Figure \ref{fig:fig2}f), indicating that sampling took place in regions with varied environmental contexts. The bimodal shape of the latitude distribution curve (Figure \ref{fig:fig2}a) suggests that samples came from two dominant latitudinal regions at the start of sampling. This bimodality is reflected in longitude (DD) at the start of sampling (Figure \ref{fig:fig2}b), as well as for temperature ($^\circ$C) (Figure \ref{fig:fig2}e) and O\textsubscript{2} concentration (mg/L) of the water (Figure \ref{fig:fig2}f), suggesting fluctuations in climatic and environmental conditions in these areas. The median in Figure \ref{fig:fig2}b (-26.21 DD) is lower than the mean (-23.12 DD), indicating asymmetry on the higher sides, as does the water temperature ($^\circ$C) (Figure \ref{fig:fig2}e). The standard deviation (5.52 DD) and quartiles Q1 (-26.77 DD) and Q3 (-18.14 DD) indicate a relatively wide range, around the mean (-23.12 DD), of the longitude distribution at the start of sampling (-31.356 - -12.162 DD). This suggests a strong environmental gradient, geographic distribution and sample diversity from east to west in the region studied. The standard deviation of Figure \ref{fig:fig2}c is quite high (881.16 m), indicating high variability in sample collection depths (316 – 2568 m) and giving a more global overview of benthic habitats. Unlike all the other diagrams in Figure \ref{fig:fig2}, the curve in Figure \ref{fig:fig2}c has a multimodal shape with three prominent peaks, suggesting that the samples were mainly collected and concentrated at three different depths (around 500, 1500 and 2500 m).