Update main.tex

Corrections step aPhyloGeo
scipy-conference · Sep 15, 2024 · 45cd864 · 45cd864
1 parent 2b30a36
commit 45cd864
Showing 1 changed file with 7 additions and 5 deletions.
diff --git a/papers/Gagnon_Kebe_Tahiri/main.tex b/papers/Gagnon_Kebe_Tahiri/main.tex
@@ -35,7 +35,7 @@ \section{Materials and Methods}\label{materials-methods}
 \begin{figure}[htbp]
     \centering
     \includegraphics[width=0.7\textwidth]{diagram.drawio.png}
-    \caption{Flow chart summarizing the Materials and Methods section workflow. Six different colors highlight the blocks. The first block (blue) represents our database. The second block (red) is data pre-processing, where we remove attributes. The third and fourth blocks (orange) implement the \textit{aPhyloGeo} software and its parameters for our phylogeographic analyses. The fifth block (grey) calculates phylogenetic tree comparison distances. The sixth block (yellow) compares the distances between the phylogenetic trees produced. The seventh block (purple) identifies regions with high mutation rates based on the results of the tree comparisons. *See YAML files on \href{https://github.com/tahiri-lab/aPhyloGeo}{GitHub} for more details on these parameters. \label{fig:fig1}}
+    \caption{Flow chart summarizing the Materials and Methods section workflow. Six different colors highlight the blocks. The first block (blue) represents our database. The second block (red) is data pre-processing, where we remove attributes. The third and fourth blocks (orange) implement the \textit{aPhyloGeo} software and its parameters for our phylogeographic analyses (see in the second step of the section \autoref{aPhyloGeo-software}). The fifth block (grey) calculates phylogenetic tree comparison distances. The sixth block (yellow) compares the distances between the phylogenetic trees produced. The seventh block (purple) identifies regions with high mutation rates based on the results of the tree comparisons. *See YAML files on \href{https://github.com/tahiri-lab/aPhyloGeo}{GitHub} for more details on these parameters. \label{fig:fig1}}
 \end{figure}
 
 \subsection{Description of the data}
@@ -196,11 +196,13 @@ \subsection{{\textit{aPhyloGeo} software}\label{aPhyloGeo-software}}
 \begin{enumerate}
 \item \textbf{The first step} was to collect DNA sequences from cumaceans of sufficient quality for the needs of our results \citep{koshkarov_phylogeography_2022}. In this study, 62 cumaceans samples were selected to represent 62 sequences of the 16S rRNA mitochondrial gene. We then included two climatic attributes, namely wind speed (m/s) at the start and end of the sampling; three environmental characteristics, such as depth (m) at the start of sampling, water temperature ($^\circ$C), and O\textsubscript{2} concentration (mg/L); and two geographic variables, latitude (DD) and longitude (DD) at the start of sampling.
 
-\item \textbf{In the second step}, trees were generated separately from biological, spatial, meteorological, and genetic data. Concerning spatial attributes, we calculated the dissimilarity between each pair of cumaceans from distinct spatial conditions \citep{koshkarov_phylogeography_2022}. This produced a symmetrical square matrix \citep{koshkarov_phylogeography_2022}. The {neighbor-joining algorithm}\footnote{It is a method used to construct phylogenetic trees using distance matrices.} was used to build the spatial tree from this matrix \citep{koshkarov_phylogeography_2022}. Each geographic attribute gives rise to a geographic tree. If there are $m$ windows, there will be $m$ geographic trees. The same approach was applied to biological, meteorological, and genetic data. Phylogenetic reconstruction was reiterated to build genetic trees based on 62 16S rRNA mitochondrial sequences, considering only data within a window that progresses along the alignment \citep{koshkarov_phylogeography_2022}. This displacement can vary according to the steps and the size of the window defined by the user (their length is determined by the number of base pairs (bp)) \citep{koshkarov_phylogeography_2022}. 
+\item \textbf{In the second step}, trees were generated separately from biological, spatial, meteorological, and genetic data. Concerning spatial attributes, we calculated the dissimilarity between each pair of cumaceans from distinct spatial conditions \citep{koshkarov_phylogeography_2022}. This produced a symmetrical square matrix \citep{koshkarov_phylogeography_2022}. The {neighbor-joining algorithm}\footnote{It is a method used to construct phylogenetic trees using distance matrices.} was used to build the spatial tree from this matrix \citep{koshkarov_phylogeography_2022}. Each geographic attribute gives rise to a geographic tree. If there are $m$ windows affected by this attribute, there will be $m$ geographic trees. The same approach was applied to biological, meteorological, and genetic data. 
 
-In our case, we set up the \textit{aPhyloGeo} software as follows: 'pairwiseAligner' for sequence alignment; 'Hamming distance' to measure simple dissimilarities between sequences of identical length; 'Wider Fit by elongating with Gap (starAlignment)' algorithm takes alignment gaps into account, which is often mandatory in the case of major deletions or insertions in the sequences; windows size: 1 nucleotide (nt); and finally, window step: 10 nt. The last two configurations imply that for each 1 nt window, a phylogenetic tree is produced using the nucleotide of each cumacean, then the window is moved by 10 nt, creating a new tree. Each window in the alignment will give a genetic tree. If there are $n$ windows, there will be $n$ phylogenetic trees. In turn, genetic trees will be used in an object called $T_1$, while spatial and ecological trees will be used in $T_2$.
+For genetic data, phylogenetic reconstruction was reiterated to build genetic trees based on 62 mitochondrial 16S rRNA sequences, considering only data within a window that progresses along the alignment \citep{koshkarov_phylogeography_2022}. This displacement can vary according to the steps and the size of the window defined by the user (their length is determined by the number of base pairs (bp)) \citep{koshkarov_phylogeography_2022}. 
 
-\item \textbf{In the third step}, the genetic trees constructed in each sliding window are compared with ecosystemic, atmospheric, and spatial trees using Robinson-Foulds Distance \citep{robinson_comparison_1981, koshkarov_phylogeography_2022}, normalized Robinson-Foulds Distance, Euclidean Distance, and Least-Squares Distance. These contribute to understanding the correspondence between cumaceans genetic sequences and their habitat. The approach also takes bootstrapping into account \citep{koshkarov_phylogeography_2022}. The results of the metrics used were obtained using the functions $leastSquare(tree1, tree2)$, $robinsonFoulds(tree1, tree2)$, $euclideanDist(tree1, tree2)$ from the \textit{aPhyloGeo} software and were organized by the main function (\autoref{lst:main}). The metric output tells us which of our attributes has the greatest impact on phylogenetic relationships in our samples, based on the magnitude of the metric distances (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}). 
+In our case, we set up the \textit{aPhyloGeo} software as follows: $pairwiseAligner$ for sequence alignment; $Hamming distance$ to measure simple dissimilarities between sequences of identical length; $All distance methods$ to include Least-Square Distance, Robinson-Foulds distance, normalized Robinson-Foulds Distance and Euclidean Distance in our analyses; $Wider Fit by elongating with Gap (starAlignment)$ algorithm takes alignment gaps into account, which is often mandatory in the case of major deletions or insertions in the sequences; $windows_size$: 1 nucleotide (nt); and finally, $step_size$: 10 nt. The last two configurations imply that for each 1 nt window, a phylogenetic tree is produced using the nucleotide of each cumacean, then the window is moved by 10 nt, creating a new tree. Each window in the alignment will give a genetic tree. If there are $n$ windows, there will be $n$ phylogenetic trees. Genetic trees will be used in an object called $T_1$, while spatial and ecological trees will be used in $T_2$.
+
+\item \textbf{In the third step}, the genetic trees constructed in each sliding window are compared with ecosystemic, atmospheric, and regional trees using Robinson-Foulds Distance \citep{robinson_comparison_1981, koshkarov_phylogeography_2022}, normalized Robinson-Foulds Distance, Euclidean Distance, and Least-Squares Distance. These contribute to understanding the correspondence between cumaceans genetic sequences and their habitat. The approach also takes bootstrapping into account \citep{koshkarov_phylogeography_2022}. The results of these metrics were obtained using the functions $leastSquare(tree1, tree2)$, $robinsonFoulds(tree1, tree2)$, $euclideanDist(tree1, tree2)$ from the \textit{aPhyloGeo} software and were organized by the main function (\autoref{lst:main}). Those for the normalized Robinson-Foulds Distance are obtained by default with the function $robinsonFoulds(tree1, tree2)$. The metric output tells us which of our attributes has the greatest impact on phylogenetic relationships in our samples, based on the magnitude of the metric distances (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}). 
 
 In addition to identifying the specific attribute, a sliding-window approach enables the precise localization of subtle sequences with high rates of genetic mutation \citep{koshkarov_phylogeography_2022}. This method requires shifting a fixed-size window over the alignment of genetic sequences, allowing phylogenetic trees to be reconstructed for each part of the sequence. As a result, it recognizes changes in evolutionary relationships along the 16S rRNA mitochondrial gene region of the Cumacea species. This method is critical for determining whether specific gene sequences of cumaceans can be affected by certain ecological or spatial attributes of their habitat (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}).
 \end{enumerate}
@@ -446,7 +448,7 @@ \section{Results}\label{results}
     \caption{Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Least-Squares Distance (LSD), b) Robinson-Foulds Distance (RFD), c) normalized Robinson-Foulds Distance (RFD), and d) Euclidean Distance (ED). These distance variations are studied to establish their correlation with variation of O\textsubscript{2} concentration (mg/L) at the sampling sites. \label{fig:fig7}}
 \end{figure}
 
-The correlation between the genetic sequences and two attributes, one climatic (wind speed (m/s) at the start of sampling) and the other environmental (O\textsubscript{2} concentration (mg/L)) is presented in Figure \ref{fig:fig6} and Figure \ref{fig:fig7}. All the attributes given in the first step of the \textit{aPhyloGeo} software section (\autoref{aPhyloGeo-software}) were analyzed and their script and figure will be soon available in the $img$ and $script$ python file on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub}. However, only these two attributes showed the most interesting mutation rate. Using four metrics (LSD, RFD, nRFD, and ED), we noticed that the ED is particularly sensitive to our data, manifesting considerable sequence variation at the position in MSA 520-529 amino acids (aa) (ED: 0.8 < x < 0.9; Figure \ref{fig:fig6}d) and 1190-199 aa (ED: 1.2 < x < 1.3; Figure \ref{fig:fig7}d). This implies that these genetic sites are subject to selection pressures or evolutionary changes, due to biological (O\textsubscript{2} concentration (mg/L)) and meteorological conditions (wind speed (m/s) at the start of sampling). These results align with our study's aim to identify the genetic region of cumaceans with the highest mutation rate linked to a specific habitat attribute.
+The correlation between the genetic sequences and two attributes, one climatic (wind speed (m/s) at the start of sampling) and the other environmental (O\textsubscript{2} concentration (mg/L)) is presented in Figure \ref{fig:fig6} and Figure \ref{fig:fig7}. All the attributes given in the first step of the \autoref{aPhyloGeo-software} section were analyzed and their script and figure will be soon available in the $img$ and $script$ python file on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub}. However, only these two attributes showed the most interesting mutation rate. Using four metrics (LSD, RFD, nRFD, and ED), we noticed that the ED is particularly sensitive to our data, manifesting considerable sequence variation at the position in MSA 520-529 amino acids (aa) (ED: 0.8 < x < 0.9; Figure \ref{fig:fig6}d) and 1190-199 aa (ED: 1.2 < x < 1.3; Figure \ref{fig:fig7}d). This implies that these genetic sites are subject to selection pressures or evolutionary changes, due to biological (O\textsubscript{2} concentration (mg/L)) and meteorological conditions (wind speed (m/s) at the start of sampling). These results align with our study's aim to identify the genetic region of cumaceans with the highest mutation rate linked to a specific habitat attribute.
 
 These results provide important insight into the genetic adaptation of cumaceans to their environment. These results need to be analyzed in greater depth to certify their involvement, especially in contrast with \citep{uhlir_adding_2021}, which investigated similar topics of environmental and climatic effects on cumaceans distribution and genetics. A more in-depth analysis of the results is available on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub} in the supplementary file.