Donate
Leveraging AI and Machine Learning in DNA Sequencing for Tree Phylogenetics

Leveraging AI and Machine Learning in DNA Sequencing for Tree Phylogenetics

Scientists have been able to uncover evolutionary relationships between different plant species utilizing DNA sequencing. Studying these relationships is beneficial for our understanding of biodiversity and conservation biology. Additionally, we can improve our conservation efforts, detect invasive species, and understand the effects of climate change. Today, artificial intelligence plays a crucial role. It helps monitor read alignment, variant detection and analysis, and Genome assembly. In addition, it serves other functions that we will further explore throughout the article.

Phylogenetic trees allow scientists to group plants; for this article, we’ll discuss trees, depending on their genetic makeup and evolutionary relationships. These trees are composed of nodes. On a family tree, your parents would both be nodes, and a branch connects your node to theirs. In a phylogenetic tree, “each node represents a common ancestor shared by two or more terminal taxa” (Hermsen, 2019). A terminal taxon/taxa is a leaf/leaves on a phylogenetic tree that represents current species, populations, or groups being studied. The tree then goes down, forming a node at a point where two species relate. 

For better comprehension, think about the base of a tree. The trunk represents the first ancestor from which all the other branches or species come. Looking up at the tree, you see larger branches. These represent the first big genetic split that a species had. Further up the tree, you’ll see more branches, representing the different species that have split apart from the first. Finally, you get to the leaves, which are the terminal taxa, or a species in existence today.

So, how does DNA sequencing play a role in all this? First, we must understand what DNA sequencing is. Per the National Human Genome Research Institute, “sequencing DNA means determining the order of the four chemical building blocks – called “bases” – that make up the DNA molecule. The sequence tells scientists the kind of genetic information that is carried in a particular DNA segment” (Genome.gov). With this knowledge, scientists can identify similarities and differences between species, allowing them to create phylogenetic trees. Determining how close or distant one species is related to another depends on the comparison of their DNA sequences.

DNA sequencing can be difficult for many reasons, some being error correction, data management, and the complexity of DNA. Still, in recent years, artificial intelligence has aided in mitigating some of those problems. AI and machine learning (ML) algorithms can identify and correct sequencing errors. A study titled “Machine learning empowered next generation DNA sequencing: perspective and prospectus” by Sneha Mittal et al. discusses how machine learning algorithms hold promise for “high throughput DNA sequencing at the single nucleotide level” (Mittal et al., 2024). The paper explains, “ML-aided DNA sequencing is especially appealing, as ML has the potential to decipher complex patterns and extract knowledge from complex datasets” (Mittal et al., 2024).

Another study titled “Machine learning: A powerful tool for gene function prediction in plants” by Elizabeth H. Mahood et al. details different ML models and how they are used in the context of plant species. According to this paper, there are two categories of ML algorithms. Supervised and unsupervised. Supervised algorithms “are frequently used for the purposes of binary/multi-class classification of test instances or numerical prediction of the trait values (regression) and require explicit definitions of labels, while unsupervised methods… are label-free and are primarily used for clustering and feature extraction” (Mahood et al., 2020). 

Artificial neural networks can “extract features from the training data by themselves” (Mhaood et al., 2020). This is extremely beneficial and helpful as discovering these features based on the data is totally automated. The paper also discusses that the ability of machine learning to “integrate large volumes of heterogeneous data may improve its accuracy over those of non-machine learning methods” (Mahood et al., 2020). For a more comprehensive explanation of how ML is fully being utilized, see here. So what’s in it for the trees?

Scientists use DNA sequencing to determine a relationship between different tree species. This again allows them to create a phylogenetic tree, ironically of trees, which allows for a better understanding of trees' interaction with the environment and one another. With this knowledge, we can make a list of beneficial improvements to our environment. By identifying tree species that are closely related, we can prioritize conservation efforts for a species that had an ancestor struggle with similar environmental challenges, whether that be the climate or invasive species. 

Another benefit of understanding trees’ relationships is the preservation of genetic diversity within forests. How? While phylogenetic trees help visualize and document relationships among species, they are also helpful in discovering distantly related species. This allows scientists to select tree species representing different lineages and include them in forest restoration efforts, maintaining genetic diversity and resilience. DNA sequencing also helps to identify and preserve species with unique genetic traits that may be crucial in their adaptation to the changing environment.

You may be asking yourself how important the diversity of tree species in an area really is. Well, on a small scale, the answer is that it is very important. A study titled “The significance of tree-tree interactions for forest ecosystem functioning” by Stefan Trogisch et al. discusses this very topic. The paper states, “Our guiding hypothesis is that positive biodiversity effects at the community level emerge from the dominance of positive over negative tree-tree interactions at the neighborhood level” (Trogisch et al., 2021). In short, scientists believe the positive effects of biodiversity in a community arise because of the positive interactions between trees while outweighing the negative effects of interaction. This suggests that different trees help rather than hurt one another. 

One other benefit of studying relationships amongst trees is the ability to discover tree combinations that maximize carbon capture. For instance, say a pine tree and an oak can together capture 100 pounds of carbon from the atmosphere. After studying phylogenetic trees using DNA sequencing, scientists may find an Oak and Maple can capture 200 pounds of carbon from the atmosphere. This will help maintain the Earth’s temperature, which is crucial for our survival. Read “Beyond Traditional Reforestation” on SnoQap for more information on that topic.

With the help of artificial intelligence and machine learning algorithms, scientists can create relations between tree species that are more accurate along with being able to decipher and investigate larger sets of data than traditional methods. Understanding trees and their relationship to one another will allow for better reforestation methods, the monitoring of ecosystems, the management of ecosystem health, and climate change. While the study of AI and ML use in DNA sequencing is still in its infancy, it seems there is great promise for future developments in this field.


Works Cited

DNA sequencing fact sheet. Genome.gov. (n.d.). https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Fact-Sheet 

Hermsen, E. J., & HENDRICKS, J. R. (2019, October 15). 2.1 reading trees. Digital Atlas of Ancient Life. https://www.digitalatlasofancientlife.org/learn/systematics/phylogenetics/reading-trees/#:~:text=The%20branches%20are%20the%20line,taxa%2C%20branches%2C%20and%20nodes 

Mahood, E. H., Kruse, L. H., & Moghe, G. D. (2020, July 28). Machine learning: A powerful tool for gene function prediction in plants. Applications in plant sciences. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7394712/ 

Mittal, S., Jena, M. K., & Pathak, B. (2024, July 8). Machine learning empowered next generation DNA sequencing: Perspective and prospectus. Chemical Science. https://pubs.rsc.org/en/content/articlehtml/2024/sc/d4sc01714e 

Trogisch, S. (2021, February 8). The significance of tree-tree interactions for forest ecosystem functioning. Basic and Applied Ecology. https://www.sciencedirect.com/science/article/pii/S1439179121000256#:~:text=In%20particular%2C%20a%20better%20understanding,et%20al.%2C%202020 

Are We Ready for “Peak Oil”?

Are We Ready for “Peak Oil”?

Endangered Plants and Hawaiian Forest Evolution

Endangered Plants and Hawaiian Forest Evolution