A genome, which is an organism’s complete set of genetic instructions, is very large. The human genome is estimated to have 3 billion base pairs, containing roughly 25 000 genes. Driven by lower sequencing costs, there is more biological data from genomic sequences, molecular pathways, and diverse populations than ever before. To handle the data boom and make sense of the information, artificial intelligence models are coming to the rescue!
From large and complex datasets, AI can analyze and interpret various datasets reveal patterns, correlations, and insights otherwise unknown. It’s possible to extract useful insights to discover new antibiotics, develop drugs, and make agriculture more sustainable.
WHAT’S BIOINFORMATICS? WHAT’S AI?
The development of methods and software tools to understand, analyze, and interpret biological data is known as bioinformatics. Using mathematical and statistical techniques on large and complex data sets, bioinformatics combines the principles of biology, computer science, information engineering, mathematics, and statistics.
The goal of identifying specific genes (sequences of nucleotides in DNA or RNA) and single nucleotide polymorphisms (substitutions of a single nucleotide in the genome) in bioinformatics is to understand the genetics of disease, unique adaptations, desirable properties, or differences between populations.
Artificial intelligence (AI) is an extremely popular branch of computer science, considered an umbrella term for any computer program that mimics human intelligence and does smart things. Whether it’s YouTube recommendations or Siri replies, self-driving cars or chatbots, AI systems have risen to prominence and infiltrated our day-to-day lives.
Thanks to machine learning (ML), a subset of AI, systems can automatically learn and improve from experience. From sample data (also known as training data), a model is built to make predictions and decisions. Because computers don’t need to be explicitly programmed to perform tasks with ML, they must discover patterns in large data sets and learn from them. Faster computers, algorithmic improvements, and access to large amounts of data have enabled advances in ML tasks.
Artificial neural networks are loosely inspired by the biological neural networks in our brains, capable of figuring out for themselves which factors are the most relevant in the training data. Simple neural networks have very few layers between the input and output layers, whereas deep learning architectures have multiple layers (they’re hence “deep”).
Genomics — The study of genomes
While the growth in the size and number of raw biological datasets has been exponential, the actual interpretation of this data is much slower. ML systems are playing a role in determining the location of genes in sequences, a process known as gene prediction. It’s part of genome annotation, a larger process of identifying the locations of genes and coding regions in a genome and determining their function. A newly sequenced genome needs to be annotated to be understood.
How is this done? 🤔
An input DNA sequence runs through a database of sequences whose genes have already been discovered and their locations annotated. Hence, homology (similarity due to shared ancestry) is determined between the input gene sequences and the known ones. Alternatively, when it isn’t possible to identify all the genes of the input sequence due to the lack of known and annotated gene sequences, gene prediction programs attempt to identify the remaining genes from the DNA sequence alone.
Proteomics — The study of proteins
Capable of predicting how proteins fold by analyzing their amino acid sequence, ML programs (like AlphaFold) are being developed for protein structure prediction tasks. Researchers are interested in the way proteins fold because their function largely depends on their three-dimensional structure.
Microarrays — A lab tool used to simultaneously detect the expression of thousands of genes
ML could analyze the data from microarrays, monitoring the expression of genes within a genome and enabling disease diagnosis based on which genes are expressed.
Systems biology — Analysis or modeling of complex biological systems
The focus of systems biology is on the behaviours that emerge from complex interactions of biological components, such as DNA, RNA, proteins, and metabolites, in a system. Probabilistic graphical models, which can govern the structure between the components, are commonly used to model regulatory genetic networks. Other ML techniques can address systems biology problems by identifying transcription factor binding sites (sequences where other DNA-binding molecules may bind), by analyzing disease biomarkers, and by predicting the function of enzymes and proteins.