An introduction to Data Mining & its Applications in Bioinformatics

With the increasing importance of bioinformatics in agriculture, molecular medicine, microbial genome applications, etc. the research in this field has gained momentum than ever before. Bioinformatics, also known as computational biology, deals with interpreting biological data by using computer science and information technology. Of lately, research in bioinformatics has produced vast amounts of data and will continue to generate proteomic, genomic, etc. data. To analyze and gain deep insights into such biological data necessitates making sense of the information by inferring the data. For instance, gene classification, protein structure prediction, clustering of gene expression data, protein-protein interactions, etc. These processes, in turn, increases the need for interaction between bioinformatics and data mining.  

Data mining, also regarded as Knowledge Discovery in Database (KDD) is the automated extraction of patterns that represent the knowledge stored or captured in large sets of data. Some of the steps included in the KDD process are data collection, selection, transformation, visualization, and assessment of extracted knowledge. 

The processes involved in mining precise and meaningful data pattern are:

  • Classification – This involves learning of a function that classifies input data items into predefined classes.  
  • Estimation – It shows value for unknown variables with a given data input value.  
  • Prediction – Although the prediction involves classification and estimation, data will be classified based on the future estimated value. 
  • Association rules – Also called as dependency modeling, association rule identifies data associated and the possible outcomes. 
  • Clustering – This involves segmenting the population into clusters or subgroups.

In bioinformatics, data mining leverages genetic algorithms and statistical techniques from machine learning, statistics, databases, artificial intelligence, etc. Additionally, generally mining systems including SPSS, SAS enterprise miner, S-plus, Microsoft SQL server 2000, SGI MineSet, IBM intelligent miner, etc. can be utilized for mining biological data. 

So, what is the need for data mining in bioinformatics?

Biologists, after performing rigorous studies on the human genome, have determined the information pertaining to each human cell. Understanding these genetic information enables biologists to identify biological processes like growth, aging and tracking human evolution with other species on the planet, studying the nature and cure for diseases, etc. The applications of data mining in bioinformatics include:

  • Sequence analysis 

DNA sequence analysis is the most crucial operation in computational biology. This process demands the prediction of genes in uncharacterized genomic sequences. Here, the goal is to break the newly sequenced uncharacterized genome into exons, introns, and repetitive DNA sequences. This operation involves determining which part of biological sequences differ during the genome mapping process. To achieve this, several data mining techniques such as neural networks, Bayesian classifiers, support vector machines, and decision trees can be employed. While sequence recognition algorithms exhibit tradeoff performance, clustering algorithms are used to organize structurally related DNA sequences. 

  • Gene expression analysis 

The most commonly used approach here is the clustering method. This data mining technique is classified into one-way and two-way clustering approaches. The one-way clustering method is used to group samples with similar expression or genes with similar behavior. Whereas, two-way clustering approach is used cluster genes and samples simultaneously. Another approach that is used in microarray analysis is the classification data mining technique. This is utilized to discriminate against the diseases, predict outcomes based on gene expression patterns, and determine the treatment for genetic signature. 

  • Analysis of proteomics  

Identifying modification sites plays a major role in bioinformatics. However, often simple database scan cannot determine the modification sites. That is when neural networks come into use. This method can also be used to determine the active sites, protein localization prediction, properties of proteins like stability, globularity, shape, etc. In recent times, data mining techniques have been applied to determine the secondary protein structure. Besides these approaches, techniques based on Bayes theorem, information theory, a density-based clustering algorithm can be leveraged to study protein interactions. 

Data mining and bioinformatics are growing as an interdisciplinary science. The collaboration of bioinformatics with data mining is required to determine the characteristics of complex biological data. This, in turn, enables biologists to develop genetic circuits and make advancements in the bioinformatics field.

Leave a Reply

Your email address will not be published. Required fields are marked *