journal of theoretical and applied information technology
A more appropriate Protein Classification using Data Mining
November 30, 2010
Research in bioinformatics is a complex phenomenon as it overlaps two knowledge domains, namely, biological and computer sciences. This paper has tried to introduce an efficient data mining approach for classifying proteins into some useful groups by representing them in hierarchy tree structure. There are several techniques used to classify proteins but most of them had few drawbacks on their grouping. Among them the most efficient grouping technique is used by PSIMAP. Even though PSIMAP (Protein Structural Interactome Map) technique was successful to incorporate most of the protein but it fails to classify the scale free property proteins. Our technique overcomes this drawback and successfully maps all the protein in different groups, including the scale free property proteins failed to group by PSIMAP. Our approach selects the six major attributes of protein: a) Structure comparison b) Sequence Comparison c) Connectivity d) Cluster Index e) Interactivity f) Taxonomic to group the protein from the databank by generating a hierarchal tree structure. The proposed approach calculates the degree (probability) of similarity of each protein newly entered in the system against of existing proteins in the system by using probability theorem on each six properties of proteins. This function generates probabilistic value for deriving its respective weight against that particular property. All probabilistic values generated by six individual functions will be added together to calculate the bond factor. Bond Factor defines how strongly one protein bonds with another protein base on their similarity on six attributes. Finally, in order to group them in hierarchy tree, the aggregated probabilistic value will be compared with the probabilistic value of the protein that resides at the root. If there is no root protein (i.e. at the initial state), the first protein will be considered as the root and depending on the probabilistic value it can change its relative position. Recursively, at each node, we have applied this technique to calculate the highest probable position for a particular protein in the tree.