Human Protein Sequence Classification using Machine Learning and Statistical Classification Techniques
ChhoteLal Prasad Gupta1, AnandBihari2, SudhakarTripathi3
1ChhoteLal Prasad Gupta, Computer Science & Engineering, Dr. APJ Abdul Kalam Technical University, Lucknow, India.
2AnandBihari, School of Information Technology & Engineering, VIT University, Vellore, (Tamil Nadu), India.
3Sudhakar Tripathi, Deportment of Information Technology, Rajkiya Engineering College, Ambedkarnagar, Uttar Pradesh, India.
Manuscript received on 11 March 2019 | Revised Manuscript received on 15 March 2019 | Manuscript published on 30 July 2019 | PP: 3591-3599 | Volume-8 Issue-2, July 2019 | Retrieval Number: B3224078219/19©BEIESP | DOI: 10.35940/ijrte.B3224.078219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: In the field of computational biology, to gauge the meaningful and accurate feature for protein function predications, either the profile-based protein data or sequence-based data has been used. As we know that the prediction of enzyme class from an unknown protein is most interacted research in the current era. In this context, machine learning and statistical classification technique has been used. In this article, we have use six different machine learning and statistical classification technique such as CRT, QUEST, CHAID, C5.0, ANN and SVM for classification of 4314 number of human protein sequence data. These data are extracted form UniprotKB databank with the help of PROFEAT server. The extracted data are categorized in seven different classes. To manipulate the high dimensional protein sequence data with some missing value, the SPSS has been used for classification and estimation of the performance of classification technique. The experimental results highlight that the class C4, C5, C6 and C7 data are imbalanced that affect the overall performance of classification technique. This article provides an extensive comparative analysis of different classification technique on sequence-based protein data. The experimental analysis highlights that the SVM and C5.0 classification technique gives better result than others and can be used for protein classification and predictions.
Keywords: Protein Function Prediction; Enzyme Classification; Classification Techniques; Uni Prot KB; FASTA; Protein Sequence; etc.
Scope of the Article: Classification