Authors: | Veljković, Aleksandar Stojanović, Biljana Malkov, Saša Beljanski, Miloš Pavlović-Lažetić, Gordana Mitić, Nenad |
Affiliations: | Computer Science Mathematical Institute of the Serbian Academy of Sciences and Arts |
Title: | Codon Usage-based SARS-CoV-2 protein classification | Journal: | Biologica Serbica | Volume: | 43 | Issue: | 1 (Special Edition) | First page: | 92 | Conference: | Belgrade BioInformatics Conference 2021, Virtual Conference, June 21-25, 2021, Vinča, Serbia | Issue Date: | 2021 | Rank: | M34 | ISSN: | 2334-6590 | URL: | https://belbi.bg.ac.rs/wp-content/uploads/2021/06/Book_of_Abstracts_2021-1.pdf | Abstract: | Severe acute respiratory syndrome corona virus 2 (SARS-CoV-2) appeared in late 2019 and spread across the world causing pandemic in humans. Since viruses differ in their specificity toward host organisms, analysis of the viral genome organization contributes to better understanding of their evolution and ad- aptation in the host. Polymorphism in genomic composition is reflected in its codon and amino acid usage patterns, as well as in translation rate (where rare codons are assumed to be translated more slowly than common codons). The same holds for specific coding sequences and the corresponding types of proteins. The goal of the current research is to build a model for classification of proteins (or parts thereof ) based on codon usage patterns. As a dataset we used the NCBI dataset of all the SARS-CoV-2 isolates and their coding sequences, pre- processed as to eliminate those with missing values, ambiguous letters and full duplicates, ending up with around 66000 isolates and around 770000 coding sequences (ORFs). We performed cluster analysis of all the coding sequences from the dataset based on codon usage (CU) as a means of identification of number and “profile” of protein classes. The approach is sound since the external as well as internal measures of clustering quality are high. Results of clustering using TwoStep algorithm (using IBM SPSS Modeler tool) include 12 clusters containing almost perfectly separated types of proteins. This may be used as an argument that specific types of proteins have their specific codon usage patterns which may be then used for protein classification model. Except for classification model based on protein clustering, we experiment with clustering virus isolates by following dynamics of CU patterns as a function of time during pandemic. |
Keywords: | SARS-CoV-2 | codon usage | CU | protein classification | protein clustering | Publisher: | University of Novi Sad, Faculty of Sciences. Department of Biology and Ecology, Novi Sad |
Show full item record
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.