Authors: Veljković, Aleksandar
Stojanović, Biljana 
Malkov, Saša
Beljanski, Miloš
Pavlović-Lažetić, Gordana
Mitić, Nenad
Affiliations: Computer Science 
Mathematical Institute of the Serbian Academy of Sciences and Arts 
Title: Codon Usage-based SARS-CoV-2 protein classification
Journal: Biologica Serbica
Volume: 43
Issue: 1 (Special Edition)
First page: 92
Conference: Belgrade BioInformatics Conference 2021, Virtual Conference, June 21-25, 2021, Vinča, Serbia
Issue Date: 2021
Rank: M34
ISSN: 2334-6590
Severe acute respiratory syndrome corona virus 2 (SARS-CoV-2) appeared in late 2019 and spread across
the world causing pandemic in humans. Since viruses differ in their specificity toward host organisms,
analysis of the viral genome organization contributes to better understanding of their evolution and ad-
aptation in the host. Polymorphism in genomic composition is reflected in its codon and amino acid usage
patterns, as well as in translation rate (where rare codons are assumed to be translated more slowly than
common codons). The same holds for specific coding sequences and the corresponding types of proteins.
The goal of the current research is to build a model for classification of proteins (or parts thereof ) based on
codon usage patterns.
As a dataset we used the NCBI dataset of all the SARS-CoV-2 isolates and their coding sequences, pre-
processed as to eliminate those with missing values, ambiguous letters and full duplicates, ending up with
around 66000 isolates and around 770000 coding sequences (ORFs). We performed cluster analysis of all
the coding sequences from the dataset based on codon usage (CU) as a means of identification of number
and “profile” of protein classes. The approach is sound since the external as well as internal measures of
clustering quality are high. Results of clustering using TwoStep algorithm (using IBM SPSS Modeler tool) include 12 clusters containing almost perfectly separated types of proteins. This may be used as an argument that specific types of proteins have their specific codon usage patterns which may be then used for protein classification model. Except for classification model based on protein clustering, we experiment with clustering virus isolates by following dynamics of CU patterns as a function of time during pandemic.
Keywords: SARS-CoV-2 | codon usage | CU | protein classification | protein clustering
Publisher: University of Novi Sad, Faculty of Sciences. Department of Biology and Ecology, Novi Sad

Show full item record

Page view(s)

checked on May 9, 2024

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.