Authors: Malkov, S.
Beljanski, M.
Pavlović Lažetić, G.
Stojanović, Biljana 
Maljković, M.
Veljković, A.
Kapunac, S.
Mitić, N.
Affiliations: Computer Science 
Mathematical Institute of the Serbian Academy of Sciences and Arts 
Title: Clustering and classification of SARS-COV-2 isolates using RSCU
First page: 39
Related Publication(s): Book of Abstracts : 4th Belgrade BioInformatics Conference - BelBI2023
Conference: 4th Belgrade BioInformatics Conference - BelBI2023. 19-23 June 2023 Belgrade, Serbia
Issue Date: 2023
Rank: M34
ISBN: 978-86-82679-14-1
The existence of a large number of sequenced SARS-COV-2 isolates provides
an opportunity to observe genomic variability in a massive sample. The
goal of our research was to use data mining techniques to study possible
correlation between codon usage and classification by WHO-labels in a
certain period of time. The material includes 745,533 isolates with
12,236,672 coding sequences (proteins) from NCBI (10.08.2022.). RSCU was
used as a measure of codon usage. Samples are associated with WHO-labels
(based on Pango_Id) and time intervals. Inconsistency of WHO-labels with
periods in which the respective strains were actually present was
observed. The isolates with the observed discrepancy were excluded from
the sample. Isolates without assigned WHO-labels were also excluded. In
addition, individual coding sequences containing ambiguous nucleotide
codes were eliminated.
Clustering was performed for each of the 12 common types of coding
sequences (proteins), with multiple methods and a different number of
clusters. Neural clustering gave the best results. For different protein
types, different degrees of RSCU variability are observed. In the case of
proteins with a small variation in nucleotide contents, over 95% of the
material belongs to a single cluster, while the other clusters are of
negligible size. In the case of proteins with more variations, a higher
number of pure clusters (by WHO-labels) is obtained, with a small number
of heterogeneous clusters (about 10% of the material). In those
heterogeneous clusters, there are isolates with different WHO-labels that
were present in parallel at some point, as a kind of transitional forms
between two strains.
Different classification models were created on the same sample. Models
based on protein types with higher diversity between coding sequences are
highly accurate (96-100%). Using the classification models, the
corresponding WHO-labels were associated with isolates without previously
assigned WHO-labels.
Keywords: SARS-COV-2 | RSCU | clustering | classification
Publisher: Institute of Molecular Genetics and Genetic Engineering, University of Belgrade

Show full item record

Page view(s)

checked on May 9, 2024

Google ScholarTM



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.