Clustering and classification of SARS-COV-2 isolates using RSCU

Malkov, S.; Beljanski, M.; Pavlović Lažetić, G.; Stojanović, Biljana; Maljković, M.; Veljković, A.; Kapunac, S.; Mitić, N.

DC Field	Value	Language
dc.contributor.author	Malkov, S.	en_US
dc.contributor.author	Beljanski, M.	en_US
dc.contributor.author	Pavlović Lažetić, G.	en_US
dc.contributor.author	Stojanović, Biljana	en_US
dc.contributor.author	Maljković, M.	en_US
dc.contributor.author	Veljković, A.	en_US
dc.contributor.author	Kapunac, S.	en_US
dc.contributor.author	Mitić, N.	en_US
dc.date.accessioned	2023-12-01T13:55:56Z	-
dc.date.available	2023-12-01T13:55:56Z	-
dc.date.issued	2023	-
dc.identifier.isbn	978-86-82679-14-1	-
dc.identifier.uri	http://researchrepository.mi.sanu.ac.rs/handle/123456789/5239	-
dc.description.abstract	The existence of a large number of sequenced SARS-COV-2 isolates provides an opportunity to observe genomic variability in a massive sample. The goal of our research was to use data mining techniques to study possible correlation between codon usage and classification by WHO-labels in a certain period of time. The material includes 745,533 isolates with 12,236,672 coding sequences (proteins) from NCBI (10.08.2022.). RSCU was used as a measure of codon usage. Samples are associated with WHO-labels (based on Pango_Id) and time intervals. Inconsistency of WHO-labels with periods in which the respective strains were actually present was observed. The isolates with the observed discrepancy were excluded from the sample. Isolates without assigned WHO-labels were also excluded. In addition, individual coding sequences containing ambiguous nucleotide codes were eliminated. Clustering was performed for each of the 12 common types of coding sequences (proteins), with multiple methods and a different number of clusters. Neural clustering gave the best results. For different protein types, different degrees of RSCU variability are observed. In the case of proteins with a small variation in nucleotide contents, over 95% of the material belongs to a single cluster, while the other clusters are of negligible size. In the case of proteins with more variations, a higher number of pure clusters (by WHO-labels) is obtained, with a small number of heterogeneous clusters (about 10% of the material). In those heterogeneous clusters, there are isolates with different WHO-labels that were present in parallel at some point, as a kind of transitional forms between two strains. Different classification models were created on the same sample. Models based on protein types with higher diversity between coding sequences are highly accurate (96-100%). Using the classification models, the corresponding WHO-labels were associated with isolates without previously assigned WHO-labels.	en_US
dc.publisher	Institute of Molecular Genetics and Genetic Engineering, University of Belgrade	en_US
dc.subject	SARS-COV-2 \| RSCU \| clustering \| classification	en_US
dc.title	Clustering and classification of SARS-COV-2 isolates using RSCU	en_US
dc.type	Conference Paper	en_US
dc.relation.conference	4th Belgrade BioInformatics Conference - BelBI2023. 19-23 June 2023 Belgrade, Serbia	en_US
dc.relation.publication	Book of Abstracts : 4th Belgrade BioInformatics Conference - BelBI2023	en_US
dc.identifier.url	https://belbi.bg.ac.rs/wp-content/uploads/2023/07/BelBi2023-Book-of-Abstracts.pdf	-
dc.contributor.affiliation	Computer Science	en_US
dc.contributor.affiliation	Mathematical Institute of the Serbian Academy of Sciences and Arts	en_US
dc.relation.firstpage	39	-
dc.description.rank	M34	-
item.fulltext	No Fulltext	-
item.cerifentitytype	Publications	-
item.grantfulltext	none	-
item.openairetype	Conference Paper	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
crisitem.author.orcid	0000-0003-2618-754X	-

Show simple item record

Page view(s)

98

checked on Feb 16, 2026

Google Scholar^TM

Check

Page view(s)

Google ScholarTM

Altmetric

Google Scholar^TM