DC FieldValueLanguage
dc.contributor.authorMalkov, S.en_US
dc.contributor.authorBeljanski, M.en_US
dc.contributor.authorPavlović Lažetić, G.en_US
dc.contributor.authorStojanović, Biljanaen_US
dc.contributor.authorMaljković, M.en_US
dc.contributor.authorVeljković, A.en_US
dc.contributor.authorKapunac, S.en_US
dc.contributor.authorMitić, N.en_US
dc.description.abstractThe existence of a large number of sequenced SARS-COV-2 isolates provides an opportunity to observe genomic variability in a massive sample. The goal of our research was to use data mining techniques to study possible correlation between codon usage and classification by WHO-labels in a certain period of time. The material includes 745,533 isolates with 12,236,672 coding sequences (proteins) from NCBI (10.08.2022.). RSCU was used as a measure of codon usage. Samples are associated with WHO-labels (based on Pango_Id) and time intervals. Inconsistency of WHO-labels with periods in which the respective strains were actually present was observed. The isolates with the observed discrepancy were excluded from the sample. Isolates without assigned WHO-labels were also excluded. In addition, individual coding sequences containing ambiguous nucleotide codes were eliminated. Clustering was performed for each of the 12 common types of coding sequences (proteins), with multiple methods and a different number of clusters. Neural clustering gave the best results. For different protein types, different degrees of RSCU variability are observed. In the case of proteins with a small variation in nucleotide contents, over 95% of the material belongs to a single cluster, while the other clusters are of negligible size. In the case of proteins with more variations, a higher number of pure clusters (by WHO-labels) is obtained, with a small number of heterogeneous clusters (about 10% of the material). In those heterogeneous clusters, there are isolates with different WHO-labels that were present in parallel at some point, as a kind of transitional forms between two strains. Different classification models were created on the same sample. Models based on protein types with higher diversity between coding sequences are highly accurate (96-100%). Using the classification models, the corresponding WHO-labels were associated with isolates without previously assigned WHO-labels.en_US
dc.publisherInstitute of Molecular Genetics and Genetic Engineering, University of Belgradeen_US
dc.subjectSARS-COV-2 | RSCU | clustering | classificationen_US
dc.titleClustering and classification of SARS-COV-2 isolates using RSCUen_US
dc.typeConference Paperen_US
dc.relation.conference4th Belgrade BioInformatics Conference - BelBI2023. 19-23 June 2023 Belgrade, Serbiaen_US
dc.relation.publicationBook of Abstracts : 4th Belgrade BioInformatics Conference - BelBI2023en_US
dc.contributor.affiliationComputer Scienceen_US
dc.contributor.affiliationMathematical Institute of the Serbian Academy of Sciences and Artsen_US
item.openairetypeConference Paper-
item.fulltextNo Fulltext-
Show simple item record

Page view(s)

checked on May 9, 2024

Google ScholarTM



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.