Large scale clustering in structural and evolutionary analysis of SARS-CoV-2 proteins

Mitić, N.; Pavlović-Lažetić, G.; Beljanski, M.; Malkov, S.; Maljković, M.; Stojanović, Biljana; Veljković, A.; Kapunac, S.

DC Field	Value	Language
dc.contributor.author	Mitić, N.	en_US
dc.contributor.author	Pavlović-Lažetić, G.	en_US
dc.contributor.author	Beljanski, M.	en_US
dc.contributor.author	Malkov, S.	en_US
dc.contributor.author	Maljković, M.	en_US
dc.contributor.author	Stojanović, Biljana	en_US
dc.contributor.author	Veljković, A.	en_US
dc.contributor.author	Kapunac, S.	en_US
dc.date.accessioned	2022-12-09T12:51:02Z	-
dc.date.available	2022-12-09T12:51:02Z	-
dc.date.issued	2022	-
dc.identifier.uri	http://researchrepository.mi.sanu.ac.rs/handle/123456789/4929	-
dc.description.abstract	Motivation and Aim: In order to understand SARS-CoV-2 origin, evolution and interaction with host’s cells, various aspects of viral genome structure and function are under investigation. Codon usage (CU) frequency of viral proteins as well as non-silent mutations are of special interest, since they may contribute to changing virus characteristics. Previous analyses have shown that rare codons often occur in large clusters within protein coding sequences. In the case of SARS-CoV-2, previous codon usage analyses show an antagonistic codon usage pattern (i.e., use of rare codons) reducing translation speed, but increasing its precision, and yielding accurate and correctly folded viral proteins [1]. At that end, clustering of protein sequences are investigated based on Relative Synonymous Codon Usage (RSCU) as well as edit distances of amino acid (AAc) sequences providing for both characterizing (identifying) specific protein groups (types) and temporal evolution of proteins within groups (types). The paper will present results of such analysis. Materials, Methods and Algorithms: A dataset of 423425 complete isolate nucleotide sequences have been extracted from https://www.ncbi.nlm.nih.gov/sars-cov-2 on August 25, 2021. After cleaning process, remains 347962 isolates with 225.934 unique (2.366.031 total) SARS-CoV-2 protein coding nucleotide sequences, as well as the corresponding AAc sequences. Consistency check has been performed between the two based on standard genetic code table (transl_table 1). For all the proteins (141926) for which world-health-organization (WHO) SARS-CoV-2 annotation exists, submission date and protein sequence metadata are supplied and RSCU has been calculated for measuring CU bias in different proteins of different protein classes. Then different algorithms (including TwoStep clustering in SPSS Modeler program [2], hierarchical clustering in Cluto [3] and Python Scikit-learn library) were applied for k-clustering proteins based on RSCU, for k=2,40. Proteins of the most heterogeneous protein type – Surface glycoprotein (S-protein)– have been further clustered based on RSCU, for each WHO label and year/month date. Furthermore, for all the protein sequences, edit distances of AAc sequences for each pair of proteins have been calculated. Then different algorithms (e.g., spectral clustering) were applied for clustering proteins in each protein group. Results: Figure 1. presents S-protein 18-clustering based on RSCU and labeled by WHO annotations. Clustering results are quite correct with silhouette of 0.47. The figure presents all the groups with quantities higher 10%. Clustering performed by the SPSS Modeler program. Figure 2 (for Epsilon WHO label) is representative of a set of figures presenting specific WHO groups on year/month scale when S proteins are clustered by the SPSS modeler into 18 clusters. Figure 2 is a representative of a set of figures presenting all the groups with quantities higher 5%. Specific WHO groups mostly dominate in specific clusters in all the time periods (for example, as is the case with the Epsilon group dominating in the cluster 12). Spectral clustering of different types of proteins, based on AAc distances, give quite similar results regarding WHO labels, when applied to S protein (clusters rather homogenous), while less homogenous but still representative for other types of proteins. Hierarchical clustering of all the proteins for k=2,40 produces highly homogenous clusters regarding protein types. Specifically, for k=12 (the number of different protein types), each type is predominantely represented by its specific cluster. Conclusion: Since all the SARS-C0V-2 Orfs cluster in relatively homogenous clusters (according to WHO isolate classification), i.e., WHO-specifically annotated isolates make most of each cluster, this new approach may be used for annotation/prediction of strains that isolates belong.	-
dc.title	Large scale clustering in structural and evolutionary analysis of SARS-CoV-2 proteins	en_US
dc.type	Conference Paper	en_US
dc.relation.conference	The 13th International Multiconference, 04–08 July, 2022 Novosibirsk, Russia	en_US
dc.relation.publication	Bioinformatics of Genome Regulation and Structure/Systems Biology (BGRS/SB-2022)	en_US
dc.identifier.doi	10.18699/SBB-2021-000	-
dc.contributor.affiliation	Computer Science	en_US
dc.contributor.affiliation	Mathematical Institute of the Serbian Academy of Sciences and Arts	en_US
dc.relation.firstpage	888	-
dc.relation.lastpage	889	-
dc.description.rank	M34	-
item.fulltext	No Fulltext	-
item.cerifentitytype	Publications	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.openairetype	Conference Paper	-
item.grantfulltext	none	-
crisitem.author.orcid	0000-0003-2618-754X	-

Show simple item record

Page view(s)

120

checked on Jun 18, 2026

Google Scholar^TM

Check

Page view(s)

Google Scholar^TM

Altmetric

Altmetric

Page view(s)

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM