Large scale clustering in structural and evolutionary analysis of SARS-CoV-2 proteins

Mitić, N.; Pavlović-Lažetić, G.; Beljanski, M.; Malkov, S.; Maljković, M.; Stojanović, Biljana; Veljković, A.; Kapunac, S.

Authors:	Mitić, N. Pavlović-Lažetić, G. Beljanski, M. Malkov, S. Maljković, M. Stojanović, Biljana Veljković, A. Kapunac, S.
Affiliations:	Computer Science Mathematical Institute of the Serbian Academy of Sciences and Arts
Title:	Large scale clustering in structural and evolutionary analysis of SARS-CoV-2 proteins
First page:	888
Last page:	889
Related Publication(s):	Bioinformatics of Genome Regulation and Structure/Systems Biology (BGRS/SB-2022)
Conference:	The 13th International Multiconference, 04–08 July, 2022 Novosibirsk, Russia
Issue Date:	2022
Rank:	M34
DOI:	10.18699/SBB-2021-000
Abstract:	Motivation and Aim: In order to understand SARS-CoV-2 origin, evolution and interaction with host’s cells, various aspects of viral genome structure and function are under investigation. Codon usage (CU) frequency of viral proteins as well as non-silent mutations are of special interest, since they may contribute to changing virus characteristics. Previous analyses have shown that rare codons often occur in large clusters within protein coding sequences. In the case of SARS-CoV-2, previous codon usage analyses show an antagonistic codon usage pattern (i.e., use of rare codons) reducing translation speed, but increasing its precision, and yielding accurate and correctly folded viral proteins [1]. At that end, clustering of protein sequences are investigated based on Relative Synonymous Codon Usage (RSCU) as well as edit distances of amino acid (AAc) sequences providing for both characterizing (identifying) specific protein groups (types) and temporal evolution of proteins within groups (types). The paper will present results of such analysis. Materials, Methods and Algorithms: A dataset of 423425 complete isolate nucleotide sequences have been extracted from https://www.ncbi.nlm.nih.gov/sars-cov-2 on August 25, 2021. After cleaning process, remains 347962 isolates with 225.934 unique (2.366.031 total) SARS-CoV-2 protein coding nucleotide sequences, as well as the corresponding AAc sequences. Consistency check has been performed between the two based on standard genetic code table (transl_table 1). For all the proteins (141926) for which world-health-organization (WHO) SARS-CoV-2 annotation exists, submission date and protein sequence metadata are supplied and RSCU has been calculated for measuring CU bias in different proteins of different protein classes. Then different algorithms (including TwoStep clustering in SPSS Modeler program [2], hierarchical clustering in Cluto [3] and Python Scikit-learn library) were applied for k-clustering proteins based on RSCU, for k=2,40. Proteins of the most heterogeneous protein type – Surface glycoprotein (S-protein)– have been further clustered based on RSCU, for each WHO label and year/month date. Furthermore, for all the protein sequences, edit distances of AAc sequences for each pair of proteins have been calculated. Then different algorithms (e.g., spectral clustering) were applied for clustering proteins in each protein group. Results: Figure 1. presents S-protein 18-clustering based on RSCU and labeled by WHO annotations. Clustering results are quite correct with silhouette of 0.47. The figure presents all the groups with quantities higher 10%. Clustering performed by the SPSS Modeler program. Figure 2 (for Epsilon WHO label) is representative of a set of figures presenting specific WHO groups on year/month scale when S proteins are clustered by the SPSS modeler into 18 clusters. Figure 2 is a representative of a set of figures presenting all the groups with quantities higher 5%. Specific WHO groups mostly dominate in specific clusters in all the time periods (for example, as is the case with the Epsilon group dominating in the cluster 12). Spectral clustering of different types of proteins, based on AAc distances, give quite similar results regarding WHO labels, when applied to S protein (clusters rather homogenous), while less homogenous but still representative for other types of proteins. Hierarchical clustering of all the proteins for k=2,40 produces highly homogenous clusters regarding protein types. Specifically, for k=12 (the number of different protein types), each type is predominantely represented by its specific cluster. Conclusion: Since all the SARS-C0V-2 Orfs cluster in relatively homogenous clusters (according to WHO isolate classification), i.e., WHO-specifically annotated isolates make most of each cluster, this new approach may be used for annotation/prediction of strains that isolates belong.

Show full item record

Page view(s)

128

checked on Jul 28, 2026

Google Scholar^TM

Check

Page view(s)

Google Scholar^TM

Altmetric

Altmetric

Page view(s)

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM