SARS-CoV-2 variants in Vietnam: A comprehensive analysis of nucleotide changes in the spike gene

SARS-CoV-2, causing the COVID-19 pandemic, has claimed millions of lives worldwide. SARS-CoV-2 has a high mutation rate in its genome, resulting in thousands of variants. The success of SARS-CoV-2 is attributed to a mutation in the S gene, which encodes the spike protein that interacts directly with hACE2. Mutations in this gene are known to increase the transmission rate and the ability to escape the immune system of the virus. This study focused on analyzing the nucleotide changes in the S gene of the SARS-CoV-2 variants that appeared in Vietnam. The results showed that Vietnam recorded many VOC variants, including Alpha, Beta, Delta, and Omicron, with Delta and Omicron being the most prevalent. The S1 region of the S gene had the highest mutation rate, with missense and C to T mutations being the most common. The NTD region contained all deletion and insertion mutations, with the nucleotide at 22198 to 22206 being the hotspot for insertion. The RBD region showed a positive selection during evolution, indicating that it had undergone harsh missense mutation. Overall, this study demonstrates that the S gene in Vietnam has high haplotype and nucleotide diversity. The Omicron variant in Vietnam had the highest nucleotide, haplotype diversity indexes, and average number of mutations in the S gene. These findings provide insights into the genetic diversity of SARS-CoV-2 variants in Vietnam and the impact of the S gene mutations on the evolution of the virus.


Introduction
The outbreak of COVID-19 pandemic because of SARS-CoV-2, which is belong to Betacoronavirus infected in human has taken away millions of lives. SARS-CoV-2 contain the single strand RNA+ in core that is encoded for the four major structure proteins: Spike (S), Envelope (E), Membrane (M), N (Nucleocapsid) proteins. Basicly, the coronavirus or RNA virus had the high mutation [1,2], that is the main reason why SARS-CoV-2 spawned thousands of variations. Because of high population diversity, therefore, WHO was classified the variants of SARS-CoV-2 into three group: VUM (Variant Under Minitor), VOI (Variant Of Interest) and VOC (Variant Of Concern). Therein, the VOCs variant is the most dangerous which are contained advantage phenotype when compared to wildtype. Heretofore, Alpha (B.1.1.7) emergence as the first VOC variant in global which rapid spread in United Kingdom in the end of 2020 [3]. Following, the novel VOCs were Beta (B.1.351), Gamma (P.1) appeared in South Africa and Brazil, respectively also ahead of transmission in each countries [4]. More remarkable, the emergence of the Delta variant has changed the global pandemic landscape. Delta had the transmission rate higher Alpha about 63% to 167% [5], therefore, this variant rapid became the domant variant in worldwide. Moreover, Delta successful intrusion and transmission over 200 countries/territories that is more dangerous than previous VOCs variant [6]. Interestingly, the successful of Alpha, Beta, Gamma, or Delta were considered by mutation in spike protein. The previous studies showed that mutations appearing in the RBD -Receptor Binding Domain (L452R, T478K, E484K/Q, Q498R, N501Y) enhance the affinity to the receptor [7][8][9][10].In addition, P681H/R mutations in FCS -Furin Cleavage Site are the primary mutation support for the success of the Alpha and Delta variants [11,12]. Furthermore, spike protein located on the envelope, therefore, it is represents as an antigen of SARS-CoV-2. Thus, this protein is a crucial candidate for vaccine strategy and acceptance represented by antibodies [13].
In SARS-CoV-2 genomic, spike protein is encoded by the Spike (S) gene. Therefore, missense nucleotide mutations in the S gene led to amino acid substitution in this protein. Correspondingly, the S gene had the highest mutation rate and was positively selected during evolution compared to other structure genes [14,15]. In just one year, by the end of 2020, the S gene has appeared with more than 4000 mutations [16], indicating that this gene region will contribute positively to the transmission of SARS-CoV-2. Indeed, recent research has also shown that the S gene helps SARS-CoV-2 adapt to the host cell [17,18], especially in the appearance of a new VOC variant Omicron in December 2021 complicates the COVID-19 epidemic. Omicron appears with more than 30 amino acid substitutions on the spike protein. Hundreds of sub-Omicron variants have appeared, including dangerous new variants such as BA.2.75, BA.4, BA.5 and XBB, containing characteristic mutations on spike protein [19]. Thus, examination of the nucleotide drive in this gene is very important [20]. Understanding this situatition, PANGO Linages developed the secondary database of S gene sequences for tracking the mutations [21]. Currently, WHO also uses the mutation in the spike protein for classifying SARS-CoV-2 variants. Therefore, a detailed evaluation of nucleotide changes in the S gene as soon as possible helps to point out the evolutionary direction and response to the host cell. Information extracted from the diversity and dynamics of the S gene will inform further studies -particularly in support of epidemiology.
Vietnam is one of the few countries that successfully controlled in the early stage of the COVID-19 pandemic [22]. However, the appearance of Delta in the fourth wave has made a single worse COVID-19 situation. Our previous studies on the genetic diversity of SARS-CoV-2 in Vietnam have shown the migration of numerous variants [23,24]. Delta and Omicron variants are the ones that show a high degree of diversity in Vietnam. Over a year since our report, Vietnam has recorded many new dangerous variants, such as BA.2.75, BA.4, BA.5, and XBB. Therefore, evaluating and investigating novel mutations in the S gene region is necessary for the present complicated situation.

Data collection
The genome and subsequent information on SARS-CoV-2 isolated in Vietnam were redeemed from the GISAID database. To extract the S gene, we used the SARS-CoV-2 full-length MSA tool of the MAFFT [25] server based on the S gene of the reference sequence (accession number NC_045512.2) in GenBank. All the S gene splits will also be aligned in MAFFT for downstream analysis. We used the in-house developed software to identify the S gene mutation. Then these genes were haplotype numbering by DnaSP software ver 6.12.03 [26].

The phylogenetic tree and haplotype network analysis
The IQtree software ver 2.2.0 was used to build the Maximum-Likelihood (ML) phylogenetic tree [27]. In order to build the ML tree, we chose the GTR+F+I+G4 method as the best model based on the BIC score of IQtree and setting the reference sequence as a root of the tree. A 1000 bootstrap values were used to provide consistency of the ML tree. The Figtree ver 1.4.4 (http://tree.bio.ed.ac.uk/software/fitree/) was used to visualize the output tree of IQtree. Then, PopART [28] software based on the Minimum-Spanning method built the haplotype network of the S gene in Vietnam.

Examination of genetic diversity
The genetic diversity measures, including the nucleotide diversity, haplotype diversity, average nucleotide difference distance of each variant, and the comparison nucleotide difference distance of each variant, were determined by Arlequin ver 3.5 [29]. Then, our study used the -test in R packages for statistic analysis.

Haplotype diversity
To 10th March 2023, we collected 8372 S gene sequences in Vietnam that were classified into the 1978 haplotype. Analyzing the timeline of variants in Vietnam indicated each variant displayed for each COVID-19 outbreak wave. The A -sub-A lineages and B -sub-B lineages dominate in the 1 st and 2 nd waves, following the Alpha variant as the primary cause of the 3 rd wave. The 4 th wave recorded two VOC variants that are Delta's emergence in an early stage, and Omicron variants replaced this variant in March 2022 ( Figure 1A). The haplotype diversity analysis showed that Delta and Omicron variants had a high haplotype diversity index and numerous haplotypes ( Figure 1B). This result displayed that Delta and Omicron were the significant variants contributing to Vietnam's genetic diversity. Besides, we also noted the low haplotype diversity of B and sub-B Lineages ( Figure 1B) displayed for the limit of nucleotide changes in this variant in Vietnam (Figure 3). A and sub-A Lineages had a high level of haplotype diversity, but they contained a low number of haplotypes. Extracting output of the nucleotide pairwise differ of each haplotype at 3.2 ( Figure 3) and timeline of variants ( Figure 1A) indicated these lineages entry in Vietnam twice in the 1 st and 3 rd wave. In addition, the metadata of Vietnam showed that two sequences (EPI_ISL_16034628 and EPI_ISL_16034629) were classified into Beta variant (B.1.351) that were not recorded in previous studies. Especially these sequences were collected on 08/05/2021, but that is not submitted to GISAID until 17/12/2022 -more than one year.
(A) The color grey, black, light blue, green, yellow and pink were presented for A and Sub-A Lineages, B and Sub-B Lineages, Alpha, Beta, Delta and Omicron, respectively; the timeline was exhibited column for the frequency of the variant in the month. B) The haplotype diversity of the S gene in Vietnam. All error bars in this figure show the SD: standard deviation.)

Figure 1 The timeline and haplotype diversity of SARS-CoV-2 in Vietnam
A phylogenetic tree, phylogenetic network, and haplotype nucleotide difference analysis were provided to examine each variant's impact on population diversity. Our ML tree based on the S gene sequences, built by IQtree software, indicates the clear split of each variant (Figure 2A). The Alpha variant was split from B Lineages, and the Omicron variant was cut off from Delta with bootstrap values 84 and 100, respectively. On the ML tree, the correlation between high haplotype diversity and the complexities of Delta and Omicron variants was displayed (Figure 2A). On the phylogenetic network, the Delta haplotype differed by an average of 2.1 nucleotides (Figure 3), and the haplotype had close related ( Figure  2B). However, it had numerous polymorphic sites that implication for this variant, which had a long time circulating in Vietnam ( Figure 1A) ) that is enough to store novel mutations. The phylogenetic network also demonstrated very complicated nucleotide changes in the S gene of Omicron variants. While A Lineages, B Lineages, Alpha and Delta variants had the close related haplotype in phylogenetic network analysis (Figure 2), the Omicron variant had the high distance. Therefore, Omicron not only rapidly transmission in Vietnam but also introduces new sub-variant with high distance genetically into the country.
The analyzed nucleotide difference displayed the high impact of Delta and Omicron variants on population genetic diversity. The B Lineages and Alpha variants recorded the lower nucleotide distance ( Figure 3) compared with the A Lineages, which had multi-time entries in Vietnam. Parallelism, the variants comparison analysis indicated that the population diversity was closer to that of the Omicron variant, demonstrated by nucleotide pairwise difference and nucleotide population distance between Omicron and data analyzed ( Figure 3). Generally, the S gene in Vietnam had a low level of haplotype diversity and limit of nucleotide changes that were only increased by the entries Delta and Omicron variants.

Figure 2
The Maximum-Likelilhood phylogenetic tree and phylogenetic network of the S gene in Vietnam (In this picture, the nucleotide distance of each variant was displayed in three levels (three diagonal). The explanation in detail for the mean of the diagonal was displayed in the picture. Generally, each diagonal exhibited each colour, and the scale of colour shows the high distance nucleotide.)

Figure 3
The nucleotide pairwise differences in the S gene of each variant

Nucleotide diversity
Evaluating the diversity of the S gene at the nucleotide level showed that this gene contains an average of 34 mutations, in which the region encoded for the S1 subunit had the most mutation ( Figure 4A). Furthermore, in the S1 region, we recorded NTD region had a higher mutation when compared to RBD. The explanation for the NTD region had the amount of mutation because all indels appeared in this region ( Figure 5A). Examination of nucleotide changes in the S gene noted 1147 polymorphic sites, including 188 indels, 809 transitions and 394 transversions ( Figure 4B). All types of transition and transversion were displayed. In detail, the transition mutation C to T is the highest frequency of the S gene, following T to C ( Figure 4C). The Delta and Omicron variants had high polymorphic sites correlated with the high haplotype diversity of these variants in Vietnam. Determination of the number of mutations showed that the Omicron variant had the highest mutation, at 43 ± 7, followed by Alpha (at 18 ± 1) and Delta (at 15 ± 2) ( Figure 4D). Despite Alpha having higher mutation with Delta, this variant had low nucleotide diversity (0.000276 ± 0.000205) ( Figure 4D) and polymorphic sites (17 sites) when compared to Delta variants (0.000553 ± 0.000341 and 500 polymorphic sites). The main reason for this is that Alpha contained more deletion mutations in NTD than Delta. Generally, the diversity of nucleotide diversity level of the S gene in Vietnam is very high, with a nucleotide diversity index of 0.00753 ± 0.003641, of which Delta and the Omicron variant are the main contributors to this diversity (0.004360 ± 0.002145) ( Figure 4D). This result is consistent with diversity at the haplotype level, suggesting that the Delta variant and Omicron had compressibility nucleotide changes in the S gene.
(A) The amount of mutation in the S gene, including the gene region encoded S1 subunit (NTD and RBD region) and S2 subunit. The -test was used to identify the differences in mutation in each region, with a -value < 0.05 displayed for the statistics significance. B) The examination of nucleotide changes in the S gene of each variant and Vietnam (total), including polymorphic site, indels (insertion and deletion mutation), transitions and transversions. C) The frequency of each type of transition and transversion was displayed in this picture. D) This picture displayed the amount of mutation (column) as the primary axis and nucleotide diversity index (blue line) as the secondary axis. All error bars in this figure show the SD: standard deviation)

Figure 4 The diversity at nucleotide level of the S gene in Vietnam
Examination of the mutation located in the S gene, we noted NTD conserved the deletion mutation (del21633_41ACCCCCTG -del24-26LPP, del22028_22033 -del157-58FR) and the nucleotide position at 22198 to 22206 as the hot region for insertion mutation when it appeared multiple time nucleotide inserting, including, ins22198GAGAGCCCAGAA (example sample: EPI_ISL_15897003), ins22198AATGGTGAG (example sample: EPI_ISL_16073578), ins22204GAGCCAGAA (many sequences, including one sample of Delta variant) and ins22206CGCAGTGGCAGT (example sample: EPI_ISL_16201111). We recorded a unique insert ins21620CCAGAGCGT (sample: EPI_ISL_13358783). Significantly, the RBD of S gene high conserved missense mutations (more than 10% frequency), including C22995A (

Discussion
In this study, we provide the whole landcapse genetic diversity of the S gene in Vietnam. In the metadata, we noticed a delay in the announcement and recognition of new variants in Vietnam. It is evident that in the dataset updated in July 2021, we did not notice the occurrence of the Beta variant [23]. However, in this study, we identified the occurrence of Beta due to the uncertainty in the data release. This is extremely dangerous because Beta can spread faster and more dangerous than the Wuhan variant [30,31]. Thankfully, we have not seen any popularity of Beta, possibly due to Delta's rapid dominance since it first appeared in March 2021. In haplotype diversity, Vietnam noted the high genetic diversity of the S gene of SARS-CoV-2. Delta and Omicron variants are the ones with the most nucleotide changes. The rapid change of Delta and Omicron is also noted around the world. The Delta variant caused the worst outbreaks in India and the United States [6,32]. This variant also quickly developed into many subvariant, such as Delta Plus in the world [33] or famous in Vietnam, AY.57. In Vietnam, the Omicron variant is more miscellaneous than Delta, particularly in the world general. We have recorded more than 130 Omicron subvariants that are more infectious than the ancestral variant (B. In nucleotide diversity, detailed analysis showed that the S1 region had the highest variability. This is consistent with the function of the S1 subunit, which serves as both an antigenic factor and a key player for direct interaction with the hACE2 receptor. Therefore, mutations that give SARS-CoV-2 the ability to evade the immune system and increase affinity for the receptor will be preferred. Deleting mutations has been shown to help mutants escape neutralizing antibodies [34,35]. This also explains why Omicron is the variant with the most deletions and insertions recorded in our study. Notably, we identified a sample of the Delta variant (EPI_ISL_11221203) that had the insertion (ins2204GAGCCAGAA,) which has yet to be recorded in our previous study as in the world [36]. The insertion at nucleotide 22204 (ins22204GAGCCAGAA) at the S gene is most commonly of the Omicron variant in Vietnam and the world [37]. The Delta sequence (EPI_ISL_11221203) has a sampling period of February 2022, the parallel circulation of the Delta and Omicron variants in Vietnam. Therefore, we turn to the hypothesis that the most likely explanation is the co-infection and recombination of Delta and Omicron, which has been reported worldwide [38] . Based on our results, we highlight that the ongoing updating of the data and the assessment of nucleotide variations, including missense as well as indels mutation in the S gene, is very important.

Conclusion
In conclusion, our study shows that continuously updating nucleotide changes in the S gene region is imperative. We found that the S1 subunit coding region of the S gene has the fastest mutation rate. Furthermore, the S gene region of SARS-CoV-2 in Vietnam was selected for positivity during circulation here. Furthermore, in vaccine research, the S glycoprotein plays an excellent target. Therefore, further studies to evaluate mutations affecting the structure and function of protein S are essential. To date, there are more than 15 million whole genome sequences of SARS-CoV-2 worldwide. Evaluating and providing information on mutations in the S gene will provide paramount data on the emergence of possible novel variants and sub-variants. It will be possible to evaluate the effectiveness of vaccine studies to be developed and to monitor changes related to the pathogenesis of the disease. Determining mutation rates and rates is also helpful because they play an important role in the virus escaping the host immune response and thus developing drug resistance.