Evolutionary Frequency of the Available Human Coronavirus Genomes

Background. A novel, human-infecting coronavirus causing CoVID-19 was �rst identi�ed in Wuhan, China in late December, 2019. Within a short span of time more the virus has recorded more than 1 million deaths world-wide. This study is designed to address the overall evolutionary process of the novel Coronavirus complete genomes. Addressing the complexity and huge population size, network-based approaches are used in mapping samples to their reported locations. Results. Total of 473 complete human-coronavirus genomes representing 20 different countries are studied including 17 states from the United States and samples collected from the Cruise-diamond princess. The phylodynamic network of global-scale is classi�ed into �ve clusters contained two clusters U1 and U2 of the USA samples. Cluster B is a shared cluster of China and the USA while A and C are of diverse nature. We found that Chinese samples aggregated in cluster A and B which aided in retaining the homogeneous viral genomic pool. In contrast, samples from the USA and Spain were split into distinct clusters indicating multiple port entries and a possibility in implying a delay in quarantine measures. Among the samples from the USA, we found that sequences reported from Washington and Virginia are scattered indicating evolutionary diversity.


Background
A novel, human-infecting coronavirus called SARS-CoV2 causing COVID-19 was rst identi ed with the use of next-generation sequencing in Wuhan, China in late December, 2019 [1].Contagion in medical workers and family clusters were also reported con rming human-to-human transmission [2].Patients infected with COVID-19 exhibit a high fever, sore throat, dyspnoea, with invasive lesions present in both lungs as revealed by chest radiography [2,3].Within a period of 4 months the virus has spread to more than 210 countries becoming an international emergency where European Region, Region of the Americas, Western Paci c Region and Eastern Mediterranean Region are the worst affected.As of April 13, 2020, more than 1773084 con rmed cases have been reported around the world, with 111640 fatalities (www.cdc.gov).SARS-CoV2 is a RNA virus due to which it has high mutation rate which alternatively allows for estimating the underlying genealogy connecting sampled viruses [4].SARS-CoV2 shares 96.3% of genetic similarity with the bat coronavirus RaTG13, which was obtained from bats in Yunnan in 2013 and is used as an outgroup in recent studies [5].Identifying the origin and transmission pattern of such a pathogen is imperative to block the means of further spread [6].
Several approaches are being employed to combat the pandemic.Treatment with antiviral drugs, chloroquine, corticosteroids and convalescent plasma transfusion are being tested with limited success [7][8][9][10][11][12].Development of a potential vaccine is a time-consuming process and till then conventional public health procedures, such as isolation, quarantine, community distancing and social containment, can be used to stop the spread of this viral disease [13].In order to successfully employ this tactic phylogenetic methods can be employed in clinical studies to investigate the pathogen spread within individual and in communities.Moreover, understanding the global transmission and phylodynamic pattern of CoV-19 can assist in tracking undocumented COVID-19 infection sources and trace the route of infection transmission.New cases are being reported every day and with that sequencing data is also readily accessible.In our study we included sequence entries from 20 different countries, analyzed and mapped 473 complete CoV2 genomes and connect them through a network-based distances retrieved from wholegenome sequencing.

Results
To understand the spread and evolving dynamics of CoV2, here we mapped all the genomes available on NCBI virus database (www.ncbi.nlm.nih.gov/labs/virus).Total of 473 complete CoV2 genomes comprising sequence entries from 20 different countries were selected for analyses.Based on available reports Bat-CoV genome was used as an outgroup source [5].Our analyses are consistent with other reports which shows that samples from Wuhan (MT291831) and Shenzhen/Hongkong (MN975262) are closest to the source.The former sample spread out into two clusters A and B engaging three samples (MN997409-Arizona, MT106054-Texas and MN938384-Hongkong/Shenzhen) to connect with cluster B and one sample, MT304489-Taxas for cluster A, sharing one and four mutations each (Figure 1).For better understandings, we have classi ed the whole network into ve clusters, where the distant U1 and U2 are rich in samples of the USA.Cluster B is mainly a shared cluster of China and USA while A and C are diverse.The center of cluster A is shared by samples from USA, China and Taiwan while the Chinese source shares ancestry (two mutations each) to Colombian (MT256924) and Indian (MT050493) sample respectively.The sample from Taiwan provide a sole outgroup (MN985325) to cluster U1 which densely contains the sequences from Washington DC, USA.Cluster B is heavily centered to USA and China and provide direct descendants to Vietnam, Israel, India, Pakistan, Italy, Nepal, Australia, Sweden and Korea sharing one to four mutations.Interestingly, the Swedish sample is using Australian node rather than Chinese.Second cluster of the USA, U2 is connected to cluster B by a rather small cluster C that contained European and South American samples from Spain, France and Peru.The French sample of cluster C provide an outgroup to the U2 cluster that contained sequences from different states of the USA.
Collectively, our global scale CoV2 spreading dynamics indicate countries with multiple or different source entries are assisting viral evolution at a rapid phase.
Phylodynamics of the USA Until April 13, 2020 there were more than 400 sequences from the USA.Here, we have analyzed the 355 complete genome samples of the USA reported from seventeen different states including 24 samples from the cruise ship Diamond princess that had 3771 passengers on board out of which more than 700 con rmed cases of CoV2 [14].Since the cruise was carrying CoV2 positive patients from Hongkong, we used Bat-CoV genome as an outgroup.To our interest, the cruise samples grouped next to the ancestor, here we call it Cruise-cluster.Along with the Cruise-cluster one sample each from Oregon (OR, MT304487) and Texas (TX, MT276331) stayed closer to the ancestor (Figure 2).The OR sample provide a base for one sample each for California (CA), Georgia (GA) and ve for Washington (WA).The central base of the Cruise-cluster is shared with the Arizonian sample directly infected from China (discussed above).Overall, the C-cluster shares similarity with majority of the samples from CA and further bifurcated.The left side group of WA samples is in the same group we previously mentioned as U1 and is connected by an arbitrary ancestor to the C-group suggesting that Cruise samples are not the direct source for U1.Ultimately the only valid source left is from Taiwan.Similar case can be observed in the right cluster where the Cruise-cluster is not providing an actual ancestral link.

Discussion
Previously, phylodynamic is used to describe immunodynamics, epidemiology, and evolutionary biology' to understand how infectious diseases are transmitted and evolve [15].Variety of evolutionary models assume a tree to facilitate the testing and discussion of hypotheses.However, the increase in population size more complex evolutionary scenarios are poorly described by such models [16].Such limitations have led to the development of a number of different types of phylogenetic networks.To estimate evolutionary frequency of the available human CoV2 genomes and map them on to the geographical locations we present our analyses through median-joining network.
Analyzing the global scale evolution and spread of human CoV2, we have noticed the presence of Chinese samples only in cluster A and B highlighting the e cacy of tight quarantine practices of Chinese citizens that proved to be e cient in retaining the homogeneous viral genomic pool.On the other hand, samples from the USA were split into distinct clusters indicating multiple port entries of the virus and implying a delay in quarantine measures.Although USA had restrictions in place on all the tra c coming from China but such measures were not applied to the tra c coming from rest of the world, hence the virus was not contained as e ciently as it was contained in China.A similar phenomenon was observed in Spanish samples located in three different clusters (A, B and C) and shares ancestors from Taiwan, China, The USA and Israel separately.Contrary, genomes reported from the USA population indicate that the passengers from the Cruise Diamond Princess were e ciently quarantined and treated and are not the major source for the spread of infection in the USA.The clustering of the cruise samples near the ancestral node are justi ed by two main reasons.Firstly, passengers were carrying the virus from the epicenter, China and secondly, they remained isolated inside the cruise which restricted viral evolution.Speci cally, sequences of WA and VA has shown diversity and are scattered almost in every cluster.Overall, our data emphasize that the CoV2 spread is higher in the USA due to heterogeneity in viral pool when compared to rest of the affected countries.Besides the US government need to take some strict measures to keep the viral spread limited to the source by restricting free movements of the citizens.

Figures
Figures

Figure 1 Global
Figure 1