The set was derived from full genomes 12 genomes from each of 40 virus families plus 2 genomes from each of the 6 Coronaviridae genera. To illustrate this variety, two genomes examples of Alphacoronavirus , Betacoronavirus , Gammacoronavirus , Deltacoronavirus , Torovirus, and Bafinivirus were selected and all domains encoded by the full genomes were identified and their positions in each virus genome is marked by colored rectangles indicating frequent, moderate, or rare occurrence Fig.

The domain content is both extensive and varied by genus and this information might be used to identity and classify Coronaviridae sequences. The distribution of Pfam domains across Coronaviridae genera. Panel A: Two examples of Alpha-, Beta-, Gamma-, Delta-, Toro- and Bafinivirus were selected and all protein domains encoded by the full genomes, detected by profile HMMs, were identified and their positions in each virus genome is marked by colored rectangles. A hierarchical-clustering of domain content of all genomes in each genus showed three distribution patterns Fig.

Domains present in a high frequency in single genus upper-third of the cluster map , domains present at high frequency in most or all genera bottom-third of the cluster map , and domains with low frequency in some genomes or genera middle of the cluster map. Cluster map of Pfam protein domains encoded by Coronaviridae genomes. The protein domain repertoire, as detected by profile HMMs, is plotted as the frequency of each domain in all available full genomes from all Coronaviridae genera. Each row represents a protein domain, each column represents a Coronaviridae genus.

Sensitivity and specificity plot of various triage conditions. The HMM domain content of the forty-one virus mock contig set , viral genome fragments including 3, Coronaviridae fragments was determined for each fragment.

The contigs classified as Coronaviridae for each triage condition were then identified to the genus level using RF classification. Panel A. Panel B. Panel C.

Panel D. The utility of domain content for Coronaviridae classification was first tested by developing a simple triage method to identify potential Coronaviridae sequence contigs. Preliminary work identified four triage conditions as useful for this purpose. The performance of these triage conditions for identifying coronavirus contigs was examined. The accuracy of the classification was assessed in comparison to the classification of the original genome annotation in GenBank Fig.

We ran each classification process five times to control for the random selection of features. We next applied this protein domain-based method to classify Coronaviridae genomic sequences generated from next-generation sequencing surveillance data. All de novo assembled Coronaviridae contigs that passed the quality control and minimum length cutoff were subjected to Pfam domain content identification, triage by CATD content and length and RF classification.

The process identified thirty-four potential Coronaviridae genomes from bat fecal samples and 11 Coronaviridae genomes from rat fecal samples. These forty-five genomes were classified to the Coronaviridae genus level using the Coronaviridae classification tool Fig. Identification of Coronaviridae genomes. These sequences were included in the complete set of samples processed for full genome coronavirus handling. Heatmap of predicted Coronaviridae genus probabilities. Table of probabilities prediction.

These are likely to be the same viruses described here at the full genome level. The other group of Alphacoronavirus was distant from any known Alphacoronavirus strains and may represent new species. The genome organization for the new coronaviruses was similar to the closest reference genomes sharing similar open reading frame organization as well as similar Pfam domains Fig.

Analyses of identified coronavirus genomes. Open reading frames and domain content of the three classes of coronavirus identified in this study.

Maximum-likelihood phylogenetic tree of the spike protein coding sequences from Alphacoronaviruses from this study highlighted in red plus selected reference sequences. Horizontal branch lengths are drawn to the scale of nucleotide substitutions per site. Maximum-likelihood phylogenetic tree of the spike protein coding sequences from Betacoronaviruses plus a collection of spike coding regions from relevant Betacoronaviruses.

To examine the relationship between the reported viruses and known Alpha - and Betacoronaviruses , the spike protein encoding regions of these genomes were compared with the spike coding regions from the most closely related coronavirus genomes from GenBank. The CoVs identified from Vietnamese rats were classified as Betacoronavirus and belonged to two distinct lineages as shown in phylogenetic tree Fig. Members of the Coronaviridae family of viruses cause health problems in a variety of animal hosts. SARS-CoV moved from civet cats to humans and caused substantial morbidity and mortality before it was brought under control Poon et al.

Given the frequent association of Coronaviridae members with severe diseases, a more comprehensive description of Coronaviridae diversity, especially in animals with frequent human contact, is an important objective. We describe a Coronaviridae sequence classification strategy based on the set of protein domains encoded by the genome sequence. The classification is not dependent on a single domain, but rather the composite score of all domains present in the query sequence. This is a strength of the method that can limit false positive identifications which might be due for example, to shorter regions of homology to a bacterial or host or repetitive sequence.

The requirement for longer sequence contigs is also a weakness of the method as sufficient query sequence must be available to encode multiple protein domains. This also limits the tool to assembled contigs rather than short read data.

In other words, the sensitivity of the classification is directly dependent upon the length of the genomic sequences, that is, higher sensitivity of genus assignment with longer or complete genome sequence. The classification tool provides a robust, rapid, and alignment-free method to classify large sets of more distantly related sequences. Once a database is generated, the algorithm can be used in the field or resource-limited settings and the classification can be performed with typical contig sets within minutes on a standard laptop.

With the availability of the platform independent Docker version of the algorithm see Section 2 , scientists can easily run the analyses on any computing platform. Given the large number of genomes available for most of the Coronaviridae genera, this domain-based classification method can provide a sensitive measure of genome and annotation quality.

One consideration is that the genus classification may be broad and the diversity within that genus includes genomes with more distant variations in the protein domains. An additional consideration is that the genus classification of individual Coronaviridae genomes in GenBank may not be correct mis-annotation , that the genome sequences may include errors machine errors, PCR errors, chimeric sequences or have been assembled incorrectly or with sequence duplications or deletions mis-assemblies.

The domain method described here can help identify these patterns. Bats have been suggested to harbor great diversity of CoVs and play a key role in the emergence and transmission of pathogenic CoVs causing severe diseases in human Menachery, Graham, and Baric Rats, on the other hand, represent the largest order of mammalian species and are potentially a major zoonotic source of human infectious diseases Meerburg, Singleton, and Kijlstra ; Luis et al.

As part of a large-scale zoonotic surveillance in Vietnam Rabaa et al. The sample collection was from the Dong Thap province in southern Vietnam where humans and domestic and farm animals live in close proximity. From this modest sample size surveillance, forty-four complete or nearly complete genomes belonging to Coronaviridae family were identified, thirty-four of which were from bat samples belonging to the Alphacorovirus genus and eleven genomes from ten rat samples belonging to the Betacoronavirus genus. The bat fecal samples were pooled material from five to ten individuals, thus the total individual bats samples screened ranged from to 1, and the frequency of full genome identification was 1.

In comparison, the screened rat samples were derived from individual fecal pellets and frequency of the coronavirus genome identification was 2. Given this small sample size, the frequency of CoVs identified was not strikingly different between bats and rats.