TaxAss: Leveraging Custom Databases Achieves Fine-Scale Taxonomic Resolution

Taxonomy assignment for microbial community composition studies can be limited by a lack of relevant reference organisms in large taxonomy databases. TaxAss is a taxonomy assignment workflow to classify 16S rRNA gene amplicon data using two taxonomy reference databases: a large comprehensive database such as Greengenes or SILVA, and a small ecosystem-specific database curated by scientists within a field. To test TaxAss performance, we classified five different freshwater datasets using the comprehensive Greengenes database and the freshwater-specific FreshTrain database. TaxAss increased the percent of the dataset classified compared to using only Greengenes. The increase in classifications was highest at fine-resolution taxa levels, where across the freshwater test-datasets the classifications at species-level increased by 24-40 percent reads. A similar increase in classifications was not observed in a control mouse gut dataset, which was not expected to contain freshwater bacteria. TaxAss maintained taxonomic richness compared to using only the FreshTrain. Richness was maintained across all taxa-levels from phylum to species. Without TaxAss, the majority of organisms not represented in the FreshTrain were unclassified, but at finer taxa levels incorrect classifications were also significant. TaxAss splits a dataset's sequences into two groups based on their percent identity to reference sequences in the ecosystem-specific database. Highly similar sequences are classified using the ecosystem-specific database and the others are classified using the comprehensive database. TaxAss metrics help users choose a percent identity cutoff appropriate for their data. TaxAss is free and open source, and available at www.github.com/McMahonLab/TaxAss. IMPORTANCE Microbial communities drive ecosystem processes, but microbial community composition analyses using 16S rRNA gene amplicon datasets is limited by the lack of fine-resolution taxonomy classifications. Course taxonomic groupings at phylum, class, and order level lump ecologically distinct organisms together. To avoid this, many researchers cluster similar sequences into operational taxonomic units (OTUs). These fine-resolution groupings are more ecologically relevant, but OTU definitions are dataset-dependent and therefore cannot be compared between datasets. Microbial ecologists studying a variety of environments have curated small, ecosystem-specific taxonomy databases to provide consistent and up-to-date terminology. We created TaxAss, a workflow that leverages these ecosystem-specific databases to assign taxonomy. We found that TaxAss improves fine-resolution taxonomic classifications (family, genus and species). Fine taxonomic groupings are also ecologically relevant, so they provide an alternative to OTU-based analyses that is consistent and comparable between datasets. TaxAss enables researchers to compare data using ecologically relevant terminology.

lack of relevant reference organisms in large taxonomy databases. TaxAss is a 23 taxonomy assignment workflow to classify 16S rRNA gene amplicon data using two 24 taxonomy reference databases: a large comprehensive database such as Greengenes 25 or SILVA, and a small ecosystem-specific database curated by scientists within a field. 26 To test TaxAss performance, we classified five different freshwater datasets using the 27 comprehensive Greengenes database and the freshwater-specific FreshTrain database. 28 TaxAss increased the percent of the dataset classified compared to using only 29 Greengenes. The increase in classifications was highest at fine-resolution taxa levels, 30 where across the freshwater test-datasets the classifications at species-level increased 31 by 24-40 percent reads. A similar increase in classifications was not observed in a 32 control mouse gut dataset, which was not expected to contain freshwater bacteria. 33 TaxAss maintained taxonomic richness compared to using only the FreshTrain. 34 Richness was maintained across all taxa-levels from phylum to species. Without 35 Taxonomic naming systems allow comparisons between datasets by creating consistent 88 terminology and consistent phylogeny-determined boundaries between organisms. 89 However, taxonomic naming is most useful when sequences can be classified to a fine 90 level (e.g. family, genus, or species). Many abundant taxa are poorly represented in 91 reference taxonomy databases (hereafter "databases"), resulting in only coarse 92 classifications for large proportions of amplicon datasets (e.g. phylum, class, or order). 93 Coarse taxonomic groupings often include diverse organisms with differing ecological 94 roles, so analyses at coarse taxa levels miss underlying ecological dynamics (3). ecosystem-specific classifications onto OTUs, the amplicon dataset is split into two 164 groups prior to classification: OTUs with high percent identity to ecosystem-specific 165 reference sequences and OTUs with low percent identity to ecosystem-specific 166 reference sequences. The two groups are then classified separately using the Wang 167 classifier and the appropriate database ( Figure 1). 168 169

170
To test whether TaxAss improved taxonomic classification over solely using a 171 comprehensive database, we assigned taxonomy to a Lake Mendota amplicon dataset 172 first by using Greengenes alone and then by using TaxAss to leverage both Greengenes and the FreshTrain (Figure 2a and Table 1). We compared the percent of 174 reads classified by both methods and observed a marked improvement in the percent of 175 the dataset classified to the fine taxa levels of family/lineage, genus/clade, and 176 species/tribe. At species/tribe level, the percent of reads classified increased from 1.4% 177 to 41%, at genus/clade level they increased from 23% to 63%, and at family/lineage 178 level they increased from 64% to 78%. In addition to these increases in classifications, 179 TaxAss also improved the quality of classifications because the FreshTrain is curated 180 with terminology consistent with the freshwater microbial ecology literature. At 181 family/lineage level, Greengenes alone could classify a majority of the dataset, but 75% 182 of those Greengenes-classified reads received more meaningful ecosystem-specific 183 nomenclature when using TaxAss.  184   185 The FreshTrain reference sequences come exclusively from temperate lake epilimnia, 186 and many of them come from Lake Mendota itself. Lake Mendota is a eutrophic, 187 temperate lake in Wisconsin, USA; and the Lake Mendota amplicon dataset consists of 188

206
To test whether TaxAss improved taxonomic classification over solely using an 207 ecosystem-specific database, we assigned taxonomy to the Lake Mendota dataset first 208 by using the FreshTrain alone and then by using TaxAss to leverage both the 209 FreshTrain and Greengenes. TaxAss maintained taxonomic richness at all taxa levels 210 not include any Cyanobacteria, which comprised 8.5% of the Lake Mendota dataset. All 218 of Lake Mendota's cyanobacterial OTUs were classified as something else (99% as 219 unclassified), which resulted in a loss of phylum-level richness in the FreshTrain-only 220 classification (Figure 3a). In contrast, TaxAss maintained the taxonomic richness of a 221 Greengenes-only classification and reduced unclassified reads at all taxa levels (Table  222 1). 223

224
In the FreshTrain-only classification scheme, we also observed occasions when an 225 OTU was incorrectly given a taxonomic assignment when it should have been labeled 226 unclassified ( Figure 3b). This "forcing" of OTUs into incorrect classifications by the small 227 FreshTrain database was less common than OTUs being called "unclassified," but had 228 significant effects on taxa relative abundances at finer-resolution taxa levels. Lake 229 Mendota's 5th most abundant lineage, the Bacteroidetes bacI, gained 27% more reads 230 in a FreshTrain-only classification compared to using TaxAss. The forcing errors 231 TaxAss prevented were significant enough to change basic attributes such as rank 232 abundances of top taxa, and had an even larger impact on the freshwater test-datasets 233 that differed more from the FreshTrain references (Supplemental Figure 3).  for our application because they have been highly optimized to find short, highly similar 258 alignments. However, BLAST finds areas of local similarity and there is no way to 259 require BLAST to align the entire length of a query OTU's sequence. 16S rRNA gene 260 amplicon sequences are highly similar, and differences in taxonomic classification can 261 be based on even a single mismatch in the amplified region. Therefore, we recalculated 262 the percent identities BLAST returned into "full-length" percent identities for the entire 263 query OTU's sequence (Supplemental Document 1 and Equation 1). 264 265 We found that recalculating percent identity was necessary to prevent dissimilar OTUs 266 from inclusion in the ecosystem-specific classification. For example, the FreshTrain 267 does not include any reference sequences from the major freshwater phylum 268 Cyanobacteria, so no cyanobacterial OTUs have high true percent identities to any 269 references in the FreshTrain. We found that the percent identity recalculation was 270 necessary to prevent some cyanobacterial OTUs from meeting the percent identity 271 cutoff due to the original BLAST percent identities being based on only a short aligned 272 section of the OTU sequence (Supplemental Figure 1). 273 274 We also found that it was necessary to recalculate the percent identity from several 275 BLAST alignments ("hits") for each OTU because the best BLAST hit did not always 276 have the highest recalculated percent identity. TaxAss examines the top five BLAST 277 hits, recalculates the percent identity of each, and then uses the highest recalculated 278 percent identity to determine if an OTU meets the cutoff. To ensure enough BLAST hits 279 were examined to consistently arrive at the highest possible recalculated percent 280 identity, we calculated the proportion of times each BLAST hit number had the highest 281 recalculated percent identity. In the Lake Mendota amplicon dataset the first BLAST hit 282 almost always also had the best recalculated score, and the contribution of additional 283 BLAST hits was very low, especially when only "good" hits above a stringent percent 284 identity cutoff were considered (Table 2). In the Lake Mendota dataset at the chosen 98 285 percent identity cutoff, 99.69% of the best hits found by BLAST were also the best re-286 calculated hits and only 0.01% of BLAST's 5th hits were used. TaxAss generates a 287 version of Table 2 for users' individual datasets, and if they observe more high-number 288 BLAST hits contributing to the best re-calculated hit they can increase the number of 289 BLAST results used for the calculation. 290

293
The need for curated ecosystem-specific databases has been recognized by microbial 294 ecologists studying many ecosystems. TaxAss was developed specifically to leverage 295 the Freshwater Training Set (FreshTrain) (12), but it could be applied to custom 296 databases curated for other ecosystems: the dictyopteran gut microbiota reference 297 database (DictDb) (10), the rumen and intestinal methanogen database (RIM-DB) (8), 298 the honey bee database (HBDB) (9), the microbial database for activated sludge 299 (MiDAS) (11), and the human oral microbiome database (HOMD) (7). These databases 300 were created by starting with a comprehensive database such as SILVA or Greengenes 301 and then re-curating the reference sequences from the study ecosystem, sometimes 302 also incorporating new reference sequences. Often during curation, phylogenies were 303 collapsed to be monophyletic and incorporate new organisms, and abundant but 304 unnamed organisms were given placeholder names to allow for consistent terminology 305 among researchers. Once an ecosystem-specific database has diverged from the comprehensive database, 324 as occurred with the FreshTrain, leveraging the ecosystem-specific database for 325 taxonomy assignment is no longer straightforward. Reintegrating the ecosystem-specific 326 database into the comprehensive database is more involved than simply concatenating 327 databases and removing duplicated references because conflicting phylogenetic 328 structures must be resolved. Analysis of community amplicon data is a fairly routine part 329 of many studies for which extensive phylogenetic curation would fall outside the scope. 330 The FreshTrain has been used in a variety of ways since it diverged from the current 331 version of Greengenes, and it is often difficult to discern the specifics from cursory 332 sentences in a paper's methods section. TaxAss provides a well-documented and 333 rigorously tested workflow to leverage two conflicting databases without extensive 334 curation. Another straightforward approach has been to classify amplicon datasets sequentially, 349 first using the FreshTrain and then re-classifying the unclassified sequences using a 350 comprehensive database. For example, in a study of Lake Erken, Sweden (22) important decision users make is the cutoff percent identity that determines which 415 database is used to classify each OTU. If an OTU is above the cutoff (i.e. has high percent identity to an ecosystem-specific reference sequence) then it will be classified 417 with the ecosystem-specific database. When the cutoff is higher, fewer OTUs are 418 classified with the ecosystem-specific database and users run the risk of leaving some 419 ecosystem-specific OTUs poorly classified by the comprehensive database. If the cutoff 420 is lower, more OTUs will be classified with the ecosystem-specific database and users 421 run the risk of forcing incorrect classifications onto OTUs and losing taxonomic richness. We found that a percent identity cutoff of 98-99% optimized classifications in our test-427 datasets. The finding that most OTUs match their ecosystem-specific reference 428 sequences with such high percent identity suggests that the commonly chosen 97% 429 sequence identity clustering is too coarse to observe fine taxa level dynamics. This is 430 supported by previous findings that sequence identity-based OTUs can impose artificial 431 delineations between organisms that affect results differently depending on the lineage 432 (29), and that sequence identity-based OTUs can contain temporally discordant 433 sequences (30). We recommend that users planning a taxonomy-centric analysis 434 classify unique sequences without sequence identity-based clustering and use fine-level 435 taxonomic assignment to group their data. This will make the classification take longer, 436 but likely users will save computational time overall by not clustering. For users who 437 also want to emphasize OTU-based analyses and whose datasets are too large for 438 unclustered analyses, we recommend choosing a finer sequence identity-based OTU 439 definition such as 98 or 99% to best leverage the fine-level classification provided by 440 TaxAss and a detailed ecosystem-specific database. When OTUs have been clustered 441 based on sequence identity, we recommend that users choose the same or lower 442 percent identity cutoff in TaxAss  The naïve Bayesian algorithm used for taxonomy assignment (the Wang classifier) (13) 502 can "force" incorrect classifications onto OTUs when a close match does not exist in a 503 small reference database. TaxAss uses the well-accepted Wang classifier, but avoids forcing when classifying with a small ecosystem-specific database by only classifying 505 sequences for which a close reference exists. The National Center for Biotechnology 506 Information's Basic Local Alignment Search Tool (BLAST) (20) is utilized to split the 507 amplicon dataset into two groups prior to classification: one is classified with the 508 ecosystem-specific database, the other with the comprehensive database. 509 510 Blastn queries each OTU sequence against the ecosystem-specific database using the 511 default megablast settings, which are optimized to find highly similar matches between 512 sequences longer than 30 bp (31). However, blast returns the percent identity of the 513 highest scoring pair (the "pident"), which does not necessarily include the full length of 514 the query OTU sequence. OTU sequences are highly similar; a single mismatch can 515 change a classification, so mismatches at the ends of OTU sequences (in the 516 "overhang") that BLAST leaves out of the alignment must be included in the percent 517 identity cutoff used for classification. Therefore, the BLAST pident is recalculated to a 518 full length percent identity with the following equation: where "pident" is the percent identity returned by BLAST, "length" is the length of the 523 alignment, "qlen" is the query length, "qend" is the query end, and qstart is the query 524 start. All of these parameters are returned by BLAST output format 6, and detailed 525 descriptions of what they are, the equation derivation, and an example alignment and 526 calculation are included in Supplemental Document 1. 527

528
The recalculation to full length percent identity is conservative; all query nucleotides not 529 included in the alignment (nucleotides in the "overhang") are considered mismatches. 530 This means that it would be possible to exclude an OTU from the ecosystem-specific 531 classification when its true percent identity is above the cutoff due to matches in 532 unaligned overhangs. An example of this situation is illustrated in Supplemental 533 Document 1. When the highest scoring BLAST alignments contain matches on the 534 overhangs, some of the lower-scoring alignments will be longer, and therefore have a 535 higher recalculated percent identity. To correct for this TaxAss recalculates the percent 536 identity of the top five blast hits and uses the best one for the cutoff decision. TaxAss 537 also shows users the distribution of chosen hits, so that settings can be re-evaluated if 538 BLAST is not primarily returning hits that have the best recalculated percent identities.  Table 1). Phylum-and class-level classifications are more reliable when 561 assigned by a large comprehensive database that includes more diversity, so if 562 ecosystem-specific classifications at these coarse taxa levels disagree with the 563 comprehensive database's assignments it suggests that the percent identity cutoff is too 564 low. Only these coarse levels can be used to check for forcing, because at finer 565 taxonomic levels too many OTUs end up unclassified by the comprehensive database 566 to compare assignments. datasets (Lake Mendota and Trout Bog) were also pre-processed with VSEARCH (32) 582 to trim to uniform lengths and remove low quality sequences with > 0.5 expected errors. 583 During TaxAss, the percent identity cutoff used for all datasets was 98%, and the Wang 584

813
This zip file contains directions for reproducing all data processing in this paper. It 814 includes batch scripts that download each dataset, batch scripts that quality control 815 each dataset, and batch scripts that perform TaxAss on each dataset. It also contains 816 the versions of TaxAss scripts used in this paper. A README file includes directions for 817 how to download the versions of databases and programs used in processing, and how 818 to source the included batch scripts. The zip file does not include the data itself, just the