Comparison of de novo clustering algorithms. Plot of MCC (A), number of OTUs (B), and execution times (C) for the comparison of de novo clustering algorithms when applied to four natural and two synthetic data sets. The first three columns of each panel contain the results of clustering the data sets: (i) seeding the algorithm with one sequence per OTU and allowing the algorithm to proceed until the MCC value no longer changed, (ii) seeding the algorithm with one sequence per OTU and allowing the algorithm to proceed until the MCC changed by less than 0.0001, and (iii) seeding the algorithm with all of the sequences in one OTU and allowing the algorithm to proceed until the MCC value no longer changed. The human data set could not be clustered by the average neighbor, Sumaclust, USEARCH, or OTUCLUST with less than 45 GB of RAM or 50 h of execution time. The median from 10 reorderings of the data is presented for each method and data set. The range of observed values is indicated by the error bars, which are typically smaller than the plotting symbol.
OptiClust performance. Average execution time (A) and memory usage (B) required to cluster the four natural data sets. The confidence intervals indicate the range between the minimum and maximum values. The y axis is scaled by the square root to demonstrate the relationship between the time and memory requirements relative to the number of unique sequences squared.
Effects of taxonomically splitting the data sets on clustering quality. The data sets were split at each taxonomic level based on their classification using a naive Bayesian classifier and clustered using average neighbor, VSEARCH-based DGC, and OptiClust.
Description of data sets used to evaluate the OptiClust algorithm and compare its performance to other algorithmsa
Data set (reference[s])
Read length (nt)
No. of samples
Total no. of sequences
No. of unique sequences
No. of distances
No. of OTUs
Even (34, 36)
Staggered (34, 36)
↵a Each data set contains sequences from the V4 region of the 16S rRNA gene. The number of distances for each data set indicates those that were less than or equal to 0.03. The number of OTUs was determined using the OptiClust algorithm. The even and staggered data sets were generated by extracting the V4 region from full-length reference sequences, and the data sets from the natural communities were generated by sequencing the V4 region using an Illumina MiSeq with paired reads of either 150 or 250 nt. NA, not applicable.
The OptiClust algorithm is able to effectively cluster sequences into OTUs by minimizing or maximizing numerous metrics. Plot of MCC (A), number of OTUs (B), and execution times (C) for the comparison of output from the OptiClust algorithm when minimizing or maximizing a variety of parameters when applied to four natural and two synthetic data sets. Within mothur, OTU assignments can also be made using other metrics, including minimizing false positives and maximizing the specificity, positive predictive value, and true negatives; however, these all resulted in sequences being assigned to separate OTUs, which resulted in no false positives and the maximum number of true negatives. The error bars indicate the range of values observed for 10 replicates. Download FIG S1, EPS file, 0.8 MB.
Summary of the average number of true positives, true negatives, false positives, false negatives, and the resulting Matthews correlation coefficient for each of the clustering methods that were analyzed in this study for each of the six data sets. Blank values indicate that those conditions could not be completed in 50 h with 45 GB of RAM. Download TABLE S1, PDF file, 0.02 MB.
The OptiClust algorithm rapidly converges to optimize the Matthews correlation coefficient. The six data sets were clustered into OTUs using the OptiClust algorithm seeking to maximize the Matthews correlation coefficient. This was repeated 10 times for each data set. Download FIG S2, EPS file, 0.3 MB.