Emergence of a novel Salmonella enterica serotype Reading clone is linked to its expansion in commercial turkey production, resulting in unanticipated human illness in North America

Concurrent separate human outbreaks of Salmonella enterica serotype Reading occurred in 2017-2019 in the United States and Canada, which were both linked to the consumption of raw turkey products. In this study, a comprehensive genomic investigation was conducted to reconstruct the evolutionary history of S. Reading from turkeys, and to determine the genomic context of outbreaks involving this rarely isolated Salmonella serotype. A total of 988 isolates of U.S. origin were examined using whole genome-based approaches, including current and historical isolates from humans, meat, and live food animals. Broadly, isolates clustered into three major clades, with one apparently highly adapted turkey clade. Within the turkey clade isolates clustered into three subclades, including an “emergent” clade that only contained isolates dated 2016 or later, including many of the isolates from these outbreaks. Genomic differences were identified between emergent and other turkey subclades suggesting that the apparent success of currently circulating subclades clade is, in part, attributable to plasmid acquisitions conferring antimicrobial resistance, gain of phage-like sequences with cargo virulence factors, and mutations in systems that may be involved in beta-glucuronidase activity and resistance towards colicins. U.S. and Canadian outbreak isolates were found interspersed throughout the emergent subclade and the other circulating subclade. The emergence of a novel S. Reading turkey subclade, coinciding temporally with expansion in commercial turkey production and with U.S. and Canadian human outbreaks, indicates that emergent strains with higher potential for niche success were likely vertically transferred and rapidly disseminated from a common source. Importance Increasingly, outbreak investigations involving foodborne pathogens are confounded by the inter-connectedness of food animal production and distribution, necessitating high-resolution genomic investigations to determine their basis. Fortunately, surveillance and whole genome sequencing, combined with the public availability of these data, enable comprehensive queries to determine underlying causes of such outbreaks. Utilizing this pipeline, it was determined that a novel clone of Salmonella Reading has emerged that coincides with increased abundance in raw turkey products and two outbreaks of human illness in North America. The rapid dissemination of this highly adapted and conserved clone indicates that it was likely obtained from a common source and rapidly disseminated across turkey production. Key genomic changes may have contributed to its apparent continued success in the barn environment, and ability to cause illness in humans.


Introduction 69
Salmonella enterica subsp. enterica derived from poultry meat continue to serve as a primary 70 cause of salmonellosis infections towards humans in the United States (1, 2). Among the more 71 than 2500 serotypes that have been identified to date, only a handful of them consistently top the 72 list as those causing the majority of cases of human illness. Estimates on human salmonellosis 73 cases from poultry in the U.S. vary depending on the method used from 10-29.1%, and 74 specifically from turkeys 5.5% (3,4). 75 76 S. Reading is a serotype of S. enterica subsp. enterica first identified in 1916 from a water supply 77 in Reading,England (5), and subsequently identified in various animal hosts, including poultry 78 (6-10). Human outbreaks due to S. Reading historically have been rare. In 1956In -1957 outbreak involving S. Reading occurred in the U.S., sickening 325 people across multiple states 80 (11). In 2008, 30 persons were involved in an outbreak linked to iceberg lettuce in Finland (12). 81 In 2014-2015, an outbreak of unknown origin was described, with 31 confirmed cases in Canada 82 involving persons of Mediterranean descent (13). 83 84 Commercial turkey production is commonly identified as a primary reservoir of S. Reading (2,. Given its low isolation frequency, relatively little is known about the biology of S. 86 Reading compared with other serotypes. With that said, S. Reading has been shown to have 87 enhanced ability to form biofilms under stress conditions (18) and seems to have the ability to 88 withstand environmental conditions, as they have been isolated from produce (19). Multidrug 89 resistance phenotypes, including resistance towards third-generation cephalosporins, also appear 90 website (https://pubmlst.org) (25). Based on this scheme, six STs were identified with three 114 dominating: one containing primarily turkey-source and human-source isolates (ST412; 83.7% 115 of isolates), one containing primarily swine/bovine-source and human-source isolates (ST1628; 116 10.2% of isolates), and one containing primarily human-source isolates (ST93; 5.8% of isolates) 117 (Figure 1). Animal host source was strongly correlated with ST, with 99.6% (564/566) of total 118 turkey-source isolates belonging to ST412 and 93.8% (45/48) and 84.1% (37/44) of swine-source 119 and bovine-source isolates belonging to ST1628, respectively. To rule out temporal bias in the 120 clustering of same host-source isolates by ST, isolates were also characterized based on year of 121 isolation using the same ST scheme (Figure S1). This demonstrated evenness with regard to 122 isolation date across the major STs. 123 Reading isolates. Three isolates (swine-, chicken-, and human-source) are not included because 125 their STs could not be determined. Tree is colored based on isolate host source. 126 127 128 Core genome MLST (cgMLST) profiles based upon 3,002 loci were then identified for all 129 isolates, allowing for up to either two allelic differences (Figure S1) or five allelic differences 130 (Figure S1). In all analyses, there was clear and consistent separation based upon animal host 131 source, separating isolates into three major groups. 132 133 To gain further resolution, a whole genome SNP-based phylogenetic tree was constructed for all 134 isolates (Figure 2). The resulting tree contained 11086 core SNPs and resolved isolates into three 135 primary clades (designated Clades 1-3), corresponding to MLST and cgMLST results. Clade 1 (n 136 = 828) was comprised mainly of turkey-source and human-source isolates, and all but one 137 turkey-source isolate fell within this clade. Clade 2 (n = 59) was primarily human-source 138 isolates. Clade 3 (n = 101) contained mainly swine-source and bovine-source isolates, with 139 95.8% (46/48) and 84.1% (37/44) of total swine-source and bovine-source isolates falling within 140 this clade, respectively. Average core SNP distances were investigated between clades (Table  141 S1), revealing that Clades 1-2 were more similar to one another (mean core SNP difference 142 1638.72 ± 8.49) than Clades 1-3 (8165.04 ± 10.91) or Clades 2-3 (9246.30 ± 12.72). 143 Additionally, mean SNP differences for isolates within Clade 1 (7.72 ± 5.61) were lower than 144 those within Clade 2 (59.23 ± 44.10) or Clade 3 (32.87 ± 16.84). To confirm these results were 145 not due to different sample sizes between clades, average core SNP distances were recalculated 146 on a random subsample of each clade (Table S1)

162
A pan-genome approach was then used to investigate specific genomic differences between 163 isolates from Clades 1 and 3, representing the majority of isolates from turkey and swine/bovine 164 sources, respectively. A total of 11366 gene clusters were identified across all 988 isolates, with 165 3246 (28.6%) present in 100% of isolates (i.e. the "core" genes). Using a cutoff requirement of 166 100% prevalence vs. 0% prevalence in the two populations, a total of 225 gene clusters were 167 isolates from Clade 1 were then examined alone to gain further insight towards their evolution 181 over time. All of these isolates (n = 565), except one, belonged to ST412 and were examined at 182 higher resolution using a core SNP-based phylogenetic tree (Figure 3). The core SNP-based 183 phylogenetic tree contained 1093 informative variant sites, and from this, three major clades 184 were designated based upon tree clustering and dates of isolation. The "historical" clade (orange 185 clade in Figure   The same three-clade structure was also observed in a minimum spanning tree from cgMLST 198 data allowing for up to two allelic differences (Figure 4), where isolates clearly separated by 199 clade designation (historical, contemporary, and emergent) and 57.6% of all isolates in the 200 emergent clade were of the same cgMLST profile. A phylogenetic tree constructed from core 201 genome SNPs and a dendrogram based on hierarchical clustering of all pan-genome genes also 202 showed isolates clustered into the same three clades (Figure S2 Based upon average core SNP distances (

Small plasmids and associated resistance genes define differences between turkey-source 222
clades. All Clade 1 turkey-source isolates were examined for their possession of genes and 223 mutations known to confer antimicrobial resistance, and plasmid replicons, known among Gram-224 negative bacteria (Figure 5). When overlaid on the SNP-based phylogenetic tree, several patterns 225 emerged. First, nearly all isolates contained a T57S mutation in parC and ColpVC plasmid 226 replicon. An IncQ1 plasmid replicon was found in 20% (41/201) and 33% (98/295) of isolates 227 belonging to the contemporary and emergent clades, respectively. The possession of this plasmid 228 replicon was significantly associated with possession of sul2, tet(A), strA (aph(3")-Ib), and strB 229 (aph(6)-Id) genes conferring the classical SSuT phenotype (Table S2; all pairwise Fisher's exact  230 test BH-adjusted P-values < 0.05). Possession of these traits were found throughout the emergent 231 clade, with some evidence of trait loss scattered infrequently. In contrast, isolates possessing 232 these traits in the contemporary clade were found clustered in the one half of the clade, and were 233 absent from the other half. 234  Isolates belonging to the emergent clade also frequently possessed co-occurring ColRNAI-like 238 and Col440II-like plasmid replicons, present in 61% (181/295) and 65% (191/295) of isolates 239 within this clade. Isolates in the emergent clade were over 20 times more likely to possess both 240 replicons compared to isolates in the historical and contemporary clades (Fisher's exact test: 241 odds ratio = 0.022, P-value < 0.05). Possession of these replicons was also significantly 242 associated with possession of the beta-lactam resistance gene, bla TEM-1C (Table S2; all pairwise  243 Fisher's exact test BH-adjusted P-values < 0.05). 244 245 Complete sequences of these highly conserved plasmids belonging to IncQ1 and 246 Col440II/ColRNAI-like replicon types were identified and annotated from a representative 247 turkey-source isolate (Figure 6). The IncQ1 replicon and sul2-strAB-tetAR genes were co-248 localized within a 10867-bp mobilizable plasmid containing mobAC. The Col440II-and 249 ColRNAI-like replicons were found on a 10384-bp mobilizable plasmid containing mobAD and 250 bla TEM-1C adjacent to a Tn2 transposon. 251  398 after removal of duplicated sequences) was reconstructed using a GTR nucleotide 330 substitution model, an uncorrelated lognormal relaxed molecular clock, and a constant growth 331 coalescent model ( Figure S4). The model predicted an evolutionary rate of 4.14×10 -7 332 substitutions/site/year (higher posterior density (HPD 95% ) = 3.60-4.77×10 -7 ) and time to most 333 recent common ancestor (TMRCA) for the Turkey clade was dated to 1984 (1975-1992). The 334 branching of the contemporary and emergent clades was dated to 1997 (1994)(1995)(1996)(1997)    products, prompting the question of why this strain and serotype has become successful. The 384 overall genetic differences between the avian subclades were subtle, yet may provide important 385 clues highlighting the success of the contemporary and emergent strains. One distinguishing 386 feature of the emergent strains, and contemporary subsets of the circulating clade, was the 387 presence of mobilizable IncQ1 and Col440II/ColRNAI-like small plasmids. Collectively, these 388 plasmids encode resistance towards ampicillin, streptomycin, sulfamethoxazole, and tetracycline. 389 IncQ1 plasmids are broad host range, highly mobilizable plasmids capable of residing in a 390 variety of Gram-negative bacterial species (29). Similar conformations of this plasmid conferring 391 the same SSuT resistance profile have been identified in S. Typhimurium in Italy (30). While the 392 presence of these two plasmids appears to be a marker of evolution of the clade, they apparently 393 have been frequently lost by isolates in the emerging clade. There was no association between 394 isolate host source and apparent plasmid loss (i.e., human-versus turkey-source isolates), 395 indicating that plasmid loss is not a function of selective pressure in a particular environment, but 396 instead a function of genetic gain followed by plasmid instability or dispensability. 397

398
There was an overall genome size gain between the historical to contemporary/emergent isolates 399 within Clade 1. This was primarily due to acquisition of several phage-like elements within the 400 chromosome. Acquisition of a lambda-like prophage-like element was accompanied by 401 accessory carriage of sopE into the contemporary and emergent clades (Figure 8). All avian 402 strains carried the canonical version of Salmonella pathogenicity-associated island, SPI-1. SopE, 403 along with SopE2, are guanine nucleotide exchange effector molecules for the type III secretion 404 system encoded by SPI-1 (31). Together, these two molecules are able to act differentially on the 405 RhoGTPase signaling cascade and may promote enhanced inflammatory function. SopE has also 406 been shown to enhance murine colitis (32). SopE has previously been identified on a P2 family 407 phage-like element in S. Typhimurium (33) and was associated with persistent epidemic strains 408 in humans and animals. SopE has also been shown to reside on diverse phage types, including 409 lambda-like phage in S. Gallinarum, Enteritidis, Hadar and Dublin (34), and was more common 410 in the most common human serotypes in England (35). Therefore, the acquisition of SopE by 411 contemporary and emergent avian clade isolates may represent an advantage for their persistence 412 and virulence. 413 414 Two gene disruptions were notable between the emergent and contemporary isolates of Clade 1. 415 First, emergent isolates possessed a frameshift insertion of cytosine at position 680 in the cirA 416 gene resulting in a predicted frameshift that was uniform across emergent isolates. Additionally, 417 a portion of the contemporary isolates possessed a truncation of cirA that was independent of the 418 mutation identified in emergent isolates. CirA is a catecholate siderophore receptor that also 419 serves as the receptor for colicin ColIb, a pore-forming toxin produced by some E. coli and 420 Salmonella as a competitive exclusion mechanism (36). ColIb production has been shown to 421 favor producers during competition with ColIb-sensitive strains lacking the plasmid that encodes was conducted for all available raw sequencing data of isolates annotated as Salmonella enterica 480 subsp. enterica serotype Reading. Only isolates that met the following criteria were considered: 481 1) was collected within the United States, 2) had a known isolation year, and 3) had a known 482 isolation source. Raw sequencing reads of all identified isolates (n = 989) were downloaded from 483 the SRA using the SRA Toolkit (v2.8.2). An additional 32 isolates collected from U.S. 484 commercial turkey production facilities were sequenced for this study (see Sample collection and 485 DNA sequencing for details). A series of quality filtering steps within the bioinformatic 486 processing pipeline (described below) were used to obtain a final sample size of 988 high-quality 487 isolate genomes, including 566 from turkey-related sources (Supplementary Dataset,488 https://figshare.com/articles/Supplementary_dataset/10781795). A summary of sample filtering 489 steps is depicted in Figure S6. (https://github.com/tseemann/snippy) and the S. Reading assembly, SRR6374143, as the 528 reference. Separate core SNP alignments were then created for all isolates (n = 988) and for all 529 Clade 1 turkey-source isolates (n = 565). Based on MLST and cgMLST minimum spanning 530 trees, one turkey isolate clustered separately from all other turkey isolates and was therefore not 531 included in the turkey-source alignment. Recombinant regions were identified with Gubbins 532 (v2.3.4) (55) and masked from the core genome alignments using maskrc-svg (v0.5) 533 (https://github.com/kwongj/maskrc-svg). Samples with >25% missing data were removed from 534 further analyses ( Figure S6). The program snp-sites (v2.4.1) was then used to extract all core 535 SNPs and monomorphic sites where the columns did not contain any gaps or ambiguous bases 536 (56). Pairwise core SNP distance matrices were created using snp-dists (v0.6.3) 537 (https://github.com/tseemann/snp-dists) after duplicate core SNP profiles were removed with 538 SeqKit (v0.10.1) (57). 539 540 Maximum likelihood trees for both all isolates and the turkey-source isolates only were 541 reconstructed based on the alignments of core SNPs plus monomorphic sites with IQ-TREE 542 (v1.6.10) (58). ModelFinder was used to identify the most appropriate substitution models (59). 543 For the "all-isolate" tree, the model with the best fit according to the Bayesian information 544 criterion was the three substitution-type model (K3Pu) (60) with empirically-derived unequal 545 base frequencies (+F) and the discrete Gamma model of rate heterogeneity model with four rate 546 categories (+G4) (61). For the "turkey-source" tree, the best model was K3Pu+F+I, where the 547 rate heterogeneity model (+I) allowed for a proportion of invariable sites. Branch support for 548 both trees was estimated by performing 1000 ultrafast bootstrap approximation replicates (62). 549 The resulting trees were visualized and annotated using the online tool iTOL (63). 550

551
To assess the robustness of clades identified in the turkey-source core SNP-based phylogenetic 552 tree, two additional turkey-source trees were constructed using alternative methods based on the 553 pan-genome (see Pan-genome analysis for further details). First, a core genome phylogenetic 554 tree was constructed from the core genome alignment. Core SNPs and monomorphic sites were 555 then extracted from this alignment and used as input into ModelFinder and IQ-TREE. The best 556 model was the transversion substitution model [AG = CT] (TVM) with empirically-derived 557 unequal base frequencies (+F) and allowing for a proportion of invariable sites (+I). Branch 558 support was estimated from 1000 ultrafast bootstrap approximation replicates. Second, a 559 hierarchical clustering dendrogram was generated based on the presence/absence of pan-genome 560 gene clusters. Euclidean distance was calculated using the R package, vegan (v2.5-5) (64), and 561 complete linkage clustering was performed by the hclust function from the R package, stats 562 (v3.6.1). 563 564 A separate maximum likelihood tree of all Clade 1 turkey-source isolates (n = 565) and human-565 source isolates identified as part of the 2017-19 S. Reading outbreaks in the U.S. (n = 139) and 566 Canada (n = 111) was constructed following the same methods outlined above. As with the 567 turkey-only tree, the best model was identified as K3Pu+F+I, with 1000 ultrafast bootstrap 568 approximation replicates to estimate branch support. 569 570 Time-scaled phylogenetic analysis. Non-duplicate turkey-origin isolates were used. A 'temporal 571 signal' of the data was evaluated by generating a linear regression of phylogenetic root-to-tip 572 distances against the sampling dates using Tempest (v1.5) (65), and a positive correlation 573 between root-to-tip distance and collection time (R 2 = 0.46) was demonstrated. In addition, the 574 'temporal signal' was verified using a tip-date randomization test that was conducted using the 575 package 'TipDatingBeast' (v1.0.6) (66) in R (v3.4.3) (67). The evaluated TMRCA for the 576 selected model (below) was compared between the real data and the randomized trials (n = 20), 577 and no overlaps were found between the HPD 95% intervals and/or mean values (data not shown  Col440II-like replicon was used to search the publicly available ENA/SRA databases (through 628 December 2016; 90% threshold) (81). Metadata for sequences positive for the 282-bp target was 629 downloaded from NCBI. Resistance gene content was determined using an in-house database 630 adapted from ResFinder 3.0 (90% identity, 60% cutoff). Sequenced isolates with both serotype 631 and year of collection available were included in the analysis (n = 100). 632 633 Pan-genome analyses. Sample assemblies were annotated with Prokka (v1.13.4) and a core 634 genome alignment was generated using Roary (v3.12.0) (82). Coding sequences were clustered 635 into "gene clusters" using the default 95% sequence identity. "Core genes" were defined as gene 636 clusters identified in 100% of isolates, while an "accessory genes" were defined as clusters 637 present in <100% of isolates. A presence/absence matrix heatmap of accessory genes was created 638 using the roary_plots.py script (https://github.com/sanger-639 pathogens/Roary/tree/master/contrib/roary_plots). Scoary (v1.6.16) (83) was then used to 640 conduct a pan-genome-wide association analysis comparing the prevalence of gene clusters 641 between phylogenetic clades. Specifically, in the all-isolate trees, "Clade 1" isolates were 642 compared to "Clade 3" isolates, and in the turkey-source tree, "contemporary clade 1b" isolates 643 were compared separately to both "emergent clade 1c" and "historical clade 1a" isolates. Genes 644 identically distributed across samples were collapsed into a single gene cluster with the collapse 645 Figure S5. Phylogenetic tree of S. Reading isolates from turkey (n = 565) and humans (n = 250) 944 based on 1242 core SNPs in nonrecombinant genome regions. Only human-source isolates 945 classified as part of the 2017-19 S. Reading outbreaks in the U.S. and Canada were included. The 946 inner colored ring around the tree denotes the core SNP-based phylogenetic tree clade 947 designations (based on Figure 3). Colored stars indicate country of origin for each human-source 948 isolate (USA-blue, Canada-red). The outer four rings denote the presence of genetic elements 949 sopE, uidA, full cirA, frameshifted cirA, and truncated cirA. The tree is rooted with a turkey-950 source isolate collected in 2002 (SRR1195634). 951 952 Figure S6. Schematic depicting quality filtering steps within the bioinformatic processing 953 pipeline. Thirty-three isolates were removed during filtering for a final sample size of 988 high-954 quality S. Reading isolate genomes. 955 956