Complete Assembly of Escherichia coli Sequence Type 131 Genomes Using Long Reads Demonstrates Antibiotic Resistance Gene Variation within Diverse Plasmid and Chromosomal Contexts

Drug-resistant bacteria are a major cause of illness worldwide, and a specific subtype called Escherichia coli ST131 causes a significant number of these infections. ST131 bacteria become resistant to treatments by modifying their DNA and by transferring genes among one another via large packages of genes called plasmids, like a game of pass-the-parcel. Tackling infections more effectively requires a better understanding of what plasmids are being exchanged and their exact contents. To achieve this, we applied new high-resolution DNA sequencing technology to six ST131 samples from infected patients and compared the output to that of an existing approach. A combination of methods shows that drug resistance genes on plasmids are highly mobile because they can jump into ST131’s chromosomes. We found that the plasmids are very elastic and undergo extensive rearrangements even in closely related samples. This application of DNA sequencing technologies illustrates at a new level the highly dynamic nature of ST131 genomes.


R eported cases of bloodstream and urinary tract infections caused by extraintestinal
pathogenic Escherichia coli (ExPEC) are increasing globally at an alarming rate (1). As a key source of ExPEC isolates worldwide, E. coli sequence type 131 (ST131) is regarded as a serious threat to public health, given its high level of antimicrobial resistance (AMR), as well as the broad spectrum of infections it causes in community and hospital settings (2,3).
E. coli ST131 is virulent (4) and has an expansive range of virulence factors (5,6), especially those linked to uropathogenic E. coli (UPEC) (3,7,8). AMR and virulence genes allow ST131 to adapt to drug selection pressure and to survive in extraintestinal niches and are often encoded on mobile genetic elements (MGEs) (9), which means that the exact set of virulence and AMR genes in a single ST131 isolate may vary (8,10). ST131 encodes a range of extended-spectrum ␤-lactamases (ESBLs) that hydrolyze third-line drugs, including cephalosporins, the most common of which is cefotaximase bla CTX-M- 15 . Within ST131, clade C2 has more AMR genes than other clades and is typically bla CTX-M-15 positive, differentiating it from clade C1, which can be bla  or bla CTX-M-27 positive (3,8).
Most ST131 AMR genes are reported to be carried on plasmids, circular selfreplicating double-stranded DNA molecules that constitute part of the bacterial accessory genome (11)(12)(13). Plasmids can reduce bacterial cell fitness, but a number of postsegregation killing and stable plasmid inheritance mechanisms allow the stable maintenance of IncF plasmids in ST131 (14)(15)(16). The chromosomal integration of plasmid genes is most commonly facilitated by transposons, which can ensure the acquisition and conservation of such elements if there is no subsequent local recombination (17,18).
Identifying plasmid conjugation, recombination, and transposition could have value in tracking AMR genes associated with disease outbreaks and antibiotic treatment failures. Plasmids may be classified using incompatibility (Inc), relaxase (MOB), and mating pair formation system typing (19), but difficulties in plasmid genetic analysis and reconstruction arise with short-read data due to rearrangements driven by recombination, dense arrays of repetitive elements including transposable elements (TEs), changes in gene copy numbers, and high sequence variation. Methods using short reads alone may fail to detect genomic segments exchanged between plasmids and the chromosome, limiting evaluation of the core and accessory genomes.
Whole-genome sequencing has provided a high resolution of the genomic epidemiology of ST131 and plasmid-mediated AMR outbreaks (20). However, short reads alone are insufficient to resolve plasmids that often have numerous small MGEs of ϳ1 kb or less in size, e.g., TEs and insertion sequences (ISs) (21). Complex transposable units (TUs) consisting of multiple TEs or ISs can mobilize AMR genes by transposition, and this can sometimes be followed by recombination within the TU between one of the inverted repeats (IRs) flanking the TE and the IR of another local TE or an adjacent homologous sequence, resulting in different TU structures, locations, and copy numbers. At present, the exact resolution of complex structural rearrangements of repetitive TUs containing AMR genes may be impossible with short reads (22). Consequently, plasmid assembly is a challenge, requiring accurate long reads and sufficient coverage to distinguish between independent plasmids with regions of sequence identity (21,23).
Long reads, such as those generated using Oxford Nanopore Technologies (ONT) or Pacific Biosciences platforms, can provide a solution to this plasmid assembly problem (24)(25)(26). Here, we sequenced six ST131 using the ONT GridION X5 platform. Using the resulting high-coverage sequence data, we reconstructed and annotated the plasmids and chromosomal regions carrying bla CTX-M genes, as well as their genetic context and copy numbers.

RESULTS
Oxford Nanopore long-read quality control and filtering. High-molecular-weight DNA from six E. coli ST131 isolates was sequenced using long Oxford Nanopore reads and short Illumina reads to assemble their genomes, allowing for plasmid reconstruction and resolution of AMR genes, MGEs, and associated rearrangements. The ONT GridION X5 sequencing generated 8.9 Gbases in total across 1,406,087 reads (mean length of 6.3 kb) ( Table 1). The number of reads generated per hour, total yield of bases over time, read length distribution, and read Q score distribution were examined (see Fig. S2 at https://ndownloader.figshare.com/files/14983688). Half of the bases with Q values of Ն7 were on reads of 18 kb or longer (Fig. S3). These metrics indicated sufficient GridION data in terms of quantity and quality. Initial screening removed reads with Q values of Ͻ7, leaving 1,142,067 reads with 8.2 Gbases with a mean Q score of 10.2 and a mean length of 7.2 kb (Table 1) for analysis. This included 81 reads longer than 100 kb, including one of 155,312 bases. This corresponded to a 257-fold theoretical coverage for six 5.3-Mb genomes.
The initial number of reads per library ranged from 127,118 to 510,253, and these were filtered using a series of steps to ensure that the reads used for each of the six assemblies had high quality. Bases were successfully called at an average of 97.9% of reads (Table 2). Identifying the consensus demultiplexed, duplicate-free, and adapterfree reads from Porechop v0.2.4 eliminated a further 2.9% of the base-called reads, yielding 120,123 to 487,482 reads per library ( Table 2).
Long-read genome assembly illuminates highly diverse accessory genomes. All six genome assemblies produced chromosomes of 4.81 to 5.38 Mb, with differing numbers of plasmids with lengths spanning 4 to 156 kb (see Fig. S4 at https:// ndownloader.figshare.com/files/14983688; Table 3). The numbers of contigs produced by long-read assemblies of two samples (VREC0693, VRES0739) corresponded exactly to the chromosome and plasmids. The others had either one (VREC1073, VRES1160, VREC1013) or two (VREC1428) additional chromosomal contigs (Table S2).
Contigs were classified as chromosomal or plasmid derived using mlplasmids given a probability threshold of 60% (22), with further screening for plasmid-related gene content using the Multiple Antibiotic Resistance Annotator (MARA), the Comprehensive Antibiotic Resistance Database (CARD), and PlasmidFinder (Table S2). The largest plasmid was a 156.3-kb IncFIA plasmid in VREC1073, its sole plasmid. VREC1428 and VRES1160 had 92.8-and 61.9-kb IncFIA plasmids, respectively, along with three small Col plasmids each (Table 3). VREC0693 had a 132.0-kb IncFIB plasmid and an 88.8-kb   (27). VREC3013 had one 89.9-kb IncFII plasmid. VRES0739 alone had no large plasmid, which was verified with the short-read data. By mapping the long reads to the optimal assemblies, the read coverage of each chromosome and plasmid was estimated (see Table S2 at https://ndownloader.figshare .com/files/14983688). Each chromosome had 126-to 310-fold median coverage, and the median coverage levels of large plasmids ranged from 85-to 282-fold, except for VREC1013's IncFII plasmid, which had 1,015-fold coverage and a normalized depth of 3.3-fold. The normalized depth of plasmids compared to chromosomes suggested that some cells in VREC1428 and VREC1073 may have lost their IncFIA plasmid, and the same was true for VREC0693 and its IncFIB plasmid. However, the IncFIA plasmid in VRES1160 and the IncB plasmid in VREC0693 had higher than expected copy numbers (by 9% after normalization), potentially indicating stable plasmid retention.
Across five assemblies in the Unicycler normal mode, the median insertion/deletion (indel) error rates for short reads and hybrid assemblies were similar (0.21 and 0.28 per 100 kb, respectively) but were much higher for long-read assemblies (265.0 per 100 kb) (Table S3). Likewise, the median mismatch error rates for short reads and hybrid assemblies were comparable (4.25 and 2.28 per 100 kb, respectively), but the error rate was much higher for long-read assemblies (332.8 per 100 kb) (Table S3). These rates excluded VREC1073, for which some Quast metrics were zero values. Similarly, the recovery of conserved BUSCO genes was far higher for hybrid assemblies (Ͼ99.5%) than for long-read ones (Ͼ82.3%).
The dynamic locations and genomic contexts of bla CTX-M genes in long-read assemblies. The optimized assemblies provided an improved view of the genomic context of each bla CTX-M allele, whose effectiveness as a marker for ST131 clade classification and origin (8) were explored in this study. The deeper resolution of genome architecture revealed surprising differences in bla CTX-M gene context ( Fig. 1; see Table S2 at https://ndownloader.figshare.com/files/14983688), including the discovery of chromosomal bla CTX-M genes in VREC0693 (three copies of bla CTX-M-15 ) and VREC1073 (one copy of bla CTX-M-14 ). All bla CTX-M genes were complete (876 bp) with adjacent ISEcp1 (1,658 bp with flanking IRs of 14 to 16 bp) and Tn2 (5.8-kb) elements; ISEcp1 and Tn2 can transpose bla CTX-M and other ESBL genes (28,29). The VRES0739 genome did not contain any region homologous to bla CTX-M , most likely because it had lost an IncF plasmid, unlike the other isolates.
VRES1160, VREC0693, and VREC1013 all had bla CTX-M-15 genes linked to isoforms of ISEcp1, IS26, and Tn2, implicating them in driving the transposition of the TU (Fig.  S5). Each was similar to the ST131 clade C2 ISEcp1-bla CTX-M-15 -orf477Δ TU (8, 30) but with distinct structural differences. VRES1160's single bla CTX-M-15 gene was at bp 2296 on its IncFIA plasmid and was flanked by ISEcp1 to its 5= end and by Tn2 followed by IS26 at its 3= end, with another Tn2 5= of ISEcp1. VREC0693's three chromosomal bla CTX-M-15 genes were not tandem repeats (chromosomal locations at positions 2781074 bp, 3696068 bp, and 3970927 bp), but each of these TUs were a Each assembly had seven or fewer contigs, and in three cases, no fewer contigs were possible, consistent with full genome assembly (for VREC0693, VRES0739, and VREC1073). The optimal assembly with Unicycler used long reads alone (in bold mode), with the exception of VREC1013, for which a hybrid combining short Illumina reads with long Oxford Nanopore reads was best, with minor manual screening (see supplemental results at https://ndownloader.figshare.com/files/14983688).
identical: all had ISEcp1 at the 5= ends and truncated Tn2's at the 3= ends. VREC1013's sole bla CTX-M-15 gene was located at bp 13226 on its IncFII plasmid and was flanked by a truncated ISEcp1 at its 5= end and Tn2 at its 3= end, with IS26 copies 5= and 3= of these segments. VREC1428's single bla CTX-M-27 gene was on its IncFIA plasmid at position 6018, and VREC1073's single chromosomal bla CTX-M-14 gene started at contig position 19746 (see Fig. S5 at https://ndownloader.figshare.com/files/14983688). Both their bla CTX-M genes were flanked by a truncated ISEcp1 at the 5= ends and a shortened IS903B at the 3= ends, suggesting that ISEcp1 and IS903B may have facilitated the transposition of the TU from the plasmid. Similar bla CTX-M gene transposition events have been observed in ST131 clade C1 (8).
Alignment of the plasmid-derived contigs of VRES1160 (IncFIA) to VREC1013 (IncFIB) showed that the bla CTX-M-15 -positive plasmids were much more similar (Ͼ83% identity) than VREC1428's bla CTX-M-27 -positive IncFIA plasmid, which was more distinct (Fig. 2). In addition, VREC1428's plasmid had traI and traD genes, indicating conjugation machinery (Table 4), as well as high homology to at least one published plasmid, unlike VRES1160's and VREC1013's plasmids (see supplemental results at https://ndownloader .figshare.com/files/14983688). This suggested that the VRES1160 and VREC1013 plasmids had homology corresponding well with bla CTX-M gene and subclade classification and that they were structurally different from published plasmids due to recombination.
Phylogenetic context of analyzed isolates. A comparison of these six samples with 119 published ST131 isolates (8, 31) as short-read assemblies scaffolded using reference genome NCTC13441 showed that all clustered in ST131 clade C (Fig. S6). There was sufficient resolution across 4,457 core genome single-nucleotide polymorphisms (SNPs) to confidently assign them to subclade C1 (n ϭ 1) or C2 (n ϭ 5) ( Fig. 3). VRES1160, VREC0693, VREC1013, VRES0739, and VREC1073 clustered with C2, whereas the bla CTX-M-27 -positive VREC1428 was in C1. VRES1160, VREC0693, and VREC1013 all had IncF plasmids (IncFIA, IncFIB, IncFII) and bla CTX-M-15 genes, consistent with C2 isolates, which are typically bla CTX-M-15 positive; 77% of the C2 isolates here (48 out of 62) were observed to be bla CTX-M-15 positive. However, VREC1073 was in subclade C2 but had a bla CTX-M-14 gene, contradicting this pattern, and was the sole bla CTX-M-14 -positive C2 isolate found here. The core genomes of VRES0739 and VREC0693 were identical, implying that VRES0739 has very recently lost its (bla CTX-M -positive IncF) plasmid. The sole isolate clustering with C1 was VREC1428, which had an IncFIA plasmid with a bla CTX-M-27 gene and so may belong to the emerging subclade C1-M27, as evidenced by the presence of prophage-like regions like M27PP1/2 (31).

DISCUSSION
Our study resolved the plasmid architecture of several recent E. coli ST131 isolates, allowing investigation of AMR gene location, copy number, and potential transposon-driven rearrangements. This advance was facilitated by careful DNA handling during extraction to produce large volumes of high-molecular-weight DNA that was pure and free from contamination, which was avoided by performing separate extraction steps to obtain small plasmids (32), overcoming a limitation for MinION sequencing (21).
The long-read genome assemblies illuminated significant variations in plasmids, MGEs, and bla CTX-M gene composition that were not captured by short reads. ST131 is a globally pandemic E. coli clonal group (15) with diverse sources of transmission (25). Phylogenetic comparison with published genomes (8,31) showed that five out of six isolates were from subclade C2, with one from C1. The emergence of clade C has been associated with IncF plasmids, and subclade C2 has been associated with ISEcp1 and Tn2 elements flanking bla CTX-M-15 genes (33,34). Our long-read assemblies showed the excision of the entire TU from the IncFIB plasmid and chromosomal integration at three distinct locations for VREC0693 and, similarly, chromosomal translocation of the  Table 5. bla CTX-M-14 gene from an IncFIA plasmid for VREC1073, mediated by ISEcp1 and IS903B based on previous work (8). These transposition events were likely driven by recombination at adjacent transposable elements. This highlights the value of long-read sequencing to resolve the location of bla CTX-M genes and that chromosomal translocations are not rare in ST131.
A high resolution of the AMR gene structure, context, and copy number is highly predictive of AMR phenotypes (35) and could lead to new insights into AMR mechanisms. However, the high indel and mismatch errors in long Oxford Nanopore reads (32,(36)(37)(38) limit the power to identify AMR isoforms that could permit genome-based antimicrobial susceptibility testing (27,39,40). Here, the five ONT assemblies together had an average of 447-fold higher indel and 48-fold higher mismatch error rates than those for the corresponding Illumina reads, similar to previous work with MinION reads (23), and this impacted gene identification. Consequently, short reads and assembly polishing methods remain important for SNP identification and error detection until long-read error rates can be reduced (41).
Our findings illustrate the diversity of AMR gene context even within recently emerged clones such as ExPEC ST131. The detection of multiple instances of chromosomally integrated ESBL genes using long reads here for bla CTX-M-15 in E. coli has parallels elsewhere for bla  in bla CTX-M-15 -positive Klebsiella pneumoniae (42) and so highlights chromosomal ESBL gene ISEcp1-mediated transposition as a potential adaptive mechanism in Enterobacteriaceae. Further studies with larger sample sizes are needed to identify the rates and mechanisms of these dynamic changes.

MATERIALS AND METHODS
Sample collection. Six ESBL-producing E. coli ST131 clinical strains were isolated in June to October 2015 from patients at Addenbrooke's Hospital, Cambridge, United Kingdom, as part of a study on antibiotic resistance (see Table S1 at https://ndownloader.figshare.com/files/14983688). Five samples were from feces, and one was from blood. These were short-read libraries in a multiplex run on an Illumina HiSeq 2500 platform and were processed as previously outlined (43).

FIG 3
The phylogenetic context of the six ST131 genomes (names in large bold font) shows that all except that of VREC1428 are in ST131 subclade C2 (red inner ring, VRES1160, VREC1073, VRES0739, VREC0693, and VREC1013). VREC1428 clustered in subclade C1 (purple inner ring). No new isolate clustered in C0 (green inner ring), B (blue inner ring), or an intermediate cluster (gray inner ring). Clade classification was based on phylogenetic analysis (8) by including the reference NCTC13441, 63 isolates from reference 8, and 56 isolates from reference 31 with associated classification and bla CTX-M allele data. VREC1073 and VREC0693 had chromosomal bla CTX-M genes. The outer ring shows bla CTX-M-15 (red), bla CTX-M-14 (purple), and bla CTX-M-27 alleles (green). The phylogeny was built with RAxML v8.2.11 using 4,457 SNPs from a core genome alignment generated with Roary v3.11.2 and was visualized with iTOL v4.3. Branch support was performed by 100 bootstrap replicates, and the scale bar indicates the number of substitutions per site. This midpoint-rooted phylogeny includes reference genome isolates EC958 and NCTC13441 (both in C2).
High-molecular-weight DNA extraction. Frozen stocks of the six isolates were streaked onto LB agar plates and grown overnight at 37°C. Single colonies were subcultured onto LB agar plates and incubated overnight at 37°C. DNA was extracted using a Lucigen MasterPure complete DNA and RNA purification kit. For each sample, a swab was used to sweep half a plate of pure colonies, which was suspended in 1ϫ phosphate buffer solution (PBS). Samples were processed according to the manufacturer's instructions, with elution in 70 l of nuclease-free water. Pipetting was minimized to reduce shearing of the DNA prior to sequencing.
Oxford Nanopore library preparation and sequencing. DNA was quantified using a Quant-iT HS (high-sensitivity) kit (Invitrogen). DNA purity was checked using a NanoDrop (Thermo Fisher), and fragment size was confirmed by a Femto Pulse system (Nano Life Quest). The sequencing libraries were prepared using 1 g DNA per sample and ligation sequencing kit 1D SQK-LSK109 with the barcoding extension kit EXP-NPB104 in accordance with the ONT protocols. The samples were combined using equimolar pooling and loaded onto a single 9.4.1 MIN-106 flow cell and sequenced on the GridION X5 platform under standard conditions. Illumina library preparation and sequencing. The short reads used in this study were created as follows: bacterial genomic DNA was extracted using a QIAxtractor (Qiagen, Valencia, CA, USA) according to the manufacturer's instructions. Library preparation was conducted according to the Illumina protocol and sequenced (96-plex) on an Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA) using 100-bp paired-end reads.
Genome assembly and improvement. We assembled the genomes using the conservative, normal, and bold modes of the long-read-only assembly pipeline in Unicycler v4.6. Previous work has suggested that Unicycler outperforms alternatives (21) that struggle to resolve plasmids (36). This workflow included the assembly polisher, Racon, which ran iteratively to minimize error rates of called bases (45). For comparison, short read-only and hybrid assemblies were also created using Unicycler v4.6. Briefly, during short-read-only assembly, Unicycler v4.6 employed SPAdes v3.12 to assemble short reads and then used Pilon v1.22 to polish the assembly. In hybrid assemblies, Unicycler v4.6 used Miniasm to piece together long reads first and applied SPAdes v3.12 to incorporate short reads and bridge gaps. Pilon was run 3 to 10 times for short-read assemblies and 5 to 10 times for hybrid ones, until no further changes were required to achieve the most contiguous and complete genome assemblies. The average numbers of Genome assembly assessment and error rate quantification. The quality of resulting assemblies was assessed using Quast 3.0 (47) according to the total assembly length, number of contigs, N 50 , GC content, and degree of replicon circularization. Assembly graphs were visualized with Bandage (48). The resulting contigs in each assembly were classified as chromosomal or plasmid using machine learning algorithms implemented in mlplasmids (22). Genome completeness was examined using the numbers of single-copy universal orthologous genes using BUSCO (Benchmarking Universal Single-Copy Orthologs) v3 with the gammaproteobacteria_odb9 database (49). Read depth estimation. The read depth of each replicon was estimated by aligning the short Illumina reads and the long Oxford Nanopore reads to the completed genomes by use of Smalt v0.7.6 and BWA-MEM v0.7.17 (with the flag -x ont2d for ONT reads), respectively. SAMtools v1.7 was used to process the SAM files to BAM format, remove duplicates, and identify the coverage at each base of each assembly. The median value for each replicon was noted and was normalized using the median chromosomal depth of the same assembly.
Genome annotation. The genomes were annotated using Prokka v1.13.3 (50). bla CTX-M alleles and their contexts were detected using the Multiple Antibiotic Resistance Annotator (MARA) (51) and by aligning the assemblies against the Comprehensive Antibiotic Resistance Database (CARD v3.0) to screen for matches with 100% identification only. Information on the detected AMR features and MGEs are retrieved from Galileo AMR (https://galileoamr.arcbio.com/mara/feature/list). Plasmid identification and typing were carried out using PlasmidFinder v2.0 (52). The plasmid-derived contigs from the assembled genomes were compared using BLAST v2.6.0 with a database of 10,892 complete plasmids (53). Their homology and annotation were visualized using EasyFig v2.2.2 (54).
Phylogenetic analysis. To provide a phylogenetic context for these six isolates, the Illumina read libraries of 63 published ST131 samples from references 8 and 56 from reference 31 were cleaned and trimmed using Fastp v0.12.3 (55), as were the short-read libraries of the six isolates from this study. These 125 libraries were de novo assembled with Unicycler v4.6 using NCTC13441 as a reference and annotated using Prokka. The 126 genomes were processed using Roary v3.11.2 (56) with a 95% BLAST v2.6.0 identity threshold to create a core genome alignment containing 4,457 SNPs using MAFFT v7.310 (57) spanning 3,250,343 bases and 3,350 genes of the NCTC13441 chromosome (a length similar to that described previously [20]). This core genome was used to construct a maximum likelihood phylogeny using RAxML v8.2.11 with the general time-reversible (GTR) model with gamma rate heterogeneity (58). Clade classification of the six isolates was based on a published ST131 phylogenetic analysis (8)

ACKNOWLEDGMENTS
We acknowledge Anne Parle-McDermott and Emma Finlay at Dublin City University (DCU, Ireland) for guidance on DNA extraction protocols and Emma Betteridge, Karen Oliver, and the Long Read sequencing and data teams at the Wellcome Sanger Institute (United Kingdom) for their assistance with sequencing. J.P. is a consultant to Next Gen Diagnostics LLC.