Rapid bioinformatics and machine learning based identification of respiratory bacterial pathogens to assist antimicrobial stewardship

Yesmine Sahnoun1,2,3, Mathieu Thériault1,2, Dominique Boudreau1,2, Martine Bastien1,2, Ève Bérubé1,2, Maria Christina Mallet1,2, Elsa Rousseau3,4,5, Sandra Isabel1,2,6
1Centre de recherche en infectiologie, Université Laval, Québec, Québec, Canada

2 Axe Maladies Infectieuses et Immunitaires, Centre de Recherche du Centre Hospitalier Universitaire de Québec-Université Laval, Québec, Québec, Canada
3 Département d’Informatique et de Génie Logiciel, Faculté des Sciences et de Génie, Université Laval, Québec, Québec, Canada
4 Centre de Recherche en Données Massives de l’Université Laval, Québec, Québec, Canada
5 Centre Nutrition, Santé et Société (NUTRISS), Institute of Nutrition and Functional Foods (INAF), Université Laval, Québec, Québec, Canada
6 Département de pédiatrie, Faculté de médecine, Université Laval, Québec, Québec, Canada
 
Antimicrobial resistance (AMR) is a recognized public health concern, driven by the use of broad-spectrum antibiotics in the absence of rapid species-level identification for bacterial infections. Accurate species-level identification is needed to guide targeted antibiotic therapy. Next-generation sequencing of bacterial DNA has emerged as an innovative approach for rapid pathogen identification and resistance profiling, offering a more timely and precise alternative to traditional methods.
 
This study aimed to evaluate and compare two rapid approaches for species-level classification of reads from bacterial species from nanopore sequencing data. We evaluated two optimized identification approaches based on targeted nanopore sequencing: an alignment‑based bioinformatic (BIF) pipeline and an explainable k‑mer‑based machine learning (ML) model. Bacterial sequences from 116 samples encompassing 60 bacterial species were analyzed with a BIF pipeline, aligning raw reads to a lab-constructed primer-trimmed reference database including 392 bacterial species. In parallel, an ML approach was evaluated, in which a logistic regression classifier was trained on k-mer profiles obtained from the same reference sequences, and subsequently applied to predict the species of reads from the samples.
 
The BIF pipeline achieved species‑level accuracy of 98%, while the ML model reached 97%, with both models achieving 100% sensitivity, and specificities of 97% and 96%, respectively. On a 90 MB dataset, the BIF pipeline achieved identification within 40 minutes, while the ML model produced classifications in under 4 minutes.
 
The BIF pipeline offers highly accurate and traceable species-level calls, whereas the ML model provides a marked reduction in runtime with only a modest loss in accuracy. Together, they support rapid diagnostics for respiratory bacterial infections and can contribute to antibiotic stewardship programs by enabling targeted prescribing and limiting the spread of antimicrobial resistance. Future directions include expanding the panel of targeted genes to broaden pathogen coverage and further enhance species-level identification, as well as incorporating resistance markers such as mecA, pbp2b and pbp2x to enable simultaneous characterization of bacterial species and their associated antimicrobial resistance profiles.