Since the introduction of spectrographs and mass spectrometers in the early 1900s,1 mass spectrometry (MS) has undergone enormous technological improvements. Once a methodology primarily used by chemists, MS is now an incredibly versatile analytical technique with multiple apps in research including structural biology, clinical diagnostics, environmental analysis, forensics, food and beverage analysis, omics and beyond.
MS produces a large amount of data that needs to be analyzed. Managing, processing and interpreting this voluminous data is computationally intensive and often prone to errors, particularly when manual or semi-automated processes are used. Therefore, artificial intelligence (AI) and machine learning (ML) have become extremely popular for MS-generated data processing and statistical analysis, as they can be applied to various biological disciplines,2 limit errors and improve data analysis.
This article examines what is involved in MS data analysis and the associated challenges, how AI/ML can facilitate analyzes and drive potential future developments in the field, with specific applications to proteomics and metabolomics research .
What does MS data analysis involve?
“Data analysis in proteomics and metabolomics is a complex, multi-step process that begins with the collection of biological samples and culminates in the extraction of meaningful biological information,” Dr. Wout Bittremieux, Assistant Professor at the Adrem Data Lab from the University of Antwerp, said.
After the laborious sample preparation consisting of extracting the proteins, peptides or metabolites of interest,3 they are ionized and introduced into a mass spectrometer where they are detected based on their mass/charge ratio (m/z), producing a mass spectrum. The coupling of MS with others analytical toolssuch as gas chromatography And liquid chromatographyallows for further separation and identification of these analytes.
“One of the main challenges in MS is the accurate annotation of MS spectra to corresponding molecules,” said Dr. Bittremieux.
“In proteomics, the dominant method for this task is searching sequence databases. This relies on the comparison of experimental and theoretical spectra simulated from peptides predicted to be present. However, these theoretical spectra are often overly simplistic and do not capture detailed information about fragment ion intensity, which can lead to significant ambiguities and false identifications.
Once the data has been quantified, either relatively or absolutely, statistical analyzes can take place to facilitate biological interpretation.
“To contextualize results, pathway analysis tools can be used to map identified proteins or metabolites onto known biological pathways to help understand the functional implications of changes observed in the data. Alternatively, biomarker candidates can be identified based on their ability to distinguish different biological conditions or groups,” explained Bittremieux.
Applying AI/ML to MS Data Analysis
Although some scientists still have concerns Regarding the implementation of AI on a large scale, AI and ML have become indispensable tools for MS data analysis; assist in clinical decisions, guide metabolic engineering, and stimulate fundamental biological discoveries.
The application of AI/ML in MS research attempts to minimize errors associated with data analysis, including high noise levels, batch effects in measurements, and missing values.4 – improve usability and maximize data output.5 Additionally, training ML models on large datasets of empirical MS spectra allows generating highly accurate predicted spectra that closely match experimental data.6 This overcomes the limitations of traditional sequence database searching, which relies on raw theoretical spectra.
Developments in AI/ML have led to more accurate, efficient and comprehensive interpretations of biological data, including new peptide sequencing.7
“New Peptide sequencing, which involves determining the peptide sequence directly from tandem spectra (MS/MS) without relying on a reference database, is a difficult problem. ML approaches are beginning to have a significant impact in this field by learning models from known spectra and using them to predict peptide sequences from unknown spectra, making it possible to analyze complex proteomes without relying only on existing protein databases,” said Dr. Bittremieux.
Another area where AI/ML has been applied in MS data analysis is repository-wide data analysis.8 Public data repositories have continued to grow, now containing millions or even billions of MS spectra. Although existing data provide many opportunities to extract new biological information, the sheer volume of data presents significant challenges in terms of data processing and analysis.
“We have developed AI algorithms that can perform large-scale analyzes in these repositories, identify patterns across experiments, and detect new peptides and proteins that were previously missed. This has led to discoveries that would have been impossible with manual or traditional computational methods.
Recent developments in AI/ML
Although advances in AI have been fruitful, applying these technological developments to MS data is challenging due to their unique nature, making a direct translation of AI advances to MS data non-trivial .
“One of the most significant recent advances in AI in MS data analysis is the development of more sophisticated deep learning models capable of handling high-dimensional data and extracting complex patterns” , said Bittremieux.
“For example, transformative neural networks, originally developed for natural language processing, are now used effectively to ‘translate’ between sequences of peaks in tandem MS spectra into amino acid sequences during processing. sequence. new peptide sequencing. These models can learn large amounts of empirical MS data, identifying subtle features that traditional methods might overlook.
“Despite these advances, successfully applying AI to MS data still requires deep expertise in both AI and MS. This multidisciplinary skill set remains relatively rare, which has slowed the broader adoption of AI in this field. However, as more researchers receive training in both fields and AI tools become more accessible, we are starting to see a new generation of scientists capable of bridging this gap.
Looking to the future
Although significant advances in AI and ML have contributed to the continued development of MS data analytics, there is still room for improvement.
“One of the key areas that I believe future developments should focus on is the generation and curation of large-scale, high-quality datasets. Although advances in AI model architectures have been impressive, the quality of these models depends on the data they are trained on,” explained Bittremieux.
Greater availability of diverse MS datasets would ultimately enable the development of AI tools suitable for use in multiple experimental conditions on different biological subjects.9
“These datasets should include comprehensive annotations, such as accurate peptide and metabolite identifications, quantification data, and metadata related to sample preparation and instrument settings. This diversity will allow AI models to learn more generalizable patterns, thereby improving their performance in different applications.
Researchers are sometimes wrong to test their models on cherry-picked datasets. This contributes to the lack of standardization evaluations evaluating the performance of different models. Dr Bittremieux clarified that “the development of benchmarking suites would enable a fair comparison of different algorithms, promoting transparency and leading to real progress in the field”.
“As AI tools become more accessible and interpretable, we will likely see an increase in innovative applications, from personalized medicine to environmental monitoring. »
References
1. Wilkinson DJ. Historical and contemporary approaches to stable isotope tracers for studying mammalian protein metabolism. Mass spectrum. Round. 2018;37(1):57-80. do I:10.1002/mas.21507
2. Neagu AN, Jayathirtha M, Baxter E, Donnelly M, Petre BA, Darie CC. Applications of tandem mass spectrometry (MS/MS) in protein analysis for biomedical research. Molecules. 2022;27(8):2411. do I:10.3390/molecules27082411
3. Luque-Garcia JL, Neubert TA. Sample preparation for serum/plasma profiling and biomarker identification by mass spectrometry. J. Chromatogr. A. 2007;1153(1):259-276. do I:10.1016/j.chroma.2006.11.054
4. Liebal UW, Phan ANT, Sudhakar M, Raman K, Blank LM. Machine learning applications for mass spectrometry-based metabolomics. Metabolites. 2020;10(6):243. do I:10.3390/metabo10060243
5. Beck AG, Muhoberac M, Randolph CE et al. Recent developments in machine learning for mass spectrometry. ACS Meas Sci Au. 2024;4(3):233-246. do I:10.1021/acsmeasuresciau.3c00060
6. Adams C, Gabriel W, Laukens K et al. Fragment ion intensity prediction improves the identification rate of non-tryptic peptides in timsTOF. Nat Common. 2024;15(1):3956. do I:10.1038/s41467-024-48322-0
7. Yilmaz M, Fondrie WE, Bittremieux W et al. Sequence-to-sequence translation of peptide mass spectra with a transformer model. Nat Common. 2024;15(1):6427. do I:10.1038/s41467-024-49731-x
8. Bittremieux W, May DH, Bilmes J, Noble WS. Learned integration for efficient joint analysis of millions of mass spectra. Nat Methods. 2022;19(6):675-678. do I:10.1038/s41592-022-01496-1
9. Dens C, Adams C, Laukens K, Bittremieux W. Machine learning strategies to address data challenges in mass spectrometry-based proteomics. J Am Soc mass spectrum. 2024;35(9):2143-2155. do I:10.1021/jasms.4c00180