Computational Approaches to Support Identification of Chemicals in the Environment

Co-authored by Andrew D. McEachran and Antony J. Williams
Computational Approaches to Support Identification of Chemicals in the Environment

The number of chemicals detected in the environment continues to increase. These range from expected pollutants such as pesticides and pharmaceuticals (for example, opioids and cannabinoids) to metabolites and degradants. The rapid identification of small molecules in environmental monitoring studies generally utilizes high resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) techniques. NTA analysis generally combines the acquisition of HRMS spectral signatures for hundreds to thousands of chemicals with informatics approaches that perform searches against databases containing “known” chemicals. 


Freely available public online databases can contain 10s of millions of chemicals (for example the PubChem and ChemSpider databases contain 96 million and 74 million substances, respectively, as of August 2019). While these large databases are useful for broad chemical searching, more focused databases are better-suited for identifying chemicals in the environment. At the US-EPA we have been building a more focused data collection to support our computational toxicology research for almost 20 years (the DSSTox Database []) and it now contains over 875,000 substances (as of August 2019). The “CompTox Chemicals Dashboard” ( is a freely available web interface accessing the data contained in DSSTox and has specific functionality that can support our mass spectrometry analyses and the identification of “known unknowns” (


When attempting to identify an unknown chemical in an environmental sample, most search techniques use either a generated molecular formula or an observed molecular mass to determine what are potential candidate chemicals for that unknown. In many cases tens to hundreds of chemicals can match a molecular formula or mass within the database. For example, the chemical formula for Bisphenol A (or BPA that many of us will know from the emphasis on “BPA-free” in commerce) corresponds to over 200 chemicals out of the collection of 875k substances ( The challenge is how to identify which of these chemicals is a more likely “candidate”. One of the approaches that has proven to be of value to date is “metadata ranking” ( that uses available data such as the number of consumer products containing the chemical, or the number of scientific articles in PubMed mentioning the article, to prioritize the candidates.  


To further increase the confidence in an identification beyond metadata, researchers use spectral “fragmentation patterns” (how a chemical structure breaks apart in a high energy collision) to match what was observed on an analytical instrument to what has previously been observed for that same structure.  These data, when available, can boost the confidence in identifying chemicals and there are an increasing number of freely available spectral databases available online (for example, MassBank (  However, overall there is low availability of fragmentation data, limiting generalized high-throughput application in routine identifications.  The goal in our reported work ( was to fill a crucial gap by predicting and storing the fragmentation patterns of the entirety of the EPA’s DSSTox database to enable easy access to both the rich metadata and fragmentation patterns for broad, high-throughput use to boost confidence in chemical identifications.  We hope that individuals, research groups, and analytical chemistry vendors will find the data of value, informative, and effective.


Disclaimer: The views expressed in this paper are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.

Please sign in or register for FREE

If you are a registered user on Research Data Community, please sign in