Creating an open source daily PM2.5 concentration data set for the western US (2008-2018)


Fine particulate matter air pollution, often referred to as PM2.5 (2.5 microns in aerodynamic diameter or smaller), is increasingly shown to be associated with numerous adverse health outcomes including, but not limited to, mortality, respiratory and cardiovascular morbidity, negative birth outcomes, and lung cancer. Although PM2.5 concentrations have been declining in many parts of the United States due to policies to limit emissions of air pollutants, PM2.5 levels have been increasing in parts of the western US. This increase has been shown to be associated with wildfire smoke. Further research into the health impacts of wildfires is urgently needed as wildfires become more common and more intense in many parts of the world. 

One of the factors restricting attempts to study the health impacts of air pollution from wildfires (and air pollution in general) is the sparseness of air pollution data. Especially in the western US, there are many places and times where there is no regulatory monitoring data for air pollution. Many monitors also don’t measure PM2.5 each day, leaving temporal gaps as well as spatial gaps in the data.

To address this problem, we sought to estimate the concentrations of PM2.5 at locations and on days where we don’t have monitoring data. The most common approach for this kind of estimation is land-use regression, in which a model is created that codifies a relationship between various environmental input variables (such as the proximity to high traffic roads or industrial land uses) to predict PM2.5 concentrations. Once the model does a good job of predicting values of PM2.5 that are known, it is used to predict PM2.5 concentrations where there is no monitoring data -- thereby providing a “best guess” at what the concentrations there would be. 

In recent years, many researchers (including our group) have improved upon land-use regression by using machine learning rather than linear regression to allow for complex mathematical relationships between the predictor variables and the PM2.5 concentrations. In this project, we wanted to create a publicly available, consistent dataset of daily PM2.5 estimates across many years (2008-2018) that would focus on the western US and target high concentrations resulting from wildfires. We hope that many researchers will use these data to better evaluate the health impacts of PM2.5, both from wildfires and in general.

In late fall of 2017, Colleen Reid, environmental health scientist and professor in the department of geography, began building her research team in Earth Lab. Based at the University of Colorado Boulder, Earth Lab is a research organization dedicated to advancing earth data science. In addition to generating scientific insights from the modern “data deluge”, Earth Lab emphasizes creation of open source tools and training of the next generation of earth data scientists. Through Earth Lab, Dr. Reid was able to recruit a postdoctoral atmospheric scientist, a masters student in geospatial computing, and an undergraduate student in applied math to help on the project. 

Obtaining all of the environmental data necessary to estimate PM2.5 daily over 11 years across 11 states took a lot of time and effort. The data sets (all publicly-available) included PM2.5 monitoring data, meteorological data (e.g. temperature, relative humidity, windspeed, etc.), remotely-sensed aerosol optical depth (AOD), active fire locations, and vegetation index, and land-use data. Looking to further improve the model, we obtained chemical transport model (CMAQ) output from the EPA in December 2019. Unfortunately, these data were only available through 2016, so we decided to develop two models, one with CMAQ (2008-2016) and one without (2008-2018). 

Once the input data were processed, we began experimenting with machine learning algorithms. We started with a random forest model and investigated combining that with geostatistical techniques such as kriging. Ultimately, we decided to account for spatiotemporal variation by including many levels of spatial and temporal covariates in the machine learning model and settled on an ensemble of two machine learning algorithms (random forest and gradient boosting trees). 

We then had to create datasets that covered all the locations and days on which we aimed to predict: the centroids of all counties, ZIP codes, and census tracts in the western US. We chose to predict at these geographies to facilitate fine-resolution health analyses. Merging the prediction data set together was a challenge because of its size (~90 million observations of many variables). 

In summer 2020, we finished up the prediction data sets and generated the plots and tables for our paper. Here is a map of our predicted PM2.5 concentrations by county, averaged over the fall (September, October, November) of 2017. The fall of 2017 had some very large fires in California and the Pacific Northwest, which is evident in the higher PM2.5 concentrations in those areas. 

Time series plots of our predicted concentrations compared to observed concentrations at a few select locations also demonstrate the high temporal variability in PM2.5 within and across locations.

While we embarked on this project with the intention of using these PM2.5 estimates in health studies, our dataset may also be used for other applications including understanding how PM2.5 concentrations have changed over time or have been impacted by wildfires.

Please sign in or register for FREE

If you are a registered user on Research Data Community, please sign in