Software development and training offer opportunities to advance ‘Big Data’ discovery and scientific research: Lessons from the PhenoCam team

As ecological datasets continue to increase in size and availability, ‘big data’ concepts and applications are becoming increasingly relevant in the field of ecology. As a part of this wave of open access to large-scale ecological data, the members of the PhenoCam (or ‘Phenological-Camera’) project have focused on developing open repositories, methods, and tools to process and analyze large-scale datasets, as well as sharing this knowledge through training workshops and seminars.

Thumb img 0949  1
Feb 13, 2019

The PhenoCam Project is a multi-institutional collaboration currently collecting and archiving digital images from more than 550 cameras across the globe at least once-a-day. From these collected images, the greenness of the canopy can be extracted from the image pixels and modeled at seasonal time scales in a consistent and completely open fashion, providing scientists with valuable and near real-time information on vegetation phenology.


PhenoCam was developed as a network, taking key lessons from AmeriFlux and FLUXNET, which have demonstrated that a distributed network approach is needed to study large-scale patterns and drivers of change; In other words, the network is more than the sum of its parts. We rely on PhenoCam PIs for support and collaboration, and in return we process, house and distribute the data on a nightly basis for the benefit of their lab and the wider scientific community.  We share the data openly, since the sheer volume of data and its range of potential applications is more than any one lab group can tackle and perhaps there are potential applications that we haven’t yet dreamed up, but others have.

Altogether, this imagery provides >35 million images (≈ 16 TB), with another 500,000 images added each month. To put this in perspective, if every image in the PhenoCam archive was printed on glossy photo paper and stacked, it would rise to the height of 11,000 meters, taller than Mt. Everest, and weigh more than 2 adult blue whales! Much of the PhenoCam imagery and processed data products were recently published in a curated and publicly available dataset.

The number of images within the PhenoCam data now exceeds 35 million, with additions on a daily basis.  If these images were stacked upon each other they would be 30 times higher than the empire state building, or higher than Everest. 

With such a wealth of data, ‘drinking from the fire hose’ can feel overwhelming to new users. Thus our team has recently focused on developing tools, applications, and interfaces for scientists of all stripes to interact with, extract, and integrate PhenoCam data. The tools our team developed are designed to build on the foundation of open data, while also providing software and interfaces that can be generalized for scientists to process their own digital repeat photography collections. Specifically, the PhenoCam team has developed applications for the identification of “hazy” images (hazer), extracting time-series data from stacks of digital images (xROI), and modelling color information from a stack of digital images as a time series (phenor). Using this collection of tools, massive imagery datasets can be efficiently processed and analyzed to extract key ecosystem patterns and change. Our team is also building an interface that pulls a number of open phenology datasets into the same application including PhenoCam, MODIS, and USA National Phenology Network data to allow for comparison across scales (PhenoSynth) which builds upon and utilizes open tools and data from our lab, a key example of how open data and open science build upon each other.


Beyond ‘big data’ collections and processing tools, teaching opportunities, such as training workshops at international conferences, offer a bridge to engage with the broader scientific community and remove skill barriers to open data. At these conferences, we are able to share packages and interfaces under development and help enable others to advance their own research. Recently, members of the PhenoCam team and National Ecological Observatory Network (NEON) provided a training workshop at the 2018 AGU Fall Meeting in Washington D.C.  From our experience, the benefits of such workshops go both ways. We get to ‘spread the word’ of what is open and available to the entire community (and thus hopefully prevent the re-invention of the wheel), while aiding those who want to use these tools overcome any initial roadblocks that can occur. In return, we build important collaborations and receive feedback on our applications in development, as well as potential additions that may benefit the scientific community at large.  

The PhenoCam-NEON training workshop at the AGU 2018 Fall Meeting (Washington, D.C.), with ~40 attendees participating. Image courtesy of Megan Jones (NEON).

The development and dissemination of software tools relies heavily upon a solid foundation of open data: both unrestricted access and high-quality curation with metadata. Overall, the proliferation and availability of larger volumes of open data offers increasing opportunities to advance scientific research in the community. However, this advancement, through the creation of new tools designed for big data, is further facilitated by offering opportunities for training and teaching to the broader community. By demonstrating how to use open source packages and interfaces we hope to motivate the community to share their own tools in development, and contribute to ours through the power of open science.


            Co-authors: Bijan Seyednasrollah (NAU) and Adam M. Young (NAU)

            PhenoCam Team: Koen Hufkens, Megan Jones (NEON), Andrew D. Richardson (PI), Thomas Milliman         (UNH)

Medium img 0949  1

Katharyn Duffy

Postdoctoral Scientist, Northern Arizona University

Open-source software engineer for the PhenoCam (Phenological Camera) project. I work on collating data and building interfaces for scientists to evaluate, process and download phenological data across multiple sources and scales.

No comments yet.