Jan 23

Jan 23 Open Data Management at the U.S. Environmental Protection Agency (EPA)

Agency Data, EPA, Open Data, Project Management, Standards

Last night’s Open Data Leaders meetup in Washington D.C. titled Open Data Discussion with the Environmental Protection Agency (EPA) was an opportunity to peer inside the impressive efforts of the EPA to open up its data assets for public access and commercial exploitation. Andrew Yuen, Dave Smith, Ethan McMahon, and others from the EPA described ongoing efforts and the challenges associated with them.

Last night’s meeting provided an impressive overview of the technical and governance complexities involved in a sophisticated data intensive government program that decides to “go open.” The EPA staff who spoke provided an excellent overview of EPA’s efforts to open up EPA data and outlined efforts including cooperation with Data.gov, data and metadata standardization efforts, working with other agencies on special data oriented projects, cooperation and engagement with developer communities, and the need to align data management and access efforts with EPA’s legislative priorities and programs.

A question asked by one audience member was, “How do you explain your progress and success?” The answer was straightforward:

Good people.
Leadership that recognized early on the need for good data management and data stewardship practices.
The common practice of developing first for internal audiences before “going public” with a data service or application.
Reliance on geospatial data as a foundation for data standardization and reporting.

That last point shouldn’t be overlooked. Geocoded data plays a key role in many open date efforts at all levels of government. People need to slice, dice, and visualize data in light of where they live. This is why people familiar with geocoding, in my experience, are often tapped by management to run open data efforts.

The EPA team discussed challenges associated with efforts to “crowdsource” data via mobile apps that enable citizen-collected data to be “mashed up” with “official” data. Where does one store such data if legislation does not authorize (or provide budgets for) such programs and the service and support they require? How does one navigate a drawn-out internal regulatory process that equates time devoted to citizen engagement with “burden”?

Much of what the EPA staff talked about involved processes and activities that are necessarily associated not only with “open data” but with any data intensive business process. Data must be managed. Systems that share data need to be coordinated. Resources need to be allocated and shared. Such requirements are not unique to “open data” but are universally relevant.

That last point is one of the most important messages that needs to be communicated about open data programs where the goal is to make data assets available for public access and commercial exploitation.

For open data programs to be managed effectively they can’t just rely on on “bolted on” technologies that simply extract and publish data as-is from existing systems and data stores. For open data programs to be effective the processes, technologies, and data need to be managed in a unified fashion with existing programs and services. This is done partly to reduce the need for potentially costly duplication of source data, and partly to make it easier to align data management practices with organizational goals and objectives.

Another take away from the presentations is that making a clear distinction between “internal” and “external” users in such a complex environment is a challenge. I became aware of this when consulting with the EPA several years ago on how it manages its research processes. EPA’s research and data related operations involve a multitude of public, private, academic, international, and quasi- governmental institutions. Many different disciplines and professions are involved. Both automated and manual data collection techniques are employed. Analysis and data management resources range from spreadsheets to supercomputers.

Trying to impose anything approaching a “unified” data management lifecycle approach or a data management maturity model across such a diversified data landscape is clearly a challenge. Making a clear distinction for operational or fiscal purposes between public and private data ownership rights and responsibilities is also a challenge, as discussed in Challenges of Public-Private Interfaces in Open Data and Big Data Partnerships.

The best approach is to be clear about the goals and objectives we are serving with our open data programs and then to collaboratively build the systems and processes needed to serve those goals and objectives.

This must be done with the understanding that these goals or objectives may change and this in turn may require changes to the data management infrastructures built to serve these goals and objectives. (This is something open data management professionals in the U.S. government must be thinking about as we approach the 2016 presidential election.)