Digitizing Earth: Developing a cyberinfrastructure for the geosciences
University of California at San Diego Publications/Erik Jepsen
A glance into a geoscientist’s office will usually reveal fossils, rocks, old maps and field notebooks, pieces of cores or other samples crowding every available surface. Likewise, a peek behind the scenes at the collections of museums like the Smithsonian National Museum of Natural History in Washington, D.C., or the Natural History Museum in London, reveals innumerable earth science-related specimens. And, in dozens of geologic repositories around the country lay thousands of kilometers of rock, sediment, seafloor and ice cores.
That geoscientists are notorious hoarders should come as no surprise. After all, geoscientists collect and study nothing less than Earth itself. And they’ve been doing so since before James Hutton sketched the famous unconformity at Siccar Point, Scotland, in 1788. Centuries of data have since been recorded and filed away in notebooks, binders, boxes and filing cabinets and, more recently, on computer disks and tape drives — many of which cannot be accessed through the Internet.
Over the last four decades, massive amounts of digital data have begun streaming in from a growing number of satellites and sensors unceasingly monitoring the earth, atmosphere and oceans. Geoscientists are awash in data and, at the same time, have access to ever-increasing computing power. Together, these advances have precipitated fundamental changes in the way earth science is done, leading to the proliferation of computer-based data visualization and modeling — especially 3-D and 4-D modeling.
Ideally, all of these data should be preserved, even if their current value is unclear, because they might someday prove relevant to other research questions or reveal fresh facts through new analytical techniques. Previously unknown species are often discovered in museum archives and old cores regularly provide new insights in seismic or geothermal research. But saving everything is impossible. Or is it?
The answer may be in the “cloud”: where data, and indirectly even physical samples, can be digitized, stored, integrated and shared. Of course, the idea of digitizing such a wealth of geological information is simpler than the reality, especially because everyone has their own methods and means of doing so, not to mention their own ideas about exactly what data should be digitized and shared.
Despite these challenges, a plethora of new initiatives and collaborations has cropped up to develop standards for the integration and sharing of digital geologic data, and they are starting to show results.
The Rise of Geoinformatics
In May 1992, shortly before the World Wide Web took off and well before GIS and GPS became well-known acronyms, Congress created the National Cooperative Geologic Mapping Program. The goal of the program was to produce a surface and bedrock geology map of the entire United States for resource development, environmental protection and the identification and mitigation of natural hazards. However, the program also gave rise to the first national digital database of geologic mapping data (the National Geologic Map Database) and a committee (the Digital Geologic Mapping Committee) that set out to develop nationwide standards for the collection and storage of digital geologic data. It was one of a number of parallel efforts that collectively marked the birth of a new field: geoinformatics.
In the decades since, there has been an exponential rise in the collection, aggregation and integration of vast amounts of data of all kinds, not just geologic, from multiple sources. It’s a trend called Big Data, which many fields have had to come to grips with and many are actively embracing, especially in the business world. According to IBM, 2.5 quintillion bytes of new data are gathered each day — including everything from weather and climate data to social media posts to credit card transactions. So much so that 90 percent of the data in the world has been created in the last two years alone. The main use of such data in the business world is to target customers, but in science, it could open up new areas of research and capitalize on methodologies such as data mining.
“In this era of data-driven science and data-driven enterprises — indeed the data-driven society — vast amounts and diverse types of data are being collected that can now be analyzed and mined for detecting patterns, used for building predictive models, and repurposed and integrated to gain new insights and solve complex problems that we have never been able to address before,” said Chaitanya Baru, director of the advanced cyberinfrastructure development group at the San Diego Supercomputer Center (SDSC), in a statement earlier this year.
Data integration could also increase productivity for geoscientists who today, in addition to mastering the scientific knowledge of their particular field, must also act, in part, as data managers.
“Early in my career, I realized I spent three-quarters of my time discovering where the data were that I needed, collecting those data from a variety of sources and then converting them to a common format so I could make a map or draw a [cross] section,” says Lee Allison, state geologist and director of the Arizona Geological Survey. Allison is now heading the largest of four coordinated projects that strive to build a national data system of geothermal data. “Our goal is to flip the current work structure so that instead of 75 percent of our effort going to discovering, accessing and transforming data and only 25 percent doing science, we reverse that.”
Last year, the White House launched a Big Data initiative in which federal science agencies — including the National Science Foundation (NSF), the U.S. Geological Survey (USGS) and the Department of Energy (DOE) — have featured prominently. In February 2013, the White House Office of Science and Technology Policy issued a directive that data that results from most federally funded research must be made publicly and freely accessible in open source, interoperable formats. In May 2013, the president issued an executive order requiring all federal agencies to make their data, except those concerning privacy and national security issues, freely available.
The policy, dubbed “Project Open Data,” is based on the rationale that data are a valuable national resource and making them available, discoverable and usable “not only strengthens our democracy and promotes efficiency and effectiveness in government, but also has the potential to create economic opportunity and improve citizens’ quality of life,” according to the project website. The White House cites as an example the release of federal weather and GPS data to the public, which fueled industries based on geospatial data, like digital mapping, that now generate tens of billions of dollars a year.
A Cyberinfrastructure for the Geosciences
One of the biggest challenges facing the geosciences today is how to integrate all of these data with software applications and models into a usable and useful form, accessible via the Web — in other words, a cyberinfrastructure for the geosciences.
One of the earliest attempts to develop the cyberinfrastructure necessary to support data sharing and integration in the earth science community was GEON (GEOscience Network), a project funded by the NSF’s Information Technology Research program in 2002, which explored the use of centralized databases.
The GEON project, on which Baru was a principal investigator, revealed many drawbacks to centralization, including the need for massive computer infrastructure at a central location and the staff to maintain it. Although the project has now ended, it succeeded in establishing partnerships and relationships out of which other initiatives grew, including the U.S. Geoscience Information Network (USGIN).
USGIN is a Web-based framework in which users maintain their data on their own servers in a distributed system, a model that has prevailed over the centralized database model. USGIN was born out of a 2007 strategy session between USGS and the Association of American State Geologists.
The session was held in advance of a geoinformatics summit meeting called by NSF “to get the geosciences community to develop a more systemic approach to cyberinfrastructure rather than the plethora of ‘one-off’ projects, that while successful in meeting their own goals, were not contributing to a pervasive infrastructure that could be used by everyone,” Allison says.
The idea was that initiating data-sharing standards among the federal and state geological surveys might spur adoption of data integration and sharing more widely across the geosciences, including academia and industry.
“Geological surveys, both at the state and federal level, have mandates to not only collect data, but to archive and disseminate” those data, Allison says.
Other initiatives that grew out of GEON include the EarthScope Data Portal and OpenTopography. The EarthScope Data Portal — a collaboration of SDSC, the Incorporated Research Institutions for Seismology, UNAVCO and the International Continental Scientific Drilling Program — integrates all seismic data related to the USArray and makes them accessible via a Web portal. OpenTopography, operated by SDSC and the School of Earth and Space Exploration at Arizona State University, provides community access to high-resolution topographic data acquired with lidar, along with related tools and resources.
National Geothermal Data System
One of the most successful programs to date is the National Geothermal Data System (NGDS), an effort funded by the DOE Geothermal Technologies Office to recover historical data relevant to geothermal exploration from state geological surveys in all 50 states and make them available online to foster understanding, discovery and development of geothermal energy resources.
NGDS, which also integrates data from geothermal research centers, USGS, and other DOE Geothermal Technologies Office-funded projects, is built on the USGIN data-integration framework.
Started in 2009 with nearly $5 million in funding from DOE, NGDS is specifically intended to help geothermal energy companies speed exploration and reduce development costs by getting faster and better online access to key data and tools for analysis and visualization. In 2010, federal stimulus funds allowed DOE to expand the program to include state geological surveys and other research organizations in digitizing vast quantities of offline “legacy” data. Some of the records gathered so far were collected from exploration surveys done as far back as the 1930s.
“At last count, we had about 17,000 documents and datasets online, representing more than 5.2 million records including more than 1.2 million wells — petroleum, water and geothermal — coming from every state,” says Allison, adding that he expects to have at least double and likely triple the number of current data records online by the end of the year.
The benefits of digitizing and integrating these data are evident. Estimates of the cost to replicate some of the historical data today reach into the millions of dollars. And the effort is already revealing some new findings.
“State surveys are digitizing vast amounts of data and reports that were largely in paper format tucked away in file drawers,” Allison says. “The geothermal experts and co-workers in the geological surveys are eagerly exploring their own data in ways not possible previously, and starting to make new insights and see new opportunities.” For example, in Utah, geologists compiling survey data found evidence of thermal activity underlying a large area of the Black Rock Desert. NGDS provided funds to drill additional geothermal gradient holes to confirm the discovery, which could lead to development of a new type of sedimentary geothermal resource, Allison says.
“As more data and services become available and as more companies discover the resources available, we expect to see the same kind of revolution in geothermal energy as we are seeing in other uses of Big Data by industry worldwide,” Allison says.
Currently still under development, the online public NGDS Portal should launch nationwide in 2014.
The main goal of NGDS and other programs is to make data open and available in formats that are easy to retrieve, download, search, index and process. But determining which formats will ultimately prevail as the standard for each data type is challenging.
In science, researchers often work alone or in small groups on projects that are designed to answer a specific question, and their data formats are usually customized. Thus, scientific data and metadata — data about data and its collection — are often formatted in ways that do not consider how other researchers may want to use the data in the future.
This lack of standardization is one of the reasons that scientists spend significant amounts of time manually integrating data, Allison says.
But getting researchers to agree on how to share can be a challenge in itself.
“Everyone thinks their standard should be the standard,” says Kerstin Lehnert, director of Integrated Earth Data Applications, a community-based data facility at Columbia University’s Lamont-Doherty Earth Observatory that operates EarthChem and the Marine Geoscience Data System. “And when that happens, you have no standard at all.”
Internationally, one organization attempting to standardize formatting is the Open Geospatial Consortium, an international consortium of more than 475 companies, government agencies and universities that adopts and approves standards that enable geoprocessing technologies to talk to each other. Their ultimate goal is to attach spatial data to “geo-enable” all Web, wireless and location-based services, as well as mainstream IT.
Agreeing on formats to deliver data from 50 states was one of the first hurdles NGDS planners had to overcome. First, they identified 30 major types of data that were available and useful for assessing geothermal resources. Then they adopted the USGIN framework, which uses open source international conventions for interoperability such as those developed by the Open Geospatial Consortium. One of these is the GeoScience Markup Language, an information interchange format developed and maintained by the International Union of Geological Sciences Commission for the Management and Application of Geoscience Information.
Building a Data-Sharing Community
Solving the technical problems involved in getting disparate data to cooperate may turn out to be easier than getting researchers to cooperate, Lehnert says. “What we learned from the earlier initiatives [such as GEON] was that data integration is much more a social process than a technical process,” she says.
It involves more than just changing how geoscientists label their samples or key in their data; at its core, data integration is about scientists sharing, at a time when tight funding in the sciences is fostering an increasingly competitive environment. For a variety of reasons, “researchers are often reluctant to share their data,” Lehnert says. “Motivating them to share and actually use certain [data-integration] standards is a huge challenge.”
Indeed, Allison says, “the challenges are not so much technical, as they are cultural, organizational and institutional.” In his experience with NGDS, he says, agreeing on technical standards has helped facilitate the social aspect of data sharing by, literally, getting everyone on the same page. Adopting USGIN helped lead to the “creation of a community of practice.”
Another new initiative, EarthCube, is attempting to address the issue through development of a virtual community that fosters engagement through meetings, workshops and online discussion forums on the website at earthcube.ning.com, as well as through social media services, such as Facebook, Twitter, YouTube and LinkedIn.
EarthCube, which is billed as “a community-driven data and knowledge environment for the geosciences,” started in 2011 as a unique collaboration between NSF’s Geosciences Directorate and NSF’s Advanced Cyberinfrastructure division. So far, there has been a score of projects to develop “roadmaps” for different aspects of geoscience cyberinfrastructure, and involvement of more than 1,700 participants in the virtual EarthCube community.
“EarthCube is emerging as a prototype for cyberinfrastructure development across all the sciences so it’s exciting to see the geosciences taking a leadership role in this transformative initiative,” Allison says.
The first annual EarthCube Summer Institute for Technology Exploration is scheduled to be held in August 2013 at SDSC. Some of the goals are to teach geoscientists about computer science and to teach computer scientists about how geoscientists work, as well as to instruct researchers in the best practices for recording metadata. EarthCube is facilitating dozens of other workshops on topics such as integration of real-time data; education and workforce training; and data management in fields such as geochemistry and paleogeoscience.
Virtual discussion groups are another feature of the community. Among other topics, researchers can talk about the creation of a digital environment for the curation of physical samples, including how best to record data about samples — for example, by bar-coding them and digitizing metadata.
One factor that will determine how quickly things progress will be funding. In late 2012, NSF announced $4.8 million in the first round of funding under a new geoinformatics program that seeks to create the underlying architecture, or “building blocks,” of EarthCube.
But while the recent successes are welcomed, and many initiatives are active, researchers note the proliferation of data-integration initiatives may ultimately require some integration itself. NSF has called for a two-year test governance process to engage the geoscience community to create an organization and management structure that will guide priorities and system design.
“The challenge is to avoid throwing more money at building many independent pieces that don’t fit into an overall strategy,” Allison says, “while not shutting down the amazing momentum underway in the scientific community.”
Allison says it is an exciting time for the field: Developing a cyberinfrastructure for the geosciences is not unlike the early days of other historic technological advancements.
“We view EarthCube as being in the formative stages of the process common to the creation of all infrastructures, from canals in the 1830s, to railways, highways, radio, electrical grids and the Internet,” Allison says. “They all followed similar paths, but not many had the same foresight as to where things may go as we do now with cyberinfrastructure.”