As data migrate to a cloud-based environment, issues of data ownership, how the data will be used for scientific discovery, and who has access to the data become uncoupled, making the need for clear governance and oversight plans essential, said Anthony Philippakis, chief data officer at the Broad Institute of MIT and Harvard. Indeed, said Sean Horgan, lead project manager at Verily Life Sciences of the company’s biomedical research platform, data access policies inherited through large existing datasets have failed to keep up with what scientists now see as the need for cross-dataset analysis. New policies need to be drafted for the next generation of data, he said, which will require coordination across new datasets such as those being generated by the various AMP initiatives, the All of Us research program,1 Sage Bionetworks, and others.
Institutional policies around cloud and data governance are set primarily by chief information officers (CIOs), lawyers, privacy officers, and information security officers, with little engagement of scientists themselves, said Ruth Marinshaw, chief technology officer for research computing at Stanford University. Scientists need to advocate more strongly for a seat at the table where governance decisions are made, she said. Perhaps a new position needs to be defined that brings the researcher’s perspective to these deliberations, said Adam Ferguson, associate professor of neurosurgery at the University of California, San Francisco. From the researcher’s perspective, institutional restrictions on data sharing can be viewed as restrictions on academic freedom, and completely at odds with the NIH mandate, added Ferguson. “These are freight trains going at a head-on trajectory toward each other, and should be sorted out with transparency,” he said. Research participants should also be involved in this decision-making process, added Gregory Farber, director of the Office of Technology Development and Coordination at NIMH.
At NIH, dbGaP has provided the voice of the government and served as an honest broker in bringing groups together to decide who can access genomic data and for what research purposes, said Philippakis. As dbGaP data move to the cloud, NIH plans to continue playing that role, said Farber. Among the issues to be addressed are whether data use aligns with existing informed consent policies, or whether current policies reflect the world of 20 years ago and need to be updated.
CURRENT PROMISING PRACTICES FOR DATA GOVERNANCE IN THE CLOUD
The Office of Data Science Strategy has as one of its tenets sustainability around data, said Nick Weber. They are currently piloting a program with
___________________
1 For more information, see https://allofus.nih.gov (accessed November 11, 2019).
Figshare2 where NIH is providing funding up front for anyone with a dataset of a certain size that will be put into general purpose repository for long-term sustainability, said Weber. He added that NIH is encouraging researchers to use STRIDES to manage very large datasets in the cloud in part so that NIH can gather reporting insights, information on costs, and information on funding to help make long-term sustainability decisions.
The All of Us research program has been innovative on two fronts related to the research participant and the dynamic between the research participant and researcher, said Philippakis. First, all data collected on a research participant are returned to the participant, and second, when a researcher gains access to data, he or she is required to provide information about the research team and how they intend to use the data. “Researcher privacy isn’t really a thing or maybe it shouldn’t be,” said Philippakis. Rather, letting research participants be involved in policing oversight is innovative, he said. Horgan noted that when Verily wanted to create a data user agreement, they started by looking at the All of Us agreement.
Leveraging technology to remove some of the human-specific tasks involved in data use oversight could also make the process more efficient and consistent, said Philippakis. His team showed that a simple machine-readable ontology could be created for about 95 percent of use cases, and then ran an experiment comparing an automated versus traditional data use oversight approach. Not only was the automated approach identical to the traditional approach in most cases, but when there were disagreements, the automated approach provided more consistent answers.
ISSUES TO BE RESOLVED REGARDING DATA USE AND ACCESS, ANALYSIS, USER TRAINING, AND PLATFORMS SUSTAINABILITY
Each institution sets its own rules, which hinders collaboration and efficiency, said Rosa Canet-Avilés, director of neuroscience research partnerships at FNIH. For example, one of the biggest obstacles to data sharing is that every institution requires researchers to obtain IRB and ethics approval even for data generated elsewhere, said Jane Roskams. Thus, even data that are openly accessible can take months and years to obtain. Marinshaw suggested that institutions might be able to avoid creating these regulations in a vacuum if information was available on the governance rules and data use agreements established by other institutions such as NIH, Harvard, and the Broad Institute. Creating standard templates for data user agreements may also be helpful, added Horgan. Canet-Avilés added that harmonizing such templates across different types of data and cohorts could also be valuable.
___________________
2 For more information, see https://figshare.com (accessed November 11, 2019).
Determining when restrictive access policies are needed presents another governance dilemma, said Farber. The world would be a simpler place and data would be much more useful if general research use (GRU) consents were widely adopted, he said. However, while GRU consent may be applicable to bigger datasets, Farber suggested that smaller and more specialized “edge” cases may need more restrictive policies. Philippakis added that while nearly everyone agrees that individual-level data should not be put into open access domains, aggregated data may be fine to put in the public domain. However, there is no cut point that defines when data are aggregated enough for sharing, he said.
Philippakis suggested that as new cohorts are generated, GRU provides many benefits. He noted, however, that existing cohorts are also extremely valuable even though the consents obtained in setting them up may not allow generalized use. Another challenge with integrating data from older studies is that those data may not exist in digital form, said Silvana Borges, associate director for regulatory science in the Office of Drug Evaluation II at FDA’s Center for Drug Evaluation and Research (CDER).
Canet-Avilés said it would be helpful if there was a single clearinghouse where investigators could access information about various aspects of governance, such as data use agreements for different types of data and different levels of access. Valerie Virta, American Association for the Advancement of Science Science & Technology Policy Fellow at NIH, concurred, noting that NIH is poised to provide guidance that could be helpful to the community and help propagate best practices. Bringing a larger group of investigators and organizations together to share learnings on governance problems and solutions could be valuable, said Alyssa Picchini Schaffer, senior scientist at the Simons Foundation. Marinshaw agreed about the need to engage a broader group of participant institutions, possibly by issuing requests for information on various issues related to governance practices.
A system that defines the required qualifications of researchers to access controlled data, and to track researchers when they move from one cloud to another, is also needed, said Philippakis. The technology exists to build such a system, he said, but the organizational structure does not exist.
Governance committees may also address when cloud storage is appropriate, considering factors such as cost, safety, and the amount of data involved, said Farber. The cost of cloud storage is low at first glance, said Marinshaw, but the data management, movement, and curation can be expensive. Generally, when data are stored in the cloud there are more resources and technologies that can be employed in cost-effective ways, but researchers need to be educated on costs and benefits, said Horgan.
For example, Lisa Merck, associate professor of emergency medicine and vice chair of research at the University of Florida, said that for the
BOOST3 clinical trial, which is looking at cerebral oxygenation-driven therapy after severe traumatic brain injury, continuous brain oxygenation and multiparametric data are being collected and stored in the cloud from 45 centers. She suggested that an alternative might be to publish datasets that have been curated and analyzed in a large national library that would be publicly accessible rather than relying on cloud-based services. Farber said there are some efforts to move in this direction, but added that this approach raises other governance issues such as how long to keep the data in storage.
Philippakis added that while data storage on the cloud versus on an institution’s own infrastructure may be somewhat cheaper, it can be painful simply because it is a change. But he suggested that cloud storage also incentivizes other good outcomes such as data sharing. Whether data are stored in the cloud or “on prem” (i.e., on the premises of a research organization), Philippakis said another important concern for investigators is getting locked into a certain technology that could disappear if the company goes out of business or becomes obsolete as technology improves. He suggested that investing in open-source technologies that can be built and maintained in the community offers the best defense against that problem. Horgan added that open source is valuable not just for software, but for configurations of datasets and best practices associated with sharing code as well.
One of the main impediments to the goal of using the cloud to accelerate science is a lack of knowledge among researchers about how to work with different cloud-native data models and tools, said Marinshaw. Increasing training and providing researchers with information from a variety of demonstration cases could help address this problem, she said. Horgan added that there are also gaps and disparities with the tools that exist in the cloud and how these tools provide different user experiences in different cloud environments. Dedicated experts investing time with a user research team to understand the specific tasks a researcher wants to accomplish, rather than forcing the researcher to learn how to write their own queries to accomplish that task, could support making cloud use more efficient, he said. Roskams added that the user journey is further complicated by the fact that most platforms have failed to provide users with roadmaps that will guide them in how to manage, store, and wrangle their data. Developing training modules, possibly through INCF, or conducting hands-on training workshops could alleviate this problem, said Roskams.
Governance policies may also address training. Most institutions currently require animal ethics and/or human ethics certification for researchers working with animals or humans, noted Roskams. She suggested that it might also be helpful to require data ethics and data understanding certification. Huerta said his office is also looking at staff training, so that
program officers who do not have portfolios dedicated to computational biology will better understand these concepts when they are evaluating budgets and proposals.
Finally, an important consideration related to governance is how to ensure the sustainability of cloud-based platforms. Magali Haas noted that many platforms are funded for a limited time period through grant mechanisms with no mechanism for renewal. Canet-Avilés noted, however, that for AMP, a public–private partnership between NIH and private organizations, the model they are developing is that data platforms eventually will be sustainable through government funding. Funding is not the only factor that affects sustainability, however. Sustaining the kind of cloud support engineer talent needed to support research projects has also proved challenging, according to Russell Poldrack and Weber. One approach taken by the Office of Data Science Strategy, according to Weber, is to develop programs that recruit people from outside government for one year or two for projects they might find very interesting, enabling them to internally train and raise the knowledge level of the rest of the research staff.