Dennis D. McDonald ( consults from Alexandria Virginia. His services include writing & research, proposal development, and project management.

Data Governance, AI, and Data Driven Medicine: Challenges & Opportunities

Data Governance, AI, and Data Driven Medicine: Challenges & Opportunities

By Dennis D. McDonald, Ph.D.


On behalf of Dimensional Concepts LLC (DCL) of Reston, Virginia I attended a meeting sponsored by the DC-based think tank Center for Data Innovation titled U.S. Data Innovation Day 2018: The Future of Data-Driven Medicine.

Before the meeting I made list of topics of special interest:

  • What data governance challenges are associated with applying AI techniques to new drug discovery and development?

  • What data governance challenges are specific to medical AI applications? Which data governance challenges are generic?

  • What are the implications of using AI techniques for the business processes and regulations associated with new drug development and medical treatment delivery?

These topics were touched upon over 4 hours of panel discussions and presentations by representatives of medical AI research and development, government regulation, pharmaceutical companies, and cloud vendors. 

What follows is my personal take on factors that are potentially relevant to contractors and consultants that focus on data governance, program management, and IT services. First, key points are summarized. Then I discuss implications concerning data governance. (If these are topics that interest you please let me know by contacting me at

Key Points

Here are what I recorded as “key points” made during the meeting by the various speakers:

  • AI techniques have great potential at the front end of the “drug discovery” process. It is now possible to model individual molecules at a level of detail that supports prediction of potential therapeutic as well as side effects. This can help at the stage of development when drug designs must be selected and synthesized for testing.

  • The ability to efficiently analyze large amounts of data can streamline and potentially shorten the time required to design clinical trials.

  • The term "big data" seems to be falling out of favor, at least among those able to take advantage of current resources (especially cloud based) to support data capture, storage, and access.

  • Industry and researcher concerns over constantly increasing data volumes have been superseded by realization that (a) cloud-based services can handle the steadily increasing volume and (b) the more serious challenges are now with making data volumes useful, accessible, and interoperable.

  • Efforts are underway at the FDA to streamline the approval process associated with wearable and digital medical devices. Regulations associated with more traditional pharmaceutical products are taking longer to modify.

  • FDA is also pushing for more transparency and data sharing during drug related research. How this will impact traditional concerns about proprietary research and data ownership is hard to predict.

  • Powerful analytical and even AI based software tools are becoming more accessible for end users such as lab scientists. This accessibility and analytical power may bypass some roles traditionally played by IT or Computer Science staff.

  • Any discussion of applying AI techniques to assist in analysis or modeling of data extracted from multiple sources must address data governance concerns including data standards, data quality, and interoperability.

  • One benefit of data interoperability is the ability to use advanced analytical approaches to model similarities and differences across different diseases. When taken together these related models might improve our understanding of overall health related conditions such as aging effects.

  • Failure rates in drug development are high. A large proportion of tested drugs never make it to market. Reducing failure rates based on employment of AI techniques to prioritize drug designs for testing might have a significant impact on reducing the overall cost of clinical trials.

  • Smaller companies are trying to be disruptive (e.g., by incorporating better molecular data into the "drug discovery" process). How this will impact the still complicated and lengthy process of getting new drugs approved is hard to predict.

  • Even when advanced software is used in drug design and testing a significant amount of costly and hard-to-automate manual processes are still required to move the drug through the entire development and testing cycle.

  • The need to conduct certain types of tests (e.g., cardiovascular impacts) are common across many different types of drugs. With improved interoperability, will it be possible to use test results conducted for one drug to be used to evaluate other drugs if the bases for commonality are well understood?

  • It’s always necessary to have a global perspective on testing given the worldwide market for drugs. Failure to test a drug in one country (e.g., Japan) might lead to that drug not being approved for use in that country.

  • Don’t forget that AI techniques might also be adapted for use in “lower level” types of applications such as the automating of report writing about clinical trial results. Such applications could have significant impacts on clinical trial costs.

  • Electronic health record (EHR) processing as input to generation of large datasets that supplement databases consisting of structured data requires more attention to how natural language processing tools can overcome language variability. FDA has a variety of data and metadata standardization efforts in the works.

Data Governance Implications

Pressures to share data

In biomedical research today, pressures to share data early in the research cycle are increasing, due to several factors including:

  • The ease with which data in ever increasing volumes can be shared.

  • The influence of younger researchers accustomed to collaborating and sharing data online.

  • The encouragement of data sharing by government research funders.

  • An increasing awareness of the value of data sharing.

Proprietary research

The picture looks different when we consider proprietary research funded by pharmaceutical companies to develop new and commercially successful drugs. These differences include:

  • The tendency of pharmaceutical companies to treat some research data as proprietary.

  • The risk of testing new drugs using live organisms (especially humans).

  • The reality that many participants in the proprietary research sponsored by pharmaceutical companies are also involved with open “nonproprietary” research.

How the two worlds interact has implications for how data are governed as well as for who does the governing. Despite the move to make biomedical research data more accessible and open, there are many and sometimes competing interests to consider.

Researchers and sharing

People wise, the researchers involved must be willing to share data, especially early on in research before results are analyzed and deemed “complete.” This is not always the case given how competitive some research areas are. “Being first” is still a high incentive that may impede release not only of methodological details but also disclosure of models, algorithms, and raw data that can be accessed and re-analyzed by others in ways that may not have been originally intended. (This is an interesting consideration when discussing the possibilities of using existing clinical trials data to “prequalify” or influence separate research efforts.)

Barriers to interoperability

Even when data are made available for access by others, there may still exist barriers to interoperability, a point made repeatedly by speakers at the Data Innovation Day conference.  “Big data,” artificial intelligence, and natural language processing tools hold great promise for extracting structure from multiple data sets, but interoperability challenges still exist ranging from a lack of standards, to too many standards, to no standards of it all. 

The “no standards” case may be an especially serious challenge when using natural language processing tools to analyze the unstructured text of medical records pulled from the electronic health record systems managed by different institutions.

No central control

When considering the implications of the above for data governance, one overriding factor stands out: no single governing body exists that can internationally control data and metadata interoperability from the beginning to the end of the biomedical research data lifecycle.

Collaboration and cooperation

Collaboration and voluntary cooperation are the key. These in turn can be strongly influenced by government policy and regulation. Several speakers at the conference emphasized the need to find “balance” between industry needs and regulatory interests.

Corporate laboratories are certainly in a position to impose more controls over how data are captured, managed and disclosed. Data governance factors for them to consider include:

  • The corporate laboratory may still need access to externally generated research data to operate on using existing as well as powerful new AI and analytic tools. Both emphasize the need for data interoperability.

  • The clinical trials and adverse events recording stages of research require operating in a highly regulated environment where, increasingly, complete data “secrecy” is being discouraged.


Data transparency and open research data access are strongly encouraged today but are still unevenly distributed across the entities involved in biomedical research. This is complicated by the length of time (often years) it takes for even published results to find their way into usage, and by the sometimes-conflicting objectives of those involved in the process (for example, commercial peer reviewed journals versus “open access” journals).

The languages (machine, spoken, written) used by the different entities involved in biomedical research are constantly evolving and can never be completely standardized and unified given the different populations involved (for example, data scientists, clinicians, lab researchers, etc.) There is no single international “super regulator” that can “knock heads together” when needed.

Governing data in a complex and distributed environment requires both a willingness to share data as well as financial, human, and technical resources to support data standardization, data quality, and data stewardship efforts. Data governance policies and practice when considered in such a complex environment must continue to evolve to improve data interoperability so that researchers can take advantage of the increasingly powerful tools such as AI available to them.

Copyright © 2018 by Dennis D. McDonald. Thanks are due to Dimensional Concepts LLC (DCL) for supporting the writing of this report. More information about DCL is available here.

Goodbye, MoviePass

Goodbye, MoviePass

Leo Laporte's Response to Social Media Decay

Leo Laporte's Response to Social Media Decay