protection—a key element of personal health records, in that the patient is empowered to apply fine-grained control of the information contained therein—also requires that patient-specified security and privacy policies act on all data elements referring to the targets of those policies. This requirement presents yet another data integration task.

Data Management at Scale20

Presuming the existence of large integrated corpora of data (the focus of Section 5.2.3 on data integration), another major challenge is in managing those data. Some of the important dimensions of medical information management include:

  • Annotation and metadata. Raw data almost never speak for themselves, and their interpretation inevitably relies on metadata—annotations to the primary data that provide the necessary context. For example, the primary data for the human genome consist of a sequence of some 3 billion nucleotides. Metadata associated with the primary data help scientists to identify significant patterns within those data—a given sequence might be annotated as a gene or a regulatory element. Metadata could also be used to trace the provenance or lineage of data. For example, the value of certain data in an electronic health record could be enhanced if the data included information about the conditions under which certain data were obtained (e.g., physician observations of a patient’s description of symptoms might be accompanied by video and audio recordings of the session with the patient). With metadata, a primary problem is the design and development of tools to facilitate machine-readable annotations in large databases.

  • Information extraction from text. The volume of medically significant information rendered in text form (e.g., physician or nursing notes) is large, and may in various instances be as or more significant than information rendered in different forms (e.g., lab instrument readings). Extracting useful medical information from textual notes is therefore an important problem that calls for computer science expertise in text processing, natural language processing, and statistical text-mining techniques as well as medical expertise to understand the concepts and ideas to which the information refers. New techniques are needed for extracting information such as patient names, doctor names, medicine names, and disease names from textual notes, and for generating automatic linkages between


An extended discussion of the data management challenges in biomedical data can be found in National Research Council, Catalyzing Inquiry at the Interface of Computing and Biology, The National Academies Press, Washington, D.C., 2005.

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement