For more information, purchase options, and for other versions (if available) please visit http://www.nap.edu/catalog.php?record_id=10060
Chapter 18: Developing Standards for Interdisciplinary and Intersectoral Data Applications | Data for Science and Society: The Second National Conference on Scientific and Technical Data | U.S. National Committee for CODATA | National Research Council

U.S. National Committee for CODATA
National Research Council
Promoting Data Applications for Science and Society: Technological Challenges and Opportunities
 


18

Developing Standards for Interdisciplinary and Intersectoral Data Applications

John Rumble, Jr.




     I would like to talk about standards in relation to data, the Internet, and scientific databases in general. I'm going to talk about why we have standards and what scientific data are. Then I'm going to talk a little bit about the approach to developing standards. Since I'm president of the Committee on Data for Science and Technology (CODATA), I also want to take the opportunity to mention a few ways CODATA can help in this whole process.

     I want first to apologize for not being an extensible markup language (XML) expert. I was talking to someone at the National Institute of Standards and Technology (NIST) who is working on a variation of XML for materials data. He said, "How can you talk about scientific database standards without mentioning XML, CORBA (Common Object Request Broker Architecture), or anything else like that?"

     I want to make it clear what I'm going to talk about. There are two aspects to sharing data, sending data over the Internet and using data that are computerized. Part of it is the envelope. What is the format that the information is in or inwhich the data are coded? Part of it has to do with the content. I will leave it to the computer gurus to talk about XML and things like that. My talk is going to be focused on content: standards for content in scientific databases; content of scientific data collections; content that is used and exploited in twenty-first century science and technology.



Figure 18.1

   


     Some of us here are scientists, and some of us aren't. Let me remind you what data are (see Figure 18.1). I have spent the last 25 years worrying about physical properties, observations, and structures that are represented by numbers, scientific text, and pictures that result from experiments, measurements, calculations, observations, more calculations, and modeling. It used to be that these data were reported in the scientific literature. To use them, you would have to find the relevant scientific literature and extract the data. Today, in the print publishing world, it has become very difficult to publish routinely the results of scientific measurements and experiments. By results, I mean the actual raw data or large compilations or tables of derived data. Oftentimes these data now are generated, measured, created, and put into computerized collections without really going through the published literature.

     Another major change that is happening focuses on databases, or database-building technology. Today anyone can buy Microsoft Office, which has a database management system, Access that is really quite powerful. This means that building scientific databases is now quite easy compared to 10 or 20 years ago.

     However, not only is the computer technology changing to allow us to build databases with an ease we never had before, but two other things are happening. First, modeling and simulation have taken advantage of the computer power that we all are aware of and they are generating data at unprecedented rates and in unprecedented amounts. We are building some really big instruments, for example, the Hubble Space Telescope, which are generating data at rates never achieved before. So it is not just the technology that captures the data that has changed, it's the technology that generates the data that has really changed, and this is going to continue to accelerate into the twenty-first century.

     Many of the systems that we are thinking about, doing calculations about, making models of, and looking at experimentally are more complex systems. What I mean by more complex is that they are real systems. We no longer are looking simply at the kinetics of a chemical reaction in the gas phase, for example, the interaction of one molecule of methane with one molecule of oxygen. Now we look at the combustion of methane in the atmosphere when we have 1022, 1023, 1024 particles that you have to account for. Or we are look at cells, or the human body, or all of astrophysical or astronomical objects. All of these systems have a large number of components and many complex interactions. Because we are looking at more complex systems, we have to look at them in many different ways, since they are so difficult to understand.

     These systems are complex because they have many more independent variables that are relevant. That means that the volume of data that we generate, either by calculation or by looking at a complex system, is going to be inherently larger than when you deal with just simple systems. As this complexity increases, the number of elements or components that we are looking at grows in size, in terms of internal associativity and connectivity, and in numbers of interactions. The number of data fields that we have to worry about is also growing, and the number of databases that contain relevant information is proliferating rapidly.

     In addition, because we are looking at real things, scientists and engineers are realizing that the decomposition of science and technology into isolated disciplines no longer allows us to operate at the scale of reality and complexity that we really want to work. Consequently, as we look at these systems, we realize that we will have to involve and use data from many different subdisciplines that have been defined over the years.

     Not only are the scale and complexity growing, but also the disciplines that look at the various aspects of very complex systems have traditionally been rather distinct. If we are going to model these real systems, we are going to have to be able to share data across these different subdisciplines. This is going to be done primarily through some kinds of standards, whether formal or informal.

     The point I would like to make is that as we deal with these real, large-scale, complex systems, and as our ability to collect data and the tools to computerize the manipulation, delivery, and use of the data become easier, the data issues associated with science have to be addressed from the beginning. I would say that to do science intelligently, especially at the level of complexity we are talking about and with the level of federal support required, a lot of these complex science data issues must be addressed from the beginning. To make databases interoperable and to share data across disciplines, standards must be mandatory.

     It is not just at the beginning of some new scientific investigation that we are going to have to ask these data questions. It is as we progress through scientific research that we are going to have to continue to reexamine the data issues, because we learn more about what we are looking at. Also, because we know more about what we are looking at, we can gather more data. This is going to change our database requirements, our data standard requirements, and our data use. In essence, we have to anticipate the data future or at least be prepared to change our data view. Of course, this is difficult, but we have to get into that mind set.

     There are many reasons for developing scientific database standards (Figure 18.2). I'm going to concentrate on the needs for the scientific and technical (S&T) data and database standards in the context of their development.



Figure 18.2

   


     The first need is for improved data collection. If we know what we want to collect, it is obviously much easier to collect. As we collect data over a long period of time, we constantly revisit our knowledge and our understanding of what we should be collecting. Then we can change what we collect and do it more efficiently and, what is more important, more completely. Obviously, if you know what you want to collect, it will make for more efficient database building because you are able to build your data dictionaries more easily. Comprehensive, standardized database dictionaries will facilitate data sharing, data exchange, and data integration.

     We have data standards that allow us to have more complete and uniform data collections. One of the holy grails of science is, of course, discovering new knowledge, and the standards will facilitate this greatly.

     I have been involved in building scientific data and database standards for almost 25 years now. In thinking about this subject over the past couple of years, I have tried to identify the major components of successful scientific data and database standards. I think there are three basic approaches that every successful standard in this area has adopted (see Figure 18.3). This is something that we need to think about proactively before proceeding.



Figure 18.3

   


     The first is neutrality of formats. The community that is developing these standards has to approach this with a clean slate. We have learned a lot about building various kinds of databases, but we must adopt a neutrality into what is best, because quite frankly the community at large probably has many different viewpoints of the right way to do this. If you adopt an advocacy position for one viewpoint, you are never going to get anywhere. The second component is that data elements have to be defined very carefully. The third approach focuses on separating the semantics--that is, the meaning--and the content of the data elements from the way you deliver them. If you do this, I think you are a long way toward developing good standards.

     What kind of data and database standards are we talking about? There are really three components of these standards. We usually talk about properties, measurements, or observations of something. So the first component is to describe the "something," whether it is a chemical substance, a species, a human being, a person in a clinical trial, an object, a system, an astronomical object, whatever. The second component is to record or report in appropriate detail the property, measurement, calculation, or observation. The third component, which many people forget--and where all the hard work really comes in--is the context in which these properties or measurements are meaningful, all of the independent variables that have to be separated. One has to recognize that at the beginning of a data collection, you may not know everything that is applicable. What we know about the relevant independent variables, say, for the genomic collections, is different now than it was 5 years ago and is much different than it was 20 years ago.

     When we talk about the description of systems, species, or substances, we are doing so in a computerized sense. What are we trying to accomplish when we describe the substance in a computerized manner? There are really two functionalities that a standard has to address. The first is to make sure to define the substance under consideration. I am talking about this chemical, not that chemical. We also have to describe it uniquely so that not only can you put the information in, but also you can find it to whatever level of detail you want. Second, you want to be able to support equivalency. If I made a measurement on one species, and someone else made a measurement on another species, and I am able to compare the records to determine that indeed the species are equivalent to whatever degree of detail I want, I can combine the data set into something bigger and exploit it.

     These are the two functions of a computerized description of substances. I want to point out that there are millions of things to describe. You don't have to develop standards to describe everything at once. As a matter of fact, this is virtually impossible. So you have to set your goals very realistically. The other point I would like to make is that scientists almost always have very little understanding of information modeling, which is the best tool for developing an understanding of data. We have to become familiar with this kind of technology.

     We have the same kinds of considerations for reporting properties and for describing independent variables. However, I will not go over these in any kind of detail.

     How do we actually develop standards, and what are some of the messages I can provide people in terms of the development of process? The first is that there are many different organizations that develop standards. Some of them are very formal, and have legal bases, such as the International Organization for Standardization. They are legal entities incorporated somewhere. They have a well-defined standards development process. The administration of the committees and groups that develop standards is very carefully controlled.

     Sometimes standards are developed informally through professional and technical societies. The International Union of Crystallography is a good example. There are also a number of pre-standardization bodies, NIST being one of them. There are many other national laboratories and research institutes that in essence develop scientific data standards and other kinds of standards informally. They don't have the same legal standing, but the standards are appropriate for their situation.

     There are also a number of new players in the standards arena who have come in because the Internet is changing things. The traditional standards development organizations, because they are legal and bureaucratic, take time to develop a standard, and "time" to somebody working in the Internet environment, say the World Wide Web Consortium (W3C), is a week or a month. However, many of the issues that we have to deal with in scientific data and database standards are not going to be resolved in a week or a month. We have to recognize this.

     The different kinds of standards organizations have different roles to play in the scientific data and database standards process. We need to recognize what the differences are in the way these organizations operate, and how we can take advantage of it to our benefit.

     Given all this, does it mean that it has been impossible to develop a large suite of very robust scientific data and database standards? No, and many standards do exist today. I would just like to make a quick point here. The very first scientific data standard ever developed was for neutron collision. It was developed in the 1950s and ratified, I think, in the early 1960s. It was an international treaty signed by ambassadors as I understand it, and it was done under the auspices of the International Atomic Energy Agency. In all my years of developing materials, chemical, and biological database standards, I wish I had that authority in other situations because it has been almost impossible to get agreement in any other situation. So there is something to be learned from this.

     There are some key issues that face the scientific community as we try to develop standards in the future. Standards development is a process (see Figure 18.4). You know there is a need for a standard. After the standard is developed, it is used by people and groups who have a vested interest in doing so. I want to point out two important aspects.



Figure 18.4

   


     The first is that just because one person thinks there is a need for a standard does not mean the need for that standard exists. There almost always needs to be a group of users, whether it be colleagues, a discipline, or what have you. There also has to be a community-wide recognition that the need exists. The second point I would like to make is that if the standard is developed, and nobody uses it, it is a waste of time, an absolute waste of time.

     There are some economic and sociological lessons I have learned in developing standards. You have to be motivated to build a standard. You also have to be motivated to use a standard. If it is hard to build a standard, one should look at it and ask if the real motivation in building this is missing. Maybe we do not have enough motivation. Industry builds standards and uses them for business reasons, and that's a very important point.

     What are the motivations for scientists to build standards? The first obviously is efficiency, though scientists very often don't think that's a real motivation. However, if somebody else has solved the problem, I can stand on his or her shoulders and solve the next bigger problem, and I can do it efficiently because I'm taking advantage of the other person's knowledge.

     The second is money. If we spend a lot of money and we develop standards that help us preserve some of that money so we can do more research or more science, this is a very strong motivation. In areas as diverse as developing an astronomical catalogue and the Human Genome Project, people have realized that if they develop standards initially, they will reduce some of the data integration problems, thereby creating more resources for the basic research.

     Preservation is another key issue. There is now recognition that data are not good for just 3 years, or 5 years, or decades. They are good for centuries. How do we maintain data in a way that future populations of scientists in the next centuries can take advantage of our scientific observations and discovery?

     Finally, standards allow and facilitate data mining and knowledge discovery. The point was made today, which I wanted to reemphasize, that there is a major paradigm shift in science. In addition to experimentation, theory, and modeling, exploiting databases is going to become a new source of scientific discovery. We heard this morning in the talk by Usama Fayyad on data mining that some very exciting knowledge discovery from databases was occurring.1 I think this kind of knowledge discovery is possible in every area in which scientific databases are available.

     There are some problems with scientists and standards. Scientists are reluctant to use standards. Often this is because standards are not the state of the art. Also, scientists are not used to working toward a consensus. It is important to work together as a community to actually develop the standards.

     Now let me just give you a brief example of this. Over the last decade the Federal Geographic Data Committee (FGDC) has done a fantastic job of developing standards for geographic information. It has been motivated first by the fact that many businesses wanted them because there is a growing geographic information systems industry. At the same time, the federal government was sponsoring many data programs in which data keenly depended on geographic information. It made sense to do this, and the government did an excellent job. In the last year I have talked with six or seven groups that are building new databases that are dependent on geographic location. I asked if these groups were using the FGDC standards. In five of the seven cases, they had never even heard of them. Two of the cases said, "They're not right. They're not good enough. They don't have what I want." This is after literally millions and millions of dollars of investment. This is an attitude that scientists often adopt about data standards. We have to be careful when we do this, because essentially what we are doing is reducing the amount of resources available for the science and knowledge discovery.

     I want to make the point again that most scientific data standards are developed because of hidden business motivations. We have to recognize that there are many different ways of making business decisions. Just because the FGDC has developed these very important standards does not mean that they are going to map onto my business decision-making process. As we develop scientific data and database standards, scientists are going to have to think about some of the economic and sociological motivations to do this and use the result of that analysis to overcome some of these artificial barriers.

     I would like to conclude with a brief discussion of how CODATA, both internationally and through the national committees, can facilitate some of this process. The first is to point out that over the years, CODATA has done a lot of work in so-called data recording and data reporting formats. If you go through the old CODATA bulletins, many nomenclature and definition issues have been addressed very robustly in a lot of scientific areas, from the point of view not of building a database standard, but of reporting information measurements, calculations, and modeling results in the primary literature. Much of the knowledge developed over the years is still applicable and should be used as a basis to help some of the modern standards development that is going on.

     Second, CODATA is a multinational, multidisciplinary activity and organization. I hope I have created an awareness that scientific data are not the problem of one discipline. Scientific data are of importance to the entire scientific community, having a multidisciplinary and multinational view of this will build us the most robust data and database standards.

     CODATA is an organization that was created by scientists for scientists and is operated by scientists. Because of this, I think it provides an environment that is friendly for scientists to work in, perhaps much more friendly than many of the formal standards bodies.

     In conclusion, developing standards for interdisciplinary and intersectoral data applications is becoming critical for twenty-first century research and development. The issues discussed here must be addressed in order to develop such standards.



Notes

1 See Chapter 15, "Data Mining and Databases," in these Proceedings for additional information.



Copyright 2001 the National Academy of Sciences

PreviousNext