Cover Image

Not for Sale



View/Hide Left Panel

7

Government-Sector Data

DR. ALEXANDER: My name is Shelton Alexander, from the Pennsylvania State University, and I am moderating this session. I would like to reintroduce the members of the panel on government-sector data, all of whom you have heard from in other contexts today. Barbara Ryan from the U.S. Geological Survey (USGS), Jim Ostell from the National Center for Biotechnology Information (NCBI), Richard Kayser from the National Institute of Standards and Technology (NIST), and Kenneth Hadeen, formerly with the National Oceanic and Atmospheric Administration's (NOAA) National Climatic Data Center (NCDC). The rapporteur is Suzanne Scotchmer from the University of California at Berkeley.

We have a set of five points that we want to address in the next hour. The National Research Council (NRC) study committee developed a set of five questions to guide the discussion this afternoon ( Box 7.1). The first is to identify and discuss the principal benefits or opportunities with respect to data production or dissemination activities in the government sector, occasioned by the current legal and policy regimes. We want to try to get some sense of the relative order of importance of the issues identified. I think it is clear from the discussion this morning that federal agencies certainly have to deal not only with the U.S. situation, but also with the situation in Europe and in other foreign countries. The government agencies also have dealings with the commercial sector and with not-for-profits. I think the context of your answers to this question should be broadened to include both of those areas. I would like to have each of you, in turn, give two-minute comments on the first question. We will start with Barbara Ryan.

BOX 7.1: Questions for the Discussion Sessions on the Existing Legal and Technical Situation
  1. Identify and discuss the principal benefits and opportunities to your database production and dissemination activities from the current legal and policy regimes. Try to rank them in order of importance.

  2. Identify and discuss the major problems and challenges to your database activities posed by the current legal and policy regimes. Try to rank them in order of importance.

  3. What specific conduct on the part of others most adversely impacts your organization's database activities? In answering this question, please specifically consider the impacts on your data activities caused by other database producers, data product disseminators, and data users in all three sectors (government, not-for-profit, and commercial). Try to rank them in order of importance.

  4. Identify and discuss the principal benefits and problems to data users posed by the current legal and policy regimes. Try to rank them in order of importance.

  5. Would any of your responses to the questions above change significantly if you project your activities five years hence?



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS 7 Government-Sector Data DR. ALEXANDER: My name is Shelton Alexander, from the Pennsylvania State University, and I am moderating this session. I would like to reintroduce the members of the panel on government-sector data, all of whom you have heard from in other contexts today. Barbara Ryan from the U.S. Geological Survey (USGS), Jim Ostell from the National Center for Biotechnology Information (NCBI), Richard Kayser from the National Institute of Standards and Technology (NIST), and Kenneth Hadeen, formerly with the National Oceanic and Atmospheric Administration's (NOAA) National Climatic Data Center (NCDC). The rapporteur is Suzanne Scotchmer from the University of California at Berkeley. We have a set of five points that we want to address in the next hour. The National Research Council (NRC) study committee developed a set of five questions to guide the discussion this afternoon ( Box 7.1). The first is to identify and discuss the principal benefits or opportunities with respect to data production or dissemination activities in the government sector, occasioned by the current legal and policy regimes. We want to try to get some sense of the relative order of importance of the issues identified. I think it is clear from the discussion this morning that federal agencies certainly have to deal not only with the U.S. situation, but also with the situation in Europe and in other foreign countries. The government agencies also have dealings with the commercial sector and with not-for-profits. I think the context of your answers to this question should be broadened to include both of those areas. I would like to have each of you, in turn, give two-minute comments on the first question. We will start with Barbara Ryan. BOX 7.1: Questions for the Discussion Sessions on the Existing Legal and Technical Situation Identify and discuss the principal benefits and opportunities to your database production and dissemination activities from the current legal and policy regimes. Try to rank them in order of importance. Identify and discuss the major problems and challenges to your database activities posed by the current legal and policy regimes. Try to rank them in order of importance. What specific conduct on the part of others most adversely impacts your organization's database activities? In answering this question, please specifically consider the impacts on your data activities caused by other database producers, data product disseminators, and data users in all three sectors (government, not-for-profit, and commercial). Try to rank them in order of importance. Identify and discuss the principal benefits and problems to data users posed by the current legal and policy regimes. Try to rank them in order of importance. Would any of your responses to the questions above change significantly if you project your activities five years hence?

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS MS. RYAN: With regard to the first question, for the USGS, the principal benefit of the current legal and policy regimes was actually laid out fairly well by Justin Hughes when he made reference to the Office of Management and Budget (OMB) Circular A-130. If the public citizenry of the United States has paid for data once, they should not pay for data again. This policy is very clear. As we enter into any cooperative agreements with either the private sector or our other public partners—state and local governments—we come to the table with that understanding right in the beginning, so there is no misunderstanding with any of our partners about what our responsibilities are as a federal agency. If others enter into an agreement with us regarding Earth and natural science information, then the expectation is that the data will, in fact, be available to all parties. Whether these parties are developers or conservationists, everybody on both sides of the fence gets equal access to these data right away. For the USGS, the greatest benefit is just the clarity of the position with federal information. There is one exception, however. As we enter into agreements with Indian tribes, there may be an exclusion from uniform data release because of self-governance and self-determination policies and our federal trust responsibility to those tribes and their policies. So if there is any debate about how easily we can turn the data over and release data once the data have undergone quality assurance and quality control, it tends to get a little foggier in terms of our negotiations with Indian tribes. DR. OSTELL: I agree with Ms. Ryan regarding the clearly voiced intent on the part of the U.S. government that the data should not be paid for twice. In fact, it is our job to make those data available in as many different ways as possible to as many different people as possible. I would also like to expand on the current status of data referred to in a published article in a traditional scientific journal. The status is that these are separate issues. That is, the article can be copyrighted, but the data behind the figure in the article are not under copyright or under database restrictions of the publisher who published it. This is quite important because it allows us to build databases and refer to them as published literature, and there is a clear ability to implement this notion that the data should be publicly available, while allowing the author or the publisher to retain copyright. The only reason I point out this differentiation between a database and a journal article is because of what publishers might do. For example, a scientific publisher, such as Elsevier, could ask for the underlying data as part of the article it is publishing and then would consider, under the this European Union Database Directive (E.U. Directive), that it would therefore own the database associated with the publication, which could be a problem. Under the current U.S. law, it is not a problem. Finally, the notion that different types of published works are protected by copyright does give scientists needed flexibility when we encounter situations in which we don't get cooperation from the data providers, either because they are from another country or because they think they don't have to cooperate. We have the option of getting the data out of the table in a book, or something like that, and incorporating these data into some tool we have anyway. It is not the preferred method, but in a sense it provides an opportunity of last resort. Again, if that information now becomes protected as exclusive property instead of under copyright, this means that there is no escape in these cases and we are trapped. DR. KAYSER: I certainly agree with Dr. Ostell's comments about the importance of being able to get information out of the literature. The way I look at this situation is that, under the current policy regimes, data compilations are not protected. We are relatively free to take

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS factual information from the literature and incorporate it into our databases. That is what I see as the principal benefit of the current regime, as far as we are concerned. There are no questions about the ownership of the underlying technical information. The other aspect of the current policy regime, as far as NIST is concerned as a data provider, is that if we provide data collections that are not covered by copyright, then they wouldn't be covered under copyright under any of the new regimes under consideration, at least the misappropriation model. I don't see it having any effect on us in that respect. DR. HADEEN: It is nice to be at the end of the table because everyone has already said everything. I do want to talk about OMB Circular A-130, which states that everyone will be treated equally. I think I see a bad trend developing within NOAA where the dual-pricing policy allows commercial customers to be charged more than the regular price of reproduction and dissemination of the data. This could lead to a situation in which the NOAA data centers start looking for those products that sell well instead of all the other activities associated with the data center. At some future time, this may have to be done in order to maintain the databases that are needed for the future. From the meteorological perspective, I think the current situation is quite flexible and allows NOAA to do the job it needs to do. DR. ALEXANDER: With respect to the first question and the comments we just heard, I invite anyone from the audience to ask a question or raise a point. AMBASSADOR SWEENEY: I would like to discuss a national security issue that I have not heard much about today. The question of international technology and data transfer was precluded from discussion, to some degree. First of all, let me introduce myself. I am James Sweeney, and I have been involved in negotiations with the international community on issues pertaining to a treaty on science and technology, but also on the nature of the proliferation of technology that the United States is exporting to other countries. It is an issue that I think is very important and I haven't heard much discussion on it today. In any proposed legislation, I think there needs to be some very clear consideration of the transfer of basic science data that could have dual-use applications, that is, commercial as well as military applications. I certainly believe that the national security and foreign policy considerations are extremely important. We have seen recently in the news the issue pertaining to the export of space-related data to China and also to Europe. There are many issues such as this that need to be addressed. I would like your comments on that. DR. OSTELL: I have a comment on that. I think you raise two issues. One is a national security issue, in which essentially enabling technology that puts the United States in jeopardy is exported this way. A second issue is, in a sense, the fruits of the U.S. taxpayer dollar going, in a nonreciprocal way, to other countries. On the first point, speaking from the point of view of biology, which is my particular field, it is very difficult to separate the military use from the health use—for example, when that basic information is used to engineer an application for, say, a viral weapon, as opposed to a drug against that virus. In the field of biology, I would say that we can't distinguish on the basis of the information. You would either have to send all or nothing in that regard. In terms of the nonreciprocal nature of the investment in science and the willingness to disseminate it again, it can be very difficult for us to get the information from other countries, even though they are perfectly happy to take information generated by the United States. I don't see a way out of this because it is a much deeper issue dealing with the national policy of countries for funding research. We could stop giving these countries information, but we would

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS also then need to stop their citizens from visiting the United States. It would have to be a deep decision on the part of the government to wall off U.S. research. Science itself is very international. There are excellent brains in all parts of the world. I think individual scientists tend to want to cooperate. In fact, we get quite a bit of cooperation from European and Japanese scientists. It is only at these points where they encounter their governments that it becomes an obstacle. You don't want to cut off individual scientists from participating in the process because of their governments. On the other hand, it is irksome that the government is putting up a barrier. DR. HADEEN: I can't talk about nuclear proliferation and weapons, but I can talk about international weather experiments, which have the cooperation of up to 100 different countries in some cases. After the International Geophysical Year in 1957, the International Council of Scientific Unions set up the World Data Center system, which was designed to exchange data from the various experiments. The NCDC is the World Data Center-A for meteorology. There are also about six or seven other parts to the World Data Center-A, which involve geophysics data and a whole series of other kinds of discipline data, such as oceanographic and solar and terrestrial, and various other aspects. Within the World Data Center system, there has been an exchange of data since the 1950s. The World Data Center-B is near Moscow. We have exchanged a lot of data with the Russian, even during the Cold War. Of course, today the Russian economy is such that they are having a difficult time providing the information back to the United States. There is a strong agreement that the Russians would like to. In fact, they were concerned about the survival of their World Data Center, and they talked about transferring more and more of their data to the United States so that their data would be protected somehow. The World Data Center-B is operating. AMBASSADOR SWEENEY: The statement that I heard was that any research done with government funding should be in a public database. I agree with that generally. However, most of the research I am referring to is done by the Department of Defense (DOD) and is classified, or by the Department of Energy (DOE) or other areas of the intelligence community that is classified as well. These data are collected by these agencies and then are available to the open market, and I think this is a concern. This is all I am trying to emphasize. We should include a statement that all government-funded data that might have an impact on national security policy should not be made available for international data transfer, unless they are in full compliance with export control policies and procedures. DR. ALEXANDER: Obviously research that gets done at DOD and DOE is done under these other legal and policy regimes, and the results are not widely distributed or distributed at all in many cases. MR. MOLHOLM: Under the Freedom of Information Act there is an exemption for the DOD. No other department has that. So not all data are necessarily available. In fact, not everything that is unclassified is necessarily available in the public domain. I think that is an important point. MS. RYAN: There is one other point, which is that, at least under this administration, the trend has been the other way, to take classified information and start declassifying it for, in effect, the civilian community. There has been a push to do that. The Civilian Applications Committee spends a lot of time talking with government and university scientists to examine

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS DOD-classified assets and look for civilian applications. MR. MOLHOLM: That is true, but to declassify it doesn't necessarily put it into the public domain. MS. RYAN: That is correct. PARTICIPANT: I would like to make a general comment. I think maybe one benefit of the present policy that should not be overlooked in these discussions is that federal policy has created, or made, the United States a world leader in scientific and technical information. It may be serendipitous and it may be a result of malice aforethought, but the present policy has looked at that position. I think that any change to the policy has to recognize this leadership position and do cost-benefit analyses to see if any proposed legislation would, in some way, inhibit the creativity and innovation of not only our scientists and our computer specialists but also science informatics people in an industry that has grown up. DR. ALEXANDER: That is a very good point. DR. HEILMAN: I work for the State of Maryland and we looked at a distinction between individual use of data and commercial use. We have a variety of different ways of disseminating the data. An issue that came up involves a particular data set that we have that is copyrighted and sold for commercial value. There are entities that purchase data sets and even do additional value-added work. This policy has allowed us to provide these data to the public because they are in part publicly funded as well as commercially funded. The state doesn't have to worry about these particular data being captured by an individual agency. Is that a distinction being made by the federal government? DR. ALEXANDER: I will let Justin Hughes of the Patent and Trademark Office address that question. MR. HUGHES: First of all, you have to describe your database. There is a very good chance that it is not protected by law any more, and there is a very good chance that I could just copy it. So it may be that the State of Maryland is pulling a fast one on all the people who are paying for it, under the current state law. I want you to understand that. If it is a thorough, complete database and if I wanted to take it, I could take it, not pay for it, and add value and resell it, or not add value at all and resell it. DR. HEILMAN: It was actually interesting because commercial purchasers were concerned about the price and went to the state legislature. As a result, the price was reduced for the commercial users because the data set is created now, and maintenance costs aren't nearly as high as the development costs, which included geographic information system implications and mapping indications that were added work, other than straight data. The Maryland legislature decided that these were benefits to the individual citizens. Because of the reduced price, the commercial purchasers were comfortable with that compromise. MR. HUGHES: In essence, what the State of Maryland did was to lower the price enough so that a commercial entity decided that it was easier to pay for the data set than to bear the risk of going into court and having to prove that it is unprotected. DR. HEILMAN: The reduction in costs was not that significant. As I said, there are other entities that do value-added work and sell it to those same consumers. It was more a balancing act of whether or not these data were able to be copyrighted by someone as intellectual property.

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS DR. ALEXANDER: I would like to move on to the second question, sort of the counterpoint to the first one, which is to identify and discuss the major problems or challenges in the database activities posed by the current regime. DR. OSTELL: I previously mentioned the problem with different policies internationally, so I won't cover that again. I also said that a benefit of the current government policy is that the intent is that data under U.S. government grants should be in the public domain. This also presents a problem because sometimes it is not clear what that means. So, for example, we have cases of grantees who tried to make the case that if the data were on their Web site, they were public, as opposed to being, say, in the database at NCBI, which is a different kind of public. Also at issue is at what point in the development of the data should they be made available publicly. There has been a lot of discussion of this issue for large funded centers, for example, that are producing millions of base pairs of sequence. Some of that sequence may exist in an unfinished condition for months, possibly, and yet it is finished enough that discoveries are made based on it. There are trade-offs in how these centers make that information available. So the practical solution of this problem is being worked out. The enforcement of making grantee data available is very difficult because almost all that you as the grantor can do is complain, or rescind the grant, which is a fairly extreme action to take. There is not too much middle ground other than to resort to public shame to get people to respond. I think another point of ambiguity that was raised by one of the other speakers concerns the scope of government activity. At what point are we, in fact, doing the job that the U.S. government is asking us to do, and at what point are we interfering with industry? This has been an ongoing balance for NCBI in a number of different directions. In general, I would say that this has been very positive. We have had a couple of encounters but, through some give and take, we managed to find a somewhat positive situation in both cases. However, this question of scope may continue to be a problem. It is hard to know a priori what that boundary is between government and industry because the world changes. Technologies change. Things that used to have to be very expensive now have become cheap. Things that used to be of interest to some small group of people, such as the World Wide Web, are suddenly something that millions of people want, so the economics and the priorities of these things change. I don't know how to correct this, other than to do it on a case-by-case basis, but clarity would be a help on this. DR. KAYSER: That was such a good answer, I don't really have anything to add. At NIST we don't have any significant problems with the current policy regime in this country. However, we do have a problem with the E.U.'s Database Directive, which presents a different policy. Even in that case, I am not sure that we have seen any real manifestations of the E.U. Directive yet. I guess we will just have to wait and see what happens along that line. DR. HADEEN: Again, the NOAA doesn't have any particular problems, other than what I mentioned earlier about the international aspect. There is the World Meteorological Organization (WMO) Resolution 40, which allows certain countries to withhold data or to say that data from certain stations cannot be reproduced or used for commercial purposes. There are also bilateral agreements with countries like Canada in which their data are used in research and so on. If we disseminate these data, they have to be referred back to the Canadian Environmental Service. These are all things that could be worked out and are being considered on a regular basis.

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS The situation today is not critical by any means. But if we don't take some action at some time, we may wish that we had. It is a situation in which you are just waiting for the other shoe to drop. MS. RYAN: I think that as funding pressures mount for federal agencies, what we are basically looking for is some acknowledgment of the resources that have gone into the collection of the data in the first place. In the USGS-Microsoft cooperative research and development agreement (CRADA), we have embedded the USGS logo as a watermark at periodic places in that data set. Those people who look for it will see that the base data are, in fact, USGS data. For those who don't look for it, it will give the appearance that, in fact, Microsoft has collected that information. As we enter into future balanced-budget agreements, as we did with the 104th Congress when the USGS was slated for abolishment, the question is, What purpose do these agreements serve? Because so few people know about the breadth and depth of the USGS, other people using USGS-derived products are apt to get credit for that information. I don 't know how closely this is aligned to the current policy. When you look at the British Geological Survey, for example, which is operating in more of a quasi-private-sector mode and all of a sudden has more funding flexibility for collection of certain data sets, then there are aspects of that agreement that look a little bit more appealing. MR. HUGHES: Would it help the USGS if there was something that said that, when a commercial entity uses and processes public domain, government-generated data there must be some acknowledgment of the source? Something that said that, the original source of these data was the USGS? MS. RYAN: It absolutely would. I can imagine that it would be the same for the National Weather Service with their derived products, which has a whole base of information that is collected at public expense. MR. HUGHES: The other virtue of something like this is that not only would it help Congress understand all the good work our federal agencies do, but it would also tell citizens where to go back and look for the original data, if they don't want to pay Microsoft. DR. OSTELL: Can I also respond to that point? There is a cascade of credit that occurs in derived works like this. For example, the National Institutes of Health (NIH) funds a grant to get a sequence. An individual researcher who publishes that sequence in a paper should be cited. The researcher should also cite the grant from NIH. That sequence then goes into an NCBI database. That database gets redistributed in a commercial product. Everyone wants to get credit, so you end up with multiple layers of crediting on these sequences. This, in fact, is an issue for NCBI because each person in the agency wants to show that they contributed to the sequence. It rapidly becomes unwieldy at some level. DR. KAYSER: If people incorporate NIST products in commercial products and leave them exactly the way they were when we provided them, then we want them to use our name. If they want to modify the products that we give them in any way, then we don't want them to use our name. That is the general NIST policy. MS. CARROLL: Bonnie Carroll, Information International Associates. I work with nine federal science and technical information agencies in a group called CENDI. One of the things that we have observed over the years is that the interpretation and the implementation of these policies differ dramatically among the agencies. Earlier Ms. Ryan said that there are contracts and CRADAs and other mechanisms used by the federal agencies and that these have been very useful. We also have looked at the Federal

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS Acquisition Regulations and these are all interpreted differently. These federal agency managers ask, “How can I sign my rights away to a publisher?” And they can't. One of the big obstacles is not only in the interpretation of these regulations, but also in the implementation. It is a frequently asked question, and every general counsel might answer differently. This interpretation issue and differences across government agencies is something that might be considered when looking at how to deal with issues in these database settings. MR. KELLY: I am Chris Kelly from the Department of Justice. I am wondering whether, as database users or consumers, any of you have a sense that you would be getting more and better database products to use if we were in a regime where people had proprietary rights to data. In this case, you might be looking at better products to check your data against. DR. OSTELL: I doubt it. MS. RYAN: Actually, I think it might be quite the opposite. This issue might be addressed later in question number three, from a database producer standpoint. I think one of the biggest challenges across the board is the development of standards, or metadata standards, that address and facilitate the integration of data. I fear that if you were to go down that path of proprietary protection, with the funding pressures on top of that, it would exacerbate the problems, not improve them. We talked earlier today about facilitating the exchange or integration of information across all the disciplines. A perfect example is Jim Ostell's presentation when he talked about comparing gene sequences and colon cancer. For example, can you imagine the power of the scientific inquiry if you were then to superimpose incidences of colon cancer with data that we have for soils or surficial geology, looking at incidences of colon cancer with water quality data from the same geographical area, chemical data from the agricultural industry, and the fate and transport of agricultural contaminants or other chemicals in the environment? That is the integration of these data sets that needs to occur. And in all honesty, I am not sure the Europeans are doing any better than we are in the United States, at this juncture. PARTICIPANT: I would like to raise one problem that has come up since the E.U. Directive was enacted. At least, that is when I became aware of it. The WMO published an article that looked at meteorological data policy. They stated in the article that no data that were published in their journal could be put in a computer format without prior permission from them. I haven't researched this thoroughly, and I am not certain that this statement was made since the E.U. Directive was put into place. I have talked to the WMO and they just smile and say, “The lawyer says this is fine; we are enforcing it.” This is the first case that I am aware of that a scientific publisher has put such a prohibition in their scientific primary publication. I am wondering if you are aware of this. DR. ALEXANDER: Sounds as if this would make a great court case. DR. OSTELL: I don't know about this particular example. The International Union of Crystallography has such journals. There is also the Cambridge Small Molecule database, which is proprietary. The Union has an agreement with Cambridge to do other depositions, putting that journal into the Small Molecule database. The Small Molecule database is not freely available, however. You have to pay quite a lot of money to use it. PARTICIPANT: My point is, you can't take the primary data and make them available in the primary database without the permission of the publisher. DR. OSTELL: It sounds like the Elsevier model, in that the data reported in the journal

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS belong to the publisher. DR. ALEXANDER: I am going to cut off the discussion on this point because we are under a severe time constraint. Question number three: What specific conduct on the part of others adversely impacts your organization's database activities? DR. KAYSER: From an intellectual property point of view, I am not sure that other database producers or data product disseminators have any effect on NIST's data activities at all, other than that NIST cares about what other organizations are doing so that we can ensure that our efforts are complementary to theirs and not overlapping. At NIST, we want to produce data that people consider valuable. We try to enter into as many agreements as we can with other people who want to disseminate data that come from NIST. DR. HADEEN: The NCDC is in the same situation as NIST regarding intellectual property. As I mentioned earlier, the major impacts on NCDC are the rapid changes in technology and the observation of networks and so on, because we depend on other groups to take the observations. I don't think that any of the issues we have talked about today impact that aspect—either the producers or the disseminators or the users. I want to say one thing about data product disseminators. NCDC works with many other groups and, in many cases, there are agencies that don't charge for any data. Consequently, if a database is developed mutually between two agencies, the one that gives it away free distributes it broadly. This has happened on several occasions. We also have several databases that were developed with many contributions from other countries. Again, these databases have been distributed without any strings attached, only for the cost of reproduction in some cases. DR. ALEXANDER: Does this apply to bilateral data as well? That is, are all the data that you gather from international sources treated like domestic data? DR. HADEEN: The data we get internationally, except under the WMO 's Resolution 40, with certain countries and under some bilateral agreements, are all treated like domestic data. Some other countries embellish some of the large databases. In negotiations with these countries, they have relaxed some of their restrictions in order to promote the common good of large global databases. MS. RYAN: I want to reiterate what I alluded to a few minutes ago, and that is, when in this case other government agencies—whether they are federal, state, or local government agencies—think that they are isolated and therefore build their databases without the recognition that there might be potential linkages of their data to other data sources, it limits the usefulness of everyone's data. This is probably the greatest detriment for this whole topic. Again, it goes back to developing standards and metadata. It is just immensely important but, at the same time, an immense challenge to go through the kind of coordination activity that is necessary to make sure databases are, in fact, interchangeable. DR. OSTELL: In this context, I would say that NCBI has managed to consciously avoid some of these problems by having a very clearly voiced rule that any data or software that are on our site are publicly available. This has resulted in some cases of not including data that were encumbered in some way from some places. NCBI has had the advantage, as we have grown as a central resource, to attract previously uncooperative people who found it in their interest to become cooperative. I would say that the biggest problem has been the change of status of a public data resource like SWISS-

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS PROT, where essentially we incorporated it in our strategy because it was a publicly distributed resource, and then it could be switched. It places us in limbo in how we deal with it, how we substitute for it, and what replaces it in the long-term strategy. NCBI's concern is whether there are other resources currently public that could undergo such a change and become restrictive, which would alter what NCBI does with them. DR. BROOKS: Lisa Brooks from the Genome Institute. I fund databases. One of our biggest problems is getting sequence data and variation data into public databases. What has not been talked about too much at this meeting is patent issues and intellectual property related to that. Now that things like SNPs are of so much commercial interest, it is a problem getting this type of information into public databases. The other thing that doesn't help is that the European laws are different, which affects the Europeans' willingness to put information into databases. DR. HEILMAN: I am Kelly Heilman and I work for the Maryland Department of Health and Mental Hygiene. We are actually looking at trying to enforce some of our rights in data, particularly in grants and research funding. There are two particular issues that we are dealing with. Maybe the panel can address these. One concern is that if a researcher holds the rights to data that were collected, and then goes back into that database and mines it for a secondary purpose, there is the potential for violating Institutional Review Board (IRB) protocols and informed consent policies, and also possibly exposing us to liability if that occurs. The second issue is that when we are considering at disseminating data to the public, we believe that we have an obligation to de-identify these data. The potential for taking research data and linking them to other data sets can really violate the privacy and confidentiality of individuals. DR. OSTELL: Well, if you can get enough data, you can figure anything out—I think that is the bottom line of what you are saying. Particularly for health-related data, this issue comes up for many of the longitudinal studies like Framingham or some with the Mormon families, for example. They have been anonymized to a certain extent but also they have now been studied from enough different directions that you can figure out who these people are. There are not that many families with 12 children living in Framingham who had a father who died of a heart attack at age 60. I don't think there is an easy answer to your question. There are the usual techniques in which errors can be introduced into the data, parts of the information can be hidden, or some portions of the data can be made proprietary. The Framingham study approach is that the data are not redistributed. You have to collaborate with a Framingham researcher, and that is how they protect the data. I don't think there is a simple answer in terms of redistributing the data. DR. HEILMAN: There is also the IRB and informed consent issue. Under state law, we have some protection so that we can de-identify the data or have a licensing agreement that might prevent the leakage and identification of individuals enforced with civil penalties, IRB sanctions, preventing user access to data, etc. We are trying to protect the privacy of individuals. DR. ALEXANDER: We will now move on to the fourth question, which is to identify and discuss principal benefits and problems of data users posed by the present legal and policy regimes. DR. HADEEN: It seems as if this is an issue that we have already covered. Some of the benefits of the current regime for the meteorological users are that, with the open exchange of data, the data are readily accessible, and users, even commercial users, can add value to them. In

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS addition, we can redistribute meteorological data at will, for the most part. If you look at problems of the current regime, it is mainly the cost to obtain some data, which can be rather exorbitant. In some cases there are huge databases that have to be reprocessed, or require a lot of work before they are ready and in a form that can be used. Of course, digital data are no problem usually, but if the data are still in a manuscript form and you want a digital data version, there are a lot of steps to take in going from the former to the latter. DR. OSTELL: The same thing is true in the sense that the current regime does allow a lot of ranges of uses of data produced within the government. That is, end users can use the data with resources on site, and third parties can produce commercial products with them to fill niches not covered by government activities, some of which can be quite large and profitable. I think from the scientific perspective, the ability to get at the whole data set is crucial for making new discoveries. There are a number of types of scientific data sets that, if you have a new approach to analyzing it, you have to be able to compute over the whole data set. That is not something you do on someone's Web page. You have to get the data set. By making the cost of getting the data set very low, especially given the low cost of PCs and the hardware in many cases, you can let many flowers bloom. A clever graduate student at the University of Oklahoma has as much access as some large commercial concern. Myra Williams mentioned a couple of databases that were developed academically that have become commercially important. I can say that there are dozens of others that were developed the same way, however, which turned out to be flops. By this very open policy, the low-cost ability to get the data, you allow people to experiment without having to make lots of investments in licensing agreements up front for that 10 percent or 5 percent that actually turn out to be good implementations or new discoveries. DR. ALEXANDER: One question or observation on the current regime is that, among the various agencies, there is a variance of policies as to what constitutes the cost recovery that we do make. Does that differential pricing that exists across the federal agencies pose problems for users? Are there costs being passed on to the users that are over and above just the simple cost of reproduction as a barrier? DR. KAYSER: The Standard Reference Data Act, which was passed in 1968, empowered NIST to recover the full cost of essentially all data activities—everything related to producing the databases, ranging from compilation, evaluation, packaging, as well as distribution and administrative costs. In many cases it is not really possible to recover a significant fraction of the cost of producing the databases, but in some cases it is. I think that NIST may be unique in this regard because of the Standard Reference Data Act. DR. OSTELL: I would say it also has something to do with timing. In the case of NCBI, we are relative latecomers, so we assumed a technology standard. We distributed data on CD-ROM for a while. That cost money because we had to produce media. So we did cost recovery just for the production of the media. Shifting to having the data distributed on the Internet means that essentially there is no cost from our perspective. The user needs to obtain access to the Internet, but other than that, the same machines that NCBI uses to produce the data sets are also the distribution medium, so it is free. In discussions with the director of NIH and the heads of the institutes, there is a recognition that it is possible to do things now technologically that are so inexpensive relative to the cost of the researcher getting the data in the first place, that to charge for accessing the data is kind of silly. Even though in toto you may be talking about $1 million or millions of dollars, the cost of distribution is still a tiny fraction of what was actually spent.

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS DR. ETKINS: Bob Etkins, NOAA. On the other hand, it is very hard to place a value on information that the government provides free of charge, just because of the collection. Not being able to place a monetary value on the information can make it difficult sometimes to justify to Congress the work and the services that we provide. We are in a prosperous year right now. I can imagine, and I am sure you all can imagine, a situation in which it was not as prosperous. The government agencies are under pressure to reduce their services and reduce their costs. DR. OSTELL: NCBI may be in a unique position, in the sense that molecular biology is in a stage of expansion. The justification NCBI uses is cases in which, for example, disease genes are discovered, like the colon cancer gene. We can cite a paper where an investigator or researcher used an NCBI resource and we found a human colon cancer gene and now we are going to make a new drug. That plays well with Congress. It may be different in these other cases, for example, the large data sets that NOAA collects, or at NIST where they have direct industry connections. You can get the refrigeration industry to say, yes, we need the data set and we will pay for it. DR. KAYSER: In some cases, we set the prices on databases based on how many data there are, how much evaluation went into them, how much it cost to create the database, and also what people are willing to pay. In some cases, that is nothing. In other cases, we can essentially recover the cost of the entire program. We determine what to charge for the database by figuring out what kind of a program we need to have to meet the needs of the community. If we can estimate the number of sales, then that is how we try to set it. MS. RYAN: At the USGS we are able to recover data reproduction and data dissemination costs for the Landsat imagery, which we now distribute largely over the Internet, but it is not stored on the Internet. So there is still a fair amount of salary dollars that are required to retrieve data from tapes and make them available for users. There are also the 55,000 topographic maps that cover the United States and are still distributed in hard copy. The sale of these maps accounts for somewhere between $10 million and $15 million a year for the USGS. So it is a substantial amount of money. DR. KAYSER: I wanted to make one qualification of my comment, and that is that NIST subscribes completely to the OMB Circular A-130 principle that Justin Hughes outlined earlier. People should certainly not pay for data more than once. If we quit adding value to any data products, then you have to start giving them away for essentially the cost of dissemination. PARTICIPANT: I have worked in the federal data business for five administrations. During that time we have had people come to us very seriously to say that the taxpayers paid for it once and they should get it free. We also have people saying that if the data are worth anything, charge what you can. I have heard from everyone in this room who is my age or older that this is a strain on federal data activities. DR. ETKINS: I would like to comment on the issue that if the taxpayers pay for it once, they shouldn't have to pay for it again. Taking an opposing view, this is not always true. There are many cases in which a very small fraction of the taxpayer population actually benefits from the data that are collected. Would it be fair to recover that cost, or some of it, from those very few users? To reduce the cost to the remainder of the taxpaying public is not a bad thing. The government pays for building toll roads, for example, and we pay tolls because the rest of the state shouldn't have to pay for that toll road. This is an accepted principle. MR. HUGHES: This is very complicated. The toll road is a good example. What was the paradigm that emerged after World War II? It was not the toll road. It was the interstate,

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS which anyone could travel free of charge. Yes, there are still toll roads in America, but you have to make decisions about that. The problem is that every benefit that the government creates is not shared equally by everyone. This does apply only to information. It is true of the roads, of the railroads, of the airways. It benefits those of us who get on planes. It benefits me more than it benefits my mother. So we have to decide at some point if we, as a society, will invest broadly in public infrastructure or a service that doesn't always accrue to everyone equally. It is a tough decision, but it is there to be made. DR. KAYSER: Another principle is that if only a few individuals or a single organization benefit from what we do, then it is hard to justify doing work just for the sake of a few. A good example is that if someone sends a thermometer or a pressure gauge to NIST to be calibrated, you have to recover the cost of that directly from the people for whom you are doing the calibration, not from the general public. DR. ALEXANDER: The last question is to gaze into a crystal ball and ask if anything that we have talked about here today will change five years hence. The question also might be, What might you have said five years ago about these issues vis-à-vis what we have said today? MS. RYAN: This is a tough question. The current policy of OMB Circular A-130 will probably still be in effect five years from now, which will largely be for the public good. It may have some minor modifications that we talked about earlier, for example, about credit to the original government data sources so that our agencies still exist five years from now, and that the public can, in fact, benefit from our information. I would hope that in the next five years the public sector, largely the federal government, will get its act together on data standards and metadata, so that the integration of information across disciplines will be largely facilitated and we will in fact actually see that this information provides a gateway to the Earth. DR. OSTELL: It is hard to project, because we are talking about many different things. Just looking at the thrust, at least, in biology, I would say that it is going to be more of what we have now. This notion of freeways is, I think, inexorable. It will be so essential to be able to go to the public resources, to make these computational connections, go back out to the research laboratory, go out to commercial providers, and have them point back into the public resources. I would say that people in five years probably are going to find it hard to imagine even the barriers that we have to put up with today. I just don't see that the current approach is going to stop. I suppose it could be legislated away. DR. KAYSER: From an intellectual property point of view, and from the perspective of the kind of programs NIST has, I don't see any changes coming in the next five years. We work in an area that is relatively mature, compared to bioinformatics. The areas in which NIST works may change a little, but from an intellectual property point of view, no, I would say that there will be no change. DR. HADEEN: I am on the fence on whether to say that there are going to be changes or not. This pendulum that swings toward commercialization in one administration and noncommercialization in the next one can make it difficult to predict. What I see is more and more data and easier access to them. So my final answer is that the policy situation should remain steady for the next five years. DR. ALEXANDER: With regard to the enabling technology that exists today, we talk about the Internet, but we can anticipate further leaps like that, which will have profound impacts on the ability to access data and information. The changes are going to be as stunning as they

OCR for page 139
PROCEEDINGS OF THE WORKSHOP ON PROMOTING ACCESS TO SCIENTIFIC AND TECHNICAL DATA FOR THE PUBLIC INTEREST: AN ASSESSMENT OF POLICY OPTIONS have been over the past five years. Even though we are talking about legal and policy issues, the enabling technology has to be a factor in these discussions because it influences what we do and how we do the database generation, dissemination the related and all the aspects, and especially the cost.