Read "Engaging Privacy and Information Technology in a Digital Age" at NAP.edu

Page 88 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

3
Technological Drivers

Privacy is an information concept, and fundamental properties of information define what privacy can—and cannot—be. For example, information has the property that it is inherently reproducible: If I share some information with you, we both have all of that information. This stands in sharp contrast to apples: If I share an apple with you, we each get half an apple, not a whole apple. If information were not reproducible in this manner, many privacy concerns would simply disappear.

3.1
THE IMPACT OF TECHNOLOGY ON PRIVACY

Advances in technology have often led to concerns about the impact of those advances on privacy. As noted in Chapter 1, the classic characterization of privacy as the right to be left alone was penned by Louis Brandeis in his article discussing the effects on privacy of the then-new technology of photography. The development of new information technologies, whether they have to do with photography, telephony, or computers, has almost always raised questions about how privacy can be maintained in the face of the new technology. Today’s advances in computing technology can be seen as no more than a recurrence of this trend, or can be seen as different in that new technology, being fundamentally concerned with the gathering and manipulation of information, increases the potential for threats to privacy.

Several trends in the technology have led to concerns about privacy. One such trend has to do with hardware that increases the amount of

Page 89 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

information that can be gathered and stored and the speed with which that information can be analyzed, thus changing the economics of what it is possible to do with information technology. A second trend concerns the increasing connectedness of this hardware over networks, which magnifies the increases in the capabilities of the individual pieces of hardware that the network connects. A third trend has to do with advances in software that allow sophisticated mechanisms for the extraction of information from the data that are stored, either locally or on the network. A fourth trend, enabled by the other three, is the establishment of organizations and companies that offer as a resource information that they have gathered themselves or that has been aggregated from other sources but organized and analyzed by the company.

Improvements in the technologies have been dramatic, but the systems that have been built by combining those technologies have often yielded overall improvements that sometimes appear to be greater than the sum of the constituent parts. These improvements have in some cases changed what it is possible to do with the technologies or what it is economically feasible to do; in other cases they have made what was once difficult into something that is so easy that anyone can perform the action at any time.

The end result is that there are now capabilities for gathering, aggregating, analyzing, and sharing information about and related to individuals (and groups of individuals) that were undreamed of 10 years ago. For example, global positioning system (GPS) locators attached to trucks can provide near-real-time information on their whereabouts and even their speed, giving truck shipping companies the opportunity to monitor the behavior of their drivers. Cell phones equipped to provide E-911 service can be used to map to a high degree of accuracy the location of the individuals carrying them, and a number of wireless service providers are marketing cell phones so equipped to parents who wish to keep track of where their children are.

These trends are manifest in the increasing number of ways people use information technology, both for the conduct of everyday life and in special situations. The personal computer, for example, has evolved from a replacement for a typewriter to an entry point to a network of global scope. As a network device, the personal computer has become a major agent for personal interaction (via e-mail, instant messaging, and the like), for financial transactions (bill paying, stock trading, and so on), for gathering information (e.g., Internet searches), and for entertainment (e.g., music and games). Along with these intended uses, however, the personal computer can also become a data-gathering device sensing all of these activities. The use of the PC on the network can potentially generate data that can be analyzed to find out more about users of PCs than they

Page 90 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

anticipated or intended, including their buying habits, their reading and listening preferences, who they communicate with, and their interests and hobbies.

Concerns about privacy will grow as the use of computers and networks expands into new areas. If we can’t keep data private with the current use of technology, how will we maintain our current understanding of privacy when the common computing and networking infrastructure includes our voting, medical, financial, travel, and entertainment records, our daily activities, and the bulk of our communications? As more aspects of our lives are recorded in systems for health care, finance, or electronic commerce, how are we to ensure that the information gathered is not used inappropriately to detect or deduce what we consider to be private information? How do we ensure the privacy of our thoughts and the freedom of our speech as the electronic world becomes a part of our government, central to our economy, and the mechanism by which we cast our ballots? As we become subject to surveillance in public and commercial spaces, how do we ensure that others do not track our every move? As citizens of a democracy and participants in our communities, how can we guarantee that the privacy of putatively secret ballots is assured when electronic voting systems are used?

The remainder of this chapter explores some relevant technology trends, describing current and projected technological capacity and relating it to privacy concerns. It also discusses computer, network, and system architectures and their potential impacts on privacy.

3.2
HARDWARE ADVANCES

Perhaps the most commonly known technology trend is the exponential growth in computing power—loosely speaking the central processor unit in a computer will double in speed (or halve in price) every 18 months. What this trend has meant is that over the last 10 years, we have gone through about seven generations, which in turn means that the power of the central processing unit has increased by a factor of more than 100. The impact of this change on what is possible or reasonable to compute is hard to overestimate. Tasks that took an hour 10 years ago now take less than a minute. Tasks that now take an hour would have taken days to complete a decade ago. The end result of this increase in computing speed is that many tasks that were once too complex to be automated can now be easily tackled by commonly available machines.

While the increase in computing power that is implied by this exponential growth is well known and often cited, less appreciated are the economic implications of that trend, which entail a decrease in the cost of computation by a factor of more than 100 over the past 10 years. One

Page 91 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

outcome of this is that the desktop computer used in the home today is far more powerful than the most expensive supercomputer of 10 years ago. At the same time, the cell phones commonly used today are at least as powerful as the personal computers of a decade ago. This change in the economics of computing means that there are many more computers in simple numbers than there were a decade ago, which in turn means that the amount of total computation available at a reasonable price is no longer a limiting factor in any but the most complex of computing problems.

Nor is it merely central processing units (CPUs) that have shown dramatic improvements in performance and dramatic reductions in cost over the past 10 years. Dynamic random access memory (DRAM), which provides the working space for computers, has also followed a course similar to that for CPU chips.¹ Over the past decade memory size has in some cases increased by a factor of 100 or more, which allows not only for faster computation but also for the ability to work on vastly larger data sets than was possible before.

Less well known in the popular mind, but in some ways more dramatic than the trend in faster processors and larger memory chips, has been the expansion of capabilities for storing electronic information. The price of long-term storage has been decreasing rapidly over the last decade, and the ability to access large amounts of such storage has been increasing. Storage capacity has been increasing at a rate that has outpaced the rate of increase in computing power, with some studies showing that it has doubled on average every 12 months.² The result of this trend is that data can be stored for long periods of time in an economical fashion. In fact, the economics of data storage has become inverted. Traditionally, data was discarded as soon as possible to minimize the cost of storing that data, or at least moved from primary storage (disks) to secondary storage (tape) where it was more difficult to access. With the advances in the capacities of primary storage devices, it is now often more expensive to decide how to cull data or transfer it to secondary storage (and to spend the resources to do the culling or transferring) than it is to simply store it all on primary storage, adding new capacity when it is needed.

The change in the economics of data storage has altered more than just the need to occasionally cull data. It has also changed the kind of

¹	On the other hand, the speed with which the contents of RAM chips can be accessed has not increased commensurately with speed increases in CPU chips, and so RAM access has become relatively “slower.” This fact has not yet had many privacy implications, but may in the future.
²	E. Grochowski and R.D. Halern, “Technological Impact of Magnetic Hard Disk Drives on Storage Systems,” IBM Systems Journal 42(2):338-346, July 2003.

Page 92 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

data that organizations are willing to store. When persistent storage was a scarce resource, considerable effort was expended in ensuring that the data that were gathered were compressed, filtered, or otherwise reduced before being committed to persistent storage. Often the purpose for which the data had been gathered was used to enhance this compression and filtering, resulting in the storing not of the raw data that had been gathered but instead of the computed results based on that data. Since the computed results were task-specific, it was difficult or impossible to reuse the stored information for other purposes, part of the compression and filtering caused a loss of the general information such that it could not be recovered.

With the increase in the capacity of long-term storage, reduction of data as they are gathered is no longer needed. And although compression is still used in many kinds of data storage, that compression is often reversible, allowing the re-creation of the original data set. The ability to re-create the original data set is of great value, as it allows more sophisticated analysis of the data in the future. But it also allows the data to be analyzed for purposes other than those for which it was originally gathered, and allows the data to be aggregated with data gathered in other ways for additional analysis.

Additionally, forms of data that were previously considered too large to be stored for long periods of time can now easily be placed on next-generation storage devices. For example, high-quality video streams, which can take up megabytes of storage for each second of video, were once far too large to be stored for long periods; the most that was done was to store samples of the video streams on tape. Now it is possible to store large segments of real-time video footage on various forms of long-term storage, keeping recent video footage online on hard disks, and then archiving older footage on DVD storage.

Discarding or erasing stored information does not eliminate the possibility of compromising the privacy of the individuals whose information had been stored. A recent study has shown that a large number of disk drives available for sale on the secondary market contain easily obtainable information that was placed on the drive by the former owner. Included in the information found by the study was banking account information, information about prescription drug use, and college application information.³ Even when the previous owners of the disk drive had gone to some effort to erase the contents of the drive, it was in most cases fairly easy to repair the drive in such a way that the data that the drive had held

³	Simson L. Garfinkel and Abhi Shelat, “Remembrance of Data Past: A Study of Disk Sanitization Practices,” IEEE Security and Privacy 1(1):83-88, 2003.

Page 93 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

were easily available. In fact, one of the conclusions of the study is that it is quite hard to really remove information from a modern disk drive; even when considerable effort has been put into removing the information, sophisticated “digital forensic” techniques can be used to re-create the data. From the privacy point of view, this means that once data have been gathered and committed to persistent storage, it is very difficult to ever be sure that the data have been removed or forgotten—a point very relevant to the archiving of materials in a digital age.

With more data, including more kinds of data, being kept in its raw form, the concern arises that every electronic transaction a person ever enters into can be kept in readily available storage, and that audio and video footage of all of the public activities for that person could also be available. This information, originally gathered for purposes of commerce, public safety, health care, or for some other reason, could then be available for uses other than those originally intended. The fear is that the temptation to use all of this information, either by a governmental agency or by private corporations or even individuals, is so great that it will be nearly impossible to guarantee the privacy of anyone from some sort of prying eye, if not now then in the future.

The final hardware trend relevant to issues of personal privacy involves data-gathering devices. The evolution of these devices has moved them from the generating of analog data to the generation of data in digital form; from devices that were on specialized networks to those that are connected to larger networks; and from expensive, specialized devices that were deployed only in rare circumstances to cheap, ubiquitous devices either too small or too common to be generally noticed. Biometric devices, which sense physiological characteristics of individuals, also count as data-gathering devices. These sensors, from simple temperature and humidity sensors in buildings to the positioning systems in automobiles to video cameras used in public places to aid in security, continue to proliferate, showing the way to a world in which all of our physical environment is being watched and sensed by sets of eyes and other sensors. Box 3.1 provides a sampling of these sensing devices.

The ubiquitous connection of these sensors to the network is really a result of the transitive nature of connectivity. It is not in most cases the sensors themselves that are connected to the larger world. The standard sensor deployment has a group of sensors connected by a local (often specialized) network to a single computer. However, that computer is in turn connected to the larger network, either an intranet or the Internet itself. Because of this latter connection, the data generated by the sensors can be moved around the network like any other data once the computer to which the sensors are directly connected has received it.

The final trend of note in sensing devices is their nearly ubiquitous

Page 94 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

BOX 3.1

A Sampling of Advanced Data-gathering Technologies

Pervasive sensors and new types of sensors (e.g., “smart dust”)
Infrared/thermal detectors
GPS/location information
Cell-phone-generated information
Radio-frequency identification tags
Chips embedded in people
Medical monitoring (e.g., implanted heart sensors)
Spycams and other remote cameras
Surveillance cameras in most public places
Automated homes with temperature, humidity, and power sensors
Traffic flow sensors
Camera/cell-phone combinations
Toys for children that incorporate surveillance technology (such as a stuffed animal that contains a nanny-cam)
Biometrics-based recognition systems (e.g., based on face recognition, fingerprints, voice prints, gate analysis, iris recognition, vein patterns, hand geometry)
Devices for remote reading of monitors and keyboards
Brain wave sensors
Smell sensors

However, it should also be noted that data-gathering technologies need not be advanced or electronic to be significant or important. Mail or telephone surveys, marketing studies, and health care information forms, sometimes coupled with optical scanning to convert manually created data into machine-readable form, also generate enormous amounts of personal and often sensitive information.

proliferation. Video cameras are now a common feature of many public places; traffic sensors have become common; and temperature and humidity sensors (which can be used as sensors to detect humans) are in many modern office buildings. Cell phone networks gather position information for 911 calling, which could be used to track the locations of their users. Many automobiles contain GPS sensors, as part of either a navigation system or a driver aid system. As these devices become smaller and more pervasive, they become less noticeable, leading to the gathering of data in contexts where such gathering is neither expected nor noticed.

The proliferation of explicit sensors in our public environments has been a cause for alarm. There is also the growing realization that every computer used by a person is also a data-gathering device. Whenever a computer is used to access information or perform a transaction, informa-

Page 95 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

tion about the use or transaction can be (and often is) gathered and stored. This means that data can be gathered about far more people in far more circumstances than was possible 10 years ago. It also means that such information can be gathered about activities that intuitively appear to occur within the confines of the home, a place that has traditionally been a center of privacy-protected activities. As more and more interactions are mediated by computers, more and more data can be gathered about more and more activities.

The trend toward ubiquitous sensing devices has only begun, and it shows every sign of accelerating at an exponential rate similar to that seen in other parts of computing. New kinds of sensors, such as radio-frequency identification (RFID) tags or medical sensors allowing constant monitoring of human health, are being mandated by entities such as Walmart and the Department of Defense. Single-sensor surveillance may be replaced in the future with multiple-sensor surveillance. The economic and health benefits of some ubiquitous sensor deployments are significant. But the impact that those and other deployments will have in practice on individual privacy is hard to determine.

3.3
SOFTWARE ADVANCES

In addition to the dramatic and well-known advances in the hardware of computing have come significant advances in the software that runs on that hardware, especially in the area of data mining and information fusion/data integration techniques and algorithms. Owing partly to the new capabilities enabled by advances in the computing platform and partly to better understanding of the algorithms and techniques needed for analysis, the ability of software to analyze the information gathered and stored on computing machinery has made great strides in the past decade. In addition new techniques in parallel and distributed computing have made it possible to couple large numbers of computers together to jointly solve problems that are beyond the scope of any single machine.

Although data mining is generally construed to encompass data searching, analysis, aggregation, and, for lack of a better term, archaeology, “data mining” in the strict sense of the term is the extraction of information implicit in data, usually in the form of previously unknown relationships among data elements. When the data sets involved are voluminous, automated processing is essential, and today computer-assisted data mining often uses machine learning, statistics, and visualization techniques to discover and present knowledge in a form that is easily comprehensible.

Information fusion is the process of merging/combining multiple sources of information in such a way that the resulting information is

Page 96 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

more accurate or reliable or robust as a basis for decision making than any single source of information would be. Information fusion often involves the use of statistical methods, such as Bayesian techniques and random effects modeling. Some information fusion approaches are implemented as artificial neural networks.

Both data mining and information fusion have important everyday applications. For example, by using data mining to analyze the patterns of an individual’s previous credit card transactions, a bank can determine whether a credit card transaction today is likely to be fraudulent. By combining results from different medical tests using information fusion techniques, physicians can infer the presence or absence of underlying disease with higher confidence than if the result of only one test were available.

These techniques are also relevant to the work of government agencies. For example, the protection of public health is greatly facilitated by early warning of outbreaks of disease. Such warning may be available through data mining of the highly distributed records of first-line health care providers and pharmacies selling over-the-counter drugs. Unusually high buying patterns of such drugs (e.g., cold remedies) in a given locale might signal the previously undetected presence and even the approximate geographic location of an emerging epidemic threat (e.g., a flu outbreak). Responding to a public health crisis might be better facilitated with automated access to and screening analyses of patient information at clinics, hospitals, and pharmacies. Research on these systems is today in its infancy, and it remains to be seen whether such systems can provide reliable warning on the time scales needed by public health officials to respond effectively.

Data-mining and information fusion technologies are also relevant to counterterrorism, crisis management, and law enforcement. Counterterrorism involves, among other things, the identification of terrorist operations before execution through analysis of signatures and database traces made during an operation’s planning stages. Intelligence agencies also need to pull together large amounts of information to identify the perpetrators of a terrorist attack. Responding to a natural disaster or terrorist attack requires the quick aggregation of large amounts of information in order to mobilize and organize first-responders and assess damage. Law enforcement must often identify perpetrators of crimes on the basis of highly fragmentary information—e.g., a suspect’s first name, a partial license number, and vehicle color.

In general, the ability to analyze large data sets can be used to discern statistical trends or to allow broad-based research in the social, economic, and biological sciences, which is a great boon to all of these fields. But the ability can also be used to facilitate target marketing, enable broad-based

Page 97 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

e-mail advertising campaigns, or (perhaps most troubling from a privacy perspective) discern the habits of targeted individuals.

The threats to privacy are more than just the enhanced ability to track an individual through a set of interactions and activities, although that by itself can be a cause for alarm. It is now possible to group people into smaller and smaller groups based on their preferences, habits, and activities. There is nothing that categorically rules out the possibility that in some cases, the size of the group can be made as small as one, thus identifying an individual based on some set of characteristics having to do with the activities of that individual.

Furthermore, data used for this purpose may have been gathered for other, completely different reasons. For example, cell phone companies must track the locations of cell phones on their network in order to determine the tower responsible for servicing any individual cell phone. But these data can be used to trace the location of cell-phone owners over time.⁴ Temperature and humidity sensors used to monitor the environment of a building can generate data that indicate the presence of people in particular rooms. The information accumulated in a single database for one reason can easily be used for other purposes, and the information accumulated in a variety of database can be aggregated to allow the discovery of information about an individual that would be impossible to find out given only the information in any single one of those databases.

The end result of the improvements in both the speed of computational hardware and the efficiency of the software that is run on that hardware is that tasks that were unthinkable only a short time ago are now possible on low-cost, commodity hardware running commercially available software. Some of these new tasks involve the extraction of information about the individual from data gathered from a variety of sources. A concern from the privacy point of view is that—given the extent of the ability to aggregate, correlate, and extract new information from seemingly innocuous information—it is now difficult to know what activities will in fact compromise the privacy of an individual.

3.4
INCREASED CONNECTIVITY AND UBIQUITY

The trends toward increasingly capable hardware and software and increased capacities of individual computers to store and analyze information are additive; the ability to store more information pairs with the increased ability to analyze that information. When combined with these

⁴	Matt Richtel, “Tracking of Mobile Phones Prompts Court Fights on Privacy,” New York Times, December 10, 2005, p. A1.

Page 98 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

two, a third technology trend, the trend toward increased connectivity in the digital world, has a multiplicative effect.

The growth of network connectivity—obvious over the past decade in the World Wide Web’s expansion from a mechanism by which physicists could share information to a global phenomenon, used by millions to do everything from researching term papers to ordering books—can be traced back to the early days of local area networks and the origin of the Internet: Growth in the number of nodes on the Internet has been exponential over a period that began roughly in 1980 and continues to this day.⁵ Once stand-alone devices connected with each other through the use of floppy disks or dedicated telephone lines, computers are now networked devices that are (nearly) constantly connected to each other.

A computer that is connected to a network is not limited by its own processor, software, and storage capacity, and instead can potentially make use of the computational power of the other machines connected to that network and the data stored on those other computers. The additional power is characterized by Metcalfe’s law, which states that the power of a network of computers increases in proportion to the number of pair-wise connections that the network enables.⁶

A result of connectivity is the ability to access information stored or gathered at a particular place without having physical access to that place. It is no longer necessary to be able to actually touch a machine to use that machine to gather information or to gain access to any information stored on the machine. Controlling access to a physical resource is a familiar concept for which we have well-developed intuitions, institutions, and mechanisms that allow us to judge the propriety of access and to control that access. These intuitions, institutions, and mechanisms are much less well developed in the case of networked access.

The increased connectivity of computing devices has also resulted in a radical decrease in the transaction costs for accessing information. This has had a significant impact on the question of what should be considered a public record, and how those public records should be made available. Much of the information gathered by governments at various levels is considered public record. Traditionally, the costs (both in monetary terms and in terms of costs of time and human aggravation) to access such

⁵

Raymond Kurzweil, The Singularity Is Near, Viking Press, 2005, pp. 78-81.

⁶

See B. Metcalfe, “Metcalfe’s Law: A Network Becomes More Valuable as It Reaches More Users,” Infoworld, Oct. 2, 1995. See also the May 6, 1996, column at http://www.infoworld.com/cgi-bin/displayNew.pl?/metcalfe/bm050696.html. The validity of Metcalfe’s law is based on the assumption that every connection in a network is equally valuable. However, in practice it is known that in many networks, certain nodes are much more valuable than others, a point suggesting that the value may increase less rapidly in proportion to the number of possible pair-wise connections.

Page 99 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

public records have been high. To look at the real estate transactions for a local area, for example, required physically going to the local authority that stored those records, filling out the forms needed for access, and then viewing the records at the courthouse, tax office, or other government office. When these records are made available through the World Wide Web, the transaction costs of accessing those records are effectively zero, making it far easier for the casual observer to view such records.

Connectivity is also relevant to privacy on a scale smaller than that of the entire Internet. Corporate and government intranets allow the connection and sharing of information between the computers of a particular organization. The purpose of such intranets is often for the sharing of information between various computers (as opposed to the sharing of information between the users of computers). Such sharing is a first step toward the aggregation of various data repositories, combining information collected for a variety of reasons to enable that larger (and richer) data set to be analyzed in an attempt to extract new forms of information.

Along with the increasing connectivity provided by networking, the networks themselves are becoming more capable as a mechanism for sharing data. Bandwidth, the measure of how much data can be transferred over the network in a given time, has been increasing dramatically. New network technologies allowing some filtering and analyzing of data as it flows through the network are being introduced. Projects such as SETI@home⁷ and technologies like grid computing⁸ are trying to find ways of utilizing the connectivity of computers to allow even greater computational levels.

From the privacy point of view, interconnectivity seems to promise a world in which any information can be accessed from anywhere at any time, along with the computational capabilities to analyze the data in any way imaginable. This interconnectivity seems to mean that it is no longer necessary to actually have data on an individual on a local computer; the data can be found somewhere on another computer that is connected to the local computer, and with the seemingly unlimited computing ability of the network of interconnected machines, finding and making use of that information is no longer a problem.

Ubiquitous connectivity has also given impetus to the development of digital rights management technologies (DRMTs), which are a response to the fact that when reduced to digital form, text, images, sounds, and other forms of content can be copied freely and perfectly. DRMTs harness the power of the computer and the network to enforce predefined limits on

⁷	Available at http://setiathome.ssl.berkeley.edu/.
⁸	Available at http://www.gridforum.org/.

Page 100 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

the possible distribution and use of a protected work. These predefined limits can be very fine-grained. They can include:

Limits on the number of times that protected information is viewed or on the extent to which protected information can be altered;
Selective permissions that allow protected information to be read but not copied, printed, or passed along electronically except to selected parties;
Selective access to certain portions of protected information and denial of access to others;
Tracking of the parties that receive protected information and what they did with it (e.g., what they read, when they read it, how many times they read it, how long they spent reading it, to whom they forwarded it); and
Enforcement of time windows during which certain access privileges are available.

DRMTs are a particularly interesting technological development with a potential impact on privacy. Originally developed with the intent of enhancing privacy, their primary application to date has been to protect intellectual property rights. But some applications of DRMTs can also detract from privacy. For example, DRMTs offer the potential for institutional content owners to collect highly detailed information on user behavior regarding the texts they read and the music they listen to, a point discussed further in Section 6.7. And in some instances, they have the potential to create security vulnerabilities in the systems on which they run, exploitation of which might lead to security breaches and the consequent compromise of personal information stored on those systems.⁹

On the other hand, DRMTs can—in principle—be used by private individuals to exert greater control over the content that they create. An individual could set permissions on his or her digital document so that only certain other individuals could read it, or could make a copy of it, and so on. Although, this capability is not widely available today for individuals to use, some document management systems are beginning to incorporate some such features.

⁹

An example is the recent Sony DRM episode, during which Sony’s BMG Music Entertainment surreptitiously distributed software on audio compact disks that was automatically installed on any computer that played the CD. This software was intended to block the copying of the CD, but it had the unintentional side effect of opening security vulnerabilities that could be exploited by other malicious software such as worms or viruses.

Page 101 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

3.5
TECHNOLOGIES COMBINED INTO A DATA-GATHERING SYSTEM

Each of the technology trends discussed above can be seen individually as having the potential to threaten the privacy of the individual. Combined into an overall system, however, such technologies seem to pose a far greater threat to privacy. The existence of ubiquitous sensors, generating digital data and networked to computers, raises the prospect of data generated for much of what individuals do in the physical world. Increased use of networked computers, which are themselves a form of activity sensor, allows the possibility of a similar tracking of activities in the electronic world. And increased and inexpensive data storage capabilities support retention of data by default.

Once stored, data are potentially available for analysis by any computer connected via a network to that storage. Networked computers can share any information that they have, and can aggregate information held by them separately. Thus it is possible not only to see all of the information gathered about an individual, but also to aggregate the information gathered in various places on the network into a larger view of the activities of that individual. Such correlations create yet more data on an individual that can be stored in the overall system, shared with others on the network, and correlated with the sensor data that are being received.

Finally, the seemingly unlimited computing power promised by networked computers would appear to allow any kind of analysis of the data concerning the individual to be done thoroughly and quickly. Patterns of behavior, correlations between actions taken in the electronic and the physical world, and correlations between data gathered about one individual and that gathered about another are capable, in principle, of being found, reported, and used to create further data about the individual being examined. Even if such analysis is impractical today, the data will continue to be stored, and advances in hardware and software technology may appear that allow the analysis to be done in the future.

At the very least, these technology trends—in computation, sensors, storage technology, and networking—change the rules that have governed surveillance. It is the integration of both hard and soft technologies of surveillance and analysis into networks and systems that underlies the evolution of what might be called traditional surveillance to the “new” surveillance.¹⁰ Compared to traditional surveillance, the new surveillance is less visible and more continuous in time and space, provides fewer

¹⁰

The term originates with Gary Marx, “What’s New About the ‘New’ Surveillance?,” Surveillance & Society 1(1):9-29, 2005; and Gary Marx, “Soft Surveillance: The Growth of Mandatory Volunteerism in Collecting Personal Information,” in T. Monahan, Surveillance and Security: Technological Politics and Power in Everyday Life, Routledge, 2006.

Page 102 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

opportunities for targets to object to or prevent the surveillance, is greater in analytical power, produces data that are more enduring, is disseminated faster and more widely, and is less expensive. (Table 3.1 presents some examples.) Essentially all of these changes represent additional surveillance capabilities for lower cost, and exploitation of these changes would bode ill for the protection of privacy.

3.6
DATA SEARCH COMPANIES

All of the advances in information technology for data aggregation and analysis have led to the emergence of companies that take the raw technology discussed above and combine it into systems that allow them to offer directly to their customers a capability for access to vast amounts of information. Search engine services, such as those provided by Google, Yahoo!, and MSN, harness the capabilities of thousands of computers, joined together in a network, that when combined give huge amounts of storage and vast computational facilities. Such companies have linked these machines with a software infrastructure that allows the finding and indexing of material on the World Wide Web.

The end result is a service that is used by millions every day. Rather than requiring that a person know the location of information on the World Wide Web (via, for example, a uniform resource locator (URL), such as www.cstb.org), a search engines enables the user to find that information by describing it, typically by typing a few keywords that might be associated with that information. Using sophisticated algorithms that are the intellectual property of the company, links to locations where that information can be found are returned. This functionality, undreamed of a decade ago, has revolutionized the way that the World Wide Web is used. Further, these searches can often be conducted for free, as many search companies make money by selling advertising that is displayed along with the search results to the users of the service.

While it is hard to imagine using the Web without search services, their availability has brought up privacy concerns. Using a search engine to assemble information about an individual has become common practice (so common that the term “to Google” has entered the language). When the Web newspaper, cnet.com, published personal information about the president of Google that had been obtained by using the Google service, Google charged Cnet with publishing private information and announced that it would not publicly speak to Cnet for a year in retribution.¹¹ This

¹¹	Carolyn Said, “Google Says Cnet Went Too Far in Googling,” San Francisco Gate, August 9, 2005, available at http://sfgate.com/cgi-bin/article.cgi?file=/c/a/2005/08/09/GOOGLE.TMP&type=business.

Page 103 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

TABLE 3.1 The Evolution of Surveillance over Time

Dimension	Traditional Surveillance	The New Surveillance
Relation to senses	Unaided senses	Extends senses
Visibility of the actual data collection, who does it, where, and on whose behalf	Visible	Reduced visibility, or invisible
Consent	Lower proportion involuntary	Higher proportion involuntary
Cost	Expensive per unit of data	Inexpensive per unit of data
Location of data collectors and analyzer	On scene	Remote
Ethos	Harder (more coercive)	Softer (less coercive)
Integration	Data collection as separate activity	Data collection folded into routine activity
Data collector	Human, animal	Machine (wholly or partly automated)
Where data reside	With the collector, stays local	With third parties, often migrates
Timing of data collection	Single point or intermittent	Continuous (omnipresent)
Time period of collection	Present	Past, present, future
Availability of data	Frequent time lags	Real-time availability
Availability of data collection technology	Disproportionately available to elites	More democratized, some forms widely available
Object of data collection	Individual	Individual, categories of interest
Comprehensiveness	Single measure	Multiple measures
Context	Contextual	Acontextual
Depth	Less intensive	More intensive
Breadth	Less extensive	More extensive
Ratio of knowledge of observed to observer	Higher (what the surveillant knows the subject probably knows as well)	Lower (surveillant knows things that the subject doesn’t)
Identifiability of object of surveillance	Emphasis on known individuals	Emphasis also on anonymous individuals, individual masses
Realism	Direct representation	Direct and simulation
Form	Single medium (likely narrative or numerical)	Multiple media (including video and/or audio)
Who collects data?	Specialists	Specialists, role dispersal, self-monitoring
Analysis of data	More difficult to organize, store, retrieve, analyze	Easier to store, retrieve, analyze
Ease of merging data	Discrete non-combinable	Easy to combine
Communication of data	More difficult to send, receive	Easier to send, receive
SOURCE: G.T. Marx, “What’s New About the New Surveillance?,” Surveillance & Society 1(1):9-29, 2002, available at www.surveillance-and-society.org/articles1/whatsnew.pdf.

Page 104 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

is an interesting case, because the information that was obtained was accessible through the Web to anyone, but would have been difficult to find without the services offered by Google. Whereas in this case privacy could perhaps have been maintained because of the difficulty of simply finding the available information, the Google service made it easy to find the information, and made it available for free.

A second privacy concern arises regarding the information that search engine companies collect and store about specific searches performed by users. To service a user’s search request, the specific search terms need not be kept for longer than it takes to return the results of that search. But search engine companies keep that information anyway for a variety of purposes, including marketing and enhancement of search services provided to users.

The potential for privacy-invasive uses of such information was brought into full public view in a request in early 2006 by the Department of Justice (DOJ) for search data from four search engines, including search terms queried and Web site addresses, or URLs, stored in each search engine’s index but excluding any user identifying information that could link a search string back to an individual. The intended DOJ use of the data was not to investigate a particular crime but to study the prevalence of pornographic material on the Web and to evaluate the effectiveness of software filters to block those materials in a case testing the constitutionality of the Child Online Protection Act (COPA).¹²

The four search engines were those associated with America Online, Microsoft, Yahoo!, and Google. The first three of these companies each agreed to provide at least some of the requested search data. Google resisted the original subpoena demanding this information; subsequently, the information sought was narrowed significantly in volume and character, and Google was ultimately ordered by a U.S. District Court to provide a much more restricted set of data.¹³ Although the data requested did not include personally identifiable information of users, this case has raised a number of privacy concerns about possible disclosures in the future of the increasing volumes of user-generated search information.

Google objected to the original request for a variety of reasons. Google asserted a general interest in protecting its users’ privacy and

¹²

Attorney for Alberto R. Gonzales, McElvain Declaration in Gonzales v. Google, Inc. (subpoena request), available at http://i.i.com.com/cnwk.1d/pdf/ne/2006/google-doj/notice.of.stark.declaration.pdf.

¹³

Antone Gonsalves, “Judge Hands Google Mixed Ruling on Search Privacy,” Internet Week, March 17, 2006, available at http://Internetweek.cmp.com/showArticle.jhtml?articleID=183700724. The findings based on the search data were to serve as part of the government’s case in defending the constitutionality of the Child Online Protection Act, a law aimed at protecting minors from adult material online.

Page 105 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

anonymity.¹⁴ Additionally, Google believed the original request was overly broad, as it included all search queries entered into the search engine during a 2-month period and all URLs in Google’s index. In negotiations with the DOJ, the data request was reduced to a sampling of 50,000 URLs and 5,000 search terms by the DOJ.¹⁵

In considering only the DOJ’s modified request, the court decided to further limit the type of data that was released to include only URLs and not search terms. Several of the privacy implications considered in this ruling included the recognition that personally identifying information, although not requested, might be available in the text of searches performed (e.g., such as searching to see if personal information is on the Internet, such as Social Security numbers or credit card information, or to check what information is associated with one’s own name, so-called vanity searches). The court also acknowledged the possibility of the information being shared with other government authorities if text strings raised national security issues (e.g., “bomb placement white house”).¹⁶

Although this case was seen as a partial victory for Google and for the privacy of its users, the court as well as others acknowledged that the case could have broader implications. Though outside its ruling, the court could foresee the possibility of future requests to Google, particularly if the narrow collection of data used in the DOJ’s study was challenged in the COPA case.¹⁷ However, others have suggested that this case underscores the larger problem of how to protect Internet user privacy, particularly as more user-generated information is being collected and stored for unspecified periods of time, which makes it increasing vulnerable to subpoenaed requests.¹⁸

Many of the concerns about compromising user privacy were illustrated graphically when in August 2006, AOL published on the Web a list of 20 million Web search inquiries made by 658,000 users over a 3-month

¹⁴	Declan McCullagh, “Google to Feds: Back Off,” CNET News.com, February 17, 2006, available at http://news.com.com/Google+to+feds+Back+off/2100-1030_3-6041113.html?tag=nl.
¹⁵	Order Granting in Part and Denying in Part Motion to Compel Compliance with Subpoena Duces Tecum, United States District Court for the Northern District of California, San Jose Division, Court Ruling No. CV 06-8006MISC JW, p. 4, available at http://www.google.com/press/images/ruling_20060317.pdf.
¹⁶	United States District Court for the Northern District of California, San Jose Division, Court Ruling, pp. 18-19.
¹⁷	United States District Court for the Northern District of California, San Jose Division, Court Ruling, p. 15.
¹⁸	Thomas Claburn, “Google’s Privacy Win Could Be Pyrrhic Victory,” InformationWeek, March 22, 2006, available at http://www.informationweek.com/showArticle.jhtml;jsessionid=LMWTORMPFH2B4QSNDBCSKH0CJUMEKJVN?articleID=183701628.

Page 106 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

period.¹⁹ AOL sought to anonymize users by substituting a code number for their login names, but the list of inquiries sorted by code number shows the topics in which a person was interested over many different searches. A few days later, AOL took down the 439-megabyte file after many complaints were received that the file violated user privacy. AOL acknowledged that the publication of the data was a violation of its own internal policies and issued a strongly worded apology. Some users were subsequently identified by name.²⁰

A related kind of IT-enabled company—the data aggregation company—is discussed further in Chapter 6.

3.7
BIOLOGICAL AND OTHER SENSING TECHNOLOGIES

The technology trends outlined thus far in this chapter are all well established, and technologies that follow these trends are deployed in actual systems. There is an additional trend, only now in its beginning stages, that promises to extend the sensing capabilities beyond those that are possible with the kinds of sensors available today. These are biological sensing technologies, including such things as biometric identification schemes and DNA analysis.

Biometric technologies use particular biological markers to identify individuals. Fingerprinting for identification is well known and well established, but interest in other forms of biometric identification is high. Technologies using identifying features as varied as retinal patterns, walking gait, and facial characteristics are all under development and show various levels of promise. Many of these biometric technologies differ from the more standard and currently used biometric identification schemes in two ways: first, these technologies promise to allow the near-real-time identification of an individual from a distance and in a way that is non-invasive and, perhaps, incapable of being detected by the subject being identified; second, some of these mechanisms facilitate automated identification that can be done solely by the computer without the aid of a human being. Such identification could be done more cheaply and far more rapidly than human-mediated forms of identification.

Joined into a computing system like those discussed above, such identification mechanisms offer a potential for tracing all of the activities of an individual. Whereas video camera surveillance now requires human watchers, automated face-identification systems could allow the logging

¹⁹	Saul Hansell, “AOL Removes Search Data on Group of Web Users,” New York Times, August 8, 2006.
²⁰	Michael Barbaro and Tom Zeller, Jr., “A Face Is Exposed for AOL Searcher No. 4417749,” New York Times, August 9, 2006.

Page 107 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

of a person’s location, which could then be cross-referenced with other information gathered about that individual, all without the knowledge of the person being tracked. Such capabilities raise the prospect of a society in which everyone can be automatically tracked at all times.

In addition to these forms of biometric identification is the technology associated with the mapping and identification of human DNA. The mapping of the human genome is one of the great scientific achievements of the past decade, and work is ongoing in understanding the phenotypic implications of variations at specific sites within the gnome. A full understanding of these relationships would allow use of a DNA sample to obtain information not only about the individual from whom the DNA was taken, but also about individuals genetically related to that individual. Just what information will be revealed by our DNA is currently unknown, but there is speculation that it might indicate everything from genetic predisposition to certain kinds of disease to behavioral patterns that can be expected in an individual. Much of this information is now considered to be private, but if it becomes easily accessible from our own DNA or from the DNA of our relatives, issues will arise as to how that information should be treated or even who the subject of the information really is.

3.8
PRIVACY-ENHANCING TECHNOLOGIES

Although much of the discussion above concerns advances in technology that potentially threaten privacy, technology is not inherently destructive of privacy. Technology developments can help with limiting access to or control of information about people. These fall into two categories: those that can be used by the individual whose privacy is being enhanced, and those that can be used by an information collector who wishes to protect the privacy of the individuals about whom information is being collected.²¹

3.8.1
Privacy-enhancing Technologies for Use by Individuals

One cluster of technologies allows individuals to make basic data unavailable through the use of cryptography. Data about a person is made private by encrypting that data in such a way that the data cannot be decrypted without the consent and active participation of the person who provides the decryption key; this is known as the confidentiality

²¹	A useful reference, and one on which much in this section is based, is Lorrie Faith Cranor, The Role of Privacy Enhancing Technologies, AT&T Labs Research, available at http://www.cdt.org/privacy/ccp/roleoftechnology1.shtml.

Page 108 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

application of cryptography. Despite considerable work on decryption technologies, the cost to decrypt an encrypted data set without access to the decryption keys can currently be made almost arbitrarily high in comparison to the cost of encrypting the data set.

However, such technologies are not universally accepted as an appropriate approach to the problem of protecting privacy. In particular, they allow the hiding of data from everyone, including those who some feel should be able to see the data, such as law enforcement agencies or intelligence-gathering branches of the government. In addition, they make it impossible for the data to be accessed in cases when the owner of the data is unable to participate. It would be impossible, for example, for emergency medical personnel to gain access to protected medical information if the subject of the records (and holder of the decryption key) were unconscious.

Other privacy-enhancing technologies that are usable by individuals facilitate anonymization in certain contexts. For example, some anonymization technologies allow e-mail or Web surfing that is anonymous to Internet eavesdroppers for all practical purposes. These technologies can also exploit national boundaries to increase the difficulty of breaking the anonymity they offer—identifying information stored on a server located in Country A may be difficult or impossible for authorities in Country B to obtain because of differing legal standards or the lack of a political agreement between the two nations. This same technology, however, can also hide the identity of those who use the networks to threaten or libel other members of the network community. Although it can facilitate privacy, anonymity can also help to defeat accountability. Since law enforcement is based on the notion of individual accountability, law enforcement pressures to restrict the use of anonymizing technologies are not unexpected. Anti-spyware technologies stem the flow of personal information related to one’s computer and Internet usage practices to other parties, thereby enhancing privacy.

Another category of privacy-enhancing technologies includes those that assist users in avoiding spam e-mail, that prevent spyware programs from being installed, or that alert individuals that they might be the subject of “phishing” attacks.²² Anti-spam technologies promote the privacy of those who believe that being left alone is an element of privacy. Phishalerting technologies enhance privacy by warning the individual that he

²²

“Phishing” is the act of fooling someone into providing confidential information or into doing something under false pretenses. A common phishing attack is for an attacker to send an e-mail to someone falsely claiming to be a legitimate enterprise in an attempt to trick the user into providing private information (e.g., a login name and password for his bank account) that can be used by the attacker to impersonate the victim.

Page 109 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

or she may be about to divulge important personal information to a party that should not be getting it.

None of the technologies above are focused on privacy per se, which relates to the protection of personal information. For example, encryption provides confidentiality of stored information or information sent over a network—any information, not just personal information. Anonymizing technologies protect only a subset of personal information—personal information that can be used to identify an individual.

3.8.2
Privacy-enhancing Technologies for Use by Information Collectors

Privacy-enhancing tools that can be used by information collectors include anonymization techniques that can help to protect privacy in certain applications of data mining.

3.8.2.1
Query Control

Teresa Lunt has undertaken some work in the development of a privacy appliance²³ that is based on a heuristic approach to query control and can be viewed as a firewall that is placed in between databases containing personal information and those querying those databases. Programmed by a party other than the querying party, the appliance is intended to prevent:

Direct disclosures (e.g., by withholding from query results data such as names, Social Security numbers, credit card numbers, addresses, and phone numbers);
Inferences of identity based on the combined results of multiple queries. This requires the maintenance of a log of earlier queries and a determination of whether any given query can yield an inference of identity; if so, the appliance is intended to prevent that query result from being returned.
Access to sensitive statistics. If a statistic will reveal information about an individual or if sensitive information can be inferred from a statistical summary, the appliance should block access to that statistic (if, for example, a statistical query is computed over too few records).

In those instances where identifying information must be obtained (e.g., in order to identify the would-be perpetrator of a terrorist event),

²³	Privacy Appliance, Xerox PARC, information available at http://www.parc.com/research/projects/privacyappliance/.

Page 110 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

individuals with proper authorization such as a court order could be granted privileges to override the blocking by the appliance.

The query control approach draws from the broader literature on approaches to privacy protection and disclosure limitation. Nonetheless, it also poses some unresolved issues for which further research is needed.

A lesson from the literature on the statistics of disclosure limitation is that privacy protection in the form of “safe releases” from separate databases does not guarantee privacy protection for information in a merged database.²⁴ It is not known how strongly this lesson applies to the query control approach, especially given the fact that the literature addresses aggregate data, whereas questions of privacy often involve identification of individuals.
The query control approach assumes that it is possible to analyze a log of previous queries to determine if any given query can yield an inference of identity. While this result is clearly possible when the previous queries are simple and relatively few, the feasibility of such analysis with a large number of complex queries is at present not known.
Still to be determined are the scope and the nature of analyses that can be undertaken with a privacy appliance in place. Because the k-anonymity concept on which the appliance is based relies on reporting sums or averages rather than individual data fields, there is some degradation of data. Whether and in what contexts that degradation matters operationally is not known.
Some attacks on k-anonymity are known to succeed under certain circumstances by taking advantage of background knowledge or database homogenity.²⁵ Background knowledge is externally provided knowledge about the variables in question (e.g., the statistical likelihood of their occurrence). Database homogeneity refers to the situation in which there is insufficient diversity in the values of sensitive attributes. Techniques have been proposed that reduce such difficulties,²⁶ but their compatibility with the query control approach of the privacy appliance remains to be seen.

²⁴	Stephen E. Fienberg, “Privacy and Confidentiality in an e-Commerce World: Data Mining, Data Warehousing, Matching and Disclosure Limitation,” Statistical Science 21(2):143-154, 2006.
²⁵	Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramaniam, “l-Diversity: Privacy Beyond k-Anonymity,” Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE 2006), Atlanta, Georgia, April 2006, available at http://www.cs.cornell.edu/johannes/papers/2006/2006-icde-publishing.pdf.
²⁶	Machanavajjhala et al., “1-Diversity,” 2006.

Page 111 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

3.8.2.2
Statistical Disclosure Limitation Techniques²⁷

Other techniques can be used to reduce the likelihood that a specific individual can be identified in a data-mining application that seeks to uncover certain statistical patterns. Such techniques are useful to statistical agencies such as the Census Bureau, the Bureau of Labor Statistics, and the Centers for Disease Control and Prevention (to name only a few), which collect vast amounts of personally identifiable data and use it to produce useful data sets, summaries, and other products for the public or for research uses—most often in the form of statistical tables (i.e., tabular data). Some agencies (e.g., Census) also make available so-called microdata files—that is, files that can show (while omitting specific identifying information) the full range of responses made on individual questionnaires. Such files can show, for example, how one household or one household member answered questions on occupation, place of work, and so on.

Given the sensitive nature of much of this information and the types of analysis and comparison facilitated by modern technology, statistical agencies also can and do employ a wide range of techniques to prevent the disclosure of personally identifiable information related to specific individuals and to ensure that the data that are made available cannot be used to identify specific individuals or, in some cases, specific groups or organizations. Following are descriptions of many of those techniques.

Limiting details. Both with tabular data and microdata, formal identifiers and many geographic details are often simply omitted for all respondents.
Adding noise. Data can be perturbed by adding random noise (adding a random but small amount or multiplying by a random factor close to 1, most often before tabulation) to help disguise potentially identifying values. For example, imagine perturbing each individual’s or household’s income values by a small percentage.
Targeted suppression. This method suppresses or omits extreme values or values that might be unique enough to constitute a disclosure.
Top-coding and bottom-coding. These techniques are often used to limit disclosure of specific data at the high end or low end of a given range by grouping together values falling above or below a certain level. For instance, an income data table could be configured to list every income below $20,000 as simply “below $20,000.”
Recoding. Similar to top-coding and bottom-coding, recoding

²⁷	Additional discussion of some of these techniques can be found in National Research Council, Private Lives and Public Policies, National Academy Press, Washington, D.C., 1993.

Page 112 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

involves assigning individual values to groups or ranges rather than showing exact figures. For example, an income of $54,500 could simply be represented as being within the range of “$50,000- $60,000.” Such recoding could be adequate for a number of uses where detailed data are not required.

Rounding. This technique involves rounding values (e.g., incomes) up or down based on a set of earlier decisions. For example, one might decide to round all incomes down to the nearest $5,000 increment. Another model involves making random decisions on whether to round a given value up or down (as opposed to conforming data according to a predetermined rounding convention).
Swapping and/or shuffling. Swapping entails choosing a certain set of fields among a set of records in which values match, and then swapping all other values among the records. Records can also be compared and ranked according to a given value to allow swapping based on values that, while not identical, are close to each other (so-called rank-swapping). Data shuffling is a hybrid approach, blending perturbation and swapping techniques.
Sampling. This method involves including data from only a sample of a given population.
Blank and impute. In this process, values for particular fields in a selection of records are deleted and the fields are then filled either with values that have been statistically modeled or with values that are the same as those for other respondents.
Blurring. This method involves replacing a given value with an average. These average values can be determined in a number of different ways—for example, one might select the records to be averaged based on the values given in another field, or one might select them at random, or vary the number of values to be averaged.

3.8.2.3
Cryptographic Techniques

The Portia project²⁸ is a cross-institutional research effort attempting to apply the results of cryptographic protocols to some of the problems of privacy. Such protocols theoretically allow the ability to do queries over multiple databases without revealing any information other than the answer to the particular query, thus ensuring that multi-database queries can be accomplished without the threat of privacy-threatening aggregation of the data in those databases. Although there are theoretical protocols that can be proved to give these results, implementing those protocols in a fashion that is efficient enough for common use is an open research

²⁸	See more information about the Portia project at http://crypto.stanford.edu/portia/.

Page 113 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

problem. These investigations are in their early stages, so it is too soon to determine if the resulting techniques will be appropriate for wide use.

A similar project is attempting to develop a so-called Hippocratic database, which the researchers define as one whose owners “have responsibility for the data that they manage to prevent disclosure of private information.”²⁹ The thrust behind this work is to develop database technology to minimize the likelihood that data stored in the database are used for purposes other than those for which the data were gathered. While this project has produced some results in the published literature, it has not resulted in any widely deployed commercial products.

3.8.2.4
User Notification

Another set of technologies focus on notification. For example, the Platform for Privacy Preferences (P3P) facilitates the development of machine-readable privacy policies.³⁰ Visitors to a P3P-enabled Web site can set their browsers to retrieve the Web site’s privacy policy and compare it to a number of visitor-specified privacy preferences. If the Web site’s policy is weaker than the visitor prefers, the visitor is notified of that fact. P3P thus seeks to automate what would otherwise be an onerous manual process for the visitor to read and comprehend the site’s written privacy policy. An example of a P3P browser add-on is Privacy Bird.³¹ Results of the comparison between a site’s policy and the user’s preferences are displayed graphically, showing a bird of different color (green and singing for a site whose policy does not violate the requirements set by the user, angry and red when the policy conflicts with the desires of the user) in the toolbar of the browser.

Systems such as Privacy Bird cannot guarantee the privacy of the individual who uses them—such guarantees can only be provided by enforcement of the stated policy. They do attempt to address the privacy issue directly, allowing the user to determine what information he or she is willing to allow to be revealed, along with what policies the recipient of the information intends to follow with regard to the use of that information or the transfer of that information to third parties. Also, the process of developing a P3P-compatible privacy policy is structured and systematic. Thus, a Web site operator may discover gaps in its existing privacy policy as it translates that policy into machine-readable form.

²⁹	Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu, “Hippocratic Databases,” 28th International Conference on Very Large Databases (VLDB), Hong Kong, 2002.
³⁰	See http://www.w3.org/P3P/.
³¹	See http://www.privacybird.com/.

Page 114 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

3.8.2.5
Information Flow Analysis

Privacy can also be protected by tools for automated privacy audits. Some companies, especially large ones, may find it difficult to know the extent to which their practices actually comply with their stated policies. The purpose of a privacy audit is to help a company determine the extent to which it is in compliance with its own policy. However, since the information flows within a large company are multiple and varied, automated tools are very helpful in identifying and monitoring such flows. When potential policy violations are identified, these tools bring the information flows in question to the attention of company officials for further attention.

Such tools often focus on information flows to and from externally visible Web sites, monitoring form submissions and cookie usage, and looking for Web pages that accidentally reveal personal information. Tools can also tag data as privacy sensitive, and when such tagged data are subsequently accessed, other software could check to ensure that the access is consistent with the company’s privacy policy.

Because of the many information flows in and out of a company, a comprehensive audit of a company’s privacy policy is generally quite difficult. But although it is virtually impossible to deploy automated tools everywhere within a company’s information infrastructure, automated auditing tools can help a great deal in improving a company’s compliance with its own stated policy.

3.8.2.6
Privacy-Sensitive System Design

Perhaps the best approach for protecting privacy is to design systems that do not require the collection or the retention of personal information in the first place.³² For example, systems designed to detect weapons hidden underneath clothing have been challenged on privacy grounds because they display the image recorded by the relevant sensors, and what appears on the operator’s display screen is an image of an unclothed body. However, the system can be designed to display instead an indicator signaling the possible presence of a weapon and its approximate location on the body. This approach protects the privacy of the subject to a much greater degree than the display of an image, although it requires a much more technically sophisticated approach (since the image detected must be analyzed to determine exactly what it indicates).

³²

From the standpoint of privacy advocacy, it is difficult to verify the non-retention of data since this would entail a full audit of a system as implemented. Data, once collected, often persist by default, and this may be an important reason that a privacy advocate might oppose even a system allegedly designed to discard data.

Page 115 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

When a Web site operator needs to know only if a visitor’s age is above a certain threshold (e.g., 13), rather than the visitor’s age per se, collecting only an indicator of a threshold protects the visitor’s privacy. More generally, systems can be designed to enable an individual to prove that he or she possesses certain attributes (e.g., is authorized to enter a building, holds a diploma, is old enough to gamble or drink) without revealing anything more about the individual. Even online purchases could, in principle, be made anonymously using electronic cash.

However, the primary impediments to the adoption of such measures appear to be based in economics and policy rather than in technology. That is, even though measures such as those described above appear to be technically feasible, they are not in widespread use. The reason seems to be that most businesses benefit from the collection of detailed personal information about their customers and thus have little motivation to deploy privacy-protecting systems. Law enforcement agencies also have concerns about electronic cash systems that might facilitate anonymous money laundering.

3.8.2.7
Information Security Tools

Finally, the various tools supporting information security—encryption, access controls, and so on—have important privacy-protecting functions. Organizations charged with protecting sensitive personal information (e.g., individual medical records, financial records) can use encryption and access controls to reduce the likelihood that such information will be inappropriately compromised by third parties. A CD-ROM with personal information that is lost in transit is a potential treasure trove for identity thieves, but if the information is encrypted on the CD, the CD is useless to anyone without the decryption key. Medical records stored electronically and protected with good access controls that allow access only to authorized parties are arguably more private than paper records to which anyone has access. Electronic medical records might also be protected by audit trails that record all accesses and prevent forwarding to unauthorized parties or even their printing in hard copy.

With appropriate authentication technologies deployed, records of queries made by specific individuals can also be kept for future analysis.³³ Retention of such records can deter individuals from making privacy-invasive queries in the course of their work—in the event that personal information is compromised, a record might exist of queries that might

³³	The committee is not insensitive to the irony that keeping query logs is arguably privacy-invasive with respect to the individual making the queries.

Page 116 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

have produced that personal information and the parties that may have made those queries.

3.9
UNSOLVED PROBLEMS AS PRIVACY ENHANCERS

Although much of the discussion above involves trends in technology that can lead to privacy concerns, many technical challenges must be addressed successfully to enable truly ubiquitous surveillance. If so, one can argue that many worries about technology and privacy are therefore misplaced.

For example, the problem of data aggregation is far more than simply the problem of finding the data to be combined and using the network to bring those data to a shared location. One fundamental issue is that of interpreting data collected by different means so that their meaning is consistent. Digital data, by definition, comprises fields that are either on (represent 1) or off (represent 0). But how these 1s and 0s are grouped and interpreted to represent more complex forms of data (such as images, transaction records, sound, or temperature readings) varies from computer to computer and from program to program.

Even so simple a convention as the number of bits (values of 1 or 0) used to represent a value such as an alphanumeric character, an integer, or a floating point number varies from program to program, and the order in which the bits are to be interpreted can vary from machine to machine. The fact that data are stored on two machines that can talk to each other over the network does not mean that there is a program that can understand the data stored on the two machines, as the interpretation of that data is generally not something that is stored with the data itself.

This problem is compounded when an attempt is made to combine the contents of different databases. A database is organized around groupings of information into records and indexes of those records. The combinations and indexes, known as schema, define the information in the database. Different databases with different schema definitions cannot be combined in a straightforward way; the queries issued to one of those databases might not be understood in the other database (or, worse still, might be understood in a different way). Since the schema used in the database defines, in an important way, the meaning of the information stored in the database, two databases with different schema store information that is difficult to combine in any meaningful way.

Note that this issue is not resolved simply by searching in multiple databases of similar formats. For example, although search engines facilitate the searching of large volumes of text that can be spread among multiple databases, this is not to say that these data can be treated as belonging to a single database, for if that were the case both the format and the

Page 117 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

semantics of the words would be identical. The Semantic Web and similar research efforts seek to reduce the magnitude of the semantic problem, disambiguating syntactically identical words. But these efforts have little to do with aggregations of data in dissimilar formats, such as video clips and text or information in financial and medical databases.

This problem of interpretation is not new; it has plagued businesses trying to integrate their own data for nearly as long as there have been computers. Huge amounts of money are spent each year on attempts to merge separate databases within the same corporation, or in attempts by one company to integrate the information used by another company that it has acquired. Even when the data formats are known by the programmers attempting to do the integration, these problems are somewhere between difficult and impossible. The notion that data gathered by sensors about an individual by different sources can be easily aggregated by computers that are connected by a network presupposes, contrary to fact, that this problem of data integration and interpretation has been solved.

Similarly, the claim that increases in the capacity of storage devices will allow data to be stored forever and used to violate the privacy of the individual ignores another trend in computing, which is that the formats used to interpret the raw data contained in storage devices are program specific and tend to change rapidly. Data are now commonly lost not because they have been removed from some storage device, but because there is no program that can be run that understands the format of the data or no hardware that can even read the data.³⁴ In principle, maintaining documentation adequate to allow later interpretation of data stored in old formats is a straightforward task—but in practice, this rarely happens, and so data are often lost in this manner. And as new media standards emerge, it becomes more difficult to find and/or purchase systems that can read the media on which the old data are recorded.

A related issue in data degradation relates to the hardware. Many popular and readily available storage devices (CDs, DVDs, tapes, hard drives) have limited dependable lifetimes. The standards to which these devices were originally built also evolve to enable yet more data to be packed onto them, and so in several generations, any given storage device may well be an orphan, with spare parts and repair expertise difficult to find.

Data can thus be lost if—even though the data have not been destroyed—they become unreadable and thus unusable.

Finally, even with the advances in the computational power available

³⁴	National Research Council, Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy, Robert Sproull and Jon Eisenberg, eds., The National Academies Press, Washington, D.C., 2005.

Page 118 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

on networks of modern computers, there are some tasks that will remain computationally infeasible without far greater breakthroughs in computing hardware than have been seen even in the past 10 years. Some tasks, such as those that require the comparison of all possible combinations of sets of events, have a computational cost that rises combinatorially (i.e., faster than exponentially) with the number of entities being compared. Such computations attempted over large numbers of people are far too computationally expensive to be done by any current or anticipated computing technology. Thus, such tasks will remain computationally infeasible not just now but for a long time to come.³⁵

Similar arguments also apply to certain sensing technologies. For example, privacy advocates worry about the wide deployment of facial recognition technology. Today, this technology is reasonably accurate under controlled conditions where the subject is isolated, the face is exposed in a known position, and there are no other faces being scanned. Attempts to apply this technology “in the wild,” however, have largely failed. The problem of recognizing an individual from a video scan in uncontrolled lighting, where the face is turned or tilted, and where the face is part of a crowd, or when the subject is using countermeasures to defeat the recognition technology, is far beyond the capabilities of current technology. Facial recognition research is quite active today, but it remains an open question how far and how fast the field will be able to progress.

3.10
OBSERVATIONS

Current trends in information technology have greatly expanded the ability of its users to gather, store, share, and analyze data. Indeed, metrics for the increasing capabilities provided by information technology hardware—storage, bandwidth, and processing speed, among others—could be regarded as surrogates for the impact of technological change on privacy. The same is true, though in a less quantitative sense, for software—better algorithms, better database management systems

³⁵

To deal with such problems, statisticians and computer scientists have developed pruning methods that systematically exclude large parts of the problem space that must be examined. Some methods are heuristic, based on domain-specific knowledge and characteristics of the data, such as knowing that men do not get cervical cancer or become pregnant. Others are built on theory and notions of model simplification. Still others are based on sampling approaches that are feasible when the subjects of interest are in some sense average rather than extreme. If a problem is such that it is necessary to identify with high probability only some subjects, rather than requiring an exhaustive search that identifies all subjects with certainty, these methods have considerable utility. But some problems—and in particular searches for terrorists who are seeking to conceal their profile within a given population—are less amenable to such treatment.

Page 119 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

and more powerful query languages, and so on. These data can be gathered both from those who use the technology itself and from the physical world. Given the trends in the technology, it appears that there are many ways in which the privacy of the individual could be compromised, both by governments, private corporations, and individual users of the technology.

Many of these concerns echo those that arose in the 1970s, when the first databases began to be widely used. At that time, concerns over the misuse of the information stored in those databases and the accuracy of the information itself led to the creation of the Fair Information Practice guidelines in 1973 (Section 1.5.4 and Box 1.3).

Current privacy worries are not as well defined as those that originally led to the Fair Information Practice guidelines. Whereas those guidelines were a reaction to fears that the contents of databases might be inaccurate, the current worries are more concerned with the misuse of data gathered for otherwise valid reasons, or the ability to extract additional information from the aggregation of databases by using the power of networked computation. Furthermore, in some instances, technologies developed without a conscious desire for affecting privacy may—upon closer examination—have deep ramifications for privacy. As one example, digital rights management technologies have the potential to collect highly detailed information on user behavior regarding the texts they read and the music they listen to. In some instances, they have a further potential to create security vulnerabilities in the systems on which they run, exploitation of which might lead to security breaches and the consequent compromise of personal information stored on those systems. The information-collection aspect of digital rights management technologies is discussed further in Section 6.7.

At the same time, some technologies can promote and defend privacy. Cryptographic mechanisms that can ensure the confidentiality of protected data, anonymization techniques to ensure that interactions can take place without participants in the interaction revealing their identity, and database techniques that allow extraction of some information without showing so much data that the privacy of those whose data has been collected will be compromised, are all active areas of research and development. However, each of these technologies imposes costs, both social and economic, for those who use them, a fact that tends to inhibit their use. If a technology has no purpose other than to protect privacy, it is likely to be deployed only when there is pressure to protect privacy—unlike other privacy-invasive technologies, which generally invade privacy as a sideeffect of some other business or operational purpose.

An important issue is the impact of data quality on any system that involves surveillance and matching. As noted in Chapter 1, data quality

Page 120 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

has a significant impact on the occurrence of false positives and false negatives. By definition, false positives subject individuals to scrutiny that is inappropriate and unnecessary given their particular circumstances—and data quality issues that result in larger numbers of false positives lead to greater invasions of privacy. By definition, false negatives do not identify individuals who should receive further scrutiny—and data quality issues that result in larger numbers of false negatives compromise mission accomplishment.

Technology also raises interesting philosophical questions regarding privacy. For example, Chapter 2 raised the distinction between the acquisition of personal information and the use of that information. The distinction is important because privacy is contextually defined—use X of certain personal information might be regarded as benign, while use Y of that same information might be regarded as a violation of privacy. But even if one assumes that the privacy violations might occur at the moment of acquisition, technology changes the meaning of “the moment.” Is “the moment” the point at which the sensors register the raw information? The point after which the computers have processed the bit streams from the sensors into a meaningful image or pattern? The point at which the computer identifies an image or pattern as being worthy of further human attention? The point at which an actual human being sees the image or pattern? The point at which the human being indicates that some further action must be taken? There are no universal answers to such questions—contextual factors and value judgments shape the answers.

A real danger is that fears about what technology might be able to do, either currently or in the near future, will spur policy decisions that will limit the technology in artificial ways. Decisions made by those who do not understand the limitations of current technology may prevent the advancement of the technology in the direction feared but also limit uses of the technology that would be desirable and that do not, in fact, create a problem for those who treasure personal privacy. Consider, for example, that data-mining technologies are seen by many to be tools of those who would invade the privacy of ordinary citizens.³⁶ Poorly formulated limitations on the use of data mining may reduce its impact on privacy, but may also inadvertently limit its use in other applications that pose no privacy issue whatever.

Finally, it is worth noting the normative question of whether technology or policy ought to have priority as a foundation for protecting privacy. One perspective on privacy protection is that policy should come first—policy, and associated law and regulation, are the basis for the per-

³⁶	A forthcoming study by the National Research Council will address this point in more detail.

Page 121 Cite

Suggested Citation:"3 Technological Drivers." National Research Council. 2007. Engaging Privacy and Information Technology in a Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/11896.

×

formance requirements of the technology—and that technology should be developed and deployed that conforms to the demands of policy. On the other hand, policy that is highly protective of privacy on one day can be changed to one that is less protective the next day. Thus, a second view of privacy would argue that technology should constitute the basis for privacy protection, because such a foundation is harder to change or circumvent than one based on procedural foundations.³⁷ Further, violations of technologically enforced privacy protections are generally much more difficult to accomplish than violations of policy-enforced protections. Whether such difficulties are seen as desirable stability (i.e., an advantage) or unnecessary rigidity (i.e., a disadvantage) depends on one’s position and perspective.

In practice, of course, privacy protections are founded on a mix of technology and policy, as well as self-protective actions and cultural factors such as ethics, manners, professional codes, and a sense of propriety. In large bureaucracies, significant policy changes cannot be implemented rapidly and successfully, even putting aside questions related to the technological infrastructure. Indeed, many have observed that implementing appropriate human and organizational procedures that are aligned with high-level policy goals is often harder than implementing and deploying technology.

³⁷	Lessig argues this point in Code, though his argument is much broader than one relating simply to privacy. See Lawrence Lessig, Code and Other Laws of Cyberspace, Basic Books, New York, 2000.