National Academies Press: OpenBook

Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition (1999)

Chapter: Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research

« Previous: Chapter 1 Keynote Address
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Chapter 2

Invited Session on Record Linkage Applications for Epidemiological Research

Chair: John Armstrong, Elections Canada

Authors:

Leicester E.Gill, University of Oxford

John R.H.Charlton, Office of National Statistics, UK and Judith D.Charlton, JDC Applications

John Van Voorhis, David Koepke, and David Yu University of Chicago

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
This page in the original is blank.
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

OX-LINK: The Oxford Medical Record Linkage System

Leicester E.Gill, University of Oxford

Abstract

This paper describes the major features of the Oxford record linkage system (OX-LINK), with its use of the Oxford name compression algorithm (ONCA), the calculation of the names weights, the use of orthogonal matrices to determine the threshold acceptance weights, and the use of combinational and heuristic algebraic algorithms to select the potential links between pairs of records.

The system was developed using the collection of linkable abstracts that comprise the Oxford Record Linkage Study (ORLS), which covers 10 million records for 5 million people and spans 1963 to date. The linked dataset is used for the preparation of health services statistics, and for epidemiological and health services research. The policy of the Oxford unit is to comprehensively link all the records rather than prepare links on an ad-hoc basis.

The OX-LINK system has been further developed and refined for internally cross matching the whole of the National Health Service Central Register (NHSCR) against itself (57.9 million records), and to detect and remove duplicate pairs; as a first step towards the issue of a new NHS number to everyone in England and Wales. A recent development is the matching of general practice (primary care) records with hospital and vital records to prepare a file for analyzing referral, prescribing and outcome measures.

Other uses of the system include ad hoc linkages for specific cohorts, academic support for the development of test programs and data for efficiently and accurately tracing people within the NHSCR, and developing methodologies for preparing registers containing a high proportion of ethnic names.

Medical Record Linkage

The term record linkage, first used by H.L.Dunn (1946; Gill and Baldwin, 1987), expresses the concept of collating health-care records into a cumulative personal file, starting with birth and ending with death. Dunn also emphasised the use of linked files to establish the accuracy or otherwise of the recorded data. Newcombe (Newcombe et al., 1959; and Newcombe, 1967, 1987, and 1988) undertook the pioneering work on medical record linkage in Canada in the 1950's and thereafter, Acheson (1967, 1968) established the first record linkage system in England in 1962.

When the requirement is to link records at different times and in different places, in principle it would be possible to link such records using a unique personal identification number. In practice, a unique number has not generally been available on records in the UK of interest in medicine and therefore other methods such as the use of surnames, forenames and dates of birth, have been necessary to identify different records relating to the same individual. In this paper, I will confine my discussion to the linkage of records for different events which relate to the same person.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Matching and Linking

The fundamental requirement for correct matching is that there should be a means of uniquely identifying the person on every document to be linked. Matching may be all-or-none, or it may be probabilistic, i.e., based on a computed calculation of the probability that the records relate to the same person, as described below. In probability matching, a threshold of likelihood is set (which can be varied in different circumstances) above which a pair of records is accepted as a match, relating to the same person, and below which the match is rejected.

The main requirement for all-or-none matching is a unique identifier for the person which is fixed, easily recorded, verifiable, and available on every relevant record. Few, if any, identifiers meet all these specifications. However, systems of numbers or other ciphers can be generated which meet these criteria within an individual health care setting (e.g., within a hospital or district) or, in principle, more widely (e.g., the National Health Service number). In the past, the National Health Service number in England and Wales had serious limitations as a matching variable, and it was not widely used on health-care records. With the allocation of the new ten digit number throughout the NHS all this is being changed (Secretaries of State, 1989; National Health Service and Department of Health, 1990), and it will be incorporated in all health-care records from 1997.

Numbering systems, though simple in concept, are prone to errors of recording, transcription and keying. It is therefore essential to consider methods for reducing errors in their use. One such method is to incorporate a checking device such as the use of check-digits (Wild, 1968; Hamming, 1986; Gallian, 1989; Baldwin and Gill, 1982; and Holmes, 1975). In circumstances where unique numbers or ciphers are not universally used, obvious candidates for use as matching variables are the person's names, date of birth, sex and perhaps other supplementary variables such as the address or postcode and place of birth. These, considered individually, are partial identifiers and matching depends on their use in combination.

Unique Personal Identifiers

Personal identification, administrative and clinical data are gradually accumulated during a patient's spell in a hospital and finalized into a single record. This type of linkage is conducted as normal practice in hospital information systems, especially in those hospitals having Patient Administration Systems (PAS) and District Information Systems (DIS) which use a centrally allocated check-digited District Number as the unique identifier (Goldacre, 1986).

Identifying numbers are often made up, in part, from stable features of a person's identification set, for example, sex, date of birth and place of birth, and so can be reconstructed in full or part, even if the number is lost or forgotten. In the United Kingdom (UK), the new 10-digit NHS number is an arbitrarily allocated integer, almost impossible to commit to memory, and cannot be reconstructed from the person's personal identifiers.

Difficulties arise, however, where the health event record does not include a unique identifier. In such cases, matching and linking depends on achieving the closest approach to unique identification by using several identifying variables each of which is only a partial identifier but which, in combination, provide a match which is sufficiently accurate for the intended uses of the linked data.

Personal Identifying Variables

The personal identifying variables that are normally used for person matching can be considered in five quite separate groups.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
  • Group 1. –Represents the persons proper names and with the exception of present surname when women adopt their husbands surname on marriage, rarely changes during a person's lifetime: birth surname; present surname; first forename or first initial; second forename or second initial; and, other forenames.

  • Group 2. –Consists of the non-name personal characteristics that are fixed at birth and very rarely changes during the person's lifetime: gender (Sex at birth); date of birth; place of birth (address where parents living when person was born); NHS number (allocated at birth registration, both old and new formats); date of death; and ethnicity.

  • Group 3. –Consists of socio-demographic variables that can change many times during the course of the person's lifetime: street address; post code; general practitioner; marital status; social class; number(s) allocated by a health district or special health-care register; number(s) allocated by a hospital or trust; number(s) allocated by a general practitioner's computing system; and, any other special hospital allocated numbers.

  • Group 4. –Consists of other variables that could be used for the compilation of special registers: clinical specialty; diagnosis; cancer site; drug idiosyncrasy or therapy; occupation; date of death; and other dates (for example, LMP, etc.).

  • Group 5. –Consists of variables that could be used for family record linkage: other surnames; mother's birth surname; father's surname; marital status; number of births; birth order; birth weight; date of marriage; and number of marriages.

File Ordering and Blocking

Matching and linkage in established datasets usually involves comparing each new record with a master file containing existing records. Files are ordered or blocked in particular ways to increase the efficiency of searching. In similar fashion to looking up a name in a telephone directory the matching algorithm must be able to generate the “see also” equivalent to this surname for variations in spelling (e.g., Stuart and Stewart, Mc, Mk, and Mac). Searching can be continued, if necessary, under the alternative surname.

Algorithmics that emulate the “see also” method are used for computer matching in record linkage. In this way, for example, Stuarts and Stewarts are collated into the same block. A match is determined by the amount of agreement and disagreement between the identifiers on the “incoming” record and those on the master file. The computer calculates the statistical probability that the person on the master file is the same as the person on the record with which it is compared.

File Blocking

The reliability and efficiency of matching is very dependent on the way in which the initial grouping or the “file-blocking” step is carried out. It is important to generate blocks of the right size. The balance between the number and size of blocks is particularly important when large files are being matched. The selection of variables to be used for file blocking is, therefore, critical and will be discussed before considering the comparison and decision-making stages of probability matching.

Any variable that is present on each and every record on the dataset to be matched could be used to divide or block the file, so enhancing the search and reducing the number of unproductive comparisons. Nevertheless, if there is a risk that the items chosen are wrongly recorded—which would result in the records being assigned to the wrong file block, then potential matches will be missed. Items that are likely to change their value from one record to an-

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

other for the same person, such as home address, are not suitable for file blocking. The items used for file blocking must be universally available, reliably recorded and permanent In practice, it is almost always necessary to use surnames, combined with one or two other ubiquitous items, such as sex and year of birth, to subdivide the file into blocks that are manageable in size and stable. Considerable attention has been given to the ways in which surnames are captured and algorithmic methods to reduce, or eliminate, the effects of variations in spelling and reporting, and which “compress” names into fixed-length codes.

Phonemic Name Compression

In record linkage, name compression codes are used for grouping together variants of surnames for the purposes of blocking and searching, so that effective match comparisons can be made using both the full name and other identifying data, despite misspelled or misreported names.

The first major advance in name compression was achieved by applying the principles of phonetics to group together classes of similar-sounding groups of letters, and thus similar-sounding names. The best known of these codes was devised in the 1920's by Odell and Russell (Knuth, 1973) and is known as the Soundex code. Other name compression algorithms are described by Dolby (1970) and elsewhere.

Soundex Code and the Oxford Name Compression Algorithm (ONCA)

The Soundex code has been widely used in medical record systems despite its disadvantages. Although the algorithm copes well with Anglo-Saxon and European names, it fails to bring together some common variants of names, such as Thomson/Thompson, Horton/Hawton, Goff/Gough, etc., and it does not perform well where the names are short, as is the case for the very common names, have a high percentage of vowels, or are of Oriental origin.

It is used principally, for the transformation of groups of consonants within names, to specific combinations of both vowels and consonants (Dolby, 1970). Among several algorithms of this type, that devised by the New York State Information and Intelligence System (NYSIIS) has been particularly successful, and has been used in a modified form by Statistics Canada and in the USA for an extensive series of record linkage studies (Lynch and Arends, 1977). A recent development in the Unit of Health-Care Epidemiology (UHCE) (Gill and Baldwin, 1987; Gill et al, 1993), referred to as the Oxford Name Compression Algorithm (ONCA), uses an anglicised version of the NYSIIS method of compression as the initial or pre-processing stage, and the transformed and partially compressed name is then Soundexed in the usual way. This two-stage technique has been used successfully for blocking the files of the ORLS, and overcomes most of the unsatisfactory features of pure Soundexing while retaining a convenient four-character fixed-length format.

The blocks produced using ONCA alone vary in size, from quite small and manageable for the less common surnames, to very large and uneconomic for the more common surnames. Further subdivision of the ONCA blocks on the file can be effected using sex, forename initial and date of birth either singly or in combination.

ORLS File Blocking Keys and Matching Variables

The file blocking keys used for the ORLS are generated in the following fashion:

  • The primary key is generated using the ONCA of the present surname.

  • The secondary key is generated from the initial letter of the first forename. Where this forename is a nickname or a known contraction of the “formal” forename, then the initial of the “formal” forename is used. For example, if the recorded forename was BILL, the “formal” forename would be William, and the initial used

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

would be W. A further record is set up on the master file where a second forename or initial is present; the key is derived from this second initial.

  • Where the birth surname is not the same as the present surname, as in the case of married women, a further record is set up on the master file under the ONCA code of birth surname and again subdivided by the initial. (This process is termed exploding the file.)

  • Further keys based on the date of birth and other blocking variables are also generated.

In addition to the sorting header, four other variables are added to each record before sorting and matching is undertaken:

  • Accession Number. –A unique number allocated from a pool of such numbers, and is absolutely unique to this record. The number is never changed and is used for identification of this record for correction and amendment. The number is check digited to modulus 97.

  • Person or System Number. –A unique number allocated from a pool of such numbers. The number can be changed or replaced if this record matches with another record. The number is check digited to modulus 97.

  • Coding Editions. –Indicators that record the various editions of the coding frames used in this record, for example the version of the ICD (International Classification of Diseases) or the surgical procedure codes. These indicators ensure that the correct coding edition is always recorded on each and every record and reliance is not placed on a vague range of dates.

  • Input and Output Stream Number. –This variable is used for identifying a particular dataset during a matching run, and enables a number of matches to be undertaken independently at the same pass down the master file.

Generating Extra Records Where a Number of Name Variants Are Present

To ensure that the data record can match with the blocks containing all possible variants of the names information, multiple records are generated on the master file containing combinations of present and birth surnames, and forenames. To illustrate the generation of extra records where the identifying set for a person contains many variants of the names, consider the following example:

birth surname:

SMITH

present surname (married surname):

HALL

first forename:

LIZ (contraction of Elizabeth)

second forename:

PEGGY (attraction of Margaret)

year of birth:

1952 (old enough to be married).

Eight records would be generated on the master file and each record indexed under the various combinations of ONCA and initial, as follows:

Indexed under the present surname HALL: i.e., ONCA H400:

H400L

for Liz

H400E

for Elizabeth (formal version of Liz)

H400P

for Peggy

H400M

for Margaret (formal version of Peggy);

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Indexed under the birth surname SMITH: i.e., ONCA S530:

S530L

for Liz

S530E

for Elizabeth (formal version of Liz)

S530P

for Peggy

S530M

for Margaret (formal version of Peggy).

Mrs. Hall would have her master file record included under each of the above eight ONCA/initial blocks. A data record containing any combination of the above names would generate an ONCA/initial code similar to any one of the eight above, and would have a high expectation of matching to any of the variants during the matching phase.

To reduce the number of unproductive comparisons, a data record will only be matched with an other record in the same block provided that the year of birth on both records are within 16 years of each other. This constraint has been applied, firstly, to reduce the number of unproductive matches, and secondly to confine matching to persons born within the same generation, and in this way eliminate father/son, mother/daughter matches. Further constraints could be built into the matching software for example, matching only within the same sex, logically checking that the dates on the two records are in a particular sequence or range, or that the diagnoses on the two records are in a specified range, as required in the preparation of a cancer registry file.

Matching Methods

There are two methods of matching data records with a master file.

  • The two file method is used to match a data record from a data file with a block on the master file, and in this way compare the data record with every record in the master file block.

  • The one file/single pass method is used to combine the data file block and the master file block into one block, and to match each record with every other in the block in a triangular fashion, i.e., first with the rest, followed by second with the rest etc. In this way every record can be matched with every other record. Use of a stream number on each record enables selective matching to be undertaken, for example data records can be matched with the master file and with each other, but the master file records are not matched with themselves.

Match Weights

Considerable work has been undertaken to develop methods of calculating the probability that pairs of records, containing arrays of partial identifiers which may be subject to error or variation in recording do, or do not, relate to the same person. Decisions can then be made about the level of probability to accept. The issues are those of reducing false negatives (Type I errors) and false positives (Type II errors) in matching (Winkler, 1995; Scheuren and Winkler, 1996; and Belin and Rubin, 1995). A false negative error, or “missed match, ” occurs when records which relate to the same person are not drawn together (perhaps because of minor variations in spelling or a minor error in recorded dates of birth). Matches may also be missed if the two records fall into different blocks. This may happen if, for example, a surname is misspelled and the phonemic compression algorithm puts them into two different blocks.

Methods for probability matching depend on making comparisons between each of several items of identifying information. Computer-based calculations are then made which are based on the discriminating power of each item. For example, a comparison between two different records containing the same surname has greater discriminating power if the surnames are rare than if they are common. Higher scores are given for agreement between identifiers (such as particular surnames) which are uncommon than for those which are common. The extent to which an iden-

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

tifier is uncommon or common can be determined empirically from its distribution in the population studied. Numerical values can then be calculated routinely in the process of matching for the amount of agreement or disagreement between the various identifying items on the records. In this way a composite score or match weight can be calculated for each pair of records, indicating the probability that they relate to the same person. In essence, these weights simulate the subjective judgement of a clerk. A detailed discussion of match weights and probability matching can be found in publications by Newcombe (Newcombe et al., 1959; and Newcombe, 1967, 1987, and 1988), and by Gill and Baldwin (1987) (See also Acheson, 1968.)

Calculating the Weights for the Names Items

Name identifiers are weighted in a different fashion to the non-name identifiers, because there are many more variations for correctly spelled names. Analysis of the NHS central register for England and Wales shows that there are:

57,963,992

records

1,071,603

surnames

15,143,043

surname/forename pairs.

The low frequency names were mainly non Anglo-Saxon names, hyphenated names and misspelled names. In general the misspellings were due to embedded vowel changes or to miss keying. A more detailed examination of the register showed that 954 different surnames covered about 50% of the population, with the following frequency distribution:

10% population

24 different surnames

20% population

84 different surnames

30% population

213 different surnames

40% population

460 different surnames

50% population

954 different surnames

60% population

1,908 different surnames

70% population

3,912 different surnames

80% population

10,214 different surnames

90% population

100,000 different surnames

100% population

1,071,603 different surnames.

Many spelling variations were detected for the common forenames. Using data from the NHSCR register, various forename directories and other sources of forenames, a formal forename lexicon was prepared that contained the well known contractions and nicknames. The problem in preparing the lexicon was whether to include forenames that had minor spelling errors, for example JOHN and JON. This lexicon is being used in the matching algorithm, to convert nicknames and contractions, for example LIZ, to the formal forename ELIZABETH, and both names are used as part of the search strategy.

Calculation of Weights for Surnames

The binit weight calculated from the frequency of the first letter in the surname (26 different values) was found to be too crude for matching files containing over 1 million records. The weights for Smith, Snaith, Sneath, Smoothey, Samuda, and Szabo would all have been set to some low value calculated from the frequency of Smith in the population, and ignoring the frequency of the much rarer Szabo. Using the frequencies of all of the 1 million or more different surnames on the master match file is too cumbersome, time consuming to keep up-to-date, and operationally difficult to store during the match run. The list would also have contained all of the one-off surnames generated by bad transcription and bad spelling. A compromise solution was devised by calculating the weights based on

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

the frequency of the ONCA block (8,000 values), with a cut-off value of 1 in a 1,000 in order to prevent the very rare and one-off names from carrying very high weights. Although this approach does not get round the problem of the very different names that can be found in the same ONCA block (Block S530: contains Smith, Smithies, Smoothey, Snaith, Sneath, Samuda, Szabo, etc.) it does provide a higher level of discrimination and, in part, accommodate the erroneous names.

The theoretical weight based on the frequency of the surname in the studied population is modified according to the algorithm devised by Knuth-Morris-Pratt (Stephen, 1994; Gonnet and Baeza-Yates, 1991; and Baeza-Yates, 1989), and takes into account the length of the shortest of the two names being compared, the difference in length of the two names, the number of letters agreeing and the number of letters disagreeing. Where the two names are absolutely identical, the weight is set to +2N, but falls down to a lower bound of -2N where the amount of disagreement is quite large.

If the birth surname and present surname are swapped with each other, exploding the file as described previously enables the system to find and access the block containing the records for the appropriate surnames. The weights for the present and birth surname pairs are calculated, then the present surname/birth surname and birth surname/present surname pairs are also calculated. The highest of the two values is used in the subsequent calculations for the derivation of the match weight.

In cases where the marital status of the person is single, i.e., never married, or the sex is male, or the age is less that 16 years, it is normal practice in the UK for the present surname to be the same as the birth surname, and for this reason only the weight for the present surname is calculated and used for the determination of a match.

Forenames

The weights derived for the forenames are based on the frequency of the initial letter of the forename in the population. Since the distribution of male and female forenames are different, there are two sets of different weights, one for males and a second for females. Since the forenames can be recorded in any order, the weights for the two forenames are calculated and the highest value used for the match. Where there are wide variations in the spelling of the forenames, the Daitch-Motokoff version of Soundex (“Ask Glenda”) is being evaluated for weighting the forenames in a fashion similar to that used for the surnames.

Calculating the Weights for the Non-Names Items

The weights for date of birth, sex, place of birth and NHS number are calculated using the frequency of the item on the ORLS and on the NHSCR file. The weight for the year of birth comparison has been extended to allow for known errors, for example, only a small deduction is made where the two years of birth differ by 1 or 10 years, but the weight is substantially reduced where the year of births differ by say, 7 years.

The weight for the street address is based on the first 8 characters of the full street address, where these characters signify a house number (31, High Street), or house name (High Trees), or indeed a public house name (THE RED LION). Terms like “Flat” or “Apartment” are ignored and other parts of the address are then used for the comparison. The postcode is treated and weighted as a single field although the inward and outward parts of the code can be weighted and used separately.

The range of binit weights used for the ORLS is shown in Table 1.

When the matching item is present on both the records, a weight is calculated expressing the amount of agreement or disagreement between the item on the data record and the corresponding item on the master file record.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Table 1. —The Range of Binit Weights Used for Matching

Identifying Item

Score in Binits1

 

Exact Match

Partial Match

No Match

Surnames:

Birth

+2S

+2S to -2S

-2S

 

Present2

+2S

+2S to -2S

-2S

 

Mother's birth

+2S

+2S to -2S

-2S

 

(where: common surname S = 6, rare surname S = 17)

Forenames3

 

+2F

+2F to -2F

-2F

 

(where: common forename F = 3, rare forename F = 12)

NHS number

+7

NP4

0

Place of birth (code)

+4

+2

-4

Street address5

+7

NP

0

Post Code

+4

NP

0

GP (code)

+4

+2

0

Sex6

+1

NP

-10

Date of birth

+14

+13 -> -22

-23

Hospital and Hospital unit number

+7

NP

-9

1 Where an item has been recorded as not known, the field has been left blank, or filled with an error flag, the match weight will be set to 0, except for special values described in the following notes.

2 Where the surname is not known or has been entered as blank, the record can not be matched in the usual way, but is added to the file to enable true counts of all the events to be made.

3 Forename entries, such as boy, girl, baby, infant, twin, or not known, are weighted as -10.

4 Where the weight is shown as NP (not permissible), this partially known value cannot be weighted in the normal fashion and is treated as a NO MATCH.

5 No fixed abode is scored 0.

6 Where sex is not known, blank, or in error, it is scored -10. (All records input to the match are checked against forename/sex indices and the sex is corrected where it is missing or in error.)

It is possible for the calculated weight to become negative where there is extreme disagreement between the item on the data record and the corresponding item on the master file. In matching street address, postcode and general practitioner the score cannot go negative, although it can assume zero, because the individual may have changed their home address or their family doctor since they were last entered into the system, this is really a change in family circumstances and not errors in the data and so a negative weight is not justified.

Threshold Weighting

The procedure for deciding whether two records belong to the same person, was first developed by Newcombe, Kennedy, Axford, and James (1959), and rigorously examined by Copas and Hilton (1990), Belin and Rubin (1995), and Winkler (1995). The decision is based on the total binit weight, derived by summing algebraically the individual binit weights calculated from the comparisons of each identifying item on the master file and data file. The

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

algebraic sum represents a measure of the probability that two records match. By comparing the total weight against a set of values determined empirically, it is possible to determine whether the two records being compared refer to the same person.

Two types of error can occur in record matching. The first, false negative matching or Type I error, is the more common and is a failure to collate records which refer to the same person and should have the same system number instead the person is assigned two or more person/system numbers and their records are not collated together. The second, false positive or Type II error, is less common but potentially more serious in allocating the same system number to two or more persons, where their records are wrongly collated together. The frequency of both types of error is a sound measure of the reliability of the record matching procedure.

In preparing earlier versions of the ORLS linked files, a range of binit weights was chosen and used to select records for clerical scrutiny. This range was delimited by the upper and lower pre-set thresholds, see Figure 1. The false positive and false negatives are very sensitive to the threshold cut-off weight: too low gives a very low false positive rate and a high false negative rate; too high gives a high and unacceptable false positive rate with a low false negative rate. The values selected for the threshold cut-off are, of course, arbitrary, but must be chosen with care, having considered the following objectives:

  • The minimisation of false positives, at the risk of increased missed matches;

  • The minimisation of missed matches, at the risk of increased false positives; and

  • The minimisation of the sum of false positives and missed matches.

Figure 1. —Frequency Distribution of the Binit Weights for Pairs of Records

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

The simple approach for the determination of a match based on the algebraic sum of the binit weights ignores the fact that the weight calculated for names is based on the degree of commonness of the name, and is passed on from other members of the family, whereas the weights for the non-names items are based on distributions of those items in the population, all values of which are equally probable.

An unusual set of rare names information would generate high weights which would completely swamp any weights calculated for the non-names items in the algebraic total, and conversely, a common name would be swamped by a perfect and identical set of non-names identifiers. This would make it difficult for the computer algorithm to differentiate between similarly named members of the population without resort to clerical assistance.

In the determination of the match threshold, a number of approaches have been developed, the earliest being the two stage primary and secondary match used in building the early ORLS files, through a graphical approach developed in Canada for the date of birth, to the smoothed two dimensional grid approach developed by the UHCE and used for all its more recent matching and linking (Gill et al., 1993; Vitter and Wen-Chin, 1987).

Algebraic Summation of the Individual Match Weights

In recent years we have, therefore, developed an approach in which a two dimensional “orthogonal” matrix is prepared, analogous to a spreadsheet, with the names scores forming one axis and the non-names scores the other axis. In the development of the method, sample runs are undertaken; pairs of records in cells in the matrix are checked clerically to determine whether they do or do not match; and the probability of matching is derived for each cell in the sample. These probabilities are stored in the cells of an “orthogonal” matrix designated by the coordinates (names score, non-names score). The empirical probabilities entered into the matrix are further interpolated and smoothed across the axes using linear regression methods.

Match runs using similar data types would access the matrix and extract the probability score from the cell designated by the coordinates. The array of probabilities can be amended after experience with further runs, although minor tinkering is discouraged. Precise scores and probabilities may vary according to the population and record pairs studied. A number of matrices have therefore been prepared for the different types of event pairs being matched, for example, hospital to hospital records, hospital to death records, birth to hospital records, hospital and District Health Authority (DHA) records, cancer registry and hospital records, and so on.

Over 200,000 matches were clerically scrutinized and the results recorded in the two axes of a orthogonal matrix, with the algebraic sum of the weights for the names items being X coordinate (“X” axis), and the algebraic sum of the fixed and variable statistics items plotted on the Y coordinate (“Y” axis). In each cell of the orthogonal matrix the results of the matches were recorded, with each cell holding the total number of matches, the number of good matches and the number of non-matches. A sample portion of the matrix is shown in Figure 2.

A graphical representation of the matrix is shown in Figure 3, where each cell contains the empirical decision about the likelihood of a match between a record pair. The good matches are shown as “Y,” the non-matches as “N” and the doubtful matches that require clerical intervention as “Q.” This graph is the positive quadrant where both the names and non-names weights are greater that zero. In the microcomputer implementation of the software, this graph is held as a text file and can be edited using word-processing software.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Figure 2. —Sample Portion of the Threshold Acceptance Matrix Showing the Number of Matches and Nonmatches, by Binit Weight for Names and Non-names Identifiers

Record pairs with weights that fall in the upper right part of the matrix and shown in Figure 3 as “Y” are considered to be “good” matches and only a 1% random sample is printed out for clerical scrutiny. Record pairs with weights that fall between the upper and lower thresholds and shown in the figure as “Q” are considered to be “query” matches and all the record pairs are printed out for clerical scrutiny and the results keyed back into the computing system. Record pairs with weights falling below the lower threshold and shown on the map as “N” are considered to belong to two different people and a 1% random sample is taken of record pairs that fall adjacent to N-Q boundary.

At the end of each computer run, the results of the clerical scrutiny are pooled with all the existing matching results and new matrices are prepared. The requirement is to reduce the “Q” zone to the minimum consistent with the constraints of minimum false positives and false negatives. Clerical intervention is invariably the most costly and rate determining stage.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Figure 3. —A Sample Portion of the Matrix Used for Matching Hospital Records with Hospital Records

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Separate matrices have been modelled for the different types of record pairs entering the system, for example:

hospital discharge

/hospital discharge

hospital discharge

/death record

birth record

/hospital discharge

hospital discharge

/primary care/FHSA record

hospital discharge

/Cancer registry.

Further matrices have also been prepared that record the number of match items used in matching a record pair, for example, number of surnames, forenames and numbers of other matching variables. Since the number of matrices can become quite large, intelligent systems and neural net techniques are being developed for the interpretation of the N dimensional matrices and the determination of the match threshold (Kasabov, 1996; Bishop, 1995).

Special procedures have been developed for the correct matching of similarly-named same sex twins. Where the match weights fall within the clerical scrutiny area, the clerks are able to identify the two records involved and take the appropriate action.

The marked records are printed out for clerical scrutiny and the match amended where necessary. This situation also arises where older people are recorded in the information system under a given set of forenames but, on a subsequent hospital admission or when they die, a different set of forenames are reported by the patient or by the next of kin.

Linking

The output from the matching run, is a text file that contains details about each pair of records that were matched together. A sample portion of this file is shown in Figure 4, the layout of which is:

Details of data record

Person/system number

Accession number

Record type

Details of main file record

Person/system number

Accession number

Record type

Details about the match run

Output stream (good match or query match)

Names weight

Non-names weight

Cross-reference to the clerical printout

Matching probability/decision (either Y or N).

The number of records written to the output file for any one person can be very large, and is approximately the number of records on data file multiplied by the number of records on the master file. Using combinational and heuristic algebraic methods these records are reduced to a small number for each potential match pair, ideally one (Hu, 1982; Cameron, 1994; Lothaire, 1997; and Pidd, 1996).

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Figure 4. —A Sample of the Typical Output from the Match Run

Example of OX-LINK System Number Output

389447756

860895558

GS

229800034

352–68394

GN

2

50

26

(GH1/500001)

Y

O

379194856

858751858

GS

233513082

369890337

GN

2

29

24

(GH1/500002)

Y

O

379194856

858751858

GS

233513082

911759078

TU

2

29

15

(GH1/500003)

Y

O

379194856

858751858

GS

233513082

911759078

TU

2

29

15

(GH1/500004)

Y

O

437096752

781384114

GS

323947927

524582350

BL

2

31

19

(GH1/500005)

Y

O

357816810

726892961

GS

249173530

472792138

GN

2

31

23

(GH1/500006)

Y

O

357816810

726892961

GS

249173530

343537893

GN

2

31

21

(GH1/500007)

Y

O

357816810

726892961

GS

249173530

406349427

GN

2

31

23

(GH1/500008)

Y

O

540814037

883641514

GS

210500551

448983383

GM

2

50

19

(GH1/500009)

Y

O

110463907

559719951

GN

408578989

738005030

GS

2

50

30

(GH1/500010)

Y

O

110463907

262969219

GH

408578989

738005030

GS

2

50

30

(GH1/500011)

Y

O

110463907

63685552

GH

408578989

738005030

GS

2

50

26

(GH1/500012)

Y

O

133714360

188729480

GH

414567239

748873845

GS

2

50

25

(GH1/500013)

Y

O

133714360

205039688

GH

414567239

748873845

GS

2

50

23

(GH1/500014)

Y

O

The rules for undertaking this reduction are:

  • Ideally, all records for the same person will have the same person/system number.

  • The records for a person who has only one set of identification details will be of the following type, where each record only carries one person/system number (A):

A = A = A = A, etc. (= signifies matches with).

  • Where a single woman gets married within the span of the file, records will be recorded under maiden name, person/system number (A) and also under her married name (B). Links will be effected between (A) and (B) and all the records will be converted to person/system number (A). The person/system number (B) will be lost to the system. Future matches will link to either her single or married records, both of which will carry the person/system number (A):

A = A = B = B = A = B, etc.

A being links under her maiden name

B being links under her married name.

  • Where there are records for a women recorded under her maiden name (A), and records that contain details of both her maiden and married name (B) and just her married name (C), these chains are will be made up of three types of links,

A = A = B = B = C = B = C, etc.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Successive matches will convert all the records to person/system number (A). If the linked file contains records type (A) and (C) only, linkage cannot be effected between (A) and (C) until records of type (B) are captured and linked into the system.

  • Where the person has had many changes of name and marital status, the number of different types of links will increase. Over the 30 year span of the file, links up to 5 deep have been found.

Each record altering the system is given a new purely arbitrary person/system number from a pool of such numbers. Where the record on the data file matches with a record on the master file, the person/system number stored on the master file record is copied over the person/system number on the data record, overwrites it, and the original person/system number on the data record number is then lost from the system and cannot be re-issued.

Where two sets of records for the same person, but having two different person/system numbers are brought together during a subsequent matching run; all the records are given the lowest person/system number and any other person/system numbers are destroyed.

Results

When the matching, linking and clerical stages are completed, the file of linked records will contain two types of error. Firstly, the records that have matched together but do not belong to the same person, these are known as false positives. Secondly, records belonging to the same person that have not been brought together, i.e., reside on the file under two or more different person identifiers, these are known as “false negatives or missed matches”

The false positive rate was estimated using two different methods. Firstly, all the records for a random sample of 5,000 people having two or more records were extracted from the ORLS file and printed out for clerical scrutiny. Secondly, all the record pairs that matched together with high match weights but where the forenames differed, were printed out for clerical scrutiny.

The “false negative or missed match” rate was estimated, by extracting a subset of people who had continuing treatment, such as repeated admissions for diabetics, nephritics, etc., and for those patients who had died in hospital, where the linked file should contain both the hospital discharge record and the death record.

The latest results from the ORLS file and the Welsh and Oxfordshire Cancer registry files are very encouraging, with the false positive rate being below 0.25 percent of all people on the file, and the missed match rate varying between 1.0 percent and 3.0 percent according to the type of sample investigated. Recent works on matching 369,000 records from a health district with 71 million exploded records from NHS Central Register has given a false positive rate of between 0.2 and 0.3%; the higher figure is produced from records which have very common Anglo-Saxon or Asian names.

The worst false negative rate was found where hospital discharges were matched with the corresponding death record. The identifying information on the hospital discharge was drawn from the hospital master index supplemented by information supplied by the patient or immediate family. The identifying information on the death record is usually provided by the next of kin from memory and old documents.

The completed ORLS file is serial file that is indexed using the person/system number, and contains the partial identifiers, administrative and socio-demographic variables and clinical items. This file used for a wide range of epidemiological and health services research studies. For ease of manipulation and other operational reasons, subsets of the file are prepared for specific studies, usually by selecting specified records or record types, or by selecting on geographical area or span of years or on clinical specialty.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Acknowledgments

The Unit of Health Care Epidemiology and the work on medical record linkage is funded by the Research and Development Directorate of the Anglia and Oxford Regional Health Authority. The Office of Population Censuses and Surveys (now the Office of National Statistics) for permission to publish the frequencies of the surnames from the NHS central register.

References

Acheson E.D. ( 1967). Medical Record Linkage, Oxford: Oxford University Press.

Acheson E.D. (ed) ( 1968). Record Linkage in Medicine, Proceedings of the International Symposium, Oxford, July 1967, London: ES Livingstone Limited.

“Ask Glenda,” Soundex History and Methods, World Wide Web: http://roxy.sfo.com/~genealogysf/glenda.html.

Baeza-Yates, R.A. ( 1989). Improved String Searching, Software Practice and Experience, 19, 257–271.

Baldwin, J.A. and Gill, L.E. ( 1982). The District Number: A Comparative Test of Some Record Matching Methods Community Medicine, 4, 265–275.

Belin, T.R. and Rubin, D.B. ( 1995). A Method for Calibrating False-Match Rates in Record Linkage, Journal of the American Statistical Association, 90, 694–707.

Bishop, C.M. ( 1995). Three Layer Networks, in: Neural Networks for Pattern Recognition, United Kingdom: Oxford University Press, 128–129.

Cameron, P.J. ( 1994). Graphs, Trees and Forests, in: Combinatorics. United Kingdom: Cambridge University Press, 159–186.

Copas, J.R. and Hilton, F.J. ( 1990). Record Linkage: Statistical Models for Matching Computer Records, Journal of the Royal Statistical Association, Series A, 153, 287–320.

Dolby, J.L. ( 1970). An Algorithm for Variable Length Proper-Name Compression, Journal of Library Automation, 3/4, 257.

Dunn, H.L. ( 1946). Record Linkage, American Journal of Public Health, 36, 1412–1416.

Gallian, J.A. ( 1989). Check Digit Methods, International Journal of Applied Engineering Education, 5, 503– 505.

Gill, L.E. and Baldwin, J.A. ( 1987). Methods and Technology of Record Linkage: Some Practical Considerations in: Textbook of Medical Record Linkage (Baldwin, J.A., Acheson, E.D., and Graham, W.J., eds). Oxford: Oxford University Press, 39–54.

Gill, L.E.; Goldacre, M.J.; Simmons, H.M.; Bettley, G.A.; and Griffith, M. ( 1993). Computerised Linkage of Medical Records: Methodological Guidelines Journal of Epidemiology and Community Health, 47, 316–319.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Goldacre, M.J. ( 1986). The Oxford Record Linkage Study: Current Position and Future Prospects Proceedings of the Workshop on Computerised Record Linkage in Health Research (Howe, G.R. and Spasoff, R.A., eds). Toronto: University of Toronto Press, 106–129.

Gonnet, G.H. and Baeza-Yates, R. ( 1991). Boyer-Moore Text Searching, Handbook of Algorithms and Data Structure, 2nd ed, United States: Addison-Wesley Publishing Co he, 256–259

Hamming, R.W. ( 1986). Coding and Information Theory, 2nd ed., Englewood Cliffs, NJ: Prentice Hall.

Holmes, W.N. ( 1975). Identification Number Design, The Computer Journal, 14, 102–107.

Hu, T.C. ( 1982). Heuristic Algorithms, in: Combinatorial Algorithms., United States: Addison-Wesley Publishing Co. Inc, 202–239.

Kasabov, N.K. ( 1996). Kohonen Self-Organising Topological Maps, in: Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering, Cambridge, MA, USA: MIT Press, 293–298.

Knuth, D.E. ( 1973). Sorting and Searching, in: The Art of Computer Programming, 3, United States: Addison-Wesley Publishing Co. Inc., 391.

Lothaire, M. ( 1997). Words and Trees, in: Combinatorics on Words., United Kingdom: Cambridge University Press, 213–227.

Lynch, B.T. and Arends, W.L. ( 1977). Selection of a Surname Encoding Procedure for the Statistical Reporting Service Record Linkage System, Washington, DC: United States Department of Agriculture.

National Health Service and Department of Health ( 1990). Working for Patients: Framework for Implementing Systems: The Next Steps, London: HMSO.

Newcombe, H.B. ( 1967). The Design of Efficiency Systems for Linking Records into Individual and Family Histories, American Journal of Human Genetics, 19, 335–339.

Newcombe, H.B. ( 1987). Record Linking: The Design of Efficiency Systems for Linking Records into Individual and Family Histories, in: Textbook of Medical Record Linkage (Baldwin, J.A.; Acheson, E.D.; and Graham, W.J., eds), Oxford: Oxford University Press, 39–54.

Newcombe, H.B. ( 1988). Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business, Oxford: Oxford University Press.

Newcombe, H.B.; Kennedy, J.M.; Axford, S.J.; and James, A.P. ( 1959). Automatic Linkage of Vital Records, Science, 130, 3381, 954–959.

Pidd, M. ( 1996). Heuristic Approaches, Tools for Thinking, Modelling in Management Science, England: John Wiley and Sons, 281–310.

Scheuren, F. and Winkler, W.E. ( 1996). Recursive Merging and Analysis of Administrative Lists and Data, Proceedings of the Section on Survey Research Methods, American Statistical Association.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Secretaries of State for Health, Wales, Northern Ireland and Scotland ( 1989). Working for Patients., London: HMSO, CM 555.

Stephen, G.A. ( 1994). Knuth-Morris-Pratt Algorithm, in: String Searching Algorithms, Singapore: World Scientific Publishing Co. Pte. Ltd, 6–25.

Vitter, J.S and Wen-Chin, C. ( 1987). The Probability Model, Design and Analysis of Coalesced Hashing, United Kingdom: Oxford University Press, 22–31.

Wild, W.G. ( 1968). The Theory of Modulus N Check Digit Systems, The Computer Bulletin, 12, 308–311.

Winkler, W.E. ( 1995). Matching and Record Linkage, Business Survey Methods (Cox, Binder, Chinnappa, Christianson, Culledge, and Kott, eds.), New York: John Wiley and Sons, Inc., 355–384.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Complex Linkages Made Easy

John R.H.Charlton, Office for National Statistics, UK

Judith D.Charlton, JDC Applications

Abstract

Once valid key fields have been set up, relational database techniques enable complex linkages that facilitate a number of statistical analyses. Using one particular example, a classification of types of linkages is developed and illustrated. The naive user of such data would not necessarily know how to use a relational database to perform the linkages, but may only know the sort of questions they want to ask. To make data (annonymised to protect the confidentiality of patients and doctors) generally accessible, a user-friendly front-end has been written using the above concepts, which provides flat-file datasets (tabular or list) in response to answers from a series of questions. These datasets can be exported in a variety of standard formats. The software will be demonstrated, using a sample of the data.

Introduction

Most of the papers at this workshop are concerned with establishing whether or not different records in the database match. This paper starts from the point where this matching has already been established. It will thus be assumed that the data have already been cleaned, duplicates eliminated, and keys constructed through which linkages can be made. Procedures for matching records when linkage is not certain have been discussed for example by Newcombe et al. (1959, 1988), Fellegi and Sunter (1969), and Winkler (1994). We also assume database software that can:

  • select fields from a file of records;

  • extract from a file either;

    • all records

    • distinct records which satisfy specified criteria; and

  • link files using appropriate key and foreign fields.

The purpose of this paper is firstly to illustrate the huge potential of using relational databases for data linkage for statistical analyses. In the process a classification of linkages will be developed, using a particular database to illustrate the points. Some results will be presented by way of example. We will show how the complex linkages required for statistical analyses can be decomposed into a sequence of simple database queries and linkages. Finally a user-friendly program that has been written for extracting a number of different types of dataset for analyses will be described. The advantages and disadvantages of such approaches will be discussed.

Relational databases are ideal for storing statistical data, since they retain the original linkages in the data, and hence the full data structure. They also facilitate linking in new data from other sources, and are

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

economical in data storage requirements. However, most statistical analyses require simple rectangular files, and complex database queries may be required to obtain these.

The linkages required to obtain the flat files for statistical analysis vary from the relatively simple to the extremely complex (Figure 1). Subsets of data files may be found (A), possibly by linkage with another file (B). Derived files may be created by linking files (or their subsets) within the study data files (C), or to files outside the study data (D). The derived files may be further linked to files in or outside the dataset (E), subsets (F), or other derived files (G), to obtain further derived files, and this process may continue at length.

Figure 1. —Illustration of Complex Linkage

The Example Database

In a major survey in England and Wales (MSGP4) some 300 general medical practitioners (GPs) in 60 practices collected data from half a million patients, relating to every face to face contact with them over the course of a year (McCormick et al., 1995). In the UK nearly the entire population is registered with a GP, and only visit a doctor in the practice in which they are registered, except in an emergency, when they may attend an accident and emergency department of a hospital or another GP as a temporary patient. For all patients in MSGP4 there was information on age, sex, and postcode. In addition socio-economic data were successfully collected by interview for 83 per cent of the patients on these doctors registers. There was a core of common questions, but there were also questions specific to children, adults, and married/cohabiting women. Information was also collected about the practices (but not individual GPs). Geographic information related to postcodes was also available. The structure of the data is illustrated in

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Figure 2 (simplified). MSGP4 was the fourth survey of morbidity in general practice. In previous MSGP surveys output consisted only as a series of tables produced by COBOL programs, and MSGP4 was the fist survey for which relational databases were used to provide flexible outputs.

Figure 2. —Example Database (Simplified) —Patient Consultations of General Medical Practitioners

Some Definitions
  • Read Code. —A code used in England and Wales by general practice staff to identify uniquely a medical term. This coding was used in the MSGP project because it is familiar to general practice staff, but it is not internationally recognised and the codes have a structure that does not facilitate verification.

  • ICD Code. —International classification of disease. Groups of Read codes can be mapped onto ICD9 codes. For example Read code F3810= “acute serious otitis media,” maps to ICD A381.0 = “acute nonsuppurative otitis media”). Such mappings form part of the consultation metadata (see below).

  • Consultation. —A “consultation” refers to a particular diagnosis by a particular member of staff on a particular date at a particular location, resulting from a face to face meeting between a patient and doctor/nurse. A “diagnosis” is identified by a single Read code.

  • “Patients Consulting”. —Some registered patients did not consult a doctor or other staff member during the study. “Patients consulting” is therefore a subset of the practice list of all registered patients. Consultations must be carefully distinguished from “patients consulting.” A combination of patient number, date and place of consultation and diagnosis uniquely define each record in the consultation file. Patient numbers are not unique because a patient may consult more than once, nor are combinations of patient number and diagnosis unique. On the other hand, a “patient consulting” file will contain at most one record for each patient consulting for a particular

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

diagnosis (or group of diagnoses), no matter how many times that patient has consulted a member of the practice staff. “Consultations ” are more relevant when work-load is being studied, but if prevalence is the issue then “patients consulting,” i.e., how many patients consulted for the illness, is more useful.

  • Patient Years at Risk. —The population involved in the MSGP project did not remain constant throughout the study. Patients entered and left practices as a result of moving house or for other reasons, and births and deaths also contributed to a changing population. The “patient years at risk” derived variable was created to take account of this. The patient file contains a “days in” variable, which gives the number of days the patient was registered with the practice (range 1–366 days for the study). “Patient years at risk” is “days in” divided by 366, since 1992 was a leap year.

Database Structure

To facilitate future analyses some non-changing data were combined at the outset. For example some consultation metadata were added to the consultation dataset, such as International Classification of Disease (ICD) codes and indicators of disease seriousness. The resultant simplified data structure is thus:

Practice: Practice number; information about practice (confidential)

Primary Key: Practice number

A practice is a group of doctors, nurses, and other staff working together. Although patients register with a particular doctor, their records are kept by the practice and the patient may be regarded as belonging to a practice. Data on practice and practice staff are particularly confidential, and not considered in this paper. Individual practice staff consulted are identified in the consultation file by a code.

Patients: Patient number; age; sex; post code; socio-economic data

Primary key: Patient number

Foreign key: Postcode references geographic data

These data were stored as four separate files relating to: all patients; adult patients; children; married cohabiting women, because different information was collected for each subgroup.

Consultation: Patient number; Practice number; ID of who consulted; date of contact; diagnosis; place of consultation; whether referred to hospital; other consultation information

Primary key: Patient number, doctor ID, date of contact, diagnosis

Foreign keys: Practice number references practice; Patient number references patients; Staff ID references staff (e.g., doctor/nurse).

Episode: For each consultation the doctor/nurse identified whether this was the “first ever,” a “new,” or “ongoing” consultation for that problem. An “episode” consists of a series of consultations for the same problem (e.g., Read code).

Geographically-referenced data: Post codes, ED, latitude/longitude, census ward, local authority, small area census data, locality classifications such as rural/urban, prosperous/inner city, etc.

These data were not collected by the survey, but come from other sources, linked by postcode or higher level geography.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Patient metadata: These describe the codes used in the socio-economic survey (e.g., ethnic group, occupation groups, social class, housing tenure, whether a smoker, etc.)

Consultation metadata: The ReadICD file links Read codes with the corresponding ICD codes. In addition a lookup table links 150 common diseases, immunisations and accidents to their ICD codes. Each diagnosis is classified as serious, intermediate or minor.

Derived files: The MSGP database contains information on individual patients and consultations. To make comparisons between groups of patients, and to standardise the data (e.g., for age differences), it is necessary to generate files of derived data, using database queries and linkages as described below. In some derived files duplicate records need to be eliminated. For example, we may wish to count patients consulting for a particular reason rather than consultations, and hence wish to produce at most one record per patient in a “patients consulting ” derived file—see “Some Definitions above).

Types of Linkage (with Examples)

In this section we classify a variety of linkage types that are possible into three main types, illustrating the linkages with examples based on the MSGP4 study.

Simple Linkage
  • Straightforward data extracts (lists) combining several sources. —

    Example: Making a list of patients with asthma including age, sex and social class for each patient.

  • Observed frequencies.—

    Example: Linking the “all patients” file, and the “consultations” file to count the number of consultations by the age, sex and social class of the patient, or cross-classifying home-visits and hospital referrals with socio-economic characteristics.

  • Conditional data, where the availability of data items depends on the value of another variable.—

    Example: In MSGP4 some data are available only for adults, or children, or married/cohabiting women. Smoking status was only obtained from adult patients, so tabulating “home visits” by “smoking status” by “age,” and “sex” involves linking the “all patients” (to find age and sex), “adult patients” (to find smoking status) and “consultations” (to find home visits) files. Linking the “adult” file to the “all patients” file excludes records for children.

  • Linking files with “foreign” files. — Useful information can often be obtained by linking data in two or more different datasets, where the data files share common codes. For example data referenced by postcode, census ED or ward, or local authority are available from many different sources as described above.

    Example: The MSGP4 study included the postcode of residence for each patient, facilitating studies of neighbourhood effects. The crow-fly distance from the patient's home to the practice was calculated by linking patient and practice postcodes to a grid co-ordinates file and using Pythagoras's theorem. The distance was stored permanently on the patient file for future use.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
  • Linking to lookup tables (user-defined and pre-defined). —

    Examples: The information in the MSGP database is mostly held in coded form, with the keys to the codes held in a number of lookup tables linked to the main database. Most of these are quite small and simple (e.g., ethnic group, housing tenure, etc.) but some variables are linked to large tables of standard codes (e.g., occupational codes, country of birth). In some cases the coded information is quite detailed and it is desirable to group the data into broader categories, e.g., group diagnostic codes into broad diagnostic groups such as ischaemic heart disease ICD 410–414. For some diseases a group of not necessarily contiguous codes are needed to define a medical condition. A lookup file of these codes can be created to extract the codes of interest from the main data, using a lookup table that could be user-defined. Missing value codes could also be grouped, ages grouped into broad age groups, social classes combined, etc.

Auto-Linkage Within a File (Associations Within a File)
  • Different records for the same “individual.” —Records for the same individual can be linked together to analyse patterns or sums of events, or associations between events of different kinds. In general a file is linked to a subset of itself to find records relating to individuals of interest.

    Example: Diabetes is a chronic disease with major complications. It is of interest to examine, for those patients who consulted for diabetes, what other diseases they consulted for. Consultations for diabetes can be found from their ICD code (250). Extracting just the patient identification numbers from this dataset, and eliminating duplicates, results in a list of patients who consulted for diabetes at least once during the year. This subset of the consultation file can be linked with the original consultation file to produce a derived file containing the consultation history of all diabetic patients in the study, which can be used for further analysis. Note that in this example only the consultation file (and derived subsets) has been used.

  • Different records for same households/other groups. —

    Example: Information on households was not collected as part of MSGP4. However “synthetic” households can be constructed, using postcode and socio-economic data, where the members of the same “household” must, by definition, share the same socio-economic characteristics and it would be rare for two distinct households to have exactly the same characteristics. These “households” can be used to discover how the behaviour of one “household” member may affect another. For example, we can examine the relationship between smoking by adults, and asthma in children. Clearly in this example some sort of check needs to be made on how accurately “households” can be assembled from the information available and the algorithm used.

  • Temporal relationships. —Files containing “event” data can be analysed by studying temporal patterns relating to the same individual.

    Example:

    • The relationship between exposure to pollution or infection and asthma can be studied in terms of both immediate and delayed effects. Consultations for an individual can be linked together and sorted by date, showing temporal relationships.

    • The duration of clinical events can sometimes be determined by the sequence of consultations. In MSGP4 each consultation for a particular medical condition was labelled

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

“first ever,” “new,” or “ongoing” and the date of each consultation recorded. Survival analysis techniques cater for these types of data.

Complex Linkages

Linkages that are combinations of the two types of linkage previously described could be termed “complex linkages.” These can always be broken down into a sequence of simpler linkages. A number of examples of complex linkages are given, in order of complexity.

  • Finding subsets through linkage. —

    Example: In the MSGP4 data this is particularly useful in the study of chronic conditions such as diabetes and heart disease. Linking the file of patients consulting for diabetes discussed in section 3.2 with the patient dataset results in a subset of the patient file, containing only socio-economic details of diabetic patients.

  • Linking a derived file to a lookup table and other files. —

    Example: Diabetes is particularly associated with diseases of the eye (retinopathy), kidney, nervous system and cardiovascular systems. It is of interest to analyse the relationship between diabetes and such diseases, which are likely to be related to diabetes. In this slightly more complex situation it is necessary to create a lookup table containing the diseases of interest and their ICD codes and link this to the “consultations by diabetic patients” file to create a further subset of the consultation file containing consultations for diabetes and its complications. It is likely that this file as well as the simpler one described above would be linked to the patient file to include age and sex and other patient characteristics before analysis using conventional statistical packages.

  • Linking a derived file with another derived file. —

    Example:

    • Rates for groups of individuals. — Rates are found by linking a derived file of numerators with a derived file of denominators. The numerators are usually found by linking the patient and consultation files, for example, age, sex, social class or ethnic group linked to diagnosis, referral or home visits. Denominators can be derived from the patient file (patient years at risk) or the consultation file (consultations or patients consulting) for the various categories age, sex, etc.

    • Standardised ratios. — This is the ratio of the number of events (e.g., consultations or deaths) observed in a sub-group to the number that would be expected if the sub-group had the same age-sex-specific rates as a standard population (e.g., the whole sample), multiplied by 100. Examples of sub-groups are different ethnic groups or geographical areas. The calculation of standard population rates involves linking the whole population observed frequencies to whole population patient years at risk. Each of these is a derived file, and the result is a new derived file. Calculating expected numbers involves linking standard population rates to the subgroups' “years at risk” file. This produces two new derived files, “Observed” and “Expected.” Age-standardised patient consulting ratios are obtained by linking these two derived files together, using outer joins to ensure no loss of “expected” records where there are no observed in some age-sex categories.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
  • Establishing population rates for a series of nested definitions. —

    Example: Individuals at particular risk from influenza are offered vaccination. In order to estimate how changes in the recommendations might affect the numbers eligible for vaccination, population rates for those living in their own homes were estimated for each of several options. People aged 65 and over living in communal establishments are automatically eligible for vaccination, and hence were selected out and treated separately. The options tested were to include patients with:

    1. any chronic respiratory disease, chronic heart disease, endocrine disease, or immunesuppression;

    2. as A but also including hereditary degenerative diseases;

    3. as B but also including thyroid disease;

    4. as C but also including essential hypertension.

      The MSGP dataset was used to estimate the proportion of the population in need of vaccination against influenza according to each option. The problem was to find all those patients who had consulted for any of the diseases on the list, taking care not to count any patient more than once. This involved creating a lookup table defining the disease groups mentioned in options A-D, linking this to the consultation dataset, eliminating duplicates and linking this to the patient dataset (to obtain age-group and sex), and then doing a series of queries to obtain appropriate numerator data files. A denominator data file was separately obtained from the patient dataset to obtain patient years at risk, by age-group and sex. The numerator and denominator files were then joined to obtain rates. These rates were then applied to census tables to obtain the estimated numbers of patients eligible for vaccination under assumptions A-D.

  • Record matching for case-control studies. — These are special studies of association-extracting “cases” and “controls” from the same database.

    Example: what socio-economic factors are associated with increased risk of Crohn's disease? All patients who consulted for ICD555 (regional non-infective enteritis) during the MSGP4 study were selected and referred back to their GP to confirm that they were genuine cases of Crohn's disease. Patients who were not confirmed as having Crohn 's disease were then excluded. This resulted in 294 cases. Controls were selected from patients who did have the disease—those who matched cases for practice, sex and month and year of birth. In each of two practices there were two cases who were of the same sex and the same month and year of birth. In each of these practices the controls were divided randomly between these cases as equally as possible. There were 23 cases for whom no controls could be found using these criteria. In 20 of these cases it was possible to find controls who matched on practice and sex and whose date of birth was within two months of the case's date of birth. The remaining three cases were excluded from the analysis. This procedure resulted in 291 cases and 1682 controls.

User-Friendly Linkage Software

The MSGP4 practice software was originally written so that participating practices could gain access to the data collected from their own practice. The software was designed to be used easily by people with no knowledge of database technology and because the software runs directly under DOS or Windows, no specialised database software is needed. The structure of the MSGP database is transparent to the user who can refer to entities (e.g., diseases or occupation) by name rather than codes.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Later, a modified version of the software was developed to enable researchers to use the complete dataset (60 practices).

Although it may be possible for some of these linkages to be performed as a single query it is generally best to do a series of simple linkages for two reasons. Firstly, database software creates large temporary files of cross products, which is time consuming and may lead to memory problems. Secondly, queries involving complex linkages are often difficult to formulate and may easily turn out to be incorrect. The order in which the linkages are performed is also important for efficiency. In general, only the smallest possible files should be linked together. For example, rather than linking the patient and consultations files together, then finding the diseases and patient characteristics of interest, it is better to find the relevant subsets of the two files first, then link them together.

The software performs the required linkages and then analyses the data in two stages. The first part of the program performs the sequence of linkages and queries needed to find subsets required for the second stage, and the second part performs the analyses and displays the output. The data flow through the program is shown in Figure 3.

It can be seen from the diagram that any of the three input files may be linked to themselves or to either of the others in any combination to form subsets of the data, or the entire dataset can be used.

Finding Subsets
  • The program enables the user to find any combination of characteristics required, simply by choosing the characteristic from menus. The program finds subsets of individual files, as well as linking files in the dataset to each other and to lookup tables, and finding subsets of one file according to data in another. For example the program can produce a list of young women with asthma who live in local authority accommodation, or of patients with a particular combination of diagnoses. It is also possible to examine the data for a particular group of people (for example, one ethnic group), or for a particular geographical area.

  • Dealing with missing values. —When the data for MSGP4 was collected it was not possible to collect socio-economic data for all patients. The user is given the option to exclude missing values, or to restrict the data to missing values only should they want to find out more about those patients for whom certain information is missing. For example, an analysis of the frequency of cigarette smoking in each age/sex group in the practice might include only those patients for whom smoking information is available.

The Output

The output from the program is of three types, any of which may be exported by the program in a variety of formats (e.g., WK1, DBF, TXT, DB) for further statistical analyses.

  • Lists output consists of one record for each patients, consultation or episode of interest, with files linked together as appropriate. Each record contains a patient number together with any other information that the user has requested. These flat files can be used for further analysis using spreadsheet or statistical software.

  • Frequency output consists of counts of the numbers of patients, consultations or episodes in each of the categories defined by the fields selected by the user.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
  • Rate output enables a variety of rate with different types of numerators and denominators to be calculated. Any of the following rates may be chosen: Diagnostic rates for a specified diagnostic group (patients consulting; consultations; episodes); referral rates; and home visit rates. Rates are generally calculated for standard age and sex groups but other appropriate patient and consultations characteristics may be included in the analysis. Denominators can be consultations, patients consulting or patient years at risk.

Figure 3. —Data-flow Diagram for MSGPX Data Extractor Program

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Discussion and Conclusions

We have demonstrated through the use of one example database the potential that relational databases offer for storing statistical data. These are also the natural way to capture the data, since they reflect real data relationships, and are economical in storage requirements. They also facilitate linking in new data from other sources. However most statistical analyses require simple rectangular files, and complex database queries may be required to obtain these. We have shown that such complex linkages can be decomposed into a sequence of simple linkages, and user-friendly software can be developed to make such complex data readily available to users who may not understand the data structure or relational databases fully. The major advantage of such software is that the naïve user can be more confident in the results than if they were to extract the data themselves. They can also describe their problem in terms closer to natural language.

Although such programs enable the user with no knowledge of database technology to perform all the linkages shown above, they do have their limitations. Choosing options from several dialogue boxes is simple but certainly much slower than performing queries directly using SQL, Paradox or other database technology. Since the most efficient way to perform a complex query depends on the exact nature of the query, the program will not always perform queries in the most efficient order. The user is also restricted to the queries and tables defined by the program, and as more options are added the program must of necessity become more unwieldy and possibly less efficient.

User friendly software remains, however, the useful for the casual user who may not be familiar with the structures of a database, and essential for the user who does not have access to or knowledge of database technology.

References

Fellegi, I.P. and Sunter, A.B. ( 1969). A Theory for Record Linkage, Journal of the American Statistical Association, 64:1183–1210.

McCormick, A.; Fleming, D.; and Charlton, J. ( 1995). Morbidity Statistics from General Practice, Fourth National Study 1991–92, Series MB5, no 3, London: HMSO.

Newcombe, H.B. ( 1988). Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business, Oxford: Oxford University Press.

Newcombe, H.B.; Kennedy, J.M.; Axford, A.P.; and James, A.P. ( 1959). Automatic Linkage of Vital Records, Science, 130:954–959.

Winkler, W.E. ( 1994). Advanced Methods of Record Linkage, American Statistical Association, Proceedings of the Section on Survey Research Methods, 467–472.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×

Tips and Techniques for Linking Multiple Data Systems: The Illinois Department of Human Services Consolidation Project

John Van Voorhis, David Koepke, and David Yu

University of Chicago

Abstract

This project involves the linkage of individuals across more than 20 state-run programs including TANF (AFDC), Medicaid, JOBS, Child Protection, Child Welfare Services, Alcohol and Substance Abuse programs, WIC, and mental health services. The count before linking is over 7.5 million records of individuals. Unduplicating the datasets leaves 5.9 million records. And the final linked dataset contains records for 4.1 million individuals. This study will provide the basic population counts for the State of Illinois's planning for the consolidation of these programs into a new Department of Human Services.

In the context of linking multiple systems, we have done a number of different things to make using AutoMatch easier. Some features of the process relate to standardized file and directory layouts, automatically generating match scripts, “data improvement” algorithms, and false match detection.

The first two issues, files and directories and scripts, are primarily technical, while the second two issues have more general substantive content in addition to the technical matter.

Properly laying out the tools for a matching project is a critical part of its success. Having a standard form for variable standardization, unduplication and matching provides a firm and stable foundation for linking many files together. Creating additional automation tools for working within such standards is also well worth the time it takes to make them.

With multiple sources of data it is possible to improve the data fields for individuals who are linked across multiple datasets. We will discuss both how we extract the information needed for such improvements and how we use it to improve the master list of individuals. One particular example of these improvements involves resolving the false linking of family members.

Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
This page in the original is blank.
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 13
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 14
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 15
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 16
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 17
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 18
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 19
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 20
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 21
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 22
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 23
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 24
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 25
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 26
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 27
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 28
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 29
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 30
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 31
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 32
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 33
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 34
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 35
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 36
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 37
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 38
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 39
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 40
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 41
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 42
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 43
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 44
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 45
Suggested Citation:"Chapter 2 Invited Session on Recorded Linkage Applications for Epidemiological Research." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.
×
Page 46
Next: Chapter 3 Contributed Session on Application of Record Linkage »
Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!