Cover Image

Not for Sale



View/Hide Left Panel

The Comac: An Efficient Punched Card Collating System for the Storage and Retrieval of Information1

MORTIMER TAUBE

The concept of item codes and term codes has been fully developed in a paper by Mr. Alexander Kreithen which has also been submitted to the Conference (1); and from this paper we take the conclusion that there are only two basic patterns of grouping codes in a store: “Either term codes are collected under item codes or item codes are collected under term codes” (2). We also utilize the conclusion that, in a system of direct free field coding, a search consists of matching (or collating) a code (or codes) in the question successively against codes in the store.

Although these theoretical conclusions are generally applicable to any type of storage and retrieval device, in this paper their implication will be applied to the design of a specific system, namely, a new system of punched card collation which we have designated the Continuous Multiple Access Collator (Comac).

Since it is the purpose of this paper to demonstrate that the Comac is an efficient device for even the largest collections of information, e.g., patents, intelligence files, newspaper morgues, and picture files, we will base our calculations of search time and size of the store on the following figures:

Number of items in the store, 1,000,000.

Average number of term codes used to index an item, 20.

Number of terms in the vocabulary or different term codes used in the system, 10,000.

Within the general concept of matching we distinguish two methods employed by standard punched card devices. These two methods are usually shown as (1) searching and (2) collating.

MORTIMER TAUBE Documentation Incorporated, Washington, D.C.

1  

Prepared under Contract No. AF 49(638)–91, Air Force Office of Scientific Research, Directorate of Advanced Studies. October 10, 1957.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1245
--> The Comac: An Efficient Punched Card Collating System for the Storage and Retrieval of Information1 MORTIMER TAUBE The concept of item codes and term codes has been fully developed in a paper by Mr. Alexander Kreithen which has also been submitted to the Conference (1); and from this paper we take the conclusion that there are only two basic patterns of grouping codes in a store: “Either term codes are collected under item codes or item codes are collected under term codes” (2). We also utilize the conclusion that, in a system of direct free field coding, a search consists of matching (or collating) a code (or codes) in the question successively against codes in the store. Although these theoretical conclusions are generally applicable to any type of storage and retrieval device, in this paper their implication will be applied to the design of a specific system, namely, a new system of punched card collation which we have designated the Continuous Multiple Access Collator (Comac). Since it is the purpose of this paper to demonstrate that the Comac is an efficient device for even the largest collections of information, e.g., patents, intelligence files, newspaper morgues, and picture files, we will base our calculations of search time and size of the store on the following figures: Number of items in the store, 1,000,000. Average number of term codes used to index an item, 20. Number of terms in the vocabulary or different term codes used in the system, 10,000. Within the general concept of matching we distinguish two methods employed by standard punched card devices. These two methods are usually shown as (1) searching and (2) collating. MORTIMER TAUBE Documentation Incorporated, Washington, D.C. 1   Prepared under Contract No. AF 49(638)–91, Air Force Office of Scientific Research, Directorate of Advanced Studies. October 10, 1957.

OCR for page 1245
--> Searching Searching is performed with a sorter by making several successive sorts until all items coded by a certain term or terms have been selected, that is, sorted out from the rest of the deck. Changing the column selector in the sorter and selecting the cards from the proper pocket constitute setting up a question in the reading head of a piece of apparatus, and this question is matched successively against codes in the store. It is assumed that each item is represented by a card or set of cards on which is grouped the term codes characterizing that item. With standard punched card equipment (unless superimposed punching or wiring is used) the term codes in the question must be matched against specified fields on the item cards. For example, a simple sorter “sorts” cards column by column as determined by setting the column selector in the sorter; and the “101” machine which can search many columns at once must be programmed to search in the proper columns for term codes in the question (the “101” can be wired to search for a single code in any of a number of fields). The IBM Corporation has constructed an experimental searching device which can search for multiple codes any place on a card. This device, which uses direct free field coding, is known as the Luhn scanner; it represents a significant advance in punched card searching. The operation of the Luhn scanner is illustrated2 in Fig. 1. Each item card in the store is coded with the terms characterizing the items and the unused columns of the card are “laced,” that is, all holes are punched. The question card is also laced except for the columns required for the term codes in the question. The coding of a term in the question is the complement of the code of a term in the store (Fig. 2). It is apparent that as a card in the store passes over the question card, the matching of complementary codes will cause a “blackout” which can signal a photocell to actuate a selection mechanism. The use of complementary coding and “blackout” cuts down the requirement for reading apparatus to one photocell per column per code area. If direct matching of codes were employed, each hole on the card would have to be read for a match or failure to match. (It will be seen that the use of complementary coding is only possible when a question card is prepared for a specific search and cannot be used in collation in which any card in the store may be used as a question card.) 2   This illustration is only theoretically accurate. Since the scanner, as constructed, reads only one-half of the card, lacing is required on only one-half of the card area. Also, by virtue of always using codes of 5 out of 12 holes, lacing can be accomplished by just one extra punch which will always let light through.

OCR for page 1245
--> FIGURE 1 FIGURE 2. (a) Term code in question card. FIGURE 2 (b) Term code in store.

OCR for page 1245
--> We will assume here that each card in the store to be searched by a Luhn scanner has room for twenty term codes and that no item is indexed by more than twenty terms. The Luhn scanner matches cards in the store against the question card at a rate of 1000 per minute. Since it is a necessary characteristic of searching systems (systems in which terms are collected under items) that the total file be scanned in any search, it would require 1000 minutes or approximately 16 1/2 hours to search a million items to answer one question. Hence a searching system even with so advanced a device as a Luhn scanner can only be used for relatively small collections or for collections which permit the division of items into mutually exclusive classes, each one of which is small enough to make searching the total class practical. The great advance of the Luhn scanner was its demonstration that free field coding could be used with punched cards and that one card could constitute the question which interrogated the store on other cards. Collation In spite of this development the inherent inefficiency of linear search has so far precluded the successful application of punched card searching to collections of any significant size which cannot be divided into mutually exclusive classes; but several relatively successful punched card installations have been organized for collating rather than searching. In setting up a system of punched cards for collating as contrasted with searching, grouping of items by terms is employed rather than grouping of terms by items. The following figure illustrates the two forms of grouping.

OCR for page 1245
--> When collation is used as a matching technique, item codes collected under one term are matched against item codes collected under another term. In effect one group of item codes becomes the question which is matched against the other group considered as the store. It should be apparent that collation does not require the search of the total store but only of those item codes grouped under the terms of the question. However, with standard collating equipment, a considerable price must be paid for this decrease in search time. A collection of 1,000,000 items indexed by an average of 20 terms would require a file of 20,000,000 cards. With 10,000 terms in the vocabulary the 20,000,000 cards would be arranged in 10,000 groups averaging 2000 cards in a group. Since a standard collator feeds 240 cards per minute from each feed, the collation of two terms (asking a two-termed question) would average between 10 and 20 minutes. This is an appreciable reduction from 16 1/2 hours, but there are some penalties which must be faced which reduce radically the efficiency of standard collators as information searching devices. In the first place the size of the store must be increased enormously to permit prefiling items (cards) under every term by which they are indexed, in this instance, from 1,000,000 to 20,000,000. Secondly, collators work only on arrays maintained in fixed numerical or alphabetical order. Hence, item cards must be filed (posted) to each term array and maintained in that array in a fixed order. Thirdly, cards matched by the collator and selected as answers must be refiled in proper order. If the selected cards are to be retained as an answer or are to be matched or are to be matched against existing groups, they may have to be duplicated so that the array from which they are selected initially can be restored to completeness for other searches. These difficulties, which arise from the use of standard collators as information searching devices, are not attributable to the grouping of items by terms, but to the use of a collator designed primarily for interfiling of cards rather than the matching of codes. It will be seen that most of the difficulties disappear when a device like the Comac is substituted for a standard collator. The standard collator carries out its interfiling function by noting the match, the failure to match, and the order of numbers on cards which it compares. On the basis of what the match discloses, the collator advances one deck or the other, or both (in the case in which a match occurs). The multiplication of cards from one to twenty million is not attributable to the need for additional coding space but to the fact that, since the collator reacts to a match by selecting cards or to the recognition of order by interfiling cards in proper order, it can operate on only one item code per card. But if we separate the collator’s ability to match codes from the requirement for physical selection and inter-

OCR for page 1245
--> filing of cards, it becomes possible to put more than one item code on a card and to signal a match by punching the matched code on another card. The Comac The essential function of the Comac which determines its design is simply the ability to match codes on one punched card against codes on another punched card and to punch the codes for the logical product or sum on a third card. Consider a set of item codes on card A and another set on card B. (See Fig. 3.) The card AB can be collated with card C, etc. The final answer if it involves the product [(A∪B)∪C)] can be printed rather than punched. It is immediately apparent that one of the features of the Comac is the fact that it does not require any refiling of selected cards into an A deck or a B deck. The degree of file reduction, however, may not be immediately apparent. An IBM card contains 80 columns. Since, in collation, we group item codes under term codes, let us assume that 2 columns are required for the term code of each card and 3 columns for the card number. With one column blank, the remaining 74 columns can be divided into 37 two-column groups with each group containing 24 bits. The 24 bits can be divided into 6 fields of 4 giving a possibility of coding any item number from 1 to 999,999. Hence 37 item numbers ranging from 1 to 999,999 can be punched in each card. In terms of our previous figures, namely, 1,000,000 items, 20 term codes per item, 10,000 terms in the vocabulary, the 20,000,000 item codes could be punched on 540,540 cards, providing a file reduction of 37 to 1 as compared with standard collating systems.3 In more picturesque terms the card file is reduced from building size to room size. The 540,540 cards would be organized into 10,000 groups averaging 54 cards to a group. And the collating of one group against the other would involve comparing the item codes of 54 cards with the item codes on 54 cards. The groups will ordinarily not be equal and many groups will contain many more than 54 cards; but unlike the Minicard System, we can add cards to any group without dedicating space for it in the group. Hence it is appropriate to discuss the groups of a Comac system in terms of averages. It is interesting to note that the 3,000,000 patents in the Patent Office could be handled on approximately 1,600,000 cards and given the same 10,000-word vocabulary, with an average of 162 cards in a group. 3   This form of binary coding in columns is called “Chinese Binary” by IBM data processing people. Although it gives maximum compression of codes, it does require decimal binary converters and modification of existing punches. With regular Hollerith coding, the 37 to 1 file reduction possible with Chinese binary, drops to 14 to 1 for collections requiring 6-digit codes (up to 999,999 items) and to 18 to 1 for collections requiring 5-digit codes (up to 99,999 items).

OCR for page 1245
--> FIGURE 3

OCR for page 1245
--> Once the basic design requirement of the Comac is established, the Comac can operate in either of two ways depending on the amount of comparators or registers that are provided. If only one code on each card can be read at a time, both groups being collated would have to be advanced intermittently as is the case with existing collators, the difference being that the cards would be advanced two columns at a time instead of a card at a time. The Comac which compares only one code at a time from each group can only operate if the item codes are punched in ascending order, that is, if “code 1< code 2< code 3, etc.” This is also the case with existing collators. This constraint can be avoided if enough circuitry is provided to store all the codes on a card while another card is passed over it. In such a case all codes would be matched against all codes regardless of the order of the codes on the cards. Whether or not it would be worthwhile to provide this additional circuitry would depend on the nature of the collection being indexed. If the items in the collection could not be assigned serial numbers and indexed in order, the more advanced type of Comac would be required; but if serial order could be maintained in indexing and in entering item codes on cards, the Comac which advanced and compared one code at a time from each group would be adequate. We have not attempted in this paper to describe the Comac apparatus. However, from our studies of existing punched card equipment, binary to decimal converters, comparators, etc., it appears that once the basic concept of the Comac is accepted, the construction of a device for single code comparison represents only a very modest development effort. Actually the character of the physical equipment necessary is practically deducible from the new concept of collation as a matching and print-out process of item codes rather than a card selection and interfiling process. The development of the more advanced Comac capable of comparing multiple codes with multiple codes might require a greater investment; but before such a development is undertaken, the value of removing the constraint of entering item codes in order should be thoroughly explored. Advantages of the Comac system We summarize at this point the characteristics of the Comac which make it a practical and efficient information storage and retrieval system. Some of these points have been mentioned above; others will be presented here for the first time. 1. DECREASE IN TIME OF SEARCH For a collection of 1,000,000 items, the grouping of item codes under term codes, as compared with term codes under item codes reduces the time of

OCR for page 1245
--> search for a two-termed, question from 16 1/2 hours (Luhn scanner) to approximately 5 minutes. We are assuming that cards can be advanced two columns at a time and compared in the Comac at about twice the rate they can be advanced in existing card reproducers that feed cards the long way, i.e., one card every 2 to 3 seconds. Since one card in the Comac contains 37 codes, 54 cards containing 2000 codes can be read in 100 to 150 seconds. Doubling this figure to allow for the intermittent advance of two groups, we get 200 to 300 seconds or 3 to 5 minutes. 2. MULTIPLE ACCESS TO THE STORE Searching involves scanning the whole file; but collating involves comparing prefiled groups. This means that many searchers can select groups for comparing at the same time without interfering with one another. Further, it is assumed that the Comac will be so reasonable in price that any large and busy installation would have several so that searchers desiring to collate term groups would not have to queue up at one machine. Assuming 5 minutes for the average search, five Comacs would provide an answer every one minute during a working day. 3. ELIMINATION OF REFILING The card reproducing and print-out features of the Comac would eliminate the necessity which exists with present collators of refiling cards. That is, there would be no “return-to-normal” problem which now exists with most punched card searching and selection devices. For example, the Patent Office R & D Report No. 6 describes this problem as a major difficulty in the utilization of punched cards for searching (3). 4. CARD REPRODUCTION AND PRINT OUT The Comac will be able to reproduce a punched card containing matched codes; but in addition it will contain a binary to decimal converter which will print out the final answer of a succession of collations, e.g., {[(A∩B)∩C] ∩D}. 5. EASE OF MAINTENANCE AND POSTING The cumulation of item codes for punching on term cards is a simple procedure which has been worked out by Documentation Incorporated and other organizations (NSA) in connection with posting on Uniterm Cards (4). 6. FREEDOM FROM CONSTRAINTS ON INDEXING The reproducing features of the Comac make it fairly simple to combine terms, set up hierarchical relations between terms and the items grouped on

OCR for page 1245
--> such terms, etc. For example, a decision to establish a grouping of item codes under the general term antibiotics, that is, to collect codes previously listed under aureomycin, streptomycin, penicillin, etc., involves only changing instructions in the reproducer. Whereas an ordinary search involves the reproduction of the logical product of the group of item codes, the collection of item codes under a general term is equivalent to reproducing the logical sum of the codes. Hence, although it is possible to look upon the Comac as a mechanized Uniterm Index, the mechanization extends not only to the comparison of codes but to the ability to update the actual indexing and to provide any degree of order or hierarchical relationship required by the search problems confronted by the system. REFERENCES 1. KREITHEN, A., “Mathematical Foundations for a Storage and Retrieval Theory,” AFOSR IN 57–400. ASTIA AD 132475, JUNE 1957. 2. Reference 1, p. 14. 3. DON ANDREWS. U.S. Patent Office, Research and Development Report No. 6, pp. 11–12. 4. SANFORD and THEREAULT. Problems in the Application of Uniterm Coordinate Indexing. College and Research Libraries, 17, No. 1, 19–23 (January 1956).