Cover Image

Not for Sale



View/Hide Left Panel

The Haystaq System: Past, Present, and Future

HERBERT R.KOLLER, ETHEL MARDEN, and HAROLD PFEFFER

I. Introduction

A. BACKGROUND MATERIAL ON PATENT SEARCHING

Literature searching has been distinguished from information retrieval by some authors (1, 5), and within the field of literature searching the peculiar characteristics of patent searching have also been set forth (24). The literature contains discussions of various philosophies of searching systems, each of which is derived from a different appraisal of and theory as to users’ needs (1, 6, 7). To round out the discussion in the present paper, a few characteristics of patent searching and some of its practical implications are given.

Patent searching is that type of literature searching which is performed by patent examiners when (1) determining the novelty of a concept claimed in an application for a patent; or (2) (if novel) finding the nearest similar related concepts previously known and published. The Patent Office library includes within its files over 2.8×106 domestic patents, twice this number of foreign patents, and thousands of serial publications and books. This library literally deals with every field of technology, running the gamut from pins and needles to printing presses, and from antibiotics to submarines. About 2000 searches are made each day in the normal operation of the Office, and each search includes from one to upwards of twenty distinct questions. It is estimated that about 20 per cent of the searches made are in the chemical field, about the same fraction relate to the electrical and electronic arts, and the remaining 60 per cent deal with mechanical and miscellaneous fields. The disclosures of patents in the chemical arts vary in the amount of information they contain from a single, very specific “recipe” to a very generic disclosure of a series of related processes, each illustrated by a large number of examples. The compounds in-

HERBERT R.KOLLER and HAROLD PFEFFER Office of Research and Development, U.S. Patent Office, Washington, D.C.

ETHEL MARDEN National Bureau of Standards, Washington, D.C.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1143
--> The Haystaq System: Past, Present, and Future HERBERT R.KOLLER, ETHEL MARDEN, and HAROLD PFEFFER I. Introduction A. BACKGROUND MATERIAL ON PATENT SEARCHING Literature searching has been distinguished from information retrieval by some authors (1, 5), and within the field of literature searching the peculiar characteristics of patent searching have also been set forth (2–4). The literature contains discussions of various philosophies of searching systems, each of which is derived from a different appraisal of and theory as to users’ needs (1, 6, 7). To round out the discussion in the present paper, a few characteristics of patent searching and some of its practical implications are given. Patent searching is that type of literature searching which is performed by patent examiners when (1) determining the novelty of a concept claimed in an application for a patent; or (2) (if novel) finding the nearest similar related concepts previously known and published. The Patent Office library includes within its files over 2.8×106 domestic patents, twice this number of foreign patents, and thousands of serial publications and books. This library literally deals with every field of technology, running the gamut from pins and needles to printing presses, and from antibiotics to submarines. About 2000 searches are made each day in the normal operation of the Office, and each search includes from one to upwards of twenty distinct questions. It is estimated that about 20 per cent of the searches made are in the chemical field, about the same fraction relate to the electrical and electronic arts, and the remaining 60 per cent deal with mechanical and miscellaneous fields. The disclosures of patents in the chemical arts vary in the amount of information they contain from a single, very specific “recipe” to a very generic disclosure of a series of related processes, each illustrated by a large number of examples. The compounds in- HERBERT R.KOLLER and HAROLD PFEFFER Office of Research and Development, U.S. Patent Office, Washington, D.C. ETHEL MARDEN National Bureau of Standards, Washington, D.C.

OCR for page 1143
--> volved in a disclosure may be described specifically or generically, and it is common for a genus to be set forth in the so-called Markush type of structural formula. (See Section II.A, “Factors Considered.”) Haystaq is one of many possible types of mechanized searching systems which could have been designed for Patent Searching. Obviously some systems are better equipped than others to satisfy the needs of patent searchers. Some suggested criteria by which systems may be evaluated will be found in Appendix A. Patent searching is not at all restricted to patents, as the name might suggest. The total field of search does indeed include patents of all countries, but it equally embraces periodicals, textbooks, catalogues, abstract services, and every other form of publication. Many persons other than patent examiners make the same type of literature search; for example, research workers embarking on new investigations, patent attorneys, and lawyers in general. This type of search is characterized by questions variable in scope from the most generic to the most specific viewpoints. In addition, general combinations as well as subcombinations, equivalence between concepts, negative concepts, and certain syntactical-logical artifices (e.g., Markush formats) are all of importance. B. HAYSTAQ: SOME GENERAL CONSIDERATIONS The characteristics of the Haystaq system have been described elsewhere (4, 8); therefore, this paper is principally concerned with new developments and additions to the system. In general, Haystaq includes four parts: (1) a data preparation routine for the library making up the disclosure file of information to be searched (see Appendix D); (2) a data preparation routine for the question, which is set up in the form of a model answer; (3) the search routine; and (4) the checkout routine, which evaluates apparent answers to questions and provides the output of the system. The greatest part of the effort so far has been expended on the search routine since (a) the objectives of the two data preparation sections can be achieved manually so long as the system is not yet in a large scale operational phase and (b) the checkout routine in extenso is not essential to the searching routine. Haystaq simulates the manual type of search performed by patent examiners in that it effects a serial scanning of each document in the file. While at present its routines are being coded for the NBS SEAC, full development of the system will be predicated partly on features available only in a more advanced machine. Some of the desirable characteristics of such a machine will be discussed below. In order to achieve a reasonable working model, the research has been

OCR for page 1143
--> focused on the field of chemistry, although it is quite apparent that minor modifications will permit the inclusion of other fields of knowledge in the system. The file is organized in the most convenient arrangement for the human searcher. Thus, within each document the largest segment of disclosure treated as a unit is a complete process, including all the steps disclosed. The next largest segment of disclosure treated is a composition or admixture. Each composition is subdivided into groups of codes representing the individual ingredients (or “items”) which it contains. The individual codes are known as descriptors. Since disclosures may vary from simple to very complex statements, any or all of the above levels of organization may be involved. The system must provide the utmost flexibility with regard to the manner of formulating questions. This need stems from several sources: (1) A very-large file requires a high degree of discrimination to provide exactly (and only) those disclosures which are desired. (2) The same piece of disclosed information can answer a large number of questions, each reflecting a different interest. (3) The ingenuity constantly exercised in phrasing patent claims and the constantly shifting focus of interest in the industrial community result in the continuous generation of new ways of expressing essentially the same or related ideas (2, 3). Examples of types of searches which must be provided for are given in Appendix B. II. Current work A. FACTORS CONSIDERED Among the aims of the Haystaq system, as stated in a previous report (8), is the construction of a prototype system. This would permit additional studies as to the requirements and design of a machine system, and provide a basis for constructing improved search systems. Any results obtained from the study of the prototype are a function of the sufficiency of the system, and the sufficiency of the system itself depends upon how well it fulfills the users’ needs. Several problems, set aside for further study at the time of the initial effort, have now been tackled. These are (1) the Markush problem, (2) the creation of an effective classification scheme for the preparation of schedules of items, and (3) the necessity of giving the user greater flexibility in posing questions to the system. 1. “Markush group” is an art term used in the Patent Office for designating a synthetic genus whose scope is determined by a listing of its members. The first system devised was capable of handling this situation where the genus was defined by listing individual compounds, e.g., the class consisting of phenol, mono-chloro-phenol, mono-amino-phenol, and mono-ethyl-phenol. How-

OCR for page 1143
--> ever, another method exists for defining such a group by means of a structural formula showing the portion of the molecule common to all the embodiments, to which is attached a variable member, which is then defined. The same example in the structural format appears in Fig. 1. Where the number of em FIGURE 1. bodiments is small, it becomes a simple matter to decompose such generic formulas into their component embodiments and encode them individually. But it is not uncommon to find such formulas embracing hundreds, indeed thousands, of specific embodiments. One example which was discovered contained over 64,000,000,000 embodiments. It is completely unrealistic to think in terms of decomposing such compact forms into specific embodiments for encoding from the point of view of the number of man-hours involved, the amount of storage required and the time to be consumed by a computer in scanning this mass of data. On the other hand, there are obvious and distinct advantages in being able to compress so much data into an extremely compact form. The problem is serious because the great majority of chemical patents as well as much of the non-patent literature have utilized this device for more than twenty years. The system was therefore designed to process questions in this form. 2. Compounds are sometimes defined by generic descriptors alone, sometimes by pure structure configurations alone, and very often by a hybrid of both. Consider the example of the latter situation which is illustrated in Fig. 2. FIGURE 2. The terms “halogen,” “alkenyl,” and “alkyl” in a topological network of elements become quite embarrassing in an attempt at tracing an element-by-element path from one portion of the molecule to another. If this example is considered as a question, it is evident that the system should permit recognition of chlorine, bromine, iodine, or fluorine as species of “halogen”; and “propyl,” “isopropyl,” “butyl,” or “isobutyl” as species of “alkyl (3–4 carbons).” In

OCR for page 1143
--> other words, the compound of Fig. 3 should be recognized as an answer. Previous attempts at creating a classification based upon a hierarchical arrangement of generic and specific terms have fallen by the wayside. The large num FIGURE 3. ber (practically limitless) of generic and subgeneric terms which can be generated presents such a confused picture of overlapping relationships as to preclude the construction of one comprehensive hierarchy. The usefulness of any comprehensive system will be measured to a large extent by its ability to give full effect to the genus-species relationship. 3. The patent examiners, who are among the potential users of the system, are not limited in their searches to mere determinations of the novelty of a compound. Because of legal peculiarities in the patent system, compounds which are similar within certain prescribed limitations are acceptable as answers. Thus, if a question involves the compound of Fig. 4, an acceptable answer may very well be found in any of the structures shown in Fig. 5. FIGURE 4. Thus, 5(a) and 5(b) are illustrative of positional isomers of Fig. 4 in which all the functional groups are alike, but their relative positions are different. Figure 5(c) is illustrative of a situation where all the requirements of the question are met, but the compound has something in addition. If the user required an exact match of the question structure 5(c) would not be acceptable, nor would 5(a) or 5(b). However, if he specified the question as a fragment of any larger configuration, 5(c) would be a valid answer. Figure 5(d) shows an example of a compound which is a higher homolog of the question, the only difference being in the length of the chain of carbons attached to the nitrogen.

OCR for page 1143
--> While none of these has the same effect as a direct anticipation, nevertheless they are all valid answers unless they can be rebutted by an applicant. FIGURE 5. In order that the user be given maximum flexibility in stating his question, the system must permit him the option of indicating, for each point of the molecule in question, whether he will or will not accept positional isomers, whether he will or will not accept homologs and whether he will or will not accept as answers compounds of which the question represents a fragment. B. THE SYSTEM In the original search program of Haystaq, chemical compounds were described by a limited schedule of coordinate chemical descriptors. Sufficient definition was thus given to a search request for a specific compound to eliminate a large number of disclosures. This procedure was based upon the anticipated use of a topological element-by-element search, such as the routine written by L.C.Ray (9), whenever required, which would employ the data resulting from the previous operation to give a conclusive answer. The present system represents an improvement in the search for chemical compounds in the light of the problems outlined above. It is based upon a judiciously selected, comparatively small basic vocabulary of functional groups. These, so far as possible, have been chosen to coincide with the conventional groups recognizable by chemists. In some instances terms have been synthesized, either for convenience in handling or to add additional flexibility to the system. Inasmuch as recognition of topological relationships among functional groups is one of the desiderata of the system, a close study of two available systems developed by others was made: the Norton-Opler system (10), which deals with relatively large functional groups, and the Ray system (9), which deals with elements. The functional groups chosen for use in Haystaq were

OCR for page 1143
--> selected in an attempt to overcome the relative rigidity of the Norton-Opler system and the relative slowness of the Ray system. It was not found practicable to use stored look-up tables referring to class terms because of the infinite variety of generic terms and the inability to predict or state all the species which may fall under any given genus. It was therefore decided to make all data, both disclosure and question, self definitive. That is, each generic expression utilized in describing a compound is defined in terms of its specific meaning in that particular compound. The coded chemical descriptors, all having the initial digit 3, are divided into four sections: 3A, 3B, 3C, and 3D. The 3A section is devoted to a class of terms which are combinations of two or more units of basic functional groups, having constant definitions and recognizable as such by chemists; e.g., anilino, naphthyl, or carboxyl. The 3B section is limited to generic expressions, i.e., those capable of being represented by more than one specific embodiment. These are defined, where necessary, by reference to the specific groups in 3C which are pertinent to the particular compound under consideration. Exemplary terms to be found in this group are acid, ester, and halogen. The 3C section contains a list of all the basic functional groups which make up the particular compound and shows the topological relationships among them (without, however, indicating relative positions of attachment). The 3D section (not yet written) is an element-by-element topological definition of each compound. The example of Fig. 6, which is a complex structure having some 1600 distinct embodiments, illustrates a typical structural formula in the literature. U=an alkyl group having one to two carbons; V−H, a halogen, an alkyl group having one to three carbons or an alkoxy group in which the alkyl has one to three carbons; W=−H, or halogen; X=−H, or −Cl; Y=−H, or an alkyl group having one to four carbons; Z=−H, or an alkyl group having one to four carbons. FIGURE 6. Note that in each case, where H is shown as one of the variants in a group, it is in effect stated that one of the variations is the absence of a substituting group. H is therefore substituted by the expression “N.S.” (no substituent).

OCR for page 1143
--> The structure is rewritten in Fig. 7 in composite form, showing the separation of functional groups, and is arbitrarily numbered for identification of the groups for topological tracing. These numbers are called “designation numbers.” Fig. 7 also illustrates the coding scheme. This coding does not exhaust all possible terms for 3A and 3B; it is only intended to be illustrative. Research is continuing on terminology. In the 3A terms, the number following the description indicates the number of occurrences. Note that all 3B terms (generic) are defined by reference to the groups found in 3C. For example, “halogen-71” in 3B is defined in 3C as a chloro group. The ether of 3B is defined by the 3C terms phenyl, oxy, and alkyl, and is therefore an aryl-alkyl ether. The first group in 3C identifies piece number 10 as a phenyl group. The first parenthetical expression describes the number of other groups connected to this piece and is expressed as a range. In this case the number may vary, depending upon the particular combination of variable groups which may be present at any one time. Since there are attached one fixed group and three variable groups, each of the latter having as one of its variants a “no substituent” group, the range of connections may vary from one to four. The second parenthetical expression, which has been left blank, is used only for questions and indicates to the computer whether the exact number of (1) or at least as many (0) connections must be matched. The remainder of the line indicates that piece 10 (phenyl) is connected to each of the pieces numbered 30, 50, 70, and 90 by a single bond (S). In piece number 30, M indicates the beginning of a variable group. When one of the variants is “no substituent” (N.S.), that information is stored in the M word. C. SPECIAL CONSIDERATIONS (a) Ordering. In all the data, in both question and disclosure, the descriptors in 3A and 3B are presented in an ascending series, according to the numerical values of the codes representing the substantive information. The program thus permits the computer to decide (in searching for a particular term) at the earliest possible moment that there is no available answer. For example, if the code for a question term were 127 and the first three descriptors in the disclosure list were 33, 105, and 148, comparison with the first would indicate that it was too small. This would result in calling for the next word, with the same consequence. However, on comparison with the third word, which is too large, the computer would conclude that there is no answer for the question.

OCR for page 1143
--> FIGURE 7

OCR for page 1143
--> (b) The Question. In the search of Markush type disclosures with Markush type questions in 3C, the computer synthesizes the various combinations of specific embodiments in both question and disclosure. Since this may result in a large number of combinations, some means must be found to relieve the computer of the burden of searching all possibilities. One of the devices employed is the ordering of words in a variable group in the same manner as described above. However, this in itself is not sufficient. It is apparent that for any particular disclosure being searched, not all the variables in the question can be expected to be answered. If fruitless questions can be identified sufficiently early, time can be saved by instructing the computer to disregard them. Consequently the 3A and 3B sections of the question have each been split into two groups, the one containing those terms pertaining to the fixed part of the molecule (3AN and 3BN), and the other containing those pertaining to the variable portion of the molecule (3AM and 3BM). When a 3AN or 3BN term is not matched, the search of that disclosure is immediately terminated. However, failure to find a 3AM or 3BM term results in marking the question 3C pieces referred to, so that they are omitted from consideration when the topological search is made among the 3C terms. The question is re-marked in this manner for each disclosure considered. No terms which are partly fixed and partly variable can be used as 3A or 3B questions. They must all be either one or the other. If the example of Fig. 7 were to represent a question, the arrangement would be as follows: 3AN anilino—1 urea—1 3AM – – – – – 3BN amide—80, 90 amide—60, 80 ring—homocyclic, carbocyclic, aromatic, six members—10 3BM halogen—31 halogen—51 halogen—71 It is noted that the term “ether” does not appear, since in this example it is in part fixed and in part variable. (c) Generic expressions in 3C. While generic expressions are ordinarily confined to the 3B section, situations arise, as illustrated in the example of Fig. 7, where a portion of the molecule of a structural formula is described in class terms rather than by some specific embodiment. In this case the 3B terms are repeated in the 3C section, treating them as though they were pieces of basic

OCR for page 1143
--> vocabulary. Such pieces are marked with an X both in 3B and 3C. In the topological search in 3C, on recognition of such a marked term, the computer is instructed to accept a similar term or any species embraced therein. This is accomplished by a look-up procedure in reverse from the 3C to the 3B section. Thus (referring to the example again as a question), the terms “31-halogen” and “51-halogen” are answered by the terms halogen, chloro, bromo, iodo, or fluoro. (d) Rings and alkyl groups. The term “alkyl” is of such frequent occurrence that it loses its discriminatory power as a generic expression. This term frequently serves as an additional means of compressing information, as for example “an alkyl group having 1–3 carbon atoms.” The “alkyl” word is therefore given special treatment and has a small subroutine devoted to handling it. The information contained in this word is in fixed fields. One field carries the designation “alkyl.” Another field has information as to the number of carbons involved and is expressed as a range with an upper and lower limit. A third field carries the identification of specific groups. The fourth field is reserved for the question only, and is used to indicate whether the search is for any alkyl group within a specified range or for any homolog at least as large as a specified minimum. Unsaturated carbon chains are defined in terms of alkyl groups joined through double or triple bonds. The words which describe rings represent a package of information. Each ring is described in terms of whether it is homocyclic or heterocyclic. If homocyclic, it may be carbocyclic, nitrocyclic, etc. If carbocyclic, it may be alicyclic or aromatic. Heterocyclic rings are described in terms of the kinds and frequency of occurrence of the hetero elements. All rings are further described in terms of total number of elements (i.e., ring size) and double bonds. A special subroutine permits asking for rings in terms of any one or more of the terms described above. D. THE SEARCH In general, each of the four sections described may be used as a primary basis of search. However, 3A, 3B, and 3C may each operate as a screen for any one of the subsequent sections. One feature of the 3C section is the creation of an Equivalence Table as part of the output. This table identifies each group of a disclosure molecule which represents an answer for the equivalent group in the question. For example, if Fig. 8 represents a disclosure molecule and Fig. 9 represents a question molecule, the Equivalence Table produced would be as shown in Fig. 10.

OCR for page 1143
--> literature searching system employing “encoding-in-depth” inherently includes in the search file the kind of data useful in an information retrieval system. A modification only of the type of output would be required to employ the file in this dual role. Browsing is based upon a philosophy which is basically different from searching or information retrieval, but the nature of search files is such that a searching system can be designed to provide facilities for browsing. How much “noise” and what per cent of “false drops” are tolerable? It is not true that a large-scale searching system must have some degree of each of these disturbing characteristics. Is it necessary to store the disclosures so that the original document can be reconstructed, or is it adequate to provide clues to the subject matter in the documents being searched? In part, the answer to this question depends upon the form of output provided and the use made of the output. To what extent should the system be a re-entrant one? This refers to the ability of the machine to retain, manipulate, and compare various pieces of information presented to it at different instants of time. The answer is largely determined by the complexity of questions permitted and is limited by the type of searching machine used. Questions (a) and (b) are related to this one. Should the file be static or kinetic? Hand-operated notched-edge card systems exemplify the former type, while any of the systems employing the ESM—101 with IBM cards is an example of the latter. Factors such as the time to make a search, the complexity of questions permitted, and the possible size of the file must be analyzed. Should the system employ a serial or a parallel approach? No sound theoretical reasons exist which substantiate the passive acceptance by many documentation specialists of the concept that it should take more time to search a large file than it does to search a small one. Only the present-day absence of machine technology for large-scale parallel searching is lacking. Some systems provide for several questions to be asked simultaneously when one serial pass through the file is made (24, 25). The parallel system contemplated here does not involve parallel questions; what is contemplated is a parallel or simultaneous approach to all documents in the file. (See Appendix E.) In such a system, no prolonged interval need exist between the instant when the question is put into the machine and the instant when the output is obtained. Questions could be asked one after another at the same rate of input as data is fed into present-day electronic computers. What types of logical subject matter systems must be devised to permit the formulation of all desired questions? What are the costs involved in setting up the system? Initial costs include document accession, system development and programming, data file preparation, machine acquisition or rental, and the employment and training of necessary personnel. Initial outlays should be amortized over a reasonable operating period. A large first expenditure may be less costly in the long run than a smaller first cost for a system requiring extensive maintenance. What is the cost per search for operating the system? The cost of updating the files, which must be considered as a continuing operation, must be added to the expense of performing a search. In devising and setting up a full-scale system, how much time will be required before productive operation is possible? A system conceived along superficial lines

OCR for page 1143
--> will lend itself to earlier exploitation, but will require more effort in the later incorporation of desirable features than if they had been included initially. Careful planning plus the provision for more detail and more logical ability than appear necessary in a first appraisal will generally be worth the investment in time required. How long does it take to make a search and is that length of time satisfactory? Formulation of the question in machine language, programming the machine, actual running time, and obtaining the output in a usable physical form must all be taken into account. How difficult is it for a user to frame a question? If the user must master a complex coding system before he can make use of the system, he is less likely to accept it and use it to best advantage. An educational program to acquaint users with the facilities provided by the system is required. A practical method of operation permits the user to state his question in common language, and a technically trained person, who is intimately familiar with the system, translates the question into the system language. APPENDIX B Some exemplary types of questions 1. A or B or C where A, B, and C represent different specific compounds, any one of which will satisfy the question. 2. A+B+C where A, B, and C represent different compounds which must be disclosed in admixture. 3. abc where a, b, and c represent different aspects of a single compound. 4. (a) abc or def similar to type 1, but each compound is described in terms of several aspects.   (b) abc or abd where the two compounds have some aspects in common. 5. (a) abd+def similar to type 2, but each compound is described in terms of several aspects.   (b) abc+abd where the two compounds have some aspects in common. 6. ab (without c) similar to type 3, but one of the stated aspects must be not present. 7. ab (no c) similar to type 6, but the disclosure must positively exclude aspect c. 8. a (without b)+ b (without a) similar to type 5, but the two compounds have no common characteristics. 9. (A or B)+(C or D) where A, B, C, and D represent different compounds. Note that neither (A+B) nor (C+D) is requested. 10. abc or ab First choice (abc) is a more specific concept than the second choice (ab). 11.   This is an example of a Markush structural formula. Neither is a valid answer.

OCR for page 1143
--> Types 12–15 represent processes, where A, B, C, etc., represent the various materials present in the different steps of the process. 12.   13.   14.   15.   APPENDIX C Test examples ofHaystaq run on Simulac The logical sufficiency of the flow chart for the system described in Section II was checked by manually simulating the activity of the computer on several test problems, each involving Markush structures. During such tests several errors in housekeeping operations were discovered and corrected. Code checking time on the machine is expected to be substantially reduced by this method of pretesting. Simulac is the simulation of an automatic computer on a blackboard. TEST 1 (Question 1 has 1600 unique embodiments, Disclosure 1 has 48 unique embodiments, and hence the search of this disclosure by this question is the equivalent of 76,800 separate searches.) Question 1: FIGURE 12.

OCR for page 1143
--> Disclosure 1: FIGURE 13. Simulated print-out, test 1: Question embodiment: FIGURE 14. Satisfied by Disclosure embodiment: FIGURE 15.

OCR for page 1143
--> TEST 2 (Question 2 has 1280 unique embodiments, Disclosure 2 has 48 unique embodiments, and this search is the equivalent of 61,440 separate searches.) Question 2: FIGURE 16. Disclosure 2: FIGURE 17.

OCR for page 1143
--> Simulated print-out, test 2: No embodiment found: Note. This was a “worst case” situation, no answer being possible, and none was found. TEST 3 (Question 3 has 10 unique embodiments, Disclosure 3 has 8 specific embodiments and this search is the equivalent of 80 separate searches.) Question 3: FIGURE 18. Disclosure 3: FIGURE 19. Simulated print-out, test 3: Question embodiment: FIGURE 20. Satisfied by Disclosure embodiment: FIGURE 21.

OCR for page 1143
--> APPENDIX D The use of machines as aids in the preparation of search files Since the character recognition and mechanical translation arts are not yet sufficiently developed to permit machines to perform the entire job of preparing the search files from printed documents, we must make the best possible use of machines to help human workers perform this exacting and voluminous task. The following is one proposal for a contemplated method of data preparation to be used in the Haystaq system. A technically trained person will read and analyze each document, note the portions to be encoded, and then prepare a formal abstract or summary of the subject matter. This abstract will indicate the organization of the various levels of disclosure, “index numbers” of the compounds (items) included in the several admixtures (compositions) disclosed, indications of the compositions involved in the various processes, the “accidental” descriptors, and indications of alternativeness, negation, conjunction and other relationships. (“Accidental” descriptors are those characteristics of a material which come to light only because of the environment disclosed for that item in the particular document being treated.) The subject matter of the abstract will then be encoded and transcribed onto punched cards. Index numbers are unique numbers assigned to each specific compound contained within a collection. Each index number will be represented by an index card (one or more punched cards) which will contain (a) synonyms for the compound, (b) the index number, (c) the structural codes (the types used in 3C and 3D), (d) the empirical formula (computed by machine from the structural codes) and (e) the fixed descriptors (the types used in 3A and 3B). Such an index card will be prepared the first time that a compound is encountered in preparing the search file. To compile the complete, encoded disclosure, the machine will record on magnetic tape all codes present on the abstract cards and intersperse in the proper places the codes from the appropriate index cards (except the element-by-element structure code). The tape so prepared will then be processed by the machine according to an “editing” program. This program will generate and insert into the code data for various “mechanical” screens as well as “housekeeping” information for use by the machine in performing the search. Following the editing routine, the encoded data will be checked for mechanical and logical errors by use of another computer program. Since Haystaq employs the element-by-element structure codes on a secondary tape, these codes will be similarly copied from the index cards in the preparation of that tape. Question data will be similarly prepared. In order to provide easy access to the desired index cards, an unambiguous filing system must be employed. It is planned to make use of one of the available indexing schemes which generate unique ciphers for each compound (26). Since these ciphers lend themselves to a systematic alphabetization, they can be employed in the manner of tabs on library catalogue card files to enable a machine to locate index cards which contain the detailed codes needed to supplement the abstract in preparing the complete search file.

OCR for page 1143
--> APPENDIX E Some comments on parallel searching and memory devices The system now being devised for tests on SEAC (and all other systems so far proposed) operates in such a manner that a limited number of questions is held in relatively fixed storage and the entire—very large—collection of permanent data to be searched is serially compared with the internally stored static questions. This is comparable to the FBI’s looking for a person whose characteristics they know by making the entire population of the country come to Washington and walk past the FBI building in single file. A more desirable approach would be to keep the large volume of data to be searched in a static, fixed storage and to use the questions as input. Under such a system, the search would proceed directly from question to location of an answer (if any existed) in the static fixed storage. Questions would be asked serially in rapid succession and the output speed could be made to depend upon the speed with which questions could be propounded as input. This would be of the same order of speed as the input in the system first mentioned above. By what means could the disclosure file be stored so as to provide truly parallel (i.e., simultaneous) access to all disclosures and permit instantaneous retrieval? Emphasis on memories providing large capacity, rapid access, and rapid retrieval has led to the development of a large volume of factual information concerning the characteristics of a variety of memory systems. More emphasis must be placed on determining the basic characteristics of memory systems in general. Research in this direction could well lead to the birth of systems radically different from any now contemplated. More consideration should be given to deriving systems which can provide unlimited polydimensional access to stored information. Presently available storage devices are relatively bulky in size and most of them are at best two-dimensional in character. Some work has been done along the lines of polydimensional storage but a more intensive effort in this direction is needed. With the accent on research in solid state physics today, perhaps some investigations into the use of intersecting planes in a crystal, as a memory device, might be in order. Or perhaps some bright young scientist might investigate the use of complex wave forms as storage media and their response to question wave forms by (1) analysis and (2) subset or component wave matching. The area of “self-tracing cross-path systems” should be investigated since such systems offer a possible means for storing and retrieving complex types of internal relationships of disclosures. (See Section III, under “Machines with Learning Ability,” part (c), for one concept of a self-tracing cross-path system.) It is desired to de-emphasize neither the importance of volume or cost in storage requirements, nor the speed of access and retrieval, since these are important economic factors. However, the economic aspects of this field should be considered secondary to the principal task of bringing to a higher state of maturity the relatively elementary electronic machines of today as tools for information retrieval and literature searching. The “automatic library” of Bush (22, 23) and other men of vision must be capable of being embodied in some usable form if the results of human investigations in the realm of science are to be made available to further workers in the field, and the concepts mentioned above must be pursued before this end can be attained.

OCR for page 1143
--> REFERENCES 1. METCALFE, J., Information Indexing and Subject Cataloguing, The Scarecrow Press, Inc., New York, N.Y., 1957. 2. NEWMAN, S.M., Problems in Mechanizing the Search in Examining Patent Applications, Patent Office Research and Development Report No. 3., Department of Commerce, Washington, D.C., 1956. 3. LANHAM, B.E., J.LEIBOWITZ, and H.R.KOLLER, Advances in Mechanization of Patent Searching—Chemical Field, Patent Office Research and Development Report No. 2, Department of Commerce, Washington, D.C., 1956. 4. LANHAM, B.E., J.LEIBOWITZ, H.R.KOLLER, and H.PFEFFER, Organization of Chemical Disclosures For Mechanized Retrieval, Patent Office Research and Development Report No. 5, Department of Commerce, Washington, D.C., 1957. 5. BAR-HILLEL, Y., A Logician’s Reaction to Recent Theorizing on Information Search Systems, American Documentation, 8, [2], 103–113 (1957). 6. LUHN, H.P., A Statistical Approach to Mechanized Literature Searching, International Business Machines Corporation, Research Center, Poughkeepsie, N.Y., Jan. 20, 1957. 7. PERRY, J.W., and ALLEN KENT, Documentation and Information Retrieval, Interscience Publishers, Inc., New York, 1957. 8. PFEFFER, H., H.R.KOLLER, and E.MARDEN, A First Approach to Patent Searching Procedures on Standards Electronic Automatic Computer (SEAC), Patent Office Research and Development Report No. 10, Department of Commerce, Washington, D.C., 1958. 9. RAY, L.C., and R.A.KIRSCH, Finding Chemical Records by Digital Computers, Science, 126, 814–819 (1957). 10. OPLER, A., and T.R.NORTON, New Speed to Structural Searches, Chemical and Engineering News, 34, 2812–2816 (1956). 11. Digital Computer Newsletter, Journal of the Association for Computing Machinery, 4, [4], 547–548 (1957). 12. HATTERY, L.H., Public Administration Review, 17, [3], 159–163 (1957). 13. LOCKE, W.N., and A.DONALD BOOTH, Machine Translation of Languages, The Technology Press of the Massachusetts Institute of Technology, Boston, and John Wiley and Sons, Inc., New York, 1955. 14. WILKES, N.V., Can Machines Think? Proceedings of the I.R.E., 41, [10], 1230–1234 (1953). 15. SHANNON, C.E., Computers and Automata, Proceedings of the I.R.E., 41, [10], 1234–1241 (1953). 16. FRIEDBERG, R.M., A Learning Machine: Part I, IBM Journal of Research and Development, 2, [1], 2–13 (1958). 17. FAIRTHORNE, R.A., The Patterns of Retrieval, American Documentation, 7, [2], 65–70 (1956). 18. NEWMAN, S.M., Linguistics and Information Retrieval: Toward a Solution of the Patent Office Problem, Monograph Series in Linguistics and Language Studies No. 10. Georgetown University Press, Washington, D.C., 1957. 19. NEWMAN, S.M., Linguistic Problems in Mechanization of Patent Searching,

OCR for page 1143
--> Patent Office Research and Development Report No. 9, Department of Commerce, Washington, D.C., 1957. 20. Electronic Computer Study of English Syntax Patterns, Technical News Bulletin, National Bureau of Standards, 41, [6], 84–86 (1957). 21. SHERA, J.H., Research and Developments in Documentation, Library Trends, 6, [2], 187–206 (especially p. 202), (1957). 22. BUSH, V., So We May Think, Atlantic Monthly, pp. 101–108, July, 1945. 23. BUSH, V., For Man to Know, Atlantic Monthly, 196, [2], 29–34 (Aug., 1955). 24. BAILEY, M.F., B.E.LANHAM, and J.LEIBOWITZ, Mechanized Searching in the U.S. Patent Office, J. of the Patent Office Society, 35, [7], 566–587 (1953). 25. OPLER, A., Dow Refines Structural Searching, Chemical and Engineering News, 35, [33], 92–96, (1957). 26. Which Notation? Chemical and Engineering News, 33, [27], 2838–2843 (1955). 27. Seeing—Eye Computer, Time, 71, [11], 61, (1958).

OCR for page 1143
--> This page intentionally left blank.