Cover Image

Not for Sale



View/Hide Left Panel

The Basic Types of Information Tasks and Some Methods of Their Solution

V.P.CHERENIN

General information about retrieval tasks

THE POSING OF THE TASK

The defined set of information among which it is necessary to carry out a search is represented in the form of a universe E of individual elements of information ek (13). The answers to questions asked are particular subsets of E. The volume of E can grow by means of the addition of new ek, that is, E is considered as defined only for a definite interval of time. Data from the elements of information are linked more or less closely to one another, since the condition is usually assumed that they deal with one and the same “subject” (concept, experiment, process, substance, animal, person, etc.). On the basis of considerations which will be set forth below, the universe E is, as a rule, made up of homogeneous ek. Proceeding from the defined set of information, the scope of the questions asked is also determined more or less accurately. In general, this does not mean that all the possible questions are previously known, but that for each question it is usually known whether or not it can be asked. The general conditions for the selection of ek for the questions asked are also known.

The solution of the task “Does it answer the question asked?” for each ek, generally speaking, does not evolve clearly from the information contained in the ek and the question. It can require either a processing of that information according to definite rules or the referring only to information from other ek (if the set of the information is closed), or even the referring to a more general set of information than E, or, finally, it may go beyond the limits of searching for certain information and may require the carrying out of scientific research work. Thus the work of carrying out searches is in its immediate, original form one of the varieties of mental labor requiring not only great erudition and memory in order to cut down the number of references to the data which

V.P.CHERENIN All-Union Institute of Scientific and Technical Information, Academy of Sciences of the USSR, Moscow.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 823
--> The Basic Types of Information Tasks and Some Methods of Their Solution V.P.CHERENIN General information about retrieval tasks THE POSING OF THE TASK The defined set of information among which it is necessary to carry out a search is represented in the form of a universe E of individual elements of information ek (1−3). The answers to questions asked are particular subsets of E. The volume of E can grow by means of the addition of new ek, that is, E is considered as defined only for a definite interval of time. Data from the elements of information are linked more or less closely to one another, since the condition is usually assumed that they deal with one and the same “subject” (concept, experiment, process, substance, animal, person, etc.). On the basis of considerations which will be set forth below, the universe E is, as a rule, made up of homogeneous ek. Proceeding from the defined set of information, the scope of the questions asked is also determined more or less accurately. In general, this does not mean that all the possible questions are previously known, but that for each question it is usually known whether or not it can be asked. The general conditions for the selection of ek for the questions asked are also known. The solution of the task “Does it answer the question asked?” for each ek, generally speaking, does not evolve clearly from the information contained in the ek and the question. It can require either a processing of that information according to definite rules or the referring only to information from other ek (if the set of the information is closed), or even the referring to a more general set of information than E, or, finally, it may go beyond the limits of searching for certain information and may require the carrying out of scientific research work. Thus the work of carrying out searches is in its immediate, original form one of the varieties of mental labor requiring not only great erudition and memory in order to cut down the number of references to the data which V.P.CHERENIN All-Union Institute of Scientific and Technical Information, Academy of Sciences of the USSR, Moscow.

OCR for page 823
--> are not included in the information element being immediately considered, but also great art in processing the data from the ek being considered and from the question according to definite rules. In order not to resort repeatedly to such a complex and slow process during each search (it is assumed that searches will be made many times), it is obviously desirable either to record, by some method, the results of searches previously made, or to create a special information language and to use it to record the content of the information in all the ek, as well as the content of the questions asked, in such a form that will make it possible, to a greater or lesser extent, to automatize the process of solution, without at the same time leading to an inadmissibly large increase in the volume of initial data or time expended for the solutions. The first method of solution cannot be applied to all possible questions (of the defined scope), if there are very many of them, and therefore the entering of all old and all newly incoming ek is done only for a limited number of rather general questions or for more limited, more important questions. This work can be done periodically (as a sufficient amount of new ek is accumulated) in the form of bibliographic reviews, reference books, tables, etc., prepared by specialists on the individual questions. This work is done continuously as new questions come into the information section, by having the section personnel examine all the existing material, with attempts being made here to use in part the results of searches made by the users themselves, either on the basis of the content of the requests (library) for sets of ek which they found as answers to the questions in which they were interested, or on the basis of lists of sources which they cited (ek) in articles which they published (citation indexes (4)). Subject classifiers also carry on continuous work in recording each incoming ek on a more or less established list of questions according to subject headings, from which subject indexes are then prepared. A secondary information task, however, develops as a result of this work: since the answers are found not for all, but usually only for a small number of all possible questions, it is often necessary to find for an arbitrary question (once again, from the definite scope) all the similar questions “corresponding” to it from among the already processed number of questions, which thus form a new set of elements of information E. For example, a system of cross references in the subject index is planned for this purpose. The solution of this secondary task has, as a rule, already had to be done (with the large number of all possible questions) by the second method, which is the only one that we shall consider in the future. INDEXING With the second method of solution the content of the information in each ek and the content of any question asked is, in general, approximated more or

OCR for page 823
--> less on a one-to-one basis by the set of standard semantic units—characteristics—with an indication of the standard synthetic relations between the latter (1–3). Such a process is usually called indexing. By synthetic relations we understand the connections which can be established between the characteristics directly on the basis of the content of the ek being indexed or the question. A particular degree of accuracy and one-to-one correspondence (the most that can be permitted is a few approximations) of indexing depends upon our knowledge of the ek and the question (it is impossible to approximate something with greater accuracy than the accuracy of the assignment of that something), upon our possibilities with respect to the scope of the work necessary to carry out a particular approximation and the time, complexity of equipment, and volume of entries necessary for solution by means of comparison of approximations of the particular accuracy. It is also necessary for the degrees of accuracy of approximations of ek and the question to be approximately identical. Therefore, it is often necessary, even for tasks in which a high degree of approximation is applicable, to limit oneself to lower degrees of the latter. The demand for a one-to-one correspondence of approximation is, I dare say, more important than the accuracy, if by a shortcoming of accuracy we understand the searching for superfluous ek which do not answer the question asked, but not the loss of necessary ek, since when the one-to-one correspondence of approximation is violated, necessary ek will deliberately be lost. Therefore, it is sometimes better to approximate ek and the question less accurately, but on a more nearly one-to-one basis. In general it can be said that the content of each ek and question must be approximated on a one-to-one (or almost one-to-one) basis by the logical function of logical variables, after which the selection conditions to be satisfied must be checked by means of definite logical operations on the functions of ek and the question. In this process the conditions for selection can in general be approximately described as the requirement that the function of the question be the logical consequence of the function of the ek. We shall meet an illustration of this approach below. Thus, in the final analysis the special information language must be a variety of that “language of science” about the creation of which many scientists have dreamed. Without posing the task too broadly, but remaining on the ground of reality, we shall limit ourselves in the future only to the simplest degrees of approximation which are applied in practice and which require a not-too-complicated information language, which is so called, rather than the “language of science,” in order to emphasize its simplicity and its immediate practical applicability. It is on the basis of these degrees of approximation that the basic types of information tasks will be introduced.

OCR for page 823
--> Getting somewhat ahead of ourselves, it can be said that with an increase in the number of the type there is an increase in the complexity of approximation and the entire information task. Therefore, it is obvious that for any information task it is necessary to select the lowest type providing for a one-to-one correspondence and the minimally acceptable accuracy of approximation. The acceptable accuracy is determined on the basis of the practical requirements, and the possibility of carrying out the particular one-to-one correspondence and accuracy of approximation, on the basis of the content of the ek and the questions and on the basis of the results of experimental searching. The greater the accuracy of approximation, the more difficult it is to observe its one-to-one correspondence. This fact, which is well known to any classifier, also forces one to get by with the minimally acceptable accuracy. The simplest method of indexing is the direct processing of the content of the ek and the questions according to definite rules without referring to any other data (with the exception of those rules). With the least accuracy of approximation this comes down to extracting from the content itself for each ek some one particular characteristic (or several independent characteristics) used as the standard name for that ek (type I). Examples of such names could be: the last name of an author or the name of an article, the name of a substance in some system of designations, or the principal subject of an article recorded in a definite form. When this degree of accuracy proves to be inadequate or the names obtained cannot be considered as standard ones (and it is difficult to standardize them), then the content of the ek and the questions is semantically subdivided into smaller units, for each of which the name described above may prove possible, on the basis of the characteristics which are encountered in them and which are accepted as the standard. This possibility is facilitated by the lesser concreteness of each part and, as a rule, the lesser number of all possible parts as compared with the total number of all ek or questions, and this makes it possible to standardize these parts more easily. This subdivision must be done on a one-to-one basis and must be carried out either according to definite sections (type II), or with the aid of other sufficiently unambiguous rules expressed with the aid of a standard structural formula for content. Since it is assumed that ek is taken as the sets of information which are not able to be subdivided into completely unrelated parts, the parts within the formula are either related by simple conjunctive relationships (type III) or by synthetic relationships (type IV). Examples of these parts for type II could be: year of publication, name of magazine, country, language, author’s last name, or principal subject, which are extracted from an article; headings of questionnaire data for a person; properties of substances from various sections. For type III, individual subjects, processes, conditions, methods, etc., for an ordinary sub-

OCR for page 823
--> ject heading could be extracted. Type IV can be illustrated by structural formulae of organic compounds. With an insufficient degree of accuracy, such a subdivision can be further continued, provided that the first subdivision can be made on a one-to-one basis; in this process we shall come sooner or later to a small number of easily standardized characteristics. But if this subdivision, on the basis of meaning, on the basis of content of the ek or the question, proves to be an impossible one or cannot be carried out on a one-to-one basis (or an almost one-to-one basis, since several different subdivisions are acceptable), then it is necessary, when approximating the total ek or the question, or parts of them which were isolated as a result of the acceptable subdivisions, to resort to a more complicated method of indexing when referring to information not contained directly in the ek. This indexing is actually a classification, that is, a referring of the entire ek or the question, or parts of them, to one classification characteristic (heading)—or several independent ones—in a single system of standard characteristics of classification (type I), to one characteristic (or several independent ones) in each of several sections (Ranganathan facets) or classification systems (type II), or to several characteristics in one system jointly describing the content being indexed (type III). Thus, there is a certain similarity between indexing and classifying (5). Indexing can be defined as direct, inductive classifying, which is more accurate than ordinary classifying, since it is also possible to take synthetic relations into consideration. Classifying can be viewed as less accurate, but a more nearly one-to-one, deductive indexing, where the role of the information language is played by the system of classification headings. Below we shall consider information language and types of information tasks from a more general point of view. CONDITIONS FOR SELECTION The conditions for selection evolve from the ordinary posing of the task of searching for information, when it is necessary to find—to collect—information devoted to each of the “subjects” making up the volume of the concept characterizing the question asked. This, by the way, is usually emphasized in the very way the questions are asked. For example, a person requests literature on the subject “properties of antibiotics,” thus wishing to obtain information on the properties of penicillin, or streptomycin, etc., and does not formulate the question as “the properties of an antibiotic,” which would mean the desire to find out information dealing only with properties common to all antibiotics. In other instances the way the task is posed cannot be reflected in

OCR for page 823
--> ordinary language, when, for example, information is requested on the subject “the effect of radiant energy upon bacteria,” although even here it is understood that it would be more correct to say “various types of action of various types of radiation upon various types of bacteria.” With the aid of a stricter formulation of questions it is possible to reduce all the varieties of questions to such a generic-specific scheme. For example, when a person requests information on the “effect of acceleration upon an aircraft,” he usually wishes to obtain information about “types of action of accelerative forces upon various parts of an aircraft,” not upon various types of aircraft. Therefore, not “aircraft,” but “part of an aircraft” should be taken as one of the concepts characterizing the question. Thus, the content of the ek being sought as those dealing with the subjects which are part of the scope of the question-concept must be characterized by concepts whose content includes this question-concept. In this sense the logical function of the question, when satisfying the condition of selection, is the consequence of the logical function of the ek. The conditions for selection will be described in more detail when the individual types of information tasks are examined. INFORMATION LANGUAGE The characteristics pn, which form the universe P, and synthetic relations are analogous to the words and grammatical relations of ordinary language. Like the degree of approximation and the conditions for selection, they are determined by the nature of information contained in ek. However, the appearance of new ek may require the appearance of new characteristics to describe them, but the relations remain the same. In order to observe the one-to-one correspondence of approximation it is desirable to have the characteristics (as well as the relations) not only standard, but also independent, that is, to assure that between them there do not exist any analytic relations which make it possible to substitute for some one characteristic another semantically similar characteristic or several characteristics which are united by means of the synthetic relations used in language. By analytic relations we understand here the constant semantic relations established between the characteristics on the force of a broader and older set of information than E. Obviously, the fewer relations that the language has, the more concrete and more varied the words must be to preserve the same degree of approximation, and, conversely, with the existence of many relations the number of characteristics and their concreteness can be fewer, and this makes it possible more easily to standardize them and ascertain their independence. The demand for independence of characteristics leads in the first instance to a com-

OCR for page 823
--> plicate representation of the less concrete, dependent characteristics in the form of disclosing their scope with the aid of enumerating a whole series of initial characteristics. In the second instance it leads to a complicated representation of very concrete characteristics in the form of compound complexes of a large number of initial characteristics (the content is disclosed). Ordinary language is rich both in relations and in words, and both are not independent, that is, use is made, for example, not only of a word designating a certain concept, but also words designating more general concepts (for example, generic for initial, which results in the non-one-to-one correspondence of representation of the content (without even considering synonyms), since instead of the first word it is possible to use its definition, consisting of other words which are related to one another. However, the presence of words for more general concepts does not make it necessary to describe them by means of a listing of words for the concepts making up their scope. Conversely, the presence of words for less general concepts does not make it necessary each time to give their complete definition by means of disclosing their content through words for more general concepts. Thus, the demand for the independence of characteristics leads to great difficulties in information language. In order to surmount the difficulties mentioned, limitations are first imposed on the heterogeneity of content of the ek themselves. One information task does not include ek with contents which differ greatly with respect to degrees of concreteness. For example, searches for literature are not made simultaneously among books and articles. Analogous limitations are imposed on the questions asked, that is, it is required for the degree of concreteness of content of the questions to be within certain limits. Then choice is made of the most appropriate level of concreteness of the independent characteristics forming the universe P. With the aid of the generally used method of approximation, the characteristics from P must describe the content of ek and the questions with an accuracy to any permissible degree of concreteness. If during this process the above-indicated difficulties are not eliminated then one selects two or more sets of independent characteristics P′, P″, P″′, etc. without an intersection which correspond to various increasing levels of concreteness, and an approximation is made of the content of ek by means of the union of sets P=P′∪P″∪P″′· · ·, which thus consists of dependent characteristics. The content of the question is also approximated by means of a union of sets. In this process the above-mentioned difficulties are eliminated by using in the approximation of characteristics the most appropriate level of concreteness. That is, the particular content is described by characteristics which are as concrete as possible without using the listing of characteristics for disclosing the scope of the more general concept, if for some general concept this still proves impos-

OCR for page 823
--> sible even with the aid of characteristics at a lower level of concreteness, then it is considered to be beyond the limits of the accepted degree of concreteness of content and it is ignored, and, on the other hand, without the use of too complicated complexes of characteristics for revealing the content of a very concrete concept (if this proves impossible, the concept is replaced by a generic concept close to it). This method of approximation will be called the “natural” method. Having avoided these difficulties, we nevertheless again come up against the possibility of the non-one-to-one representation of the content. Actually, the condition that characteristics which are as concrete as possible will be used requires the knowledge of all possible methods of representing them through less concrete characteristics, which (methods) can occur as a result of the dependence of the characteristics from P. On the other hand, the knowledge of these relations between characteristics is also necessary when searching for ek on the basis of generic characteristics, when what is requested is not a listing of all concrete characteristics, but a description of a somewhat broader content with the aid of less concrete characteristics. Thus, the manifestation, in an information task, of a universe of characteristics P and analytic relations between them is closely related to the establishment of the method of approximating the content of ek and the questions asked. This means that the development of a special information language is a necessary and, at the same time, the basic and most complicated, phase of making the process of task solution automatically controlled (1–3). This phase is obviously linked with works on terminology and classification and requires the knowledge of a broader scope of information than E, if the latter is not closed. The analytic links revealed between characteristics from P must be recorded in a form that makes it possible to locate them quickly and automatically when searching for ek, and when approximating the contents of ek and questions. The summarizing of these relations is similar to the set of cross references and other rules for the use of subject index. Thus, the locating of characteristics which are analytically related to one another constitutes an auxiliary information task (1), which, as a rule, deals with a broader content than E and which can be solved: (a) separately from the basic task when approximating the question asked, if only this does not result in a too complicated representation of the question, for example, in the form of listing a large number of characteristics revealing the volume of less concrete characteristics; (b) in the process of solution, when, for the content of the ek being examined, more general representations of it than that which is recorded are found very quickly in a special memory device and all the representations are compared with the question, which is approximated, like the content of ek in instance (a), by the natural method; (c) one time before recording the contents

OCR for page 823
--> of the ek with the subsequent recording of the result of solution of the auxiliary task when recording the content of the ek itself, that is, by limiting ourselves to natural approximation for the question, we shall make for the ek several approximations at various levels of concreteness, or, putting it more exactly, with a drop of one degree in the level of concreteness for each characteristic (a rise in the level of concreteness would, once again, require a complicated representation of the volume of less concrete characteristics, and a drop by several degrees of concreteness would again lead to a very complicated representation of very concrete characteristics, although sometimes this can be used with successful coding). For complicated information tasks with a broad volume of information from E, the first two methods of making the solution of the auxiliary information task automatically controlled are not very applicable, (a) because of the large amount of time expended when searching for ek with a complicated and, as a rule, unnatural representation of the questions, and (b) because of the absence of a high-speed memory device of sufficient capacity for auxiliary searches. The third method has found applicability in cases when, with the aid of proper coding, the recording of approximations at various levels of concreteness can be made economic enough. Actually, the principal shortcoming of the third method of solution of the auxiliary information task is the necessity of recording each characteristic as many times as it is encountered, not only in its natural form, but also in the form of representation through characteristics of a lower level of concreteness, which reveal its content; that is, it is necessary to give its definition each time. Let us note that the first two methods of solving the auxiliary information task presuppose that the transition to approximation of the content of ek at a lower level of concreteness of characteristics can be made by means of representation of each individual characteristic through less concrete characteristics. In the third method of solution this cannot be presupposed; however, then this approximation will be made too labor-consuming and nonautomatic and will require a still greater increase in the number of entries. It will not be possible to reduce the number of entries by means of coding, since it will be necessary to give, for the entire content of the ek, several parallel, individual approximations. Therefore we shall assume in the future that this transition is allowed by the method of approximation itself. Almost everything that has been said with respect to characteristics is applicable to synthetic relations; however, taking into consideration the fact that the number of different relations in a language is usually not great and therefore the standardizing of them and the ascertaining of their dependence upon one another do not represent any great labor, we shall not dwell on them in greater detail.

OCR for page 823
--> Having thus ascertained that the consideration of analytic interdependencies among characteristics, as well as among synthetic relations, leads to an auxiliary information task that can have parallel solutions (separately or jointly), we can consider that the universe P consists of standard characteristics pn which, when the basic task is solved separately from the auxiliary task, can be viewed as independent characteristics, with the approximation by them of the content of the ek and the questions being carried out on a one-to-one basis—by the natural method. An analogous assumption can also be made with respect to synthetic relations. The types of information tasks revealed below by this assumption will obviously also occur for auxiliary information tasks, which, if they are viewed separately, do not in any way differ in principle from the basic tasks (some of the characteristics of the basic task begin to play the role of elements of information in the auxiliary task). The type of any information task will then be determined by the combination of types of the basic and auxiliary tasks into which the information task is broken down. Basic types of information tasks TYPE I. This is the simplest type of task, to which the other tasks can also be reduced, with certain tasks belonging exclusively to this type as a result of the insignificance of the information contained in the ek. Here the content of each ek is characterized simply by one or several standard names. These name-characteristics can be completely independent of one another. They can be given to elements of information to a considerable degree arbitrarily. Names which are similar (or even identical) in spelling can correspond to ek which are completely dissimilar to one another. In instances when the ek is characterized by several names, these latter are linked purely accidentally and other, different ek can correspond to each of them. The scope of the questions consists of the names encountered in all the ek and also of other names which can be constructed analogously to the former. Having in mind, here and elsewhere in the future, only elementary questions which cannot be represented in the form of several questions which can be solved by independent, separate searches, it is possible to say that each question consists of one name. The conditions for selection lie in the requirement of complete correspondence between the question and the name of the ek. All semantic, classification relations are absent here. Examples of this could be: searches for telephone numbers, names of books, addresses, and the like, by persons’ last names; translations or definitions of words by words; descriptions

OCR for page 823
--> of magazines according to their standard names; and in general, code values by codes. It is obvious that with complete independence of the name-characteristics, no auxiliary information searches are necessary and all that remains is the basic information task in its pure form. Standard name-characteristics can also be nonindependent and can reflect the content of the ek somewhat more thoroughly, that is, they can be given in a nonarbitrary manner. The solution of the basic task here remains the same as before. However, in this instance the purpose of the auxiliary information task will be to reveal the (analytic) relations between the characteristics, and, as has already been stated, this task can belong to any of the types introduced. These relations can also find their reflection in the very spelling of the characteristics, that is, characteristics which are similar in spelling can correspond to dependent characteristics and to ek which are similar in content, and this can be used in coding to solve the auxiliary information task by the third method. This similarity is revealed partially in words characterizing similar concepts (searches for translations or definitions of words according to the words themselves). In every scientifically elaborated terminology and nomenclature, this similarity is reflected still more completely. Another example could be the searching for some substances according to the values of some one of their properties or, in general, the values of a continuous function according to the values of one argument. TYPE II. In the second type of information tasks each ek is characterized not by one standard name, which cannot be the case, but more thoroughly by several sections—by one or several standard characteristics in each section, with several characteristics in one section being related purely accidentally, as formerly. Insofar as an elementary question, once again, contains here no more than one characteristic or its negation for each section, then, as in type I, any ek can be subdivided into several ek having no more than one characteristic in each section. The synthetic relations between characteristics are lacking here, if one does not include in them logical conjunctive and disjunctive relationships between predicate-characteristics and their negations. Actually, the content of any ek here (after the subdivision into ek containing no more than one characteristic in each section) is represented in the form of conjunction of the predicate-characteristic, and any nonelementary question in the form of a certain expression consisting of predicate-characteristics and their negations, united by conjunctive and disjunctive relations. The reduction of a nonelementary question to elementary ones is equivalent to the representation of that expression

OCR for page 823
--> searches, in the case of the simplest “devices,” are broken down, are made immediately automatic and are accelerated, and the frame of microfilm that is found as a result of the searches, with the content of ek∈En recorded in the usual form, appears on the screen of a microfilm viewer or on a television screen (for transmission to a distance). The use of microfilm to record the content of the ek makes it possible, when using these rather simple decoders, to obtain, in addition to the automation and acceleration of searches, a considerable decrease in the volume of the entries themselves. Such devices can therefore be considered the ones with best prospects for the rapid and automatic solution of type I tasks and, in particular, for finding translations of words in other languages, for the rapid obtaining of information concerning addresses, telephone numbers, and the like, and for other similar tasks. TYPE II TASKS An estimate of the number of all possible questions which can be made up for all ek∈E in the case of type II tasks shows that this number usually proves to be much greater than the number of all possible ek and pn. Therefore, the use of the simplest “devices” in which all these questions are located in lexicographical order becomes here, as a rule, practically impossible as a result of the excessive increase in their volume. When using parallel- and consecutive-action decoders, the characteristics can now be encoded not only by “order,” but also by the “tabular” method, and it is possible to obtain as answers intermediate names or numbers of the ek sought, according to which intermediate answers one can then, as in a type I task, make an additional search of the contents of the ek themselves. In other words, in order to facilitate the solution of type II tasks, it is considered justified to reduce the individual stages to the solution of tasks of a simpler type, type I. The process of solution with the aid of parallel-action decoders is broken up, as in the case of the simplest “devices,” into a not very large number of steps, during which a search is made for all the smaller subsets from a set of elements of information E. However, because of the absence of the completely systematic location of the ek (the impossibility of locating in lexicographical order all the questions pertaining to them), it is not possible here to limit oneself only to the consecutive isolation of subsets inserted in one another, but it is necessary also to search for the intersection of rather large subsets from E which correspond to the individual characteristics of the question. And this does not make it possible to use, in the case of type II tasks, certain of the simpler parallel-action decoders, as, for example, the Amfis system. Putting it more exactly, the process of searching can be described as the finding of the answer E★⊂E in the form: E∝=En1∩En2∩··· Ena,

OCR for page 823
--> where En1, En2,· · ·, Ena are subsets from E which correspond to the characteristics of the question , forming the set P★⊂P. The subsets En1, En2,· · ·, Ena can be easily found by the characteristics with the aid of the simplest “devices,” as was done in type 1 tasks, if all the ∈P are situated in lexicographical order. If this situation is made for each section separately, then it is also possible to carry out the simultaneous search for all assigned characteristics. These subsets must, however, contain not the complete descriptions of the ek included in them, but only the intermediate names or numbers of the latter, in order to avoid the duplication of those descriptions (each ek is usually approximated by characteristics from all sections) and to facilitate the finding of the intersection of the subsets obtained. The nonautomatic finding of the intersection is used in Taube’s coordinate indexing system (8) and is actually the sole possibility of using the simplest “devices” for solving type II tasks. Such a method of solution proves to be not only labor-consuming, but also requires a rather large amount of time, without even mentioning the necessity of solving several type I tasks (one task for each question characteristic, and one task for each finding of the complete descriptions of the ek according to their intermediate names or numbers), and therefore it cannot be considered satisfactory. Other parallel-action decoders used to solve type II tasks include punched cards with notches on the edges and superimposed punched cards (11) (aspect cards). These devices are to a certain degree dual one for another, as can be revealed from an examination of the principle of comparison of characteristics (1), which is common not only to them, but also to the majority of other decoders. Let us assume that two characteristics are compared: pn⊂Pk, where Pk⊂P is a set of characteristics approximating the corresponding ek, and ∈P★, and let us assume that the code designations having the identical number of bits, the designations of pn on the medium and of in the memory, contain an identical number of code symbols (represented in the form 1 on the medium and in the form in the memory), that is, the code is a “selector” code. Then it is not difficult to show that the characteristics coincide in the instance, and only in the instance, that where M1 and M0 are sets of bits of the code designation for pn which contain 1 and 0 respectively, and and are analogous sets for . In this process, the bits of the medium and the memory which mutually correspond to one another in a one-to-one relation are considered to be identified. Depending upon how one realizes the pairs of states 0 and 1, and and , and the signals θ and I characterizing respectively the absence or presence of intersection of sets M, one obtains different designs for decoders.

OCR for page 823
--> In the case of punched cards with notches on the edges, 1 is taken as the notched cell, is a raised or inserted rod, and θ is the absence of meeting between the rod and the unnotched cell causing the drop or falling out of punched cards corresponding to the sought ek. Characteristics from Pk and P★ are encoded here according to local (in particular, direct) selector code, and the comparison can be done immediately for all characteristics from P★. In the case of superimposed punched cards, 1 is taken as the punched-out cell, is the superimposed card, and θ is the passage of light through the cell, with the absence of an obstacle in that cell on the superimposed punched cards , which passage of light detects the ek corresponding to the open cells. Characteristics from Pk and P★ are coded here usually only according to direct code, although later (1, 9) it also proved possible to use a local selector code. Thus, the pair consisting of rod (light) and moving punched card in the first instance realizes the pair , and in the second instance the pair , in which the above-mentioned duality of the devices considered manifests itself. The devices described do not provide for complete automation of searches or for a considerable reduction in the time required for searching, as compared with the simplest “devices,” although they are more effective than these latter. Therefore, the question arose concerning the development of more nearly perfect parallel-action decoders, and the use of consecutive-action decoders become justified (as distinct from type I tasks). More details about these decoders will be given in the next section, but here the only thing remaining is to recall that with medium numbers of ek (on the order of tens of thousands) consecutive-action decoders do not have a sufficiently high speed of search. This speed, although it does exceed the speed of the devices just described, is still, even for extremely complicated decoders, on the same order as the speed of the simplest “devices” when solving type I tasks. For large numbers of ek this ratio between the speeds of searching becomes still more unfavorable for consecutive-action decoders; their speed of searching must be almost the same, for example, as that of the last of the devices described in this section. Thus, the most suitable device for solving type II tasks must apparently be some kind of improved parallel-action decoder, which is somewhat more complicated than the Amfis system, but is simpler and faster than high-speed consecutive-action decoders. This finding is also supported by the fact that the locating of subsets En1, En2,· · ·,Ena is reduced to the solving, by an already known method, of a type I tasks, and only the finding of the intersection of those sets can cause difficulties. With small numbers of ek it is possible to use as a more nearly perfect decoder various electric (1, 10), electronic, and other decoders similar to inter-

OCR for page 823
--> nal high-speed parallel-action memory devices in present-day numerical computers. As distinct from type I tasks, the problem solved here is the extraction of the sought ek not in the form of a complete description of the information contained in them, but in the form of intermediate number, which situation corresponds more to the possibilities (volume of cells in the memory) of those decoders, but, on the other hand, now it is necessary to have a separate output for each ek, and it is impossible to get by with general output “number lines.” Actually, whereas formerly one answer E★, consisting, as a rule, of a small number of elements of information ek, was prepared for each question, and this answer was produced on the general output lines, now, as a result of the extremely large number of all possible questions, each ek is recorded separately on the “address lines,” and since all the ek from E★ are searched simultaneously, it is no longer possible to lead them out onto the general output lines. In such decoders can be taken as the presence of signals on the input “address lines,” 1 as the presence of a connection of the output line corresponding to a separate ek with an input line by means of a diode, capacitance, resistance, or the like, and θ is the absence of signals on the output line , which absence detects the necessary ek (having taken as 1, and also for the opposite state, we shall come to a still more obvious analogy with punched cards which are notched on the edges). For medium numbers of ek, the described decoders produced have as yet, unfortunately, been too complicated, and this makes it possible in this instance to attempt to improve immediately the system of superimposed punched cards, where the finding of the intersection of sets En1, En2,· · ·, Ena is successfully made automatic, requiring, however, the recording of each set according to a direct code, that is, on each card (or set of cards) cells are set aside for all possible ek. The essence of the improvement proposed lies first of all in the fact that the characteristics pn are now recorded not according to the direct code, when each pn has its own card, but according to a local selector code, which makes it possible to reduce sharply the total number of cards (1, 9)—to reduce them to the number of bits in the code designation of the characteristics from all sections, and secondly, lies in the replacement of the superimposition of punched cards by a slight shift (1) of the always superimposed cards into an examination position similar to that used, for example, in the Card Translator. If each ek is described by no more than one characteristic from each section, as occurs or can be achieved in type II tasks, then after the shifting of cards corresponding to the set of bits , for each characteristic of the question those, and only those, cells which correspond to the necessary ek will allow light to pass through. Since the cards can be shifted automatically, for example, by means of electromagnets, the entire process of finding E★

OCR for page 823
--> proceeds extremely rapidly. With order encoding of characteristics from individual sections it is not necessary first to solve a type I tasks, since they are automatically solved on the decoder itself, and with the use of the position of each open cell on the surface of the card as a direct indication of the corresponding frame of microfilm it is possible automatically to receive on the screen a complete description of the ek found. If in the described device one limits himself only to one section, in which any characteristic is assigned completely, the latter can also be encoded by a binary, but not necessarily a selector, local code, which leads to a still greater decrease in the total number of cards, but which requires, however, two cells for the recording of each ek. Such a simplified device can serve, as was mentioned above, for solving type I tasks. Let us note in conclusion that a still greater decrease in the number of cards can be achieved if one refrains from using just two code symbols; however, in this process, in order to record each ek it is necessary to have a number of cells which is one greater than the number of code symbols, and, instead of the simple shifting of each card, it is necessary to shift it into one of several different positions, the number of which is equal to the number of code symbols. For large numbers of ek (on the order of millions or hundreds of thousands) the use of the device described proves to be a difficult one, since, in order to set aside on the card a cell for each ek, it is necessary to use cards which are too large (the size of the cell is limited from the bottom, in order to provide for the unhindered passage of light through all the punched cards). The replacement of each card by a series of cards leads to a manyfold duplication of the entire device. The way out of this position must apparently be sought (if one does not return again to complex nonmechanical decoders) in rejecting the direct code for recording sets En1, En2,· · ·, Ena, and in recording the answer En for each characteristic pn according to a nonlocal code on a group of punched cards or on a piece of punched tape, photographic film, or the like, in the form of a sequence of code designations of numbers of all ek∈En. Having found consistently the En corresponding to the individual characteristics of the question, by means of the simplest “devices” or a device of the type of the Amfis system we isolate the En which contains the smallest number of ek. Then the numbers ek from that En are introduced into the memory device of a consecutive-action decoder, and the content of each other found En is consecutively examined by that decoder. Those numbers in the memory of the decoder for which the coincidence in examination of each En will be detected will correspond to the sought ek forming the intersection of all the located En. Such a parallel-consecutive method of searches will obviously prove more profitable than the purely consecutive method if the number of ek from all the

OCR for page 823
--> examined En, multiplied by the quotient obtained when the number of ek from the En placed in the memory is divided by the number of numerical designations in that device, is much less than the total number of ek∈E. With a large number of characteristics pn in each section and a not very large number of the sections themselves, such a situation ordinarily occurs. Actually, considering that the characteristics in each section were chosen to be as homogeneous as possible in their concreteness and that approximately identical numbers of ek correspond to each of them, and remembering that each ek is described by no more than one characteristic in each section, we can approximately evaluate the capacity of the En as the quotient obtained when the number of ek∈E is divided by the number pn from the corresponding section. Obviously, the highest degree of automation will be achieved in such a device if the input of En into the memory and the examination of the remaining En are carried out directly from the screen of a system on the type of Amfis, for example, with the aid of an iconoscope. In conclusion let us note that with the presence of a negated characteristic in the question, the methods of solutions remain approximately the same, except that the ek found for the negated characteristic, if it is viewed as nonnegated, are not left in the intersection of the En corresponding to the nonnegated characteristics, but, on the contrary, are rejected. Sometimes such ek are revealed when this intersection is compared with a new intersection obtained when a negated characteristic viewed as a nonnegated one is added. TYPE III TASKS In order to solve these tasks one may use the parallel-action decoders described in the preceding section, only in the instance that the characteristics of pn from the set Pk which describes the corresponding element of information ek, are recorded by direct code, that is, each pn∈P is given its own place in the simplest “devices,” its separate card in a system of superimposed punched cards, its cell on punched cards with notches on the edges, and its input line in electrical decoders. The attempt to use a local code to record the characteristics would lead to a situation in which, not knowing in what field a particular question characteristic could be found, we would have to ask the question by very many methods, the number of which, at best (with a systematized sequence of characteristics), is equal to the number of all possible combinations from among the fields of the local code for the number of question characteristics. With, as a rule the large number of all possible pn the last two decoders just listed cannot be considered practically feasible. Thus, the only methods remaining are the first two, relatively nonautomatic methods, and of the improved decoders, only the last of those described in the preceding section.

OCR for page 823
--> In order to make use of the remaining parallel-action decoders it is necessary to record the characteristics from Pk according to a superpositional code. As was already mentioned earlier, it is possible during this process, however, for superfluous, unnecessary ek to appear during the searches. Without considering here that possibility, which has been worked out in detail by many researchers, let us change over to a brief description of consecutive-action decoders. Although, in conformity with what has previously been set forth, they can be considered to be satisfactory only for medium numbers of ek, as a result of the absence at the present time of the last of the improved parallel-action decoders described in the previous section, they are the only devices for the automatic solution of type III tasks (without the use of superpositional code). The basic shortcoming of a consecutive-action decoder is the necessity of a very rapid, consecutive examination of the content of each ek. Sectors of the medium which contain the code designations of the characteristics from Pk which are recorded by a nonlocal code are consecutively passed or switched up to the scanning device of the decoder. Then each pn from the Pk is compared with all the from the set of question characteristics P★. If the P★ is included in the Pk (P★⊂Pk), the corresponding ek is selected. If the medium is a discrete one (punched cards, cards made out of film, or the like), it is possible to make a consecutive comparison first with one ∈P★, then to compare the ek selected with another ∈P★, etc. Taking into consideration the fact, which was already mentioned above, that usually only a small number of all the ek∈E proves to be approximated by a certain characteristic, it may be considered that the number of repeatedly examined ek will decrease rapidly, that is, it is completely admissible, with a discrete medium, to use the simplest decoders with a memory only for one question characteristic in order to solve type III tasks (an example could be the Samain (12) selector—the Filmorex system). For a continuous medium it would be necessary by such a method to rerecord the content of the ek selected during each examination. Therefore, the comparison of each pn∈Pk is carried out immediately with all ∈P★. The decoder required for such a comparison is analogous to the electric, electronic, or other decoder described in the preceding section, in which decoder the role of the question characteristic is played by an individual characteristic pn∈ Pk, and the role of the code designations of the pn is played by the code designations of all ∈P★ which are recorded on the input lines with the aid of diodes, resistances, etc. Thus for 0 one can here take the presence of a signal on the input line, for the presence of connection of the output line of the corresponding individual to the input line with the aid of a diode, resistance, etc. (or directly in the case of a single in the question), and θ is the absence

OCR for page 823
--> of a signal on the output line , which absence detects the coincidence of the corresponding with the pn being examined. If, after the examination of a certain ek, the coincidence will be recorded for all ∈P★, then this ek is selected or its content is re-recorded. Since the number of in the question is usually not large, the decoder proves to be extremely simple, although it does require a special detecting element for each . The outputs of the detecting elements can be joined by means of switches in such a way as to detect not only the value of the conjunction of the nonnegated characteristics, but also the value of an arbitrary Boolean function of the question characteristics (1, 2, 6, 13, 14), that is, the ek can be selected during one examination even for a nonelementary question (in particular it is possible to search simultaneously for many questions if the decoder has a sufficiently large memory). Depending upon the type of medium, the signals 0 can be fed to the input lines of the decoder or directly from the brushes through the holes in the punched card or punched tape, or, after amplification, from the heads picking the magnetized spots of the magnetic tape or drum, or from photocells reacting to the transparent spots of photographic film or photodisc, etc. Other designs of decoders are also possible, provided they are of such a high speed that they do not limit the speed of examination of the ek. Since they contain a small number of , a certain complication of them (both conditions, and , are checked) is also possible, and this complication makes it possible to record characteristics pn and pn according to a binary, but not selector, code. The medium with the best prospects is apparently photographic film (frames of film, photodiscs), not only as a result of its high degree of resolution, but also because of the possibility of recording and extracting complete descriptions of ek, instead of just intermediate numbers, which would require an additional solution of a type I task. Thus, with medium numbers of ek the devices that can be considered the best adapted for the automatic solution of type III tasks are consecutive-action decoders or an improved decoder with superimposed punched cards and with the use of a superpositional code, but with large numbers of ek the best device is the improved parallel-consecutive-action device which was described at the end of the preceding section. TYPE IV TASKS Generally speaking, only consecutive-action decoders can be employed for the solution of these tasks, since the synthetic relations do not characterize directly the ek themselves, but are built up on characteristics from ek. Parallel-action decoders could be employed here only for the preliminary rough selection of ek according to which are assigned, as in a type III task, without any

OCR for page 823
--> indication of the relations. Although the number of ek found during this process proves to be less than the number of all ek∈E, nevertheless it is much greater than the number of the ek sought, or otherwise there would be no reason to show the relations at all. For medium numbers of ek the automatic solution of a type III task (with the aid of a parallel-action device) produces, as we have seen, only the numbers of these ek, and therefore the locating by numerical designations of a rather large number of complete descriptions of these ek, as in type I tasks, can occupy too much time for subsequent searches taking the relations into consideration. Consecutive-action decoders used to solve type IV tasks differ from those mentioned in the preceding section only by the presence in the memory of additional detecting elements, switches, and counters for comparing the relations. Taking into consideration the certain lack of definiteness and lack of study of this type of tasks, it is for the time being difficult to give a general description of the requirements made upon the composition of such memory elements. For example, with very complicated relations it may be necessary to have a whole arsenal of facilities existing in modern numerical computers (15). With the existence of only the simplest relations of the grouping type (the uniting of some characteristics from Pk into groups which are then viewed as independent characteristics), it is sufficient to combine the outputs of the detecting elements corresponding to the characteristics grouped in the question, through the switch “and” and then to feed the output from this switch to an additional element which detects the inclusion of that group into a grouping of characteristics from Pk after the examination of that latter grouping, which situation is signalized by means of special symbols, “brackets”—the code symbols are recorded on the medium in definite places at the beginning and end of the grouping. Devices possessing such possibilities are, for example, the Experimental Information Machine (1, 2, 6) of the Institute of Scientific Information, Academy of Sciences USSR (EIM), the WRU selector (13), and the planned Kodak Minicard system (14). In the event that the characteristics from P are too general, as can occur with the presence of a complicated system of relations, it sometimes proves possible to have very many methods of establishing the mutually one-to-one correspondence between ∈P★ and the pn∈Pk coinciding with them. Since only for certain of these methods can the relations between be generated by relations existing between the characteristics from Pk which correspond to those , it is necessary to analyze all the possible instances of establishment of this mutual one-to-one correspondence. For this purpose it may be necessary to have a repeated reproduction of the content of each ek. In order to be spared the repeated passage of a moving medium, in this instance it is necessary to rerecord the content of the ek on an extremely high-speed auxiliary operational

OCR for page 823
--> memory (16), and also to set aside a portion of this device for recording the intermediate results of the analysis and the program of its execution. Thus, the solution of type IV tasks in general requires the application of modern universal or even special digital computers. It must, however, be noted that at the present time there exist only a very few information tasks in which the approximation of ek and the questions could be carried out on a one-to-one basis with such great accuracy and complexity as to require the solution of those tasks by means of such highly improved devices. REFERENCES 1. V.P.CHERENIN, Nekotorye problemy dokumentatsii i mekhanizatsiya informatsionnykh poiskov [Certain Problems of Documentation and Mechanization of Information Searches], Moscow, 1955. 2. B.M.RAKOV, V.P.CHERENIN, Eksperimental’naya informatsionnaya mashina Instituta nauchnoy informatsii AN SSSR [Experimental Information Machine of the Institute of Scientific Information, Academy of Sciences USSR], Moscow, 1955 (there also is a publication in the English language). 3. J.W.PERRY, A.KENT, M.M.BERRY, Machine Literature Searching, Interscience, New York-London, 1956. 4. W.C.ADAIR, Citation Indexes for Scientific Literature? American Documentation, 6, [1], 1955. 5. B.C.VICKERY, Developments in Subject Indexing, The Journal of Documentation, 11, [1], 1955. 6. B.M.RAKOV, V.P.CHERENIN, Byulleten YUNESKO dlya bibliotek; Machines for Retrieving Information in the USSR, Unesco Library Bulletin, 11, [8–9], 1957. (Also published in English, French, and Spanish languages; the article was reprinted in the German language in the journal Nachrichten fur Dokumentation, 9 [1], 1958. 7. E.A.AVAKIAN, E.GARFIELD, AMFIS—The Automatic Microfilm Information System, Special Libraries, April, 1957. 8. M.TAUBE and ASSOCIATES, Studies in Coordinate Indexing, Vol. I, 1953. 9. Same, Vol. II, 1954. 10. Same, Vol. III, 1956. 11. R.S.CASEY, J.W.PERRY, Punched Cards—Their Applications to Science and Industry, Reinhold, New York, 1951. 12. J.SAMAIN, Filmorex—Une nouvelle technique de classement et de selection des documents et des informations, Paris, 1952. 13. J.W.PERRY and ALLEN KENT, Das lesende Selektiongerät der Western Reserve University, Nachrichten für Dokumentation, No. 2, 1957. 14. A.W.TYLER, W.L.MYERS, J.W.KUIPERS, The Application of the Kodak Minicard System to Problems of Documentation, American Documentation, 6, [1], 1955. 15. A.OPLER, T.R.NORTON, New Speed to Structural Searches, Chemical and Engineering News, June 4, 1956. 16. W.H.T.DAVISON and M.GORDON, Sorting for Chemical Groups Using Gordon-Kendall-Davison Ciphers, American Documentation, 8, [3], 1957.

OCR for page 823
--> This page intentionally left blank.