Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 687
--> The Evaluation of Systems Used in Information Retrieval CYRIL CLEVERDON Recent years have seen a two-pronged attack to deal with the problems which have been caused by the immense growth in the amount of recorded information, the greater complexity of the subject matter and the increasing interrelationship between subjects. First, there have been many attempts to devise new indexing systems which will be an improvement on the conventional methods and, on the other hand, a great deal of work has been done in developing the mechanics which can be used, from the simpler kinds of hand-sorted punched cards to high-speed computing machines. Several theoretical evaluations have been made of the various systems, but it appears that the position has now been reached where it is necessary to make a practical assessment of the merits and demerits of information retrieval systems. A project which will attempt to do this has been started under the direction of the author with the aid of a grant from the National Science Foundation to the Association of Special Libraries and Information Bureaux (Aslib). Previous work undertaken by Thorne and the author (1) was useful chiefly in making apparent the main factors that had to be taken into account. It became obvious that the only practicable method of comparing various systems is on the basis of their economic efficiency. Any system can, if economic aspects be disregarded, reach a high level of retrieval efficiency, even if it involves looking at the majority of individual documents in the collection, and the important matter is to find which system will give the required level of efficiency at the lowest cost. It is useless to attempt to compare any two established indexes unless one also has reliable data concerning their compilation costs. There are three main items to be considered in the costs of information retrieval, namely the cost of indexing, the cost of equipment used, and the cost of retrieval. The indexing cost is influenced by the salary paid to the indexer and the average time spent in indexing. Included in the cost of equipment are all the charges involved from the time when the indexer makes his decision until the stage where the record has been entered and put into the form which CYRIL CLEVERDON The College of Aeronautics, Cranfield, England.
OCR for page 688
--> permits another person to make a search, whether this has been done by typing entries onto catalogue cards, punching holes in manually or machine-sorted cards, or any other method that may be used. The retrieval cost must cover not only the time costs involved in searching the index, but also the time cost involved in making physically available the required document or documents. These three aspects are closely interrelated, and decisions taken regarding one point will affect the others. It must, however, be emphasised that not always are they entirely dependent on each other. For instance, much of the early criticism of the Uniterm system of coordinate indexing was not against the system, but was based on the difficulty of comparing long lists of numbers. Different and improved methods of recording the indexing decisions have invalidated all such criticisms, and the basic merits or demerits of the Uniterm system, as an indexing system, are the same whether the mechanics used involve a search time of 30 minutes or 30 seconds. This project is, therefore, concerned only with ascertaining the efficiency of the systems as such. Obviously it will be necessary to use catalogue cards or some other method, but it is intended to separate this aspect. Certainly there is no intention of carrying out any comparative tests on the mechanics of information retrieval. Much work of this nature has already been done, further work is planned by other organisations, and it should be simple to build in the results of such work to the results from this project. The indexing programme In brief, the programme envisages the indexing of 18,000 aeronautical papers by four different systems; three indexers will be employed, and strict time controls will be maintained. In devising the programme for the project, certain arbitrary decisions had to be taken, the first of which obviously concerned the systems to be tested. The four systems selected were: (1) the Universal Decimal Classification, (2) an alphabetical subject catalogue, based on the Special Libraries Association “Subject headings for aeronautical engineering libraries,” (3) a special faceted classification, (4) a coordinate system based on Uniterm. Other systems were considered, but it is felt that these are reasonably representative of present practice. For instance, two systems are long established, two are recent innovations. Two are classification systems, while the other two are not. Two systems are conventionally used for collections covering all subjects, while the other two are intended more for specialist collections. The results will be strictly valid only for the systems used, but it is thought that it should be possible at a later stage to test other systems without necessarily
OCR for page 689
--> having to duplicate all the work put into this project. It could also be argued that the results will be valid only for the subject field used, but here again the expectation is that it will be possible to test in other subject fields with less effort. The field of aeronautics has been selected for this project, mainly because of the availability of documents in the author’s library, but it has certain other attractions. Not only does it cover a broad field of knowledge, but it also has aspects of great specificity. Subjects in the broad field would include, for example, mathematics, mechanics, fluid and aerodynamics, heat, light, sound, electricity, theory of structures, chemistry and meteorology among the sciences, as well as electrical, military, mechanical and production engineering, metallurgy and fuel technology. Half the documents to be indexed will cover this broad field, while the other half will be restricted to papers dealing with high-speed aerodynamics (Mach number >0.8). This intense concentration in one detailed subject area, combined with the general coverage of a wide range of subjects should show the varying capabilities from the viewpoint of special or universal systems. The ability of the indexer is of paramount importance, yet very little attention has been paid to the problems of actual indexing. Whereas one very vocal school of thought insists that technical indexing can be done only by a person with technical qualifications, there are others who argue that this is untrue, and that anyhow it would be wasteful to employ technical persons on such work. However, no results are known of tests designed to compare the average indexing ability of different types of persons. Potential indexers for this project might be described as being within the following broad groups: (A) technical knowledge of the subject plus indexing experience; (B) technical knowledge of the subject but no indexing experience; (C) indexing experience in the subject field; (D) theoretical knowledge of indexing; (E) neither theoretical nor practical knowledge of the subject nor of indexing. In deciding the types of persons to recruit, it was considered that there were very few persons with the qualifications outlined in (A) and that such individuals would not be likely to join the project. The significance in certain circumstances of using an individual such as outlined in (E) was appreciated, but it was decided not to use this category in the project. Three indexers, therefore, have been recruited to be representative of groups (B), (C), (D). The first man, with several years industrial experience, has taken a 2-year postgraduate course and received his diploma for a thesis in aerodynamics. The second man has been a Fellow of the Library Association since 1952 and for four years has been Librarian of a large aircraft firm. The third man has recently passed the examinations of the Library Association in classification and cataloguing, but has no practical experience of indexing technical literature. This variation of
OCR for page 690
--> experience will enable a number of useful comparisons to be made, and it will have its effects on the economic costs, since the salary one might expect to pay (in the United Kingdom) would be in the ratio 5:4:3. A main policy decision that always has to be made is the depth of indexing and, other things being equal, this will affect the time spent in indexing. Few reliable figures have been given for current practices, although a particularly high figure is the 1 1/2 hours average quoted (2) for indexing reports for the catalogue of aerodynamic data prepared by the Nationaalluchtvaartlaboratorium in Holland. It appears from personal discussions that an average of 20 minutes for a general collection of technical reports is the top limit, and this has been taken as the maximum indexing time to be used in the project. The staff will always index the documents as fully as possible within the set time limits which will be, with various groups of documents, 20 minutes, 15 minutes, 10 minutes, 5 minutes, and 2 1/2 minutes. It is necessary both to index sufficient documents so that retrieval is not too simple, and also to continue the indexing long enough to give a comparison of the indexers’ rates of learning with the different systems. For financial reasons, it is unnecessary to continue an experiment beyond the stage where there is no significant change in the test results. It is believed that the indexing of 18,000 documents will meet these requirements. As will be seen, in this one programme we are attempting to evaluate three variables, namely the system, the type of indexer, and the indexing time. With four systems, three indexers and five indexing times, the number of permutations is sixty. The documents to be indexed will be divided into groups of one hundred (from here on referred to as a “document group”), and therefore 6000 documents will be indexed before the same indexing conditions are repeated. By that time, however, all the indexers will presumably have become more adept in using the various systems. The procedure will be that the first indexer will index documents 1–100 by system A, allowing himself an average time of 20 minutes for each document. Immediately after indexing a document by system A, he will allocate the appropriate headings or classification numbers for systems B, C, and D, but this will be done without any time control. Documents 101–200 will then be indexed by system B with the 20-minute allowance for each document, followed by the postings for systems A, C, and D. Items 201–300 and 301–400 will be similarly indexed by systems C and D. This procedure will be repeated for documents 401–500, 501–600, 601–700, and 701–800, except that for these document groups the average indexing time will be limited to 15 minutes. For documents 801–1200, the time will be limited to 10 minutes; for 1201–1600 it will be 5 minutes, and finally for docu-
OCR for page 691
--> ments 1601–2000 the indexing time will be limited to 2 1/2 minutes per document. Meanwhile the second indexer is carrying out a similar procedure with documents 2001–4000, and the third indexer is doing the same with documents 4001–6000. The indexing of documents 6001–12,000 will repeat the conditions found in the indexing of 1–6000, and the whole stage will be repeated in documents 12,001–18,000. The type of documents within each group will be kept as nearly as possible similar to those in any other group. The first requirement is that one-half of the documents should deal with the specialised subject field of high-speed aerodynamics while the remainder range over the broader subject fields. However, we think that the project may show up some interesting facts concerning the comparative “indexibility” of various documents. There are some people who believe, for instance, that American reports present more indexing problems than British reports, but that there is no difference between articles in American or British periodicals. A typical document-group, therefore, is made up as follows: Research reports National Advisory Committee for Aeronautics 25 papers U.S.A. industrial organisations 5 “ Royal Aircraft Establishment 20 “ British industrial organisations 5 “ Periodicals Journal of Aeronautical Sciences 11 articles Royal Aeronautical Society Journal 7 “ Jet Propulsion 8 “ Aircraft Engineering 6 “ Aircraft Production 5 “ Communication and Electronics 2 “ Interavia 2 “ Product Engineering 2 “ Metal Progress 3 “ Quarterly of Applied Mathematics 4 “ In order to try to make some assessment of the ability of the indexers under controlled conditions, it is intended that at various stages in the project a selected list will be made of documents that have been indexed. This list will be sent to a number of organisations and individuals who have expressed their willingness to cooperate, and they will be invited to index these documents by one of the four systems. The individual or organisation concerned either will have technical knowledge of the subject field or will be specialists in the system in which they index or will combine both attributes. Since no time restriction
OCR for page 692
--> will be placed on this indexing work, it might be reasonably assumed that the indexing should be correct (if such a thing is possible). Entries for this indexing would be made in the various catalogues, with an identification mark so that it will be possible to find in the testing whether these entries give better retrieval than the entries of the project’s indexers. The test programme Although the main testing will not be done until the indexing work has been completed towards the end of 1959, it is hoped that some preliminary testing will be possible in time to present the results at the International Conference. The first point to be emphasized is that when completed the four catalogues will be permanent tools which can be used as often as required and for a number of different tests. If it is felt that one series of tests fails to produce the type of result required, then it will be perfectly possible to make any further tests that can be devised. Complete testing is, however, certain to be a long process, and it is desirable that as many factors as possible should be covered in a single series of tests. Five variables have been introduced into the indexing, namely the indexer, the system, the indexing time, the experience of the indexer, and the size of the collection. All these are considered to be of importance, and the tests must show the differences. In the testing we can introduce at least two new variables, the first of which concerns the types of questions put to the index. These may range through all grades of specificity from the (comparatively) broad question such as “safety considerations in aircraft design” or the narrower question such as “shear stresses in oblique plates,” down to the most specific type of question for which there can only be one correct answer, a question such as “hinge moments of a horizontal tail, 45° swept back plan form, of aspect ratio 2 and taper ratio 0.5.” The second variable concerns the types of persons who are physically attempting to retrieve information, and these can be summarised as follows: (1) the originator of the enquiry, whose knowledge of the indexing system might be either reasonably good or non-existent; (2) the technical indexer; (3) the librarian indexer; (4) a technical man who did not originate the enquiry; (5) a librarian who had not been engaged on the indexing; (6) any combination of groups 1–5. The testing is based on the procedure of putting the same questions to the different indexes, and comparing the resulting answers, but within this general statement there lies a number of possible variations. At this stage it is not possible to say the extent of the testing that will be necessary. Each document
OCR for page 693
--> group will be a unique collection and will therefore have to be tested as such. It will be necessary to put to each document group a sufficient number of questions to obtain an average result which is reasonably valid. The size of the figure is not known for certain, but it appears unlikely to be lower than ten, and on this basis it would be necessary to ask 1800 questions to test the whole collection of 18,000 documents. Preliminary tests will be made to ascertain this accurately, before the major test programme starts, but for the purpose of further discussion in the paper, it will be presumed that ten is the figure decided on. In theory, the questions may be either genuine or faked; that is to say, they could be either questions which were raised in the normal course of the work of an aeronautical establishment, or they could be questions devised for the purpose of testing the project indexes. The former method was used in the Astia-Uniterm (3) tests, and this method carries with it the corollary that there may or may not be in the index an answer to any particular question. The Astia-Uniterm tests were inconclusive because agreement could not be reached as to which answers were correct and which were irrelevant. Since no one has suggested any way in which this problem can be overcome, the first series of tests will be done on the principle of putting questions which are based on documents known to be in the indexes. A sufficient number of people, not directly associated with the project, will be asked to compile questions based on documents in the collection. If, as is hoped, thirty people cooperate, each person would be asked to compile a total of 60 questions. They would have complete freedom of choice within selected document groups, but would be asked to vary their type of question over the three grades of specificity mentioned earlier. This procedure ensures that there is always at least one right answer to each question, and it may be presumed that the system which produces most “right” answers in given circumstances will be shown to be the most efficient. This is not, however, certain, and deeper analysis than this will be necessary to discover the relative economic efficiency of the systems, since here other factors have to be taken into account. As stated earlier, the retrieval cost must include not only the time costs in searching the index but also the time costs involved in making physically available the required document. This latter item is a figure which will vary with each individual organisation and will have to be built into the results. Examples of possible answers will explain this point. Four systems A, B, C and D when asked the same very specific question produce answers as follows: System A produces document card 1 in 5 minutes System B produces document cards 1, 2, 3, 4, 5 in 3 minutes
OCR for page 694
--> System C produces document card 2 in 5 minutes System D produces document cards 2, 3, 4, 5, 6, 7, 8 in 4 minutes Document card 1 refers to the document on which the question was based and is therefore the “right” answer. System A can thus be debited with a search time of (5+x) minutes (x being the time taken to fetch the required document). System B produces references to five documents, but it is not possible to say which is the document giving the required information. Therefore the documents have to be fetched and scanned. On an average, it may be presumed that the correct document will be found halfway through such a search, and therefore this system produces the right answer, but it is debited with a search time of (3+3x+3y) minutes, where y is the time taken to scan a document. System C fails in a time of (5+x+y) minutes, and system D fails in a time of (5+7x+7y) minutes. In different organisations, x is a figure that will vary. When the time taken to fetch a document is one minute, the difference between system A and system B will be small, but should x be as high as five minutes, then the total search time would be only 10 minutes for system A as against a minimum of 20 minutes for system B. The result can be considered differently when the question is of a general nature. Although there will always be the “right” answer, there possibly will also be other answers which would, in practice, be equally useful to the enquirer, and some extra allowance should be made for systems which produce these correct answers. The main difficulty in doing this is that it brings one back to the rock on which the Astia-Uniterm tests foundered, namely the personal interpretation of what is or is not a relevant answer. The only way to avoid this dilemma is for the main emphasis always to be placed on the “right” answer. However, as Fairthorne states (4) “search for objects known to exist, and for those that may not exist, cannot be done efficiently in the same way.” This is a theory that it might be possible to test by putting to the indexes a sample of genuine questions. Analysis of tests With so many variables being introduced into the indexing and the retrieving, there will be many differing answers given by the results, and it is unlikely, to say the least, that one system will be preeminent irrespective of the conditions. The test results will be a set of objective statements showing what has been achieved under the differing conditions, and it will be for each organisation to
OCR for page 695
--> consider the results in the light of its own requirements and select those conditions which are the most favourable. The first basic question that has to be decided is the minimum acceptable retrieval efficiency, and it is often too readily assumed that this must approach 100% as closely as possible. To strive for a figure such as this would appear to be reasonable only if the collection indexed also approaches 100% of the available pertinent information on the subject. With very few exceptions, such as patent offices, which, by the nature of their work, will have a complete file of a certain type of recorded material, the large majority of organisations concerned with science or technology do not have in their collections more than a small fraction of the potentially available information in all aspects of the organisation’s work. Within strictly limited subject fields, it is possible that a librarian might manage to collect 50% of the potentially available documents, but it will require extensive and prolonged searching to raise the figures over 80%. All special librarians are continually faced with the problem of marginal information; it arises in the acquisition of books, periodicals, or reports and is equally difficult in the decision as to whether an in-coming document should be indexed, ignored, or passed to the waste-paper basket. Such decisions must in the first place be based on a broad policy decision, even though there will be differing individual interpretations. One major organisation bases its policy on the externals of the documents it receives, discarding, for instance, all interim reports, preprints, or reprints or anything written in a foreign language. As a result it eliminates all but 35,000 documents of the 250,000 documents received annually. More often the decision is based on subject content, and while in certain fields published abstract journals will be relied on, there will inevitably be exceptions. An aeronautical organisation might decide to ignore, for indexing purposes, papers dealing with materials, and rely instead on Metals Review, Crerar Metal Abstracts, Titanium Abstract Bulletin, etc. Exceptions would have to be made for report series which, perhaps for security reasons, could not be included in published abstracts, and also it might be considered desirable to include in the organisational index all articles dealing with a subject of particular local importance, such as fatigue of certain metals. Therefore over many subject fields of possible interest to an organisation there has to be the economic decision as to whether indexing is justified. Too often the argument is put that failure to find information locally might result in a major cost to the organisation. The argument should really be between the cost of finding it with an internal index and the cost of finding it by other means. Another local factor that can influence the interpretation of the test results is the frequency of use of the catalogue and its relationship to the number of
OCR for page 696
--> documents indexed. While I know of no figures which purport to show what the relationship should be or even what it actually is in any particular case, in my own library it appears to be in the ratio of 1:5. This is roughly calculated, comparing the known number of items annually indexed (8000) with the estimated total of successful uses of the catalogue. By sample checks it has been shown that about 1300 annual successful uses of the catalogue result in the issue of approximately 2100 reports. A number of these issues will duplicate previous successful uses of the catalogue, and this number is believed to be about 500. This leaves 1600 items used of the 8000 actually indexed. The clear implication of this is the necessity to debit each use of the catalogue with the indexing costs for five documents, and illustrates the importance of the relation of indexing time and retrieval time. If the ratio is 1 to 5, as in the example, it would pay to decrease the indexing time by five minutes a document, so long as retrieval time was not increased by more than twenty-five minutes. Yet another factor to be considered is the average potential value of information which is retrieved or, alternatively, the average potential cost of failure to find information. It will be quite practical to work out and publish hypothetical examples covering these various aspects, but in the end each organisation will have to make decisions based on its own circumstances. The mechanics aspect As emphasized earlier, the project is concerned with the system rather than the mechanics which might be used, and it was stated that the costs of the latter could readily be built into the results. This decision was taken partly because it would greatly increase the project cost to prepare different type indexes, but also because of the work already done and other investigations now in progress, in particular at the Forschungsinstitut für Rationalisation in Aachen. We were, however, faced with the problem of deciding on the mechanics to be used for the various systems, and how they can be adapted to yield further information. For the U.D.C. catalogue, the alphabetical subject catalogue and the faceted classification catalogue we shall use conventional 5 by 3-inch cards. Most organisations issuing aeronautical reports include with each document, catalogue cards showing the subject headings or classification numbers for the report. An analysis of these cards shows that on average about 4 placings are given to each report, while less than 2% are given more than eight placings. Basically, we wish to test the two conventional systems (U.D.C. and the alphabetical subject catalogue) in the way in which they are conventionally used, and therefore it would be reasonable to confine our entries to the general average given
OCR for page 697
--> above. Such a restriction, however, might well invalidate some of our investigations into the most economic indexing time, yet to give the indexers carte blanche authority to call for a very large number of entries might again upset the economic calculations. No practical investigations appear to have been done which support this conventional figure of about 4 entries for an aeronautical report, and presumably the practice has become as it is because of a general feeling by indexers that such a number of entries is adequate, and that further entries would not compensate for the extra costs involved. We may be able to find whether there is any justification for these beliefs. For this purpose, the indexers will be instructed to enter on the master cards all subject headings or class numbers which they consider in any way suitable, but they will place a mark against those which are considered to be essential and which would be allocated in normal practice. Such entries will be identified in the card catalogues, and in the testing it will be possible to find whether the additional entries are justified by reason of improved retrieval, or whether they clutter up the catalogue and are positively harmful by increasing the number of irrelevant answers. The coordinate system should present no problems in this connection, nor should the faceted classification, although in the latter case there will be the necessity of compiling a chain index. Conclusion The project is an attempt to make a practical contribution to solving some of the problems of information retrieval. The major difficulty in evolving the programme has been that at present absolutely nothing can be taken for granted; there is no single fact which can be demonstrably shown to be true; no theory put forward by one expert which is not refuted by another. Experience in other fields of knowledge shows that most worth while scientific achievements result from long series of tests and rarely from a single experiment, and that repeated testing is necessary to determine reliably the nature and relative effects of all relevant variables. The experiment in which the researcher knows for certain that all the variables have been identified and whether each will significantly affect the results is the exception rather than the rule. It is most unlikely that this project is “the exception.” We have, however, included in the programme at least seven controlled variables, so that the tests may fairly be described as a “series of tests,” and we may reasonably hope that the results will materially advance our present knowledge of what is and is not important in information retrieval. We also believe that even when all possible
OCR for page 698
--> testing has been done, the catalogues and indexes will remain of value as a yardstick by comparison with which the testing of other systems and in other subject fields will be a comparatively simple process. REFERENCES 1. C.W.CLEVERDON and R.G.THORNE, A Brief Experiment with the Uniterm System of Co-ordinate Indexing for the Cataloguing of Structural Data. (R.A.E. Library Memo 7), 1954. 2. K.STAPLES and M.SHAW, The N.L.L. Card Catalogue of Aerodynamic Measurements. (R.A.E. Library Memo 17), 1954. 3. D.E.GRAY and A.R.C.ASSOCIATES, Report on Reference Tests of Conventional and Uniterm Systems. ASTIA Reference Center, Library of Congress, 1954. 4. R.A.FAIRTHORNE, Information theory and clerical systems. Journal of Documentation, 9, 113 (1953).
Representative terms from entire chapter: