National Academy of Sciences | 150 Year Anniversary

Questions? Call 800-624-6242

| Items in cart [0]

The National Academies Press

Rights & Permissions

topleft topright

Proceedings of the International Conference on Scientific Information -- Two Volumes (1959)

Citation Manager

. "Subject-Word Letter Frequencies with Applications to Superimposed Coding." Proceedings of the International Conference on Scientific Information -- Two Volumes. Washington, DC: The National Academies Press, 1959.

Please select a format:

BibTeX EndNote RefMan


Page
903
bottomleft bottomright

The following HTML text is provided to enhance online readability. Many aspects of typography translate only awkwardly to HTML. Please use the page image as the authoritative form to ensure accuracy.


Subject-Word Letter Frequencies with Applications to Superimposed Coding

HERBERT OHLMAN

ABSTRACT. The frequencies of occurrence of English letters in the first five positions of subject words and proper names are determined. With these frequencies a superimposed code is designed. No code book is required. Coding space is utilized almost as economically as with a random code. An empirical check is made. A quantitive measure of word popularity is proposed using letter-frequency data.

Coding, or the transforming of information from one guise to another, is one of man’s commonest activities. Every picture may be said to be a coding of some real scene and every written word a coding of some utterance—the brain itself is said to work with coded impulses.

Since the beginning of mass communications, starting with the invention of printing, and increasing with the widespread use of electronics, efficient use of existing space and time has become more and more important. Today, information theory provides a sound basis for determining the limits of transmission speed and accuracy. However, Shannon’s theory (1) does not tell us how to make a particular code more efficient. The design of codes is still an art; this paper deals with the improving of one particular type, superimposed coding. In information searching, mechanical aids are being used wherever possible. For a machine to process information, the information must be coded, usually into some variant of that most basic code of all, the binary. However, the most efficient code for pure selection appears to be a superimposed random code. Each coding position is used in a random manner, and a group of coding positions contain superimposed entries.

Calvin Mooers (2, 4) calls such coding “Zatocoding” and has applied it in his patented marginal-punched card system called Zator. However, Zatocoding requires an intermediate step in both coding and searching—a code

HERBERT OHLMAN System Development Corp., Santa Monica, Calif.

Page
903
Front Matter (R1-R24)
Opening Session Address (1-8)
Area 1: Literature and Reference Needs of Scientists: Knowledge now available and methods of ascertaining requirements (9-12)
Proposed Scope of Area 1 (13-18)
Study on the Use of Scientific Literature and Reference Services by Scandinavian Scientists and Engineers Engaged in Research Development (19-76)
The Transmission of Scientific Information (77-96)
An Operations Research Study of the Dissemination of Scientific Information (97-130)
Information and Literature Use in a Research and Development Organization (131-162)
Methods by which Research Workers Find Information (163-180)
Determining Requirements for Atomic Energy Information from Reference Questions (181-188)
Systematically Ascertaining Requirements of Scientists for Information (189-194)
How Scientists Actually Learn of Work Important to Them (195-198)
Planned and Unplanned Scientific Information (199-244)
The Use of Technical Literature by Industrial Technologists (245-266)
Requirements of Forest Scientists for Literature and Reference Services (267-276)
The Information-Gathering Habits of American Medical Scientists (277-286)
Use of Scientific Periodicals (287-300)
Summary of Discussion (301-312)
Area 2: The Function and Effectiveness of Abstracting and Indexing Services (313-316)
Proposed Scope of Area 2 (317-320)
An Evaluation of Abstracting Journals and Indexes (321-350)
Analytical Study of a Method for Literature Search in Abstracting Journals (351-376)
The Relation Between Completeness and Effectiveness of a Subject Catalogue (377-380)
Cost Analysis of Bibliographies or Bibliographic Services (381-392)
The Efficiency of Metallurgical Services (393-406)
Subject Slanting in Scientific Abstracting Publications (407-428)
The Importance of Peripheral Publications in the Documentation of Biology (429-434)
Current Medical Literature: A Quantitative Survey of Articles and Journals (435-448)
A Combined Indexing-Abstracting System (449-460)
A Unified Index to Science (461-474)
Lost Information: Unpublished Conference Papers (475-480)
International Cooperation in Physics Abstracting (481-490)
International Cooperative Abstracting on Building: An Appraisal (491-496)
Cooperation and Coordination in Abstracting and Documentation (497-510)
On the Functioning of the All-Union Institute for Scientific and Technical Information of the USSR Academy of Sciences (511-522)
Summary of Discussion (523-536)
Area 3: Effectiveness of Monographs, Compendia, and Specialized Centers: Present trends and new and proposed techniques and types of services (537-540)
Proposed Scope of Area 3 (541-544)
Review Literature and the Chemist (545-570)
The Place of Analytical and Critical Reviews in Any Growing Biological Science and the Service They May Render to Research (571-588)
Recent Trends in Scientific Documentation in South Asia: Problems of Speed and Coverage (589-604)
Scientific Documentation in France (605-612)
Scientific, Technical, and Economic Information in a Research Organization (613-648)
Summary of Discussion (649-660)
Area 4: Organization of Information for Storage and Search: Comparative characteristics of existing systems (661-664)
Proposed Scope of Area 4 (665-670)
Conventional and Inverted Grouping of Codes for Chemical Data (671-686)
The Evaluation of Systems Used in Retrieval Systems on Large Electronic Computers (687-698)
Experience in Developing Information Retrieval Systems (699-710)
Printing Chemical Structures Electronically: Encoded Compounds Searched Generically with IBM-702 (711-730)
Evolution of Document Control in a Materials Deterioration Information Center (731-762)
Retrieval Questions from the Use of Linde's Indexing and Retrieval System (763-770)
Classification with Peek-a-boo for Indexing Documents on Aerodynamics: An Experiment in Retrieval (771-802)
Summary of Discussion (803-812)
Area 5: Organization of Information for Storage and Retrospective Search: Intellectual problems and equipment considerations in the design of new systems (813-816)
Proposed Scope of Area 5 (817-822)
The Basic Types of Information Tasks and Some Methods of Their Solution (823-854)
Subject Analysis for Information Retrieval (855-866)
The Construction of a Faceted Classification for a Special Subject (867-888)
On the Coding of Geometrical Shapes and Other Representations, with Reference to Archaeological Documents (889-902)
Subject-Word Letter Frequencies with Applications to Superimposed Coding (903-916)
The Analogy between Mechanical Translation and Library Retrieval (917-936)
Linguistic Transformations for Information Retrieval (937-950)
Linguistic and Machine Methods for Compiling and Updating the Harvard Automatic Dictionary (951-974)
The Feasability of Machine Searching of English Texts (975-996)
Semantic Matrices (997-1026)
Interlingual Communication in the Sciences (1027-1046)
An Overall Concept of Scientific Documentation Systems and Their Design (1047-1070)
The Possibilities of Far-Reaching Mechanization of Novelty Search of the Patent Literature (1071-1096)
Descriptive Documentation (1097-1116)
Variable Scope Search System: VS8 (1117-1142)
The Haystaq System: Past, Present, and Future (1143-1180)
A Proposed Information Handling System for a Large Research Organization (1181-1202)
Information Handling in a Large Information System (1203-1220)
Tabledex: A New Coordinate Indexing Method for Bound Book Form Bibliographies (1221-1244)
The Comac: An Efficient Punched Card Collating System for the Storage and Retrieval of Information (1245-1254)
Summary of Discussion (1255-1268)
Area 6: Organization of Information for Storage and Retrospective Search: Possibility for a general theory (1269-1272)
Proposed Scope of Area 6 (1273-1274)
The Structure of Information Retrieval Systems (1275-1290)
The Descriptive Continuum: A (1291-1312)
Algebraic Representation of Storage and Retrieval Languages (1313-1326)
A Mathematical Theory of Language Symbols in Retrieval (1327-1364)
Abstract Theory of Retrieval Coding (1365-1382)
Maze Structure and Information Retrieval (1383-1394)
Summary of Discussion (1395-1410)
Area 7: Responsibilities of Government, Professional Societies, Universities (1411-1414)
Proposed Scope of Area 7 (1415-1416)
Responsibilities for Scientific Information in Biology: Proposal for Financing a Comprehensive System (1417-1428)
Responsibility for the Development of Scientific Information as a National Resource (1429-1434)
Differences in International Arrangements for Financial Support of Information Services (1435-1440)
Training for Activity in Scientific Documentation Work (1441-1488)
Training the Scientific Information Officer (1489-1494)
Training for Scientific Information Work in Great Britain (1495-1502)
The ICSU Abstracting Board: The Story of a Venture in International Cooperation (1503-1516)
Creation of an International Center of Scientific Information (1517-1522)
An International Institute for Scientific Information (1523-1534)
Summary of Discussion (1535-1548)
Closing Session (1549-1562)
Financial Support (1563-1564)
Exhibitors (1565-1566)
Roster of Registrants (1567-1606)
Index (1607-1638)

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 903
--> Subject-Word Letter Frequencies with Applications to Superimposed Coding HERBERT OHLMAN ABSTRACT. The frequencies of occurrence of English letters in the first five positions of subject words and proper names are determined. With these frequencies a superimposed code is designed. No code book is required. Coding space is utilized almost as economically as with a random code. An empirical check is made. A quantitive measure of word popularity is proposed using letter-frequency data. Coding, or the transforming of information from one guise to another, is one of man’s commonest activities. Every picture may be said to be a coding of some real scene and every written word a coding of some utterance—the brain itself is said to work with coded impulses. Since the beginning of mass communications, starting with the invention of printing, and increasing with the widespread use of electronics, efficient use of existing space and time has become more and more important. Today, information theory provides a sound basis for determining the limits of transmission speed and accuracy. However, Shannon’s theory (1) does not tell us how to make a particular code more efficient. The design of codes is still an art; this paper deals with the improving of one particular type, superimposed coding. In information searching, mechanical aids are being used wherever possible. For a machine to process information, the information must be coded, usually into some variant of that most basic code of all, the binary. However, the most efficient code for pure selection appears to be a superimposed random code. Each coding position is used in a random manner, and a group of coding positions contain superimposed entries. Calvin Mooers (2, 4) calls such coding “Zatocoding” and has applied it in his patented marginal-punched card system called Zator. However, Zatocoding requires an intermediate step in both coding and searching—a code HERBERT OHLMAN System Development Corp., Santa Monica, Calif.

OCR for page 904
--> FIGURE 1.

OCR for page 905
--> book containing a number of indexing terms with random-number equivalents. Carl Wise (3, 4) has produced a nonrandom superimposed code which he calls “word coding” for use with marginal-punched cards of the Keysort variety. This type of code does not require an intermediate code book. The author has attempted to combine the best features of both systems in coding English words by essentially pre-randomizing the alphabet. This is possible because there is a certain invariance of the letter frequencies within each letter position of a word. As this system was developed in response to a specific need, it may be well to talk about it in concrete terms, and later apply its principles to other information systems. A marginal-punched card produced on IBM equipment (5) was used as the unit record. The thirty-eight positions along the top edge could code 38 words or phrases by using a direct code, but by superimposition every position could be made to do multiple duty. However, neither of the two systems previously described seemed to meet the requirement of a directly interpretable, yet efficient code. Subject-word and proper-name lists were studied to find what letter frequencies occurred in the first five letter positions. Some work along these lines had been done, notably by Geisler for the ASM-SLA (6) with proper names, and by Krieger (7) with subject words (however, Krieger only considered initial letters in designing his code). Striking similarities for initial-letter frequencies among various subject-word lists were found, as shown in Table 1. The average of five such lists show that 40% of the words begin with C, S, P, or A (in that order). Furthermore, 85% begin either with these four or B, M, T, R, E, F, D, G, H, or I—or only 54% of the alphabet. Even greater consistency was found with proper-name lists, as shown in Table 2, but with a different ranking of the letters. The average of three such lists gave S, B, M, H, and C for the beginning letters of 40% of the names, and these five and D, G, K, L, R, P, W, A, and F (again 54% of the alphabet) accounted for 83%. The Library of Congress list was chosen as typical of the subject-word lists, and the 1955 Syracuse Telephone Directory as typical of names. A systematic sample was obtained from each list by recording the top-left, middle, and top-right terms from every two-page spread.1 The frequencies of letters in each of the first five positions were then obtained for each list, as shown in Tables 3 and 4. 1   The probabilities in this case are not independent, but every term is equidistant in the alphabetical sequence from the next term chosen, which is a sufficient approximation to true randomness for the purposes of this study.

OCR for page 906
--> TABLE 1 Subject-word initial letter frequenciesa   Chambers Technical Dict’y, 1942, 912 pp. Merriam-Webster Unabridged Dict’y, 2987 pp. Industrial Arts Index (Vol. 41, No. 5), April, 1953, 787 pp. Chem. Abstracts Decennial Subject Index 1907–16 & 1927–36 (after Krieger, ref. 7) Lib. of Congress T.I.D. List of Subject Headings, June 1952, 327 pp. Average Frequency, % Rank Letter Freq., % Deviation Freq., % Deviation Freq., % Deviation Freq., % Deviation Freq., % Deviation A 7.3 −1.0 6.6 −1.7 9.3 +1.0 8.6 +0.3 9.7 +1.4 8.3 4 B 6.2 +0.4 5.7 −0.1 4.5 −1.3 7.4 +1.6 4.9 −0.9 5.8 5 C 10.8 +0.4 9.8 −.06 10.0 −0.4 12.6 +2.2 8.8 −1.6 10.4 1 D 5.7 +1.5 4.9 +0.7 3.3 −0.9 3.5 −0.7 3.6 −0.6 4.2 11 E 4.7 +0.3 3.3 −1.1 6.1 +1.7 3.9 −0.5 4.2 −0.2 4.4 9 F 4.6 +0.3 3.9 −0.4 3.7 −0.6 4.9 +0.6 4.5 +0.2 4.3 10 G 3.7 −0.2 3.2 −0.7 4.6 +0.7 3.7 −0.2 4.5 +0.6 3.9 12 H 4.4 +0.6 3.7 −0.1 3.3 −0.5 4.3 +0.5 3.6 −0.2 3.8 13.5 I 3.1 −0.7 3.0 −0.8 6.0 +2.2 3.7 −0.1 3.2 −0.6 3.8 13.5 J 0.6 0.0 0.9 +0.3 0.2 −0.4 ~0.4 −0.2 1.7 +0.1 0.6 21.5 K 1.0 +0.4 0.9 +0.3 0.4 −0.2 ~0.4 −0.2 0.3 −0.3 0.6 21.5 L 3.8 +0.7 3.2 +0.2 2.5 −0.6 3.3 +0.2 2.6 −0.5 3.1 15 M 5.6 −0.1 5.0 −0.7 6.6 +0.9 5.3 −0.4 6.2 +0.5 5.7 6 N 2.0 −0.5 1.8 −0.7 2.4 −0.1 3.1 +0.6 3.2 +0.7 2.5 16 O 2.2 0.0 2.6 +0.4 1.7 −0.5 2.7 +0.5 1.9 −0.3 2.2 18 P 9.1 −0.4 9.3 −0.2 10.3 +0.8 10.8 +1.3 8.1 −1.4 9.5 3 Q 0.6 +0.1 0.6 −0.1 0.2 −0.3 ~0.7 +0.2 0.3 −0.2 0.5 23 R 4.4 −0.4 4.8 0.0 4.7 −0.1 3.5 −1.3 6.8 +2.0 4.8 8 S 10.4 +0.3 12.4 +2.3 9.9 −0.2 8.4 −1.7 10.4 +0.3 10.1 2 T 4.8 −0.5 6.3 +1.0 5.1 −0.2 4.3 −1.0 6.2 +0.9 5.3 7 U 0.8 −0.4 1.9 +0.7 0.9 −0.3 ~0.7 −0.5 1.9 +0.7 1.2 20 V 1.6 +0.2 1.7 +0.3 1.1 −0.3 1.4 0.0 1.3 −0.1 1.4 19 W 1.8 −0.5 3.3 +1.0 2.4 +0.1 1.9 −0.4 1.9 −0.4 2.3 17 X 0.2 0.0 0.1 −0.1 0.2 0.0 ~0.03 −0.2 0.3 +0.1 0.2 26 Y 0.2 −0.1 0.4 +0.1 0.1 −0.2 ~0.03 −0.3 0.3 0.0 0.3 24.5 Z 0.4 +0.1 0.3 0.0 0.2 −0.1 ~0.03 −0.3 0.3 0.0 0.3 24.5 Check sum 100.0 +0.5 99.6 0.0 99.7 +0.2 99.6 0.0 99.7 +0.2 99.5   a Frequencies which deviate more than 1 % from the average are shown in italics.

OCR for page 907
--> TABLE 2 Proper-name initial letter frequenciesa   Chemical Abstracts Fourth Decennial Author Index, 3531 pp. ASM-SLA Metal Literature Study, 4870 pp. Syracuse, N.Y., Telephone Directory, 1955, 307 pp. Av. Freq., % Rank Letter Freq., % Deviation Freq., % Deviation Freq., % Deviation A 3.8 0.0 2.45 −1.35 5.2 +1.4 3.8 13 B 9.2 −0.2 10.2 +0.8 8.8 −0.6 9.4 2 C 5.7 −0.9 6.2 −0.4 7.8 +1.2 6.6 5 D 5.0 −0.3 5.3 0.0 5.5 +0.2 5.3 6.5 E 2.3 +0.1 2.25 +0.05 2.0 −0.2 2.2 16 F 3.6 0.0 3.4 −0.2 3.9 +0.3 3.6 14 G 5.3 0.0 5.6 +0.3 4.9 −0.4 5.3 6.5 H 6.7 −0.1 7.35 +0.55 6.2 −0.6 6.8 4 I 1.6 +0.7 0.75 −0.15 0.3 −0.6 0.9 21.5 J 1.8 +0.1 1.75 +0.05 1.6 −0.1 1.7 19 K 6.0 +1.2 4.9 −0.3 4.6 −0.6 5.2 8 L 4.6 −0.4 5.65 +0.65 4.6 −0.4 5.0 9 M 7.7 −0.6 8.25 −0.05 8.8 +0.5 8.3 3 N 2.3 +0.3 1.8 −0.2 2.0 0.0 2.0 17 O 1.4 −0.2 1.4 −0.2 2.0 +0.4 1.6 20 P 4.6 −0.1 4.5 −0.2 4.9 +0.2 4.7 11.5 Q 0.1 −0.1 0.1 −0.1 0.3 +0.1 0.2 25 R 5.0 +0.1 4.65 −0.25 4.9 0.0 4.9 10 S 11.3 +0.1 11.0 −0.2 11.4 +0.2 11.2 1 T 3.4 +0.2 3.65 +0.45 2.6 −0.6 3.2 15 U 0.7 +0.2 0.45 −0.05 0.3 −0.2 0.5 23 V 1.8 −0.1 2.15 +0.25 1.6 −0.3 1.9 18 W 4.6 −0.1 4.65 −0.15 4.9 +0.2 4.7 11.5 X 0.0 0.0 0.0 0.0 0.0 0.0 0.0 26 Y 0.5 +0.1 0.5 +0.1 0.3 −0.1 0.4 24 Z 1.0 +0.1 1.1 +0.2 0.7 −0.2 0.9 21.5 Check sum 100.0 +0.1 99.55 −0.4 100.1 −0.2 100.3   a Frequencies which deviate more than 1% from the average are shown in italics.

OCR for page 908
--> TABLE 3 Subject-word letter frequencies (332 words)a   First letter Second letter Third letter Fourth letter Fifth letter Letter Freq., % Rank Freq., % Rank Freq., % Rank Freq., % Rank Freq., % Rank A 9.3 2 17.8 1 8.4 2 7.7 3 5.3 9.5 B 4.8 8 0.6 17 2.7 15 2.2 17 0.0 24.5 C 8.1 3 1.8 12 5.4 9 6.2 5 2.3 11.5 D 3.6 12.5 0.3 21.5 6.3 6.5 5.0 9 2.0 13 E 4.2 11 12.3 2 6.3 6.5 11.8 1 13.3 1 F 4.5 9.5 0.3 21.5 2.4 16 1.2 20 0.7 19 G 4.5 9.5 0.3 21.5 1.8 17 0.9 22 0.7 19 H 3.6 12.5 3.9 9.5 0.9 19 3.7 13.5 1.3 15 I 3.3 14.5 11.1 4 5.2 10 10.8 2 9.3 4 J 0.9 21 0.0 25.5 0.0 25.5 1.2 20 0.0 24.5 K 0.6 23.5 0.3 21.5 0.3 22.5 2.8 15.5 0.3 21.5 L 2.7 16 6.9 7 7.8 5 3.9 11.5 6.7 7 M 6.3 6 0.9 14.5 3.9 12.5 5.3 8 2.3 11.5 N 3.3 14.5 3.9 9.5 5.7 8 5.9 7 7.3 6 O 2.1 17.5 11.4 3 8.1 3.5 6.2 5 11.7 3 P 7.8 4 1.5 13 3.9 12.5 3.7 13.5 1.0 16.5 Q 0.6 23.5 0.3 21.5 0.3 22.5 0.3 24 0.0 24.5 R 6.6 5 7.5 6 12.0 1 3.9 11.5 12.0 2 S 9.9 1 0.6 17 4.5 11 4.3 10 5.7 8 T 6.0 7 2.4 11 8.1 3.5 6.2 5 8.7 5 U 1.8 19 7.8 5 3.3 14 2.8 15.5 5.3 9.5 V 1.5 20 0.3 21.5 0.6 20 1.2 20 0.3 21.5 W 2.1 17.5 0.0 25.5 0.3 22.5 0.3 24 0.7 19 X 0.6 23.5 0.9 14.5 0.3 22.5 0.0 26 1.0 16.5 Y 0.3 26 6.0 8 1.2 18 1.9 18 1.7 14 Z 0.6 23.5 0.6 17 0.0 25.5 0.3 24 0.0 24.5 No. of blanks 0   0   0   9   33   Check sum 99.6   99.7   99.7   99.5   99.6   a Blanks are not counted in computing percentages.

OCR for page 909
--> TABLE 4 Proper name letter frequencies (309 names)a   First letter Second letter Third letter Fourth letter Fifth letter Letter Freq., % Rank Freq., % Rank Freq., % Rank Freq., % Rank Freq., % Rank A 3.2 14 21.7 1 6.8 5 7.0 5.5 6.5 8.5 B 8.8 3 0.3 19 2.3 14 4.3 10 1.5 16 C 7.8 4 3.9 7 2.9 12 3.6 12.5 2.9 11.5 D 5.5 6 0.6 15.5 2.0 16.5 6.0 8 1.5 16 E 2.0 17 14.3 2 7.1 4 11.6 1 19.2 1 F 4.2 13 0.0 23.5 1.3 20 0.7 23 0.7 20 G 4.9 9 0.6 15.5 3.9 10 2.6 14.5 2.9 11.5 H 6.2 5 3.6 8 2.3 14 2.6 14.5 6.5 8.5 I 0.6 22 10.0 4 4.9 8 8.0 3 6.9 6 J 1.6 19 0.0 23.5 0.3 25 0.0 25.5 0.4 23 K 4.5 11.5 0.0 23.5 1.3 20 3.6 12.5 3.3 10 L 4.5 11.5 3.2 9.5 11.0 3 9.3 2 7.6 3 M 9.4 2 1.6 13 0.6 23.5 1.3 19.5 2.2 14 N 2.0 17 2.3 12 12.0 2 7.6 4 6.9 6 O 2.0 17 16.8 3 6.5 6 4.3 10 10.1 2 P 4.9 9 0.6 15.5 1.6 18 1.7 18 0.7 20 Q 0.3 24.5 0.0 23.5 0.0 26 0.0 25.5 0.0 25 R 4.9 9 6.5 6 12.6 1 6.3 7 7.2 4 S 11.7 1 0.3 19 4.9 8 4.3 10 6.9 6 T 2.9 15 3.2 9.5 4.9 8 7.0 5.5 2.5 13 U 0.6 22 6.8 5 3.6 11 2.3 16.5 0.7 20 V 1.3 20 0.3 19 1.0 22 2.3 16.5 0.0 25 W 5.2 7 0.6 15.5 2.3 14 1.0 21.5 1.5 16 X 0.0 26 0.0 23.5 0.6 23.5 0.3 24 0.0 25 Y 0.3 24.5 2.6 11 2.0 16.5 1.3 19.5 0.7 20 Z 0.6 22 0.0 23.5 1.3 20 1.0 21.5 0.7 20 No. of blanks 0   0   0   6   32   Check sum 99.9   99.8   100.0   100.0   100.0   a Blanks are not counted in computing percentages.

OCR for page 910
--> TABLE 5 Amount of information (H) in subject-word lettera   First letter Second letter Third letter Fourth letter Fifth letter Av. English text (after Pratt (9)) Rank,n   pn -pnlog2pn   pn -pnlog2pn   pn pnlog2pn   pn pnlog2pn   pn -pnlog2pn   pn -pnlog2pn 1 S .099 .3303 A .178 .4432 R .120 .3671 E .118 .3638 E .133 .3871 E .131 .3841 2 A .093 .3187 E .123 .3719 A .084 .3002 I .108 .3468 R .120 .3671 T .105 .3414 3 C .081 .2937 O .114 .3571 OT .081 .2937 A .077 .2848 O .117 .3622 A .082 .2959 4 P .078 .2871 I .111 .3520     .2937     .2487 I .093 .3187 O .080 .2915 5 R .066 .2588 U .078 .2871 L .078 .2871 C,O,T .062 .2487 T .087 .3065 N .071 .2709 6 M .063 .2513 R .075 .2803 D,E .063 .2513     .2487 N .073 .2756 R .068 .2637 7 T .060 .2435 L .069 .2661     .2513 N .059 .2409 L .067 .2613 I .063 .2513 8 B .048 .2103 Y .060 .2435 N .057 .2356 M .053 .2246 S .057 .2356 S .061 .2461 9 F,G .045 .2013 H,N .039 .1825 C .054 .2274 D .050 .2161 A,U .053 .2246 H .053 .2246 10     .2013     .1825 I .052 .2218 S .043 .1952     .2246 D .038 .1793 11 E .042 .1921 T .024 .1291 S .045 .2013 L,R .039 .1825 C,M .023 .1252 L .034 .1659 12 D,H .036 .1727 C .018 .1043 M,P .039 .1825     .1825     .1252 F .029 .1481 13     .1727 P .015 .0909     .1825     .1760 D .020 .1129 C .028 .1444 14 I,N .033 .1624 M,X .009 .0612 U .033 .1624 H,P .037 .1760 Y .017 .0999 M,U .025 .1330 15     .1624     .0612 B .027 .1407 K,U .028 .1444 H .013 .0815     .1330 16 L .027 .1407     .0443 F .024 .1291     .1444 P,X .010 .0664     .1129 17 O,W .021 .1170 B,S,Z .006 .0443 G .018 .1043 B .022 .1211     .0664 G,Y,P .020 .1129 18     .1170     .0443 Y .012 .0766 Y .019 .1086     .0501     .1129 19 U .018 .1043     .0251 H .009 .0612     .0766 F,G,W .007 .0501 W .015 .0909 20 V .015 .0909     .0251 V .006 .0443 F,J,V .012 .0766     .0501 B .014 .0862 21 J .009 .0612 D,F,G, K,Q,V .003 .0251     .0251     .0766 K,V .003 .0251 V .009 .0612 22     .0443     .0251 K,Q,W, X .003 .0251 G .009 .0612     .0251 K .004 .0319 23     .0443     .0251     .0251     .0251       X .002 .0179 24 K,Q,X,Z .006 .0443     .0251     .0251 Q,W,Z .003 .0251 B,J,Q,Z 0       .0100 25     .0443 J,W 0     J,Z 0       .0251       J,Q,Z .001 .0100 26 Y .003 .0251             X 0             .0100 log2 26=4.7 4.2920     3.7964     4.1255     4.5201     3.8413     4.1300 R=1−H/(log226) 9%     20%     12%     4%     18%     12% a Average of five letters, 20.5753/5=4.1151.

OCR for page 911
--> TABLE 6 Subject-word cumulative letter frequencies (in rank order)a   a On an equiprobable basis, each letter would occur 3.846% of the time.

OCR for page 912
--> TABLE 7 Weighted letter frequencies, %a Letter First letter Third letter Fourth letter A 8.5 8.2 7.6 B 5.3 2.7 2.5 C 8.0 5.1 5.9 D 3.8 5.8 5.1 E 3.9 6.4 11.8 F 4.4 2.2 1.1 G 4.6 2.1 1.1 H 3.9 1.1 3.6 I 3.0 5.2 10.5 J 1.0 0.0 1.1 K 1.1 0.4 2.9 L 2.9 8.2 4.6 M 6.7 3.5 4.8 N 3.1 6.5 6.1 O 2.2 7.9 6.0 P 7.4 3.6 3.5 Q 0.6 0.3 0.3 R 6.4 12.1 4.2 S 10.0 5.0 4.3 T 5.6 7.7 6.3 U 1.7 3.3 2.7 V 1.5 0.7 1.3 W 2.5 0.5 0.4 X 0.5 0.3 0.0 Y 0.3 1.3 1.8 Z 0.6 0.2 0.4 Check sum 99.5% 100.3% 99.9% a All seven parts subject plus one part name. For the initial letters of subject terms, the rank order was S, A, C, P, R, M, T, · · ·; for second letters, A, E, O, I, U, R, L · · ·; for third, R, A, O or T, L, D or E, · · ·; for fourth, E, I, A, T or O or C, N, · · · ; and for fifth, E, R, O, I, T, N, L · · ·, as shown in Table 5. Cumulated frequencies are given in Table 6. Table 5 also gives the information measure −pn log2pn for each letter in each position (8). For this purpose, percentage frequencies were assumed to represent actual probabilities, pn. The sum for each letter position, represents H, the average uncertainty per letter-position or, as it is sometimes called, the average information represented by the letter position, in bits. The redundancy R is also shown on the bottom for each letter position. These calculations show that the least redundant (or the most informative) letter position is the fourth, next to that the first, and then the third. Similar results can be shown for proper names. For the marginal-punched card application, first and third letter positions were selected for coding. Subject-word frequencies were weighted with proper names in a 7-to-1 proportion,2 as shown in Table 7. The 52 letters of 2   According to Wise (3), the ratio X/H, or that of the number of positions to be punched to the number of positions available for punching, should be about 0.46. Taking H to be 19, X=8.75. The dropping fraction fd=(G/H)Y, or the ratio of the number of positions actually punched to the number available for punching, raised to a power, Y, representing the number of sorting elements used, works out to be (7/19)2 =13.7%, if about 9 codes are actually superimposed. Note that A maximum of 8 coding words was chosen, based on these calculations.

OCR for page 913
--> TABLE 8 Comparison of actual and predicted letter frequencies First lettera Letter Actual no. of cards dropped Actual % Predicted % Aa-An (median between anx and any) 29 2.6 4.25 Ao-Az 19 1.7 4.25 B 54 4.9 5.3 Ca-Ci (median between cka and cke) 65 5.9 4.0 Ck-Cz 103 9.35 4.0 D 46 4.2 3.8 E 37 3.35 3.9 F 36c 3.3 4.4 G     4.6 H     3.9 I & J     4.0 K & L     4.0 M     6.7 N & O     5.3 P     7.4 Q & R     7.0 Sa-Si (median between siv and six)     5.0 Sj-Sz     5.0 T     5.6 U-Z     7.1 Third letterb Letter Actual no. of cards dropped Actual % Predicted % aa-aq (median between ard and are) 40 3.8 5.45 ar–az & b 80 7.6 5.45 c 60 5.65 5.1 d 55 5.2 5.8 e 100 9.45 6.4 f, g & h 80 7.6 5.4 i, j & k 45 4.25 5.6 la−lo (median between lov and low) 45 4.25 5.85 lp−lz & m 80 7.6 5.85 n 80 7.6 6.5 oa–os (median between otf and oth) 45 4.25 5.9 ot–oz, p, q 35 3.3 5.9 ra-rg (median between rge and rgo) 40 3.8 6.05 rh–rz 65 6.1 6.05 s 65 6.1 5.0 ta–th (median between tid and tie) 40 3.8 3.85 ti–tz 35 3.3 3.85 u-z 70 6.6 6.3 Total 1060     Avg. 59     a Ideally each first letter position would comprise 5%. b Ideally each third letter position would comprise 5.5%. c Not carried to completion. Note: About 400 cards were used in study. Actual number was estimated by measuring cards dropped at 150 cards/inch. Predicted percentage based on Table 7. Dropping fraction, Fd=(G/H)Y=(2.7/18)1=15%. For 400 cards, Fd=60. H is the number of coding positions; G is the number of punches/card=1100/400; Y is the number of sorting positions=1. (See Wise (3) for derivation.) first and third positions were then assigned to the 38 available positions as equally as possible, but under the restriction that alphabetical order along the side of the card be preserved. The result is shown in Fig. 1, and Table 8 shows the predicted frequency distribution for this code. Note that it was necessary sometimes to combine several letters in one position, and sometimes to split

OCR for page 914
--> one letter between two positions. These splits were chosen according to the median frequencies of English trigrams (9). Splitting letters modifies the first-letter position somewhat by the second, and third somewhat by the fourth. Such letter-pair frequencies take account of intersymbol influence, and therefore make possible a better code than single-letter frequencies. H.P.Luhn has designed a superimposed code using randomizing squares (10) which takes advantage of letter-pair frequencies. An empirical check of the letter code shown in Fig. 1 was made on a 400-card file maintained by the author. The results are shown in Table 83. The average dropping fraction for the third position alone compares well with the dropping fraction as calculated by formula, but the range (from 9 to 25%) is broader than hoped for. However, Table 8 shows that the agreement between actual and predicted frequencies in the third-letter position was very good, considering the alphabetic-order limitation imposed in assigning the positions. By using data-processing equipment, much more elaborate studies on much larger samples would be possible. The author is working with such equipment and hopes to have some results available in the near future. Equifrequency-letter codes have many other applications, including the preassignment of space in files and indexes, in cryptography, and in philology. For example, the data in Table 5 can provide a quantitative measure of subject word popularity. Taking a few words from the Library of Congress list of subject headings, we add the percentage frequencies of each letter (up to 5) together and divide by the number of letters. (Multiply each pn by 100 to get the percentage frequency.) AIRCRAFT has a value of 9.3+11.1+12.0+6.2+12.0, 10.12 DIVIDER has a value of 3.6+11.1+0.6+10.8+2.0, 5.66 ICHTHYOLOGY has a value of 3.3+1.8+0.9+6.2+1.3, 2.70 These three words give some idea of the range possible in a subject-heading list. In general dictionary words, the highest found was SARI, with a value of 12.63, and the lowest, ONYX, with a value of 1.8. It is interesting to compare these values with the highest possible letter combination (not necessarily an English word), which is SAREE (value 12.96), and the lowest (value 0.06), The highest is very nearly realized in actuality, while the lowest never comes close. Also note that the word SARI is certainly uncommon English; this phenomenon may occur because the intersymbol connections are broken by taking single-letter frequencies. 3   Since the first-letter positions showed quite wide deviations from the predicted frequencies, their analysis was never completed. It is now thought that third and fourth letter positions would have made a more invariant code, less subject to the fluctuations which occur in any particular file because of the selection of particular terms.

OCR for page 915
--> ACKNOWLEDGMENT The work described in this paper was performed while the author was in the employ of Carrier Corporation, Syracuse, New York. REFERENCES 1. C.E.SHANNON, A Mathematical Theory of Communication, Bell System Technical Journal, July 1948, and following. 2. C.N.MOOERS, Zatocoding and Development in Information Retrieval, ASLIB Proc., February 1956, p. 3 (Many other papers by this author may be obtained from his Zator Co., 79 Milk Street, Boston, Massachusetts.) 3. C.S.WISE, A Punched-Card File Based on Word Coding, pp. 93–114, in Perry and Casey’s Punched Cards, Reinhold Publishing Corporation, New York, 1951. 4. MOOERS and WISE had discussions in American Documentation, April 1950, October 1950, and October 1952. 5. H.OHLMAN, The Low-Cost Production of Marginal-Punched Cards on Accounting Machines, pp. 123–26 American Documentation, April 1957. 6. JOINT COMMITTEE OF ASM AND SLA, ASM-SLA Metallurgical Literature Classification, American Society for Metals, 1950. (Figure 5, which was based on an analysis of 4870 names by A.H.Geisler in ASM Review of Metallurgical Literature.) 7. K.A.KRIEGER, A Punched-Card System for Chemical Literature, J. of Chemical Education, March 1949, p. 163. 8. E.T.KLEMMER, Tables for Computing Informational Measures, p. 75 in Quastler’s Information Theory in Psychology, Free Press, Glencoe, Ill., 1955. 9. F.PRATT, Secret and Urgent, The Story of Codes and Ciphers, Blue Ribbon Books, Garden City, N.Y., 1942, pp. 264–78. 10. H.P.LUHN, Superimposed Coding With the Aid of Randomizing Squares for Use in Mechanical Information Searching Systems, IBM Product Development Lab., Poughkeepsie, New York, 1956.

OCR for page 916
--> This page intentionally left blank.

Representative terms from entire chapter:

letter position