Cover Image

Not for Sale



View/Hide Left Panel

Subject-Word Letter Frequencies with Applications to Superimposed Coding

HERBERT OHLMAN

ABSTRACT. The frequencies of occurrence of English letters in the first five positions of subject words and proper names are determined. With these frequencies a superimposed code is designed. No code book is required. Coding space is utilized almost as economically as with a random code. An empirical check is made. A quantitive measure of word popularity is proposed using letter-frequency data.

Coding, or the transforming of information from one guise to another, is one of man’s commonest activities. Every picture may be said to be a coding of some real scene and every written word a coding of some utterance—the brain itself is said to work with coded impulses.

Since the beginning of mass communications, starting with the invention of printing, and increasing with the widespread use of electronics, efficient use of existing space and time has become more and more important. Today, information theory provides a sound basis for determining the limits of transmission speed and accuracy. However, Shannon’s theory (1) does not tell us how to make a particular code more efficient. The design of codes is still an art; this paper deals with the improving of one particular type, superimposed coding. In information searching, mechanical aids are being used wherever possible. For a machine to process information, the information must be coded, usually into some variant of that most basic code of all, the binary. However, the most efficient code for pure selection appears to be a superimposed random code. Each coding position is used in a random manner, and a group of coding positions contain superimposed entries.

Calvin Mooers (2, 4) calls such coding “Zatocoding” and has applied it in his patented marginal-punched card system called Zator. However, Zatocoding requires an intermediate step in both coding and searching—a code

HERBERT OHLMAN System Development Corp., Santa Monica, Calif.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 903
--> Subject-Word Letter Frequencies with Applications to Superimposed Coding HERBERT OHLMAN ABSTRACT. The frequencies of occurrence of English letters in the first five positions of subject words and proper names are determined. With these frequencies a superimposed code is designed. No code book is required. Coding space is utilized almost as economically as with a random code. An empirical check is made. A quantitive measure of word popularity is proposed using letter-frequency data. Coding, or the transforming of information from one guise to another, is one of man’s commonest activities. Every picture may be said to be a coding of some real scene and every written word a coding of some utterance—the brain itself is said to work with coded impulses. Since the beginning of mass communications, starting with the invention of printing, and increasing with the widespread use of electronics, efficient use of existing space and time has become more and more important. Today, information theory provides a sound basis for determining the limits of transmission speed and accuracy. However, Shannon’s theory (1) does not tell us how to make a particular code more efficient. The design of codes is still an art; this paper deals with the improving of one particular type, superimposed coding. In information searching, mechanical aids are being used wherever possible. For a machine to process information, the information must be coded, usually into some variant of that most basic code of all, the binary. However, the most efficient code for pure selection appears to be a superimposed random code. Each coding position is used in a random manner, and a group of coding positions contain superimposed entries. Calvin Mooers (2, 4) calls such coding “Zatocoding” and has applied it in his patented marginal-punched card system called Zator. However, Zatocoding requires an intermediate step in both coding and searching—a code HERBERT OHLMAN System Development Corp., Santa Monica, Calif.

OCR for page 903
--> FIGURE 1.

OCR for page 903
--> book containing a number of indexing terms with random-number equivalents. Carl Wise (3, 4) has produced a nonrandom superimposed code which he calls “word coding” for use with marginal-punched cards of the Keysort variety. This type of code does not require an intermediate code book. The author has attempted to combine the best features of both systems in coding English words by essentially pre-randomizing the alphabet. This is possible because there is a certain invariance of the letter frequencies within each letter position of a word. As this system was developed in response to a specific need, it may be well to talk about it in concrete terms, and later apply its principles to other information systems. A marginal-punched card produced on IBM equipment (5) was used as the unit record. The thirty-eight positions along the top edge could code 38 words or phrases by using a direct code, but by superimposition every position could be made to do multiple duty. However, neither of the two systems previously described seemed to meet the requirement of a directly interpretable, yet efficient code. Subject-word and proper-name lists were studied to find what letter frequencies occurred in the first five letter positions. Some work along these lines had been done, notably by Geisler for the ASM-SLA (6) with proper names, and by Krieger (7) with subject words (however, Krieger only considered initial letters in designing his code). Striking similarities for initial-letter frequencies among various subject-word lists were found, as shown in Table 1. The average of five such lists show that 40% of the words begin with C, S, P, or A (in that order). Furthermore, 85% begin either with these four or B, M, T, R, E, F, D, G, H, or I—or only 54% of the alphabet. Even greater consistency was found with proper-name lists, as shown in Table 2, but with a different ranking of the letters. The average of three such lists gave S, B, M, H, and C for the beginning letters of 40% of the names, and these five and D, G, K, L, R, P, W, A, and F (again 54% of the alphabet) accounted for 83%. The Library of Congress list was chosen as typical of the subject-word lists, and the 1955 Syracuse Telephone Directory as typical of names. A systematic sample was obtained from each list by recording the top-left, middle, and top-right terms from every two-page spread.1 The frequencies of letters in each of the first five positions were then obtained for each list, as shown in Tables 3 and 4. 1   The probabilities in this case are not independent, but every term is equidistant in the alphabetical sequence from the next term chosen, which is a sufficient approximation to true randomness for the purposes of this study.

OCR for page 903
--> TABLE 1 Subject-word initial letter frequenciesa   Chambers Technical Dict’y, 1942, 912 pp. Merriam-Webster Unabridged Dict’y, 2987 pp. Industrial Arts Index (Vol. 41, No. 5), April, 1953, 787 pp. Chem. Abstracts Decennial Subject Index 1907–16 & 1927–36 (after Krieger, ref. 7) Lib. of Congress T.I.D. List of Subject Headings, June 1952, 327 pp. Average Frequency, % Rank Letter Freq., % Deviation Freq., % Deviation Freq., % Deviation Freq., % Deviation Freq., % Deviation A 7.3 −1.0 6.6 −1.7 9.3 +1.0 8.6 +0.3 9.7 +1.4 8.3 4 B 6.2 +0.4 5.7 −0.1 4.5 −1.3 7.4 +1.6 4.9 −0.9 5.8 5 C 10.8 +0.4 9.8 −.06 10.0 −0.4 12.6 +2.2 8.8 −1.6 10.4 1 D 5.7 +1.5 4.9 +0.7 3.3 −0.9 3.5 −0.7 3.6 −0.6 4.2 11 E 4.7 +0.3 3.3 −1.1 6.1 +1.7 3.9 −0.5 4.2 −0.2 4.4 9 F 4.6 +0.3 3.9 −0.4 3.7 −0.6 4.9 +0.6 4.5 +0.2 4.3 10 G 3.7 −0.2 3.2 −0.7 4.6 +0.7 3.7 −0.2 4.5 +0.6 3.9 12 H 4.4 +0.6 3.7 −0.1 3.3 −0.5 4.3 +0.5 3.6 −0.2 3.8 13.5 I 3.1 −0.7 3.0 −0.8 6.0 +2.2 3.7 −0.1 3.2 −0.6 3.8 13.5 J 0.6 0.0 0.9 +0.3 0.2 −0.4 ~0.4 −0.2 1.7 +0.1 0.6 21.5 K 1.0 +0.4 0.9 +0.3 0.4 −0.2 ~0.4 −0.2 0.3 −0.3 0.6 21.5 L 3.8 +0.7 3.2 +0.2 2.5 −0.6 3.3 +0.2 2.6 −0.5 3.1 15 M 5.6 −0.1 5.0 −0.7 6.6 +0.9 5.3 −0.4 6.2 +0.5 5.7 6 N 2.0 −0.5 1.8 −0.7 2.4 −0.1 3.1 +0.6 3.2 +0.7 2.5 16 O 2.2 0.0 2.6 +0.4 1.7 −0.5 2.7 +0.5 1.9 −0.3 2.2 18 P 9.1 −0.4 9.3 −0.2 10.3 +0.8 10.8 +1.3 8.1 −1.4 9.5 3 Q 0.6 +0.1 0.6 −0.1 0.2 −0.3 ~0.7 +0.2 0.3 −0.2 0.5 23 R 4.4 −0.4 4.8 0.0 4.7 −0.1 3.5 −1.3 6.8 +2.0 4.8 8 S 10.4 +0.3 12.4 +2.3 9.9 −0.2 8.4 −1.7 10.4 +0.3 10.1 2 T 4.8 −0.5 6.3 +1.0 5.1 −0.2 4.3 −1.0 6.2 +0.9 5.3 7 U 0.8 −0.4 1.9 +0.7 0.9 −0.3 ~0.7 −0.5 1.9 +0.7 1.2 20 V 1.6 +0.2 1.7 +0.3 1.1 −0.3 1.4 0.0 1.3 −0.1 1.4 19 W 1.8 −0.5 3.3 +1.0 2.4 +0.1 1.9 −0.4 1.9 −0.4 2.3 17 X 0.2 0.0 0.1 −0.1 0.2 0.0 ~0.03 −0.2 0.3 +0.1 0.2 26 Y 0.2 −0.1 0.4 +0.1 0.1 −0.2 ~0.03 −0.3 0.3 0.0 0.3 24.5 Z 0.4 +0.1 0.3 0.0 0.2 −0.1 ~0.03 −0.3 0.3 0.0 0.3 24.5 Check sum 100.0 +0.5 99.6 0.0 99.7 +0.2 99.6 0.0 99.7 +0.2 99.5   a Frequencies which deviate more than 1 % from the average are shown in italics.

OCR for page 903
--> TABLE 2 Proper-name initial letter frequenciesa   Chemical Abstracts Fourth Decennial Author Index, 3531 pp. ASM-SLA Metal Literature Study, 4870 pp. Syracuse, N.Y., Telephone Directory, 1955, 307 pp. Av. Freq., % Rank Letter Freq., % Deviation Freq., % Deviation Freq., % Deviation A 3.8 0.0 2.45 −1.35 5.2 +1.4 3.8 13 B 9.2 −0.2 10.2 +0.8 8.8 −0.6 9.4 2 C 5.7 −0.9 6.2 −0.4 7.8 +1.2 6.6 5 D 5.0 −0.3 5.3 0.0 5.5 +0.2 5.3 6.5 E 2.3 +0.1 2.25 +0.05 2.0 −0.2 2.2 16 F 3.6 0.0 3.4 −0.2 3.9 +0.3 3.6 14 G 5.3 0.0 5.6 +0.3 4.9 −0.4 5.3 6.5 H 6.7 −0.1 7.35 +0.55 6.2 −0.6 6.8 4 I 1.6 +0.7 0.75 −0.15 0.3 −0.6 0.9 21.5 J 1.8 +0.1 1.75 +0.05 1.6 −0.1 1.7 19 K 6.0 +1.2 4.9 −0.3 4.6 −0.6 5.2 8 L 4.6 −0.4 5.65 +0.65 4.6 −0.4 5.0 9 M 7.7 −0.6 8.25 −0.05 8.8 +0.5 8.3 3 N 2.3 +0.3 1.8 −0.2 2.0 0.0 2.0 17 O 1.4 −0.2 1.4 −0.2 2.0 +0.4 1.6 20 P 4.6 −0.1 4.5 −0.2 4.9 +0.2 4.7 11.5 Q 0.1 −0.1 0.1 −0.1 0.3 +0.1 0.2 25 R 5.0 +0.1 4.65 −0.25 4.9 0.0 4.9 10 S 11.3 +0.1 11.0 −0.2 11.4 +0.2 11.2 1 T 3.4 +0.2 3.65 +0.45 2.6 −0.6 3.2 15 U 0.7 +0.2 0.45 −0.05 0.3 −0.2 0.5 23 V 1.8 −0.1 2.15 +0.25 1.6 −0.3 1.9 18 W 4.6 −0.1 4.65 −0.15 4.9 +0.2 4.7 11.5 X 0.0 0.0 0.0 0.0 0.0 0.0 0.0 26 Y 0.5 +0.1 0.5 +0.1 0.3 −0.1 0.4 24 Z 1.0 +0.1 1.1 +0.2 0.7 −0.2 0.9 21.5 Check sum 100.0 +0.1 99.55 −0.4 100.1 −0.2 100.3   a Frequencies which deviate more than 1% from the average are shown in italics.

OCR for page 903
--> TABLE 3 Subject-word letter frequencies (332 words)a   First letter Second letter Third letter Fourth letter Fifth letter Letter Freq., % Rank Freq., % Rank Freq., % Rank Freq., % Rank Freq., % Rank A 9.3 2 17.8 1 8.4 2 7.7 3 5.3 9.5 B 4.8 8 0.6 17 2.7 15 2.2 17 0.0 24.5 C 8.1 3 1.8 12 5.4 9 6.2 5 2.3 11.5 D 3.6 12.5 0.3 21.5 6.3 6.5 5.0 9 2.0 13 E 4.2 11 12.3 2 6.3 6.5 11.8 1 13.3 1 F 4.5 9.5 0.3 21.5 2.4 16 1.2 20 0.7 19 G 4.5 9.5 0.3 21.5 1.8 17 0.9 22 0.7 19 H 3.6 12.5 3.9 9.5 0.9 19 3.7 13.5 1.3 15 I 3.3 14.5 11.1 4 5.2 10 10.8 2 9.3 4 J 0.9 21 0.0 25.5 0.0 25.5 1.2 20 0.0 24.5 K 0.6 23.5 0.3 21.5 0.3 22.5 2.8 15.5 0.3 21.5 L 2.7 16 6.9 7 7.8 5 3.9 11.5 6.7 7 M 6.3 6 0.9 14.5 3.9 12.5 5.3 8 2.3 11.5 N 3.3 14.5 3.9 9.5 5.7 8 5.9 7 7.3 6 O 2.1 17.5 11.4 3 8.1 3.5 6.2 5 11.7 3 P 7.8 4 1.5 13 3.9 12.5 3.7 13.5 1.0 16.5 Q 0.6 23.5 0.3 21.5 0.3 22.5 0.3 24 0.0 24.5 R 6.6 5 7.5 6 12.0 1 3.9 11.5 12.0 2 S 9.9 1 0.6 17 4.5 11 4.3 10 5.7 8 T 6.0 7 2.4 11 8.1 3.5 6.2 5 8.7 5 U 1.8 19 7.8 5 3.3 14 2.8 15.5 5.3 9.5 V 1.5 20 0.3 21.5 0.6 20 1.2 20 0.3 21.5 W 2.1 17.5 0.0 25.5 0.3 22.5 0.3 24 0.7 19 X 0.6 23.5 0.9 14.5 0.3 22.5 0.0 26 1.0 16.5 Y 0.3 26 6.0 8 1.2 18 1.9 18 1.7 14 Z 0.6 23.5 0.6 17 0.0 25.5 0.3 24 0.0 24.5 No. of blanks 0   0   0   9   33   Check sum 99.6   99.7   99.7   99.5   99.6   a Blanks are not counted in computing percentages.

OCR for page 903
--> TABLE 4 Proper name letter frequencies (309 names)a   First letter Second letter Third letter Fourth letter Fifth letter Letter Freq., % Rank Freq., % Rank Freq., % Rank Freq., % Rank Freq., % Rank A 3.2 14 21.7 1 6.8 5 7.0 5.5 6.5 8.5 B 8.8 3 0.3 19 2.3 14 4.3 10 1.5 16 C 7.8 4 3.9 7 2.9 12 3.6 12.5 2.9 11.5 D 5.5 6 0.6 15.5 2.0 16.5 6.0 8 1.5 16 E 2.0 17 14.3 2 7.1 4 11.6 1 19.2 1 F 4.2 13 0.0 23.5 1.3 20 0.7 23 0.7 20 G 4.9 9 0.6 15.5 3.9 10 2.6 14.5 2.9 11.5 H 6.2 5 3.6 8 2.3 14 2.6 14.5 6.5 8.5 I 0.6 22 10.0 4 4.9 8 8.0 3 6.9 6 J 1.6 19 0.0 23.5 0.3 25 0.0 25.5 0.4 23 K 4.5 11.5 0.0 23.5 1.3 20 3.6 12.5 3.3 10 L 4.5 11.5 3.2 9.5 11.0 3 9.3 2 7.6 3 M 9.4 2 1.6 13 0.6 23.5 1.3 19.5 2.2 14 N 2.0 17 2.3 12 12.0 2 7.6 4 6.9 6 O 2.0 17 16.8 3 6.5 6 4.3 10 10.1 2 P 4.9 9 0.6 15.5 1.6 18 1.7 18 0.7 20 Q 0.3 24.5 0.0 23.5 0.0 26 0.0 25.5 0.0 25 R 4.9 9 6.5 6 12.6 1 6.3 7 7.2 4 S 11.7 1 0.3 19 4.9 8 4.3 10 6.9 6 T 2.9 15 3.2 9.5 4.9 8 7.0 5.5 2.5 13 U 0.6 22 6.8 5 3.6 11 2.3 16.5 0.7 20 V 1.3 20 0.3 19 1.0 22 2.3 16.5 0.0 25 W 5.2 7 0.6 15.5 2.3 14 1.0 21.5 1.5 16 X 0.0 26 0.0 23.5 0.6 23.5 0.3 24 0.0 25 Y 0.3 24.5 2.6 11 2.0 16.5 1.3 19.5 0.7 20 Z 0.6 22 0.0 23.5 1.3 20 1.0 21.5 0.7 20 No. of blanks 0   0   0   6   32   Check sum 99.9   99.8   100.0   100.0   100.0   a Blanks are not counted in computing percentages.

OCR for page 903
--> TABLE 5 Amount of information (H) in subject-word lettera   First letter Second letter Third letter Fourth letter Fifth letter Av. English text (after Pratt (9)) Rank,n   pn -pnlog2pn   pn -pnlog2pn   pn pnlog2pn   pn pnlog2pn   pn -pnlog2pn   pn -pnlog2pn 1 S .099 .3303 A .178 .4432 R .120 .3671 E .118 .3638 E .133 .3871 E .131 .3841 2 A .093 .3187 E .123 .3719 A .084 .3002 I .108 .3468 R .120 .3671 T .105 .3414 3 C .081 .2937 O .114 .3571 OT .081 .2937 A .077 .2848 O .117 .3622 A .082 .2959 4 P .078 .2871 I .111 .3520     .2937     .2487 I .093 .3187 O .080 .2915 5 R .066 .2588 U .078 .2871 L .078 .2871 C,O,T .062 .2487 T .087 .3065 N .071 .2709 6 M .063 .2513 R .075 .2803 D,E .063 .2513     .2487 N .073 .2756 R .068 .2637 7 T .060 .2435 L .069 .2661     .2513 N .059 .2409 L .067 .2613 I .063 .2513 8 B .048 .2103 Y .060 .2435 N .057 .2356 M .053 .2246 S .057 .2356 S .061 .2461 9 F,G .045 .2013 H,N .039 .1825 C .054 .2274 D .050 .2161 A,U .053 .2246 H .053 .2246 10     .2013     .1825 I .052 .2218 S .043 .1952     .2246 D .038 .1793 11 E .042 .1921 T .024 .1291 S .045 .2013 L,R .039 .1825 C,M .023 .1252 L .034 .1659 12 D,H .036 .1727 C .018 .1043 M,P .039 .1825     .1825     .1252 F .029 .1481 13     .1727 P .015 .0909     .1825     .1760 D .020 .1129 C .028 .1444 14 I,N .033 .1624 M,X .009 .0612 U .033 .1624 H,P .037 .1760 Y .017 .0999 M,U .025 .1330 15     .1624     .0612 B .027 .1407 K,U .028 .1444 H .013 .0815     .1330 16 L .027 .1407     .0443 F .024 .1291     .1444 P,X .010 .0664     .1129 17 O,W .021 .1170 B,S,Z .006 .0443 G .018 .1043 B .022 .1211     .0664 G,Y,P .020 .1129 18     .1170     .0443 Y .012 .0766 Y .019 .1086     .0501     .1129 19 U .018 .1043     .0251 H .009 .0612     .0766 F,G,W .007 .0501 W .015 .0909 20 V .015 .0909     .0251 V .006 .0443 F,J,V .012 .0766     .0501 B .014 .0862 21 J .009 .0612 D,F,G, K,Q,V .003 .0251     .0251     .0766 K,V .003 .0251 V .009 .0612 22     .0443     .0251 K,Q,W, X .003 .0251 G .009 .0612     .0251 K .004 .0319 23     .0443     .0251     .0251     .0251       X .002 .0179 24 K,Q,X,Z .006 .0443     .0251     .0251 Q,W,Z .003 .0251 B,J,Q,Z 0       .0100 25     .0443 J,W 0     J,Z 0       .0251       J,Q,Z .001 .0100 26 Y .003 .0251             X 0             .0100 log2 26=4.7 4.2920     3.7964     4.1255     4.5201     3.8413     4.1300 R=1−H/(log226) 9%     20%     12%     4%     18%     12% a Average of five letters, 20.5753/5=4.1151.

OCR for page 903
--> TABLE 6 Subject-word cumulative letter frequencies (in rank order)a   a On an equiprobable basis, each letter would occur 3.846% of the time.

OCR for page 903
--> TABLE 7 Weighted letter frequencies, %a Letter First letter Third letter Fourth letter A 8.5 8.2 7.6 B 5.3 2.7 2.5 C 8.0 5.1 5.9 D 3.8 5.8 5.1 E 3.9 6.4 11.8 F 4.4 2.2 1.1 G 4.6 2.1 1.1 H 3.9 1.1 3.6 I 3.0 5.2 10.5 J 1.0 0.0 1.1 K 1.1 0.4 2.9 L 2.9 8.2 4.6 M 6.7 3.5 4.8 N 3.1 6.5 6.1 O 2.2 7.9 6.0 P 7.4 3.6 3.5 Q 0.6 0.3 0.3 R 6.4 12.1 4.2 S 10.0 5.0 4.3 T 5.6 7.7 6.3 U 1.7 3.3 2.7 V 1.5 0.7 1.3 W 2.5 0.5 0.4 X 0.5 0.3 0.0 Y 0.3 1.3 1.8 Z 0.6 0.2 0.4 Check sum 99.5% 100.3% 99.9% a All seven parts subject plus one part name. For the initial letters of subject terms, the rank order was S, A, C, P, R, M, T, · · ·; for second letters, A, E, O, I, U, R, L · · ·; for third, R, A, O or T, L, D or E, · · ·; for fourth, E, I, A, T or O or C, N, · · · ; and for fifth, E, R, O, I, T, N, L · · ·, as shown in Table 5. Cumulated frequencies are given in Table 6. Table 5 also gives the information measure −pn log2pn for each letter in each position (8). For this purpose, percentage frequencies were assumed to represent actual probabilities, pn. The sum for each letter position, represents H, the average uncertainty per letter-position or, as it is sometimes called, the average information represented by the letter position, in bits. The redundancy R is also shown on the bottom for each letter position. These calculations show that the least redundant (or the most informative) letter position is the fourth, next to that the first, and then the third. Similar results can be shown for proper names. For the marginal-punched card application, first and third letter positions were selected for coding. Subject-word frequencies were weighted with proper names in a 7-to-1 proportion,2 as shown in Table 7. The 52 letters of 2   According to Wise (3), the ratio X/H, or that of the number of positions to be punched to the number of positions available for punching, should be about 0.46. Taking H to be 19, X=8.75. The dropping fraction fd=(G/H)Y, or the ratio of the number of positions actually punched to the number available for punching, raised to a power, Y, representing the number of sorting elements used, works out to be (7/19)2 =13.7%, if about 9 codes are actually superimposed. Note that A maximum of 8 coding words was chosen, based on these calculations.

OCR for page 903
--> TABLE 8 Comparison of actual and predicted letter frequencies First lettera Letter Actual no. of cards dropped Actual % Predicted % Aa-An (median between anx and any) 29 2.6 4.25 Ao-Az 19 1.7 4.25 B 54 4.9 5.3 Ca-Ci (median between cka and cke) 65 5.9 4.0 Ck-Cz 103 9.35 4.0 D 46 4.2 3.8 E 37 3.35 3.9 F 36c 3.3 4.4 G     4.6 H     3.9 I & J     4.0 K & L     4.0 M     6.7 N & O     5.3 P     7.4 Q & R     7.0 Sa-Si (median between siv and six)     5.0 Sj-Sz     5.0 T     5.6 U-Z     7.1 Third letterb Letter Actual no. of cards dropped Actual % Predicted % aa-aq (median between ard and are) 40 3.8 5.45 ar–az & b 80 7.6 5.45 c 60 5.65 5.1 d 55 5.2 5.8 e 100 9.45 6.4 f, g & h 80 7.6 5.4 i, j & k 45 4.25 5.6 la−lo (median between lov and low) 45 4.25 5.85 lp−lz & m 80 7.6 5.85 n 80 7.6 6.5 oa–os (median between otf and oth) 45 4.25 5.9 ot–oz, p, q 35 3.3 5.9 ra-rg (median between rge and rgo) 40 3.8 6.05 rh–rz 65 6.1 6.05 s 65 6.1 5.0 ta–th (median between tid and tie) 40 3.8 3.85 ti–tz 35 3.3 3.85 u-z 70 6.6 6.3 Total 1060     Avg. 59     a Ideally each first letter position would comprise 5%. b Ideally each third letter position would comprise 5.5%. c Not carried to completion. Note: About 400 cards were used in study. Actual number was estimated by measuring cards dropped at 150 cards/inch. Predicted percentage based on Table 7. Dropping fraction, Fd=(G/H)Y=(2.7/18)1=15%. For 400 cards, Fd=60. H is the number of coding positions; G is the number of punches/card=1100/400; Y is the number of sorting positions=1. (See Wise (3) for derivation.) first and third positions were then assigned to the 38 available positions as equally as possible, but under the restriction that alphabetical order along the side of the card be preserved. The result is shown in Fig. 1, and Table 8 shows the predicted frequency distribution for this code. Note that it was necessary sometimes to combine several letters in one position, and sometimes to split

OCR for page 903
--> one letter between two positions. These splits were chosen according to the median frequencies of English trigrams (9). Splitting letters modifies the first-letter position somewhat by the second, and third somewhat by the fourth. Such letter-pair frequencies take account of intersymbol influence, and therefore make possible a better code than single-letter frequencies. H.P.Luhn has designed a superimposed code using randomizing squares (10) which takes advantage of letter-pair frequencies. An empirical check of the letter code shown in Fig. 1 was made on a 400-card file maintained by the author. The results are shown in Table 83. The average dropping fraction for the third position alone compares well with the dropping fraction as calculated by formula, but the range (from 9 to 25%) is broader than hoped for. However, Table 8 shows that the agreement between actual and predicted frequencies in the third-letter position was very good, considering the alphabetic-order limitation imposed in assigning the positions. By using data-processing equipment, much more elaborate studies on much larger samples would be possible. The author is working with such equipment and hopes to have some results available in the near future. Equifrequency-letter codes have many other applications, including the preassignment of space in files and indexes, in cryptography, and in philology. For example, the data in Table 5 can provide a quantitative measure of subject word popularity. Taking a few words from the Library of Congress list of subject headings, we add the percentage frequencies of each letter (up to 5) together and divide by the number of letters. (Multiply each pn by 100 to get the percentage frequency.) AIRCRAFT has a value of 9.3+11.1+12.0+6.2+12.0, 10.12 DIVIDER has a value of 3.6+11.1+0.6+10.8+2.0, 5.66 ICHTHYOLOGY has a value of 3.3+1.8+0.9+6.2+1.3, 2.70 These three words give some idea of the range possible in a subject-heading list. In general dictionary words, the highest found was SARI, with a value of 12.63, and the lowest, ONYX, with a value of 1.8. It is interesting to compare these values with the highest possible letter combination (not necessarily an English word), which is SAREE (value 12.96), and the lowest (value 0.06), The highest is very nearly realized in actuality, while the lowest never comes close. Also note that the word SARI is certainly uncommon English; this phenomenon may occur because the intersymbol connections are broken by taking single-letter frequencies. 3   Since the first-letter positions showed quite wide deviations from the predicted frequencies, their analysis was never completed. It is now thought that third and fourth letter positions would have made a more invariant code, less subject to the fluctuations which occur in any particular file because of the selection of particular terms.

OCR for page 903
--> ACKNOWLEDGMENT The work described in this paper was performed while the author was in the employ of Carrier Corporation, Syracuse, New York. REFERENCES 1. C.E.SHANNON, A Mathematical Theory of Communication, Bell System Technical Journal, July 1948, and following. 2. C.N.MOOERS, Zatocoding and Development in Information Retrieval, ASLIB Proc., February 1956, p. 3 (Many other papers by this author may be obtained from his Zator Co., 79 Milk Street, Boston, Massachusetts.) 3. C.S.WISE, A Punched-Card File Based on Word Coding, pp. 93–114, in Perry and Casey’s Punched Cards, Reinhold Publishing Corporation, New York, 1951. 4. MOOERS and WISE had discussions in American Documentation, April 1950, October 1950, and October 1952. 5. H.OHLMAN, The Low-Cost Production of Marginal-Punched Cards on Accounting Machines, pp. 123–26 American Documentation, April 1957. 6. JOINT COMMITTEE OF ASM AND SLA, ASM-SLA Metallurgical Literature Classification, American Society for Metals, 1950. (Figure 5, which was based on an analysis of 4870 names by A.H.Geisler in ASM Review of Metallurgical Literature.) 7. K.A.KRIEGER, A Punched-Card System for Chemical Literature, J. of Chemical Education, March 1949, p. 163. 8. E.T.KLEMMER, Tables for Computing Informational Measures, p. 75 in Quastler’s Information Theory in Psychology, Free Press, Glencoe, Ill., 1955. 9. F.PRATT, Secret and Urgent, The Story of Codes and Ciphers, Blue Ribbon Books, Garden City, N.Y., 1942, pp. 264–78. 10. H.P.LUHN, Superimposed Coding With the Aid of Randomizing Squares for Use in Mechanical Information Searching Systems, IBM Product Development Lab., Poughkeepsie, New York, 1956.

OCR for page 903
--> This page intentionally left blank.