National Academies Press: OpenBook

Proceedings of the International Conference on Scientific Information: Two Volumes (1959)

Chapter: Subject-Word Letter Frequencies with Applications to Superimposed Coding

« Previous: On the Coding of Geometrical Shapes and Other Representations, with Reference to Archaeological Documents
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

Subject-Word Letter Frequencies with Applications to Superimposed Coding

HERBERT OHLMAN

ABSTRACT. The frequencies of occurrence of English letters in the first five positions of subject words and proper names are determined. With these frequencies a superimposed code is designed. No code book is required. Coding space is utilized almost as economically as with a random code. An empirical check is made. A quantitive measure of word popularity is proposed using letter-frequency data.

Coding, or the transforming of information from one guise to another, is one of man’s commonest activities. Every picture may be said to be a coding of some real scene and every written word a coding of some utterance—the brain itself is said to work with coded impulses.

Since the beginning of mass communications, starting with the invention of printing, and increasing with the widespread use of electronics, efficient use of existing space and time has become more and more important. Today, information theory provides a sound basis for determining the limits of transmission speed and accuracy. However, Shannon’s theory (1) does not tell us how to make a particular code more efficient. The design of codes is still an art; this paper deals with the improving of one particular type, superimposed coding. In information searching, mechanical aids are being used wherever possible. For a machine to process information, the information must be coded, usually into some variant of that most basic code of all, the binary. However, the most efficient code for pure selection appears to be a superimposed random code. Each coding position is used in a random manner, and a group of coding positions contain superimposed entries.

Calvin Mooers (2, 4) calls such coding “Zatocoding” and has applied it in his patented marginal-punched card system called Zator. However, Zatocoding requires an intermediate step in both coding and searching—a code

HERBERT OHLMAN System Development Corp., Santa Monica, Calif.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

FIGURE 1.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

book containing a number of indexing terms with random-number equivalents.

Carl Wise (3, 4) has produced a nonrandom superimposed code which he calls “word coding” for use with marginal-punched cards of the Keysort variety. This type of code does not require an intermediate code book.

The author has attempted to combine the best features of both systems in coding English words by essentially pre-randomizing the alphabet. This is possible because there is a certain invariance of the letter frequencies within each letter position of a word.

As this system was developed in response to a specific need, it may be well to talk about it in concrete terms, and later apply its principles to other information systems. A marginal-punched card produced on IBM equipment (5) was used as the unit record. The thirty-eight positions along the top edge could code 38 words or phrases by using a direct code, but by superimposition every position could be made to do multiple duty. However, neither of the two systems previously described seemed to meet the requirement of a directly interpretable, yet efficient code.

Subject-word and proper-name lists were studied to find what letter frequencies occurred in the first five letter positions. Some work along these lines had been done, notably by Geisler for the ASM-SLA (6) with proper names, and by Krieger (7) with subject words (however, Krieger only considered initial letters in designing his code).

Striking similarities for initial-letter frequencies among various subject-word lists were found, as shown in Table 1. The average of five such lists show that 40% of the words begin with C, S, P, or A (in that order). Furthermore, 85% begin either with these four or B, M, T, R, E, F, D, G, H, or I—or only 54% of the alphabet.

Even greater consistency was found with proper-name lists, as shown in Table 2, but with a different ranking of the letters. The average of three such lists gave S, B, M, H, and C for the beginning letters of 40% of the names, and these five and D, G, K, L, R, P, W, A, and F (again 54% of the alphabet) accounted for 83%.

The Library of Congress list was chosen as typical of the subject-word lists, and the 1955 Syracuse Telephone Directory as typical of names. A systematic sample was obtained from each list by recording the top-left, middle, and top-right terms from every two-page spread.1 The frequencies of letters in each of the first five positions were then obtained for each list, as shown in Tables 3 and 4.

1  

The probabilities in this case are not independent, but every term is equidistant in the alphabetical sequence from the next term chosen, which is a sufficient approximation to true randomness for the purposes of this study.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

TABLE 1 Subject-word initial letter frequenciesa

 

Chambers Technical Dict’y, 1942, 912 pp.

Merriam-Webster Unabridged Dict’y, 2987 pp.

Industrial Arts Index (Vol. 41, No. 5), April, 1953, 787 pp.

Chem. Abstracts Decennial Subject Index 1907–16 & 1927–36 (after Krieger, ref. 7)

Lib. of Congress T.I.D. List of Subject Headings, June 1952, 327 pp.

Average Frequency, %

Rank

Letter

Freq., %

Deviation

Freq., %

Deviation

Freq., %

Deviation

Freq., %

Deviation

Freq., %

Deviation

A

7.3

−1.0

6.6

−1.7

9.3

+1.0

8.6

+0.3

9.7

+1.4

8.3

4

B

6.2

+0.4

5.7

−0.1

4.5

−1.3

7.4

+1.6

4.9

−0.9

5.8

5

C

10.8

+0.4

9.8

−.06

10.0

−0.4

12.6

+2.2

8.8

−1.6

10.4

1

D

5.7

+1.5

4.9

+0.7

3.3

−0.9

3.5

−0.7

3.6

−0.6

4.2

11

E

4.7

+0.3

3.3

−1.1

6.1

+1.7

3.9

−0.5

4.2

−0.2

4.4

9

F

4.6

+0.3

3.9

−0.4

3.7

−0.6

4.9

+0.6

4.5

+0.2

4.3

10

G

3.7

−0.2

3.2

−0.7

4.6

+0.7

3.7

−0.2

4.5

+0.6

3.9

12

H

4.4

+0.6

3.7

−0.1

3.3

−0.5

4.3

+0.5

3.6

−0.2

3.8

13.5

I

3.1

−0.7

3.0

−0.8

6.0

+2.2

3.7

−0.1

3.2

−0.6

3.8

13.5

J

0.6

0.0

0.9

+0.3

0.2

−0.4

~0.4

−0.2

1.7

+0.1

0.6

21.5

K

1.0

+0.4

0.9

+0.3

0.4

−0.2

~0.4

−0.2

0.3

−0.3

0.6

21.5

L

3.8

+0.7

3.2

+0.2

2.5

−0.6

3.3

+0.2

2.6

−0.5

3.1

15

M

5.6

−0.1

5.0

−0.7

6.6

+0.9

5.3

−0.4

6.2

+0.5

5.7

6

N

2.0

−0.5

1.8

−0.7

2.4

−0.1

3.1

+0.6

3.2

+0.7

2.5

16

O

2.2

0.0

2.6

+0.4

1.7

−0.5

2.7

+0.5

1.9

−0.3

2.2

18

P

9.1

−0.4

9.3

−0.2

10.3

+0.8

10.8

+1.3

8.1

−1.4

9.5

3

Q

0.6

+0.1

0.6

−0.1

0.2

−0.3

~0.7

+0.2

0.3

−0.2

0.5

23

R

4.4

−0.4

4.8

0.0

4.7

−0.1

3.5

−1.3

6.8

+2.0

4.8

8

S

10.4

+0.3

12.4

+2.3

9.9

−0.2

8.4

−1.7

10.4

+0.3

10.1

2

T

4.8

−0.5

6.3

+1.0

5.1

−0.2

4.3

−1.0

6.2

+0.9

5.3

7

U

0.8

−0.4

1.9

+0.7

0.9

−0.3

~0.7

−0.5

1.9

+0.7

1.2

20

V

1.6

+0.2

1.7

+0.3

1.1

−0.3

1.4

0.0

1.3

−0.1

1.4

19

W

1.8

−0.5

3.3

+1.0

2.4

+0.1

1.9

−0.4

1.9

−0.4

2.3

17

X

0.2

0.0

0.1

−0.1

0.2

0.0

~0.03

−0.2

0.3

+0.1

0.2

26

Y

0.2

−0.1

0.4

+0.1

0.1

−0.2

~0.03

−0.3

0.3

0.0

0.3

24.5

Z

0.4

+0.1

0.3

0.0

0.2

−0.1

~0.03

−0.3

0.3

0.0

0.3

24.5

Check sum

100.0

+0.5

99.6

0.0

99.7

+0.2

99.6

0.0

99.7

+0.2

99.5

 

a Frequencies which deviate more than 1 % from the average are shown in italics.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

TABLE 2 Proper-name initial letter frequenciesa

 

Chemical Abstracts Fourth Decennial Author Index, 3531 pp.

ASM-SLA Metal Literature Study, 4870 pp.

Syracuse, N.Y., Telephone Directory, 1955, 307 pp.

Av. Freq., %

Rank

Letter

Freq., %

Deviation

Freq., %

Deviation

Freq., %

Deviation

A

3.8

0.0

2.45

−1.35

5.2

+1.4

3.8

13

B

9.2

−0.2

10.2

+0.8

8.8

−0.6

9.4

2

C

5.7

−0.9

6.2

−0.4

7.8

+1.2

6.6

5

D

5.0

−0.3

5.3

0.0

5.5

+0.2

5.3

6.5

E

2.3

+0.1

2.25

+0.05

2.0

−0.2

2.2

16

F

3.6

0.0

3.4

−0.2

3.9

+0.3

3.6

14

G

5.3

0.0

5.6

+0.3

4.9

−0.4

5.3

6.5

H

6.7

−0.1

7.35

+0.55

6.2

−0.6

6.8

4

I

1.6

+0.7

0.75

−0.15

0.3

−0.6

0.9

21.5

J

1.8

+0.1

1.75

+0.05

1.6

−0.1

1.7

19

K

6.0

+1.2

4.9

−0.3

4.6

−0.6

5.2

8

L

4.6

−0.4

5.65

+0.65

4.6

−0.4

5.0

9

M

7.7

−0.6

8.25

−0.05

8.8

+0.5

8.3

3

N

2.3

+0.3

1.8

−0.2

2.0

0.0

2.0

17

O

1.4

−0.2

1.4

−0.2

2.0

+0.4

1.6

20

P

4.6

−0.1

4.5

−0.2

4.9

+0.2

4.7

11.5

Q

0.1

−0.1

0.1

−0.1

0.3

+0.1

0.2

25

R

5.0

+0.1

4.65

−0.25

4.9

0.0

4.9

10

S

11.3

+0.1

11.0

−0.2

11.4

+0.2

11.2

1

T

3.4

+0.2

3.65

+0.45

2.6

−0.6

3.2

15

U

0.7

+0.2

0.45

−0.05

0.3

−0.2

0.5

23

V

1.8

−0.1

2.15

+0.25

1.6

−0.3

1.9

18

W

4.6

−0.1

4.65

−0.15

4.9

+0.2

4.7

11.5

X

0.0

0.0

0.0

0.0

0.0

0.0

0.0

26

Y

0.5

+0.1

0.5

+0.1

0.3

−0.1

0.4

24

Z

1.0

+0.1

1.1

+0.2

0.7

−0.2

0.9

21.5

Check sum

100.0

+0.1

99.55

−0.4

100.1

−0.2

100.3

 

a Frequencies which deviate more than 1% from the average are shown in italics.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

TABLE 3 Subject-word letter frequencies (332 words)a

 

First letter

Second letter

Third letter

Fourth letter

Fifth letter

Letter

Freq., %

Rank

Freq., %

Rank

Freq., %

Rank

Freq., %

Rank

Freq., %

Rank

A

9.3

2

17.8

1

8.4

2

7.7

3

5.3

9.5

B

4.8

8

0.6

17

2.7

15

2.2

17

0.0

24.5

C

8.1

3

1.8

12

5.4

9

6.2

5

2.3

11.5

D

3.6

12.5

0.3

21.5

6.3

6.5

5.0

9

2.0

13

E

4.2

11

12.3

2

6.3

6.5

11.8

1

13.3

1

F

4.5

9.5

0.3

21.5

2.4

16

1.2

20

0.7

19

G

4.5

9.5

0.3

21.5

1.8

17

0.9

22

0.7

19

H

3.6

12.5

3.9

9.5

0.9

19

3.7

13.5

1.3

15

I

3.3

14.5

11.1

4

5.2

10

10.8

2

9.3

4

J

0.9

21

0.0

25.5

0.0

25.5

1.2

20

0.0

24.5

K

0.6

23.5

0.3

21.5

0.3

22.5

2.8

15.5

0.3

21.5

L

2.7

16

6.9

7

7.8

5

3.9

11.5

6.7

7

M

6.3

6

0.9

14.5

3.9

12.5

5.3

8

2.3

11.5

N

3.3

14.5

3.9

9.5

5.7

8

5.9

7

7.3

6

O

2.1

17.5

11.4

3

8.1

3.5

6.2

5

11.7

3

P

7.8

4

1.5

13

3.9

12.5

3.7

13.5

1.0

16.5

Q

0.6

23.5

0.3

21.5

0.3

22.5

0.3

24

0.0

24.5

R

6.6

5

7.5

6

12.0

1

3.9

11.5

12.0

2

S

9.9

1

0.6

17

4.5

11

4.3

10

5.7

8

T

6.0

7

2.4

11

8.1

3.5

6.2

5

8.7

5

U

1.8

19

7.8

5

3.3

14

2.8

15.5

5.3

9.5

V

1.5

20

0.3

21.5

0.6

20

1.2

20

0.3

21.5

W

2.1

17.5

0.0

25.5

0.3

22.5

0.3

24

0.7

19

X

0.6

23.5

0.9

14.5

0.3

22.5

0.0

26

1.0

16.5

Y

0.3

26

6.0

8

1.2

18

1.9

18

1.7

14

Z

0.6

23.5

0.6

17

0.0

25.5

0.3

24

0.0

24.5

No. of blanks

0

 

0

 

0

 

9

 

33

 

Check sum

99.6

 

99.7

 

99.7

 

99.5

 

99.6

 

a Blanks are not counted in computing percentages.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

TABLE 4 Proper name letter frequencies (309 names)a

 

First letter

Second letter

Third letter

Fourth letter

Fifth letter

Letter

Freq., %

Rank

Freq., %

Rank

Freq., %

Rank

Freq., %

Rank

Freq., %

Rank

A

3.2

14

21.7

1

6.8

5

7.0

5.5

6.5

8.5

B

8.8

3

0.3

19

2.3

14

4.3

10

1.5

16

C

7.8

4

3.9

7

2.9

12

3.6

12.5

2.9

11.5

D

5.5

6

0.6

15.5

2.0

16.5

6.0

8

1.5

16

E

2.0

17

14.3

2

7.1

4

11.6

1

19.2

1

F

4.2

13

0.0

23.5

1.3

20

0.7

23

0.7

20

G

4.9

9

0.6

15.5

3.9

10

2.6

14.5

2.9

11.5

H

6.2

5

3.6

8

2.3

14

2.6

14.5

6.5

8.5

I

0.6

22

10.0

4

4.9

8

8.0

3

6.9

6

J

1.6

19

0.0

23.5

0.3

25

0.0

25.5

0.4

23

K

4.5

11.5

0.0

23.5

1.3

20

3.6

12.5

3.3

10

L

4.5

11.5

3.2

9.5

11.0

3

9.3

2

7.6

3

M

9.4

2

1.6

13

0.6

23.5

1.3

19.5

2.2

14

N

2.0

17

2.3

12

12.0

2

7.6

4

6.9

6

O

2.0

17

16.8

3

6.5

6

4.3

10

10.1

2

P

4.9

9

0.6

15.5

1.6

18

1.7

18

0.7

20

Q

0.3

24.5

0.0

23.5

0.0

26

0.0

25.5

0.0

25

R

4.9

9

6.5

6

12.6

1

6.3

7

7.2

4

S

11.7

1

0.3

19

4.9

8

4.3

10

6.9

6

T

2.9

15

3.2

9.5

4.9

8

7.0

5.5

2.5

13

U

0.6

22

6.8

5

3.6

11

2.3

16.5

0.7

20

V

1.3

20

0.3

19

1.0

22

2.3

16.5

0.0

25

W

5.2

7

0.6

15.5

2.3

14

1.0

21.5

1.5

16

X

0.0

26

0.0

23.5

0.6

23.5

0.3

24

0.0

25

Y

0.3

24.5

2.6

11

2.0

16.5

1.3

19.5

0.7

20

Z

0.6

22

0.0

23.5

1.3

20

1.0

21.5

0.7

20

No. of blanks

0

 

0

 

0

 

6

 

32

 

Check sum

99.9

 

99.8

 

100.0

 

100.0

 

100.0

 

a Blanks are not counted in computing percentages.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

TABLE 5 Amount of information (H) in subject-word lettera

 

First letter

Second letter

Third letter

Fourth letter

Fifth letter

Av. English text (after Pratt (9))

Rank,n

 

pn

-pnlog2pn

 

pn

-pnlog2pn

 

pn

pnlog2pn

 

pn

pnlog2pn

 

pn

-pnlog2pn

 

pn

-pnlog2pn

1

S

.099

.3303

A

.178

.4432

R

.120

.3671

E

.118

.3638

E

.133

.3871

E

.131

.3841

2

A

.093

.3187

E

.123

.3719

A

.084

.3002

I

.108

.3468

R

.120

.3671

T

.105

.3414

3

C

.081

.2937

O

.114

.3571

OT

.081

.2937

A

.077

.2848

O

.117

.3622

A

.082

.2959

4

P

.078

.2871

I

.111

.3520

 

 

.2937

 

 

.2487

I

.093

.3187

O

.080

.2915

5

R

.066

.2588

U

.078

.2871

L

.078

.2871

C,O,T

.062

.2487

T

.087

.3065

N

.071

.2709

6

M

.063

.2513

R

.075

.2803

D,E

.063

.2513

 

 

.2487

N

.073

.2756

R

.068

.2637

7

T

.060

.2435

L

.069

.2661

 

 

.2513

N

.059

.2409

L

.067

.2613

I

.063

.2513

8

B

.048

.2103

Y

.060

.2435

N

.057

.2356

M

.053

.2246

S

.057

.2356

S

.061

.2461

9

F,G

.045

.2013

H,N

.039

.1825

C

.054

.2274

D

.050

.2161

A,U

.053

.2246

H

.053

.2246

10

 

 

.2013

 

 

.1825

I

.052

.2218

S

.043

.1952

 

 

.2246

D

.038

.1793

11

E

.042

.1921

T

.024

.1291

S

.045

.2013

L,R

.039

.1825

C,M

.023

.1252

L

.034

.1659

12

D,H

.036

.1727

C

.018

.1043

M,P

.039

.1825

 

 

.1825

 

 

.1252

F

.029

.1481

13

 

 

.1727

P

.015

.0909

 

 

.1825

 

 

.1760

D

.020

.1129

C

.028

.1444

14

I,N

.033

.1624

M,X

.009

.0612

U

.033

.1624

H,P

.037

.1760

Y

.017

.0999

M,U

.025

.1330

15

 

 

.1624

 

 

.0612

B

.027

.1407

K,U

.028

.1444

H

.013

.0815

 

 

.1330

16

L

.027

.1407

 

 

.0443

F

.024

.1291

 

 

.1444

P,X

.010

.0664

 

 

.1129

17

O,W

.021

.1170

B,S,Z

.006

.0443

G

.018

.1043

B

.022

.1211

 

 

.0664

G,Y,P

.020

.1129

18

 

 

.1170

 

 

.0443

Y

.012

.0766

Y

.019

.1086

 

 

.0501

 

 

.1129

19

U

.018

.1043

 

 

.0251

H

.009

.0612

 

 

.0766

F,G,W

.007

.0501

W

.015

.0909

20

V

.015

.0909

 

 

.0251

V

.006

.0443

F,J,V

.012

.0766

 

 

.0501

B

.014

.0862

21

J

.009

.0612

D,F,G,

K,Q,V

.003

.0251

 

 

.0251

 

 

.0766

K,V

.003

.0251

V

.009

.0612

22

 

 

.0443

 

 

.0251

K,Q,W,

X

.003

.0251

G

.009

.0612

 

 

.0251

K

.004

.0319

23

 

 

.0443

 

 

.0251

 

 

.0251

 

 

.0251

 

 

 

X

.002

.0179

24

K,Q,X,Z

.006

.0443

 

 

.0251

 

 

.0251

Q,W,Z

.003

.0251

B,J,Q,Z

0

 

 

 

.0100

25

 

 

.0443

J,W 0

 

 

J,Z

0

 

 

 

.0251

 

 

 

J,Q,Z

.001

.0100

26

Y

.003

.0251

 

 

 

 

 

 

X

0

 

 

 

 

 

 

.0100

log2 26=4.7

4.2920

 

 

3.7964

 

 

4.1255

 

 

4.5201

 

 

3.8413

 

 

4.1300

R=1−H/(log226)

9%

 

 

20%

 

 

12%

 

 

4%

 

 

18%

 

 

12%

a Average of five letters, 20.5753/5=4.1151.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

TABLE 6 Subject-word cumulative letter frequencies (in rank order)a

 

a On an equiprobable basis, each letter would occur 3.846% of the time.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

TABLE 7 Weighted letter frequencies, %a

Letter

First letter

Third letter

Fourth letter

A

8.5

8.2

7.6

B

5.3

2.7

2.5

C

8.0

5.1

5.9

D

3.8

5.8

5.1

E

3.9

6.4

11.8

F

4.4

2.2

1.1

G

4.6

2.1

1.1

H

3.9

1.1

3.6

I

3.0

5.2

10.5

J

1.0

0.0

1.1

K

1.1

0.4

2.9

L

2.9

8.2

4.6

M

6.7

3.5

4.8

N

3.1

6.5

6.1

O

2.2

7.9

6.0

P

7.4

3.6

3.5

Q

0.6

0.3

0.3

R

6.4

12.1

4.2

S

10.0

5.0

4.3

T

5.6

7.7

6.3

U

1.7

3.3

2.7

V

1.5

0.7

1.3

W

2.5

0.5

0.4

X

0.5

0.3

0.0

Y

0.3

1.3

1.8

Z

0.6

0.2

0.4

Check sum

99.5%

100.3%

99.9%

a All seven parts subject plus one part name.

For the initial letters of subject terms, the rank order was S, A, C, P, R, M, T, · · ·; for second letters, A, E, O, I, U, R, L · · ·; for third, R, A, O or T, L, D or E, · · ·; for fourth, E, I, A, T or O or C, N, · · · ; and for fifth, E, R, O, I, T, N, L · · ·, as shown in Table 5. Cumulated frequencies are given in Table 6.

Table 5 also gives the information measure −pn log2pn for each letter in each position (8). For this purpose, percentage frequencies were assumed to represent actual probabilities, pn. The sum for each letter position,

represents H, the average uncertainty per letter-position or, as it is sometimes called, the average information represented by the letter position, in bits. The redundancy R is also shown on the bottom for each letter position.

These calculations show that the least redundant (or the most informative) letter position is the fourth, next to that the first, and then the third. Similar results can be shown for proper names.

For the marginal-punched card application, first and third letter positions were selected for coding. Subject-word frequencies were weighted with proper names in a 7-to-1 proportion,2 as shown in Table 7. The 52 letters of

2  

According to Wise (3), the ratio X/H, or that of the number of positions to be punched to the number of positions available for punching, should be about 0.46. Taking H to be 19, X=8.75. The dropping fraction fd=(G/H)Y, or the ratio of the number of positions actually punched to the number available for punching, raised to a power, Y, representing the number of sorting elements used, works out to be (7/19)2 =13.7%, if about 9 codes are actually superimposed. Note that

A maximum of 8 coding words was chosen, based on these calculations.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

TABLE 8 Comparison of actual and predicted letter frequencies

First lettera

Letter

Actual no. of cards dropped

Actual %

Predicted %

Aa-An (median between anx and any)

29

2.6

4.25

Ao-Az

19

1.7

4.25

B

54

4.9

5.3

Ca-Ci (median between cka and cke)

65

5.9

4.0

Ck-Cz

103

9.35

4.0

D

46

4.2

3.8

E

37

3.35

3.9

F

36c

3.3

4.4

G

 

 

4.6

H

 

 

3.9

I & J

 

 

4.0

K & L

 

 

4.0

M

 

 

6.7

N & O

 

 

5.3

P

 

 

7.4

Q & R

 

 

7.0

Sa-Si (median between siv and six)

 

 

5.0

Sj-Sz

 

 

5.0

T

 

 

5.6

U-Z

 

 

7.1

Third letterb

Letter

Actual no. of cards dropped

Actual %

Predicted %

aa-aq (median between ard and are)

40

3.8

5.45

ar–az & b

80

7.6

5.45

c

60

5.65

5.1

d

55

5.2

5.8

e

100

9.45

6.4

f, g & h

80

7.6

5.4

i, j & k

45

4.25

5.6

la−lo (median between lov and low)

45

4.25

5.85

lp−lz & m

80

7.6

5.85

n

80

7.6

6.5

oa–os (median between otf and oth)

45

4.25

5.9

ot–oz, p, q

35

3.3

5.9

ra-rg (median between rge and rgo)

40

3.8

6.05

rh–rz

65

6.1

6.05

s

65

6.1

5.0

ta–th (median between tid and tie)

40

3.8

3.85

ti–tz

35

3.3

3.85

u-z

70

6.6

6.3

Total

1060

 

 

Avg.

59

 

 

a Ideally each first letter position would comprise 5%.

b Ideally each third letter position would comprise 5.5%.

c Not carried to completion.

Note: About 400 cards were used in study. Actual number was estimated by measuring cards dropped at 150 cards/inch. Predicted percentage based on Table 7.

Dropping fraction, Fd=(G/H)Y=(2.7/18)1=15%. For 400 cards, Fd=60. H is the number of coding positions; G is the number of punches/card=1100/400; Y is the number of sorting positions=1. (See Wise (3) for derivation.)

first and third positions were then assigned to the 38 available positions as equally as possible, but under the restriction that alphabetical order along the side of the card be preserved. The result is shown in Fig. 1, and Table 8 shows the predicted frequency distribution for this code. Note that it was necessary sometimes to combine several letters in one position, and sometimes to split

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

one letter between two positions. These splits were chosen according to the median frequencies of English trigrams (9).

Splitting letters modifies the first-letter position somewhat by the second, and third somewhat by the fourth. Such letter-pair frequencies take account of intersymbol influence, and therefore make possible a better code than single-letter frequencies. H.P.Luhn has designed a superimposed code using randomizing squares (10) which takes advantage of letter-pair frequencies.

An empirical check of the letter code shown in Fig. 1 was made on a 400-card file maintained by the author. The results are shown in Table 83. The average dropping fraction for the third position alone compares well with the dropping fraction as calculated by formula, but the range (from 9 to 25%) is broader than hoped for. However, Table 8 shows that the agreement between actual and predicted frequencies in the third-letter position was very good, considering the alphabetic-order limitation imposed in assigning the positions.

By using data-processing equipment, much more elaborate studies on much larger samples would be possible. The author is working with such equipment and hopes to have some results available in the near future.

Equifrequency-letter codes have many other applications, including the preassignment of space in files and indexes, in cryptography, and in philology. For example, the data in Table 5 can provide a quantitative measure of subject word popularity. Taking a few words from the Library of Congress list of subject headings, we add the percentage frequencies of each letter (up to 5) together and divide by the number of letters. (Multiply each pn by 100 to get the percentage frequency.)

AIRCRAFT has a value of 9.3+11.1+12.0+6.2+12.0, 10.12

DIVIDER has a value of 3.6+11.1+0.6+10.8+2.0, 5.66

ICHTHYOLOGY has a value of 3.3+1.8+0.9+6.2+1.3, 2.70

These three words give some idea of the range possible in a subject-heading list. In general dictionary words, the highest found was SARI, with a value of 12.63, and the lowest, ONYX, with a value of 1.8. It is interesting to compare these values with the highest possible letter combination (not necessarily an English word), which is SAREE (value 12.96), and the lowest (value 0.06), The highest is very nearly realized in actuality, while the lowest never comes close. Also note that the word SARI is certainly uncommon English; this phenomenon may occur because the intersymbol connections are broken by taking single-letter frequencies.

3  

Since the first-letter positions showed quite wide deviations from the predicted frequencies, their analysis was never completed. It is now thought that third and fourth letter positions would have made a more invariant code, less subject to the fluctuations which occur in any particular file because of the selection of particular terms.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

ACKNOWLEDGMENT

The work described in this paper was performed while the author was in the employ of Carrier Corporation, Syracuse, New York.

REFERENCES

1. C.E.SHANNON, A Mathematical Theory of Communication, Bell System Technical Journal, July 1948, and following.

2. C.N.MOOERS, Zatocoding and Development in Information Retrieval, ASLIB Proc., February 1956, p. 3 (Many other papers by this author may be obtained from his Zator Co., 79 Milk Street, Boston, Massachusetts.)

3. C.S.WISE, A Punched-Card File Based on Word Coding, pp. 93–114, in Perry and Casey’s Punched Cards, Reinhold Publishing Corporation, New York, 1951.

4. MOOERS and WISE had discussions in American Documentation, April 1950, October 1950, and October 1952.

5. H.OHLMAN, The Low-Cost Production of Marginal-Punched Cards on Accounting Machines, pp. 123–26 American Documentation, April 1957.

6. JOINT COMMITTEE OF ASM AND SLA, ASM-SLA Metallurgical Literature Classification, American Society for Metals, 1950. (Figure 5, which was based on an analysis of 4870 names by A.H.Geisler in ASM Review of Metallurgical Literature.)

7. K.A.KRIEGER, A Punched-Card System for Chemical Literature, J. of Chemical Education

, March 1949, p. 163.

8. E.T.KLEMMER, Tables for Computing Informational Measures, p. 75 in Quastler’s Information Theory in Psychology, Free Press, Glencoe, Ill., 1955.

9. F.PRATT, Secret and Urgent, The Story of Codes and Ciphers, Blue Ribbon Books, Garden City, N.Y., 1942, pp. 264–78.

10. H.P.LUHN, Superimposed Coding With the Aid of Randomizing Squares for Use in Mechanical Information Searching Systems, IBM Product Development Lab., Poughkeepsie, New York, 1956.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×

This page intentionally left blank.

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 903
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 904
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 905
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 906
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 907
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 908
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 909
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 910
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 911
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 912
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 913
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 914
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 915
Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.
×
Page 916
Next: The Analogy between Mechanical Translation and Library Retrieval »
Proceedings of the International Conference on Scientific Information: Two Volumes Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The launch of Sputnik caused a flurry of governmental activity in science information. The 1958 International Conference on Scientific Information (ICSI) was held in Washington from Nov. 16-21, 1958 and sponsored by NSF, NAS, and American Documentation Institute, the predecessor to the American Society for Information Science. In 1959, 20,000 copies of the two volume proceedings were published by NAS and included 75 papers (1600 pages) by dozens of pioneers from seven areas such as:

  • Literature and reference needs of scientists
  • Function and effectiveness of A & I services
  • Effectiveness of Monographs, Compendia, and Specialized Centers
  • Organization of information for storage and search: comparative characteristics of existing systems
  • Organization of information for storage and retrospective search: intellectual problems and equipment considerations
  • Organization of information for storage and retrospective search: possibility for a general theory
  • Responsibilities of Government, Societies, Universities, and industry for improved information services and research.

It is now an out of print classic in the field of science information studies.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!