An Introduction to The National Language Research Institute: A Sketch of its Achievements
Third Edition(1988)/ HTML Version(1997)

[contens]| [previous]| [next]

II.3.9 Studies on the Vocabulary of Modern Newspapers -- Using Computer I - IV

(I. Report 37,1970. 342 pages; II. Report 38, 1971. 314 pages; III. Report 42,1972. 159 pages; IV. Report 48, 1973. 530 pages)
This book is a report on a vocabulary survey conducted using one year of publication (1966) of three newspapers- "Asahi," "Mainiti," and "Yomiuri" as a population. The main characteristics of this investigation are as follows: 1) Newspaper articles were selected by a sampling procedure to obtain a large corpus totalling three million running words. 2) In order to process such a large amount of data in a short period of time, a computer system and Chinese character input-output teletypewriters were used. 3) By using both a long unit (TYO~-TAN'I, roughly, a word) and a short unit (TAN-TAN'I, roughly, a morpheme), it was possible to investigate word structure during the processing. 4) In order to obtain and interpret the results from a multidimensional viewpoint, the occurrence and use of words were determined and analyzed in terms of various types of articles by topic, type of discourse, location of unit, and source of information. This project is the first one of its kind in which a computer was used in processing the data at this Institute. We conducted this survey using a computer to carry out a variety of quantitative analyses involved in the processing of the large quantity of data in a short period of time. Report 37 (I) contains a vocabulary table in order of frequency of occurrence of long and short units, and a vocabulary table in order of the Japanese 50-kana syllabary. Report 38 (II) contains a table of loan words in order of frequency of occurrence, a vocabulary table in order of frequency of use by parts of speech, a vocabulary table of short units in order of the Japanese 50-kana syllabary, a vocabulary table of homophones, and a vocabulary table of homomorphemes. Report 42 (III) contains a table of 'NA'-adjectives, a table of the connections made by affixes, and a table of the connections made by particles and auxiliary verbs. Reports 37 (I), 38 (II) and 42 (III) are all interim reports. Report 48 (IV) is the final report for this survey and contains a vocabulary table in order of word frequency and a vocabulary table in order of the Japanese 50-kana syllabary for the long units (approximately 2,000,000 running words and approximately 190,000 different words). OISI Hatutaro~, HAYASI Oki, HAYASI Siro~, ISIWATA Tosio, SAITO Hidenori, KIMURA Sigeru, TANAKA Akio, MINAMI Huzio, EGAWA Kiyosi, NAKANO Hirosi, TUTIYA Sin'iti, NOMURA Masaaki, MURAKI Sin'ziro~, and TURUOKA Akio directed this survey. This report advanced techniques in Japanese information processing considerably. A part of these results is reported in Studies in Computational Linguistics I-X (Report 31 ~ 67). In addition, the "System for Production of Tables of Keyword Examples in Japanese Context Using Computer(KWIC)" developed into the "System for General Indexing of Vocabulary by Computer" and produced the Vocabulary-Context Concordance for SIGA Naoya's KINOSAKI NITE ('At Kinosaki') (1971) and the Vocabulary-Context Concordance for MORI Ogai's KANZAN ZITTOKU (1974). In addition, we experimented with the production of general concordances for several works by MORI Ogai and NATUME So~seki. Together with these surveys, the following research reports have been written. MIYAZIMA Tatuo, Morphemes as Linguistic Units, 1965. SUZUKI Sigeyuki, Words as Linguistic Units, 1965. SINDO Sakiko, SUZUKI Sigeyuki, TANAKA Akio, HAYASI Oki, and MIYAZIMA Tatuo, Proposal for the Survey Unit, 1966.

[contens]| [previous]| [next]