An Introduction to The National Language Research Institute:
A Sketch of its Achievements
Third Edition(1988)/
HTML Version(1997)
[contens]|
[previous]|
[next]
II.3.9 Studies on the Vocabulary of Modern Newspapers
-- Using Computer I - IV
(I. Report 37,1970. 342 pages; II. Report 38, 1971. 314 pages;
III. Report 42,1972. 159 pages; IV. Report 48, 1973. 530 pages)
This book is a report on a vocabulary survey conducted using
one year of publication (1966) of three newspapers- "Asahi,"
"Mainiti," and "Yomiuri" as a population.
The main characteristics of this investigation are as
follows:
1) Newspaper articles were selected by a sampling
procedure to obtain a large corpus totalling three million
running words.
2) In order to process such a large amount of data in a
short period of time, a computer system and Chinese
character input-output teletypewriters were used.
3) By using both a long unit (TYO~-TAN'I, roughly, a word)
and a short unit (TAN-TAN'I, roughly, a morpheme), it was
possible to investigate word structure during the processing.
4) In order to obtain and interpret the results from a
multidimensional viewpoint, the occurrence and use of words were
determined and analyzed in terms of various types of
articles by topic, type of discourse, location of unit, and
source of information.
This project is the first one of its kind in which a computer
was used in processing the data at this Institute. We
conducted this survey using a computer to carry out a variety of
quantitative analyses involved in the processing of the large
quantity of data in a short period of time.
Report 37 (I) contains a vocabulary table in order of
frequency of occurrence of long and short units, and a
vocabulary table in order of the Japanese 50-kana
syllabary. Report 38 (II) contains a table of loan words
in order of frequency of occurrence, a vocabulary table in
order of frequency of use by parts of speech, a vocabulary
table of short units in order of the Japanese 50-kana syllabary,
a vocabulary table of homophones, and a vocabulary table of
homomorphemes. Report 42 (III) contains a table of
'NA'-adjectives, a table of the connections made by
affixes, and a table of the connections made by particles
and auxiliary verbs. Reports 37 (I), 38 (II) and 42
(III) are all interim reports. Report 48 (IV) is the final
report for this survey and contains a vocabulary table in
order of word frequency and a vocabulary table in order of the
Japanese 50-kana syllabary for the long units
(approximately 2,000,000 running words and approximately 190,000
different words).
OISI Hatutaro~, HAYASI Oki, HAYASI Siro~, ISIWATA
Tosio, SAITO Hidenori, KIMURA Sigeru, TANAKA Akio, MINAMI
Huzio, EGAWA Kiyosi, NAKANO Hirosi, TUTIYA Sin'iti, NOMURA
Masaaki, MURAKI Sin'ziro~, and TURUOKA Akio directed this
survey.
This report advanced techniques in Japanese information
processing considerably. A part of these results is
reported in Studies in Computational Linguistics I-X
(Report 31 ~ 67). In addition, the "System for Production of
Tables of Keyword Examples in Japanese Context Using
Computer(KWIC)" developed into the "System for General
Indexing of Vocabulary by Computer" and produced the
Vocabulary-Context Concordance for SIGA Naoya's KINOSAKI
NITE ('At Kinosaki') (1971) and the
Vocabulary-Context Concordance for MORI Ogai's KANZAN
ZITTOKU (1974). In addition, we experimented with the
production of general concordances for several works by MORI Ogai
and NATUME So~seki. Together with these surveys, the
following research reports have been written.
MIYAZIMA Tatuo, Morphemes as Linguistic Units, 1965.
SUZUKI Sigeyuki, Words as Linguistic Units, 1965.
SINDO Sakiko, SUZUKI Sigeyuki, TANAKA Akio, HAYASI Oki, and
MIYAZIMA Tatuo, Proposal for the Survey Unit, 1966.
[contens]|
[previous]|
[next]