名大会話コーパス

Outline

Creation Procedure, Characteristics and Problems of the Nagoya University Conversation Corpus

I. File creation procedure

Tape recorders or MD (MiniDisc) players and external microphones were provided to recorders, mainly graduate students, who were asked to record 30 minutes to one hour of uncontrolled, natural conversation.
They were instructed to record these conversations in places of their choice that should be as quiet as possible.
The general rule was to record a conversation between two people to make it easier to identify who was speaking. In reality, however, there were three or four speakers participating in some conversations.
Each conversation participant signed a consent form to allow their data to be used for Japanese language and education research. (See the separate sheet.)
Transcription of the data recorded on tapes and MDs was outsourced to an external transcriber. For details of the transcription process, see II.
The transcribed data was sent back to the graduate students for checking purposes.
Parts that were judged to be inappropriate for publication for privacy or other reasons were eliminated by cutting off the text with a < SNIP >. Some problematic words were also unprinted and marked with "***" as inaudible words.

II. Transcription process

Communication with the external transcriber, such as memorandums exchanged for the transcription request, and other information concerning the transcription process, are as follows:

Transcribe the conversations as faithfully as possible to the original voices.
Start each header with "@" and mark the end of each conversation with "@End."
Transcribe all audible sounds, while marking inaudible sounds with "***."
Place back-channel sounds in parentheses "()."
Transcribe overlapping voices, if any, separately, rather than specially describing them.
When a statement has a rising intonation, put a question mark after the statement.
When a speaker laughs, place "< laugh >." When the listener laughs, place "(< laugh >)."
Place the speaker's code at the beginning of each statement.
　　When the same person keeps on making statements, however, there is no need to put the code each time.
When silence lasts more than a certain period, place "< pause >."
Show the pronunciation of kanji words with hiragana in parentheses "【】" when it is considered difficult.
Add supplementary information, if any, after "％ｃｏｍ."
Always use two-byte characters for letters of the alphabet, numbers, and symbols.
Describe the ages of conversation participants as follows:

Aged 15-19: Late teens
Aged 20-24: Early 20s
Aged 25-29: Late 20s And so on.

III. Characteristics of this corpus

Some data was recorded in and around Tokyo, in Hokkaido, and in Niigata, while the majority was recorded in and around Nagoya.
Although the great majority of the conversations were spoken in Standard Japanese, some had dialect speakers.
The participants varied greatly in age, ranging from teens to their nineties. Women outnumbered men.
As many Japanese language educators and researchers participated, their conversations often included metalinguistic usages of Japanese.
While many were conversations between close friends, some conversations were held between people who met each other for the first time and between research members. Some were also conversations between seniors and juniors.
In these uncontrolled, natural conversations, participants were allowed to speak about anything they liked. At the same time, however, they were also informed that they were being recorded.

IV. Problems of the data

One disadvantage of the Collection Inventory is that it is sometimes difficult to identify specific dialects.
Here are problems in the transcription process:
1. Distinguishing and transcribing prolonged sounds, assimilated sounds, etc.
2. Notation of back-channel sounds
3. Difficulty with recognizing back-channel sounds
4. Although the transcriber was requested to transcribe the conversations as faithfully as possible to the original voices, the transcription process could not be completely unaffected by the meaning of words.
  (Example) Keredo → Keredomo
  
  Although we did our utmost to correct this problem, some errors may remain.
5. The actual sounds did not always exactly reflect what the speakers intended.
6. Difficulty with understanding the names of places and people, and dialects, as well as unfamiliar teen slang
7. Some French and English words were used. While the English words were written in the alphabet, the French words were marked with < French >.
8. As the transcripts are written using kanji characters, there is no telling how some words were actually pronounced.
9. Since there is no clear standard for what constitutes a statement, punctuation is quite arbitrary.
10. Overlapping of voices is not reflected in the transcripts.

V. For publication

For publication of the entire data in text files, the participants have been encoded as follows for privacy protection purposes.

Female participant: F + Number
Male participant: M + Number

(Example) F001, F002, M001, M002

Proper nouns used in the conversations are initialized where deemed necessary, or initialized and numbered where many proper nouns are used.

(Example) A2, B2, etc.