IDS, Abteilung Pragmatik: Das Deutsche Spracharchiv
"Das Deutsche Spracharchiv (dsav@ids-mannheim.de) ist die zentrale Sammel- und Dokumentationsstelle des IDS für Korpora des gesprochenen Deutsch. Zur Zeit verwaltet es gemeinsam mit einzelnen Projektgruppen 33 Korpora aus abgeschlossenen und noch laufenden Dokumentations- und Forschungsprojekten. Die Tonaufnahmen und Transkripte dokumentieren binnen- und auslandsdeutsche Varietäten - Dialekte, Umgangssprachen und das gesprochene Standarddeutsch - sowie verbale Interaktion in verschiedenen sozialen Kontexten."
COSMAS II
"COSMAS II wurde (...) um die Integration von gesprochenem Material in digitalisierter Form weiterentwickelt: Es ermöglicht die Zuordnung von digitalisierten gesprochenen Äußerungen zu deren Audio-Dateien, so dass Treffer abgespielt werden können. Bei Recherchen berücksichtigt es die Besonderheiten verschrifteter gesprochener Sprache (Simultanpassagen, d.h. Gleichzeitigkeit von Äußerungen, Fragmentierung von Wörtern, nicht-lexikalisierte Äußerungen, Pausen usw.) und stellt einen sprecherbezogenen Wortabstandsoperator zur Verfügung."
Bayerisches Archiv für Sprachsignale (BAS)
"Aufgabe von BAS ist es, digitale Datenbasen mit gesprochenem Deutsch in strukturierter Form sowohl der Forschungsgemeinschaft als auch der Sprachtechnologie verfügbar zu machen."
Speech Annotation and Corpus Tools
"A special issue of Speech Communication": Das Sonderheft soll Mitte 2000 interdisziplinär über aktuelle Entwicklungen in der Repräsentation und Verwaltung annotierter Korpora gesprochener Sprache (Sprachsignale mit zeitalignierten Transkripten) informieren.
Linguistic Annotation
Die von Stephen Bird und Marc Liberman vom Linguistic Data Consortium (LDC), Philadelphia, zusammengestellte Seite enthält eine repräsentative Liste von Links auf Werkzeuge, Formate und Verfahrensbeschreibungen für annotierte Korpora zur gesprochenen Sprache.
"This page describes tools and formats for creating and managing linguistic annotations. Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, "named entity" identification, co-reference annotation, and so on. The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases."
The CHILDES System
"The CHILDES system provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video."
COCOSDA
"The International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques for Speech Input/Output, COCOSDA, has been established to encourage and promote international interaction and cooperation in the foundation areas of Spoken Language Processing (...). COCOSDA provides a forum for international action and discussion and gives platforms for groups of workers to exchange information and to set up collaborations in the field of Spoken Language Engineering."
cslu toolkit
"These pages are home to the CSLU Toolkit, a comprehensive suite of tools to enable exploration, learning, and research into speech and human-computer interaction (...). The CSLU Toolkit was created to provide the basic framework and tools for people to build, investigate and use interactive language systems. These systems incorporate leading-edge speech recognition, natural language understanding, speech synthesis and facial animation technologies. The toolkit provides a comprehensive, powerful and flexible environment for building interactive language systems that use these technologies, and for conducting research to improve them."
Santa Barbara Corpus of Spoken American English(CSAE)
"The Corpus of Spoken American English is a project to gather a large number of recordings of people talking -- people from all over the United States, in all walks of life, talking about and doing all sorts of things. Selected portions of these recordings will be published as a tool for studying the structure and use of the English language as it is spoken in America. The Corpus will be published as a multi-volume book and as a computer database on CD-ROM disks which will contain written transcription and sound."
EAGLES SLWG
"Expert Advisory Group on Language Engineering Standards Spoken Language Working Group" "EAGLES I: The goal of the EAGLES SLWG is to produce and maintain a 'Handbook of Standards and Resources for Spoken Language Systems' based on experience in European projects.
EAGLES II (from Jan 1997): The goal of the EAGLES WP 4, 5 and 6 is to produce a Supplement to the Handbook, covering further topics, and with updated and extended reference materials."
The essentials of EAGLES
"The Expert Advisory Group on Language Engineering Standards (EAGLES) is an initiative of the European Commission, within DG XIII Linguistic Research and Engineering programme, which aims to accelerate the provision of standards for:
- Very large-scale language resources (such as text corpora, computational lexicons and speech corpora);
- Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools;
- Means of assessing and evaluating resources, tools and products.
The work towards common specifications is carried out by five working groups:
- Text Corpora
- Computational Lexicons
- Grammar Formalisms
- Evaluation
- Spoken Language."
ESCA European Speech Communication Association
"ESCA is a non-profit organization (...).
The main goal of the Association is 'to promote Speech Communication Science and Technology in a European context, both in the industrial and Academic areas', covering all the aspects of Speech Communication (Acoustics, Phonetics, Phonology, Linguistics, Natural Language Processing, Artificial Intelligence, Cognitive Science, Signal Processing, Pattern Recognition, etc.)."
Max Planck Institute, Nijmegen - The Spoken Childes Tool
"Corpora with Speech Signal. For psychologists as well as linguists whose work is closely related to corpora and who need to have quick and seamless access to the speech signal a set of new tools is available.
Necessary preconditions of this feature are:
- that the speech information has been digitized beforehand
- that it is accessible on-line
- that all references between the text and the speech waveform have been correctly set.
Currently great efforts are being made at various places, including the MPI für Psycholinguistiek, to do exactly this for many corpora included in Childes and in ESF. In the widely accepted CHAT-format this means that a speech tier must be created for every utterance. This speech tier provides information about the corresponding speech file and the begin- and end-points of the speech segmdnt for each utterance."
Schegloff - Prosody
"Sound data for: 'Reflections on Studying Prosody in Talk-In-Interaction' by Emanuel A. Schegloff"
Praat: doing phonetics by computer
Das Programm "Praat" ist ein Werkzeug für Phonetiker zur Erforschung, Publikation und Manipulation von Sprachsignalen.
"This comprehensive speech analysis, synthesis, and manipulation package includes general numerical and statistical stuff, is built on a general-purpose GUI shell for handling objects, and produces publication-quality graphics".
"Praat" wurde entwickelt von Paul Boersma und David Weenink vom Institut für Phonetik an der Universität Amsterdam (Niederlande).
Transcriber - a tool for segmenting, labeling and transcribing speech
"Transcriber is a tool for assisting the creation of speech corpora. It allows to manually segment, label and
transcribe speech signals, for later use in automatic speech processing. It is more specifically geared
towards the transcription of long duration broadcast news recordings, with labeling of speech turns and
topic changes. It provides a user-friendly interface which is configurable."
"Ton und Text" Hilfsprogramme für Transkribenten
"'Ton und Text' (kurz: 'TuT') ist eine Gruppe von Hilfsprogrammen zur Unterstützung der Arbeit von Transkribenten. Den Kern der TuT-Programme stellt ein Werkzeug zur Wiedergabe von digital aufgezeichneten Tonsignalen dar. Alle TuT-Programme übernehmen beim Transkribieren die Wiedergabe-Funktionen von Tonbandgeräten."