What is Language documentation?
One response to language
endangerment has been the creation of a new discipline within
linguistics called Language
documentation (or Linguistic documentation). It is
often said to have been catalysed by Nikolaus P. Himmelmann,
who wrote in 1998:
The aim of a language documentation is to provide a comprehensive record of the linguistic practices characteristic of a given speech community... This... differs fundamentally from... language description [which] aims at the record of a language... as a system of abstract elements, constructions, and rules
[p, 166, "Documentary and descriptive linguistics", Nikolaus P. Himmelmann (1998). Linguistics 36. pp. 161-195. Berlin: de Gruyter]
Another factor in the emergence of Language documentation is that
information technologies have now matured enough to allow
us to create sound and video recordings, and integrate them
with text and other explanatory or analytical material.
It is important to carefully archive language documentations that have been made, because such materials have irreplaceable value for language communities and for linguists and other researchers. Digital archives allow possibilities never before imagined: catalogues are accessible and searchable from anywhere with internet access, materials are easily deliverable by network or on CDs and DVDs, and communities can express sensitivities or restrictions to control access to materials.
Language communities also have stronger relationships with their language heritage materials, because they can make use of multimedia materials in support of local language support activities.
In addition to making materials discoverable and accessible through suitable archiving and cataloguing, the next big challenge facing Language documentation is the discovery and widespread use of software interfaces that make documentation materials easily and flexibly usable by a wide range of users. This does not mean simplifying or trivialising the data or the way we work with it; rather, it means working harder on providing mature and friendly software such as users already expect in domains such as office applications and computer games.
Documenting a language is a complex process that involves finding speakers who can serve as language teachers (often called 'language consultants', and in former times 'informants') and then working together with them to study the language and its use. We begin by recording words and expressions, transcribing them (writing them down phonetically) and then analysing the materials to uncover the structure and functions of the language. The result of this kind of documentation is often a dictionary and a grammar of the language. In addition, we aim to collect 'texts', that is stories, narratives, personal histories, explanations of how culturally important activities are carried out, speeches and other literary forms, including poetry and songs. Preferably, we record the performance of the texts - or other naturalistic language usage - using sound or video recorders. All of this will be transcribed, analysed and translated into a language of wider communication so that the materials can be used for a range of purposes, both by the language community, teachers, and researchers (provided that the community agrees that the materials can be used).
This kind of research typically involves fieldwork, going to a place (often a remote location) where speakers of a language live and working and living together with them there. To undertake fieldwork the researchers must be properly trained in the techniques of recording (sound and video), transcribing, analysing and translating languages that have never been studied before.
A language documentation projects aim to collect/create audio, video, graphic and text documentation material covering use of language in a variety of social and cultural contexts. The priorities for collecting, recording, analysing, and archiving are:
Projects will typically create materials in several types of media:
- to create a range of high quality materials to support description of a variety of language phenomena
- to enable the recovery of knowledge of the language even if all other sources are lost
- to generate resources in support of language maintenance and/or learning
Together, these will form the language documentation and should contain a range of linguistic materials, such as:
- written (e.g. transcription, description/analysis)
- metadata (structured data about materials, typically in written form)
- spoken language in a variety of styles and contexts, recorded (in video and/or audio), with transcriptions, translations and annotations
- written texts in a variety of styles, with transcriptions, translations and annotations
- relevant sociological and cultural information
- pedagogical materials
Researchers should collect, and appropriately record, metadata for all of the collected materials (see also below).
The methods and terminology should be aimed at making knowledge about the language accessible to a wide audience: not only academics, but also community members, as well as learners and teachers. Hence, although the materials researchers prepare may well be of interest for later theoretical analysis, within a documentation project they should avoid expressing the language data in terms of any particular linguistic theory- except where absolutely necessary.
Types of media and their properties
Each type of media – text, video, audio – has its own strengths for language documentation, and so a good documentation will consist of a combination of materials in different media.
Video captures the multimodal nature of situated language use and provides a rich documentary record. Video recordings ease the transcription process and allow for interdisciplinary analyses of e.g. the interplay of speech and gesture or conversational practices. Video is also often of particular interest to endangered language communities, and can be produced independently in communities without assistance from researchers. At the same time video affords careful practical, technical and ethical considerations.
Although audio does not capture visual information, its relative simplicity and unobtrusiveness can result in a good linguistic record. Digital audio files are easy to work with, and there is a range of common and easy-to-use software for editing and presenting sound. Well recorded stereo sound, together with good metadata (including images) can provide a very good record of linguistic events.
Text – traditionally the main method of presenting linguistic material – is compact, stable, and easy to store, access, index, and reuse. Representing language as text always involves some kind of abstraction and analysis, which may provide new resources and generalisations, but at the same time loses information that was in the original event or recording. Therefore, text resources that retain their connections to an original audio or video recording provide much stronger forms of language documentation.
Metadata is “data about data” – structured information describing characteristics of events, participants, recordings, and details of other data files. Metadata provides the keys for understanding data. While metadata is central to effective archiving and resource discovery, it is important to understand that good metadata content and structure is essential for any well-planned project, and is an essential part of your documentation activity, independently of your archiving plans.
Although metadata is usually in the form of text, it can be considered an independent type of resource because it is usually obtained, structured and used differently from other resource types, being more structured and possibly conforming to some formal specifications. Metadata could also be in the form of audio or images (photographs, diagrams, maps etc). Various functional categories of metadata include:
- cataloguing (title, speakers, collectors, time and place of recording, information about the situation and event, language name etc.)
- descriptive (about content)
- structural (data structures and relationships between units)
- technical (formats, quality, preservation information etc.)
- administrative (work log, responsibilities, access permissions, notes etc.)
Which of these metadata you create depends on the provenance and type of materials, the usages and audiences that the materials are likely to have, and any formal specifications you adopt. Access permissions and other sensitivities about recordings and data are an important type of metadata and must not be overlooked.
There are several tools for creating and editing metadata. For web resources, see
and the following sites:
Choosing the best formats for data can be complex, and advice about formats tends to change as technologies evolve.
It is important to have a basic awareness of the following:
- character encoding: how characters are represented, e.g. Unicode, Windows/ANSI, Big5, Latin 5 (ISO 8859-9)
- data encoding: how meaningful structures in the data are marked (using, for example, XML, Toolbox, MSWord tables, spreadsheet columns, labels etc)
- file encoding: how all the data is packaged into a file (e.g. plain-text, MSWord, PDF)
- carrier, or physical storage medium: the physical form used to store the file (e.g. hard disk, compact flash cards, CD, etc)
In many cases, there are already standard recommendations. For character encoding, wherever possible you should use Unicode, especially if your text contains non-roman or accented characters. If you use any character encoding other than Unicode or ASCII you should discuss this with ELAR and carefully document how all parts of documents are encoded.
For data encoding, you can structure data using databases, spreadsheets and software such as Toolbox. XML, a modern mark-up format, allows you to flexibly encode more complex structures, and is a more robust archiving format.
For file encoding, it is generally best to use open, non-proprietary, formats. Proprietary formats can be changed or superseded by their publishers, or may be commercial secrets, so they make poor choices for archiving. However, making the best choices may not be easy, because each practitioner has a different skills, priorities, and goals. Proprietary software tools can be more familiar or efficient tools for working with data, so they might be used to prepare data which is later exported to more standard or archivable formats; this needs careful planning, and should be discussed with ELAR.
For further information, see Bird and Simons (2003) and the following websites:
The Extensible Markup Language (XML) is the preferred data format for text materials, especially for archiving. XML is a document description language, used to describe the content of structured documents - each part of a structured document is described within a defined and logical structure (the structure can be documented in an XML schema or "DTD"). XML documents can be designed, created, processed and transformed manually or using editors, stylesheets (XSLT "extensible stylesheet language for transformations"), and document processing scripts.
Writing XML documents by hand can be difficult. Until recently, data was typically stored in databases and then exported into an XML file for data exchange or transformation. However, there are now a number of XML editors that can be used to create XML documents, to check markup tag syntax (well formedness), to create DTDs, and to transform their data structures.
Some linguistic tools use XML to structure their underlying data, or can export data as XML, including:
- Transcriber for audio annotations
- Elan for audio and video annotation (from MPI-Nijmegen)
- Toolbox for text and lexicon annotations (can export data as XML)
Sound and video formats
Real-time media (audio and video) is the area where there has been the most rapid technological change. For audio, there are now a wide range of compact and affordable solid state recorders, and just a small number of formats to choose from. Video, however, is still undergoing rapid change and presents the most uncertainty about selection of equipment and formats, and difficulty for long-term preservation.
For audio, primary recordings should be made using quality digital audio recorders to create WAV files (see below for recorders). WAV files normally consist of uncompressed audio data in two channels at a resolution of 44.1KHz and 16 bits (also known as CD or “Red Book” encoding standard). While the latest audio archiving standards favour 48 KHz, 24 bit resolution, this can currently present problems for various computers and software. Please consult ELDP and ELAR before using voice recorders, minidisc (MD), cassettes, or recording in compressed formats such as MP3 or WMA.
For video, the situation is in flux. Most documenters are currently using high-end consumer (also called “Prosumer”) video cameras which shoot at high resolutions and carry their data on hard drives, built-in flash memory or removable flash memory cards (DV and mini DV tapes are now being phased out). These cameras typically record in manufacturer-specific versions of an emerging standard that is called in various contexts MPEG-4, H.264, or AVCHD. While these files are already compressed (meaning that no first-generation uncompressed version is ever available for archiving), they represent the best current compromise between size, quality and interoperability. In addition, the recorded files may be subsequently converted to other formats for download to a computer, or for viewing in player or annotation software. All conversions can result in loss of quality, so care should be taken and, wherever possible, first generation files should be preserved locally for both archiving and editing for future video products.
Recording equipment and storage media
Each kind of recording equipment has its strengths and weaknesses of usability, convenience, accuracy, expense, power requirements, and recording media and format. To choose equipment, use your experience and training as well as consult colleagues and reviews.
You should distinguish between a recording device, the carrier it uses to store data, and the format and resolution of that data. As an example for audio, a Marantz PMD 661 makes solid state recordings in archive-ready format (WAV), but its carrier is a removable SDHC memory card.
The current range of available audio recorders provides excellent quality in compact sizes and at moderate prices. The most popular recorders amongst language documenters are currently the Zoom H4n, Zoom H2, and the Edirol R-09HR, and there are also good recorders available from Marantz, Fostex, Sony and Olympus. For reviews of some recommended recorders see http://www.hrelp.org/archive/resources.
The market for video recorders is subject to rapid change, so we can provide only basic recommendations. Video cameras must have a connector for an external microphone: never use a camera’s built in microphone for language documentation work. At the time of writing (Spring 2011), a good choice is an HD camera recording in MPEG-4/H.264/AVCHD to its built-in flash memory, removable memory card, or hard drive. Also, you will need additional software for converting and editing (many cameras come with software with very restricted functionality).
The data storage medium on many audio and video recordings is now converging on memory (“flash”) cards, such as CF and SDHC. Standard cards (rather than the latest, fastest and most expensive ones) will work quite well in most recorders. These cards have reduced in price to the level (around $US2 per GB) where it is no longer advisable to erase them in order to make new recordings. It is wiser to budget for as many cards as you will need for your field recordings, and keep the cards with those recordings in a labelled “card library”, which then forms a valuable form of additional backup or means of sending the files to another site.
Finally, one of the most important factors in recording audio or video is the choice and use of microphones. For documentation work, the current trend is to make stereo recordings in order to capture spatial information about the location of the speakers, and to help in separation of voices (and even background noise) when listening back to transcribe or translate. For most projects, more than one microphone is needed in order to cater for a variety of recording situations.
For information about microphones, see
Archiving is for the benefit of depositors, the language community, and other researchers or interested people in the future. It provides you with security for your materials. The process of archiving involves preparing materials so that they are as informative and explicit as possible, encoding them in the best ways to ensure long-term preservation and accessibility, and then delivering them safely. For a summary of key points in preparing your materials for archiving, visit the Endangered Languages Archive (ELAR) page at:
ELAR is a digital archive; all materials are stored electronically. This enables us to hold all forms of media, and in addition, to provide integration and navigation amongst them. The extent to which materials can be searched and navigated depends largely on how you prepare your data and metadata. As described above, preparing materials involves much more than handing over data files. We encourage you to produce rich, structured documentations that exploit the capabilities of the digital medium. Layers of linguistic and other information can be added in order to label and give structure to data, and to make links between items. It is recommended to make as much linkage as possible across the data: for example, between transcriptions and audio/video (e.g. as time-aligned annotation, showing the relationship between the text and the time offset in the corresponding audio/video); between analysed text and lexical /grammatical resources; or between text and images.
All deposits must be accompanied by metadata describing the sources, other characteristics of recordings, and data files (see discussion of metadata above).
In addition to archiving with ELAR, you should identify an institution such as a library, archive, educational institution, or community centre that is accessible to members of the language community, and make arrangements for materials to be deposited with that institution.
We use the term ‘protocol’ to refer to sensitivities, access restrictions, and intellectual property issues associated with documentation materials. Your documentation metadata should fully describe all such issues, especially any sensitivities or access restrictions that apply to materials. ELAR will observe these.
While you as the documenter may reserve access to some materials for research purposes for a certain period of time during and after your project, materials should always be accessible to those who provided them, and to other language community members, except under extraordinary circumstances.
Archiving and dissemination
Dissemination of digital materials, typically via the World Wide Web, is an entirely different process from archiving. Publishing materials on the World Wide Web is not a form of archiving:
Documentation should also include properly described records of the status (or restrictions, sensitivities etc) of materials. Typically, an archive such as ELAR will provide World Wide Web access to a catalogue of materials and, where appropriate, access to materials themselves. Restrictions and sensitivities expressed by the language community should be respected.
archived materials are typically more comprehensive than would normally be published on the World Wide Web
- typically, web-based materials have no guarantee of preservation
- archives contain some materials that are not currently publishable due to sensitivities but may be important for future revitalisation of the language, or research of various kinds
An archive catalogue informs the public about the existence of materials, allowing them to be 'discovered' through internet searching. Catalogues may or may not provide direct access to the actual content of the materials. Archive policies differ but some materials will be made available to various users, subject to the conditions/restrictions attached to materials or parts of materials, and depending on the type of user.
Intellectual property, protocol and access to materials
Documentation metadata should fully describe any sensitivities or restrictions that apply to materials. Descriptions of IP rights, sensitivities and other conditions should be collected by the researcher as part of the research and archived together with the materials. ELAR will observe these.
While researchers may reserve access to some materials for research purposes for a certain period of time during and after their research, materials should remain accessible to those who provided the data and other language community members except under extraordinary circumstances. Intellectual property rights and sensitivities are not acceptable reasons for not archiving materials that are collected in a documentation project, since appropriate access restrictions can generally be applied.
This page updated: 16 February 2011