The Hans Rausing Endangered Languages ProjectThe Hans Rausing Endangered Languages Project   The Hans Rausing Endangered Languages Project

What is Language documentation?

This page updated: 3 September 2007

Overview

One response to language endangerment has been the creation of a new discipline within linguistics called Language documentation (or Linguistic documentation). It is often said to have been catalysed by Nikolaus P. Himmelmann, who wrote in 1998:
The aim of a language documentation is to provide a comprehensive record of the linguistic practices characteristic of a given speech community... This... differs fundamentally from... language description [which] aims at the record of a language... as a system of abstract elements, constructions, and rules

[p, 166, "Documentary and descriptive linguistics", Nikolaus P. Himmelmann (1998). Linguistics 36. pp. 161-195. Berlin: de Gruyter]

Another factor in the emergence of Language documentation is that information technologies have now matured enough to allow us to create sound and video recordings, and integrate them with text and other explanatory or analytical material.

It is important to carefully archive language documentations that have been made, because such materials have irreplaceable value for language communities and for linguists and other researchers. Digital archives allow possibilities never before imagined: catalogues are accessible and searchable from anywhere with internet access, materials are easily deliverable by network or on CDs and DVDs, and communities can express sensitivities or restrictions to control access to materials.

Language communities also have stronger relationships with their language heritage materials, because they can make use of multimedia materials in support of local language support activities.

In addition to making materials discoverable and accessible through suitable archiving and cataloguing, the next big challenge facing Language documentation is the discovery and widespread use of software interfaces that make documentation materials easily and flexibly usable by a wide range of users. This does not mean simplifying or trivialising the data or the way we work with it; rather, it means working harder on providing mature and friendly software such as users already expect in domains such as office applications and computer games.

Doing documentation

Documenting a language is a complex process that involves finding speakers who can serve as language teachers (often called 'language consultants', and in former times 'informants') and then working together with them to study the language and its use. We begin by recording words and expressions, transcribing them (writing them down phonetically) and then analysing the materials to uncover the structure and functions of the language. The result of this kind of documentation is often a dictionary and a grammar of the language. In addition, we aim to collect 'texts', that is stories, narratives, personal histories, explanations of how culturally important activities are carried out, speeches and other literary forms, including poetry and songs. Preferably, we record the performance of the texts - or other naturalistic language usage - using sound or video recorders. All of this will be transcribed, analysed and translated into a language of wider communication so that the materials can be used for a range of purposes, both by the language community, teachers, and researchers (provided that the community agrees that the materials can be used).

This kind of research typically involves fieldwork, going to a place (often a remote location) where speakers of a language live and working and living together with them there. To undertake fieldwork the researchers must be properly trained in the techniques of recording (sound and video), transcribing, analysing and translating languages that have never been studied before.

Methods

A language documentation project should aim to collect/create audio, video, graphic and text documentation material covering use of language in a variety of social and cultural contexts. The priorities for collecting, recording, analysing, and archiving are:
  • to create a range of high quality materials to support description of a variety of language phenomena
  • to enable the recovery of knowledge of the language even if all other sources are lost
  • to generate resources in support of language maintenance and/or learning
Projects will typically create materials in several types of media:
  • audio
  • video
  • images
  • written (e.g. transcription, description/analysis)
  • metadata (structured data about materials, typically in written form)
Together, these will form the language documentation and should contain a range of linguistic materials, such as:
  • spoken language in a variety of styles and contexts, recorded (in audio and/or video), with transcriptions, translations and annotations
  • written texts in a variety of styles, with transcriptions, translations and annotations
  • relevant sociological and cultural information
  • dictionary
  • thesaurus
  • pedagogical materials
  • grammar

Researchers should collect, and appropriately record, metadata for all of the collected materials (see also below).

At least 10% of collected data should be transcribed, translated, and annotated in detail.

The methods and terminology should be aimed at making knowledge about the language accessible to a wide audience: not only academics, but also community members, as well as learners and teachers. Hence, although the materials researchers prepare may well be of interest for later theoretical analysis, within a documentation project they should avoid expressing the language data in terms of any particular linguistic theory- except where absolutely necessary.

Types of media and their properties

Each type of media - audio, video, text, and metadata - has its own strengths and weaknesses for language documentation, and so a good documentation will consist of a combination of materials in different media.

Audio is the primary component of language documentation, and in general you can think of all other resources as complementing and expanding on the primary audio record. Compared to video, audio materials may contain less information, but the relative simplicity and familiarity of audio recording can result in a better linguistic record. Digital audio files are easy to work with, and there is a range of common and easy-to-use software for editing and presenting sound.

Video material can be authentic, engaging, and multi-dimensional in content. Video is often of particular interest to endangered language communities, and can be produced independently within communities without assistance from researchers. On the negative side, video is more difficult to create, and may cause problems for researchers or people appearing in the video. Video is also harder to process, transfer, store and preserve. It is difficult to locate and access video unless it is accompanied by time-aligned annotation.

Text, traditionally the main method of presenting linguistic material, is compact, stable, and easy to store, access, index, and reuse. Representing language use as text always involves some kind of abstraction and analysis, which may provide new resources and generalisations, while at the same time losing information that was in the original event or recording. Therefore, text resources that retain their connections to an original recording (preferably a connection that can be followed via a link or other explicit reference such as a time offset) provide much stronger forms of language documentation.

Metadata

Metadata is "data about data" - structured information describing characteristics of events, participants, recordings and other data files. Metadata is important for effective archiving and discovery of materials. Although usually in the form of text, metadata can be considered an independent type of media because it is obtained and used entirely differently from all other types of media. Typically, metadata is collected and stored according to some formal specifications. Several types of metadata can be distinguished:

  • cataloguing (title, speakers, collectors, time and place of recording, language name etc)
  • descriptive (about content, relationship to other resources etc)
  • structural (what structural devices and patterns exist in the document)
  • technical (performance and preservation information, description of formats etc)
  • administrative (work log, responsibilities, access statements etc)

Which of these types of metadata you collect and store depends on the provenance and type of materials described, the usage and audience that the materials are likely to have, and the formal specification you adopt.

There are several tools for creating, editing, depositing and searching metadata. For web resources, see ELAR's Online Resources For Endangered Languages at http://www.hrelp.org/languages/resources/orel/ and the following sites:

Data formats

You will need to choose formats for your data. Choosing the best data formats can be complex, and formats can change as technologies and conventions evolve. It is important to distinguish at least the following:

  • character encoding: how characters are represented, e.g. Windows/ANSI, Unicode, Big5, Latin 5 (ISO 8859-9)
  • data encoding: how meaningful structures in the data are marked (using, for example, XML, Shoebox, MS Word tables, spreadsheet columns and labels etc)
  • file encoding: how all the data is packaged into a file (e.g. plain-text, MSWord, PDF)
  • physical storage medium: the physical form used to store the file (e.g. CD, minidisk, hard disk etc)

In some cases, there will be standard or conventional choices. For character encoding, for example, many texts using "Simplified Chinese" characters have GB encoding, although Unicode (ISO 10646) is a preferable and increasingly-used option.

Some linguists use databases, or SIL's Shoebox which marks data structures using "field markers" at the beginning of lines. XML, however, offers the ability to encode more complex and explicit structures, and is a more robust archiving format. See below for more information about XML.

For file encoding, it is generally best to use open, non-proprietary, formats. Proprietary formats, such as those produced by MS-Word or FileMaker Pro, can be changed or superseded by their publishers, or may be commercial secrets, so they make poor choices for archiving.

However, making the best choices may not always be easy. Proprietary software tools can be familiar or efficient tools for working with data, so they might be used to prepare data which is then exported to more standard or archivable formats such as XML; this needs careful planning. Some formats, such as PDF ("Portable Document Format", created by Adobe Systems), are proprietary but open, and can be created and read by many software products, thus making them an acceptable format in certain circumstances. For further information, see Bird and Simons (2003) and the following websites:

XML

The Extensible Markup Language (XML) is the preferred data format for text materials, especially for archiving, but is not obligatory to use it. XML is a document description language, used to describe the content of structured documents - each part of a structured document is described within a defined and logical structure (the structure can be documented in an XML schema or "DTD"). XML documents can be designed, created, processed and transformed manually or using editors, stylesheets (XSLT "extensible stylesheet language for transformations"), and document processing scripts.

Writing XML documents by hand can be difficult. Until recently, data was typically stored in databases and then exported into an XML file for data exchange or transformation. However, there are now a number of XML editors that can be used to create XML documents, to check markup tag syntax (well formedness), to create DTDs, and to transform their data structures.

Some linguistic tools use XML to structure their underlying data, or can export data as XML, including:

  • Transcriber for audio annotations
  • Elan for audio and video annotation (from MPI-Nijmegen)
  • Toolbox for text and lexicon annotations (can export data as XML)

Sound and video formats

Real-time media (audio and video) is the area where there is the most rapid technological change, the most difficulty in making the best choices, and the most uncertainty about long-term preservation.

For sound, use uncompressed data at CD or better quality and encoded as WAV or CD-Audio (CD quality is 44.1KHz, 16 bit, stereo; emerging audio archiving standards favour 48KHz, 24bit). Most documenters are now using solid state audio recorders (such as the Marantz PMD 660) which record directly into such formats. While minidisc (MD) recording is convenient, and can provide adequate sound quality for language documentation if properly managed, minidisc machines may use a proprietary compressed format ("ATRAC") which must be converted to an open format.

For video, currently MPEG2 format is recommended, although this might be superseded by MPEG4 in the future. These formats involve "lossy" compression: using them causes some of the information to be permanently lost. Although formats with lossy compression are not ideal for archiving, it is currently impossible to archive uncompressed digital video due to its large file sizes.

Recording equipment and storage media

Each kind of recording equipment has its strengths and weaknesses of usability, convenience, accuracy, expense, power requirements, and recording media and format. To choose equipment, use your experience and training as well as consult colleagues and reviews.

You should distinguish between recording equipment, the carrier it uses to store data, and the format properties of that data. For example, although a Marantz PMD 660 can be used to make solid state recordings in archive-ready format (WAV), its Compact Flash memory cards may not be suitable for long-term storage, so the data should be copied to hard disk and other backup as soon as possible. For minidisc, not only are the physical media not suitable for long-term preservation, but also the audio data may need to be converted to an open format.

At the time of writing (August 2007), recording equipment is changing rapidly: DAT recorders, formerly the standard device for high quality field recording, are now completely defunct. In their place, new solid state recorders with high quality microphones (such as the Sony PCM-D1), and compact mid-quality recorders with inbuilt hard disks are appearing. The Marantz PMD 660 is currently the most popular audio recorder amongst language documenters. While Hi-MD minidisc recorders can now record uncompressed WAV sound data, the decreasing prices and greater convenience of solid state recorders have made the latter more attractive.

Archiving

As a condition of your ELDP funding, you must create language documentation materials suitable for archiving and deposit them with ELAR. The usable materials you create during your project should be archived with ELAR. You do not need to archive everything you produced during the project; for example, raw notes, unedited video or audio, or large numbers of similar photographs need not be deposited.

Archiving is for the benefit of the language community and other researchers or interested people in the future. It involves preparing materials so that they are as informative and explicit as possible, encoding them in the best ways to ensure long-term accessibility, and then storing them safely.

In addition to archiving with ELAR, you should identify an institution such as a library, archive, educational institution, or community centre that is accessible to members of the language community, and make arrangements for materials to be deposited with that institution.

ELAR is a digital archive; all its materials are stored electronically. This enables us to hold all forms of media and in addition to provide integration and navigation amongst them. The extent to which the materials can be searched and navigated depends largely on how you prepare the data and metadata. Preparing materials involves much more than handing over data files. We encourage you to produce rich, structured documentations that match the capabilities of the digital medium. Important layers of linguistic representation can be added in order to structure and label data, and to make links between various items. It is recommended to make as much linkage as possible across the data: for example, between transcriptions and audio/video (e.g. as time-aligned annotation, showing the relationship between the text and the time offset in the corresponding audio/video); between analysed text and lexical /grammatical resources; or between text material and images.

All archive deposits must be accompanied by metadata describing the sources and other characteristics of recordings and data files.

Archiving and dissemination

Dissemination of digital materials, typically via the World Wide Web, is an entirely different process from archiving. Publishing materials on the World Wide Web is not a form of archiving:

  • archived materials are typically more comprehensive than would normally be published on the World Wide Web
  • typically, web-based materials have no guarantee of preservation
  • archives contain some materials that are not currently publishable due to sensitivities but may be important for future revitalisation of the language, or research of various kinds
Documentation should also include properly described records of the status (or restrictions, sensitivities etc) of materials. Typically, an archive such as ELAR will provide World Wide Web access to a catalogue of materials and, where appropriate, access to materials themselves. Restrictions and sensitivities expressed by the language community should be respected.

An archive catalogue informs the public about the existence of materials, allowing them to be 'discovered' through internet searching. Catalogues may or may not provide direct access to the actual content of the materials. Archive policies differ but some materials will be made available to various users, subject to the conditions/restrictions attached to materials or parts of materials, and depending on the type of user.

Intellectual property, protocol and access to materials

Documentation metadata should fully describe any sensitivities or restrictions that apply to materials. Descriptions of IP rights, sensitivities and other conditions should be collected by the researcher as part of the research and archived together with the materials. ELAR will observe these.

While researchers may reserve access to some materials for research purposes for a certain period of time during and after their research, materials should remain accessible to those who provided the data and other language community members except under extraordinary circumstances. Intellectual property rights and sensitivities are not acceptable reasons for not archiving materials that are collected in a documentation project, since appropriate access restrictions can generally be applied.