The Hans Rausing Endangered Languages Project   The Hans Rausing Endangered Languages Project

Key points for archiving with ELAR

This document summarises methods you should use to manage your data throughout your project and to prepare for archiving. You should also register with ELAR (if you are not already registered) to understand its systems and survey some of our deposits.

This document version: 1.3 (11 May 2011)

1. Filenames

Keep filenames simple and consistent. Set up file naming conventions at the beginning of your project, document them, and use them rigorously.

Do not stuff filenames with metadata content (such as people's names or place names); instead, think of filenames as keys or indexes which also appear listed in a metadata table together with the appropriate information about their contents.

We strongly recommend that you use only these characters in filenames:

lower case letters (a-z), upper case letters (A-Z), numbers (0-9), hyphen (-), underscore (_)

There must be a single full stop (period) followed by the correct extension.

Do not use spaces in filenames.

2. Folders/directories

Use folders as part of a logical data management strategy. Just as for filenames, keep folder names simple and consistent (see also 3: Relationships between files: file groups, and bundles).

The entire path name for any file - which means its name together with the names of all folders and disk volumes it is nested within - must be less than 200 characters. This will present no problem if you follow the advice here on file and folder names.

3. Relationships between files: file groups, and bundles

Various files in your data set will be associated together, such as an audio file and its transcription. Such associations must be explicitly represented. There are three ways to do this:

  1. place the whole group of associated files in one single folder
  2. give the associated filenames the same name root, e.g. fg27.wav, fg27.trs, fg27.txt
  3. represent the group in a spreadsheet, database or table, for example
group_id audio_file transcription comments
fg27 fg27.wav fg27.trs fg27.txt

As the table shows, you can combine (1), (2) and (3), although (3) is the most flexible, because filenames don't have to match, and you can include any particular file in many different groups (e.g. an image of a speaker whose voice is in many recordings)

At ELAR, we call these groups "bundles". We aim to present your data to users in the form of bundles. Therefore, the better you can conceptualise and represent the associations and grouping of files, the more usable your ELAR deposit will be. See also 9. Metadata - format for further details of bundling.

4. Audio and video recordings, formats and processing

All audio should be recorded as WAV (uncompressed) digital files using a digital audio recorder such as Edirol R-09, Zoom H2, or better. Use default "full resolution" settings of 44.1KHz, 16 bit, stereo, unless you have a special reason to do otherwise. Record in stereo wherever possible, using the best quality condenser microphones you can obtain.

You can retain the filenames assigned by your recorder, or rename them, but in either case make sure to prepare suitable metadata entries that describe and associate the recording (see 3: Relationships between files: file groups, and bundles).

ELAR encourages you to edit (trim and/or segment) audio and video recordings if it makes your documentation more concise, or in preparation for transcription or annotation, but if you do edit them, make sure to create the edited versions as full resolution versions so that they can be archived. Do not apply filters, effects, or noise reduction to any audio.

For video, formats and settings are dependent on the particular camera and the rapidly changing technology for video. You should maintain and send metadata that accurately describes the formats and any conversions applied to your video files.

See our video document for further information.

Your archive deposit should contain a selection of recordings and other materials. You should decide the criteria for selection, which could be based on quality, content, uniqueness, or correspondence to the plans in your funding application. Where appropriate, document your selection criteria. See 11: Selection.

5. Annotation

Many documenters annotate recordings with detailed transcriptions, glossing, and translation, using software such as ELAN, Transcriber, Praat, and Toolbox.

Few documenters are able to annotate all of their recordings in this way. However, it is a serious problem if a large number of recordings have little indication of what's in them. Whatever proportion of recordings you annotate in detail, ELAR now requires you to provide some kind of annotation, description, indexing, signposting, or summary for every recording in your deposit. This is necessary to enable us to grasp the content of your recordings, and for users to discover relevant materials. You can provide such a coarse or overview annotation at whatever level of detail is feasible for you. If you have limited time, then do an overview annotation (which you can complete in approximately the time it takes to play the recording just once at normal speed).

An overview annotation can be considered as a kind of "roadmap" or index of a recording. It could consist of approximately time-aligned information about what is in the recording, who is participating, and other interesting phenomena. For example, you could write:

"from 1 to 3 mins Auntie Freda is singing the song called Fat frog; from 3-7 mins Harry Smith is telling a story about joining the army; from 7-10 mins there is some interesting use of applicative morphology; from 15-18 mins contains rude content that should not be used for teaching children"

This could be written as prose (as above) or, better, structured into a table.

If you are familiar with software such as Transcriber or ELAN, you can do an overview annotation by marking breaks in topics/speakers etc, and typing descriptive text into the segments between breaks. Another strategy is to simply type a number into the time-aligned segment and then create a table which links the numbers with the overview information categories. Assuming you have made a Transcriber or ELAN file with the time-aligned segment numbers, your table would look like this:

segment_number speaker topic/content comments

1

Auntie Freda song: Fat frog  
2 Harry Smith story about joining the army  
3   interesting use of applicative morphology  
4     rude content, should not be used for teaching children

You can accomplish this overview annotation in "real time" - no longer than the duration of the recording - because, assuming you know and remember the circumstances of the recording, you can just type the information as you listen. Making overview annotations means that if (unfortunately) you never get around to transcribing or translating, at least there is some information so that the content can be identified, and the archive can index it. This will also be of assistance to you in the future, if you decide to transcribe/translate/re-use materials.

If you have even less time available, write a 50-100 word summary describing the recording's content.

A "best practice" technique is to perform real-time annotation of all recordings before undertaking any transcription. This will ensure that at least there will always be some kind of access to your materials. It will also enable you to more easily select and track your recordings for ongoing transcription or other annotation.

6. Avoid Microsoft Word documents

Do not send Microsoft Word documents for archiving. These documents cause problems for archiving because they are a proprietary format and must be converted at some point in the future to an open format. That means more work for the archive, and, more importantly, can result in loss of information (and you may not be around to check). Furthermore, presentational formats such as MS Word encourage their users to think about how the information looks on a printed page, rather than how to create explicit, self-documenting, multi-purpose electronic data. In the worst case, these problems compound and a future conversion of a MS Word document results in loss of essential information and data can be of little value.

What are the alternatives? It is best to create plain text files. These contain only characters, and no formatting. Logical organisation of the data can be indicated by various kinds of labels or mark-up. Two common examples are FOSF (as used by Shoebox/Toolbox) and XML. You can even invent your own system as long as you document it and use it consistently.

Another alternative is to use HTML (or, better, XHTML) - also plain text files but allowing you some formatting options. Important: never use MS Word to generate HTML files from an existing Word document.

About PDF: PDF/A is an archive-friendly variant of PDF. When absolutely necessary, we accept or create PDF/A to avoid future problems with MS Word files. But PDF/A is not a universal solution, because the data can remain dependent on layout and typography to represent organisational logic, and is almost impossible to adapt or re-use for other purposes, so its long term value is diminished.

Open Office documents (.odt): Like PDF/A, these use open standards and are archive-friendly. But they can have the same potential drawbacks as PDF/A - dependence on layout and typography to indicate organisational logic. An advantage of Open Office is that it can be used to generate PDF/A.

Software used for linguistic annotation, such as ELAN, Transcriber, Toolbox, and Praat generally create files which are archive-friendly and need no special attention.

7. Images

Images tend to be under-used in documentation, or are not always sent for archiving. Future users of materials will find images of speakers, places, objects etc useful. Images will add valuable explanation to recordings and transcriptions. Sometimes, a series of photos or illustrations can be more effective than video, and requires less equipment, effort and storage space. Also consider images of field notes, maps, diagrams and local manuscripts (scanned or photographed).

If you have many images, do not simply send a "dump" of them - make a selection. Do not send images which are blurry or otherwise poor quality.

All images must be accompanied by some caption metadata. Images can be bundled opportunistically (see 3, for example, an image of a speaker can be associated with every recording in which he/she participates).

See also ELAR's advice document on images.

8. Metadata - content

Metadata is simply data about data, and its purpose is to carry the understanding of data to other users, at other times, for a variety of purposes. Metadata throws light on the events, participants, situations, objects, and processes associated with the creation of data. Metadata, or "meta-documentation", also includes description of the background and motivation of your project, its methodologies, human relationships and responses, and careful description of the conventions and assumptions used in representing the data.

It also plays an important role in the management of data (e.g. by archives), and enables the discovery of data by users. Given these various and demanding roles, and the diversity of data to be described, ELAR does not prescribe any particular set or formulation of metadata; rather, we emphasise that depositors should provide metadata that is as rich and descriptive as possible, and pays full attention to the nature of the data deposited (and the events and methods by which the data was created).

ELAR recognises two broad classes of metadata:

  1. deposit-wide metadata, such as collected in the ELAR Deposit Form (e.g. depositor's name, field location, name of language)
  2. "file-level" metadata for each file and bundle. Metadata must be provided for every file and bundle you wish to deposit.

Typically, "file-level" metadata will consist of a set of categories (also known as fields, columns, properties, or attributes), for which each file or bundle will be given a value. For example, for the category "Speakers" (or "Participant"), the value will consist of the name of the speaker(s) in each file or bundle. You should consult the OLAC set as a guide to the basic metadata categories you should use. You should also extend and complement these with categories relevant to your field situation, consultant's wishes, or special properties of the events or materials.

Your deposit will probably contain different kinds of materials for which different sets of metadata are appropriate. For example, images of consultants or of old manuscripts have a different set of attributes from those of a voice recording. In such cases, you could use any of these three approaches:

  1. provide a different metadata document for each data type. Each document should consist of a shared core of basic metadata, but with different extended metadata categories according to the properties of that data type
  2. provide a single metadata document with the full range of metadata categories, but with category values supplied only where appropriate
  3. provide one metadata document which holds the shared core metadata for all the files/bundles, together with additional metadata documents for each data type that provide only the extended metadata for that type.

Note: ELAR will share your basic, OLAC-style metadata with an international archive portal such as OLAC; your full set will be retained by ELAR.

Don't overlook the possibility of expressing metadata simply as unstructured prose, especially if you have difficulty creating categories and values as described above. A 50-100 word summary or "abstract" describing a recording can be just as informative as a set of category-values.

9. Metadata - format

Metadata is usually provided in either tabular or prose form. Other formats are possible, such as XML or Toolbox files. Given that metadata's function is to explain your data in order to make it understandable and usable, it is crucial that metadata is provided in a robust and consistent form. For tabular metadata, MS Excel spreadsheets (or Open Office Calc) are fine; prose metadata should be submitted as plain text.

Your metadata can be most efficiently used by ELAR for bundling (see 3. Relationships between files: file groups and bundles) if you send spreadsheets or tab delimited files which have the following properties:

  1. each line or row corresponds to a single bundle (i.e. group of associated files)
  2. the bundle has a unique title
  3. columns identify the file(s) included in the bundle; for example, an audio file column has the value audio/fg27.wav" and a transcription file column has the value "transcripts/fg27.trs". Notice that in this example the files are located in different folders, and we show full paths, file names, and extensions so that all the files can be explicitly located in your dataset
  4. add other columns for speakers and other informative metadata (see 8: Metadata - content).
  5. when adding metadata that has multiple values per bundle (for example a recording with three speakers), put all values in a single column, and separate each value with a delimiter. ELAR uses "||" (the double pipe).
  6. see the following table as an example. In the example, the bundle contains two conversations, each involving the same three people. Normally there will be additional columns to contain basic and extended metadata categories:
bundle_id audio_file transcription_file speaker(s) topic/content date_recorded
fg27 audio/fg27.wav transcripts/fg27.trs Mary Smith||Barry Jones||Fred Ford Conversation about storks||Conversation about weather 2008-07-21

10. Protocol

ELAR uses the term "protocol" to refer to the sensitivities and restrictions that apply to parts of your data. Many materials relating to endangered languages can be sensitive and it is a central duty of the documenter to find out from consultants any such sensitivities and how these should be implemented as access restrictions. ELAR provides a carefully worked-out set of options for access control - see the ELAR Deposit form and ELAR site, especially the ELAR's access protocol page. We believe that by providing secure restriction on access where required, ELAR enables you to provide your consultants with the confidence to make as much data as possible openly accessible. Note that as Depositor, you can change the access restrictions on any item in your deposit at any time.

In your Deposit Form, note the protocol value that best applies to the overall deposit. For example, if most of the materials can be freely accessed but a few are restricted, then give the value for open access (P1).

Unless all the files in your deposit have the same protocol value, list their individual protocol values in your metadata. It is important that your file-level protocol is clear and accessible to ELAR staff so that we can implement it.

11. Selection

Materials should be selected for archiving. Do not send a "dump" of your hard disk or project files. Carefully prepare a collection, dataset or corpus for archiving by formulating and applying your selection criteria, which could be based on quality, content, uniqueness, or correspondence to the plans in your funding application. Where appropriate, document your selection criteria.

Remove temporary and incomplete files, and those that have not been inventorised or provided with metadata.

12. Contact and delivery

Contact ELAR at archive@hrelp.org or djn@soas.ac.uk when you are preparing your deposit for archiving. In some cases, we will ask you to send samples first.

See our page on delivery of materials with detailed information on preparing and posting.

Unless you have a very small amount of data (e.g. less than 1GB), you should send it to ELAR on a portable hard disk or a flash memory card. CDs and especially DVDs have high failure rates and create needless work and confusion. ELAR can return your hard disk or card to you. If you do not have a suitable card or hard disk, ELAR can send one to you for you to copy your data onto and return to us.

Small amounts of data can simply be emailed to us at archive@hrelp.org, or sent via facilities such as Dropbox or YouSendIt. ELAR does not presently have an upload facility although in the future we may provide one.

You do not need to send all your data at once. You can add or update transcriptions and other files later (these can usually be emailed); in fact you are encouraged to deposit your materials as early as possible, even when transcriptions are minimal (but we strongly recommend you provide at least simple annotations - see 5. Annotation).

Don't forget that an ELAR deposit consists of some files, their metadata, and a filled-in ELAR Deposit Form.

Once your deposit has been received, ELAR will upload it to our secure, backed-up server. We will then curate it (check for quality, consistency, and archivability), at which time we may contact you about any problems. We will then accession your deposit, which means that it will appear in the ELAR catalogue at elar.soas.ac.uk, with its own home page, and with data accessible according to the protocol values you have provided. We will inform you when this happens, and you can then edit and maintain your deposit online.