The document discusses the design of a corpus of spoken Irish. It outlines the linguistic background of Irish and issues with existing spoken language resources. It then describes the pilot corpus, including data collection from podcasts and conversations. Transcription guidelines were adapted from CHAT and LDC conventions to balance accuracy with transcription speed. The goal is to create a large, balanced corpus to support research and language preservation.
1. Issues in Designing a Corpus
of Spoken Irish
Elaine Uí Dhonnchadha, Alessio Frenda, Brian Vaughan
Centre for Language and Communication Studies
Trinity College Dublin
Ireland.
2. Overview
Linguistic Background
Corpus Design
Pilot Corpus
Data Collection and Recording
Transcription
Corpus Processing
Future work
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
2
3. Irish
Indo-European - Celtic language
Verb initial language (VSO)
Irish is the first official language of Ireland -
English is the second official language.
Irish is spoken as a first language (L1) in only
a small number of areas known as Gaeltachtaí.
Irish is learned at school as a second language
by the majority of the population.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
3
4. Irish Speaking Regions
Na Gaeltachtaí 1
2. Donegal
3. Mayo 2
4. Galway 7
3
5. Kerry
6. Cork
7. Waterford
4
8. Meath 6
5
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
4
5. Irish
1.6 million of the 3.9 million
population report proficiency in the
spoken language.
The number of native speakers is 64
thousand
These sociolinguistic conditions mean
that a comprehensive spoken corpus
can play a vital role in promoting and
preserving the spoken language.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
5
6. Motivation
Linguistic research
language change, language contact …
phonology, syntax,
semantics, pragmatics, discourse etc.
Lexicography (new Irish-English
dictionary project due to start in 2013)
Teaching materials
Speech Recognition
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
6
7. Existing Resources
Spoken Language Collections
Caint Chonamara (1964) 1.2 mill. wds
Iorras Aithneach Irish (pub. 2007)
Doegen Records Web Project
(1928-1931) (various dialects)
Other dialectal studies (without audio
files)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
7
8. Motivation
Various difficulties …
one dialect, or one year
Different dialects but mainly songs,
stories, monologues
Very little dialogue
Book and CD format (pdf)
Some phonetic transcriptions but not
other linguistic annotation
Limited searchability
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
8
9. Motivation
Need a spoken corpus which is:
Dialectally balanced
Diachronically balanced
Gender/age balanced
L1 and L2 speakers
Text aligned with audio/video file
Linguistically annotated
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
9
10. Corpus Design
We examined the design of a number of corpora:
London-Lund Corpus of Spoken English
Lancaster/IBM Spoken English Corpus (SEC)
Corpus of Spoken New Zealand English
British National Corpus (BNC)
COREC (Corpus oral de referencia del Español
Contemporáneo)
CLIPS (Corpora e Lessici dell’Italiano Parlato e Scritto)
ICE (The International Corpus of English)
CGN (Corpus Gesproken Nederlands)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
10
11. Corpus Design
One common feature shared by the more
recent corpora surveyed here is the extent
of naturalistic conversational material they
include.
Our design is heavily influenced by ICE and
CGN
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
11
13. Corpus Design
Our design considers the following
variables:
Time frame
Dialectal variation
Sociolinguistic variation
Gender and age
Context and subject matter
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
13
14. Time Frame
We have decided upon the three
time periods
P1: 1930-1971
P2: 1972-1995
P3: 1996-present
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
14
15. Dialectal Variation
We aim to cover the main dialects
of Irish in equal measure
i.e. not proportionally to the number of
speakers of each dialect (which may
have varied over the years)
Ulster (north)
Connaught (west)
Munster (south)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
15
16. Sociolinguistic Variation
We aim to include Irish speakers from
all linguistic backgrounds
‘Traditional’ native speakers (L1)
Non-native speakers (L2)
‘Non-traditional’ native speakers (L1),
i.e. those who were raised through Irish
by L1 or L2 parents, typically in a non-
Gaeltacht setting
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
16
17. Gender and Age Variation
We aim to represent both males and
females proportionally
We aim to represent different
generations i.e. young adults,
middle aged and elderly speakers
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
17
18. Content Variation
We aim to record conversations in a
variety of contexts (informal, work,
leisure, education etc.) and cover a
variety of topics.
Overall we aim for a spoken corpus
of 2 million words approx.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
18
20. Pilot Corpus
Funded by Foras na Gaeilge
P3: 1996-present (contemporary)
Dialogues
Mainly public broadcast dialogues (mp3
podcasts of radio interviews and
discussions).
We also carried out a small amount of
video recording of private dialogue
conversations.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
20
21. Data Collection
Four pairs of volunteers agreed to be
video recorded in informal conversation in
the Speech Communications Laboratory,
TCD
Video recorded using a Sony HDR-XR500v
High Definition Handycam.
The audio was recorded in two ways:
using the onboard camera microphone and
using two Sennheiser MKH-60 shotgun
microphones and an Edirol 4-channel HD Audio
recorder.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
21
22. Data Collection
Audio was recorded at a sampling rate of
96KhZ with a bit rate of 24 bits.
Bounced down a sampling rate of
44.1KhZ with a bit rate of 16bits (the
Redbook audio standard), with the higher
96KhZ files being used for archiving.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
22
23. Podcast Extracts
70 x 8 min. audio extracts were transcribed
giving 102,000 words of transcribed speech
(8.5 hours approx.).
We also aligned and formatted some
existing transcripts, Frenda (2011) material
transcribed for PhD research TCD (20K);
Wigger (2000) Caint Chonamara (10K);
Dillon, G. material transcribed for PhD research TCD
(5K).
overall total 140,000 words (approx.)
106 transcripts, 151 speakers
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
23
24. Transcription
Spoken and written language differ in a
number of important respects.
The syntactic structure of spontaneous
spoken utterances is usually simpler
Spontaneous speech: repetitions, false
starts, hesitations or non-verbal
communication such as a gesture or the
tone of voice.
Dialectal pronunciations deviate
substantially from standard
orthographical representations
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
24
25. Transcription Guidelines
Phonetic or Orthographic transcription
We examined a number of transcription
conventions already in use including
CHAT: The CHAT (Codes for the Human Analysis of
Transcripts) System is a comprehensive standard
for transcribing and encoding the characteristics of
spoken language (MacWhinney, 2000).
LINDSEI: Louvain International Database of Spoken
English Interlanguage Transcription guidelines
http://www.uclouvain.be/en-307849.html
LDC: Linguistic Data Consortium
http://www.ldc.upenn.edu /Creating/creating_annotated.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
25
26. CHAT Guidelines
The CHAT (Codes for the Human Analysis
of Transcripts) (MacWhinney, 2000).
These guidelines were developed for the
transcription of spoken interactions
between children and their carers in order
to study child language acquisition.
Inaudible segments, phonetic fragments,
repetitions, overlaps, interruptions,
trailing off, foreign words, proper nouns
and numbers etc.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
26
27. CHAT Guidelines
the guidelines are very comprehensive
but there are a few drawbacks to
implementing the guidelines in full
it can slow down the transcription process
considerably
some are quite subjective (short, medium
and long pauses)
while others are difficult to implement
(retracings and reformulations)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
27
28. LDC Transcription Guidelines
LDC guidelines advocate simplicity
Keep the rules to a minimum in order to
make transcription as easy as possible for
the transcriber, which increases
transcription speed, accuracy and
consistency
In addition automatic procedures are used
when possible
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
28
29. Transcription
On average 30 minutes to
orthographically transcribe 1 minute of
audio material.
Transcription process must be as
straightforward and intuitive as possible.
Minimum number of codes and keystrokes
[repeated material],
xxx, < … > [?],
[% comment], @laugh etc., @eng,
filled pauses {yeah, ehm, uh..}
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
29
30. Transcription
Dialectal variation
maith ‘good’ /mah/ or /maɪ/
an-mhaith ‘very good’ /ənə'wa/
or /ənə'waɪ/ or /ənə'va/
Initial mutations
ag déanamh ‘doing’ /ə d´ianəv/
or /ə ʤanu/ (not a’ déanamh)
standard orthography
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
30
31. Transcription Guidelines
Advantages to using standard
orthography:
It makes the job of transcription easier and
quicker for transcribers
It helps mimimise spelling inconsistencies among
transcribers as only standard spelling is used,
apart from predefined lists permitted exceptions
Attempting to represent actual pronunciation in
orthography is difficult and prone to
inconsistency. It can be more accurately
captured in a separate phonetic transcription
layer (which may be partially generated from the
orthography).
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
31
32. Transcription Guidelines
Standard orthography facilitates corpus
querying and lexical searches
Standard orthography facilitates automatic
text processing, such as part-of-speech
tagging and parsing
Transcription codes for some linguistic
features (e.g. co-articulation effects, elision
etc.) would require specialist training for
transcribers, in order to ensure accuracy and
consistency, and are better undertaken as a
separate task.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
32
33. Transcription Software
We tested several pieces of freely-
available transcription and annotation
software (e.g. Praat, ELAN, Anvil, CLAN,
Xtrans, Transcriber)
We chose Transcriber
http://trans.sourceforge.net
It has a straightforward user interface
It facilitates alignment of the audio and text
transcription in XML format
Audio duration and word count information at
a glance
Transcripts can be conveniently exported as
text
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
33
34. Transcription Software
It handles a variety of audio file types,
including .wav, .mp3 (podcasts) and .ogg
The later version of the software,
TranscriberAG, can handle video as well as
audio
It facilitates the annotation of various features
of spontaneous speech (overlap, interruptions,
coughs, laughs, etc.) as well as linguistics
categories (e.g. proper nouns, human/animate
etc. etc.) if desired
It can be used with foot pedals for increased
speed if necessary
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
34
35. Transcribers
Audio segments of 8 min. in duration
broadcast discussions and interviews
Raidio na Gaeltachta podcasts.
Panel of 22 transcribers recruited
Workpackages were sent via e-mail to
members of the panel who worked from
home. (filenames, speaker ids)
They returned a time-aligned
transcription and timesheet for each
workpackage completed.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
35
36. Transcription Checking
Each transcript was checked for accuracy
against the audio file by a member of the
project team.
In the case of new video-recordings, the
transcripts were also anonymised, i.e.
names and places which could identify the
participants were replaced by fictitious
names to ensure anonynity.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
36
37. Corpus Processing
Corpus Metadata
XCES Corpus Encoding Standard
Part-of-Speech Tagging
SketchEngine Corpus Query Tool
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
37
38. Corpus Metadata
All relevant details related to speakers,
transcripts and transcribers are recorded
in a database.
Each speaker is given a speaker code
which is used in the transcript in place of
the speaker’s name, in order to make
speakers less recognisable.
Speaker attributes such as
dialect,
language acquisition type, (L1-G L1-NG L2)
gender and age, etc
are recorded where known.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
38
39. Corpus Metadata
Corpus database is used to generate XML
corpus headers, and to facilitate onging
monitoring of word counts of the various
corpus design categories.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
39
40. XCES – XML Corpus Encoding Std.
For each transcript, the output of the
Transcriber software was transformed into
TEI compliant XCES (XML Corpus
Encoding Standard) format using a Perl
script and data from the corpus database.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
40
41. Speech Turns
All of the transcripts to date involve
conversations between at least two
participants (dialogues).
It is quite common, particularly in radio
interviews, for spoken interactions to take
place between speakers with different
dialects or between native and non-native
speakers.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
41
42. Speech Turns
In order to create sub-corpora on the
basis of dialect, native/non-native status,
speaker, age, gender etc. then these
features must be recorded at the level of
speaker-turn rather than for the
transcript as a whole.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
42
43. XML - XCES
<doc id = "irbs0012" title = "Barrscéalta 08 October
2010" period = "1996-pres" medium = "broadcast-
radio"spokentype = "interview" text_source = "GALA-
TCD" av_source = "RnaG podcast">
<speaker_turn id = "200" code = "RNG_ANC" dialect =
"Ulaidh" gender = "Bain" actype = "L1 Gaeltacht" year
= "2010">
caidé méid airgid a chosnódh sé na bádaí seo a thabhairt
suas chun dáta agus cloígh lena rialacha úra atá tagtha
isteach?
</speaker_turn>
<speaker_turn id = "559" code = "RNG_LCI" dialect =
"Mumhan" gender = "Fir" actype = "L1 Gaeltacht?"
year = "2010" >
Bhuel ehm braitheann sé sin ar chaighdeán an bháid, abair,
agus níl aon dabht faoi ach go bhfuil sé costasach, abair, [tá
tá] tá tuairiscí …
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
43
44. Part-of-Speech Tagging
All transcripts are lemmatised and POS
tagged
Using finite-state tools (xfst/foma) and
Constraint Grammar (VISL cg3)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
44
45. SketchEngine
All POS tagged transcripts are converted
to vertical format and loaded in to the
SketchEngine Corpus Query system
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
45
46. Future Work
Extensive Data Collection is required
Archives need to be examined for suitable
material (diachronic corpus)
Quality control procedures for
transcription standards need to be
formalised
Testing and enhancement of POS tagging
tools for spoken language
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
46
47. Websites
GaLa TCD Website
https://www.scss.tcd.ie/SLP/gala/index.utf8.html
GaLa in the SketchEngine
http://the.sketchengine.co.uk/
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
47