Issues in Designing a Corpus of Spoken Irish

Issues in Designing a Corpus
of Spoken Irish

Elaine Uí Dhonnchadha, Alessio Frenda, Brian Vaughan

Centre for Language and Communication Studies
Trinity College Dublin
Ireland.

Overview

 Linguistic Background
 Corpus Design
 Pilot Corpus
 Data Collection and Recording
 Transcription
 Corpus Processing
 Future work

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced langua
2

Irish

 Indo-European - Celtic language
 Verb initial language (VSO)
 Irish is the first official language of Ireland -
English is the second official language.
 Irish is spoken as a first language (L1) in only
a small number of areas known as Gaeltachtaí.
 Irish is learned at school as a second language
by the majority of the population.

3

Irish Speaking Regions

Na Gaeltachtaí 1
2. Donegal
3. Mayo 2
4. Galway 7
3
5. Kerry
6. Cork
7. Waterford
4
8. Meath 6

5

4

Irish

 1.6 million of the 3.9 million
population report proficiency in the
spoken language.
 The number of native speakers is 64
thousand
 These sociolinguistic conditions mean
that a comprehensive spoken corpus
can play a vital role in promoting and
preserving the spoken language.

5

Motivation

 Linguistic research
 language change, language contact …
 phonology, syntax,
 semantics, pragmatics, discourse etc.
 Lexicography (new Irish-English
dictionary project due to start in 2013)
 Teaching materials
 Speech Recognition

6

Existing Resources

 Spoken Language Collections
 Caint Chonamara (1964) 1.2 mill. wds
 Iorras Aithneach Irish (pub. 2007)
 Doegen Records Web Project
(1928-1931) (various dialects)
 Other dialectal studies (without audio
files)

7

Motivation

 Various difficulties …
 one dialect, or one year
 Different dialects but mainly songs,
stories, monologues
 Very little dialogue
 Book and CD format (pdf)
 Some phonetic transcriptions but not
other linguistic annotation
 Limited searchability

8

Motivation

 Need a spoken corpus which is:
 Dialectally balanced
 Diachronically balanced
 Gender/age balanced
 L1 and L2 speakers
 Text aligned with audio/video file
 Linguistically annotated

9

Corpus Design

 We examined the design of a number of corpora:
 London-Lund Corpus of Spoken English
 Lancaster/IBM Spoken English Corpus (SEC)
 Corpus of Spoken New Zealand English
 British National Corpus (BNC)
 COREC (Corpus oral de referencia del Español
Contemporáneo)
 CLIPS (Corpora e Lessici dell’Italiano Parlato e Scritto)
 ICE (The International Corpus of English)
 CGN (Corpus Gesproken Nederlands)

10

Corpus Design

 One common feature shared by the more
recent corpora surveyed here is the extent
of naturalistic conversational material they
include.
 Our design is heavily influenced by ICE and
CGN

11

Corpus Design
Dialogues Private (250, 42%) [r] Face-to-face conversations (120, 20%)
(420, 70%) [r] Phone calls (50, 8.5%)
[r] Video calls (50, 8.5%)
[r] Interviews with teachers of Irish (30, 5%)
Public (170, 28%) [r] Classroom Lessons (40, 7%)
Broadcast Discussions (40, 7%)
Broadcast Interviews (40, 7%)
Parliamentary Debates (20, 3%)
[r] Legal cross-examinations (10, 1.5%)
[r] Business Transactions (20, 3%)
Monologues Unscripted (90, 15%) Spontaneous Commentaries (40, 7%)
(180, 30%) Unscripted Speeches (20, 3%)
Demonstrations (20, 3%)
[r] Legal Presentations (10, 1.5%)
Scripted (90, 15%) Broadcast News (40, 7%)
Broadcast Talks (40, 7%)
Non-broadcast Talks (10, 1%)

12

Corpus Design

 Our design considers the following
variables:
 Time frame
 Dialectal variation
 Sociolinguistic variation
 Gender and age
 Context and subject matter

13

Time Frame

 We have decided upon the three
time periods
 P1: 1930-1971
 P2: 1972-1995
 P3: 1996-present

14

Dialectal Variation

 We aim to cover the main dialects
of Irish in equal measure
 i.e. not proportionally to the number of
speakers of each dialect (which may
have varied over the years)
 Ulster (north)
 Connaught (west)
 Munster (south)

15

Sociolinguistic Variation

 We aim to include Irish speakers from
all linguistic backgrounds
 ‘Traditional’ native speakers (L1)
 Non-native speakers (L2)
 ‘Non-traditional’ native speakers (L1),
 i.e. those who were raised through Irish
by L1 or L2 parents, typically in a non-
Gaeltacht setting

16

Gender and Age Variation

 We aim to represent both males and
females proportionally
 We aim to represent different
generations i.e. young adults,
middle aged and elderly speakers

17

Content Variation

 We aim to record conversations in a
variety of contexts (informal, work,
leisure, education etc.) and cover a
variety of topics.

 Overall we aim for a spoken corpus
of 2 million words approx.

18

Pilot Corpus
 Funded by Foras na Gaeilge
 P3: 1996-present (contemporary)
 Dialogues
 Mainly public broadcast dialogues (mp3
podcasts of radio interviews and
discussions).
 We also carried out a small amount of
video recording of private dialogue
conversations.

20

Data Collection
 Four pairs of volunteers agreed to be
video recorded in informal conversation in
the Speech Communications Laboratory,
TCD
 Video recorded using a Sony HDR-XR500v
High Definition Handycam.
 The audio was recorded in two ways:
 using the onboard camera microphone and
 using two Sennheiser MKH-60 shotgun
microphones and an Edirol 4-channel HD Audio
recorder.

21

Data Collection

 Audio was recorded at a sampling rate of
96KhZ with a bit rate of 24 bits.
 Bounced down a sampling rate of
44.1KhZ with a bit rate of 16bits (the
Redbook audio standard), with the higher
96KhZ files being used for archiving.

22

Podcast Extracts

 70 x 8 min. audio extracts were transcribed
giving 102,000 words of transcribed speech
(8.5 hours approx.).
 We also aligned and formatted some
existing transcripts, Frenda (2011) material
transcribed for PhD research TCD (20K);
 Wigger (2000) Caint Chonamara (10K);
 Dillon, G. material transcribed for PhD research TCD
(5K).
 overall total 140,000 words (approx.)
106 transcripts, 151 speakers

23

Transcription
 Spoken and written language differ in a
number of important respects.
 The syntactic structure of spontaneous
spoken utterances is usually simpler
 Spontaneous speech: repetitions, false
starts, hesitations or non-verbal
communication such as a gesture or the
tone of voice.
 Dialectal pronunciations deviate
substantially from standard
orthographical representations

24

Transcription Guidelines
 Phonetic or Orthographic transcription
 We examined a number of transcription
conventions already in use including
 CHAT: The CHAT (Codes for the Human Analysis of
Transcripts) System is a comprehensive standard
for transcribing and encoding the characteristics of
spoken language (MacWhinney, 2000).
 LINDSEI: Louvain International Database of Spoken
English Interlanguage Transcription guidelines
http://www.uclouvain.be/en-307849.html
 LDC: Linguistic Data Consortium
http://www.ldc.upenn.edu /Creating/creating_annotated.

25

CHAT Guidelines

 The CHAT (Codes for the Human Analysis
of Transcripts) (MacWhinney, 2000).
 These guidelines were developed for the
transcription of spoken interactions
between children and their carers in order
to study child language acquisition.
 Inaudible segments, phonetic fragments,
repetitions, overlaps, interruptions,
trailing off, foreign words, proper nouns
and numbers etc.

26

CHAT Guidelines

 the guidelines are very comprehensive
but there are a few drawbacks to
implementing the guidelines in full
 it can slow down the transcription process
considerably
 some are quite subjective (short, medium
and long pauses)
 while others are difficult to implement
(retracings and reformulations)

27

LDC Transcription Guidelines

 LDC guidelines advocate simplicity
 Keep the rules to a minimum in order to
make transcription as easy as possible for
the transcriber, which increases
transcription speed, accuracy and
consistency
 In addition automatic procedures are used
when possible

28

Transcription

 On average 30 minutes to
orthographically transcribe 1 minute of
audio material.
 Transcription process must be as
straightforward and intuitive as possible.
 Minimum number of codes and keystrokes
 [repeated material],
 xxx, < … > [?],
 [% comment], @laugh etc., @eng,
 filled pauses {yeah, ehm, uh..}

29

Transcription

 Dialectal variation
 maith ‘good’ /mah/ or /maɪ/
 an-mhaith ‘very good’ /ənə'wa/
or /ənə'waɪ/ or /ənə'va/
 Initial mutations
 ag déanamh ‘doing’ /ə d´ianəv/
or /ə ʤanu/ (not a’ déanamh)
 standard orthography

30

 Advantages to using standard
orthography:
 It makes the job of transcription easier and
quicker for transcribers
 It helps mimimise spelling inconsistencies among
transcribers as only standard spelling is used,
apart from predefined lists permitted exceptions
 Attempting to represent actual pronunciation in
orthography is difficult and prone to
inconsistency. It can be more accurately
captured in a separate phonetic transcription
layer (which may be partially generated from the
orthography).

31

 Standard orthography facilitates corpus
querying and lexical searches
 Standard orthography facilitates automatic
text processing, such as part-of-speech
tagging and parsing
 Transcription codes for some linguistic
features (e.g. co-articulation effects, elision
etc.) would require specialist training for
transcribers, in order to ensure accuracy and
consistency, and are better undertaken as a
separate task.

32

Transcription Software
 We tested several pieces of freely-
available transcription and annotation
software (e.g. Praat, ELAN, Anvil, CLAN,
Xtrans, Transcriber)
 We chose Transcriber
http://trans.sourceforge.net
 It has a straightforward user interface
 It facilitates alignment of the audio and text
transcription in XML format
 Audio duration and word count information at
a glance
 Transcripts can be conveniently exported as
text

33

Transcription Software
 It handles a variety of audio file types,
including .wav, .mp3 (podcasts) and .ogg
 The later version of the software,
TranscriberAG, can handle video as well as
audio
 It facilitates the annotation of various features
of spontaneous speech (overlap, interruptions,
coughs, laughs, etc.) as well as linguistics
categories (e.g. proper nouns, human/animate
etc. etc.) if desired
 It can be used with foot pedals for increased
speed if necessary

34

Transcribers
 Audio segments of 8 min. in duration
broadcast discussions and interviews
Raidio na Gaeltachta podcasts.
 Panel of 22 transcribers recruited
 Workpackages were sent via e-mail to
members of the panel who worked from
home. (filenames, speaker ids)
 They returned a time-aligned
transcription and timesheet for each
workpackage completed.

35

Transcription Checking
 Each transcript was checked for accuracy
against the audio file by a member of the
project team.

 In the case of new video-recordings, the
transcripts were also anonymised, i.e.
names and places which could identify the
participants were replaced by fictitious
names to ensure anonynity.

36

Corpus Processing

 Corpus Metadata
 XCES Corpus Encoding Standard
 Part-of-Speech Tagging
 SketchEngine Corpus Query Tool

37

Corpus Metadata
 All relevant details related to speakers,
transcripts and transcribers are recorded
in a database.
 Each speaker is given a speaker code
which is used in the transcript in place of
the speaker’s name, in order to make
speakers less recognisable.
 Speaker attributes such as
 dialect,
 language acquisition type, (L1-G L1-NG L2)
 gender and age, etc
 are recorded where known.

38

Corpus Metadata

 Corpus database is used to generate XML
corpus headers, and to facilitate onging
monitoring of word counts of the various
corpus design categories.

39

XCES – XML Corpus Encoding Std.

 For each transcript, the output of the
Transcriber software was transformed into
TEI compliant XCES (XML Corpus
Encoding Standard) format using a Perl
script and data from the corpus database.

40

Speech Turns

 All of the transcripts to date involve
conversations between at least two
participants (dialogues).

 It is quite common, particularly in radio
interviews, for spoken interactions to take
place between speakers with different
dialects or between native and non-native
speakers.

41

Speech Turns

 In order to create sub-corpora on the
basis of dialect, native/non-native status,
speaker, age, gender etc. then these
features must be recorded at the level of
speaker-turn rather than for the
transcript as a whole.

42

XML - XCES
 <doc id = "irbs0012" title = "Barrscéalta 08 October
2010" period = "1996-pres" medium = "broadcast-
radio"spokentype = "interview" text_source = "GALA-
TCD" av_source = "RnaG podcast">
 <speaker_turn id = "200" code = "RNG_ANC" dialect =
"Ulaidh" gender = "Bain" actype = "L1 Gaeltacht" year
= "2010">
 caidé méid airgid a chosnódh sé na bádaí seo a thabhairt
suas chun dáta agus cloígh lena rialacha úra atá tagtha
isteach?
 </speaker_turn>
 <speaker_turn id = "559" code = "RNG_LCI" dialect =
"Mumhan" gender = "Fir" actype = "L1 Gaeltacht?"
year = "2010" >
 Bhuel ehm braitheann sé sin ar chaighdeán an bháid, abair,
agus níl aon dabht faoi ach go bhfuil sé costasach, abair, [tá
tá] tá tuairiscí …

43

Part-of-Speech Tagging

 All transcripts are lemmatised and POS
tagged
 Using finite-state tools (xfst/foma) and
Constraint Grammar (VISL cg3)

44

SketchEngine

 All POS tagged transcripts are converted
to vertical format and loaded in to the
SketchEngine Corpus Query system

45

Future Work

 Extensive Data Collection is required
 Archives need to be examined for suitable
material (diachronic corpus)
 Quality control procedures for
transcription standards need to be
formalised
 Testing and enhancement of POS tagging
tools for spoken language

46

Websites

 GaLa TCD Website
https://www.scss.tcd.ie/SLP/gala/index.utf8.html
 GaLa in the SketchEngine
http://the.sketchengine.co.uk/

47

Go raibh maith agat!

Thank you!

Issues in Designing a Corpus of Spoken Irish

Recomendados

Recomendados

Más contenido relacionado

Similar a Issues in Designing a Corpus of Spoken Irish

Similar a Issues in Designing a Corpus of Spoken Irish (20)

Más de Guy De Pauw

Más de Guy De Pauw (20)

Último

Último (20)

Issues in Designing a Corpus of Spoken Irish

Notas del editor