SlideShare una empresa de Scribd logo
1 de 17
The Use of Corpus Linguistics in
         Lexicography
       An Integrative Review



           Lexicography
            ENGL 6203




                 Submitted by:
         IhsanIbadurrahman (G1025429)
       SyareenIzzatyBtMajelan (G1029580)
            RudianaRazali (G1115202)
The Use of Corpus Linguistics in Lexicography

                              An integrative literature review


I. Introduction

The practice of dictionary-making began as early as 1600 when Robert Cawdreyincluded words
that were deemed difficult as they were borrowed from another language into his version of the
dictionary (Siemens, 1994). The words from the dictionary were taken from Latin-English
dictionaries and also available texts of the time and were given concise definitions, synonym and
a fixed form (Siemens, 1994). It was Samuel Johnson who explicitly introduced the methods or
steps that weretaken to create his dictionary in the 1700s and some of the methods were then
followed by the committee entrusted to create “A New Dictionary” or currently known as the
Oxford English Dictionary in the 1800s.
       A corpus is a collection of samples of authentic spoken and written text which are used
for analysis of words, meanings, grammar and usage (David, 1992). In Saussurian terminology,
the text is akin to that of parole, while the corpus provides the evidence of langue
(Tognini&Bonelli, 2001). The term corpus linguistics is used when a corpus is specifically used
to study a language. Lindquist (2009: 1) distinguishes the term with other branches of linguistics
such as sociolinguistics (the study of language and society), or psycholinguistics (the study of
language and the mind) in that corpus linguistics is a specific method used in language study, the
“how to” rather than the “what”. In other words, corpus linguistics is an approach rather than a
specific field of language study (Gries, 2009).

       This paper aims to highlight major findings in the literature on corpus linguistics withan
added emphasis on its use in dictionary-making. In developing this integrative literature review,
18 sources were obtained:13 books, 2 journal articles, and 3 online articles. After all the
literature is reviewed, recurring ideas found in the literature are compared, listed, and discussed.
For ease of reading, the literature has been categorized into separate subheadings, namely, pre-
corpus era, the initial corpus, and the present corpus.


                                                                                                  1
II. Literature Review

a. Pre-corpus linguistics
Robert Cawdrey'sTable Alphabeticall(1604) is considered to be the first monolingual English
dictionary ever made even though glosses of words have been made prior to Cawdrey's
dictionary (Jackson, 2002). Cawdrey's dictionary consisted of 2543 'hard' words which
comprised of loanwords that were considered difficult to be learned by the 'uneducated' reader
where the words were gathered from Latin-English dictionaries, glosses of religious, legal and
scientific texts (Siemens, 1994).Cawdrey provided a concise definition of each word, a synonym
or explanatory phrase and fixed form of many of the difficult words (Siemens, 1994; Jackson
2002). After the conception of Cawdrey's dictionary, a lot of effort have been made to better the
quality of the dictionary and the subsequent dictionaries were made according to the methods
employed by Cawdrey which was extracting 'hard' words from different texts and including them
into the dictionary.

         It was in 1755 that Samuel Johnson published a two volume dictionary that he worked on
for 9 years (Jackson, 2002). It became the standard for English dictionary for 150 years before
the conception of the Oxford English Dictionary in England and was the first dictionary that used
quotes to indicate how each word was used (Baugh & Cable, 2002). Johnson in his letter to his
patron wrote that he had faced difficulties in adding a word into the dictionary in the following
order:

   1) Selecting words. Johnson had to decide on which words that he wanted to include in the
         dictionary and classify each word whether they are foreign or belong to English since a
         lot of borrowing has been made from other languages. He also had to decide if words
         from specific professions should be included in the dictionary.
   2) Orthography. Johnson proposed that no change should be made to the spelling of words
         without a sufficient reason because change would only cause inconvenience to others and
         is a mark of weakness or inconsistency.
   3) Pronunciation. Johnson says that along with orthography, pronunciation should also be
         constant because stability in a language is important to the lifespan of a language and any
         changes would create almost new speech which would corrupt spoken English of that
         time.

                                                                                                  2
4) Etymology and derivation. It is important to know the etymology of the word because it
       is hard to discern which words are native to English with the amount of borrowings from
       different languages.
   5) Analogy. The rules that governed how the words are used are included.
   6) Syntax. The construction of each word is shown because the construction of English is
       too inconsistent that it would be difficult to be reduced to only rules.
   7) Phraseology. The phrases in which the word is used are included to illustrate the
       different ways the words can be used.
   8) Interpretation. Compared to the previous steps, Johnson considers interpretation of a
       word to be the most difficult part of creating the dictionary because he had to look at the
       different usages of each word and come up with thebest explanation of the word.
   9) Distribution. After all the above mentioned steps have been taken, Johnson then slotted
       each word into their proper classes.
       After more than 150 years being the main source of reference with several revisions,
Johnson‟s dictionary was found to be inadequate for the standards of modern scholarship
(Jackson, 2002). So in 1857 a committee was appointed to collect words that are not in the
dictionary to be added as a supplement but the committee found that it was not enough and in
1858 it was decided a new dictionary should be created (Baugh & Cable, 2002; Jackson, 2002).
The main aims of the new project were to record every word that can be found in English from
about the year 1000 and to exhibit the history of each from a selection of quotations from the
whole range of English writings (Baugh & Cable, 2002). They gathered a total of six million
slips containing quotations from volunteers not only from England but from all over the world as
well. After 24 years of hard work, they managed to publish the first instalment of the dictionary
that covers part of the letter A in 1884. Another 16 years passed when four and a half volume of
dictionary was published until the letter H. Finally in 1928, the final section of the dictionary was
issued making the effort to create "A New Dictionary" successful after 70 years and now known
as the Oxford English Dictionary (OED) (Baugh & Cable, 2002). The committee came up with
rules that have to be observed by the editors of OED before a word can be included in the
dictionary in the following order (Considine, 1996):




                                                                                                   3
1) The Word to be explained.
   2) The Pronunciation and Accent.
   3) The Various Forms assumed by the word, and its principal grammatical inflexions.
   4) The Etymon of the word.
   5) The Cognate Forms in kindred languages.
   6) The Meanings which are logically deduced from the Etymology, and arranged to show
       the common thread or threads which unite them together.
       Even though over a century has passed since Johnson created his dictionary, some of the
steps taken by Johnson were still used while creating the OED. This shows that the methods
employed by Johnson were still relevant to lexicographers and were the main steps to be taken in
making a dictionary before corpus linguistics was introduced in dictionary making.




b. The initial stage of corpus linguistics

In 1950s, there was a growing dissatisfaction of how language theory (e.g. Noam Chomsky‟s
syntactic structure) could not reason out the many „ungrammatical‟ patterns found in English
(i.e. distinction between transitive and intransitive verbs). There was a strong call for empirical,
real language data (Teubert, 2004). It was then that corpus was invented. The first corpus was
made out of a survey of English usage conducted by two universities, University of London and
the Brown University Corpus in Providence. In the 1960s,both compiled its million word corpus
of written text from 500 reading passages, which was named Brown Corpus. This American
corpus was a landmark in corpus linguistics since it was the first corpus to employ a computer in
its making. In 1982, the British version of the corpus, named the LOB corpus was compiledby
Hofland and Johansson. LOB is an abbreviation from The Lancaster-Oslo-and Bergen, and as its
name suggests it is a collaborative attempt between the three universities: the University of
Lancster, the University of Oslo, and the University of Norwegian Computing Centre of the
Humanities.

       However, both the Brown corpus and LOB corpus were deemed to be inadequate to
sample English vocabulary. This gave birth to John Sinclair‟s English Lexical Studieswhich
specifically aimed to investigate vocabulary using an electronic text of spoken and written


                                                                                                  4
language. The study gave prominence to collocation - words that naturally co-occur
together.Aimed to represent varieties of English where it is used as a first or second language,
Sidney Greenbaum compiled one-million-word corpora called The International Corpus of
English in 1988. The unique feature of this corpus is that it samples more spoken language
(60%) than its written counterpart (40%).

       In the early 1990s, major universities and companies together compiled British National
Corpus (BNC) containing 100 million words from 1980 up to 1993. The compilers were Oxford
University Press, Longman, Chambers, the British Library, Oxford University and Lancaster
University. The aim of the corpus is to provide a balanced corpus that represents British English.
The corpus includes 10% spoken language and 90% written language, which comprises of 25%
fiction and 75% non-fiction. One big distinction between BNC and Brown is that the former took
samples from a longer piece of text between 40,000 and 50,000 words. This gives BNC an added
advantage of being representative since text contains a different use of words at the beginning, in
the middle, and at the end (Lindquist, 2009). Due to its sheer size, representativeness, and care,
most British publishers prefer to make use of this corpus as their source of lexicographic
information.

       Typically, any corpora will need to go through a three-step process in its making. Before
going through these three steps, however the writer needs to determine the basic outlines of a
corpus such as the size of the corpus, the genre of the corpus, whether it will specifically look
into written, spoken language, or both. Sinclair (1996) points out that the principles underlying
corpus creation should be as large as possible including samples from a broad range of material
in order to accomplish one way of representativeness to be anticipated with the technology of the
time. The corpus should also be classified into different genres and even size. Once this basic
outlines is determined, the three-step process may begin. It starts with collecting the data, spoken
and/or written. It entails gathering a large mass of speech, written texts, obtaining permission,
and doing a careful and organized record-keeping. The next step is computerization which entails
converting raw spoken or written text into a digital format in a computer. Recording of speech
may be painstaking sinceit needs to be transcribed manually. Another concern with spoken text
is the issue of naturalness of the speech; it needs to be recorded in a natural, casual way that
resembles how people speak every day in real life, not in a stilted way. Though written records


                                                                                                  5
seem to be less painstaking, it also has its problem, mainly the copyright issue. Still some texts
that come from books, magazines, and other written sources need to be retyped since scanning
device such as OCR (Optical character recognition) software that detect and scan words
automatically usually contain errors, so many that it‟s best to avoid using them altogether. The
last step is annotating, which involves assigning information such as parts of speech, etymology,
for each data. It should be noted that the three aforementioned steps need not to be seen as a
separate process; they are all closely connected. For example, after gathering recording of
speech, it may be best to transcribe it there and then.
       Corpus may have given a lot of contributions in language study, but its impact to
lexicography did not start until 1989. Together with the advance of computer software, both have
since contributed significantly to the development of lexicography.Since everything is automated
and recorded in a digital format, lexicographers can now save their time and the tremendous
amount of work needed in compiling a dictionary. Typically, a dictionary usually has
information on the part of speech, usage, meaning, pronunciation, etymology of a word. Before
the advent of corpora, all this information had to be gathered manually; lexicographers needed to
do the hard labor of collecting slips of paper containing text that they intend to include in the
dictionary. For this reason, it took roughly 50 years to complete Oxford English Dictionary,
which was later known as New English Dictionary(Meyer, 2002). With corpora, dictionary
makers can now usea large sample of authentic spoken and written textas a source to illustrate
how each word in their list is used in real life. The citation used in dictionary comes from real-
life discourse. Real contexts also provide accurate, well-defined lexical meanings in the
definition of a word in dictionary, which is a huge improvement over the previous dictionary
practice where words were defined using an unscientific manner. One huge improvement in
dictionary making is the rich information available for words that have many invariant meanings
such as take, go, and time, whichtend to be overlooked in the previous dictionary practice
(Lindquist, 2009).

       Another huge advantage of using corpora in lexicography is that information on word
frequency can also be obtained. This way, lexicographers can assign whether a word is among
the first 500 most common words, the next 500 and so on.Meyer (2002) notes that the most
frequent words are functional words such as the, an, a, and, and of which carry little lexical
meaning and the least frequent words are content words such as proper nouns. Gries (2009)

                                                                                                6
mentions two kinds of frequency information that lexicographers can obtain from a corpus:
frequencies of occurrence of linguistic elements in the so-called frequency list, and frequencies
ofco-occurrence of these linguistic elements in concordances. Lindquist (2009: 5) defines
concordance as “a list of all the contexts in which a word occurs in a particular text”. Using a
Key Word in Context (KWIC) concordance, words can be retrieved within theirsurrounding text,
and be presentedvertically on the screen. Since the information is presented in contexts,
lexicographers can easily assign the collocations of each word in their dictionary. Below is an
excerpt from concordance software in which the word “corpus” is highlighted.




       Figure 1: Concordance from a software called AntConc 3.2.2w (Gries, 2009).

       The above figure illustrates concordance software called AntConct in use. It should be
noted that the software does not come with a ready-made corpus. Hence, users need to readily
have a file to generate a KWIC output. The latest version of the software is 3.2.4w and can be
downloaded online at http://www.antlab.sci.waseda.ac.jp/software.html. Similar software that
lexicographers may use to find how words are used in context is wordsmith tools, devised by
Mike Scott in 1993. Since then the software has gone through a lot of changes which now
include a concordance, word-listing, web text downloader and many other features (Wikipedia,
2011). Previous versions of the software were sold and owned by Oxford University Press. The
software‟s current version is now owned by Lexical Analysis Software Ltd. The current


                                                                                               7
Wordsmithversion         is    5.0,       and    can      be      downloaded        online      at:
http://www.lexically.net/wordsmith/version5/index.html. However, unlike AntConc, Wordsmith
is a shareware. In order to unlock the demo version from the website, user will need to pay a
single-user license of £50 or around $70-80 from two online retailers (Lexical Software
Analysis, and Oxford University Press).

       Since corpus is discourse-based, it means that the word appears inhaphazard, arbitrary
collection of occurrences, as illustrated in the figure above. Dictionary makers need to check for
some contradictions with „real‟ meaning. It is thus dangerous to solely depend on corpus
(Teubert, 2004).One way to check the word in context is to expand the text by retrieving its
original source. Such feature is lacking in both software mentioned previously: the AntConc and
Wordsmith tools. Fortunately, the feature is thankfully available for free from Birmingham
Young University Website, which provides a concordance containing BNC, COCA (Corpus of
Contemporary American English), and some other corpora and can be accessed at:
http://corpus.byu.edu/

       The huge amount of data in the corpus also allows lexicographers to look for new words
that occur for the first time in spoken or written text. However, the corpus has to be large
enough to glean information on vocabulary items (Meyer, 2002). A small corpus such as LOB
corpus which stores roughly one million word items could not give lexicographers enough
information on the range of vocabulary items. A monitor corpus is also needed, in which large
data of language is pooled from time to time, rather than fixed only in one particular time period.
This way, the corpus is frequently updated with new words and meanings in today‟s growing
language.

       The first dictionary to be founded wholly on corpus is Collins COBUILD series of
English Language Dictionary compiled in 1987, guided by John Sinclair. The dictionary has its
citation taken from real life discourse, and each word is defined from these authentic texts,
instead of relying on previous dictionary. This entails using a very large corpus so that it may be
able to include all lemmas including their word senses. However, this presents problem in that
there tends to be an exclusion of rare words such as apothegm(Teubert, 2004). Besides being the
first corpus-based dictionary, COBUILD is innovative in that the definitions are akin to a


                                                                                                 8
classroom teacher explaining the words. For example in describing the word junk, it says: “You
can use junk to refer to old and second-hand goods that people buy and collect” (Jackson, 2002).

       In the practice of dictionary-making, one crucial distinction has to be made between
corpus-based dictionary and corpus-driven dictionary. Dictionaries such as Collins COBUILD
series of English Language dictionaries are said to be corpus-driven if the corpus itself is used to
validate information presented in the dictionary. However, if the corpus is used to extract the
information used in the dictionary, it is called corpus-driven. Teubert (2004: 112) suggests that
dictionary should follow corpus-driven approach so that it may complement standard linguistics
and not just extend it.




c. Modern corpus linguistics

       During the 1970s, computational research on English had not developed much in
Birmingham because heavy preparation was spent towards devising software packages,
instituting undergraduate courses and influencing opinions on the campus (Sinclair, 1991). On
that time, when computing was almost restricted to a number of crises, there was a highlight for
the importance of data- processing. It has taken approximately fifty years to make a real
improvement in the area of corpus- based linguistics which has been driven by systems that work
and methodologies that can produce reasonable coverage of linguistic condition (Lawler & Dry,
1998). Years after years, there has been a realization of emergence on accessibility of
computational resources such as fast machines and sufficient storage in order to process large
volumes of data. Besides that, in the modern corpus, there is a growing availability of corpora
with linguistics annotations, for example, part of speech, prosodic intonation, proper names, and
bilingual parallel corpora. Furthermore, the maturity of computational linguistics technology has
improved the commercial market for natural language product and the corpus linguistics
nowadays has been equipped by efficient parsing and statistical techniques.
       From 1980 to 1986, computational language was put to good effect which transformed
into a completely new set of techniques for language observation, analysis, and recording. This is
as well bringing to the development of editing substantial dictionaries by using technique and
huge database of annotated examples.


                                                                                                  9
One of the most prominent uses of a corpus in recent years is as a resource for
lexicography. There was a corpus-based work for a small number of languages that was used in
lexicography. Only recently the need for very large corpora has come to the front. The
Lexicography and Natural Language Processing (NLP) collaboration has incited the use of
corpora in dictionary projects that have had access to very large corpora (Hua, 2001).
       The role of the computer has a clerical role in lexicography which reducing the labor of
sorting and filing and examining very large amounts of English in a short time (Sinclair, 1991).
In the late 1970s, the prospects of computerized typesetting were growing more realistic. Ten
years later, in the early 1980s, a multi-million word corpus became available for study but still
limited. From simple tools, it has evolved to a substantial progress together with crucial,
profound and basic linguistic generalizations (Lawler & Dry, 1998). By these kinds of developed
tools, they have revealed many topics for inquiry which have not been well explored by
traditional linguistic methods.
In the modern era, the word has been reserved for collections of texts that are stored and
accessed electronically. Electronic corpora are usually larger than the paper-based collections
which are basically small, previously used to study the aspect of language (Hunston, 2002).This
is due to the capacity of computers that can store and process large amount of information
compared to the previous time.
       One of the work in the area of corpus linguistics is from the work done by Johansson and
collegues in producing a parallel corpus of British English have made it possible for research
workers to scrutinize and visualize physically texts of greater length compared to the time
before. The main structural features of these corpora are:
       -   A classification into genres (15) of printed texts
       -   A large number (500) of fairly short extracts (2000 words), giving a total of around
           one million words.
       -   A close to random selection of extracts within genres.


       Due to this, a great amount of useful information can be extracted easily from the
corpora. Besides that, many locations have samples of text which provide hundreds of billions of
words. Many collections available such as Association for Computational Linguistics‟ Data
Collection Initiative (ACL/DCI), the European Corpus Initiative (ECI), ICAME, The British

                                                                                              10
National Corpus (BNC), the Linguistic Data Consortium (LDC), the Consortium for Lexical
Research (CLR), Electronic Dictionary Research (EDR), and standardization efforts such as the
Text Encoding Initiative (TEI) (Armstrong, 1994).
       The application of corpora in applied linguistics is also extended to the language teaching
apart from the area of lexicography. It has benefited into a wide variety of field. Other relevant
applications of corpora are to the production of dictionaries and grammars, in critical linguistics,
translation, literary studies and stylistic, forensic linguistics and designing writer support
packages (Hunston, 2002).
       In relation towards the dictionary making, corpora have a contribution towards the area
which is most far-reaching and influential. The use of corpora has changed dictionaries in a way
that it has stressed on frequency, collocation and phraseology, variation, lexis in grammar and
authenticity (Hunston, 2002). Recent innovations of dictionaries include the on-line Longman
Web Dictionary and the Collins COBUILD English Collocations on CD ROM.
       Sinclair (1996) points out that the principles underlying corpus creation should be as
large as possible including samples from a broad range of material in order to accomplish one
way of representativeness to be anticipated with the technology of the time. The corpus should
also be classified into different genres and even size.


d. The use of corpora in language teaching


The method of using corpora in the disciplines of many studies is not uncommon (McEnery&

Wilson, 1996:4). Apart from Lexicography, other possible areas include Language Teaching,

Discourse and Pragmatics, Semantics, Sociolinguistics, Historical linguistics and Stylistic.

Within the area of Language teaching, we also have another branch known as CALL (Computer-

Assisted Language Learning), where it provides a further application of corpora. There is a study

conducted at Lancaster University towards the role of corpus-based computer software for

teaching undergraduates the basis concept of grammatical analysis (Hua, 2001). The software is

called Cytor which reads an annotated corpus, including part-of-speech tagged or parsed, in one


                                                                                                 11
sentence at a time. Besides the reading, it also hides the annotation and asks the students to

annotate the sentences on their own. In addition, students could call up help in the form of the list

of tag mnemonics, examples of frequency lexicon or concordances.

          How effective is the Cytor at teaching part-of-speech learning? A research carried out

related to this was done by McEnery, Baker and Wilson (1995, cited in Hua, 2001) which after

comparing two groups of students which have different treatments; one who were taught with

Cytor and another via traditional lecturer-based methods, the result suggests that the computer-

taught students performed better than the human-taught students throughout the term.

          Another use of corpus in the language teaching and learning is the adaptation of

classroom concordance (data driven learning) by classroom practitioner where corpus has

become a source for empirical teaching data (Hua:2001,5). One of the examples of link to Data-

Driven Learning is Tim John‟s Home Page at http://web.bham.ac.uk/johnstf/. It provides an

outstanding resource of online web-based bibliographic database of books and articles related to

Corpora and Language Teaching. Moreover, it has included online worksheets which involving

corpora for classroom teaching. Another resource which is also quite interesting is the “Grammar

Safari”     site   developed   at   Champaign-Urbana        and    can    be    found    online    at

http://deil.lang.uiuc.edu/web.pages/grammarsafari.html which provides careful and thoughtful

selection of corpus-based activities. Furthermore, the Longman Grammar of Spoken and Written

English by Douglas Biber et al to answer student questions related to grammar contribute to the

useful corpus categorized into fiction, conversation, news, etc.




                                                                                                  12
III. Discussions and Conclusions:

From the reviewed literature, it could be dictionary has been around centuries ago. The first
dictionary was made in the 1600s and was based on what was considered difficult words at that
time. During this initial stage, lexicographers faced some challenges in adding words into their
dictionaries: selecting words, orthography, pronunciation, etymology and derivation,
analogy,syntax, phraseology, interpretation, distribution.All this information had to be gathered
manually; lexicographers needed to do the hard labor of collecting slips of paper containing text
that they intend to include in the dictionary. For this reason, it took roughly 50 years to complete
Oxford English Dictionary, which was later known as New English Dictionary. However with
the advent of corpus linguistics, things began to change dramatically.
       In 1989, together with the technological advance in computer, corpus provided a
significant contribution to the development of dictionary making. Corpus linguistics made such a
huge impact in dictionary-making:

           a. It significantly reduces the time and the heavy work it needs to compile a
               dictionary since everything is automated and computerized.
           b. Each dictionary now resembles how language is used in real world. Meaning is
               assigned from these samples, rather than from the writer‟s point of view.
           c. Frequency of each word in the list can be assigned / identified.
           d. Much more information can be given to words with a lot of variant meanings such
               as go, and take.
           e. It makes it easy to include collocation because words appear in its surrounding
               text.
           f. It can quickly take „new‟ everyday words into the system.

       However, because corpus is discourse-based, it means that the word appears inhaphazard,
arbitrary collection of occurrences. Dictionary makers need to check for some contradictions
with „real‟ meaning. It is thus dangerous to solely depend on corpus. Another disadvantage of
dictionaries that are corpora-based is that it tends to exclude rare words (not appearing in real
world language) such as apothegm.The first dictionary to ever make it corpus-based is Collins
COBUILD series of English dictionaries.


                                                                                                 13
Corpus linguistics serve some linguistic purpose and to preserve the texts due to the
intrinsic value in the texts (Hunston, 2002). It also can be used as groundwork for research. The
storage of a corpus allows the users to study it non-linearly and both quantitatively and
qualitatively. The nature of a corpus does not include new information about language but to
offer us a new viewpoint on the given information. It shows us a way that language can be
examined. Most of available software packages process data from a corpus in three ways;
showing frequency, phraseology, and collocation (Hunston , 2002).
       Corpora have made life simpler as well as more complex. In situations that corpora have
made the life of users simpler are, for example, when a translator could see quickly the
comparison of words that are more or less equivalent or a teacher could refer to the corpus when
he or she wishes to show the reasons of why a particular usage is incorrect or inexact in
explanations. On the other hand corpora could also made life more complex in a sense that
language is patterned in a much more fined way than what we might have been expected that a
simple and general rule turns out to be applied only in certain context (Hunston, 2002).
       The modern corpusis reserved for collections of texts that are stored and accessed
electronically. Electronic corpora are usually larger than the paper-based collection which is
basically small, previously used to study the aspect of language. Electronic corpora gave birth to
the recent innovations of dictionaries, which include the on-line Longman Web Dictionary and
the Collins COBUILD English Collocations on CD ROM.




                                                                                               14
References:

Armstrong, S. (1994). Using Large Corpora. Cambridge: MIT Press.

Baugh, A. C. & Cable, T. (2002).A History of the English Language.Oxon: Routledge.

Considine, J. (1996). The Meanings, deduced logically from etymology in Gellerstam, M.;
  JekerJäborg; Sven-GöranMalmgren; Kerstin Norén; Lena Rogström y
  CatarinaRöjderPammehl (eds.), Euralex ‘96 Proceedings. Papers submitted to the Seventh
  EURALEX International Congress on Lexicography in Göteborg, Sweden,Göteborg
  University - Department of Swedish, Göteborg, 1996, 365-371.
David, C. (1992). An Encyclopedic Dictionary of Language and Languages. Oxford: Oxford
  University Press. Retrieved from:
  http://www.tuchemintz.de/phil/english/chairs/linguist/independent/kursmaterialien/language_
  computers/whatis.htm

Gries, S.T. (2009). „What is Corpus Linguistics?‟,Language and Linguistics Compass, Vol. 3.
  pp.1-14

Hua,T.K. (2001). Corpora: Characteristics and Related Studies. Kuala Lumpur: MazizaSdn
     Bhd.

Hunston , S. (2002). Corpora in Applied Linguistics. UK : Cambridge University Press.

Jackson, H. (2002). Lexicography, an Introduction. Oxon: Routledge.

Johnson, S. (1747). The Plan of a Dictionary of the English Language.

Lawler, J.M. &Dry,H.A. (1998). Using Computers in Linguistics: A Practical Guide. London:
    Routledge.

Lindquist, H. (2009). Corpus Linguistics and the Description of English. Edinburgh: Edinburgh
   University Press.

Mason, O. (2000).Programming for Corpus Linguistics:How to Do Text Analysis with Java.
    Edinburgh: Edinburgh University Press.

Meyer, C.F. (2002). English Corpus Linguistics.Cambridge: Cambridge University Press.

McEnery T. & Wilson, A. (1996).Corpus Linguistics. Edinburgh: Edinburgh University Press.

Siemens, R. G. (1994). Robert Cawdrey: A Table Alphabetical of Hard Usual English Words
   (1604). Retrieved from http://www.library.utoronto.ca/utel/ret/cawdrey/cawdrey0.html

Sinclair,J. (1991). Corpus,Concordance,Collocation. Oxford: Oxford University Press.


                                                                                              15
Teubert, W. (2004).„Language and corpus linguistics‟.Lexicology and Corpus
  Linguistics.London: Continuum.

Tognini, E., Bonelli. (2001). Corpus Linguistics at Work.Amsterdam: John Benjamins
  Publishing Co.

WordSmith. (2011, October 15). In Wikipedia, The Free Encyclopedia. Retrieved April 22,
  2012, from http://en.wikipedia.org/w/index.php?title=WordSmith&oldid=455732307




                                                                                          16

Más contenido relacionado

La actualidad más candente

What is Applied Linguistics?
What is Applied Linguistics?What is Applied Linguistics?
What is Applied Linguistics?Shajaira Lopez
 
Language Shift and Language Maintenance
Language Shift and Language MaintenanceLanguage Shift and Language Maintenance
Language Shift and Language Maintenancemahmud maha
 
Developments in English for Specific Purposes - Chapter 1 & 2
Developments in English for Specific Purposes - Chapter 1 & 2Developments in English for Specific Purposes - Chapter 1 & 2
Developments in English for Specific Purposes - Chapter 1 & 2Mar Iam
 
Applied linguistics presentation
Applied linguistics  presentationApplied linguistics  presentation
Applied linguistics presentationMuhammad Furqan
 
Language Contact and Its Outcomes - Kyle Shiells
Language Contact and Its Outcomes - Kyle ShiellsLanguage Contact and Its Outcomes - Kyle Shiells
Language Contact and Its Outcomes - Kyle Shiellsluvogt
 
Acculturation Model 1978
Acculturation Model 1978Acculturation Model 1978
Acculturation Model 1978Dr. Cupid Lucid
 
Sociolinguistics : Language Change
Sociolinguistics : Language ChangeSociolinguistics : Language Change
Sociolinguistics : Language ChangeAthira Uzir
 
Paradigmatic vs syntagmatic relations 2
Paradigmatic vs syntagmatic relations 2Paradigmatic vs syntagmatic relations 2
Paradigmatic vs syntagmatic relations 2Hoshang Farooq
 
Language standardization: How and why
Language standardization: How and whyLanguage standardization: How and why
Language standardization: How and whyadm-2012
 
Code Switching in Pakistan
Code Switching in PakistanCode Switching in Pakistan
Code Switching in PakistanDina Campus
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsAdnanBaloch15
 
Language bloomfield, leonard, 1887-1949
Language   bloomfield, leonard, 1887-1949Language   bloomfield, leonard, 1887-1949
Language bloomfield, leonard, 1887-1949Mona khosravii
 
Second Language Acquisition (Error Analysis)
Second Language Acquisition (Error Analysis)Second Language Acquisition (Error Analysis)
Second Language Acquisition (Error Analysis)Emeral Djunas
 
Corpus linguistics in language learning
Corpus linguistics in language learningCorpus linguistics in language learning
Corpus linguistics in language learningnfuadah123
 
Dialect and accent (idiolect)
Dialect and accent (idiolect)Dialect and accent (idiolect)
Dialect and accent (idiolect)Muslimah Alg
 

La actualidad más candente (20)

What is Applied Linguistics?
What is Applied Linguistics?What is Applied Linguistics?
What is Applied Linguistics?
 
Language Shift and Language Maintenance
Language Shift and Language MaintenanceLanguage Shift and Language Maintenance
Language Shift and Language Maintenance
 
Developments in English for Specific Purposes - Chapter 1 & 2
Developments in English for Specific Purposes - Chapter 1 & 2Developments in English for Specific Purposes - Chapter 1 & 2
Developments in English for Specific Purposes - Chapter 1 & 2
 
Applied linguistics presentation
Applied linguistics  presentationApplied linguistics  presentation
Applied linguistics presentation
 
Language Contact and Its Outcomes - Kyle Shiells
Language Contact and Its Outcomes - Kyle ShiellsLanguage Contact and Its Outcomes - Kyle Shiells
Language Contact and Its Outcomes - Kyle Shiells
 
Acculturation Model 1978
Acculturation Model 1978Acculturation Model 1978
Acculturation Model 1978
 
Sociolinguistics
SociolinguisticsSociolinguistics
Sociolinguistics
 
Diglossia
DiglossiaDiglossia
Diglossia
 
Sociolinguistics : Language Change
Sociolinguistics : Language ChangeSociolinguistics : Language Change
Sociolinguistics : Language Change
 
Lexicography
LexicographyLexicography
Lexicography
 
Paradigmatic vs syntagmatic relations 2
Paradigmatic vs syntagmatic relations 2Paradigmatic vs syntagmatic relations 2
Paradigmatic vs syntagmatic relations 2
 
Language standardization: How and why
Language standardization: How and whyLanguage standardization: How and why
Language standardization: How and why
 
Code Switching in Pakistan
Code Switching in PakistanCode Switching in Pakistan
Code Switching in Pakistan
 
phonemes
 phonemes  phonemes
phonemes
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Language bloomfield, leonard, 1887-1949
Language   bloomfield, leonard, 1887-1949Language   bloomfield, leonard, 1887-1949
Language bloomfield, leonard, 1887-1949
 
Second Language Acquisition (Error Analysis)
Second Language Acquisition (Error Analysis)Second Language Acquisition (Error Analysis)
Second Language Acquisition (Error Analysis)
 
Corpus linguistics in language learning
Corpus linguistics in language learningCorpus linguistics in language learning
Corpus linguistics in language learning
 
Language change
Language changeLanguage change
Language change
 
Dialect and accent (idiolect)
Dialect and accent (idiolect)Dialect and accent (idiolect)
Dialect and accent (idiolect)
 

Destacado

LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY mimisy
 
Corpus linguistics the basics
Corpus linguistics the basicsCorpus linguistics the basics
Corpus linguistics the basicsJorge Baptista
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsRaul Vargas
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
 
Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for DutchRubén Izquierdo Beviá
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
umair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationumair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationUmair Ijaz
 
Les outils de veille sur internet
Les outils de veille sur internetLes outils de veille sur internet
Les outils de veille sur internetAref Jdey
 
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXTpart of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXTarteimi
 
Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011Lenochka83
 
lexicography
lexicographylexicography
lexicographyayfa
 
What can a corpus tell us about discourse
What can a corpus tell us about discourseWhat can a corpus tell us about discourse
What can a corpus tell us about discoursePascual Pérez-Paredes
 
What can a corpus tell us about grammar?
What can a corpus tell us about grammar?What can a corpus tell us about grammar?
What can a corpus tell us about grammar?Pascual Pérez-Paredes
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsAlicia Ruiz
 

Destacado (20)

Lexicography
 Lexicography Lexicography
Lexicography
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Corpus linguistics the basics
Corpus linguistics the basicsCorpus linguistics the basics
Corpus linguistics the basics
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
MTT-2013
MTT-2013MTT-2013
MTT-2013
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)
 
lexicographic evidence
lexicographic evidencelexicographic evidence
lexicographic evidence
 
Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
Part of speech tagging for Arabic
Part of speech tagging for ArabicPart of speech tagging for Arabic
Part of speech tagging for Arabic
 
umair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationumair ijaz's Lexicography presentation
umair ijaz's Lexicography presentation
 
Les outils de veille sur internet
Les outils de veille sur internetLes outils de veille sur internet
Les outils de veille sur internet
 
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXTpart of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
 
Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011
 
lexicography
lexicographylexicography
lexicography
 
What can a corpus tell us about discourse
What can a corpus tell us about discourseWhat can a corpus tell us about discourse
What can a corpus tell us about discourse
 
What can a corpus tell us about grammar?
What can a corpus tell us about grammar?What can a corpus tell us about grammar?
What can a corpus tell us about grammar?
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 

Similar a The Use of Corpus Linguistics in Lexicography

Contoh proposal skripsi sastra inggris
Contoh proposal skripsi sastra inggrisContoh proposal skripsi sastra inggris
Contoh proposal skripsi sastra inggrisPungki Ariefin
 
presentation of language final.pptx
presentation of language final.pptxpresentation of language final.pptx
presentation of language final.pptxsharjeelmushtaq47
 
chapter 7_SEHAM ASAD.ppt
chapter 7_SEHAM ASAD.pptchapter 7_SEHAM ASAD.ppt
chapter 7_SEHAM ASAD.pptMisAl6
 
Spoken American English Idioms
Spoken American English IdiomsSpoken American English Idioms
Spoken American English IdiomsCompany
 
Borrowings in texts a case of tanzanian newspapers
Borrowings in texts a case of tanzanian newspapersBorrowings in texts a case of tanzanian newspapers
Borrowings in texts a case of tanzanian newspapersAlexander Decker
 
Li2.outline&rl.10 11
Li2.outline&rl.10 11Li2.outline&rl.10 11
Li2.outline&rl.10 11KhanhHoa Tran
 
TGG Summary.pdf
TGG Summary.pdfTGG Summary.pdf
TGG Summary.pdfBadrRajih
 
Sinopsis
SinopsisSinopsis
Sinopsisayfa
 
A corpus based study of distribution of preposition in pakistani
A corpus based study of distribution of preposition in pakistaniA corpus based study of distribution of preposition in pakistani
A corpus based study of distribution of preposition in pakistaniAlexander Decker
 
Dictionaries 2007 version
Dictionaries 2007 versionDictionaries 2007 version
Dictionaries 2007 versionJohan Koren
 
Sinopsis
SinopsisSinopsis
Sinopsisayfa
 

Similar a The Use of Corpus Linguistics in Lexicography (20)

Oed
OedOed
Oed
 
Contoh proposal skripsi sastra inggris
Contoh proposal skripsi sastra inggrisContoh proposal skripsi sastra inggris
Contoh proposal skripsi sastra inggris
 
Historical development of grammar
Historical development of grammarHistorical development of grammar
Historical development of grammar
 
English Lexicography
English LexicographyEnglish Lexicography
English Lexicography
 
presentation of language final.pptx
presentation of language final.pptxpresentation of language final.pptx
presentation of language final.pptx
 
Dictionaries
DictionariesDictionaries
Dictionaries
 
Talk nbu
Talk nbuTalk nbu
Talk nbu
 
6. lecture no. intro to lang. dictionary, v+adv
6. lecture no. intro to lang. dictionary, v+adv6. lecture no. intro to lang. dictionary, v+adv
6. lecture no. intro to lang. dictionary, v+adv
 
chapter 7_SEHAM ASAD.ppt
chapter 7_SEHAM ASAD.pptchapter 7_SEHAM ASAD.ppt
chapter 7_SEHAM ASAD.ppt
 
2001052491
20010524912001052491
2001052491
 
Spoken American English Idioms
Spoken American English IdiomsSpoken American English Idioms
Spoken American English Idioms
 
Borrowings in texts a case of tanzanian newspapers
Borrowings in texts a case of tanzanian newspapersBorrowings in texts a case of tanzanian newspapers
Borrowings in texts a case of tanzanian newspapers
 
Li2.outline&rl.10 11
Li2.outline&rl.10 11Li2.outline&rl.10 11
Li2.outline&rl.10 11
 
Dictionaries for learners
Dictionaries for learnersDictionaries for learners
Dictionaries for learners
 
TGG Summary.pdf
TGG Summary.pdfTGG Summary.pdf
TGG Summary.pdf
 
Sinopsis
SinopsisSinopsis
Sinopsis
 
A corpus based study of distribution of preposition in pakistani
A corpus based study of distribution of preposition in pakistaniA corpus based study of distribution of preposition in pakistani
A corpus based study of distribution of preposition in pakistani
 
Early modern english
Early modern englishEarly modern english
Early modern english
 
Dictionaries 2007 version
Dictionaries 2007 versionDictionaries 2007 version
Dictionaries 2007 version
 
Sinopsis
SinopsisSinopsis
Sinopsis
 

Más de Ihsan Ibadurrahman

Classroom management printer friendly
Classroom management printer friendlyClassroom management printer friendly
Classroom management printer friendlyIhsan Ibadurrahman
 
Workshop on Teaching vocabulary
Workshop on Teaching vocabulary Workshop on Teaching vocabulary
Workshop on Teaching vocabulary Ihsan Ibadurrahman
 
Foreign Language Activities in Japan and Indonesia
Foreign Language Activities in Japan and IndonesiaForeign Language Activities in Japan and Indonesia
Foreign Language Activities in Japan and IndonesiaIhsan Ibadurrahman
 
Nativization of English Words in Bahasa Malaysia and Bahasa Indonesia
Nativization of English Words in  Bahasa Malaysia and Bahasa IndonesiaNativization of English Words in  Bahasa Malaysia and Bahasa Indonesia
Nativization of English Words in Bahasa Malaysia and Bahasa IndonesiaIhsan Ibadurrahman
 
Outside Classroom Language Learning in Indonesia - A Project Paper
Outside Classroom Language Learning in Indonesia - A Project PaperOutside Classroom Language Learning in Indonesia - A Project Paper
Outside Classroom Language Learning in Indonesia - A Project PaperIhsan Ibadurrahman
 
Phonetics and Phonology: Consonants
Phonetics and Phonology: ConsonantsPhonetics and Phonology: Consonants
Phonetics and Phonology: ConsonantsIhsan Ibadurrahman
 
A Critical Discourse Analysis of Advertisments in Malaysia
A Critical Discourse Analysis of Advertisments in MalaysiaA Critical Discourse Analysis of Advertisments in Malaysia
A Critical Discourse Analysis of Advertisments in MalaysiaIhsan Ibadurrahman
 
Pros and Cons of Multiple Choice Question in Language Testing
Pros and Cons of Multiple Choice Question in Language TestingPros and Cons of Multiple Choice Question in Language Testing
Pros and Cons of Multiple Choice Question in Language TestingIhsan Ibadurrahman
 
Out-of-class Language Learning: Literature Review
Out-of-class Language Learning: Literature ReviewOut-of-class Language Learning: Literature Review
Out-of-class Language Learning: Literature ReviewIhsan Ibadurrahman
 
Teaching Intonation using discourse
Teaching Intonation using discourseTeaching Intonation using discourse
Teaching Intonation using discourseIhsan Ibadurrahman
 
Using norms of behavior to regulate an English-only speaking class
Using norms of behavior to regulate an English-only speaking classUsing norms of behavior to regulate an English-only speaking class
Using norms of behavior to regulate an English-only speaking classIhsan Ibadurrahman
 
The importance of classroom discourse analysis for ELT teachers
The importance of classroom discourse analysis for ELT teachersThe importance of classroom discourse analysis for ELT teachers
The importance of classroom discourse analysis for ELT teachersIhsan Ibadurrahman
 

Más de Ihsan Ibadurrahman (20)

Teaching writing
Teaching writingTeaching writing
Teaching writing
 
Teaching speaking
Teaching speakingTeaching speaking
Teaching speaking
 
Teaching reading
Teaching readingTeaching reading
Teaching reading
 
Teaching listening
Teaching listeningTeaching listening
Teaching listening
 
Teaching grammar
Teaching grammarTeaching grammar
Teaching grammar
 
Classroom management printer friendly
Classroom management printer friendlyClassroom management printer friendly
Classroom management printer friendly
 
Workshop on Teaching vocabulary
Workshop on Teaching vocabulary Workshop on Teaching vocabulary
Workshop on Teaching vocabulary
 
Foreign Language Activities in Japan and Indonesia
Foreign Language Activities in Japan and IndonesiaForeign Language Activities in Japan and Indonesia
Foreign Language Activities in Japan and Indonesia
 
Nativization of English Words in Bahasa Malaysia and Bahasa Indonesia
Nativization of English Words in  Bahasa Malaysia and Bahasa IndonesiaNativization of English Words in  Bahasa Malaysia and Bahasa Indonesia
Nativization of English Words in Bahasa Malaysia and Bahasa Indonesia
 
Elt approaches and methods
Elt approaches and methodsElt approaches and methods
Elt approaches and methods
 
Outside Classroom Language Learning in Indonesia - A Project Paper
Outside Classroom Language Learning in Indonesia - A Project PaperOutside Classroom Language Learning in Indonesia - A Project Paper
Outside Classroom Language Learning in Indonesia - A Project Paper
 
Phonetics and Phonology: Consonants
Phonetics and Phonology: ConsonantsPhonetics and Phonology: Consonants
Phonetics and Phonology: Consonants
 
Assessing vocabulary
Assessing vocabularyAssessing vocabulary
Assessing vocabulary
 
A Critical Discourse Analysis of Advertisments in Malaysia
A Critical Discourse Analysis of Advertisments in MalaysiaA Critical Discourse Analysis of Advertisments in Malaysia
A Critical Discourse Analysis of Advertisments in Malaysia
 
Pros and Cons of Multiple Choice Question in Language Testing
Pros and Cons of Multiple Choice Question in Language TestingPros and Cons of Multiple Choice Question in Language Testing
Pros and Cons of Multiple Choice Question in Language Testing
 
Out-of-class Language Learning: Literature Review
Out-of-class Language Learning: Literature ReviewOut-of-class Language Learning: Literature Review
Out-of-class Language Learning: Literature Review
 
Teaching Intonation using discourse
Teaching Intonation using discourseTeaching Intonation using discourse
Teaching Intonation using discourse
 
Developing a course outline
Developing a course outlineDeveloping a course outline
Developing a course outline
 
Using norms of behavior to regulate an English-only speaking class
Using norms of behavior to regulate an English-only speaking classUsing norms of behavior to regulate an English-only speaking class
Using norms of behavior to regulate an English-only speaking class
 
The importance of classroom discourse analysis for ELT teachers
The importance of classroom discourse analysis for ELT teachersThe importance of classroom discourse analysis for ELT teachers
The importance of classroom discourse analysis for ELT teachers
 

The Use of Corpus Linguistics in Lexicography

  • 1. The Use of Corpus Linguistics in Lexicography An Integrative Review Lexicography ENGL 6203 Submitted by: IhsanIbadurrahman (G1025429) SyareenIzzatyBtMajelan (G1029580) RudianaRazali (G1115202)
  • 2. The Use of Corpus Linguistics in Lexicography An integrative literature review I. Introduction The practice of dictionary-making began as early as 1600 when Robert Cawdreyincluded words that were deemed difficult as they were borrowed from another language into his version of the dictionary (Siemens, 1994). The words from the dictionary were taken from Latin-English dictionaries and also available texts of the time and were given concise definitions, synonym and a fixed form (Siemens, 1994). It was Samuel Johnson who explicitly introduced the methods or steps that weretaken to create his dictionary in the 1700s and some of the methods were then followed by the committee entrusted to create “A New Dictionary” or currently known as the Oxford English Dictionary in the 1800s. A corpus is a collection of samples of authentic spoken and written text which are used for analysis of words, meanings, grammar and usage (David, 1992). In Saussurian terminology, the text is akin to that of parole, while the corpus provides the evidence of langue (Tognini&Bonelli, 2001). The term corpus linguistics is used when a corpus is specifically used to study a language. Lindquist (2009: 1) distinguishes the term with other branches of linguistics such as sociolinguistics (the study of language and society), or psycholinguistics (the study of language and the mind) in that corpus linguistics is a specific method used in language study, the “how to” rather than the “what”. In other words, corpus linguistics is an approach rather than a specific field of language study (Gries, 2009). This paper aims to highlight major findings in the literature on corpus linguistics withan added emphasis on its use in dictionary-making. In developing this integrative literature review, 18 sources were obtained:13 books, 2 journal articles, and 3 online articles. After all the literature is reviewed, recurring ideas found in the literature are compared, listed, and discussed. For ease of reading, the literature has been categorized into separate subheadings, namely, pre- corpus era, the initial corpus, and the present corpus. 1
  • 3. II. Literature Review a. Pre-corpus linguistics Robert Cawdrey'sTable Alphabeticall(1604) is considered to be the first monolingual English dictionary ever made even though glosses of words have been made prior to Cawdrey's dictionary (Jackson, 2002). Cawdrey's dictionary consisted of 2543 'hard' words which comprised of loanwords that were considered difficult to be learned by the 'uneducated' reader where the words were gathered from Latin-English dictionaries, glosses of religious, legal and scientific texts (Siemens, 1994).Cawdrey provided a concise definition of each word, a synonym or explanatory phrase and fixed form of many of the difficult words (Siemens, 1994; Jackson 2002). After the conception of Cawdrey's dictionary, a lot of effort have been made to better the quality of the dictionary and the subsequent dictionaries were made according to the methods employed by Cawdrey which was extracting 'hard' words from different texts and including them into the dictionary. It was in 1755 that Samuel Johnson published a two volume dictionary that he worked on for 9 years (Jackson, 2002). It became the standard for English dictionary for 150 years before the conception of the Oxford English Dictionary in England and was the first dictionary that used quotes to indicate how each word was used (Baugh & Cable, 2002). Johnson in his letter to his patron wrote that he had faced difficulties in adding a word into the dictionary in the following order: 1) Selecting words. Johnson had to decide on which words that he wanted to include in the dictionary and classify each word whether they are foreign or belong to English since a lot of borrowing has been made from other languages. He also had to decide if words from specific professions should be included in the dictionary. 2) Orthography. Johnson proposed that no change should be made to the spelling of words without a sufficient reason because change would only cause inconvenience to others and is a mark of weakness or inconsistency. 3) Pronunciation. Johnson says that along with orthography, pronunciation should also be constant because stability in a language is important to the lifespan of a language and any changes would create almost new speech which would corrupt spoken English of that time. 2
  • 4. 4) Etymology and derivation. It is important to know the etymology of the word because it is hard to discern which words are native to English with the amount of borrowings from different languages. 5) Analogy. The rules that governed how the words are used are included. 6) Syntax. The construction of each word is shown because the construction of English is too inconsistent that it would be difficult to be reduced to only rules. 7) Phraseology. The phrases in which the word is used are included to illustrate the different ways the words can be used. 8) Interpretation. Compared to the previous steps, Johnson considers interpretation of a word to be the most difficult part of creating the dictionary because he had to look at the different usages of each word and come up with thebest explanation of the word. 9) Distribution. After all the above mentioned steps have been taken, Johnson then slotted each word into their proper classes. After more than 150 years being the main source of reference with several revisions, Johnson‟s dictionary was found to be inadequate for the standards of modern scholarship (Jackson, 2002). So in 1857 a committee was appointed to collect words that are not in the dictionary to be added as a supplement but the committee found that it was not enough and in 1858 it was decided a new dictionary should be created (Baugh & Cable, 2002; Jackson, 2002). The main aims of the new project were to record every word that can be found in English from about the year 1000 and to exhibit the history of each from a selection of quotations from the whole range of English writings (Baugh & Cable, 2002). They gathered a total of six million slips containing quotations from volunteers not only from England but from all over the world as well. After 24 years of hard work, they managed to publish the first instalment of the dictionary that covers part of the letter A in 1884. Another 16 years passed when four and a half volume of dictionary was published until the letter H. Finally in 1928, the final section of the dictionary was issued making the effort to create "A New Dictionary" successful after 70 years and now known as the Oxford English Dictionary (OED) (Baugh & Cable, 2002). The committee came up with rules that have to be observed by the editors of OED before a word can be included in the dictionary in the following order (Considine, 1996): 3
  • 5. 1) The Word to be explained. 2) The Pronunciation and Accent. 3) The Various Forms assumed by the word, and its principal grammatical inflexions. 4) The Etymon of the word. 5) The Cognate Forms in kindred languages. 6) The Meanings which are logically deduced from the Etymology, and arranged to show the common thread or threads which unite them together. Even though over a century has passed since Johnson created his dictionary, some of the steps taken by Johnson were still used while creating the OED. This shows that the methods employed by Johnson were still relevant to lexicographers and were the main steps to be taken in making a dictionary before corpus linguistics was introduced in dictionary making. b. The initial stage of corpus linguistics In 1950s, there was a growing dissatisfaction of how language theory (e.g. Noam Chomsky‟s syntactic structure) could not reason out the many „ungrammatical‟ patterns found in English (i.e. distinction between transitive and intransitive verbs). There was a strong call for empirical, real language data (Teubert, 2004). It was then that corpus was invented. The first corpus was made out of a survey of English usage conducted by two universities, University of London and the Brown University Corpus in Providence. In the 1960s,both compiled its million word corpus of written text from 500 reading passages, which was named Brown Corpus. This American corpus was a landmark in corpus linguistics since it was the first corpus to employ a computer in its making. In 1982, the British version of the corpus, named the LOB corpus was compiledby Hofland and Johansson. LOB is an abbreviation from The Lancaster-Oslo-and Bergen, and as its name suggests it is a collaborative attempt between the three universities: the University of Lancster, the University of Oslo, and the University of Norwegian Computing Centre of the Humanities. However, both the Brown corpus and LOB corpus were deemed to be inadequate to sample English vocabulary. This gave birth to John Sinclair‟s English Lexical Studieswhich specifically aimed to investigate vocabulary using an electronic text of spoken and written 4
  • 6. language. The study gave prominence to collocation - words that naturally co-occur together.Aimed to represent varieties of English where it is used as a first or second language, Sidney Greenbaum compiled one-million-word corpora called The International Corpus of English in 1988. The unique feature of this corpus is that it samples more spoken language (60%) than its written counterpart (40%). In the early 1990s, major universities and companies together compiled British National Corpus (BNC) containing 100 million words from 1980 up to 1993. The compilers were Oxford University Press, Longman, Chambers, the British Library, Oxford University and Lancaster University. The aim of the corpus is to provide a balanced corpus that represents British English. The corpus includes 10% spoken language and 90% written language, which comprises of 25% fiction and 75% non-fiction. One big distinction between BNC and Brown is that the former took samples from a longer piece of text between 40,000 and 50,000 words. This gives BNC an added advantage of being representative since text contains a different use of words at the beginning, in the middle, and at the end (Lindquist, 2009). Due to its sheer size, representativeness, and care, most British publishers prefer to make use of this corpus as their source of lexicographic information. Typically, any corpora will need to go through a three-step process in its making. Before going through these three steps, however the writer needs to determine the basic outlines of a corpus such as the size of the corpus, the genre of the corpus, whether it will specifically look into written, spoken language, or both. Sinclair (1996) points out that the principles underlying corpus creation should be as large as possible including samples from a broad range of material in order to accomplish one way of representativeness to be anticipated with the technology of the time. The corpus should also be classified into different genres and even size. Once this basic outlines is determined, the three-step process may begin. It starts with collecting the data, spoken and/or written. It entails gathering a large mass of speech, written texts, obtaining permission, and doing a careful and organized record-keeping. The next step is computerization which entails converting raw spoken or written text into a digital format in a computer. Recording of speech may be painstaking sinceit needs to be transcribed manually. Another concern with spoken text is the issue of naturalness of the speech; it needs to be recorded in a natural, casual way that resembles how people speak every day in real life, not in a stilted way. Though written records 5
  • 7. seem to be less painstaking, it also has its problem, mainly the copyright issue. Still some texts that come from books, magazines, and other written sources need to be retyped since scanning device such as OCR (Optical character recognition) software that detect and scan words automatically usually contain errors, so many that it‟s best to avoid using them altogether. The last step is annotating, which involves assigning information such as parts of speech, etymology, for each data. It should be noted that the three aforementioned steps need not to be seen as a separate process; they are all closely connected. For example, after gathering recording of speech, it may be best to transcribe it there and then. Corpus may have given a lot of contributions in language study, but its impact to lexicography did not start until 1989. Together with the advance of computer software, both have since contributed significantly to the development of lexicography.Since everything is automated and recorded in a digital format, lexicographers can now save their time and the tremendous amount of work needed in compiling a dictionary. Typically, a dictionary usually has information on the part of speech, usage, meaning, pronunciation, etymology of a word. Before the advent of corpora, all this information had to be gathered manually; lexicographers needed to do the hard labor of collecting slips of paper containing text that they intend to include in the dictionary. For this reason, it took roughly 50 years to complete Oxford English Dictionary, which was later known as New English Dictionary(Meyer, 2002). With corpora, dictionary makers can now usea large sample of authentic spoken and written textas a source to illustrate how each word in their list is used in real life. The citation used in dictionary comes from real- life discourse. Real contexts also provide accurate, well-defined lexical meanings in the definition of a word in dictionary, which is a huge improvement over the previous dictionary practice where words were defined using an unscientific manner. One huge improvement in dictionary making is the rich information available for words that have many invariant meanings such as take, go, and time, whichtend to be overlooked in the previous dictionary practice (Lindquist, 2009). Another huge advantage of using corpora in lexicography is that information on word frequency can also be obtained. This way, lexicographers can assign whether a word is among the first 500 most common words, the next 500 and so on.Meyer (2002) notes that the most frequent words are functional words such as the, an, a, and, and of which carry little lexical meaning and the least frequent words are content words such as proper nouns. Gries (2009) 6
  • 8. mentions two kinds of frequency information that lexicographers can obtain from a corpus: frequencies of occurrence of linguistic elements in the so-called frequency list, and frequencies ofco-occurrence of these linguistic elements in concordances. Lindquist (2009: 5) defines concordance as “a list of all the contexts in which a word occurs in a particular text”. Using a Key Word in Context (KWIC) concordance, words can be retrieved within theirsurrounding text, and be presentedvertically on the screen. Since the information is presented in contexts, lexicographers can easily assign the collocations of each word in their dictionary. Below is an excerpt from concordance software in which the word “corpus” is highlighted. Figure 1: Concordance from a software called AntConc 3.2.2w (Gries, 2009). The above figure illustrates concordance software called AntConct in use. It should be noted that the software does not come with a ready-made corpus. Hence, users need to readily have a file to generate a KWIC output. The latest version of the software is 3.2.4w and can be downloaded online at http://www.antlab.sci.waseda.ac.jp/software.html. Similar software that lexicographers may use to find how words are used in context is wordsmith tools, devised by Mike Scott in 1993. Since then the software has gone through a lot of changes which now include a concordance, word-listing, web text downloader and many other features (Wikipedia, 2011). Previous versions of the software were sold and owned by Oxford University Press. The software‟s current version is now owned by Lexical Analysis Software Ltd. The current 7
  • 9. Wordsmithversion is 5.0, and can be downloaded online at: http://www.lexically.net/wordsmith/version5/index.html. However, unlike AntConc, Wordsmith is a shareware. In order to unlock the demo version from the website, user will need to pay a single-user license of £50 or around $70-80 from two online retailers (Lexical Software Analysis, and Oxford University Press). Since corpus is discourse-based, it means that the word appears inhaphazard, arbitrary collection of occurrences, as illustrated in the figure above. Dictionary makers need to check for some contradictions with „real‟ meaning. It is thus dangerous to solely depend on corpus (Teubert, 2004).One way to check the word in context is to expand the text by retrieving its original source. Such feature is lacking in both software mentioned previously: the AntConc and Wordsmith tools. Fortunately, the feature is thankfully available for free from Birmingham Young University Website, which provides a concordance containing BNC, COCA (Corpus of Contemporary American English), and some other corpora and can be accessed at: http://corpus.byu.edu/ The huge amount of data in the corpus also allows lexicographers to look for new words that occur for the first time in spoken or written text. However, the corpus has to be large enough to glean information on vocabulary items (Meyer, 2002). A small corpus such as LOB corpus which stores roughly one million word items could not give lexicographers enough information on the range of vocabulary items. A monitor corpus is also needed, in which large data of language is pooled from time to time, rather than fixed only in one particular time period. This way, the corpus is frequently updated with new words and meanings in today‟s growing language. The first dictionary to be founded wholly on corpus is Collins COBUILD series of English Language Dictionary compiled in 1987, guided by John Sinclair. The dictionary has its citation taken from real life discourse, and each word is defined from these authentic texts, instead of relying on previous dictionary. This entails using a very large corpus so that it may be able to include all lemmas including their word senses. However, this presents problem in that there tends to be an exclusion of rare words such as apothegm(Teubert, 2004). Besides being the first corpus-based dictionary, COBUILD is innovative in that the definitions are akin to a 8
  • 10. classroom teacher explaining the words. For example in describing the word junk, it says: “You can use junk to refer to old and second-hand goods that people buy and collect” (Jackson, 2002). In the practice of dictionary-making, one crucial distinction has to be made between corpus-based dictionary and corpus-driven dictionary. Dictionaries such as Collins COBUILD series of English Language dictionaries are said to be corpus-driven if the corpus itself is used to validate information presented in the dictionary. However, if the corpus is used to extract the information used in the dictionary, it is called corpus-driven. Teubert (2004: 112) suggests that dictionary should follow corpus-driven approach so that it may complement standard linguistics and not just extend it. c. Modern corpus linguistics During the 1970s, computational research on English had not developed much in Birmingham because heavy preparation was spent towards devising software packages, instituting undergraduate courses and influencing opinions on the campus (Sinclair, 1991). On that time, when computing was almost restricted to a number of crises, there was a highlight for the importance of data- processing. It has taken approximately fifty years to make a real improvement in the area of corpus- based linguistics which has been driven by systems that work and methodologies that can produce reasonable coverage of linguistic condition (Lawler & Dry, 1998). Years after years, there has been a realization of emergence on accessibility of computational resources such as fast machines and sufficient storage in order to process large volumes of data. Besides that, in the modern corpus, there is a growing availability of corpora with linguistics annotations, for example, part of speech, prosodic intonation, proper names, and bilingual parallel corpora. Furthermore, the maturity of computational linguistics technology has improved the commercial market for natural language product and the corpus linguistics nowadays has been equipped by efficient parsing and statistical techniques. From 1980 to 1986, computational language was put to good effect which transformed into a completely new set of techniques for language observation, analysis, and recording. This is as well bringing to the development of editing substantial dictionaries by using technique and huge database of annotated examples. 9
  • 11. One of the most prominent uses of a corpus in recent years is as a resource for lexicography. There was a corpus-based work for a small number of languages that was used in lexicography. Only recently the need for very large corpora has come to the front. The Lexicography and Natural Language Processing (NLP) collaboration has incited the use of corpora in dictionary projects that have had access to very large corpora (Hua, 2001). The role of the computer has a clerical role in lexicography which reducing the labor of sorting and filing and examining very large amounts of English in a short time (Sinclair, 1991). In the late 1970s, the prospects of computerized typesetting were growing more realistic. Ten years later, in the early 1980s, a multi-million word corpus became available for study but still limited. From simple tools, it has evolved to a substantial progress together with crucial, profound and basic linguistic generalizations (Lawler & Dry, 1998). By these kinds of developed tools, they have revealed many topics for inquiry which have not been well explored by traditional linguistic methods. In the modern era, the word has been reserved for collections of texts that are stored and accessed electronically. Electronic corpora are usually larger than the paper-based collections which are basically small, previously used to study the aspect of language (Hunston, 2002).This is due to the capacity of computers that can store and process large amount of information compared to the previous time. One of the work in the area of corpus linguistics is from the work done by Johansson and collegues in producing a parallel corpus of British English have made it possible for research workers to scrutinize and visualize physically texts of greater length compared to the time before. The main structural features of these corpora are: - A classification into genres (15) of printed texts - A large number (500) of fairly short extracts (2000 words), giving a total of around one million words. - A close to random selection of extracts within genres. Due to this, a great amount of useful information can be extracted easily from the corpora. Besides that, many locations have samples of text which provide hundreds of billions of words. Many collections available such as Association for Computational Linguistics‟ Data Collection Initiative (ACL/DCI), the European Corpus Initiative (ECI), ICAME, The British 10
  • 12. National Corpus (BNC), the Linguistic Data Consortium (LDC), the Consortium for Lexical Research (CLR), Electronic Dictionary Research (EDR), and standardization efforts such as the Text Encoding Initiative (TEI) (Armstrong, 1994). The application of corpora in applied linguistics is also extended to the language teaching apart from the area of lexicography. It has benefited into a wide variety of field. Other relevant applications of corpora are to the production of dictionaries and grammars, in critical linguistics, translation, literary studies and stylistic, forensic linguistics and designing writer support packages (Hunston, 2002). In relation towards the dictionary making, corpora have a contribution towards the area which is most far-reaching and influential. The use of corpora has changed dictionaries in a way that it has stressed on frequency, collocation and phraseology, variation, lexis in grammar and authenticity (Hunston, 2002). Recent innovations of dictionaries include the on-line Longman Web Dictionary and the Collins COBUILD English Collocations on CD ROM. Sinclair (1996) points out that the principles underlying corpus creation should be as large as possible including samples from a broad range of material in order to accomplish one way of representativeness to be anticipated with the technology of the time. The corpus should also be classified into different genres and even size. d. The use of corpora in language teaching The method of using corpora in the disciplines of many studies is not uncommon (McEnery& Wilson, 1996:4). Apart from Lexicography, other possible areas include Language Teaching, Discourse and Pragmatics, Semantics, Sociolinguistics, Historical linguistics and Stylistic. Within the area of Language teaching, we also have another branch known as CALL (Computer- Assisted Language Learning), where it provides a further application of corpora. There is a study conducted at Lancaster University towards the role of corpus-based computer software for teaching undergraduates the basis concept of grammatical analysis (Hua, 2001). The software is called Cytor which reads an annotated corpus, including part-of-speech tagged or parsed, in one 11
  • 13. sentence at a time. Besides the reading, it also hides the annotation and asks the students to annotate the sentences on their own. In addition, students could call up help in the form of the list of tag mnemonics, examples of frequency lexicon or concordances. How effective is the Cytor at teaching part-of-speech learning? A research carried out related to this was done by McEnery, Baker and Wilson (1995, cited in Hua, 2001) which after comparing two groups of students which have different treatments; one who were taught with Cytor and another via traditional lecturer-based methods, the result suggests that the computer- taught students performed better than the human-taught students throughout the term. Another use of corpus in the language teaching and learning is the adaptation of classroom concordance (data driven learning) by classroom practitioner where corpus has become a source for empirical teaching data (Hua:2001,5). One of the examples of link to Data- Driven Learning is Tim John‟s Home Page at http://web.bham.ac.uk/johnstf/. It provides an outstanding resource of online web-based bibliographic database of books and articles related to Corpora and Language Teaching. Moreover, it has included online worksheets which involving corpora for classroom teaching. Another resource which is also quite interesting is the “Grammar Safari” site developed at Champaign-Urbana and can be found online at http://deil.lang.uiuc.edu/web.pages/grammarsafari.html which provides careful and thoughtful selection of corpus-based activities. Furthermore, the Longman Grammar of Spoken and Written English by Douglas Biber et al to answer student questions related to grammar contribute to the useful corpus categorized into fiction, conversation, news, etc. 12
  • 14. III. Discussions and Conclusions: From the reviewed literature, it could be dictionary has been around centuries ago. The first dictionary was made in the 1600s and was based on what was considered difficult words at that time. During this initial stage, lexicographers faced some challenges in adding words into their dictionaries: selecting words, orthography, pronunciation, etymology and derivation, analogy,syntax, phraseology, interpretation, distribution.All this information had to be gathered manually; lexicographers needed to do the hard labor of collecting slips of paper containing text that they intend to include in the dictionary. For this reason, it took roughly 50 years to complete Oxford English Dictionary, which was later known as New English Dictionary. However with the advent of corpus linguistics, things began to change dramatically. In 1989, together with the technological advance in computer, corpus provided a significant contribution to the development of dictionary making. Corpus linguistics made such a huge impact in dictionary-making: a. It significantly reduces the time and the heavy work it needs to compile a dictionary since everything is automated and computerized. b. Each dictionary now resembles how language is used in real world. Meaning is assigned from these samples, rather than from the writer‟s point of view. c. Frequency of each word in the list can be assigned / identified. d. Much more information can be given to words with a lot of variant meanings such as go, and take. e. It makes it easy to include collocation because words appear in its surrounding text. f. It can quickly take „new‟ everyday words into the system. However, because corpus is discourse-based, it means that the word appears inhaphazard, arbitrary collection of occurrences. Dictionary makers need to check for some contradictions with „real‟ meaning. It is thus dangerous to solely depend on corpus. Another disadvantage of dictionaries that are corpora-based is that it tends to exclude rare words (not appearing in real world language) such as apothegm.The first dictionary to ever make it corpus-based is Collins COBUILD series of English dictionaries. 13
  • 15. Corpus linguistics serve some linguistic purpose and to preserve the texts due to the intrinsic value in the texts (Hunston, 2002). It also can be used as groundwork for research. The storage of a corpus allows the users to study it non-linearly and both quantitatively and qualitatively. The nature of a corpus does not include new information about language but to offer us a new viewpoint on the given information. It shows us a way that language can be examined. Most of available software packages process data from a corpus in three ways; showing frequency, phraseology, and collocation (Hunston , 2002). Corpora have made life simpler as well as more complex. In situations that corpora have made the life of users simpler are, for example, when a translator could see quickly the comparison of words that are more or less equivalent or a teacher could refer to the corpus when he or she wishes to show the reasons of why a particular usage is incorrect or inexact in explanations. On the other hand corpora could also made life more complex in a sense that language is patterned in a much more fined way than what we might have been expected that a simple and general rule turns out to be applied only in certain context (Hunston, 2002). The modern corpusis reserved for collections of texts that are stored and accessed electronically. Electronic corpora are usually larger than the paper-based collection which is basically small, previously used to study the aspect of language. Electronic corpora gave birth to the recent innovations of dictionaries, which include the on-line Longman Web Dictionary and the Collins COBUILD English Collocations on CD ROM. 14
  • 16. References: Armstrong, S. (1994). Using Large Corpora. Cambridge: MIT Press. Baugh, A. C. & Cable, T. (2002).A History of the English Language.Oxon: Routledge. Considine, J. (1996). The Meanings, deduced logically from etymology in Gellerstam, M.; JekerJäborg; Sven-GöranMalmgren; Kerstin Norén; Lena Rogström y CatarinaRöjderPammehl (eds.), Euralex ‘96 Proceedings. Papers submitted to the Seventh EURALEX International Congress on Lexicography in Göteborg, Sweden,Göteborg University - Department of Swedish, Göteborg, 1996, 365-371. David, C. (1992). An Encyclopedic Dictionary of Language and Languages. Oxford: Oxford University Press. Retrieved from: http://www.tuchemintz.de/phil/english/chairs/linguist/independent/kursmaterialien/language_ computers/whatis.htm Gries, S.T. (2009). „What is Corpus Linguistics?‟,Language and Linguistics Compass, Vol. 3. pp.1-14 Hua,T.K. (2001). Corpora: Characteristics and Related Studies. Kuala Lumpur: MazizaSdn Bhd. Hunston , S. (2002). Corpora in Applied Linguistics. UK : Cambridge University Press. Jackson, H. (2002). Lexicography, an Introduction. Oxon: Routledge. Johnson, S. (1747). The Plan of a Dictionary of the English Language. Lawler, J.M. &Dry,H.A. (1998). Using Computers in Linguistics: A Practical Guide. London: Routledge. Lindquist, H. (2009). Corpus Linguistics and the Description of English. Edinburgh: Edinburgh University Press. Mason, O. (2000).Programming for Corpus Linguistics:How to Do Text Analysis with Java. Edinburgh: Edinburgh University Press. Meyer, C.F. (2002). English Corpus Linguistics.Cambridge: Cambridge University Press. McEnery T. & Wilson, A. (1996).Corpus Linguistics. Edinburgh: Edinburgh University Press. Siemens, R. G. (1994). Robert Cawdrey: A Table Alphabetical of Hard Usual English Words (1604). Retrieved from http://www.library.utoronto.ca/utel/ret/cawdrey/cawdrey0.html Sinclair,J. (1991). Corpus,Concordance,Collocation. Oxford: Oxford University Press. 15
  • 17. Teubert, W. (2004).„Language and corpus linguistics‟.Lexicology and Corpus Linguistics.London: Continuum. Tognini, E., Bonelli. (2001). Corpus Linguistics at Work.Amsterdam: John Benjamins Publishing Co. WordSmith. (2011, October 15). In Wikipedia, The Free Encyclopedia. Retrieved April 22, 2012, from http://en.wikipedia.org/w/index.php?title=WordSmith&oldid=455732307 16