SlideShare una empresa de Scribd logo
1 de 37
コーパスを用いた言語分析と統計
Collostructional Analysis とは何か?

                          TwiFULL 関西
                 @神戸大学人文学研究科
                              20130327
      発表者:Yuzo Morishita (@pathos95606)
発表の構成
1. はじめに
2. 言語分析と量的研究
3. フィッシャーの正確確率検定

4. Collostructional Analysis

5. Bybee (2010) による批判

6. Gries (2012) による Bybee (2010) への反論

7. 大規模コーパスを用いた具体的分析


                                        1
自己紹介
英語の文法をコーパス基盤で量的・質的に研究

統計も少しだけ

以下のような構文について研究してます
(1)   a. James came running up the stairs. (BNC-FRS)
      b. We went shopping in Brighton. (BNC-FB9)
      c. He sent the clerk hurrying into the back room to get a
         dark grey suit. (BNC-CDN)
      d. She ran after him, calling his name. (BNC-A0N)
      e. Dan walked from the room, his head reeling. (BNC-FAB)
 
                                                                  2
はじめに
はじめに
認知言語学では...

 実験やコーパス頻度に基づく研究が増加
 統計解析が必要な研究も増加

       用法基盤モデル          (e.g., Langacker 1990, 2000)   の影響

   近年、日本でも目立つようになってきた統計手法
    Collostructional Analysis   (Stefanowitsch and Gries 2003)

      手法の妥当性に関してほぼ無批判                         (cf. Bybee 2010)




                                                                 3
言語分析と量的研究
言語分析と量的研究
コーパス研究の発展


 コーパス研究の多様化
  Sinclair (1991) の「コーパス駆動型研究」
  Kennedy (1998) の「コーパス検証型研究」


 WaC (Web as Corpus) の時代


 


                                 4
言語分析と量的研究
言語分析と量的研究の歴史

 本格的な量的言語研究の萌芽
                           (e.g., Chao 1950: (Zipf 1935) の Review
              Hockett 1953: (Shannon and Warren 1949) の Review)


 類型論的研究
                                             (e.g., Haspelmath 2008)

 認知・機能系の研究
             (e.g., Bybee 1985, Baayen 1993, Bybee and Hopper 2001)

  問題点指摘タイプ                        (e.g., Johnson 1999, Kilgaliff 2005)


                                                                    5
言語分析と量的研究
2000年代:言語分析のとコーパス研究の融合

 構文文法 (Construction Grammar)と量的研究


  結果構文におけるフレームと量的研究                      (Boas 2001)


  二重目的語構文での語と構文の統計解析
                      (Stefanowitsch and Gries 2003)


  スペイン語の構文研究           (Bybee and Eddington 2006)




                                                       6
言語分析と量的研究
構文文法 (Construction Grammar)
             (e.g., Fillmore 1988, Fillmore et al. 1988, Goldberg 1995)


 語と構文の間に明確な性質の違いを求めない
                         (cf. Grimshaw 1990: Argument Structure,
           Levin and Rappaport 1995: Lexical Conceptual Structure)
 




                                                                          7
フィッシャーの正確確率検定
  (Fisher's exact test)
フィッシャーの正確確率検定
カイ2乗検定
Table 1: Chi-square test
              不正解              正解    ! 
Group A       21 (≒ 28.4)      433 (≒ 425.6) 454
Group B       44 (≒ 36.6)      540 (≒ 549.4) 584
!             65               975            1,038
                            #2= 3.68; df=1; p = 0.055



期待値の計算                            #2値の計算
          65                                            2
  454 ×       ≒ 28.4                   (実測値 ­ 期待値)
        1,038                     #2 = !
         975                              期待値
  454 ×       ≒ 425.6
        1,038
    584 ×   65
                ≒ 36.6
          1,038
           975 ≒
    584 ×         549.4
          1,038                                             8
フィッシャーの正確確率検定
カイ2乗検定
Table 1: Chi-square test
                                               不正解                   正解    ! 
Group A                                        21 (≒ 28.4)          433 (≒ 425.6) 454
Group B                                        44 (≒ 36.6)          540 (≒ 549.4) 584
!                                              65                   975            1,038
                                                                 #2= 3.68; df=1; p = 0.055

                                        chi-squared(1) distribution
          0.0 0.5 1.0 1.5 2.0 2.5
density




                                    0      2        4            6    8    10
                                                                                             8
                                                        chi-sq
統計の基礎
          フィッシャーの正確確率検定 (Fisher's exact test)
                                        chi-squared(1) distribution
          0.0 0.5 1.0 1.5 2.0 2.5
density




                                    0      2     4            6   8   10

                                                     chi-sq                Sir Ronald Aylmer Fisher (1980-1962)


          理由:特定の分布を前提とする不正確性



                                                                                                              10
統計の基礎
フィッシャーの正確確率検定 (Fisher's exact test)
ありとあらゆる組み合せの可能性を考える

全ての組み合せのうち
現在の結果に当てはまる組み合せが生じる確率を計算

組み合せ(Combination)を計算するので計算量が膨大に...
                  6!
e.g. 6 C 2 =   2!(6−2)!



                                     Sir Ronald Aylmer Fisher (1980-1962)




                                                                        11
統計の基礎
フィッシャーの正確確率検定 (Fisher's exact test)


               The R Project for Statistical Computing
               http://www.r-project.org/




> fisher.test(matrix(c(21, 433, 44, 540), nrow = 2))


	   Fisher's Exact Test for Count Data

data: matrix(c(21, 433, 44, 540), nrow = 2)
p-value < 0.06992
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.330999 1.041104
sample estimates:
odds ratio
 0.5955003


> fisher.test(matrix(c(21, 433, 44, 540), nrow = 2))$p.value
[1] 0.06992
                                                               12
cf.
Collostructional Approach
Collostructional Approach
構文と語の結合度を調べる統計的手法                                        (Stefanowitsch and Gries 2003)

                                        collostruction < collocation + construction
Table 2: Collostructional Analysis
       Construction c Other constructions Row totals
Verb v                        w                    x         w+x
Other verbs                   y                    z         y+z
Column totals               a+c                  b+d     a+b+c+d


Table3: Observed frequencies of give and the ditransitive in the ICE-GB
                                                                    (Gries 2012: 480)
                 ditransitive construction ¬ditransitive construction Row totals
give                                  461 (9)                699 (1,151)       1,160
¬give                             574 (1,026)          136,930 (136,478)     137,504
Column totals                           1,035                    137,629     138,664

                                                                                        13
Collostructional Approach
二重目的語構文における分析結果とその意義
Table 4: Collexemes most strongly attracted to the ditransitive construction
                                                 (Stefanowitsch and Gries 2003: 229)
Collexeme (n)   Collostruction strength        Collexeme (n) Collostruction strength
give (461)                           0         allocate (4)                 2.91E-06
tell (128)                    1.6E-127         wish (9)                     3.11E-06
send (64)                     7.26E-68         accord (3)                   8.15E-06
offer (43)                    3.31E-49         pay (13)                     2.34E-05
show (49)                     2.23E-33         hand (5)                     3.01E-05
cost (20)                     1.12E-22         guarantee (4)                4.72E-05
teach (15)                    4.32E-16         buy (9)                      6.35E-05
award (18)                    1.36E-11         assign (3)                   2.61E-04
allow (18)                    1.12E-10         charge (4)                   3.02E-04
lend (7)                      2.85E-09         cause (8)                    5.56E-04
deny (8)                       4.5E-09         ask (12)                     6.28E-04
owe (6)                       2.67E-08         afford (4)                   1.08E-03
promise (7)                   3.23E-08         cook (3)                     3.34E-03
earn (7)                      2.13E-07         spare (2)                     3.5E-03
grant (5)                     1.33E-06         drop (3)                     2.16E-02   14
Collostructional Approach
Goldberg (1995: 38)
 A: Central Sense: Agent successfully causes recipient to receive patient
      e.g. "give", "pass", "throw", "toss", "bring", "take"...
  (2) John gave Mary a book.
 B: Conditions of Satisfaction imply that agent causes recipient to receive patient
       e.g. "guarantee", "promise", "owe"...
   (3) Chris promised Pat a car.
 C: Agent causes recipient not to receive patient
      e.g. "refuse", "deny"...
  (4) Mary denied her sister a cake.
 D: ...
 E: ...
 F: ...




                                                                                      15
Collostructional Approach
二重目的語構文における分析結果とその意義
Table 4: Collexemes most strongly attracted to the ditransitive construction
                                                 (Stefanowitsch and Gries 2003: 229)
Collexeme (n)   Collostruction strength        Collexeme (n) Collostruction strength
give (461)                           0         allocate (4)                 2.91E-06
tell (128)                    1.6E-127         wish (9)                     3.11E-06
send (64)                     7.26E-68         accord (3)                   8.15E-06
offer (43)                    3.31E-49         pay (13)                     2.34E-05
show (49)                     2.23E-33         hand (5)                     3.01E-05
cost (20)                     1.12E-22         guarantee (4)                4.72E-05
teach (15)                    4.32E-16         buy (9)                      6.35E-05
award (18)                    1.36E-11         assign (3)                   2.61E-04
allow (18)                    1.12E-10         charge (4)                   3.02E-04
lend (7)                      2.85E-09         cause (8)                    5.56E-04
deny (8)                       4.5E-09         ask (12)                     6.28E-04
owe (6)                       2.67E-08         afford (4)                   1.08E-03
promise (7)                   3.23E-08         cook (3)                     3.34E-03
earn (7)                      2.13E-07         spare (2)                     3.5E-03
grant (5)                     1.33E-06         drop (3)                     2.16E-02   16
Bybee (2010) による批判
Bybee (2010) による批判
批判の論点


 - 粗頻度 (raw frequencies) で十分では?

 - クロス表右下の数字は必要?どうやって数値化?
 Table 3: Observed frequencies of give and the ditransitive in the ICE-GB
       ditransitive construction ¬ditransitive construction Row totals
 give                                461 (9)                 699 (1,151)      1,160
 ¬give                           574 (1,026)           136,930 (136,478)    137,504
 Column totals                         1,035                     137,629    138,664




                                                                                      17
Bybee (2010) による批判
Bybee and Eddington (2006) を基に Collostruction Strength を計算

内省と Collostructional Approach の結果を比較
Table 5: Adjective with quedarse 'become' (Bybee 2010: 100)
                          High            Collostructional   Frequency in   Corpus
                          Acceptability   Strength           Construction   Frequency
High Frequency
dormido 'asleep'                    42              79.34             28          161
sorpendido 'surprised'              42              17.57              7           92
quieto 'still/calm'                 39              85.76             29          129
Low Frequency Related
perplejo 'paralyzed'                40               2.62               1         20
paralizado 'paralyzed'              35               2.49               1          1
pasmado 'amazed'                    30               2.72               1         16
Low Frequency Unrelated
desnutrido 'undernourished'         17               3.23               1           5
orgullosismo 'proud'                 6               3.92               1           1

                                                                                    18
Gries (2012) による Bybee (2010) への反論
Gries (2012) による Bybee (2010) への反論




Figure 1: The comparison of a frequency- vs and AM-based approach (Gries 2012: 502)
                                                                                      19
大規模コーパスでの分析による検証
大規模コーパスでの分析による検証
Gries (2012) や Bybee (2010) の研究は小規模コーパス
 Stefanowitsch and Gries (2003):ICE-GB (1 million words)
 


 Bybee (2010):A spoken (1.1 million) and written (1 million) corpora

大規模コーパス利用への展望                          (Gries and Stefanowitsch 2004: 235)


大規模コーパスを利用した研究の問題点
 機械的に収集可能な現象だけに研究が制限される可能性
 
 統計的分析への影響

                                                                             20
大規模コーパスでの分析による検証
British National Corpus (BNC) を利用
 約1億語:ICE-GB の約100倍、Bybee (2010) の約50倍
             註)BYU-BNC     (http://corpus.byu.edu/bnc)   は似て非なるコーパス

扱うのは以下の構文
(5)   a. He sent the clerk hurrying into the back room to get
         a dark grey suit.                                    (BNC-CDN)

      b. A series of small explosions one morning brought Alec
        running out of the top of the steps.               (BNC-B1X)

      c. A day's yacht charter took us threading through the islands
                                                              (BNC-BPJ)

      d. My boyfriend Fisher Stevens will have to drag me
         kicking and screaming out of the house.          (BNC-CH2)21
大規模コーパスでの分析による検証
分析対象と方法
定形の動詞と非定形 (-ing形) の動詞の collostruction strength  

Verbs of Sending and Carrying (Levin 1993: 132-137)
 e.g. "airmail", "FedEX", "pass", "send", "roll", "bring", "take", "drive"

NP V NP V-ing P(P) という形式を持つもの全て

結果:303 例

動詞の頻度検索:語彙素タグ                          e.g.   hw="send" pos="VERB"   (sent, sends, sending, etc.)

全構文数:センテンス・タグ                         e.g.    <s n="777">




                                                                                                    22
大規模コーパスでの分析による検証
Table 6: Collostructional Approach to the construction
                The Construction Other constructions Row totals
crashing                       38                  2,139        2,177
Other verbs                   265              6,023,842    6,024,107
Column totals                 303              6,025,981    6,026,284
                                                      p=3.469382e-83
                                                     log103.46938e-83
                                                            = - 82.460




                                                                         23
Collostructional Approach
Table 7: Collexemes strongly and weakly attracted to the construction
Collexeme (n)    Collostruction strength       Collexeme (n) Collostruction strength
crashing (38)                    82.460        banging (1)                     1.195
scurrying (19)                   52.117        raining (1)                     1.186
sprawling (13)                   33.354        smashing (1)                    1.153
flying (21)                       27.672        scattering (1)                  1.131
tumbling (10)                    20.423        marching (1)                    1.075
rushing (12)                     18.552        floating (1)                     1.046
kicking and screaming (5)        15.748        sailing (1)                     0.908
arcing (3)                       15.182        sliding (1)                     0.878
reeling (6)                      13.043        sinking (1)                     0.850
hurrying (8)                     12.184        swinging (1)                    0.828
screaming (8)                    11.684         pouring (1)                    0.797
hurtling (5)                     11.593        backing (1)                     0.695
scuttling (5)                    11.576        escaping (1)                     0.634
spinning (5)                      7.429         shooting (1)                   0.506
billowing (5)                     6.959         falling (1)                    0.000



                                                                                        24
大規模コーパスでの分析による検証
                          80
                          60
Collostruction_Strength

                          40
                          20
                          0




                               0   10         20        30

                                        Raw_Frequency

  Figure 2: Correlation between Collostruction Strength and Raw Frequency

                                                                            25
REFERENCES
Baayen, Harald. 1993. On frequency, transparency, and productivity. In G. E. Booij
   and J. van Marle (eds.) Yearbook of morphology. 181-208. Dordrecht: Kluwer
   Academic.
Boas, Hans C. 2001. A constructional approach to resultatives. Stanford, Calif.: CSLI
    Publications.
Bybee, Joan L. 1985. Morphology: A study of the relation between meaning and
    form. Amsterdam ; Philadelphia: John Benjamins.
---. 2010. Language, usage and cognition. Cambridge; New York: Cambridge
    University Press.
--- and David Eddington. 2006. A usage-based approach to Spanish verbs of
   'becoming'. Language 82:2. 323-355.
--- and Paul Hopper. (eds.) 2001. Frequency and the emergence of linguistic structure.
    Amsterdam: John Benjamins Publishing Company.
Chao, Yuen Ren. 1950. LanguageReview of Human behavior and the principle of
    least effort: An introduction to human ecology by George Kingsley Zipf. Language
    26:3. 394-401.
REFERENCES
Fillmore, Charles J. 1988. The mechanisms of "Construction Grammar". Proceedings
    of the Fourteenth Annual Meeting of the Berkeley Linguistics Society. 35-55.
---, Paul Kay, Mary Catherine O'Connor. 1988. Regularity and idiomaticity in
    Grammatical Constructions: The case of let alone. Language 64:3. 501-538.
Goldberg, Adele E. 1995. Constructions: A construction grammar approach to
    argument structure. Chicago: University of Chicago Press.
Gries. 2012. Frequencies, probabilities, and association measures in usage-/exemplar-
    based linguistics: Some necessary clarifications. Studies in Language 11:3.
    477-510.
Gries Stefan Th., Beate Hampe, and Doris Schönefeld. 2005. Converging evidence:
    Bringing together experimental and corpus data on the association of verbs and
    constructions. Cognitive Linguistics 16:4. 635-676.
---. 2010. Converging evidence II: More on the association of verbs and
    constructions. In Sally Rice and John Newman (eds.) Empirical and Experimental
    Methods in Cognitive/Functional Research. Stanford, Calif.: CSLI Publications,
    Center for the Study of Language and Information.
REFERENCES
Grimshaw, Jane. 1990. Argument structure. Cambridge, Mass.: MIT Press.
Haspelmath, Martin. 2008. Frequency vs. iconicity in explaining grammatical
    asymmetries. Cognitive Linguistics 19:1. 1-33.
Hockett, Charles F. 1953. Review of the mathematical theory of communication by
    Claude L. Shannon and warren Weaver. Language 29:1. 69-93.
Johnson, Douglas H. 1999. The insignificance of statistical significance testing. The
    Journal of Wildlife Management 63:3. 763-772.
Kennedy, Graeme. 1998. An introduction to corpus linguistics. London; New York:
    Longman.
Kilgaliff. 2005. Language is never ever ever random. Corpus Linguistics and Linguistic
    Theory 3:2. 263-276.
Langacker, Ronald. 1990. A usage-based model. In Ronald Langacker. Concept,
    image and symbol: The cognitive basis of grammar. Berlin: Mouton de Gruyter.
    261-288.
---. 2000. A dynamic usage-based model. In Michael Barlow and Suzanne Kemmer
    (eds.) Usage-based models of language. Stanford, Calif.: CSLI Publications, Center
     for the Study of Language and Information. 1-63.
REFERENCES
Levin Beth and Malka Rappaport Hovav. 1995. Unaccusativity: At the syntax-lexical
   semantics interface. Cambridge, Mass.: MIT Press.
Shannon, Claude L. and Warren Weaver. 1949. The mathematical theory of
   communication. Urbana: University of Illinois Press.
Sinclair, John. 1991. Corpus, concordance, collocation. Oxford; Tokyo: Oxford
   University Press.
Stefanowitsch and Gries. 2003. Collostructions: Investigating the interaction of words
   and constructions. International Journal of Corpus Linguistics 8:2. 209-243.
Zipf, George Kingsley. 1935[1949]. Human behavior and the principle of least effort:
   An introduction to human ecology. Cambridge, Mass.: Addison-Wesley Press.

Más contenido relacionado

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

コーパスを用いた言語分析の統計的手法

  • 1. コーパスを用いた言語分析と統計 Collostructional Analysis とは何か? TwiFULL 関西 @神戸大学人文学研究科 20130327 発表者:Yuzo Morishita (@pathos95606)
  • 2. 発表の構成 1. はじめに 2. 言語分析と量的研究 3. フィッシャーの正確確率検定 4. Collostructional Analysis 5. Bybee (2010) による批判 6. Gries (2012) による Bybee (2010) への反論 7. 大規模コーパスを用いた具体的分析 1
  • 3. 自己紹介 英語の文法をコーパス基盤で量的・質的に研究 統計も少しだけ 以下のような構文について研究してます (1) a. James came running up the stairs. (BNC-FRS) b. We went shopping in Brighton. (BNC-FB9) c. He sent the clerk hurrying into the back room to get a dark grey suit. (BNC-CDN) d. She ran after him, calling his name. (BNC-A0N) e. Dan walked from the room, his head reeling. (BNC-FAB)   2
  • 5. はじめに 認知言語学では...  実験やコーパス頻度に基づく研究が増加  統計解析が必要な研究も増加  用法基盤モデル (e.g., Langacker 1990, 2000) の影響 近年、日本でも目立つようになってきた統計手法 Collostructional Analysis (Stefanowitsch and Gries 2003) 手法の妥当性に関してほぼ無批判 (cf. Bybee 2010) 3
  • 8. 言語分析と量的研究 言語分析と量的研究の歴史  本格的な量的言語研究の萌芽  (e.g., Chao 1950: (Zipf 1935) の Review Hockett 1953: (Shannon and Warren 1949) の Review)  類型論的研究 (e.g., Haspelmath 2008)  認知・機能系の研究 (e.g., Bybee 1985, Baayen 1993, Bybee and Hopper 2001)   問題点指摘タイプ (e.g., Johnson 1999, Kilgaliff 2005) 5
  • 9. 言語分析と量的研究 2000年代:言語分析のとコーパス研究の融合  構文文法 (Construction Grammar)と量的研究   結果構文におけるフレームと量的研究 (Boas 2001)   二重目的語構文での語と構文の統計解析 (Stefanowitsch and Gries 2003)   スペイン語の構文研究 (Bybee and Eddington 2006) 6
  • 10. 言語分析と量的研究 構文文法 (Construction Grammar) (e.g., Fillmore 1988, Fillmore et al. 1988, Goldberg 1995)  語と構文の間に明確な性質の違いを求めない  (cf. Grimshaw 1990: Argument Structure, Levin and Rappaport 1995: Lexical Conceptual Structure)   7
  • 12. フィッシャーの正確確率検定 カイ2乗検定 Table 1: Chi-square test 不正解 正解    !  Group A 21 (≒ 28.4) 433 (≒ 425.6) 454 Group B 44 (≒ 36.6) 540 (≒ 549.4) 584 !    65 975 1,038 #2= 3.68; df=1; p = 0.055 期待値の計算 #2値の計算 65 2 454 × ≒ 28.4 (実測値 ­ 期待値) 1,038 #2 = ! 975 期待値 454 × ≒ 425.6 1,038 584 × 65 ≒ 36.6 1,038 975 ≒ 584 × 549.4 1,038 8
  • 13. フィッシャーの正確確率検定 カイ2乗検定 Table 1: Chi-square test 不正解 正解    !  Group A 21 (≒ 28.4) 433 (≒ 425.6) 454 Group B 44 (≒ 36.6) 540 (≒ 549.4) 584 !    65 975 1,038 #2= 3.68; df=1; p = 0.055 chi-squared(1) distribution 0.0 0.5 1.0 1.5 2.0 2.5 density 0 2 4 6 8 10 8 chi-sq
  • 14. 統計の基礎 フィッシャーの正確確率検定 (Fisher's exact test) chi-squared(1) distribution 0.0 0.5 1.0 1.5 2.0 2.5 density 0 2 4 6 8 10 chi-sq Sir Ronald Aylmer Fisher (1980-1962) 理由:特定の分布を前提とする不正確性 10
  • 15. 統計の基礎 フィッシャーの正確確率検定 (Fisher's exact test) ありとあらゆる組み合せの可能性を考える 全ての組み合せのうち 現在の結果に当てはまる組み合せが生じる確率を計算 組み合せ(Combination)を計算するので計算量が膨大に... 6! e.g. 6 C 2 = 2!(6−2)! Sir Ronald Aylmer Fisher (1980-1962) 11
  • 16. 統計の基礎 フィッシャーの正確確率検定 (Fisher's exact test) The R Project for Statistical Computing http://www.r-project.org/ > fisher.test(matrix(c(21, 433, 44, 540), nrow = 2)) Fisher's Exact Test for Count Data data: matrix(c(21, 433, 44, 540), nrow = 2) p-value < 0.06992 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.330999 1.041104 sample estimates: odds ratio 0.5955003 > fisher.test(matrix(c(21, 433, 44, 540), nrow = 2))$p.value [1] 0.06992 12 cf.
  • 18. Collostructional Approach 構文と語の結合度を調べる統計的手法 (Stefanowitsch and Gries 2003) collostruction < collocation + construction Table 2: Collostructional Analysis        Construction c Other constructions Row totals Verb v w x w+x Other verbs y z y+z Column totals a+c b+d a+b+c+d Table3: Observed frequencies of give and the ditransitive in the ICE-GB   (Gries 2012: 480)        ditransitive construction ¬ditransitive construction Row totals give 461 (9) 699 (1,151) 1,160 ¬give 574 (1,026) 136,930 (136,478) 137,504 Column totals 1,035 137,629 138,664 13
  • 19. Collostructional Approach 二重目的語構文における分析結果とその意義 Table 4: Collexemes most strongly attracted to the ditransitive construction (Stefanowitsch and Gries 2003: 229) Collexeme (n) Collostruction strength Collexeme (n) Collostruction strength give (461) 0 allocate (4) 2.91E-06 tell (128) 1.6E-127 wish (9) 3.11E-06 send (64) 7.26E-68 accord (3) 8.15E-06 offer (43) 3.31E-49 pay (13) 2.34E-05 show (49) 2.23E-33 hand (5) 3.01E-05 cost (20) 1.12E-22 guarantee (4) 4.72E-05 teach (15) 4.32E-16 buy (9) 6.35E-05 award (18) 1.36E-11 assign (3) 2.61E-04 allow (18) 1.12E-10 charge (4) 3.02E-04 lend (7) 2.85E-09 cause (8) 5.56E-04 deny (8) 4.5E-09 ask (12) 6.28E-04 owe (6) 2.67E-08 afford (4) 1.08E-03 promise (7) 3.23E-08 cook (3) 3.34E-03 earn (7) 2.13E-07 spare (2) 3.5E-03 grant (5) 1.33E-06 drop (3) 2.16E-02 14
  • 20. Collostructional Approach Goldberg (1995: 38) A: Central Sense: Agent successfully causes recipient to receive patient e.g. "give", "pass", "throw", "toss", "bring", "take"... (2) John gave Mary a book. B: Conditions of Satisfaction imply that agent causes recipient to receive patient e.g. "guarantee", "promise", "owe"... (3) Chris promised Pat a car. C: Agent causes recipient not to receive patient e.g. "refuse", "deny"... (4) Mary denied her sister a cake. D: ... E: ... F: ... 15
  • 21. Collostructional Approach 二重目的語構文における分析結果とその意義 Table 4: Collexemes most strongly attracted to the ditransitive construction (Stefanowitsch and Gries 2003: 229) Collexeme (n) Collostruction strength Collexeme (n) Collostruction strength give (461) 0 allocate (4) 2.91E-06 tell (128) 1.6E-127 wish (9) 3.11E-06 send (64) 7.26E-68 accord (3) 8.15E-06 offer (43) 3.31E-49 pay (13) 2.34E-05 show (49) 2.23E-33 hand (5) 3.01E-05 cost (20) 1.12E-22 guarantee (4) 4.72E-05 teach (15) 4.32E-16 buy (9) 6.35E-05 award (18) 1.36E-11 assign (3) 2.61E-04 allow (18) 1.12E-10 charge (4) 3.02E-04 lend (7) 2.85E-09 cause (8) 5.56E-04 deny (8) 4.5E-09 ask (12) 6.28E-04 owe (6) 2.67E-08 afford (4) 1.08E-03 promise (7) 3.23E-08 cook (3) 3.34E-03 earn (7) 2.13E-07 spare (2) 3.5E-03 grant (5) 1.33E-06 drop (3) 2.16E-02 16
  • 23. Bybee (2010) による批判 批判の論点  - 粗頻度 (raw frequencies) で十分では?  - クロス表右下の数字は必要?どうやって数値化? Table 3: Observed frequencies of give and the ditransitive in the ICE-GB   ditransitive construction ¬ditransitive construction Row totals give 461 (9) 699 (1,151) 1,160 ¬give 574 (1,026) 136,930 (136,478) 137,504 Column totals 1,035 137,629 138,664 17
  • 24. Bybee (2010) による批判 Bybee and Eddington (2006) を基に Collostruction Strength を計算 内省と Collostructional Approach の結果を比較 Table 5: Adjective with quedarse 'become' (Bybee 2010: 100) High Collostructional Frequency in Corpus Acceptability Strength Construction Frequency High Frequency dormido 'asleep' 42 79.34 28 161 sorpendido 'surprised' 42 17.57 7 92 quieto 'still/calm' 39 85.76 29 129 Low Frequency Related perplejo 'paralyzed' 40 2.62 1 20 paralizado 'paralyzed' 35 2.49 1 1 pasmado 'amazed' 30 2.72 1 16 Low Frequency Unrelated desnutrido 'undernourished' 17 3.23 1 5 orgullosismo 'proud' 6 3.92 1 1 18
  • 25. Gries (2012) による Bybee (2010) への反論
  • 26. Gries (2012) による Bybee (2010) への反論 Figure 1: The comparison of a frequency- vs and AM-based approach (Gries 2012: 502) 19
  • 28. 大規模コーパスでの分析による検証 Gries (2012) や Bybee (2010) の研究は小規模コーパス  Stefanowitsch and Gries (2003):ICE-GB (1 million words)    Bybee (2010):A spoken (1.1 million) and written (1 million) corpora 大規模コーパス利用への展望 (Gries and Stefanowitsch 2004: 235) 大規模コーパスを利用した研究の問題点  機械的に収集可能な現象だけに研究が制限される可能性    統計的分析への影響 20
  • 29. 大規模コーパスでの分析による検証 British National Corpus (BNC) を利用  約1億語:ICE-GB の約100倍、Bybee (2010) の約50倍    註)BYU-BNC (http://corpus.byu.edu/bnc) は似て非なるコーパス 扱うのは以下の構文 (5) a. He sent the clerk hurrying into the back room to get a dark grey suit. (BNC-CDN) b. A series of small explosions one morning brought Alec running out of the top of the steps. (BNC-B1X) c. A day's yacht charter took us threading through the islands (BNC-BPJ) d. My boyfriend Fisher Stevens will have to drag me kicking and screaming out of the house. (BNC-CH2)21
  • 30. 大規模コーパスでの分析による検証 分析対象と方法 定形の動詞と非定形 (-ing形) の動詞の collostruction strength   Verbs of Sending and Carrying (Levin 1993: 132-137)  e.g. "airmail", "FedEX", "pass", "send", "roll", "bring", "take", "drive" NP V NP V-ing P(P) という形式を持つもの全て 結果:303 例 動詞の頻度検索:語彙素タグ e.g. hw="send" pos="VERB" (sent, sends, sending, etc.) 全構文数:センテンス・タグ e.g. <s n="777"> 22
  • 31. 大規模コーパスでの分析による検証 Table 6: Collostructional Approach to the construction        The Construction Other constructions Row totals crashing 38 2,139 2,177 Other verbs 265 6,023,842 6,024,107 Column totals 303 6,025,981 6,026,284   p=3.469382e-83 log103.46938e-83 = - 82.460 23
  • 32. Collostructional Approach Table 7: Collexemes strongly and weakly attracted to the construction Collexeme (n) Collostruction strength Collexeme (n) Collostruction strength crashing (38) 82.460 banging (1) 1.195 scurrying (19) 52.117 raining (1) 1.186 sprawling (13) 33.354 smashing (1) 1.153 flying (21) 27.672 scattering (1) 1.131 tumbling (10) 20.423 marching (1) 1.075 rushing (12) 18.552 floating (1) 1.046 kicking and screaming (5) 15.748 sailing (1) 0.908 arcing (3) 15.182 sliding (1) 0.878 reeling (6) 13.043 sinking (1) 0.850 hurrying (8) 12.184 swinging (1) 0.828 screaming (8) 11.684 pouring (1) 0.797 hurtling (5) 11.593 backing (1) 0.695 scuttling (5) 11.576 escaping (1) 0.634 spinning (5) 7.429 shooting (1) 0.506 billowing (5) 6.959 falling (1) 0.000 24
  • 33. 大規模コーパスでの分析による検証 80 60 Collostruction_Strength 40 20 0 0 10 20 30 Raw_Frequency Figure 2: Correlation between Collostruction Strength and Raw Frequency 25
  • 34. REFERENCES Baayen, Harald. 1993. On frequency, transparency, and productivity. In G. E. Booij and J. van Marle (eds.) Yearbook of morphology. 181-208. Dordrecht: Kluwer Academic. Boas, Hans C. 2001. A constructional approach to resultatives. Stanford, Calif.: CSLI Publications. Bybee, Joan L. 1985. Morphology: A study of the relation between meaning and form. Amsterdam ; Philadelphia: John Benjamins. ---. 2010. Language, usage and cognition. Cambridge; New York: Cambridge University Press. --- and David Eddington. 2006. A usage-based approach to Spanish verbs of 'becoming'. Language 82:2. 323-355. --- and Paul Hopper. (eds.) 2001. Frequency and the emergence of linguistic structure. Amsterdam: John Benjamins Publishing Company. Chao, Yuen Ren. 1950. LanguageReview of Human behavior and the principle of least effort: An introduction to human ecology by George Kingsley Zipf. Language 26:3. 394-401.
  • 35. REFERENCES Fillmore, Charles J. 1988. The mechanisms of "Construction Grammar". Proceedings of the Fourteenth Annual Meeting of the Berkeley Linguistics Society. 35-55. ---, Paul Kay, Mary Catherine O'Connor. 1988. Regularity and idiomaticity in Grammatical Constructions: The case of let alone. Language 64:3. 501-538. Goldberg, Adele E. 1995. Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press. Gries. 2012. Frequencies, probabilities, and association measures in usage-/exemplar- based linguistics: Some necessary clarifications. Studies in Language 11:3. 477-510. Gries Stefan Th., Beate Hampe, and Doris Schönefeld. 2005. Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics 16:4. 635-676. ---. 2010. Converging evidence II: More on the association of verbs and constructions. In Sally Rice and John Newman (eds.) Empirical and Experimental Methods in Cognitive/Functional Research. Stanford, Calif.: CSLI Publications, Center for the Study of Language and Information.
  • 36. REFERENCES Grimshaw, Jane. 1990. Argument structure. Cambridge, Mass.: MIT Press. Haspelmath, Martin. 2008. Frequency vs. iconicity in explaining grammatical asymmetries. Cognitive Linguistics 19:1. 1-33. Hockett, Charles F. 1953. Review of the mathematical theory of communication by Claude L. Shannon and warren Weaver. Language 29:1. 69-93. Johnson, Douglas H. 1999. The insignificance of statistical significance testing. The Journal of Wildlife Management 63:3. 763-772. Kennedy, Graeme. 1998. An introduction to corpus linguistics. London; New York: Longman. Kilgaliff. 2005. Language is never ever ever random. Corpus Linguistics and Linguistic Theory 3:2. 263-276. Langacker, Ronald. 1990. A usage-based model. In Ronald Langacker. Concept, image and symbol: The cognitive basis of grammar. Berlin: Mouton de Gruyter. 261-288. ---. 2000. A dynamic usage-based model. In Michael Barlow and Suzanne Kemmer (eds.) Usage-based models of language. Stanford, Calif.: CSLI Publications, Center for the Study of Language and Information. 1-63.
  • 37. REFERENCES Levin Beth and Malka Rappaport Hovav. 1995. Unaccusativity: At the syntax-lexical semantics interface. Cambridge, Mass.: MIT Press. Shannon, Claude L. and Warren Weaver. 1949. The mathematical theory of communication. Urbana: University of Illinois Press. Sinclair, John. 1991. Corpus, concordance, collocation. Oxford; Tokyo: Oxford University Press. Stefanowitsch and Gries. 2003. Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8:2. 209-243. Zipf, George Kingsley. 1935[1949]. Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, Mass.: Addison-Wesley Press.