Corpora, Blogs and Linguistic Variation (Paderborn)

•Descargar como ODP, PDF•

2 recomendaciones•870 vistas

1) The document discusses using blogs and other structured web data to develop linguistic corpora for research. It argues that structured web data provides large amounts of naturally occurring language data in various genres and languages. 2) Examples are given of how blog data in particular is well-structured with metadata like authorship, dates, and semantics. This structured data can be extracted and analyzed to study linguistic patterns and variation across different authors, registers, and languages. 3) One research example analyzed the distribution of future tense expressions ("will" vs. "be going to") in three English language blogs and found patterns relating to subject type that confirm theoretical assumptions.

Economía y finanzas Tecnología

Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development Cornelius Puschmann University of Düsseldorf [email_address] University of Paderborn 8 November 2007

Contents of this presentation ,[object Object],[object Object],[object Object],[object Object],[object Object]

Four central questions a researcher must answer ,[object Object],[object Object],[object Object],[object Object],[object Object]

Different schools of thought ...all have different questions and assumptions! CogSci SocioLing Functionalism

What role does corpus data play? ,[object Object],[object Object],[object Object],[object Object]

A totalizing view of language production investigation but... whose system? whose use? system social function cognitive mechanism genetic disposition use cultural transmission

Margin vs. center: what is shared vs. what varies If we're interested in variation, corpus analysis is the way to go shared, recurring & patterned language individual & varying language

A (slightly) different way of looking at language ,[object Object],[object Object],[object Object],[object Object]

The ultimate corpus ,[object Object],[object Object],[object Object],[object Object]

Web as Corpus (WaC) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Broader issues with WaC ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Web for Corpus ,[object Object],[object Object],[object Object],[object Object]

Constructing a corpus using web data ,[object Object],[object Object],[object Object],[object Object]

Things to consider ...not a whole lot of language data!

Things to consider ... better, but we need to take register into account

A blog is... ,[object Object],[object Object],[object Object]

A few facts ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Expression of futurity in English: will vs. be going to ,[object Object],[object Object],[object Object],[object Object]

Distribution of will vs. be going to in three blogs light blue = will dark blue = going to

Personal pronoun frequency in the same blogs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Inanimate subjects with be going to ,[object Object],[object Object],[object Object],[object Object],[object Object]

Observations ,[object Object],[object Object],[object Object],[object Object]

Más contenido relacionado

La actualidad más candente

Iepy pydata-amsterdam-2016dmoisset

Topical_FacetsEric Van Horenbeeck

What can corpus software do? Routledge chpt 11RajpootBhatti5

Code4Lib Keynote 2011Diane Hillmann

Ir 03Mohammed Romi

Barbiers iclave-frRaj Wali Khan

Matching and merging anonymous terms from web sourcesIJwest

Linked Data and SevicesPlanetData Network of Excellence

Information Retrieval-4(inverted index_&_query handling)Jeet Das

Aall denver 2010Diane Hillmann

XML Retrieval - A Slot Filling Approach鍾誠陳鍾誠

IE: Named Entity Recognition (NER)Marina Santini

Terminology work and term databases in EstoniaArvi Tavast

Semantic Search ComponentMario Flecha

Deep Dependency Graph Conversion in EnglishJinho Choi

Better Cross-Channel Experiences With Metadata - Information Architecture Sum...aungstad

02vocAditya Sharma

Methods and experiences in cultural heritage enhancementFrancesca Tomasi

Text mining Pre-processingCreditas

Ir 02Mohammed Romi

La actualidad más candente (20)

Iepy pydata-amsterdam-2016

Topical_Facets

What can corpus software do? Routledge chpt 11

Code4Lib Keynote 2011

Ir 03

Barbiers iclave-fr

Matching and merging anonymous terms from web sources

Linked Data and Sevices

Information Retrieval-4(inverted index_&_query handling)

Aall denver 2010

XML Retrieval - A Slot Filling Approach

IE: Named Entity Recognition (NER)

Terminology work and term databases in Estonia

Semantic Search Component

Deep Dependency Graph Conversion in English

Better Cross-Channel Experiences With Metadata - Information Architecture Sum...

02voc

Methods and experiences in cultural heritage enhancement

Text mining Pre-processing

Ir 02

Destacado

Corpora in Translatıon StudiesMecnun Genç

Language And Power In Multilingual Contextsguestc61d7d4

cooper language power coopercooper

What can a corpus tell us about grammar?Pascual Pérez-Paredes

Language variation in SociolinguisticsA Faiz

Corpus linguistics the basicsJorge Baptista

What can a corpus tell us about registers and genres douglas biberPascual Pérez-Paredes

How to Use Corpora in Language TeachingCALPER

Corpus linguisticsRaul Vargas

Language, power & discourse harpreetk08

Language & power part 1L Lambe

Destacado (11)

Corpora in Translatıon Studies

Language And Power In Multilingual Contexts

cooper language power

What can a corpus tell us about grammar?

Language variation in Sociolinguistics

Corpus linguistics the basics

What can a corpus tell us about registers and genres douglas biber

How to Use Corpora in Language Teaching

Corpus linguistics

Language, power & discourse

Language & power part 1

Similar a Corpora, Blogs and Linguistic Variation (Paderborn)

Big Data and Natural Language ProcessingMichel Bruley

Text mining introduction-1Sumit Sony

Semantic Interoperability - grafi della conoscenzaGiorgia Lodi

The basics of ontologiesAIMS (Agricultural Information Management Standards)

Web classification of Digital Libraries using GATE Machine Learning sstose

Lri Owl And Ontologies 04 04Rinke Hoekstra

lexicographic evidenceDuygu Aşıklar

The impact of standardized terminologies and domain-ontologies in multilingua...AIMS (Agricultural Information Management Standards)

FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESijnlc

FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig

Finding out Noisy Patterns for Relation Extraction of Bangla Sentenceskevig

Collaborative Ontology Building Project Jie Bao

SMART Seminar Series: "Data is the new water in the digital age"SMART Infrastructure Facility

Oss swotBill Ott

Lecture: Semantic Word CloudsMarina Santini

Semantic Libraries: the Container, the Content and the ContendersStefan Gradmann

Open University - TU100 Day school 1Sarah Horrigan-Fullard

Introduction to automated text analyses in the Political SciencesChristianRauh2

2nd Spinoza workshop: Looking at the Long Tail - introductory slidesFilip Ilievski

NLP_guest_lecture.pdfSoha82

Similar a Corpora, Blogs and Linguistic Variation (Paderborn) (20)

Big Data and Natural Language Processing

Text mining introduction-1

Semantic Interoperability - grafi della conoscenza

The basics of ontologies

Web classification of Digital Libraries using GATE Machine Learning

Lri Owl And Ontologies 04 04

lexicographic evidence

The impact of standardized terminologies and domain-ontologies in multilingua...

FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES

Finding out Noisy Patterns for Relation Extraction of Bangla Sentences

Collaborative Ontology Building Project

SMART Seminar Series: "Data is the new water in the digital age"

Oss swot

Lecture: Semantic Word Clouds

Semantic Libraries: the Container, the Content and the Contenders

Open University - TU100 Day school 1

Introduction to automated text analyses in the Political Sciences

2nd Spinoza workshop: Looking at the Long Tail - introductory slides

NLP_guest_lecture.pdf

Más de Cornelius Puschmann

Collecting Twitter DataCornelius Puschmann

A Tale of Two Platforms: Emerging communicative patterns in two scientific bl...Cornelius Puschmann

Digitale Methoden in den Sozial- und Geisteswissenschaften: Chancen und Herau...Cornelius Puschmann

Twitter as a data source for (socio)linguistic researchCornelius Puschmann

Form and Function of Digital Genres of Scholarly Communication: Results of th...Cornelius Puschmann

Vernetzung, Sichtbarkeit, Information: Nutzungsmotive informeller digitaler K...Cornelius Puschmann

The Pragmatics of RetweetingCornelius Puschmann

Data Access, Ownership and Control in Social Web Services: Issues for Twitter...Cornelius Puschmann

Knowledge or Credit? The (Un)changing Face of Academic Publishing from the Ph...Cornelius Puschmann

Wissenschaftliche Blogs: Nutzungsweisen und NutzerCornelius Puschmann

Was ist ein Wissenschaftsblog?Cornelius Puschmann

Wissenschaftliche Blogs: Schnittstelle zur Öffentlichkeit oder virtueller Elf...Cornelius Puschmann

Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...Cornelius Puschmann

(Academic) Community Management in the Humanities and Social Sciences for Pub...Cornelius Puschmann

Doing A Small-Scale Diachronic Twitter User StudyCornelius Puschmann

Social data: what it is, who owns it, and why you should careCornelius Puschmann

Twitter zwischen Nachrichtenkanal und MikronarrativCornelius Puschmann

#www2010 user activity chartCornelius Puschmann

#s21 user activity chartCornelius Puschmann

Studying Twitter conversations as (dynamic) graphs: visualization and structu...Cornelius Puschmann

Más de Cornelius Puschmann (20)

Collecting Twitter Data

A Tale of Two Platforms: Emerging communicative patterns in two scientific bl...

Digitale Methoden in den Sozial- und Geisteswissenschaften: Chancen und Herau...

Twitter as a data source for (socio)linguistic research

Form and Function of Digital Genres of Scholarly Communication: Results of th...

Vernetzung, Sichtbarkeit, Information: Nutzungsmotive informeller digitaler K...

The Pragmatics of Retweeting

Data Access, Ownership and Control in Social Web Services: Issues for Twitter...

Knowledge or Credit? The (Un)changing Face of Academic Publishing from the Ph...

Wissenschaftliche Blogs: Nutzungsweisen und Nutzer

Was ist ein Wissenschaftsblog?

Wissenschaftliche Blogs: Schnittstelle zur Öffentlichkeit oder virtueller Elf...

Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...

(Academic) Community Management in the Humanities and Social Sciences for Pub...

Doing A Small-Scale Diachronic Twitter User Study

Social data: what it is, who owns it, and why you should care

Twitter zwischen Nachrichtenkanal und Mikronarrativ

#www2010 user activity chart

#s21 user activity chart

Studying Twitter conversations as (dynamic) graphs: visualization and structu...

Último

Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...priyasharma62062

( Jasmin ) Top VIP Escorts Service Dindigul 💧 7737669865 💧 by Dindigul Call G...dipikadinghjn ( Why You Choose Us? ) Escorts

Call Girls Service Pune ₹7.5k Pick Up & Drop With Cash Payment 9352852248 Cal...roshnidevijkn ( Why You Choose Us? ) Escorts

call girls in Sant Nagar (DELHI) 🔝 >༒9953056974 🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR

VIP Independent Call Girls in Taloja 🌹 9920725232 ( Call Me ) Mumbai Escorts ...dipikadinghjn ( Why You Choose Us? ) Escorts

Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...priyasharma62062

Business Principles, Tools, and Techniques in Participating in Various Types...jeffreytingson

From Luxury Escort Service Kamathipura : 9352852248 Make on-demand Arrangemen...From Luxury Escort : 9352852248 Make on-demand Arrangements Near yOU

Call Girls in New Ashok Nagar, (delhi) call me [9953056974] escort service 24X79953056974 Low Rate Call Girls In Saket, Delhi NCR

Airport Road Best Experience Call Girls Number-📞📞9833754194 Santacruz MOst Es...priyasharma62062

(INDIRA) Call Girl Mumbai Call Now 8250077686 Mumbai Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Best VIP Call Girls Morni Hills Just Click Me 6367492432motiram463

VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...dipikadinghjn ( Why You Choose Us? ) Escorts

Toronto dominion bank investor presentation.pdfJinJiang6

VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...dipikadinghjn ( Why You Choose Us? ) Escorts

Pension dashboards forum 1 May 2024 (1).pdfHenry Tapper

falcon-invoice-discounting-unlocking-prime-investment-opportunitiesFalcon Invoice Discounting

Vasai-Virar High Profile Model Call Girls📞9833754194-Nalasopara Satisfy Call ...priyasharma62062

VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...dipikadinghjn ( Why You Choose Us? ) Escorts

Call Girls Banaswadi Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823

Corpora, Blogs and Linguistic Variation (Paderborn)

1. Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development Cornelius Puschmann University of Düsseldorf [email_address] University of Paderborn 8 November 2007

3. What counts as evidence in linguistics?

5. Different schools of thought ...all have different questions and assumptions! CogSci SocioLing Functionalism

7. System, use and the individual

8. A totalizing view of language production investigation but... whose system? whose use? system social function cognitive mechanism genetic disposition use cultural transmission

9. Margin vs. center: what is shared vs. what varies If we're interested in variation, corpus analysis is the way to go shared, recurring & patterned language individual & varying language

10.

11. Using the Web for corpus investigations

12.

13.

14.

15.

16.

17. Things to consider ...not a whole lot of language data!

18. Things to consider ... better, but we need to take register into account

19.

20.

21.

22. An example

23. A research example

24.

25. Distribution of will vs. be going to in three blogs light blue = will dark blue = going to

26.

27.

28.

29. Thanks for listening!

30. Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development Cornelius Puschmann University of Düsseldorf [email_address] University of Paderborn 8 November 2007

Corpora, Blogs and Linguistic Variation (Paderborn)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a Corpora, Blogs and Linguistic Variation (Paderborn)

Similar a Corpora, Blogs and Linguistic Variation (Paderborn) (20)

Más de Cornelius Puschmann

Más de Cornelius Puschmann (20)

Último

Último (20)

Corpora, Blogs and Linguistic Variation (Paderborn)