SlideShare a Scribd company logo
1 of 31
Download to read offline
5th Author Profiling task at PAN
Gender and Language Variety
Identification in Twitter
PAN-AP-2017 CLEF 2017
Dublin, 11-14 September
Francisco Rangel
Autoritas Consulting &
PRHLT Research Center -
Universitat Politècnica de València
Paolo Rosso
PRHLT Research Center
Universitat Politècnica de Valencia
Martin Potthast & Benno Stein
Bauhaus-Universität Weimar
Introduction
Author profiling aims at identifying
personal traits such as age, gender,
personality traits, native language,
language variety… from writings.
This is crucial for:
- Marketing
- Security
- Forensics
2
PAN’16AuthorProfiling
Task goal
To investigate the identification of
author’s gender and language
variety together.
3
PAN’16AuthorProfiling
Four languages:
English Spanish PortugueseArabic
Corpus collection
4
PAN’16AuthorProfiling
● Step 1: Languages and varieties selection.
● Step 2: Tweets per region retrieval.
Corpus collection
5
PAN’16AuthorProfiling
● Step 3: Unique authors identification.
● Step 4: Authors selection:
○ Tweets are not retweets.
○ Tweets are written in the corresponding language.
● Step 5: Language variety annotation:
○ 80% of tweet meta-data coincide with:
■ Geotagging.
■ Toponyms of the region.
● Step 6: Gender annotation:
○ Automatically: dictionary of proper nouns.
○ Manually: visual review.
Corpus
6
PAN’16AuthorProfiling
● Step 7: Corpus construction:
○ 500 authors per variety and gender.
■ 300 for training, 200 for test.
○ 100 tweets per author.
The accuracy is calculated per task and language.
Then, the averages per task are calculated:
Finally, the ranking is the global average:
Evaluation measures
7
PAN’16AuthorProfiling
Baselines
8
PAN’16AuthorProfiling
● BASELINE-stat: A statistical baseline that emulates random
choice.
● BASELINE-bow:
○ Documents represented as bag-of-words.
○ The 1,000 most common words in the training set.
○ Weighted by absolute frequency.
○ Preprocess: lowercase, removal of punctuation signs and
numbers, removal of stopwords.
● BASELINE-LDR:
○ Documents represented by the probability distribution of
occurrence of their words in the different classes.
○ Each word is weighted depending on its probability of
belonging to each class.
○ The distribution of weights for a given document should be
closer to the weights of its corresponding class.
22 participants
20 working notes
19 countries 9
PAN’16AuthorProfiling
Qatar
Netherlands
Cuba
Slovenia
Approaches
10
PAN’16AuthorProfiling
Approaches - Preprocessing
11
PAN’16AuthorProfiling
HTML cleaning to obtain plain text Khan. Martinc et al.; Ribeiro-Oliveira & Ferreira
Punctuation signs Ribeiro-Oliveira & Ferreira; Martinc et al.; Schaetti
Stop words Kheng et al.; Martinc et al.
Lowercase Franco-Salvador et al.; Kheng et al.; Kodiyan et al.; Miura et al.
Remove short tweets Kheng et al.
Twitter specific components:
hashtags, urls, mentions and RTs
Franco-Salvador et al.; Adame et al.; Kheng et al.; Kodiyan et al.;
Markov et al.; Miura et al.; Ribeiro-Oliveira & Ferreira; Schaetti
Out-of-alphabet words Schaetti
Expand contractions Adame et al.
Approaches - Features
12
PAN’16AuthorProfiling
Stylistic features:
- Ratios of links
- Hashtag or user mentions
- Character flooding
- Emoticons / laugher expressions
- Domain names
Alrifai et al.; Ribeiro-Oliveira & Ferreira; Martinc et al.; Adame
et al.; Markov et al.
Emotional features:
● Emotions
● Appraisal
● Admiration
● Pos/neg emoticons
● Sentiment words
● ...
Adame et al.; Martinc et al.
Specific lists of words, most
discriminant words, ..
Martinc et al.; Kocher & Savoy; Khan
Approaches - Features
13
PAN’16AuthorProfiling
N-gram models Martinc et al.;, Alrifai et al.; Kheng et al.; Markov et al.;
Ribeiro-Oliveira & Ferreira; Ogaltsov & Romanov; Schaetti;
Ciobanu et al.
Bag-of-words Adame et al.; Tellez et al.
Tf-idf n-grams Poulston et al.; Schaetti; Basile et al.
LSA Kheng et al.
Second order representation Pastor et al.
Word embeddings Ignatov et al.; Kodiyan et al.; Sierra et al.; Poulston et al.; Miura et
al.
Character embeddings Franco-Salvador et al.; Miura et al.
Approaches - Methods
14
PAN’16AuthorProfiling
Logistic regression Ignatov et al.; Martinc et al.; Poulston et al.; Ogaltsov & Romanov
SVM Alrifai et al.; Kheng et al.; Pastor et al.; Markov et al.; Tellez et al.; Basile
et al.; Ribeiro-Oliveira & Ferreira; Ciobanu et al.;
Naive Bayes Kheng et al.
Distance-based approaches Adame et al.; Kocher & Savoy; Khan
Recurrent Neural Networks Kodiyan et al.; Miura et al.
Convolutional Neural
Networks
Schaetti; Sierra et al.; Miura et al.
Deep Averaging Networks Franco-Salvador et al.
Gender results
15
PAN’16AuthorProfiling
Variety results
16
PAN’16AuthorProfiling
Confusion among varieties (AR)
17
PAN’16AuthorProfiling
Confusion among varieties (PT)
18
PAN’16AuthorProfiling
Confusion among varieties (ES)
19
PAN’16AuthorProfiling
Confusion among varieties (EN)
20
PAN’16AuthorProfiling
Coarse vs. fine grained English
21
PAN’16AuthorProfiling
● American: United States + Canada.
● European: Great Britain + Ireland.
● Oceanic: New Zealand + Australia.
The impact of the Gender in Variety Identification
22
PAN’16AuthorProfiling
● All participants’ predictions together.
● Except in Spanish, it is less difficult to predict the variety when the
author is a female.
The difficulty of Gender Id. depending on Variety
23
PAN’16AuthorProfiling
● All participants’ predictions together.
● For most Arabic and Portuguese varieties, females are less difficult to be identified.
● In case of Spanish and English both genders are similarly difficult to be identified.
Joint evaluation
24
PAN’16AuthorProfiling
Final ranking
25
PAN’16AuthorProfiling
*
26
PAN’16AuthorProfiling
PAN-AP 2017 best results
Conclusions
● High combination of features: content-based, stylometric, n-grams, … and for the first time deep
learning approaches have been widely used.
○ Deep learning approaches did not obtain the best results.
● Per language:
○ The best results have been obtained in Portuguese.
○ The average worst results in gender identification have been obtained in Arabic.
○ The average worst results in language variety identification have been obtained in English.
● Per variety:
○ In Arabic: The most difficult Gulf. The easiest Levantine.
○ In English, the highest confusion occurs among varieties which share regional locations.
○ In Spanish, most confusions through Colombia. The highest confusion is from Peru.
○ Portuguese is asymetric: Highest confusions from Portugal to Brazil.
● Coarse vs. fine-grained evaluation in English:
○ Significant differences, although not very high (3.75%) in the case of the best approaches.
● The impact of the gender in the language variety identification:
○ In Arabic and Portuguese the differences among genders are significant.
● The difficulty of gender identification depending on the language variety:
○ For most Arabic and Portuguese varieties, females are less difficult to be identified.
○ In case of Spanish and English both genders are similarly difficult to be identified.
27
PAN’16AuthorProfiling
Task impact
28
PAN’16AuthorProfiling
PARTICIPANTS COUNTRIES CITATIONS
PAN-AP 2013
21 16 67 (+28)
PAN-AP 2014
10 8 41 (+25)
PAN-AP 2015
22 13 42 (+25)
PAN-AP 2016
22 15 5
PAN-AP 2017
22 19
Next year?
29
PAN’16AuthorProfiling
Industry at PAN (Author Profiling)
30
PAN’16AuthorProfiling
Organisation Sponsors
Participants
31
PAN’16AuthorProfiling
On behalf of the author profiling task organisers:
Thank you very much for participating
and hope to see you next year!!

More Related Content

Similar to Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.

Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Francisco Manuel Rangel Pardo
 
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Francisco Manuel Rangel Pardo
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...Francisco Manuel Rangel Pardo
 
Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...
Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...
Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...UFMG
 
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...Phoenix Tree Publishing Inc
 
PODCASTING; READING 5
PODCASTING; READING 5PODCASTING; READING 5
PODCASTING; READING 5cirauqui
 
Code Switching: a paper by Krishna Bista
Code Switching: a paper by Krishna BistaCode Switching: a paper by Krishna Bista
Code Switching: a paper by Krishna BistaAna Azevedo
 
Alderson´s question revisited: Is reading in a foreign language a language pr...
Alderson´s question revisited: Is reading in a foreign language a language pr...Alderson´s question revisited: Is reading in a foreign language a language pr...
Alderson´s question revisited: Is reading in a foreign language a language pr...B L
 
Protocolo adriana pool
Protocolo adriana poolProtocolo adriana pool
Protocolo adriana poolAdriana Pool
 
Caderno do Aluno Inglês 1 ano vol 1 2014-2017
Caderno do Aluno Inglês 1 ano vol 1 2014-2017Caderno do Aluno Inglês 1 ano vol 1 2014-2017
Caderno do Aluno Inglês 1 ano vol 1 2014-2017Diogo Santos
 
PRTESOLGram - May2015
PRTESOLGram - May2015PRTESOLGram - May2015
PRTESOLGram - May2015Eric Otero
 
Introduction to Academic Writing and Publishing in English (2018)
Introduction to Academic Writing and Publishing in English (2018)Introduction to Academic Writing and Publishing in English (2018)
Introduction to Academic Writing and Publishing in English (2018)Ron Martinez
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningMartin Wynne
 

Similar to Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017. (20)

Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
 
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
 
Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...
Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...
Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...
 
کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...
کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...
کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...
 
The book of abstracts of the second annual international conference on langua...
The book of abstracts of the second annual international conference on langua...The book of abstracts of the second annual international conference on langua...
The book of abstracts of the second annual international conference on langua...
 
کتاب چکیده دومین کنفرانس بین المللی سالانه بررسی مسائل جاری زبان ها، گویش ها ...
کتاب چکیده دومین کنفرانس بین المللی سالانه بررسی مسائل جاری زبان ها، گویش ها ...کتاب چکیده دومین کنفرانس بین المللی سالانه بررسی مسائل جاری زبان ها، گویش ها ...
کتاب چکیده دومین کنفرانس بین المللی سالانه بررسی مسائل جاری زبان ها، گویش ها ...
 
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
 
PODCASTING; READING 5
PODCASTING; READING 5PODCASTING; READING 5
PODCASTING; READING 5
 
Code Switching: a paper by Krishna Bista
Code Switching: a paper by Krishna BistaCode Switching: a paper by Krishna Bista
Code Switching: a paper by Krishna Bista
 
Alderson´s question revisited: Is reading in a foreign language a language pr...
Alderson´s question revisited: Is reading in a foreign language a language pr...Alderson´s question revisited: Is reading in a foreign language a language pr...
Alderson´s question revisited: Is reading in a foreign language a language pr...
 
Protocolo adriana pool
Protocolo adriana poolProtocolo adriana pool
Protocolo adriana pool
 
Caderno do Aluno Inglês 1 ano vol 1 2014-2017
Caderno do Aluno Inglês 1 ano vol 1 2014-2017Caderno do Aluno Inglês 1 ano vol 1 2014-2017
Caderno do Aluno Inglês 1 ano vol 1 2014-2017
 
Ethnonyms
EthnonymsEthnonyms
Ethnonyms
 
Dodson_Honors_Thesis_2006
Dodson_Honors_Thesis_2006Dodson_Honors_Thesis_2006
Dodson_Honors_Thesis_2006
 
PRTESOLGram - May2015
PRTESOLGram - May2015PRTESOLGram - May2015
PRTESOLGram - May2015
 
S5 effective assessments - actfl
S5   effective assessments - actflS5   effective assessments - actfl
S5 effective assessments - actfl
 
Author Profiling. PAN@CLEF-2013 Task
Author Profiling. PAN@CLEF-2013 TaskAuthor Profiling. PAN@CLEF-2013 Task
Author Profiling. PAN@CLEF-2013 Task
 
Introduction to Academic Writing and Publishing in English (2018)
Introduction to Academic Writing and Publishing in English (2018)Introduction to Academic Writing and Publishing in English (2018)
Introduction to Academic Writing and Publishing in English (2018)
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and Learning
 

More from Francisco Manuel Rangel Pardo

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Francisco Manuel Rangel Pardo
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019Francisco Manuel Rangel Pardo
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Francisco Manuel Rangel Pardo
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Francisco Manuel Rangel Pardo
 
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Francisco Manuel Rangel Pardo
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIREFrancisco Manuel Rangel Pardo
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustFrancisco Manuel Rangel Pardo
 
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)Francisco Manuel Rangel Pardo
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Francisco Manuel Rangel Pardo
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...Francisco Manuel Rangel Pardo
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...Francisco Manuel Rangel Pardo
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Francisco Manuel Rangel Pardo
 
Native Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artNative Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artFrancisco Manuel Rangel Pardo
 
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Francisco Manuel Rangel Pardo
 

More from Francisco Manuel Rangel Pardo (20)

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
 
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRE
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building Trust
 
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
 
Smart Listening - MUIinf
Smart Listening - MUIinfSmart Listening - MUIinf
Smart Listening - MUIinf
 
IA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidadIA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidad
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...
 
Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015
 
EmoGraph for Age and Gender Identification
EmoGraph for Age and Gender IdentificationEmoGraph for Age and Gender Identification
EmoGraph for Age and Gender Identification
 
My Phd Student T-Shirt
My Phd Student T-ShirtMy Phd Student T-Shirt
My Phd Student T-Shirt
 
Kico's Stairway to Phd
Kico's Stairway to PhdKico's Stairway to Phd
Kico's Stairway to Phd
 
Native Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artNative Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the art
 
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
 

Recently uploaded

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Recently uploaded (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.

  • 1. 5th Author Profiling task at PAN Gender and Language Variety Identification in Twitter PAN-AP-2017 CLEF 2017 Dublin, 11-14 September Francisco Rangel Autoritas Consulting & PRHLT Research Center - Universitat Politècnica de València Paolo Rosso PRHLT Research Center Universitat Politècnica de Valencia Martin Potthast & Benno Stein Bauhaus-Universität Weimar
  • 2. Introduction Author profiling aims at identifying personal traits such as age, gender, personality traits, native language, language variety… from writings. This is crucial for: - Marketing - Security - Forensics 2 PAN’16AuthorProfiling
  • 3. Task goal To investigate the identification of author’s gender and language variety together. 3 PAN’16AuthorProfiling Four languages: English Spanish PortugueseArabic
  • 4. Corpus collection 4 PAN’16AuthorProfiling ● Step 1: Languages and varieties selection. ● Step 2: Tweets per region retrieval.
  • 5. Corpus collection 5 PAN’16AuthorProfiling ● Step 3: Unique authors identification. ● Step 4: Authors selection: ○ Tweets are not retweets. ○ Tweets are written in the corresponding language. ● Step 5: Language variety annotation: ○ 80% of tweet meta-data coincide with: ■ Geotagging. ■ Toponyms of the region. ● Step 6: Gender annotation: ○ Automatically: dictionary of proper nouns. ○ Manually: visual review.
  • 6. Corpus 6 PAN’16AuthorProfiling ● Step 7: Corpus construction: ○ 500 authors per variety and gender. ■ 300 for training, 200 for test. ○ 100 tweets per author.
  • 7. The accuracy is calculated per task and language. Then, the averages per task are calculated: Finally, the ranking is the global average: Evaluation measures 7 PAN’16AuthorProfiling
  • 8. Baselines 8 PAN’16AuthorProfiling ● BASELINE-stat: A statistical baseline that emulates random choice. ● BASELINE-bow: ○ Documents represented as bag-of-words. ○ The 1,000 most common words in the training set. ○ Weighted by absolute frequency. ○ Preprocess: lowercase, removal of punctuation signs and numbers, removal of stopwords. ● BASELINE-LDR: ○ Documents represented by the probability distribution of occurrence of their words in the different classes. ○ Each word is weighted depending on its probability of belonging to each class. ○ The distribution of weights for a given document should be closer to the weights of its corresponding class.
  • 9. 22 participants 20 working notes 19 countries 9 PAN’16AuthorProfiling Qatar Netherlands Cuba Slovenia
  • 11. Approaches - Preprocessing 11 PAN’16AuthorProfiling HTML cleaning to obtain plain text Khan. Martinc et al.; Ribeiro-Oliveira & Ferreira Punctuation signs Ribeiro-Oliveira & Ferreira; Martinc et al.; Schaetti Stop words Kheng et al.; Martinc et al. Lowercase Franco-Salvador et al.; Kheng et al.; Kodiyan et al.; Miura et al. Remove short tweets Kheng et al. Twitter specific components: hashtags, urls, mentions and RTs Franco-Salvador et al.; Adame et al.; Kheng et al.; Kodiyan et al.; Markov et al.; Miura et al.; Ribeiro-Oliveira & Ferreira; Schaetti Out-of-alphabet words Schaetti Expand contractions Adame et al.
  • 12. Approaches - Features 12 PAN’16AuthorProfiling Stylistic features: - Ratios of links - Hashtag or user mentions - Character flooding - Emoticons / laugher expressions - Domain names Alrifai et al.; Ribeiro-Oliveira & Ferreira; Martinc et al.; Adame et al.; Markov et al. Emotional features: ● Emotions ● Appraisal ● Admiration ● Pos/neg emoticons ● Sentiment words ● ... Adame et al.; Martinc et al. Specific lists of words, most discriminant words, .. Martinc et al.; Kocher & Savoy; Khan
  • 13. Approaches - Features 13 PAN’16AuthorProfiling N-gram models Martinc et al.;, Alrifai et al.; Kheng et al.; Markov et al.; Ribeiro-Oliveira & Ferreira; Ogaltsov & Romanov; Schaetti; Ciobanu et al. Bag-of-words Adame et al.; Tellez et al. Tf-idf n-grams Poulston et al.; Schaetti; Basile et al. LSA Kheng et al. Second order representation Pastor et al. Word embeddings Ignatov et al.; Kodiyan et al.; Sierra et al.; Poulston et al.; Miura et al. Character embeddings Franco-Salvador et al.; Miura et al.
  • 14. Approaches - Methods 14 PAN’16AuthorProfiling Logistic regression Ignatov et al.; Martinc et al.; Poulston et al.; Ogaltsov & Romanov SVM Alrifai et al.; Kheng et al.; Pastor et al.; Markov et al.; Tellez et al.; Basile et al.; Ribeiro-Oliveira & Ferreira; Ciobanu et al.; Naive Bayes Kheng et al. Distance-based approaches Adame et al.; Kocher & Savoy; Khan Recurrent Neural Networks Kodiyan et al.; Miura et al. Convolutional Neural Networks Schaetti; Sierra et al.; Miura et al. Deep Averaging Networks Franco-Salvador et al.
  • 17. Confusion among varieties (AR) 17 PAN’16AuthorProfiling
  • 18. Confusion among varieties (PT) 18 PAN’16AuthorProfiling
  • 19. Confusion among varieties (ES) 19 PAN’16AuthorProfiling
  • 20. Confusion among varieties (EN) 20 PAN’16AuthorProfiling
  • 21. Coarse vs. fine grained English 21 PAN’16AuthorProfiling ● American: United States + Canada. ● European: Great Britain + Ireland. ● Oceanic: New Zealand + Australia.
  • 22. The impact of the Gender in Variety Identification 22 PAN’16AuthorProfiling ● All participants’ predictions together. ● Except in Spanish, it is less difficult to predict the variety when the author is a female.
  • 23. The difficulty of Gender Id. depending on Variety 23 PAN’16AuthorProfiling ● All participants’ predictions together. ● For most Arabic and Portuguese varieties, females are less difficult to be identified. ● In case of Spanish and English both genders are similarly difficult to be identified.
  • 27. Conclusions ● High combination of features: content-based, stylometric, n-grams, … and for the first time deep learning approaches have been widely used. ○ Deep learning approaches did not obtain the best results. ● Per language: ○ The best results have been obtained in Portuguese. ○ The average worst results in gender identification have been obtained in Arabic. ○ The average worst results in language variety identification have been obtained in English. ● Per variety: ○ In Arabic: The most difficult Gulf. The easiest Levantine. ○ In English, the highest confusion occurs among varieties which share regional locations. ○ In Spanish, most confusions through Colombia. The highest confusion is from Peru. ○ Portuguese is asymetric: Highest confusions from Portugal to Brazil. ● Coarse vs. fine-grained evaluation in English: ○ Significant differences, although not very high (3.75%) in the case of the best approaches. ● The impact of the gender in the language variety identification: ○ In Arabic and Portuguese the differences among genders are significant. ● The difficulty of gender identification depending on the language variety: ○ For most Arabic and Portuguese varieties, females are less difficult to be identified. ○ In case of Spanish and English both genders are similarly difficult to be identified. 27 PAN’16AuthorProfiling
  • 28. Task impact 28 PAN’16AuthorProfiling PARTICIPANTS COUNTRIES CITATIONS PAN-AP 2013 21 16 67 (+28) PAN-AP 2014 10 8 41 (+25) PAN-AP 2015 22 13 42 (+25) PAN-AP 2016 22 15 5 PAN-AP 2017 22 19
  • 30. Industry at PAN (Author Profiling) 30 PAN’16AuthorProfiling Organisation Sponsors Participants
  • 31. 31 PAN’16AuthorProfiling On behalf of the author profiling task organisers: Thank you very much for participating and hope to see you next year!!