In these slides, the overview of the fifth Author Profiling task at PAN-CLEF 2017 presented at Dublin.
This year task aimed at gender and language variety identification problems in Spanish, English, and as a novelty, Arabic and Portuguese.
Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.
1. 5th Author Profiling task at PAN
Gender and Language Variety
Identification in Twitter
PAN-AP-2017 CLEF 2017
Dublin, 11-14 September
Francisco Rangel
Autoritas Consulting &
PRHLT Research Center -
Universitat Politècnica de València
Paolo Rosso
PRHLT Research Center
Universitat Politècnica de Valencia
Martin Potthast & Benno Stein
Bauhaus-Universität Weimar
2. Introduction
Author profiling aims at identifying
personal traits such as age, gender,
personality traits, native language,
language variety… from writings.
This is crucial for:
- Marketing
- Security
- Forensics
2
PAN’16AuthorProfiling
3. Task goal
To investigate the identification of
author’s gender and language
variety together.
3
PAN’16AuthorProfiling
Four languages:
English Spanish PortugueseArabic
7. The accuracy is calculated per task and language.
Then, the averages per task are calculated:
Finally, the ranking is the global average:
Evaluation measures
7
PAN’16AuthorProfiling
8. Baselines
8
PAN’16AuthorProfiling
● BASELINE-stat: A statistical baseline that emulates random
choice.
● BASELINE-bow:
○ Documents represented as bag-of-words.
○ The 1,000 most common words in the training set.
○ Weighted by absolute frequency.
○ Preprocess: lowercase, removal of punctuation signs and
numbers, removal of stopwords.
● BASELINE-LDR:
○ Documents represented by the probability distribution of
occurrence of their words in the different classes.
○ Each word is weighted depending on its probability of
belonging to each class.
○ The distribution of weights for a given document should be
closer to the weights of its corresponding class.
11. Approaches - Preprocessing
11
PAN’16AuthorProfiling
HTML cleaning to obtain plain text Khan. Martinc et al.; Ribeiro-Oliveira & Ferreira
Punctuation signs Ribeiro-Oliveira & Ferreira; Martinc et al.; Schaetti
Stop words Kheng et al.; Martinc et al.
Lowercase Franco-Salvador et al.; Kheng et al.; Kodiyan et al.; Miura et al.
Remove short tweets Kheng et al.
Twitter specific components:
hashtags, urls, mentions and RTs
Franco-Salvador et al.; Adame et al.; Kheng et al.; Kodiyan et al.;
Markov et al.; Miura et al.; Ribeiro-Oliveira & Ferreira; Schaetti
Out-of-alphabet words Schaetti
Expand contractions Adame et al.
12. Approaches - Features
12
PAN’16AuthorProfiling
Stylistic features:
- Ratios of links
- Hashtag or user mentions
- Character flooding
- Emoticons / laugher expressions
- Domain names
Alrifai et al.; Ribeiro-Oliveira & Ferreira; Martinc et al.; Adame
et al.; Markov et al.
Emotional features:
● Emotions
● Appraisal
● Admiration
● Pos/neg emoticons
● Sentiment words
● ...
Adame et al.; Martinc et al.
Specific lists of words, most
discriminant words, ..
Martinc et al.; Kocher & Savoy; Khan
13. Approaches - Features
13
PAN’16AuthorProfiling
N-gram models Martinc et al.;, Alrifai et al.; Kheng et al.; Markov et al.;
Ribeiro-Oliveira & Ferreira; Ogaltsov & Romanov; Schaetti;
Ciobanu et al.
Bag-of-words Adame et al.; Tellez et al.
Tf-idf n-grams Poulston et al.; Schaetti; Basile et al.
LSA Kheng et al.
Second order representation Pastor et al.
Word embeddings Ignatov et al.; Kodiyan et al.; Sierra et al.; Poulston et al.; Miura et
al.
Character embeddings Franco-Salvador et al.; Miura et al.
14. Approaches - Methods
14
PAN’16AuthorProfiling
Logistic regression Ignatov et al.; Martinc et al.; Poulston et al.; Ogaltsov & Romanov
SVM Alrifai et al.; Kheng et al.; Pastor et al.; Markov et al.; Tellez et al.; Basile
et al.; Ribeiro-Oliveira & Ferreira; Ciobanu et al.;
Naive Bayes Kheng et al.
Distance-based approaches Adame et al.; Kocher & Savoy; Khan
Recurrent Neural Networks Kodiyan et al.; Miura et al.
Convolutional Neural
Networks
Schaetti; Sierra et al.; Miura et al.
Deep Averaging Networks Franco-Salvador et al.
21. Coarse vs. fine grained English
21
PAN’16AuthorProfiling
● American: United States + Canada.
● European: Great Britain + Ireland.
● Oceanic: New Zealand + Australia.
22. The impact of the Gender in Variety Identification
22
PAN’16AuthorProfiling
● All participants’ predictions together.
● Except in Spanish, it is less difficult to predict the variety when the
author is a female.
23. The difficulty of Gender Id. depending on Variety
23
PAN’16AuthorProfiling
● All participants’ predictions together.
● For most Arabic and Portuguese varieties, females are less difficult to be identified.
● In case of Spanish and English both genders are similarly difficult to be identified.
27. Conclusions
● High combination of features: content-based, stylometric, n-grams, … and for the first time deep
learning approaches have been widely used.
○ Deep learning approaches did not obtain the best results.
● Per language:
○ The best results have been obtained in Portuguese.
○ The average worst results in gender identification have been obtained in Arabic.
○ The average worst results in language variety identification have been obtained in English.
● Per variety:
○ In Arabic: The most difficult Gulf. The easiest Levantine.
○ In English, the highest confusion occurs among varieties which share regional locations.
○ In Spanish, most confusions through Colombia. The highest confusion is from Peru.
○ Portuguese is asymetric: Highest confusions from Portugal to Brazil.
● Coarse vs. fine-grained evaluation in English:
○ Significant differences, although not very high (3.75%) in the case of the best approaches.
● The impact of the gender in the language variety identification:
○ In Arabic and Portuguese the differences among genders are significant.
● The difficulty of gender identification depending on the language variety:
○ For most Arabic and Portuguese varieties, females are less difficult to be identified.
○ In case of Spanish and English both genders are similarly difficult to be identified.
27
PAN’16AuthorProfiling