SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Author Profiling
PAN-AP-2013 - CLEF 2013
Valencia, 24th September 2013
Francisco Rangel
Autoritas / Universitat
Politècnica de València
Paolo Rosso
Universitat Politècnica
de València
Moshe Koppel
Bar-Illan University
Efstathios Stamatatos
University of the Aegean
Giacomo Inches
University of Lugano
2
Gender?
Age?
Native
language?
Emotions?
Personality
traits?
Author Profile... Who is who?
What’s Author Profiling?
3
Why Author Profiling?
Forensics Security Marketing
Language as
evidence
Profile
possible
delinquents
Segmenting
users
4
Task Goals
‣Given a collection of documents retrieved from
Social Media in English and Spanish...
MAIN GOAL
Identify age and
gender
SECONDARY GOALS
Test the robustness of the
approaches for identifying age and
gender of predators
Measure the computational time
needed to perform the task
5
Related Work
AUTHOR COLLECTION FEATURES RESULTS
OTHER
CHARACTERISTICS
Argamon et al., 2002 British National Corpus Part-of-speech Gender: 80% accuracy
Holmes & Meyerhoff,
2003
Formal texts - Age and gender
Burger & Henderson,
2006
Blogs
Posts length, capital letters,
punctuations. HTML features.
They only reported:
“Low percentage errors”
Two age classes: [0,18[,[18,-]
Koppel et al., 2003 Blogs
Simple lexical and syntactic
functions
Gender: 80% accuracy Self-labeling
Schler et al., 2006 Blogs
Stylistic features + content words
with the highest information gain
Gender: 80% accuracy
Age: 75% accuracy
Goswami et al., 2009 Blogs Slang + sentence length
Gender: 89.18 accuracy
Age: 80.32 accuracy
Zhang & Zhang, 2010 Segments of blog
Words, punctuation, average
words/sentence length, POS, word
factor analysis
Gender: 72,10 accuracy
Nguyen et al., 2011 y
2013
Blogs & Twitter Unigrams, POS, LIWC
Correlation: 0.74
Mean absolute error: 4.1
- 6.8 years
Manual labeling
Age as continuous variable
Peersman et al., 2011 Netlog Unigrams, bigrams, trigrams and
tetagrams
Gender+Age: 88.8
accuracy
Self-labeling, min 16 plus
16,18,25
6
Data Collection - Social Media
‣Big Data?
‣High variety of themes
‣Sexual conversations vs. sexual predators
‣Difficulty to obtain good label data
‣Real people vs. Robots (chatbots)
‣Multilingual: English + Spanish
7
Data Collection - English Distribution
MIN MAX AVG STD
0 22,736 335 208
Numberofdocuments
Number of words
8
Data Collection - English Distribution (zoomed)
‣ If we zoom the distribution, we can observe a gaussian like distribution, with its maximum on the value 415.
335 495
415
Numberofdocuments
Number of words
9
Data Collection - English Distribution (log-log)
‣ The log-log representation shows how the distribution has a long tail component, specifically in two cases,
one before the point of maximum frequency and another one after this.
‣ We could use this property to select minimum and maximum number of words that the posts must have.
Numberofdocuments(log)
Number of words (log)
10
Data Collection - Spanish Distribution
Numberofdocuments
Number of words
MIN MAX AVG STD
0 12,246 176 832
11
Data Collection - Spanish Distribution (zoomed)
500
15
Numberofdocuments
Number of words
12
Data Collection - Spanish Distribution (log-log)
Numberofdocuments(log)
Number of words (log)
13
Data Collection - Selection Criteria
‣ Grouping posts by
author
‣ Keeping authors with
few post
‣ Chunking authors with
more than 1,000 words
‣ Balanced by gender
‣ A g e g r o u p s ( n o n -
balanced):
‣ 10s (13-17)
‣ 20s (23-27)
‣ 30s (33-47)
‣ Random split in three
datasets
‣ Training
‣ Early Bird (10%)
‣ Testing (+20%)
‣ Introduction of few special cases
‣ Predators (0.0012%)
‣ Adult-adult sexual conversations
14
Data Collection - Statistics
LANG AGE GENDERLANG AGE GENDERLANG AGE GENDER
NUMBER OF AUTHORSNUMBER OF AUTHORSNUMBER OF AUTHORS
LANG AGE GENDERLANG AGE GENDERLANG AGE GENDER
TRAINING EARLY BIRDS TEST
EN
10s
MALE 8,600 740 888
EN
10s
FEMALE 8,600 740 888
EN 20s
MALE (72) 42,828 3,840 (32) 4,576
EN 20s
FEMALE (25) 42,875 3,840 (10) 4,576
EN
30s
MALE (92) 66,708 6,020 (40) 7,184
EN
30s
FEMALE 66,800 6,020 7,224
236,600 21,200 25,440
ES
10s
MALE 1,250 120 144
ES
10s
FEMALE 1,250 120 144
ES 20s
MALE 21,300 1,920 2,304
ES 20s
FEMALE 21,300 1,920 2,304
ES
30s
MALE 15,400 1,360 1,632
ES
30s
FEMALE 15,400 1,360 1,632
75,900 6,800 8,160
Predators
Adult-adult sexual conversations
15
Performance measures for identification
Accuracy for
Gender
Accuracy for
Age
Accuracy for
Gender
Accuracy for
Age
ENGLISH SPANISH
Joint Accuracy Joint Accuracy
Average Accuracy
WINNER OF THE TASK
16
Other performance measures
Number of correctly identified gender and age for
predators
Total time needed to process the test data
Number of correctly identified gender and age for
sexual conversations between adults
17
Participants
‣ 66 registered teams
‣ 21 participants (32%)
‣ 16 countries
‣ 18 papers (86%)
‣ 8 long papers
‣ 10 short papers
18
Approaches
Preprocessing Features Methods
... did the teams perform?
‣What kind of ...
19
Approaches
HTML Cleaning to obtain plain text
5 teams: [gopal-patra][moreau][meina]
[weren][pavan]
Deletion of documents with at least 0.1%
of spam words
1 team: [flekova]
Principal Component Analysis to reduce
dimensionality
1 team: [yong-lim]
Subset selection during training to reduce
dimensionality
5 teams: [caurcel-diaz][flekova][moreau]
[hernandez-farias][sapkota]
Discrimination between human-like posts
and spam-like posts (chatbots)
1 team: [meina]
Preprocessing
20
Approaches
Stylistic features: frequencies of
punctuation marks, capital letters,
quotations...
9 teams: [yong-lim][cruz][pavan][gopal-
patra][de-arteaga][meina][flekova]
[aleman][santosh]
+ POS tags
5 teams: [yong lim][meina][aleman][cruz]
[santosh]
HTML-based features like image urls or
links
3 teams: [santosh][sapkota][meina]
Readability
7 teams: [gopal-patra][yong-lim][meina]
[flekova][aleman][weren][gillam]
Emoticons
2 teams: [aleman][hernandez-farias]
*[sapkota] explicitly discarded them
Features
21
Approaches
Content features: LSA, BoW,TF-IDF,
dictionary-based words, topic-based
words, entropy-based words...
11 teams: [sapkota][gopal-patra][yong-
lim][seifeddine][caurcel-diaz][flekova]
[meina][cruz][santosh][pavan]
[hernandez-farias]
Named entities 1 team: [flekova]
Sentiment words 1 team: [gopal-patra]
Emotions words 1 team: [meina]
Slang, contractions and words with
character flooding
4 teams: [flekova][caurcel-diaz][aleman]
[hernandez-farias]
Features
22
Approaches
Text to be identified is used as a query
for a search engine
1 team: [weren]
Unsupervised features based on statistics 1 team: [de-arteaga]
Language models (n-grams)
4 teams: [meina][jankowska][moreau]
[sapkota]
Collocations 1 team: [meina]
Second order representation based on
relationships between documents and
profiles
1 team: [pastor]
Features
23
Approaches
Decision Trees
5 teams: [santosh][gopal-patra]
[seifeddine][gillam][weren]
SupportVector Machines 3 teams: [yong-lim][cruz][sapkota]
Logistic Regression 2 teams: [de-arteaga][flekova]
Naïve Bayes 1 team: [meina]
Maximum Entropy 1 team: [pavan]
Stochastic Gradient Descent 1 team: [caurcel-diaz]
Random Forest 1 team: [aleman]
Information Retrieval 1 team: [weren]
Methods
24
Early birds results
‣5 teams participated, 1 team had technical problems
‣Figures for gender are very close to baseline
‣Main goal -> All participants improved in the final
evaluation, mainly Aleman
25
Final results
‣ 21 teams for English,
20 teams for Spanish
‣ Values similar to
Early Birds
‣ Gender close to
baseline
‣ Joint identification
more difficult
26
Results for English
Meina
Pastor L.
Seifeddine
Santosh
Yong Lim
Ladra
Aleman
Gillam
Kern
Cruz
Pavan
Caurcel Diaz
H. Farias
Jankowska
Flekova
Weren
Sapkota
De-Arteaga
Moreau
BASELINE
Gopal Patra
Cagnina
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
All results below 40%
BASELINE: 16%
2 teams below baseline
Joint Identification
27
Results for English
Meina
Pastor L.
Seifeddine
Santosh
Yong Lim
Ladra
Aleman
Gillam
Kern
Cruz
Pavan
Caurcel Diaz
H. Farias
Jankowska
Flekova
Weren
Sapkota
De-Arteaga
Moreau
BASELINE
Gopal Patra
Cagnina
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.4781
0.5921
All results below 60%
4 teams below 1% better
than baseline
BASELINE: 50%
3 teams below baseline
Gender Identification
28
Results for English
Meina
Pastor L.
Seifeddine
Santosh
Yong Lim
Ladra
Aleman
Gillam
Kern
Cruz
Pavan
Caurcel Diaz
H. Farias
Jankowska
Flekova
Weren
Sapkota
De-Arteaga
Moreau
BASELINE
Gopal Patra
Cagnina
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.1234
0.6572
All results below 70%
7 teams over 60%
9 teams between 50-60%
3 teams below 50%
BASELINE: 33%
2 teams below baseline
Age Identification
29
Results for English
Meina
Pastor L.
Seifeddine
Santosh
Yong Lim
Ladra
Aleman
Gillam
Kern
Cruz
Pavan
Caurcel Diaz
H. Farias
Jankowska
Flekova
Weren
Sapkota
De-Arteaga
Moreau
BASELINE
Gopal Patra
Cagnina
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.0741 0.1234
0.4781
0.6572
0.5921
0.3894
Joint identification Gender Age
The best teams
performed better in
both identifications
30
Results for Spanish
Santosh
Pastor L.
Cruz
Flekova
Ladra
De-Arteaga
Kern
Yong Lim
Sapkota
Pavan
Jankowska
Meina
Gillam
Moreau
Weren
Cagnina
Caurcel Diaz
H. Farias
BASELINE
Aleman
Seifeddine
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
All results below 45%
BASELINE: 16%
2 teams below baseline
Joint Identification
31
Results for Spanish
Santosh
Pastor L.
Cruz
Flekova
Ladra
De-Arteaga
Kern
Yong Lim
Sapkota
Pavan
Jankowska
Meina
Gillam
Moreau
Weren
Cagnina
Caurcel Diaz
H. Farias
BASELINE
Aleman
Seifeddine
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.4784
0.6473
All results below 65%
2 teams equal to baseline
BASELINE: 50%
3 teams below baseline
Gender Identification
32
Results for Spanish
Santosh
Pastor L.
Cruz
Flekova
Ladra
De-Arteaga
Kern
Yong Lim
Sapkota
Pavan
Jankowska
Meina
Gillam
Moreau
Weren
Cagnina
Caurcel Diaz
H. Farias
BASELINE
Aleman
Seifeddine
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.0512
0.6558
All results below 70%
3 teams over 60%
9 teams between 50-60%
6 teams below 50%
BASELINE: 33%
2 teams below baseline
Age Identification
33
Results for Spanish
Santosh
Pastor L.
Cruz
Flekova
Ladra
De-Arteaga
Kern
Yong Lim
Sapkota
Pavan
Jankowska
Meina
Gillam
Moreau
Weren
Cagnina
Caurcel Diaz
H. Farias
BASELINE
Aleman
Seifeddine
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 10.0287 0.0512
0.4784
0.6558
0.64730.4208
Total Gender Age
The best teams
performed better in
both identifications
34
Results per language
Pastor
Santosh
Cruz
Ladra
Yong Lim
Flekova
Meina
Kern
Gillam
Pavan
De-Arteaga
Jankowska
Sapkota
Weren
Moreau
Aleman
Caurcel Diaz
H. Farias
Seifeddine
Cagnina
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
English Spanish
For 10 team English is better
For 10 team Spanish is better
1 team only participated in English
35
Sexual conversations
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 0,25 0,5 0,75 1
Adult-Adult Predators
Age Gender
‣ Age identification similar to previous values
‣ Gender identification performed better in case of predators
36
Runtime
Gillam
Ladra
Pastor L.
Caurcel Diaz
Pavan
De Arteaga
Cruz
Weren
Jankowska
Santosh
Kern
Flekova
Gopal Patra
Aleman
H. Farias
Sapkota
Meina
Moureau
Yong Lim
Cagnina
Seifeddine
0 275000000 550000000 825000000 1100000000
10.26 minutes
28.83 minutes
38.31 minutes
54.03 minutes
1.04 hours
1.09 hours
2.66 hours
3.25 hours
4.66 hours
4.86 hours
5.08 hours
5.13 hours
6.37 hours
6.56 hours
6.82 hours
17.88 hours
4.44 days
5.19 days
6.68 days
9.90 days
11.78 days
‣Big Data problem?
1...........
2.............
3..................
21...
20..........
19......
9..............
4................
17..
10..............
11.....
14............
12......
8..................
6.............
16...........
18.........
13..........
7................
15......
5...........
37
Conclusions
Also...
‣ We received many different and enriching approaches
‣ We were one of the task with the higher number of participants at CLEF
(21)
‣ Interest from many teams (66 registered) but the task was new and many
(5) did not make it (potentially more participation next year!)
‣Very difficult task, mainly for
gender identification
‣ Difficult to identify together
age and gender
For predators...
‣ Robust identifying age
‣ Better identifying gender
‣ Expensive in Time consuming ->
Big Data problem?
Francisco Rangel
Autoritas Consulting /
Universitat Politècnica deValència
Paolo Rosso
Universitat Politècnica deValència
Moshe Koppel
Bar-Ilan University
Efstathios Stamatatos
University of the Aegean
Giacomo Inches
University of Lugano
On behalf of the AP task organisers:
Thank you very much for participating!
We hope to see you again next year!

Más contenido relacionado

Similar a Author Profiling. PAN@CLEF-2013 Task

Localizationwebinar.1
Localizationwebinar.1Localizationwebinar.1
Localizationwebinar.1QuestionPro
 
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Francisco Manuel Rangel Pardo
 
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Francisco Manuel Rangel Pardo
 
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Francisco Manuel Rangel Pardo
 
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Francisco Manuel Rangel Pardo
 
201511-TIA_Presentation
201511-TIA_Presentation201511-TIA_Presentation
201511-TIA_Presentationhpcosta
 
Academic Writing: Think before you write
Academic Writing: Think before you writeAcademic Writing: Think before you write
Academic Writing: Think before you writeRon Martinez
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustFrancisco Manuel Rangel Pardo
 
Academic Writing: Think before you write - Week 3 2019
Academic Writing: Think before you write - Week 3 2019Academic Writing: Think before you write - Week 3 2019
Academic Writing: Think before you write - Week 3 2019Ron Martinez
 
Classification of noisy free-text prostate cancer pathology reports using nat...
Classification of noisy free-text prostate cancer pathology reports using nat...Classification of noisy free-text prostate cancer pathology reports using nat...
Classification of noisy free-text prostate cancer pathology reports using nat...Institute of Information Systems (HES-SO)
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesZoltan Varju
 
Classification of prostate cancer pathology reports using natural language pr...
Classification of prostate cancer pathology reports using natural language pr...Classification of prostate cancer pathology reports using natural language pr...
Classification of prostate cancer pathology reports using natural language pr...Anjani Dhrangadhariya
 
LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...
LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...
LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...Matt Shardlow
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
User review sites as a resource for large scale sociolinguistic studies
User review sites as a resource for large scale sociolinguistic studiesUser review sites as a resource for large scale sociolinguistic studies
User review sites as a resource for large scale sociolinguistic studiesHacer Tilbeç Turgut
 
How to survive international mail surveys
How to survive international mail surveysHow to survive international mail surveys
How to survive international mail surveysAnne-Wil Harzing
 

Similar a Author Profiling. PAN@CLEF-2013 Task (18)

Language Access for Legal Aid Websites
Language Access for Legal Aid WebsitesLanguage Access for Legal Aid Websites
Language Access for Legal Aid Websites
 
Localizationwebinar.1
Localizationwebinar.1Localizationwebinar.1
Localizationwebinar.1
 
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
 
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
 
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
 
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
 
201511-TIA_Presentation
201511-TIA_Presentation201511-TIA_Presentation
201511-TIA_Presentation
 
Academic Writing: Think before you write
Academic Writing: Think before you writeAcademic Writing: Think before you write
Academic Writing: Think before you write
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building Trust
 
Academic Writing: Think before you write - Week 3 2019
Academic Writing: Think before you write - Week 3 2019Academic Writing: Think before you write - Week 3 2019
Academic Writing: Think before you write - Week 3 2019
 
Classification of noisy free-text prostate cancer pathology reports using nat...
Classification of noisy free-text prostate cancer pathology reports using nat...Classification of noisy free-text prostate cancer pathology reports using nat...
Classification of noisy free-text prostate cancer pathology reports using nat...
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
 
Classification of prostate cancer pathology reports using natural language pr...
Classification of prostate cancer pathology reports using natural language pr...Classification of prostate cancer pathology reports using natural language pr...
Classification of prostate cancer pathology reports using natural language pr...
 
LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...
LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...
LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
User review sites as a resource for large scale sociolinguistic studies
User review sites as a resource for large scale sociolinguistic studiesUser review sites as a resource for large scale sociolinguistic studies
User review sites as a resource for large scale sociolinguistic studies
 
Miguel Rios - 2015 - Obtaining SMT dictionaries for related languages
Miguel Rios - 2015 - Obtaining SMT dictionaries for related languagesMiguel Rios - 2015 - Obtaining SMT dictionaries for related languages
Miguel Rios - 2015 - Obtaining SMT dictionaries for related languages
 
How to survive international mail surveys
How to survive international mail surveysHow to survive international mail surveys
How to survive international mail surveys
 

Más de Francisco Manuel Rangel Pardo

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Francisco Manuel Rangel Pardo
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...Francisco Manuel Rangel Pardo
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019Francisco Manuel Rangel Pardo
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Francisco Manuel Rangel Pardo
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Francisco Manuel Rangel Pardo
 
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Francisco Manuel Rangel Pardo
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIREFrancisco Manuel Rangel Pardo
 
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Francisco Manuel Rangel Pardo
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...Francisco Manuel Rangel Pardo
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...Francisco Manuel Rangel Pardo
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Francisco Manuel Rangel Pardo
 
Native Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artNative Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artFrancisco Manuel Rangel Pardo
 
Social Business Intelligence - Inteligencia Social de Negocio
Social Business Intelligence - Inteligencia Social de NegocioSocial Business Intelligence - Inteligencia Social de Negocio
Social Business Intelligence - Inteligencia Social de NegocioFrancisco Manuel Rangel Pardo
 
Dualidad onda-partícula del científico de datos en la empresa
Dualidad onda-partícula del científico de datos en la empresaDualidad onda-partícula del científico de datos en la empresa
Dualidad onda-partícula del científico de datos en la empresaFrancisco Manuel Rangel Pardo
 

Más de Francisco Manuel Rangel Pardo (20)

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
 
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRE
 
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
 
Redes sociales y preadolescentes
Redes sociales y preadolescentesRedes sociales y preadolescentes
Redes sociales y preadolescentes
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
 
Smart Listening - MUIinf
Smart Listening - MUIinfSmart Listening - MUIinf
Smart Listening - MUIinf
 
IA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidadIA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidad
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...
 
EmoGraph for Age and Gender Identification
EmoGraph for Age and Gender IdentificationEmoGraph for Age and Gender Identification
EmoGraph for Age and Gender Identification
 
My Phd Student T-Shirt
My Phd Student T-ShirtMy Phd Student T-Shirt
My Phd Student T-Shirt
 
Kico's Stairway to Phd
Kico's Stairway to PhdKico's Stairway to Phd
Kico's Stairway to Phd
 
Native Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artNative Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the art
 
Social Business Intelligence - Inteligencia Social de Negocio
Social Business Intelligence - Inteligencia Social de NegocioSocial Business Intelligence - Inteligencia Social de Negocio
Social Business Intelligence - Inteligencia Social de Negocio
 
Dualidad onda-partícula del científico de datos en la empresa
Dualidad onda-partícula del científico de datos en la empresaDualidad onda-partícula del científico de datos en la empresa
Dualidad onda-partícula del científico de datos en la empresa
 

Último

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Último (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Author Profiling. PAN@CLEF-2013 Task

  • 1. Author Profiling PAN-AP-2013 - CLEF 2013 Valencia, 24th September 2013 Francisco Rangel Autoritas / Universitat Politècnica de València Paolo Rosso Universitat Politècnica de València Moshe Koppel Bar-Illan University Efstathios Stamatatos University of the Aegean Giacomo Inches University of Lugano
  • 3. 3 Why Author Profiling? Forensics Security Marketing Language as evidence Profile possible delinquents Segmenting users
  • 4. 4 Task Goals ‣Given a collection of documents retrieved from Social Media in English and Spanish... MAIN GOAL Identify age and gender SECONDARY GOALS Test the robustness of the approaches for identifying age and gender of predators Measure the computational time needed to perform the task
  • 5. 5 Related Work AUTHOR COLLECTION FEATURES RESULTS OTHER CHARACTERISTICS Argamon et al., 2002 British National Corpus Part-of-speech Gender: 80% accuracy Holmes & Meyerhoff, 2003 Formal texts - Age and gender Burger & Henderson, 2006 Blogs Posts length, capital letters, punctuations. HTML features. They only reported: “Low percentage errors” Two age classes: [0,18[,[18,-] Koppel et al., 2003 Blogs Simple lexical and syntactic functions Gender: 80% accuracy Self-labeling Schler et al., 2006 Blogs Stylistic features + content words with the highest information gain Gender: 80% accuracy Age: 75% accuracy Goswami et al., 2009 Blogs Slang + sentence length Gender: 89.18 accuracy Age: 80.32 accuracy Zhang & Zhang, 2010 Segments of blog Words, punctuation, average words/sentence length, POS, word factor analysis Gender: 72,10 accuracy Nguyen et al., 2011 y 2013 Blogs & Twitter Unigrams, POS, LIWC Correlation: 0.74 Mean absolute error: 4.1 - 6.8 years Manual labeling Age as continuous variable Peersman et al., 2011 Netlog Unigrams, bigrams, trigrams and tetagrams Gender+Age: 88.8 accuracy Self-labeling, min 16 plus 16,18,25
  • 6. 6 Data Collection - Social Media ‣Big Data? ‣High variety of themes ‣Sexual conversations vs. sexual predators ‣Difficulty to obtain good label data ‣Real people vs. Robots (chatbots) ‣Multilingual: English + Spanish
  • 7. 7 Data Collection - English Distribution MIN MAX AVG STD 0 22,736 335 208 Numberofdocuments Number of words
  • 8. 8 Data Collection - English Distribution (zoomed) ‣ If we zoom the distribution, we can observe a gaussian like distribution, with its maximum on the value 415. 335 495 415 Numberofdocuments Number of words
  • 9. 9 Data Collection - English Distribution (log-log) ‣ The log-log representation shows how the distribution has a long tail component, specifically in two cases, one before the point of maximum frequency and another one after this. ‣ We could use this property to select minimum and maximum number of words that the posts must have. Numberofdocuments(log) Number of words (log)
  • 10. 10 Data Collection - Spanish Distribution Numberofdocuments Number of words MIN MAX AVG STD 0 12,246 176 832
  • 11. 11 Data Collection - Spanish Distribution (zoomed) 500 15 Numberofdocuments Number of words
  • 12. 12 Data Collection - Spanish Distribution (log-log) Numberofdocuments(log) Number of words (log)
  • 13. 13 Data Collection - Selection Criteria ‣ Grouping posts by author ‣ Keeping authors with few post ‣ Chunking authors with more than 1,000 words ‣ Balanced by gender ‣ A g e g r o u p s ( n o n - balanced): ‣ 10s (13-17) ‣ 20s (23-27) ‣ 30s (33-47) ‣ Random split in three datasets ‣ Training ‣ Early Bird (10%) ‣ Testing (+20%) ‣ Introduction of few special cases ‣ Predators (0.0012%) ‣ Adult-adult sexual conversations
  • 14. 14 Data Collection - Statistics LANG AGE GENDERLANG AGE GENDERLANG AGE GENDER NUMBER OF AUTHORSNUMBER OF AUTHORSNUMBER OF AUTHORS LANG AGE GENDERLANG AGE GENDERLANG AGE GENDER TRAINING EARLY BIRDS TEST EN 10s MALE 8,600 740 888 EN 10s FEMALE 8,600 740 888 EN 20s MALE (72) 42,828 3,840 (32) 4,576 EN 20s FEMALE (25) 42,875 3,840 (10) 4,576 EN 30s MALE (92) 66,708 6,020 (40) 7,184 EN 30s FEMALE 66,800 6,020 7,224 236,600 21,200 25,440 ES 10s MALE 1,250 120 144 ES 10s FEMALE 1,250 120 144 ES 20s MALE 21,300 1,920 2,304 ES 20s FEMALE 21,300 1,920 2,304 ES 30s MALE 15,400 1,360 1,632 ES 30s FEMALE 15,400 1,360 1,632 75,900 6,800 8,160 Predators Adult-adult sexual conversations
  • 15. 15 Performance measures for identification Accuracy for Gender Accuracy for Age Accuracy for Gender Accuracy for Age ENGLISH SPANISH Joint Accuracy Joint Accuracy Average Accuracy WINNER OF THE TASK
  • 16. 16 Other performance measures Number of correctly identified gender and age for predators Total time needed to process the test data Number of correctly identified gender and age for sexual conversations between adults
  • 17. 17 Participants ‣ 66 registered teams ‣ 21 participants (32%) ‣ 16 countries ‣ 18 papers (86%) ‣ 8 long papers ‣ 10 short papers
  • 18. 18 Approaches Preprocessing Features Methods ... did the teams perform? ‣What kind of ...
  • 19. 19 Approaches HTML Cleaning to obtain plain text 5 teams: [gopal-patra][moreau][meina] [weren][pavan] Deletion of documents with at least 0.1% of spam words 1 team: [flekova] Principal Component Analysis to reduce dimensionality 1 team: [yong-lim] Subset selection during training to reduce dimensionality 5 teams: [caurcel-diaz][flekova][moreau] [hernandez-farias][sapkota] Discrimination between human-like posts and spam-like posts (chatbots) 1 team: [meina] Preprocessing
  • 20. 20 Approaches Stylistic features: frequencies of punctuation marks, capital letters, quotations... 9 teams: [yong-lim][cruz][pavan][gopal- patra][de-arteaga][meina][flekova] [aleman][santosh] + POS tags 5 teams: [yong lim][meina][aleman][cruz] [santosh] HTML-based features like image urls or links 3 teams: [santosh][sapkota][meina] Readability 7 teams: [gopal-patra][yong-lim][meina] [flekova][aleman][weren][gillam] Emoticons 2 teams: [aleman][hernandez-farias] *[sapkota] explicitly discarded them Features
  • 21. 21 Approaches Content features: LSA, BoW,TF-IDF, dictionary-based words, topic-based words, entropy-based words... 11 teams: [sapkota][gopal-patra][yong- lim][seifeddine][caurcel-diaz][flekova] [meina][cruz][santosh][pavan] [hernandez-farias] Named entities 1 team: [flekova] Sentiment words 1 team: [gopal-patra] Emotions words 1 team: [meina] Slang, contractions and words with character flooding 4 teams: [flekova][caurcel-diaz][aleman] [hernandez-farias] Features
  • 22. 22 Approaches Text to be identified is used as a query for a search engine 1 team: [weren] Unsupervised features based on statistics 1 team: [de-arteaga] Language models (n-grams) 4 teams: [meina][jankowska][moreau] [sapkota] Collocations 1 team: [meina] Second order representation based on relationships between documents and profiles 1 team: [pastor] Features
  • 23. 23 Approaches Decision Trees 5 teams: [santosh][gopal-patra] [seifeddine][gillam][weren] SupportVector Machines 3 teams: [yong-lim][cruz][sapkota] Logistic Regression 2 teams: [de-arteaga][flekova] Naïve Bayes 1 team: [meina] Maximum Entropy 1 team: [pavan] Stochastic Gradient Descent 1 team: [caurcel-diaz] Random Forest 1 team: [aleman] Information Retrieval 1 team: [weren] Methods
  • 24. 24 Early birds results ‣5 teams participated, 1 team had technical problems ‣Figures for gender are very close to baseline ‣Main goal -> All participants improved in the final evaluation, mainly Aleman
  • 25. 25 Final results ‣ 21 teams for English, 20 teams for Spanish ‣ Values similar to Early Birds ‣ Gender close to baseline ‣ Joint identification more difficult
  • 26. 26 Results for English Meina Pastor L. Seifeddine Santosh Yong Lim Ladra Aleman Gillam Kern Cruz Pavan Caurcel Diaz H. Farias Jankowska Flekova Weren Sapkota De-Arteaga Moreau BASELINE Gopal Patra Cagnina 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 All results below 40% BASELINE: 16% 2 teams below baseline Joint Identification
  • 27. 27 Results for English Meina Pastor L. Seifeddine Santosh Yong Lim Ladra Aleman Gillam Kern Cruz Pavan Caurcel Diaz H. Farias Jankowska Flekova Weren Sapkota De-Arteaga Moreau BASELINE Gopal Patra Cagnina 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0.4781 0.5921 All results below 60% 4 teams below 1% better than baseline BASELINE: 50% 3 teams below baseline Gender Identification
  • 28. 28 Results for English Meina Pastor L. Seifeddine Santosh Yong Lim Ladra Aleman Gillam Kern Cruz Pavan Caurcel Diaz H. Farias Jankowska Flekova Weren Sapkota De-Arteaga Moreau BASELINE Gopal Patra Cagnina 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0.1234 0.6572 All results below 70% 7 teams over 60% 9 teams between 50-60% 3 teams below 50% BASELINE: 33% 2 teams below baseline Age Identification
  • 29. 29 Results for English Meina Pastor L. Seifeddine Santosh Yong Lim Ladra Aleman Gillam Kern Cruz Pavan Caurcel Diaz H. Farias Jankowska Flekova Weren Sapkota De-Arteaga Moreau BASELINE Gopal Patra Cagnina 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0.0741 0.1234 0.4781 0.6572 0.5921 0.3894 Joint identification Gender Age The best teams performed better in both identifications
  • 30. 30 Results for Spanish Santosh Pastor L. Cruz Flekova Ladra De-Arteaga Kern Yong Lim Sapkota Pavan Jankowska Meina Gillam Moreau Weren Cagnina Caurcel Diaz H. Farias BASELINE Aleman Seifeddine Gopal Patra 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 All results below 45% BASELINE: 16% 2 teams below baseline Joint Identification
  • 31. 31 Results for Spanish Santosh Pastor L. Cruz Flekova Ladra De-Arteaga Kern Yong Lim Sapkota Pavan Jankowska Meina Gillam Moreau Weren Cagnina Caurcel Diaz H. Farias BASELINE Aleman Seifeddine Gopal Patra 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0.4784 0.6473 All results below 65% 2 teams equal to baseline BASELINE: 50% 3 teams below baseline Gender Identification
  • 32. 32 Results for Spanish Santosh Pastor L. Cruz Flekova Ladra De-Arteaga Kern Yong Lim Sapkota Pavan Jankowska Meina Gillam Moreau Weren Cagnina Caurcel Diaz H. Farias BASELINE Aleman Seifeddine Gopal Patra 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0.0512 0.6558 All results below 70% 3 teams over 60% 9 teams between 50-60% 6 teams below 50% BASELINE: 33% 2 teams below baseline Age Identification
  • 33. 33 Results for Spanish Santosh Pastor L. Cruz Flekova Ladra De-Arteaga Kern Yong Lim Sapkota Pavan Jankowska Meina Gillam Moreau Weren Cagnina Caurcel Diaz H. Farias BASELINE Aleman Seifeddine Gopal Patra 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 10.0287 0.0512 0.4784 0.6558 0.64730.4208 Total Gender Age The best teams performed better in both identifications
  • 34. 34 Results per language Pastor Santosh Cruz Ladra Yong Lim Flekova Meina Kern Gillam Pavan De-Arteaga Jankowska Sapkota Weren Moreau Aleman Caurcel Diaz H. Farias Seifeddine Cagnina Gopal Patra 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 English Spanish For 10 team English is better For 10 team Spanish is better 1 team only participated in English
  • 35. 35 Sexual conversations 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,25 0,5 0,75 1 Adult-Adult Predators Age Gender ‣ Age identification similar to previous values ‣ Gender identification performed better in case of predators
  • 36. 36 Runtime Gillam Ladra Pastor L. Caurcel Diaz Pavan De Arteaga Cruz Weren Jankowska Santosh Kern Flekova Gopal Patra Aleman H. Farias Sapkota Meina Moureau Yong Lim Cagnina Seifeddine 0 275000000 550000000 825000000 1100000000 10.26 minutes 28.83 minutes 38.31 minutes 54.03 minutes 1.04 hours 1.09 hours 2.66 hours 3.25 hours 4.66 hours 4.86 hours 5.08 hours 5.13 hours 6.37 hours 6.56 hours 6.82 hours 17.88 hours 4.44 days 5.19 days 6.68 days 9.90 days 11.78 days ‣Big Data problem? 1........... 2............. 3.................. 21... 20.......... 19...... 9.............. 4................ 17.. 10.............. 11..... 14............ 12...... 8.................. 6............. 16........... 18......... 13.......... 7................ 15...... 5...........
  • 37. 37 Conclusions Also... ‣ We received many different and enriching approaches ‣ We were one of the task with the higher number of participants at CLEF (21) ‣ Interest from many teams (66 registered) but the task was new and many (5) did not make it (potentially more participation next year!) ‣Very difficult task, mainly for gender identification ‣ Difficult to identify together age and gender For predators... ‣ Robust identifying age ‣ Better identifying gender ‣ Expensive in Time consuming -> Big Data problem?
  • 38. Francisco Rangel Autoritas Consulting / Universitat Politècnica deValència Paolo Rosso Universitat Politècnica deValència Moshe Koppel Bar-Ilan University Efstathios Stamatatos University of the Aegean Giacomo Inches University of Lugano On behalf of the AP task organisers: Thank you very much for participating! We hope to see you again next year!