SlideShare a Scribd company logo
1 of 11
Download to read offline
Overview of PAN’16
New challenges for Authorship Analysis:
Cross-genre profiling, Clustering, Diarization,
and Obfuscation
PAN-AP-2016 CLEF 2016
Évora, 5-8 September
Paolo Rosso: Universitat Politècnica de Valencia
Francisco Rangel: Autoritas Consulting
Martin Potthast: Bauhaus - Universität Weimar
Efstathios Stamatatos: University of the Aegean
Michael Tschuggnall: University of Innsbruck
Benno Stein: Bauhaus-Universität Weimar
Introduction
Uncovering Plagiarism, Authorship, and Social
Software Misuse (PAN) is a forum for the digital text
forensics, where researchers and practitioners study
technologies that analyze texts with regard to
originality, authorship, and trustworthiness.
PAN focuses on the evaluation of selected tasks
from digital text forensics in order to develop
large-scale, standardized benchmarks, and to
assess the state-of-the-art techniques.
2
PAN’16
Evolution
3
PAN’16
158
PAN’16 focus
4
PAN’16
We have focused on focused on authorship tasks from the fields of (i) author
identification, (ii) author profiling, and (iii) author obfuscation evaluation (total 35
teams):
i. Author clustering / diarization: Author clustering is the task where given a
document collection the participant is asked to group documents written by the
same author so that each cluster corresponds to a different author. Author
diarization extends the previous tasks on intrinsic plagiarism detection.
ii. Age / gender identification: Since 2013, the main focus is in age and gender
identification. The goal of this year is the cross-genre evaluation.
iii. Author masking / obfuscation evaluation: Author masking and author
obfuscation evaluation aim respectively at perturbing an author’s style in a
given text to render it dissimilar to other texts from the same author, and at
adjusting a given text’s style so as to render it similar to that of a given author.
Author identification (clustering)
5
PAN’16
Two scenarios:
- Complete author clustering: Detailed analysis on:
- the number of different authors (k) found in the collection should be
identified.
- each document should be assigned to exactly one of the k authors.
- Authorship-link ranking: Viewed as a retrieval task, whose objective is to
establish authorship links between documents and provides a list of
document pairs ranked according to a confidence score (the score shows
how likely it is the document pair to be by the same author).
Corpora:
- Languages: English, Dutch and Greek.
- Genres: Articles and reviews.
Author identification (diarization)
6
PAN’16
Three subtasks:
- Traditional intrinsic plagiarism detection: Assuming a major author (70%
of a document) to find the remaining text portions written by other/s.
- Diarization with a given number of authors: Given a document composed
by a known number of authors, to group individual text fragments by
authors.
- Unrestricted diarization: The number of collaborating authors is not
given, so also the correct number of clusters, i.e., writers, has to be found.
Corpora:
- Webis-TRC-12 dataset, with 150 topics from TREC Web Tracks from
2009-2011
- Each subtask has variations of the dataset: number and proportions of
authors in a document, the decision, uniformly distributed...
Subtasks:
- Age and gender identification.
- Joint identification of age and gender for the same author
- The aim is at the cross-genre evaluation.
Corpora:
- Languages: English, Spanish, Dutch
- Genres: Twitter for training. Reviews, social media and blogs for evaluating.
Author profiling (age and gender identification)
7
PAN’16
Subtasks:
- Authorship verification: Given two documents, decide whether they have
been written by the same author.
- Author masking: Given two documents by the same author, paraphrase
the designated one so that the author cannot be verified anymore.
Corpora:
- Joint training and joint test datasets from the author verification tasks of
PAN 2013 to 2015.
Author obfuscation
8
PAN’16
Conclusions
9
PAN’16
- The author obfuscation shared task allowed to shed light on the
robustness of state-of-the-art author identification and author profiling
techniques against author obfuscation technology.
- New corpora have been developed in multiple languages: English,
Spanish, Dutch.
- PAN/FIRE:
- A shared task on plagiarism detection on texts written in Farsi.
- A shared task on author profiling on personality recognition in
source code.
See you on Tuesday and Wednesday
10
PAN’16
Rui Sousa-Silva
Universidade do Porto
Tuesday 6th Sept. 13:30 - 15:30
Wednesday 7th Sept. 13:30 - 15:30
16:15 - 18:15
Sponsors
11
PAN’16
On behalf of the PAN lab organisers:
Thank you very much for participating
and hope to see you next year!!

More Related Content

Similar to Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre profiling, Clustering, Diarization, and Obfuscation

LCC CTS 2 Option.docx
LCC CTS 2 Option.docxLCC CTS 2 Option.docx
LCC CTS 2 Option.docxwrite4
 
Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Andrea Scharnhorst
 
Ethical and Unethical Methods of Plagiarism Prevention in Academic Writing
Ethical and Unethical Methods of Plagiarism Prevention in Academic WritingEthical and Unethical Methods of Plagiarism Prevention in Academic Writing
Ethical and Unethical Methods of Plagiarism Prevention in Academic WritingNader Ale Ebrahim
 
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Waqas Tariq
 
MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdfMartin Wynne
 
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteDa Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteEquipex Biblissima
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsIrum Malik
 
Application Of Linguistic Cues In The Analysis Of Language Of Hate Groups
Application Of Linguistic Cues In The Analysis Of Language Of Hate GroupsApplication Of Linguistic Cues In The Analysis Of Language Of Hate Groups
Application Of Linguistic Cues In The Analysis Of Language Of Hate GroupsLeonard Goudy
 
Hum2220 fa2015 research project packet
Hum2220 fa2015 research project packetHum2220 fa2015 research project packet
Hum2220 fa2015 research project packetProfWillAdams
 
Analysing literature through the lens of information theory and network science
Analysing literature through the lens of information theory and network scienceAnalysing literature through the lens of information theory and network science
Analysing literature through the lens of information theory and network scienceMarkus Luczak-Rösch
 
Judicial Review Chapters 4 and 5 in the text discuss.docx
Judicial Review Chapters 4 and 5 in the text discuss.docxJudicial Review Chapters 4 and 5 in the text discuss.docx
Judicial Review Chapters 4 and 5 in the text discuss.docxwrite22
 
Annotating Musical Theatre Plots On Narrative Structure And Emotional Content
Annotating Musical Theatre Plots On Narrative Structure And Emotional ContentAnnotating Musical Theatre Plots On Narrative Structure And Emotional Content
Annotating Musical Theatre Plots On Narrative Structure And Emotional ContentHeather Strinden
 
Forensic linguistics with Apache Spark
Forensic linguistics with Apache SparkForensic linguistics with Apache Spark
Forensic linguistics with Apache SparkSheamus McGovern
 
The difference queer fanfiction makes: Lessons for the publishing industry - ...
The difference queer fanfiction makes: Lessons for the publishing industry - ...The difference queer fanfiction makes: Lessons for the publishing industry - ...
The difference queer fanfiction makes: Lessons for the publishing industry - ...BookNet Canada
 
The points of rhetorical analysis that you mention early in your pap.docx
The points of rhetorical analysis that you mention early in your pap.docxThe points of rhetorical analysis that you mention early in your pap.docx
The points of rhetorical analysis that you mention early in your pap.docxgabrielaj9
 
Distributing Text Mining tasks with librAIry
Distributing Text Mining tasks with librAIryDistributing Text Mining tasks with librAIry
Distributing Text Mining tasks with librAIryCarlos Badenes-Olmedo
 
Guzik ARTE 692
Guzik ARTE 692Guzik ARTE 692
Guzik ARTE 692Kyle Guzik
 
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...eveline wandl-vogt
 

Similar to Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre profiling, Clustering, Diarization, and Obfuscation (20)

LCC CTS 2 Option.docx
LCC CTS 2 Option.docxLCC CTS 2 Option.docx
LCC CTS 2 Option.docx
 
Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...
 
Ethical and Unethical Methods of Plagiarism Prevention in Academic Writing
Ethical and Unethical Methods of Plagiarism Prevention in Academic WritingEthical and Unethical Methods of Plagiarism Prevention in Academic Writing
Ethical and Unethical Methods of Plagiarism Prevention in Academic Writing
 
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
 
Introduction to Nvivo
Introduction to NvivoIntroduction to Nvivo
Introduction to Nvivo
 
MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdf
 
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteDa Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Application Of Linguistic Cues In The Analysis Of Language Of Hate Groups
Application Of Linguistic Cues In The Analysis Of Language Of Hate GroupsApplication Of Linguistic Cues In The Analysis Of Language Of Hate Groups
Application Of Linguistic Cues In The Analysis Of Language Of Hate Groups
 
Hum2220 fa2015 research project packet
Hum2220 fa2015 research project packetHum2220 fa2015 research project packet
Hum2220 fa2015 research project packet
 
Analysing literature through the lens of information theory and network science
Analysing literature through the lens of information theory and network scienceAnalysing literature through the lens of information theory and network science
Analysing literature through the lens of information theory and network science
 
Judicial Review Chapters 4 and 5 in the text discuss.docx
Judicial Review Chapters 4 and 5 in the text discuss.docxJudicial Review Chapters 4 and 5 in the text discuss.docx
Judicial Review Chapters 4 and 5 in the text discuss.docx
 
Annotating Musical Theatre Plots On Narrative Structure And Emotional Content
Annotating Musical Theatre Plots On Narrative Structure And Emotional ContentAnnotating Musical Theatre Plots On Narrative Structure And Emotional Content
Annotating Musical Theatre Plots On Narrative Structure And Emotional Content
 
Analyzing Nontextual Content Features to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic PlagiarismAnalyzing Nontextual Content Features to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic Plagiarism
 
Forensic linguistics with Apache Spark
Forensic linguistics with Apache SparkForensic linguistics with Apache Spark
Forensic linguistics with Apache Spark
 
The difference queer fanfiction makes: Lessons for the publishing industry - ...
The difference queer fanfiction makes: Lessons for the publishing industry - ...The difference queer fanfiction makes: Lessons for the publishing industry - ...
The difference queer fanfiction makes: Lessons for the publishing industry - ...
 
The points of rhetorical analysis that you mention early in your pap.docx
The points of rhetorical analysis that you mention early in your pap.docxThe points of rhetorical analysis that you mention early in your pap.docx
The points of rhetorical analysis that you mention early in your pap.docx
 
Distributing Text Mining tasks with librAIry
Distributing Text Mining tasks with librAIryDistributing Text Mining tasks with librAIry
Distributing Text Mining tasks with librAIry
 
Guzik ARTE 692
Guzik ARTE 692Guzik ARTE 692
Guzik ARTE 692
 
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
 

More from Francisco Manuel Rangel Pardo

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Francisco Manuel Rangel Pardo
 
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Francisco Manuel Rangel Pardo
 
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Francisco Manuel Rangel Pardo
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...Francisco Manuel Rangel Pardo
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019Francisco Manuel Rangel Pardo
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Francisco Manuel Rangel Pardo
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Francisco Manuel Rangel Pardo
 
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Francisco Manuel Rangel Pardo
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIREFrancisco Manuel Rangel Pardo
 
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Francisco Manuel Rangel Pardo
 
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Francisco Manuel Rangel Pardo
 
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Francisco Manuel Rangel Pardo
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustFrancisco Manuel Rangel Pardo
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...Francisco Manuel Rangel Pardo
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...Francisco Manuel Rangel Pardo
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Francisco Manuel Rangel Pardo
 

More from Francisco Manuel Rangel Pardo (20)

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
 
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
 
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
 
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRE
 
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
 
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
 
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
 
Redes sociales y preadolescentes
Redes sociales y preadolescentesRedes sociales y preadolescentes
Redes sociales y preadolescentes
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building Trust
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
 
Smart Listening - MUIinf
Smart Listening - MUIinfSmart Listening - MUIinf
Smart Listening - MUIinf
 
IA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidadIA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidad
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...
 
Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015
 

Recently uploaded

CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 

Recently uploaded (17)

CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 

Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre profiling, Clustering, Diarization, and Obfuscation

  • 1. Overview of PAN’16 New challenges for Authorship Analysis: Cross-genre profiling, Clustering, Diarization, and Obfuscation PAN-AP-2016 CLEF 2016 Évora, 5-8 September Paolo Rosso: Universitat Politècnica de Valencia Francisco Rangel: Autoritas Consulting Martin Potthast: Bauhaus - Universität Weimar Efstathios Stamatatos: University of the Aegean Michael Tschuggnall: University of Innsbruck Benno Stein: Bauhaus-Universität Weimar
  • 2. Introduction Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN) is a forum for the digital text forensics, where researchers and practitioners study technologies that analyze texts with regard to originality, authorship, and trustworthiness. PAN focuses on the evaluation of selected tasks from digital text forensics in order to develop large-scale, standardized benchmarks, and to assess the state-of-the-art techniques. 2 PAN’16
  • 4. PAN’16 focus 4 PAN’16 We have focused on focused on authorship tasks from the fields of (i) author identification, (ii) author profiling, and (iii) author obfuscation evaluation (total 35 teams): i. Author clustering / diarization: Author clustering is the task where given a document collection the participant is asked to group documents written by the same author so that each cluster corresponds to a different author. Author diarization extends the previous tasks on intrinsic plagiarism detection. ii. Age / gender identification: Since 2013, the main focus is in age and gender identification. The goal of this year is the cross-genre evaluation. iii. Author masking / obfuscation evaluation: Author masking and author obfuscation evaluation aim respectively at perturbing an author’s style in a given text to render it dissimilar to other texts from the same author, and at adjusting a given text’s style so as to render it similar to that of a given author.
  • 5. Author identification (clustering) 5 PAN’16 Two scenarios: - Complete author clustering: Detailed analysis on: - the number of different authors (k) found in the collection should be identified. - each document should be assigned to exactly one of the k authors. - Authorship-link ranking: Viewed as a retrieval task, whose objective is to establish authorship links between documents and provides a list of document pairs ranked according to a confidence score (the score shows how likely it is the document pair to be by the same author). Corpora: - Languages: English, Dutch and Greek. - Genres: Articles and reviews.
  • 6. Author identification (diarization) 6 PAN’16 Three subtasks: - Traditional intrinsic plagiarism detection: Assuming a major author (70% of a document) to find the remaining text portions written by other/s. - Diarization with a given number of authors: Given a document composed by a known number of authors, to group individual text fragments by authors. - Unrestricted diarization: The number of collaborating authors is not given, so also the correct number of clusters, i.e., writers, has to be found. Corpora: - Webis-TRC-12 dataset, with 150 topics from TREC Web Tracks from 2009-2011 - Each subtask has variations of the dataset: number and proportions of authors in a document, the decision, uniformly distributed...
  • 7. Subtasks: - Age and gender identification. - Joint identification of age and gender for the same author - The aim is at the cross-genre evaluation. Corpora: - Languages: English, Spanish, Dutch - Genres: Twitter for training. Reviews, social media and blogs for evaluating. Author profiling (age and gender identification) 7 PAN’16
  • 8. Subtasks: - Authorship verification: Given two documents, decide whether they have been written by the same author. - Author masking: Given two documents by the same author, paraphrase the designated one so that the author cannot be verified anymore. Corpora: - Joint training and joint test datasets from the author verification tasks of PAN 2013 to 2015. Author obfuscation 8 PAN’16
  • 9. Conclusions 9 PAN’16 - The author obfuscation shared task allowed to shed light on the robustness of state-of-the-art author identification and author profiling techniques against author obfuscation technology. - New corpora have been developed in multiple languages: English, Spanish, Dutch. - PAN/FIRE: - A shared task on plagiarism detection on texts written in Farsi. - A shared task on author profiling on personality recognition in source code.
  • 10. See you on Tuesday and Wednesday 10 PAN’16 Rui Sousa-Silva Universidade do Porto Tuesday 6th Sept. 13:30 - 15:30 Wednesday 7th Sept. 13:30 - 15:30 16:15 - 18:15 Sponsors
  • 11. 11 PAN’16 On behalf of the PAN lab organisers: Thank you very much for participating and hope to see you next year!!