SlideShare una empresa de Scribd logo
1 de 17
Authors
University
Politehnica
of Bucharest
A Focused Crawler for
Romanian Words Discovery
Ionuț-Gabriel Radu
Traian Rebedea traian.rebedea@cs.pub.ro
Overview
• Introduction
• Objective
• RWScraper
• Related Work
• RWScraper: Implementation
• Results
• Conclusions
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 2
Introduction
• All natural languages are subject to change
over time
• As the Web becomes more prevalent, it also
constitutes a major source for identifying
language evolution
• Due to large amounts of Romanian web
content, the rate of change has increased
significantly
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 3
Objective
• To provide a mechanism to identify new
words (e.g. neologisms) that entered the
Romanian language
• Develop a specialized (focused) web crawler
for analyzing Romanian web pages and
identifying new words
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 4
Focused Web Crawling
• Crawling the web with a specific purpose:
– “Focus” the spiders to specific content (e.g.
people search, scientific publications, products,
etc.)
– Ignore other web pages
and domains
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 5
Solution: RWScraper
• RWScraper (Romanian Word Scraper) - is able
to solve the following problems:
– Identify Romanian texts;
– Distinguish between proper names and common
nouns;
– Create a database with new words along with
context information and metadata. In order to
identify new
– Discover the most frequent spelling errors in
Romanian online texts.
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 6
RWScraper – Text Processing
• Each word discovered in a Romanian text is looked in
the database provided by www.dexonline.ro, which
contains definitions from several Romanian
dictionaries (DEX, DOOM, etc.)
• Text Processing Pipeline
– Text Normalization
– Language Validation
– Sentence Segmentation
– Sentence-Level Language
Identification
– Word Tokenization
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 7
Related Work:
Neologisms Identification
• A study for Japanese:
– Scanning existing Japanese corpora for possible ”new” words,
typically by processing the texts through segmentation software
and dealing with the ”out-of-lexicon” problem
– Simulating the Japanese morphological processes to create new
possible words and then test for the presence of them in large
corpora
• Identification of lexical discriminants (e.g. termed, called,
known as) and punctuation discriminants (e.g. single and
double quotes) for introducing new words
– This method is able to identify a significantly smaller number of
potential new words due to the limited number of lexical
discriminant patterns.
• Using data about the frequency of words usage over time
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 8
Related Work:
Language Identification
• Common Words Methods
– Store and use a list with the most frequent words for each language
• Unique Letter Combinations
– Database with the most frequent sequences of letters in a language,
not necessarily valid words
– The main disadvantage: the poor performance on short texts
– The main advantage: it does not require word tokenization
• Language Identification Using N-Grams
– Every language has several specific frequently used character n-grams
– For a particular language L, the n-gram ordered dictionary is called n-
gram language profile
– For a new text, we compute the distance to all computed language
profiles
• Markov Models for Language Identification
– The word can be represented as a Markov chain where letters are
states
– Compute a Markov model for each language
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 9
RWScraper: Implementation
• RWScraper is a focused crawler for Romanian
web pages
• Developed using Scrapy: open-source scraping
framework in Python
• It uses three main concepts:
– Spiders: responsible for defining rules to restrict the
crawled content to our area of interest
– Items: data we want to scrape from the web pages
– Pipelines: text processing tasks that act on the
crawled web resources
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 10
RWScraper Language Validation
• Divide the texts into two categories:
– Diacritics free texts - DIAFREE
– Genuine Romanian texts – GEN
• 6.40% of the characters in the Romanian texts part of
the ro_eu_parliament corpus are diacritics
• One of the problems with this approach is that 4.14%
of texts contained ș, â, and î. Unfortunately, there are
also other languages that possess these diacritics
• Romanian is the only language that uses ț and ă
• Our assumption: if a text has over 600 characters and
has no ț/ă are found
– Then it is DIAFREE
– Otherwise is GEN
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 11
RWScraper Language Validation
• Build language profiles, consisting of:
– Character bigrams and trigrams frequency
– Common words frequency
– Diacritics frequency
– Rare characters frequency
– Double consonant frequency
– Single quotes frequency
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 12
Results: Language Validation
• 105 texts are divided into: 20 Romanian with diacritics (RO1 -
RO20), 20 Romanian without diacritics (RO21- RO40), 20
Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish
texts, 3 Catalan texts, and 2 Aromanian
• The size of the texts varied from 9KB to 2:5MB, the average
size being 253:4KB
• Average scores for the discriminator function
– Lower score means higher probability for the text to be written in
Romanian
– Used to set the discriminant score to 0.77 to separate between
Romanian and non-Romanian texts
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 13
Results
• Processed 264,328 online documents
– Only 12,555 documents contained new words
• From this set of texts, we extracted 698,341
– Only 47,363 phrases contained new words
• Discovered 53,724 new words
– 21,343 are proper names
• The remaining tokens are common words and they are
divided into the following main categories:
– Misspelled words (approximately 35%)
– Technical words (approximately 15%)
– Argotic words (approximately 10%)
– Clitics, regionalisms, archaisms, alternative forms for
existing words account for the rest (cca. 40%)
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 14
Results
• Most frequent new words
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 15
Conclusions
• RWScraper is a simple new Romanian words discovery
system
• The project has also managed to create a large
database of Romanian words extracted from the
WWW
– Statistics about common proper names, frequent spelling
mistakes and newly-invented words
• There are several elements that could be further
improved
– The accuracy of the NLP components used by the system
– A more pertinent analysis of the words identified by the
system
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 16
Thank you!
Questions?
Discussion
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 17
This work has been funded by the
Sectorial Operational Programme
Human Resources Development
2007-2013 of the Romanian Ministry
of European Funds through the
Financial Agreement
POSDRU/159/1.5/S/132397

Más contenido relacionado

La actualidad más candente

Admixture of Poisson MRFs: A New Topic Model with Word Dependencies
Admixture of Poisson MRFs: A New Topic Model with Word DependenciesAdmixture of Poisson MRFs: A New Topic Model with Word Dependencies
Admixture of Poisson MRFs: A New Topic Model with Word Dependencies
David Inouye
 
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuse
Andrea Nuzzolese
 
Terminology turbocharges your translation: From my archive before TaaS ;-)
Terminology turbocharges your translation: From my archive before TaaS ;-)Terminology turbocharges your translation: From my archive before TaaS ;-)
Terminology turbocharges your translation: From my archive before TaaS ;-)
Tatjana Gornostaja
 

La actualidad más candente (20)

Admixture of Poisson MRFs: A New Topic Model with Word Dependencies
Admixture of Poisson MRFs: A New Topic Model with Word DependenciesAdmixture of Poisson MRFs: A New Topic Model with Word Dependencies
Admixture of Poisson MRFs: A New Topic Model with Word Dependencies
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuse
 
Oke
OkeOke
Oke
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Question answering
Question answeringQuestion answering
Question answering
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
MAHOUT classifier tour
MAHOUT classifier tourMAHOUT classifier tour
MAHOUT classifier tour
 
Semantic Technologies in ST&DL
Semantic Technologies in ST&DLSemantic Technologies in ST&DL
Semantic Technologies in ST&DL
 
Terminology turbocharges your translation: From my archive before TaaS ;-)
Terminology turbocharges your translation: From my archive before TaaS ;-)Terminology turbocharges your translation: From my archive before TaaS ;-)
Terminology turbocharges your translation: From my archive before TaaS ;-)
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 

Destacado

Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuri
Traian Rebedea
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
Traian Rebedea
 
Fact, Figures and Statistics
Fact, Figures and StatisticsFact, Figures and Statistics
Fact, Figures and Statistics
meducationdotnet
 
Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7
Traian Rebedea
 

Destacado (20)

Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
 
Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuri
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
 
Portaretrato tunantero
Portaretrato tunanteroPortaretrato tunantero
Portaretrato tunantero
 
Final draft ideg observation-report_voter-reg-ex 2008 1
Final draft  ideg observation-report_voter-reg-ex 2008 1Final draft  ideg observation-report_voter-reg-ex 2008 1
Final draft ideg observation-report_voter-reg-ex 2008 1
 
Intro: Ancient Greece
Intro: Ancient GreeceIntro: Ancient Greece
Intro: Ancient Greece
 
coverstory
coverstorycoverstory
coverstory
 
Introduction to Excel
Introduction to ExcelIntroduction to Excel
Introduction to Excel
 
Recintos y clasificación arancelaria
Recintos y clasificación arancelariaRecintos y clasificación arancelaria
Recintos y clasificación arancelaria
 
Fact, Figures and Statistics
Fact, Figures and StatisticsFact, Figures and Statistics
Fact, Figures and Statistics
 
What have you learnt about technologies from the process of constructing this...
What have you learnt about technologies from the process of constructing this...What have you learnt about technologies from the process of constructing this...
What have you learnt about technologies from the process of constructing this...
 
Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7
 
портфолио1
портфолио1портфолио1
портфолио1
 
Sistema aduanero mexicano, modernización para un marco mundial.
Sistema aduanero mexicano, modernización para un marco mundial.Sistema aduanero mexicano, modernización para un marco mundial.
Sistema aduanero mexicano, modernización para un marco mundial.
 
Number system
Number systemNumber system
Number system
 
Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2
 
Bitwise operators
Bitwise operatorsBitwise operators
Bitwise operators
 
Surgical Anatomy Of The Nose
Surgical Anatomy Of The NoseSurgical Anatomy Of The Nose
Surgical Anatomy Of The Nose
 
หาผลบวกและผลลบของเอกนาม
หาผลบวกและผลลบของเอกนามหาผลบวกและผลลบของเอกนาม
หาผลบวกและผลลบของเอกนาม
 
Surgical anatomy of nose
Surgical anatomy of noseSurgical anatomy of nose
Surgical anatomy of nose
 

Similar a A focused crawler for romanian words discovery

The Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
The Canadian Linked Data Initiative: Charting a Path to a Linked Data FutureThe Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
The Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
NASIG
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...
Nuno Freire
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
Zakaria Zubi
 

Similar a A focused crawler for romanian words discovery (20)

OWN-PT: Taking Stock
OWN-PT: Taking Stock OWN-PT: Taking Stock
OWN-PT: Taking Stock
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
 
The Danish National Bibliography as LOD
The Danish National Bibliography as LODThe Danish National Bibliography as LOD
The Danish National Bibliography as LOD
 
Europeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom ViewsEuropeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom Views
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of Southampton
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)
 
The Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
The Canadian Linked Data Initiative: Charting a Path to a Linked Data FutureThe Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
The Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
 
Summit2013 sw in russian universities
Summit2013   sw in russian universitiesSummit2013   sw in russian universities
Summit2013 sw in russian universities
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
Rda and new research potentials, agata kawalec
Rda and new research potentials, agata kawalecRda and new research potentials, agata kawalec
Rda and new research potentials, agata kawalec
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...
 
Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web
Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web
Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
 
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review Presentation
 

Más de Traian Rebedea

Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitare
Traian Rebedea
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTube
Traian Rebedea
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Traian Rebedea
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD Survey
Traian Rebedea
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009
Traian Rebedea
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Traian Rebedea
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
Traian Rebedea
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12
Traian Rebedea
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
Traian Rebedea
 
Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10
Traian Rebedea
 
Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9
Traian Rebedea
 
Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8
Traian Rebedea
 
Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6
Traian Rebedea
 
Algorithm Design and Complexity - Course 5
Algorithm Design and Complexity - Course 5Algorithm Design and Complexity - Course 5
Algorithm Design and Complexity - Course 5
Traian Rebedea
 
Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3
Traian Rebedea
 

Más de Traian Rebedea (20)

AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitare
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTube
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD Survey
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
 
Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10
 
Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9
 
Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8
 
Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6
 
Algorithm Design and Complexity - Course 5
Algorithm Design and Complexity - Course 5Algorithm Design and Complexity - Course 5
Algorithm Design and Complexity - Course 5
 
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic ProgammingAlgorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
 
Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3
 

Último

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
fonyou31
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
SoniaTolstoy
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Último (20)

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

A focused crawler for romanian words discovery

  • 1. Authors University Politehnica of Bucharest A Focused Crawler for Romanian Words Discovery Ionuț-Gabriel Radu Traian Rebedea traian.rebedea@cs.pub.ro
  • 2. Overview • Introduction • Objective • RWScraper • Related Work • RWScraper: Implementation • Results • Conclusions 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 2
  • 3. Introduction • All natural languages are subject to change over time • As the Web becomes more prevalent, it also constitutes a major source for identifying language evolution • Due to large amounts of Romanian web content, the rate of change has increased significantly 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 3
  • 4. Objective • To provide a mechanism to identify new words (e.g. neologisms) that entered the Romanian language • Develop a specialized (focused) web crawler for analyzing Romanian web pages and identifying new words 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 4
  • 5. Focused Web Crawling • Crawling the web with a specific purpose: – “Focus” the spiders to specific content (e.g. people search, scientific publications, products, etc.) – Ignore other web pages and domains 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 5
  • 6. Solution: RWScraper • RWScraper (Romanian Word Scraper) - is able to solve the following problems: – Identify Romanian texts; – Distinguish between proper names and common nouns; – Create a database with new words along with context information and metadata. In order to identify new – Discover the most frequent spelling errors in Romanian online texts. 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 6
  • 7. RWScraper – Text Processing • Each word discovered in a Romanian text is looked in the database provided by www.dexonline.ro, which contains definitions from several Romanian dictionaries (DEX, DOOM, etc.) • Text Processing Pipeline – Text Normalization – Language Validation – Sentence Segmentation – Sentence-Level Language Identification – Word Tokenization 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 7
  • 8. Related Work: Neologisms Identification • A study for Japanese: – Scanning existing Japanese corpora for possible ”new” words, typically by processing the texts through segmentation software and dealing with the ”out-of-lexicon” problem – Simulating the Japanese morphological processes to create new possible words and then test for the presence of them in large corpora • Identification of lexical discriminants (e.g. termed, called, known as) and punctuation discriminants (e.g. single and double quotes) for introducing new words – This method is able to identify a significantly smaller number of potential new words due to the limited number of lexical discriminant patterns. • Using data about the frequency of words usage over time 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 8
  • 9. Related Work: Language Identification • Common Words Methods – Store and use a list with the most frequent words for each language • Unique Letter Combinations – Database with the most frequent sequences of letters in a language, not necessarily valid words – The main disadvantage: the poor performance on short texts – The main advantage: it does not require word tokenization • Language Identification Using N-Grams – Every language has several specific frequently used character n-grams – For a particular language L, the n-gram ordered dictionary is called n- gram language profile – For a new text, we compute the distance to all computed language profiles • Markov Models for Language Identification – The word can be represented as a Markov chain where letters are states – Compute a Markov model for each language 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 9
  • 10. RWScraper: Implementation • RWScraper is a focused crawler for Romanian web pages • Developed using Scrapy: open-source scraping framework in Python • It uses three main concepts: – Spiders: responsible for defining rules to restrict the crawled content to our area of interest – Items: data we want to scrape from the web pages – Pipelines: text processing tasks that act on the crawled web resources 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 10
  • 11. RWScraper Language Validation • Divide the texts into two categories: – Diacritics free texts - DIAFREE – Genuine Romanian texts – GEN • 6.40% of the characters in the Romanian texts part of the ro_eu_parliament corpus are diacritics • One of the problems with this approach is that 4.14% of texts contained ș, â, and î. Unfortunately, there are also other languages that possess these diacritics • Romanian is the only language that uses ț and ă • Our assumption: if a text has over 600 characters and has no ț/ă are found – Then it is DIAFREE – Otherwise is GEN 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 11
  • 12. RWScraper Language Validation • Build language profiles, consisting of: – Character bigrams and trigrams frequency – Common words frequency – Diacritics frequency – Rare characters frequency – Double consonant frequency – Single quotes frequency 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 12
  • 13. Results: Language Validation • 105 texts are divided into: 20 Romanian with diacritics (RO1 - RO20), 20 Romanian without diacritics (RO21- RO40), 20 Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish texts, 3 Catalan texts, and 2 Aromanian • The size of the texts varied from 9KB to 2:5MB, the average size being 253:4KB • Average scores for the discriminator function – Lower score means higher probability for the text to be written in Romanian – Used to set the discriminant score to 0.77 to separate between Romanian and non-Romanian texts 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 13
  • 14. Results • Processed 264,328 online documents – Only 12,555 documents contained new words • From this set of texts, we extracted 698,341 – Only 47,363 phrases contained new words • Discovered 53,724 new words – 21,343 are proper names • The remaining tokens are common words and they are divided into the following main categories: – Misspelled words (approximately 35%) – Technical words (approximately 15%) – Argotic words (approximately 10%) – Clitics, regionalisms, archaisms, alternative forms for existing words account for the rest (cca. 40%) 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 14
  • 15. Results • Most frequent new words 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 15
  • 16. Conclusions • RWScraper is a simple new Romanian words discovery system • The project has also managed to create a large database of Romanian words extracted from the WWW – Statistics about common proper names, frequent spelling mistakes and newly-invented words • There are several elements that could be further improved – The accuracy of the NLP components used by the system – A more pertinent analysis of the words identified by the system 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 16
  • 17. Thank you! Questions? Discussion 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 17 This work has been funded by the Sectorial Operational Programme Human Resources Development 2007-2013 of the Romanian Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132397