SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Data Mining News Articles
Amir Othman
About myself.
* Software engineer @ Instance
* Education from Bauhaus Universität
Weimar and Hochschule Ulm
* Love my wife, building cool pieces
of software and making music
* http://www.instance.com.sg
* http://www.amirmeludah.com
About this project.
* Initially intended to be a part of
a thesis project
* Grew into a fun side project
* Fulfilling a weird obsession about
web scraping
What is data mining news
articles?
"Data mining is the computing process of
discovering patterns in large data sets
involving methods at the intersection of
machine learning, statistics, and database
systems."
source:
https://en.wikipedia.org/wiki/Data_mining
What is data mining news
articles?
"Data mining, the science of extracting
useful knowledge from such huge data
repositories, has emerged as a young and
interdisciplinary field in computer science."
source:
http://www.kdd.org/curriculum/index.html
What is data mining news
articles?
"Collecting as much relevant data as possible
that with the hopes of gaining insights."
- me
Collecting what?
* News articles:
* German news articles
- Regional and national
* Malaysian news articles
- ALL OF THEM!
Why collecting these data?
* Building a corpus as raw material
for to test out NLP findings
* Piece of digital history
* News organizations go missing -
Wayback Machine not practical
* Cross-validating news sources
How to collect links to
news articles?
* As starting point before expanding
* News aggregators :)
* Search engine
* Curated news from news portals
* Result: Links pointing to news
websites
How to get even more links?
* Related articles – news aggregators
* Tweets from journalists and news
organizations
What about upcoming news?
* We collected a bunch of static
links
* News need to be fresh and young
What about upcoming news?
* Information retrieval
* age
* freshness
* Effective Web Crawling: PhD Thesis
by Carlos Castillo
Simple is better than complex
What about upcoming news?
* News will (almost) always have RSS
feeds
* Slowly being replaced by Twitter
feeds.
* Advantage
- Subscription instead of frontiers
- Convenient way to get recent news
articles
- Structured
What about old news?
* Identify the next/other/more links
* Machine learning approach
* Text classification task:
- Is this link with this text a
next/more/other link?
- Train with labeled data - 400
sites from different news
websites
What about old news?
* Text classification task:
- Is this link with this text a
next/more/other link?
- Train with labeled data - 400
sites from different news
websites
- FastText
- On one iteration:
from 5443 articles to 349111
articles
How To Verify?
* Similarity – above similarity
threshold
* Put through information extraction
pipeline.
* Second layer of sanity check:
- randomly pick link and inspect.
What do we have so far?
* Links pointing to old articles
* RSS and Twitter feeds for links
pointing to new articles
How to retrieve and store
the data?
* Politeness when hitting servers -
schedule delay when on the same
domain
* Queueing with Redis
- One process to push it in a queue
each time we find a new link
- A different process pops the
queue to get the content
How to retrieve and store
the data?
* Scaling with Redis
- Redis Cluster
- multiple servers to get the
content
How to retrieve and store
the data?
* Store with MongoDB
- Document database for documents
- Require the flexibility of
document database
- Save all the extracted
information inside MongoDB
- Sharded Cluster
How to clean the data?
* HTML ==> structured information
{
“title”:<news title>,
“content”:<content of news>,
“date”:<published date>
}
How to clean the data?
* Alternative 1:
- BeautifulSoup
- disadvantage: manual
- advantage: precise
* Alternative 2:
- readability-lxml
- date and title extraction for
free!
- disadvantage: error prone
- advantage: fully automated
How to clean the data?
* Alternative 1:
- BeautifulSoup
- disadvantage: manual
- advantage: precise
* Alternative 2:
- readability-lxml
- date and title extraction for
free!
- disadvantage: error prone
- advantage: fully automated
How to clean the data?
* For data coming from RSS and
Twitter feeds:
- Cross-validate with meta data
What can be extracted from
the data?
* Language detection:
- pycld2
* Named Entity Recognition:
- Spacy
- Polyglot
* Topic Modelling:
- Gensim
Computing what is trending.
* Extract named entities and rank
them by their tf-idf score.
* Named entity recognition:
- extract names, places, etc.
* tf-idf
- A fancier way of counting the
frequency of words
Querying and similarity
* Querying:
- ElasticSearch for full text
search
* Similarity lookup:
- Run word2vec on entire corpus
- Filter dictionary to only contain
named entities
- Get nearest neighbours
Use case: Automated
timelines creation
* Web application that consumes the
data through a REST API
* www.kronologimalaysia.com
* www.diezeitachse.de
Questions
othman.amir@gmail.com

Más contenido relacionado

La actualidad más candente

Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RESChristophe Guéret
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
 
Basics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter ColpaertBasics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter ColpaertOpening-up.eu
 
Stop making tools! Nobody likes them anyway...
Stop making tools! Nobody likes them anyway...Stop making tools! Nobody likes them anyway...
Stop making tools! Nobody likes them anyway...Christophe Guéret
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?Yu-Chang Ho
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataAlbert Meroño-Peñuela
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingMichelle Minkoff
 
Introduction to Linked Data - Part 1
Introduction to Linked Data - Part 1Introduction to Linked Data - Part 1
Introduction to Linked Data - Part 1Itza Carbajal
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
Wehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansWehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansBram van den Hout
 
Search the internet
Search the internetSearch the internet
Search the internetEdit Ostrom
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedJoel Azzopardi
 
Instutional repositories and data
Instutional repositories and dataInstutional repositories and data
Instutional repositories and dataAndrew Treloar
 

La actualidad más candente (20)

Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
 
Linked Data
Linked DataLinked Data
Linked Data
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Jarrar: Linked Data
Jarrar: Linked DataJarrar: Linked Data
Jarrar: Linked Data
 
Basics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter ColpaertBasics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
 
Stop making tools! Nobody likes them anyway...
Stop making tools! Nobody likes them anyway...Stop making tools! Nobody likes them anyway...
Stop making tools! Nobody likes them anyway...
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Digital archiving 3.0
Digital archiving 3.0Digital archiving 3.0
Digital archiving 3.0
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked data
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 
Introduction to Linked Data - Part 1
Introduction to Linked Data - Part 1Introduction to Linked Data - Part 1
Introduction to Linked Data - Part 1
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Wehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansWehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historians
 
Search the internet
Search the internetSearch the internet
Search the internet
 
Library Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic ControlLibrary Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic Control
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
 
Instutional repositories and data
Instutional repositories and dataInstutional repositories and data
Instutional repositories and data
 

Similar a Data Mining News Articles from Multiple Sources

Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and TechniquesBernhard Haslhofer
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data GenerationFilip Radulovic
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22jodischneider
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked datavafopoulos
 
File and data base management
File and data base managementFile and data base management
File and data base managementAsad Ahmed
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internetcarolyn oldham
 
Linked open data with Semantic MediaWiki - ENDORSE 2021
Linked open data with Semantic MediaWiki - ENDORSE 2021Linked open data with Semantic MediaWiki - ENDORSE 2021
Linked open data with Semantic MediaWiki - ENDORSE 2021Bernhard Krabina
 
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11William Hall
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
 
Web History 101, or How the Future is Unwritten
Web History 101, or How the Future is UnwrittenWeb History 101, or How the Future is Unwritten
Web History 101, or How the Future is UnwrittenBookNet Canada
 
BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013BBC News Labs
 

Similar a Data Mining News Articles from Multiple Sources (20)

Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
Data Infrastructure in Kumparan
Data Infrastructure in KumparanData Infrastructure in Kumparan
Data Infrastructure in Kumparan
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
Group 3
Group 3Group 3
Group 3
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
Internet Search and DRM Issues
Internet Search and DRM IssuesInternet Search and DRM Issues
Internet Search and DRM Issues
 
File and data base management
File and data base managementFile and data base management
File and data base management
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internet
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Linked open data with Semantic MediaWiki - ENDORSE 2021
Linked open data with Semantic MediaWiki - ENDORSE 2021Linked open data with Semantic MediaWiki - ENDORSE 2021
Linked open data with Semantic MediaWiki - ENDORSE 2021
 
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
 
Webofdata
WebofdataWebofdata
Webofdata
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Web History 101, or How the Future is Unwritten
Web History 101, or How the Future is UnwrittenWeb History 101, or How the Future is Unwritten
Web History 101, or How the Future is Unwritten
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013
 

Más de PYCON MY PLT

Programming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High SchoolProgramming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High SchoolPYCON MY PLT
 
Train your dragons! by Shilpa Karkera
Train your dragons! by Shilpa KarkeraTrain your dragons! by Shilpa Karkera
Train your dragons! by Shilpa KarkeraPYCON MY PLT
 
Python in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas LuPython in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas LuPYCON MY PLT
 
Python testing like a pro by Keith Yang
Python testing like a pro by Keith YangPython testing like a pro by Keith Yang
Python testing like a pro by Keith YangPYCON MY PLT
 
The programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellarThe programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellarPYCON MY PLT
 
Using machine learning to try and predict taxi availability by Narahari Allam...
Using machine learning to try and predict taxi availability by Narahari Allam...Using machine learning to try and predict taxi availability by Narahari Allam...
Using machine learning to try and predict taxi availability by Narahari Allam...PYCON MY PLT
 

Más de PYCON MY PLT (6)

Programming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High SchoolProgramming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High School
 
Train your dragons! by Shilpa Karkera
Train your dragons! by Shilpa KarkeraTrain your dragons! by Shilpa Karkera
Train your dragons! by Shilpa Karkera
 
Python in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas LuPython in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas Lu
 
Python testing like a pro by Keith Yang
Python testing like a pro by Keith YangPython testing like a pro by Keith Yang
Python testing like a pro by Keith Yang
 
The programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellarThe programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellar
 
Using machine learning to try and predict taxi availability by Narahari Allam...
Using machine learning to try and predict taxi availability by Narahari Allam...Using machine learning to try and predict taxi availability by Narahari Allam...
Using machine learning to try and predict taxi availability by Narahari Allam...
 

Último

Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 

Último (20)

Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 

Data Mining News Articles from Multiple Sources

  • 1. Data Mining News Articles Amir Othman
  • 2. About myself. * Software engineer @ Instance * Education from Bauhaus Universität Weimar and Hochschule Ulm * Love my wife, building cool pieces of software and making music * http://www.instance.com.sg * http://www.amirmeludah.com
  • 3. About this project. * Initially intended to be a part of a thesis project * Grew into a fun side project * Fulfilling a weird obsession about web scraping
  • 4. What is data mining news articles? "Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems." source: https://en.wikipedia.org/wiki/Data_mining
  • 5. What is data mining news articles? "Data mining, the science of extracting useful knowledge from such huge data repositories, has emerged as a young and interdisciplinary field in computer science." source: http://www.kdd.org/curriculum/index.html
  • 6. What is data mining news articles? "Collecting as much relevant data as possible that with the hopes of gaining insights." - me
  • 7. Collecting what? * News articles: * German news articles - Regional and national * Malaysian news articles - ALL OF THEM!
  • 8. Why collecting these data? * Building a corpus as raw material for to test out NLP findings * Piece of digital history * News organizations go missing - Wayback Machine not practical * Cross-validating news sources
  • 9. How to collect links to news articles? * As starting point before expanding * News aggregators :) * Search engine * Curated news from news portals * Result: Links pointing to news websites
  • 10. How to get even more links? * Related articles – news aggregators * Tweets from journalists and news organizations
  • 11. What about upcoming news? * We collected a bunch of static links * News need to be fresh and young
  • 12. What about upcoming news? * Information retrieval * age * freshness * Effective Web Crawling: PhD Thesis by Carlos Castillo
  • 13. Simple is better than complex
  • 14. What about upcoming news? * News will (almost) always have RSS feeds * Slowly being replaced by Twitter feeds. * Advantage - Subscription instead of frontiers - Convenient way to get recent news articles - Structured
  • 15. What about old news? * Identify the next/other/more links * Machine learning approach * Text classification task: - Is this link with this text a next/more/other link? - Train with labeled data - 400 sites from different news websites
  • 16. What about old news? * Text classification task: - Is this link with this text a next/more/other link? - Train with labeled data - 400 sites from different news websites - FastText - On one iteration: from 5443 articles to 349111 articles
  • 17. How To Verify? * Similarity – above similarity threshold * Put through information extraction pipeline. * Second layer of sanity check: - randomly pick link and inspect.
  • 18. What do we have so far? * Links pointing to old articles * RSS and Twitter feeds for links pointing to new articles
  • 19. How to retrieve and store the data? * Politeness when hitting servers - schedule delay when on the same domain * Queueing with Redis - One process to push it in a queue each time we find a new link - A different process pops the queue to get the content
  • 20. How to retrieve and store the data? * Scaling with Redis - Redis Cluster - multiple servers to get the content
  • 21. How to retrieve and store the data? * Store with MongoDB - Document database for documents - Require the flexibility of document database - Save all the extracted information inside MongoDB - Sharded Cluster
  • 22. How to clean the data? * HTML ==> structured information { “title”:<news title>, “content”:<content of news>, “date”:<published date> }
  • 23. How to clean the data? * Alternative 1: - BeautifulSoup - disadvantage: manual - advantage: precise * Alternative 2: - readability-lxml - date and title extraction for free! - disadvantage: error prone - advantage: fully automated
  • 24. How to clean the data? * Alternative 1: - BeautifulSoup - disadvantage: manual - advantage: precise * Alternative 2: - readability-lxml - date and title extraction for free! - disadvantage: error prone - advantage: fully automated
  • 25. How to clean the data? * For data coming from RSS and Twitter feeds: - Cross-validate with meta data
  • 26. What can be extracted from the data? * Language detection: - pycld2 * Named Entity Recognition: - Spacy - Polyglot * Topic Modelling: - Gensim
  • 27. Computing what is trending. * Extract named entities and rank them by their tf-idf score. * Named entity recognition: - extract names, places, etc. * tf-idf - A fancier way of counting the frequency of words
  • 28. Querying and similarity * Querying: - ElasticSearch for full text search * Similarity lookup: - Run word2vec on entire corpus - Filter dictionary to only contain named entities - Get nearest neighbours
  • 29. Use case: Automated timelines creation * Web application that consumes the data through a REST API * www.kronologimalaysia.com * www.diezeitachse.de