Data Mining News Articles from Multiple Sources

Data Mining News Articles
Amir Othman

About myself.
* Software engineer @ Instance
* Education from Bauhaus Universität
Weimar and Hochschule Ulm
* Love my wife, building cool pieces
of software and making music
* http://www.instance.com.sg
* http://www.amirmeludah.com

About this project.
* Initially intended to be a part of
a thesis project
* Grew into a fun side project
* Fulfilling a weird obsession about
web scraping

What is data mining news
articles?
"Data mining is the computing process of
discovering patterns in large data sets
involving methods at the intersection of
machine learning, statistics, and database
systems."
source:
https://en.wikipedia.org/wiki/Data_mining

articles?
"Data mining, the science of extracting
useful knowledge from such huge data
repositories, has emerged as a young and
interdisciplinary field in computer science."
source:
http://www.kdd.org/curriculum/index.html

articles?
"Collecting as much relevant data as possible
that with the hopes of gaining insights."
- me

Collecting what?
* News articles:
* German news articles
- Regional and national
* Malaysian news articles
- ALL OF THEM!

Why collecting these data?
* Building a corpus as raw material
for to test out NLP findings
* Piece of digital history
* News organizations go missing -
Wayback Machine not practical
* Cross-validating news sources

How to collect links to
news articles?
* As starting point before expanding
* News aggregators :)
* Search engine
* Curated news from news portals
* Result: Links pointing to news
websites

How to get even more links?
* Related articles – news aggregators
* Tweets from journalists and news
organizations

What about upcoming news?
* We collected a bunch of static
links
* News need to be fresh and young

* Information retrieval
* age
* freshness
* Effective Web Crawling: PhD Thesis
by Carlos Castillo

* News will (almost) always have RSS
feeds
* Slowly being replaced by Twitter
feeds.
* Advantage
- Subscription instead of frontiers
- Convenient way to get recent news
articles
- Structured

What about old news?
* Identify the next/other/more links
* Machine learning approach
* Text classification task:
- Is this link with this text a
next/more/other link?
- Train with labeled data - 400
sites from different news
websites

What about old news?
* Text classification task:
- Is this link with this text a
next/more/other link?
- Train with labeled data - 400
sites from different news
websites
- FastText
- On one iteration:
from 5443 articles to 349111
articles

How To Verify?
* Similarity – above similarity
threshold
* Put through information extraction
pipeline.
* Second layer of sanity check:
- randomly pick link and inspect.

What do we have so far?
* Links pointing to old articles
* RSS and Twitter feeds for links
pointing to new articles

How to retrieve and store
the data?
* Politeness when hitting servers -
schedule delay when on the same
domain
* Queueing with Redis
- One process to push it in a queue
each time we find a new link
- A different process pops the
queue to get the content

the data?
* Scaling with Redis
- Redis Cluster
- multiple servers to get the
content

the data?
* Store with MongoDB
- Document database for documents
- Require the flexibility of
document database
- Save all the extracted
information inside MongoDB
- Sharded Cluster

How to clean the data?
* HTML ==> structured information
{
“title”:<news title>,
“content”:<content of news>,
“date”:<published date>
}

* Alternative 1:
- BeautifulSoup
- disadvantage: manual
- advantage: precise
* Alternative 2:
- readability-lxml
- date and title extraction for
free!
- disadvantage: error prone
- advantage: fully automated

* For data coming from RSS and
Twitter feeds:
- Cross-validate with meta data

What can be extracted from
the data?
* Language detection:
- pycld2
* Named Entity Recognition:
- Spacy
- Polyglot
* Topic Modelling:
- Gensim

Computing what is trending.
* Extract named entities and rank
them by their tf-idf score.
* Named entity recognition:
- extract names, places, etc.
* tf-idf
- A fancier way of counting the
frequency of words

Querying and similarity
* Querying:
- ElasticSearch for full text
search
* Similarity lookup:
- Run word2vec on entire corpus
- Filter dictionary to only contain
named entities
- Get nearest neighbours

Use case: Automated
timelines creation
* Web application that consumes the
data through a REST API
* www.kronologimalaysia.com
* www.diezeitachse.de

Questions
othman.amir@gmail.com

Data Mining News Articles from Multiple Sources

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Mining News Articles from Multiple Sources

Similar a Data Mining News Articles from Multiple Sources (20)

Más de PYCON MY PLT

Más de PYCON MY PLT (6)

Último

Último (20)

Data Mining News Articles from Multiple Sources