2. About myself.
* Software engineer @ Instance
* Education from Bauhaus Universität
Weimar and Hochschule Ulm
* Love my wife, building cool pieces
of software and making music
* http://www.instance.com.sg
* http://www.amirmeludah.com
3. About this project.
* Initially intended to be a part of
a thesis project
* Grew into a fun side project
* Fulfilling a weird obsession about
web scraping
4. What is data mining news
articles?
"Data mining is the computing process of
discovering patterns in large data sets
involving methods at the intersection of
machine learning, statistics, and database
systems."
source:
https://en.wikipedia.org/wiki/Data_mining
5. What is data mining news
articles?
"Data mining, the science of extracting
useful knowledge from such huge data
repositories, has emerged as a young and
interdisciplinary field in computer science."
source:
http://www.kdd.org/curriculum/index.html
6. What is data mining news
articles?
"Collecting as much relevant data as possible
that with the hopes of gaining insights."
- me
7. Collecting what?
* News articles:
* German news articles
- Regional and national
* Malaysian news articles
- ALL OF THEM!
8. Why collecting these data?
* Building a corpus as raw material
for to test out NLP findings
* Piece of digital history
* News organizations go missing -
Wayback Machine not practical
* Cross-validating news sources
9. How to collect links to
news articles?
* As starting point before expanding
* News aggregators :)
* Search engine
* Curated news from news portals
* Result: Links pointing to news
websites
10. How to get even more links?
* Related articles – news aggregators
* Tweets from journalists and news
organizations
11. What about upcoming news?
* We collected a bunch of static
links
* News need to be fresh and young
12. What about upcoming news?
* Information retrieval
* age
* freshness
* Effective Web Crawling: PhD Thesis
by Carlos Castillo
14. What about upcoming news?
* News will (almost) always have RSS
feeds
* Slowly being replaced by Twitter
feeds.
* Advantage
- Subscription instead of frontiers
- Convenient way to get recent news
articles
- Structured
15. What about old news?
* Identify the next/other/more links
* Machine learning approach
* Text classification task:
- Is this link with this text a
next/more/other link?
- Train with labeled data - 400
sites from different news
websites
16. What about old news?
* Text classification task:
- Is this link with this text a
next/more/other link?
- Train with labeled data - 400
sites from different news
websites
- FastText
- On one iteration:
from 5443 articles to 349111
articles
17. How To Verify?
* Similarity – above similarity
threshold
* Put through information extraction
pipeline.
* Second layer of sanity check:
- randomly pick link and inspect.
18. What do we have so far?
* Links pointing to old articles
* RSS and Twitter feeds for links
pointing to new articles
19. How to retrieve and store
the data?
* Politeness when hitting servers -
schedule delay when on the same
domain
* Queueing with Redis
- One process to push it in a queue
each time we find a new link
- A different process pops the
queue to get the content
20. How to retrieve and store
the data?
* Scaling with Redis
- Redis Cluster
- multiple servers to get the
content
21. How to retrieve and store
the data?
* Store with MongoDB
- Document database for documents
- Require the flexibility of
document database
- Save all the extracted
information inside MongoDB
- Sharded Cluster
22. How to clean the data?
* HTML ==> structured information
{
“title”:<news title>,
“content”:<content of news>,
“date”:<published date>
}
23. How to clean the data?
* Alternative 1:
- BeautifulSoup
- disadvantage: manual
- advantage: precise
* Alternative 2:
- readability-lxml
- date and title extraction for
free!
- disadvantage: error prone
- advantage: fully automated
24. How to clean the data?
* Alternative 1:
- BeautifulSoup
- disadvantage: manual
- advantage: precise
* Alternative 2:
- readability-lxml
- date and title extraction for
free!
- disadvantage: error prone
- advantage: fully automated
25. How to clean the data?
* For data coming from RSS and
Twitter feeds:
- Cross-validate with meta data
26. What can be extracted from
the data?
* Language detection:
- pycld2
* Named Entity Recognition:
- Spacy
- Polyglot
* Topic Modelling:
- Gensim
27. Computing what is trending.
* Extract named entities and rank
them by their tf-idf score.
* Named entity recognition:
- extract names, places, etc.
* tf-idf
- A fancier way of counting the
frequency of words
28. Querying and similarity
* Querying:
- ElasticSearch for full text
search
* Similarity lookup:
- Run word2vec on entire corpus
- Filter dictionary to only contain
named entities
- Get nearest neighbours
29. Use case: Automated
timelines creation
* Web application that consumes the
data through a REST API
* www.kronologimalaysia.com
* www.diezeitachse.de