Welcome, my name is Nils Kübler, I work for MeMo News in Kreuzlingen. This is my first english presentation so it is not that fluent, please excuse that. Today I am going to introduce you our first hadoop projject, which is our Media Monitoring Platform at MeMo news . In this presentation, I will first show you what Media Monitoring is about and how online and social monitoring works . second i will come on some of the core features of the memonews-platform , because they are important to understand some of our design decisions. In the third part I will come to the basics of our architecture , will introduce our technology stack and how our software uses the technology. In the final I will give you an overview of our little cluster and will wrap-up by showing how our software currently scales .
Media Monitoring means monitoring the output of print, online and broadcast data. It helps the customers to validate the results of marketing campains or to react early on changes in the public opinion . I will explain you classic and online media monitoring types: The classical approach , is to systematically record radio or television broadasts and to collect clippings from print media and scan them for keywords .
In the online world, there are two common forms: the online media monitoring , where online-sources such as news-portals or web-blogs are used as sources.
And another form is the Social Media monitoring , which primary monitors social networks such as twitter, facebook or even forums. MeMo News is both , an online media monitor and also a social media monitoring tool. Q: Has anybody an idea what type of software could be used for such a thing? A: By downloading the internet :) Q: And how is that done? A: with a web-crawler
So what's a web-crawler? A web-crawler downloads contents from the web, gathers data from the downloaded contents and persists them somewhere. Most likely in na search-index , to make it searchable. A web crawler has to balance between quality of the downloaded contents and the freshness . We also distinquish between different aspects of quailty. For example, an archive crawler tries to copy objects of the web as good as possible. This means they have a good representional quality . A Research crawler on the other hand, tries to give weight on data that is relevant to the user. This means, they have good instrinsic quality . The crawler type which fits the online-monitoring task best, is the newsagent . The newsagent primary focuses on having very fresh information . It is also sometimes called "near-realtime" search. Before we going deeper into the details of our newsagent , I will explain you the most important use-cases with the MeMo-News platform, because they are important to understand our architecture.
Monitoring with memonews works by creating so called search agents . The user creates the agents in the platform . Each agent consists a name and a query . The queries are then used to generate mail-reports or realtime alerts when articles that match are found. Additionally, the user is also able to login to the web-platform , where he can navigate through the agents , look at the most recent results and may also put results to different archives , which will never get delelted. Because of this, we can never delete anything from our search index, that means it grows and grows.
Here you can see the mask for editing an example agent for monitoring articles mentioning the Swiss Hadoop User group. Lets take a deeper look into the architecture of a search-engine.
A search engine consists of two parts: The online and offline part. The online part is responsible for the user-interaction with our system. The offline part is about everything that happens under the hood. That is: retrieving the contents from the web, gathering the important contents from it, and store them to our index. In the following we focus primary on the offline part, which is running on the hadoop platform in memonews.
Our Technology-Stack consists of 5 Maint-Parts: Data-Storage is provided by HDFS, and we use coordination via zookeeper. HBase extends HDFS by prioding Random Data Access and MapReduce allowes us to process both: HDFS and HBase Data in an asynchronous way. The most important component though, is the solr , which isn't part of the core hadoop platform. We won't get deeper into those technologies here, but i would like to ask if there is interest about some of these topics for future presentations?
Our Newsagent directly uses all of these parts: - We make heavy use of zookeeper to coordinate our downloaders. - We persist everithing we download on HBase . - This Data will also get Indexed to SOLR - And We make also heavy use of Mapreduce for Job-Scheduling and for asynchronous anlyzation tasks, such as priority calculation for downloads. OK, then lets take a look into our newsagent
About every 3 Minutes or so, we make a full scan of all our known sources via mapreduce , and check if they need to be downloaded. If so, a job will be generated on zookeeper . This is what we call scheduling . We have a distributed application called the http-loader which is responsible to download sources. Every Http-Loader knows exactly for which jobs he is responsible, and will execute it, as soon as possible. When the source is downloaded, the new articles are stored on hbase. On each update on hbase, the lily-Rowlog is triggered, which will start an update of the search-index and also do a so called prospective search to send realtime-alerts to users. During the Prospective search we check every new article for matches with any of the existing agents. We used the lily-rowlog as trigger-mechanismn, because the previous version of hbase did not provide coprocessors. Coprocessors are like triggers in RDBMS, they enable to execute custom code as soon as data in hbase changes. In a future version we will replace that, we already got an experimental implementation for that. Is there any interest about prospecivte search, Coprocessors or lily rowlog for a future presentation? OK, then let's take a short look in our current cluster setup
I have separated the setup in two slides, this here are our masters. We got 2 Virtual Master hosts, where we have each Master running in it's own Virtual Machine. There is always only one VM active for each service, the other one will only get activated when the active VM is down. So even when one Virtual Master Host fail, the other one is able to take over all master-services. This gives us a pretty good reliability. The SOLR-Server need to run on a standalone machine, because it requires so much resources. Here we follow th same pattern, when the active Server fails, the other one will take over the work. ... pause ... Currently one machine is suitable to hold our SOLR shards, but we will soon need to distribute them across multiple machines, because our index grows and grows.
Our "unit of scale" is the Worker. Each worker has the typical hadoop-services installed: Datanode, Tasktracker and Regionserver. And currently 3 of the Workers are also providing our Zookeeper-Quorum, which we may migrate some time to another place. We also have two own processes runng on each worker: The http-loader and the index-updater, which i introduced already on a previous slide. ... pause ... So, we already reached the last slide, where we can see how our newsagent currently scales.
As you can see, with 1 worker we download around 15 articles per second. With 2 workers, this number is around doubled. The scalling seems to be reduced with 3 or 4 workers, but this is only one measurement. We are positive that we can handle that, even when this graphic may suggest that we will reach an scalability-problem soon. And we definitly will need to scale our system more as we want to crawl more sources. Another point why we will need to introduce much more workers is, that our analyzis tasks will grow and grow as we introduce more and more analytic tasks, such as sentiment analyzis. .. pause ... so thats it. Thank you for your attention ...
... any questions?
As you can see, with 1 worker we download around 15 articles per second. With 2 workers, this number is around doubled. The scalling seems to be reduced with 3 or 4 workers, but this is only one measurement. We are positive that we can handle that, even when this graphic may suggest that we will reach an scalability-problem soon. And we definitly will need to scale our system more as we want to crawl more sources. Another point why we will need to introduce much more workers is, that our analyzis tasks will grow and grow as we introduce more and more analytic tasks, such as sentiment analyzis. .. pause ... so thats it. Thank you for your attention ...