Monitoring social media and news media in general is kind of building a search engine - a specific search engine called news agent. Where classical search engines are about to have all kind of content, newsagent focus on news and social media only. This is where content freshness matters - it has to be near real time.
In order to process several million news with hundreds of Megabytes per day one has to choose a system architecture that is reliable on one hand and massive scalable at the other hand. Those requirements leads to a distributed architecture, a NoSQL approach.
This talk will give you a brief overview about the Media Monitoring use case, the system architecture based on Hadoop and HBase, challenges and lessons learned.
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Near Real Time Processing of Social Media Data with HBase
1. CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3
2. August
31,
2012
• Why Hadoop and HBase? 2
• Social Media Monitoring
• Prospective Search and Coprocessors
• Challenges & Lessons Learned
• Resources to get started
Agenda
3. August
31,
2012
• Spin-off of MeMo News AG, the 3
leading provider for Social Media
Monitoring & Analytics in Switzerland
• Big Data expert, focused on Hadoop,
HBase and Solr
• Objective: Transforming data into
insights
About Sentric
5. August
31,
2012
5
Information Information Analysis & Insight
Gathering Processing Interpretation Presentation
Why Hadoop and HBase?
Social Media Monitoring Process
6. August
31,
2012
6
Cost
effective
High
Freshness
scalable
SMM
Reliable RT Alerting
Analytical
capabilities
Why Hadoop and HBase?
Requirements
7. August
31,
2012
• HDFS + MapReduce 7
• Based on Google Papers
• Distributed Storage and Computation
Framework
• Affordable Hardware, Free Software
• Significant Adoption
Why Hadoop and HBase?
Hadoop
8. August
31,
2012
• Non-Relational, Distributed Database 8
• Column-Oriented
• Multi-Dimensional
• High Availability
• High Performance
• Build on top of HDFS as storage layer
Why Hadoop and HBase?
HBase
10. CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf
11. August
31,
2012
11
Downloaded Articles
match?
Search Agents
Output
Web-UI Reports RT Alerts
Icons by http://dryicons.com
Social Media Monitoring
Overview
12. August
31,
2012
12
n News Agents
REST
HBase
Coprocessor
Web-UI
MySQL Solr RT Alerts
Icons by http://dryicons.com
Social Media Monitoring
Solution Architecture
13. August
31,
2012
13
Processing
Put operations
Prospective
Search
HRegion RT Alerts
HRegionServer
Icons by http://dryicons.com
Social Media Monitoring
Prospective Search with Coprocessors
14. August
31,
2012
• Monthly growth 14
• Index: 200GB
• 50 Mio. docs/month
• HBase: 600 GB
• Raw data, meta data and extracted
data
• A few 1000 map-reduce jobs/
month
Social Media Monitoring
Key Figures
16. Augus
t 31,
2012
1 Benchmarks - workloads
2 Supervision
16
3 Keys and shards – Schema design /LG
4 Timestamps, the 4th dimension
5 Short ColumnFamily names->
6 File handles. OS
7 JVM Tuning, GC !!!
8 Scaling region servers, data locality!
9 Automatic vs manual splits, compaction
10 Do not use HBase as rock solid in prod
11 Forget feuerwehr aktionen, it takes some time
12 Use Hbase for a apropriate use case
13 Tune and tweak – it‘s not a project – it‘s a process
14 You need devops in production
15 Huge know-how curve, you need to know the hole ecosystem
16 Use a distribution, ist packed, tested and supports migration, enterprise grade
17 Virtualisierung, Hardware
18 Dont struggle to much, there is a good community
19 Share your knowledge
20 It‘s early state, many tools around, a few still missing
Challenges & Lessons Learned
17. August
31,
2012
• Everyone is still learning 17
• Some issues only appear at scale
• At scale, nothing works as advertised
• Production cluster configuration
• Hardware issues
• Tuning cluster configuration to our work
loads
• HBase stability
• Monitoring health of HBase
Challenges & Lessons Learned
Challenges
18. August
31,
2012
• Do not rely on HBase as frontend 18
storage layer. It’s not going to be rock
solid
• Don’t struggle to much, there is a
good community
• Share your knowledge
• It‘s early stage, many tools around, a
few still missing
Challenges & Lessons Learned
Lessons - General
19. August
31,
2012
• Use HBase for an appropriate use case 19
• Use a distribution, its packed, tested and
supports migration, enterprise grade
• Benchmarks – know your workloads &
query patterns
• YCSB
• Schema & Key Design
• What’s queried together should be stored
together
• Scaling region servers, data locality!
• Virtualization vs. Real Hardware
Challenges & Lessons Learned
Lessons - Planning
20. August
31,
2012
• Number of CF < 10 20
• Compaction + Flushing I/O intensive
• Short ColumnFamily names
• HFile index size occupying aloc RAM (storefileindexSize)
• OS file handles
• ulimit –n 32768
• JVM Tuning, GC !!!
• HMaster 1024 MB
• RegionServer 8192 MB
• -XX:+UseConcMarkSweepGC
• -XX:+CMSIncrementalMode
• Automatic vs. manual splits
• Be careful with expensive operations in coprocessors
• Play with all the configurations and benchmark for tuning
Challenges & Lessons Learned
Lessons - Performance Tuning
21. August
31,
2012
• Monitoring/Operational tooling is most 21
important
• Forget “emergency actions”, it takes
some time
• Tune and tweak – it‘s not a project – it‘s
a process
• You need DevOps in production
• Huge know-how curve, you need to
know the whole ecosystem
• Hadoop, HDFS, MapRed
Challenges & Lessons Learned
Lessons - Operation
22. August
31,
2012
• http://hbase.apache.org/book.html 22
• http://www.sentric.ch/blog/best-
practice-why-monitoring-hbase-is-
important
• http://www.sentric.ch/blog/hadoop-
overview-of-top-3-distributions
• http://www.sentric.ch/blog/hadoop-
best-practice-cluster-checklist
• http://outerthought.org/blog/465-
ot.html
Resources to get started
23. August
31,
2012
23
Questions?
Christian Gügi, christian.guegi@sentric.ch
Jean-Pierre König, jean-pierre.koenig@sentric.ch
NoSQL Roadshow Basel
Thank you!