Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

Storing and Indexing Social Media Content in the Hadoop Ecosystem Lance Riedel Brent Halsey Jive Software

Jive: Social Networking for the Enterprise Engage Employees Engage Customers Engage the Social Web What Matters Apps

Jive Social Media Engagement stores social media for monitoring (e.g. brand sentiment), searching, and analysis Jive Social Media Monitoring Overview

Searching and Following Social Media content

Analyzing Social Media content

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object]

And its limitations ,[object Object],[object Object],[object Object],[object Object]

Next Generation ,[object Object],[object Object],[object Object]

Next Generation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Why Flume? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Flume Overview: The Canonical Use Case Flume Agent tier Collector tier Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Collector Collector Collector server server server server server server server server server server server server HDFS

Flume Overview: Data ingestion pipeline pattern Flume Agent Agent Agent Agent svr index hbase hdfs Collector Fanout HBase Key lookup Range query Incremental Search Idx Search query Faceted query HDFS Hive query Pig query

Katta – distributed Lucene Katta Master Index 2 Katta Node Index 1 Index 2 Katta Node Index 1 Index 2 Hadoop HDFS Raw.seq Index 1 Katta Node Index 1

Jive Social Media Search Architecture

Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search Results HDFS HBase Collector Fanout Index 1

Raw.seq Systems Overview Events HDFS HBase Collector Fanout

Hadoop Job Controller Raw.seq Distributed Indexer Job Systems Overview Events HDFS HBase Collector Fanout Index 1

Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Systems Overview Index 1 Events HDFS HBase Collector Fanout Index 1

Distributed Lucene Indexer Job Input HDFS Blocks Shard 1 Shard 2

Distributed Lucene Indexer Job Map Map Map Map Raw Events Input HDFS Blocks Index 1 Index 2 Index 3 Index 4

Distributed Lucene Indexer Job Map Map Map Map Reduce Reduce Raw Events Input HDFS Blocks Shuffle/ Sort Key -> shard number Value -> path to index Shard 1 Shard 2 Index 1 Index 2 Index 3 Index 4

5 Minute Index Deployment Incremental Indexer Job Raw.seq

5 Minute Hour Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job

5 Minute Hour Day Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job Daily Merge Indexer Job

Incremental Indexing Job Controller HDFS 1. Scan HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes

Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes

Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes

Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job 4. Deploy index Katta Index.INCREMENTAL.time-1.6 Incremental Indexing raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes

Custom sources / sinks / decorators ,[object Object],[object Object],[object Object]

Real-time Search and Indexing 5 Minute Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results

Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results

Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq

Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds

Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout

Real-time Search and Indexing Zoie Flume Sink 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout

Zoie Flume Sink Jetty Server 0-5 min 1 Search Broker Katta Zoie Sink

Zoie Flume Sink Jetty Server 0-5 min 5-10 min 2 1 Search Broker Katta Zoie Sink

Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 3 2 1 Search Broker Katta Zoie Sink

Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 4 3 2 1 > 15 min Search Broker Katta Zoie Sink

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Hadoop Ecosystem @Jive

[object Object],[object Object],[object Object],[object Object],Hadoop Ecosystem @Jive

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Hadoop Ecosystem @Jive

Questions ,[object Object],[object Object],[object Object]

Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

Similar a Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

Notas del editor