SlideShare una empresa de Scribd logo
1 de 50
Storing and Indexing Social Media Content in the Hadoop Ecosystem Lance Riedel Brent Halsey Jive Software
Jive: Social Networking for the Enterprise Engage Employees Engage Customers Engage the Social Web What Matters Apps
Jive Social Media Engagement stores social media for monitoring (e.g. brand sentiment), searching, and analysis Jive Social Media Monitoring Overview
Searching and Following Social Media content
Analyzing Social Media content
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
And its limitations ,[object Object],[object Object],[object Object],[object Object]
Next Generation ,[object Object],[object Object],[object Object]
Next Generation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
 
JSME Flume Topology
JSME Flume Topology
Why Flume? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Flume Overview: The Canonical Use Case  Flume Agent tier Collector tier Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Collector Collector Collector server server server server server server server server server server server server HDFS
Flume Overview: Data ingestion pipeline pattern Flume Agent Agent Agent Agent svr index hbase hdfs Collector Fanout HBase Key lookup Range query Incremental Search Idx Search query Faceted query HDFS Hive query Pig query
Katta – distributed Lucene Katta Master Index 2 Katta Node Index 1 Index 2 Katta Node Index 1 Index 2 Hadoop HDFS Raw.seq Index 1 Katta Node Index 1
Jive Social Media Search Architecture
Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search  Results HDFS   HBase Collector Fanout Index 1
Raw.seq Systems Overview Events HDFS   HBase Collector Fanout
Hadoop Job Controller Raw.seq Distributed Indexer Job Systems Overview Events HDFS   HBase Collector Fanout Index 1
Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Systems Overview Index 1 Events HDFS   HBase Collector Fanout Index 1
Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search  Results HDFS   HBase Collector Fanout Index 1
Distributed Lucene Indexer Job Input HDFS  Blocks Shard 1 Shard 2
Distributed Lucene Indexer Job Map Map Map Map Raw  Events Input HDFS  Blocks  Index 1 Index 2 Index 3 Index 4
Distributed Lucene Indexer Job Map Map Map Map Reduce Reduce Raw  Events Input HDFS  Blocks  Shuffle/ Sort Key -> shard number Value -> path to index Shard 1 Shard 2 Index 1 Index 2 Index 3 Index 4
5 Minute Index Deployment Incremental Indexer Job Raw.seq
5 Minute Hour Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job
5 Minute Hour Day Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job Daily Merge Indexer Job
Incremental Indexing Job Controller HDFS 1. Scan HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job 4. Deploy index Katta Index.INCREMENTAL.time-1.6 Incremental Indexing raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
Custom sources / sinks / decorators ,[object Object],[object Object],[object Object]
Real-time Search and Indexing 5 Minute Index 2 Hadoop HDFS Job Controller   HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search  Results
Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller   HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search  Results
Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq
Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds
Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
Real-time Search and Indexing Zoie Flume Sink 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
Zoie Flume Sink Jetty Server 0-5 min 1 Search Broker Katta Zoie Sink
Zoie Flume Sink Jetty Server 0-5 min 5-10 min 2 1 Search Broker Katta Zoie Sink
Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 3 2 1 Search Broker Katta Zoie Sink
Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 4 3 2 1 > 15 min Search Broker Katta Zoie Sink
Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller   HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search  Results
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Hadoop Ecosystem @Jive
[object Object],[object Object],[object Object],[object Object],Hadoop Ecosystem @Jive
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Hadoop Ecosystem @Jive
Questions ,[object Object],[object Object],[object Object]

Más contenido relacionado

La actualidad más candente

Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersEnabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersDataWorks Summit
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
 
SplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep DiveSplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep DiveSplunk
 
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersYahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersBrett Sheppard
 
SplunkLive! Hunk Technical Overview
SplunkLive! Hunk Technical OverviewSplunkLive! Hunk Technical Overview
SplunkLive! Hunk Technical OverviewSplunk
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData, Inc.
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 
Nationwide Splunk Ninjas!
Nationwide Splunk Ninjas!Nationwide Splunk Ninjas!
Nationwide Splunk Ninjas!Splunk
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingYahoo Developer Network
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...Imply
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Sid Anand
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 

La actualidad más candente (20)

Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersEnabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
SplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep DiveSplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep Dive
 
Splunk and node
Splunk and nodeSplunk and node
Splunk and node
 
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersYahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
 
SplunkLive! Hunk Technical Overview
SplunkLive! Hunk Technical OverviewSplunkLive! Hunk Technical Overview
SplunkLive! Hunk Technical Overview
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for Hadoop
 
History of Apache Pinot
History of Apache Pinot History of Apache Pinot
History of Apache Pinot
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
Nationwide Splunk Ninjas!
Nationwide Splunk Ninjas!Nationwide Splunk Ninjas!
Nationwide Splunk Ninjas!
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 

Destacado

Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An IntroductionChicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An IntroductionCloudera, Inc.
 
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Yahoo Developer Network
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoopzenyk
 
Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Alex Silva
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesArvind Prabhakar
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoopskaluska
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
 

Destacado (20)

Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An IntroductionChicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
 
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
 
Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 

Similar a Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detikk4ndar
 
Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?Glenn Renfro
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5Samuel Rash
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxAnonymous9etQKwW
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 

Similar a Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software (20)

Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detik
 
Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Hadoop at Ebay
Hadoop at EbayHadoop at Ebay
Hadoop at Ebay
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
dumb
dumbdumb
dumb
 
dumb
dumbdumb
dumb
 
5. pivotal hd 2013
5. pivotal hd 20135. pivotal hd 2013
5. pivotal hd 2013
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
hadoop resume
hadoop resumehadoop resume
hadoop resume
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Último (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

  • 1. Storing and Indexing Social Media Content in the Hadoop Ecosystem Lance Riedel Brent Halsey Jive Software
  • 2. Jive: Social Networking for the Enterprise Engage Employees Engage Customers Engage the Social Web What Matters Apps
  • 3. Jive Social Media Engagement stores social media for monitoring (e.g. brand sentiment), searching, and analysis Jive Social Media Monitoring Overview
  • 4. Searching and Following Social Media content
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.  
  • 12.  
  • 15.
  • 16. Flume Overview: The Canonical Use Case Flume Agent tier Collector tier Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Collector Collector Collector server server server server server server server server server server server server HDFS
  • 17. Flume Overview: Data ingestion pipeline pattern Flume Agent Agent Agent Agent svr index hbase hdfs Collector Fanout HBase Key lookup Range query Incremental Search Idx Search query Faceted query HDFS Hive query Pig query
  • 18. Katta – distributed Lucene Katta Master Index 2 Katta Node Index 1 Index 2 Katta Node Index 1 Index 2 Hadoop HDFS Raw.seq Index 1 Katta Node Index 1
  • 19. Jive Social Media Search Architecture
  • 20. Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search Results HDFS HBase Collector Fanout Index 1
  • 21. Raw.seq Systems Overview Events HDFS HBase Collector Fanout
  • 22. Hadoop Job Controller Raw.seq Distributed Indexer Job Systems Overview Events HDFS HBase Collector Fanout Index 1
  • 23. Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Systems Overview Index 1 Events HDFS HBase Collector Fanout Index 1
  • 24. Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search Results HDFS HBase Collector Fanout Index 1
  • 25. Distributed Lucene Indexer Job Input HDFS Blocks Shard 1 Shard 2
  • 26. Distributed Lucene Indexer Job Map Map Map Map Raw Events Input HDFS Blocks Index 1 Index 2 Index 3 Index 4
  • 27. Distributed Lucene Indexer Job Map Map Map Map Reduce Reduce Raw Events Input HDFS Blocks Shuffle/ Sort Key -> shard number Value -> path to index Shard 1 Shard 2 Index 1 Index 2 Index 3 Index 4
  • 28. 5 Minute Index Deployment Incremental Indexer Job Raw.seq
  • 29. 5 Minute Hour Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job
  • 30. 5 Minute Hour Day Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job Daily Merge Indexer Job
  • 31. Incremental Indexing Job Controller HDFS 1. Scan HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
  • 32. Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
  • 33. Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
  • 34. Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job 4. Deploy index Katta Index.INCREMENTAL.time-1.6 Incremental Indexing raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
  • 35.
  • 36. Real-time Search and Indexing 5 Minute Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
  • 37. Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
  • 38. Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq
  • 39. Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds
  • 40. Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
  • 41. Real-time Search and Indexing Zoie Flume Sink 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
  • 42. Zoie Flume Sink Jetty Server 0-5 min 1 Search Broker Katta Zoie Sink
  • 43. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 2 1 Search Broker Katta Zoie Sink
  • 44. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 3 2 1 Search Broker Katta Zoie Sink
  • 45. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 4 3 2 1 > 15 min Search Broker Katta Zoie Sink
  • 46. Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
  • 47.
  • 48.
  • 49.
  • 50.

Notas del editor

  1. Collecting content from twitter, facebook, blogs, and news outlets. Allow our users to search on this content, monitor it, and analyze it.
  2. Screen shot of the app shows a user's list of monitors and content matching those monitors. Users can filter by sentiment and by the content source. They can engage in social conversations through twitter and facebook. And they can create discussions within Jive SBS.
  3. Users can analyze social media trends over time with graph views for sentiment and content sources.
  4. Old system takes data from content sources and throws it on a queue. Queue acts as a buffer to processors that process the content and insert it into a MySQL DB. Some fault tolerance with multiple servers connecting to multiple queues. But required a fair bit of monitoring and manual intervention when problems arise.
  5. Limited because we throw away most of our content. Pushing the limits of MySQL can be painful.
  6. Wanted to store all content (limited window), search it, and analyze it.
  7. Chose HBase for random lookup. HDFS for chronological streaming. Katta for distributing Lucene shards. Hadoop for running map reduce.
  8. Built out prototype of new system using Amazon's EC2 and needed a way to stream data into these servers. Internal / External IP addresses of EC2 made it difficult to connect directly to HDFS and HBase. Flume provided this connectivity along with desirable delivery guarantees.
  9. Additionally, can fan out the data to bring data into EC2 along with our production system.
  10. Additionally, can fan out the data to bring data into EC2 along with our production system.
  11. KATTA For those not familiar with Katta, it is a distributed search engine that has two major responsibilities The first is distributing indexes from HDFS to any number of katta nodes. Katta nodes can run across as many machines as you want, easy to add more, and katta will redistribute indexes if nodes fail Katta has a highly customizable distribution policy – you can round robin, have hot/cold topologies where newer indexes are placed on faster machines As part of the distribution there is also replication of indexes for increased load performance and failover All of this is managed through zookeeper, so it is quite resilient, and does a very good job at keeping indexes where zookeeper says it should The second responsibility of katta is to take a single search request and send the request to every katta node and gather the results
  12. OVERVIEW OF SEARCH – 30 days of twitter, facebook, major news and blogs Next few slides are going to show how we tackled searching a moving window of 30 days of twitter (full firehose), public facebook feed, and Spinn3r (which includes all major news and blog sites) SEARCH IS USED – INVESTIGATE MONITOR CREATION, ADHOC ANALYTICS -search is used to investigate what monitor to create, so searching historical data is of course key -also allows to do ad-hoc analytics over recent history. Show me sentimate, or raw counts for an ad-hoc query over the last 30 days
  13. TRANSITION – OTHER REQUIREMENTS NEED FLEXIBILITY Other requirements of course pop up, so it was good that we chose Flume so that we could add easily add on new functionality One of the key customization areas of Flume are the custom sources sinks and decorators you can supply SOURCES OVERVIEW Sources allow you to create custom hooks into data providers. There is a huge list of sources provided out of the box from tailing files to avro http end points where you can send raw events to flume over http with a flume event avro schema SINK OVERVIEW Sinks allow you to create custom places to put the events.. Again there are a slew of out of the box sinks such as hbase and hdfs DECORATOR OVERVIEW And then there are decorators that you can place pretty much add anywhere in the topology where you are allowed to inspect each event and add meta data, change the contents, or throw them on the floor SOME OF OUR OWN Want to highlight a few customizations we did: (rest on slide)