SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
HBase Coprocessor to Index Columns into
ElasticSearch Cluster
Dibyendu Bhattacharya
Architect – Big Data Analytics
HappiestMinds
About HappiestMinds
• Next Gen IT Consultancy Company launched Aug 2011 . Head office
in Bangalore, India, have offices in USA, UK, Canada, Australia and
Singapore. Core focus on disruptive technologies like Big
Data/Analytics, Cloud, Mobile and Social.
• Raised USD 45M Series A Funding from prominent VCs , Intel
Capital, Canaan Partners and founders.
• 45 + Client Globally, 800 + Employees.
About Myself :
Dibyendu is Big Data Architect at HappiestMinds where he is involved
in architecting and developing solutions on a Hadoop-based analytics
and search platform. In the past few years, he has worked on complex
data analytics related projects that utilize Hadoop, HBase, and real
time analytics. Before HappiestMinds, he worked at EMC, FairIsaac,
Cisco, IBM etc.
This Presentation….
…….will explores the design and challenges HappiestMinds faced
while implementing a storage and search infrastructure for a
library procurement system where books/documents/artifacts
related records are stored in Apache HBase. Upon bulk insert of
book records into HBase, the Elasticsearch index is built offline
using MapReduce but there are certain use cases where the
records need to be re-indexed in Elasticsearch using Region
Observer Coprocessors.
Storing and Indexing Book records from
Publishers and Libraries
Publisher/
Library Data
HDFS
HBase
Cluster
Data Pre Processing
• Data ingestion to Hadoop
Data Loading : Map Reduce
• Bulk Data upload to HBase table1
2
1
2
3 Elastic
Search
Cluster
3
Data Indexing : Map Reduce
• Incremental Data Indexing to
ElasticSearch
• Part of the document is indexed.
User
Search
4
4 User Search:
• User Search Data.
• Search engine display results.
• Full data access request fetch
from HBase.
User Update data5a
5b
5 User Update:
• User update HBase record.
• Update will propagate to Search
Cluster.
HBase Write Path
HBase Write Path
HBase Storage Layout
Region Server
………………….…….
HBase Put Request
Here comes the Coprocessors
The idea of HBase Coprocessors was inspired by Google’s Big
Table coprocessors.
• HBase coprocessors are an addition to data-manipulation
toolset that were introduced as a feature in HBase in the
0.92.0 release.
• With the introduction of coprocessors, we can push arbitrary
computation out to the HBase nodes hosting data.
• Coprocessors can be loaded globally on all tables and regions
hosted by the region server, or the administrator can specify
which coprocessors should be loaded on all regions for a table
on a per-table basis.
Coprocessors Class and Interfaces
The Coprocessor Interface
• All User code must inherit from this class
The CoprocessorEnvironement Interface
• Retain state across invocation
The CoprocessorHost interfaces
• Tied state and the user code
Observer Coprocessors
Two types of Coprocessor
• observer, which are like triggers in conventional databases.
• endpoint, dynamic RPC endpoints that resemble stored procedures.
Observer Coprocessor : Callback functions/hooks for every explicit API
method
• MasterObserver
• Hooks into HMaster API
• RegionObserver
• Hooks into Region related operations
• WALObserver
• Hooks into write-ahead log operations
RegionObserver Coprocessor … Put ( )
RegionObserver: Provides hooks for data manipulation events, Get, Put,
Delete, Scan, and so on. There is an instance of a RegionObserver
coprocessor for every table region and the scope of the observations
they can make is constrained to that region.
RegionObserver Coprocessor ... Get ( )
Let us see what is ElasticSearch
Distributed Search Engine : ElasticSearch
• Distributed
• Highly-available
• REST based search engine (on top of Lucene)
• Designed to speak JSON (JSON in, JSON out)
• Built on top of Lucene.
For each index you can specify:
• Number of shards
Each index has fixed number of shards
• Number of replicas
Each shard can have 0-many replicas, can be changed
dynamically
ElasticSearch : Automatic Discovery
Discovery Module responsible for discovering nodes within the
cluster , as well as electing master node.
The responsibility of master node is to maintain global cluster
state, and act if nodes join or leave cluster by reassigning shards.
ElasticSearch : Talking to Cluster
ElasticSearch : Nodes are Different
The idea is to perform Indexing into
ElasticSearch from HBase Coprocessors…..
We need a Java Client…
Use ElasticSearch Transport Client : The Transport Client connects
remotely to an ElasticSearch cluster. It does not join the cluster, but
simply gets one or more initial transport addresses and communicates
with them in round robin fashion on each action (though most actions
will probably be “two hop” operations).
And Index with Transport Client…
But this approach has a problem..
• Client does not have the knowledge of the ElasticSearch
cluster.
• Two Hop indexing.
• No fault tolerant mechanism if transport address is down.
• HBase Region Servers can have hundreds regions and
hence hundreds of transport client.
Solution
• Use ElasticSearch Node Client. Client Node does not hold
index but have knowledge of complete Cluster.
• Use HBASE-6505 to share Node Client across Regions in a
RegionServer.
HBase 6505
RegionCoprocessorEnvironment provides a getSharedData()
method, which returns a ConcurrentMap, which is held by
the RegionCoprocessorHost as a weak reference (in a special
map with strongly referenced keys and weakly referenced
values), and held strongly by the RegionEnvironment.
That way if the coprocessor is blacklisted the coprocessors
environment is removed, and any shared data is immediately
available for garbage collection. This shared data is per
RegionServer. As long as there is at least one region observer
or endpoint active this shared data is not garbage collected
and can be accessed to share state between the remaining
coprocessors of the same class.
Shared Node Client across Regions
Shared Node Client across Regions
The Final Problem….
Concurrency Control …
HBase Solve it using MVCC (Multi Version Concurrency Control):
Implement updates not by deleting an old piece of data and
overwriting it with a new one, but instead by making the old data as
obsolete and adding newer version
And ElasticSearch using OCC (Optimistic Concurrency Control) :
Multiple transactions can complete without affecting each other, and
that therefore transactions can proceed without locking the data
resources that they affect. Before committing, each transaction verifies
that no other transaction has modified its data. If the check reveals
conflicting modifications, the committing transaction rolls back.
Let See a Conflict.. Search and Update
HBase ES
C1
C2
V1
V1
V1(M/R)
HBase ES
C1
C2
V1
V1
V2 (Update success)
Conflict
V2(CP)
V1(M/R)
One More Conflict.. Search and Update
HBase ES
C1
C2
V1
V1
V1(M/R)V1(M/R)
HBase ES
C1
C2
V1
V1
Conflict
V2(M/R)
Conflict
The bottom line is.
Search and Update should only be successful when the
Version of ElasticSearch and Version of HBase is same
during the update.
Solution..
1. Data Load from Source to HBase will insert a document with Put call.
2. postPut coprocessor will perform incrementColumnValue for a version
column.
………………………
………………………
Solution..
3. Same Version number will be propagated to ElasticSearch during
Map Reduce based bulk indexing. ElasticSearch support version
number supplied externally.
4. Step 1-3 will repeat for any new data upload.
5. During search and update , the client will perform checkAndPut ()
call.
5i. Client perform search and get the Version number from ElasticSearch
5ii. Client construct a Put with new Version No = Old Version + 1
5iii. Client perform checkAndPut, and check for old Version number before
doing Put.
5iv. postCheckAndPut Coprocessor invoked to propagate the successful Put to
Search Cluster.
5v. After this step the Version Number of HBase column and ElasticSearch
version will be equal.
Solution..
……………………………….
Thanks
Dibyendu.B@happiestminds.com

Más contenido relacionado

La actualidad más candente

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBaseHBaseCon
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase Cloudera, Inc.
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseHBaseCon
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataCloudera, Inc.
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightHBaseCon
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme MakeoverHBaseCon
 
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...Cloudera, Inc.
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresHBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresCloudera, Inc.
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Suman Srinivasan
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerHBaseCon
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBaseCon
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon
 

La actualidad más candente (20)

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBase
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to Contribute
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresHBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at Cerner
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 

Similar a HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBKaushik Rajan
 
Big data and Blockchain in HealthIT
Big data and Blockchain in HealthITBig data and Blockchain in HealthIT
Big data and Blockchain in HealthITDave Callaghan
 
Distributed Caching - Cache Unleashed
Distributed Caching - Cache UnleashedDistributed Caching - Cache Unleashed
Distributed Caching - Cache UnleashedAvishek Patra
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine OverviewKunal Gupta
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Reference architectures shows a microservices deployed to Kubernetes
Reference architectures shows a microservices deployed to KubernetesReference architectures shows a microservices deployed to Kubernetes
Reference architectures shows a microservices deployed to KubernetesRakesh Gujjarlapudi
 
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfseo18
 
Data for all: Empowering teams with scalable Shiny applications @ useR 2019
Data for all: Empowering teams with scalable Shiny applications @ useR 2019Data for all: Empowering teams with scalable Shiny applications @ useR 2019
Data for all: Empowering teams with scalable Shiny applications @ useR 2019Ruan Pearce-Authers
 
Microservice message routing on Kubernetes
Microservice message routing on KubernetesMicroservice message routing on Kubernetes
Microservice message routing on KubernetesFrans van Buul
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
Robust ha solutions with proxysql
Robust ha solutions with proxysqlRobust ha solutions with proxysql
Robust ha solutions with proxysqlMarco Tusa
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGLucidworks
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operationsphanleson
 
Soa interview questions (autosaved)
Soa interview questions (autosaved)Soa interview questions (autosaved)
Soa interview questions (autosaved)xavier john
 
Soa interview questions
Soa interview questionsSoa interview questions
Soa interview questionsxavier john
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 

Similar a HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster (20)

Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODB
 
Hive
HiveHive
Hive
 
Hive.pptx
Hive.pptxHive.pptx
Hive.pptx
 
Big data and Blockchain in HealthIT
Big data and Blockchain in HealthITBig data and Blockchain in HealthIT
Big data and Blockchain in HealthIT
 
Distributed Caching - Cache Unleashed
Distributed Caching - Cache UnleashedDistributed Caching - Cache Unleashed
Distributed Caching - Cache Unleashed
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Reference architectures shows a microservices deployed to Kubernetes
Reference architectures shows a microservices deployed to KubernetesReference architectures shows a microservices deployed to Kubernetes
Reference architectures shows a microservices deployed to Kubernetes
 
Hive_Pig.pptx
Hive_Pig.pptxHive_Pig.pptx
Hive_Pig.pptx
 
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
 
Data for all: Empowering teams with scalable Shiny applications @ useR 2019
Data for all: Empowering teams with scalable Shiny applications @ useR 2019Data for all: Empowering teams with scalable Shiny applications @ useR 2019
Data for all: Empowering teams with scalable Shiny applications @ useR 2019
 
Microservice message routing on Kubernetes
Microservice message routing on KubernetesMicroservice message routing on Kubernetes
Microservice message routing on Kubernetes
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Robust ha solutions with proxysql
Robust ha solutions with proxysqlRobust ha solutions with proxysql
Robust ha solutions with proxysql
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Soa interview questions (autosaved)
Soa interview questions (autosaved)Soa interview questions (autosaved)
Soa interview questions (autosaved)
 
Soa interview questions
Soa interview questionsSoa interview questions
Soa interview questions
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 

Último (20)

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

  • 1. HBase Coprocessor to Index Columns into ElasticSearch Cluster Dibyendu Bhattacharya Architect – Big Data Analytics HappiestMinds
  • 2. About HappiestMinds • Next Gen IT Consultancy Company launched Aug 2011 . Head office in Bangalore, India, have offices in USA, UK, Canada, Australia and Singapore. Core focus on disruptive technologies like Big Data/Analytics, Cloud, Mobile and Social. • Raised USD 45M Series A Funding from prominent VCs , Intel Capital, Canaan Partners and founders. • 45 + Client Globally, 800 + Employees. About Myself : Dibyendu is Big Data Architect at HappiestMinds where he is involved in architecting and developing solutions on a Hadoop-based analytics and search platform. In the past few years, he has worked on complex data analytics related projects that utilize Hadoop, HBase, and real time analytics. Before HappiestMinds, he worked at EMC, FairIsaac, Cisco, IBM etc.
  • 3. This Presentation…. …….will explores the design and challenges HappiestMinds faced while implementing a storage and search infrastructure for a library procurement system where books/documents/artifacts related records are stored in Apache HBase. Upon bulk insert of book records into HBase, the Elasticsearch index is built offline using MapReduce but there are certain use cases where the records need to be re-indexed in Elasticsearch using Region Observer Coprocessors.
  • 4. Storing and Indexing Book records from Publishers and Libraries Publisher/ Library Data HDFS HBase Cluster Data Pre Processing • Data ingestion to Hadoop Data Loading : Map Reduce • Bulk Data upload to HBase table1 2 1 2 3 Elastic Search Cluster 3 Data Indexing : Map Reduce • Incremental Data Indexing to ElasticSearch • Part of the document is indexed. User Search 4 4 User Search: • User Search Data. • Search engine display results. • Full data access request fetch from HBase. User Update data5a 5b 5 User Update: • User update HBase record. • Update will propagate to Search Cluster.
  • 7. HBase Storage Layout Region Server ………………….…….
  • 9. Here comes the Coprocessors The idea of HBase Coprocessors was inspired by Google’s Big Table coprocessors. • HBase coprocessors are an addition to data-manipulation toolset that were introduced as a feature in HBase in the 0.92.0 release. • With the introduction of coprocessors, we can push arbitrary computation out to the HBase nodes hosting data. • Coprocessors can be loaded globally on all tables and regions hosted by the region server, or the administrator can specify which coprocessors should be loaded on all regions for a table on a per-table basis.
  • 10. Coprocessors Class and Interfaces The Coprocessor Interface • All User code must inherit from this class The CoprocessorEnvironement Interface • Retain state across invocation The CoprocessorHost interfaces • Tied state and the user code
  • 11. Observer Coprocessors Two types of Coprocessor • observer, which are like triggers in conventional databases. • endpoint, dynamic RPC endpoints that resemble stored procedures. Observer Coprocessor : Callback functions/hooks for every explicit API method • MasterObserver • Hooks into HMaster API • RegionObserver • Hooks into Region related operations • WALObserver • Hooks into write-ahead log operations
  • 12. RegionObserver Coprocessor … Put ( ) RegionObserver: Provides hooks for data manipulation events, Get, Put, Delete, Scan, and so on. There is an instance of a RegionObserver coprocessor for every table region and the scope of the observations they can make is constrained to that region.
  • 14. Let us see what is ElasticSearch
  • 15. Distributed Search Engine : ElasticSearch • Distributed • Highly-available • REST based search engine (on top of Lucene) • Designed to speak JSON (JSON in, JSON out) • Built on top of Lucene. For each index you can specify: • Number of shards Each index has fixed number of shards • Number of replicas Each shard can have 0-many replicas, can be changed dynamically
  • 16. ElasticSearch : Automatic Discovery Discovery Module responsible for discovering nodes within the cluster , as well as electing master node. The responsibility of master node is to maintain global cluster state, and act if nodes join or leave cluster by reassigning shards.
  • 18. ElasticSearch : Nodes are Different
  • 19. The idea is to perform Indexing into ElasticSearch from HBase Coprocessors…..
  • 20. We need a Java Client… Use ElasticSearch Transport Client : The Transport Client connects remotely to an ElasticSearch cluster. It does not join the cluster, but simply gets one or more initial transport addresses and communicates with them in round robin fashion on each action (though most actions will probably be “two hop” operations).
  • 21. And Index with Transport Client…
  • 22. But this approach has a problem.. • Client does not have the knowledge of the ElasticSearch cluster. • Two Hop indexing. • No fault tolerant mechanism if transport address is down. • HBase Region Servers can have hundreds regions and hence hundreds of transport client. Solution • Use ElasticSearch Node Client. Client Node does not hold index but have knowledge of complete Cluster. • Use HBASE-6505 to share Node Client across Regions in a RegionServer.
  • 23. HBase 6505 RegionCoprocessorEnvironment provides a getSharedData() method, which returns a ConcurrentMap, which is held by the RegionCoprocessorHost as a weak reference (in a special map with strongly referenced keys and weakly referenced values), and held strongly by the RegionEnvironment. That way if the coprocessor is blacklisted the coprocessors environment is removed, and any shared data is immediately available for garbage collection. This shared data is per RegionServer. As long as there is at least one region observer or endpoint active this shared data is not garbage collected and can be accessed to share state between the remaining coprocessors of the same class.
  • 24. Shared Node Client across Regions
  • 25. Shared Node Client across Regions
  • 26. The Final Problem…. Concurrency Control … HBase Solve it using MVCC (Multi Version Concurrency Control): Implement updates not by deleting an old piece of data and overwriting it with a new one, but instead by making the old data as obsolete and adding newer version And ElasticSearch using OCC (Optimistic Concurrency Control) : Multiple transactions can complete without affecting each other, and that therefore transactions can proceed without locking the data resources that they affect. Before committing, each transaction verifies that no other transaction has modified its data. If the check reveals conflicting modifications, the committing transaction rolls back.
  • 27. Let See a Conflict.. Search and Update HBase ES C1 C2 V1 V1 V1(M/R) HBase ES C1 C2 V1 V1 V2 (Update success) Conflict V2(CP) V1(M/R)
  • 28. One More Conflict.. Search and Update HBase ES C1 C2 V1 V1 V1(M/R)V1(M/R) HBase ES C1 C2 V1 V1 Conflict V2(M/R) Conflict
  • 29. The bottom line is. Search and Update should only be successful when the Version of ElasticSearch and Version of HBase is same during the update.
  • 30. Solution.. 1. Data Load from Source to HBase will insert a document with Put call. 2. postPut coprocessor will perform incrementColumnValue for a version column. ……………………… ………………………
  • 31. Solution.. 3. Same Version number will be propagated to ElasticSearch during Map Reduce based bulk indexing. ElasticSearch support version number supplied externally. 4. Step 1-3 will repeat for any new data upload. 5. During search and update , the client will perform checkAndPut () call. 5i. Client perform search and get the Version number from ElasticSearch 5ii. Client construct a Put with new Version No = Old Version + 1 5iii. Client perform checkAndPut, and check for old Version number before doing Put. 5iv. postCheckAndPut Coprocessor invoked to propagate the successful Put to Search Cluster. 5v. After this step the Version Number of HBase column and ElasticSearch version will be equal.