SlideShare una empresa de Scribd logo
1 de 32
SEARCHING BILLIONS
OF PRODUCT LOGS IN
REALTIME
RyanTabora -Think Big Analytics
NoSQL Search Roadshow - June 6, 2013
WHO AM I?
RyanTabora
Think Big Analytics - Senior Data Engineer
Lover of dachshunds, bass, and zombies
OVERVIEW
Primers
What are product logs?
How do they apply to big data?
Real use case
Real issues and designs
Conclusion
PRODUCT LOGS?
• Device data
• IT, Energy, Healthcare, Manufacturing,Telecom ...
• These devices are pushing data back home (pull works too!)
• As more devices are sold/installed, more and more data
comes back to ‘home base’
• RealtimeVisualization
• Realtime Response
• Ad Hoc Analysis
• Full Historical Capture
• Blended Data Sets
POWER OF DEVICE DATA
DEVICE
DATA
TRADITIONAL APPROACHES
• SQL: PostGres, MySQL, Oracle, Microsoft
• SQL provides many of the search features required for typical
search applications
• Joins, regex, group by, sorting, etc
• But the these technologies can only scale so far...
• Hadoop
• HBase/Cassandra/Accumulo
• Search features are very limited
• HBase row scans, primary key index
• Cassandra limited secondary indexing
NEWTECHNIQUES
STORING DATA
• What is an index?
• Lucene
• Paralleling Index Creation
• MapReduce/Flume/Storm
• RealTime Search
• Searching before it hits disk
NEWTECHNIQUES
INDEXING DATA
• Solr/ElasticSearch
• Both build on top of Lucene
• Search servers
• RESTful HTTP APIs
• Easy to administer
• Add powerful text/numerical search capabilities
NEWTECHNIQUES
SEARCHING DATA
BASIC SEARCH FEATURES
• Boolean logic (AND, OR + -)
• Sorting and Group By
• Range queries
• Phrase/Prefix/Fuzzy queries
ADVANCED SEARCH
FEATURES
• Custom ranking/scoring
• More like this
• Auto suggest
• Faceting/Highlighting
• Geo-spacial search
SCALING SEARCH
• ElasticSearch and SolrCloud both have distributed features
built in
• Auto-sharding
• Replication
• Query routing
• Transaction log
USE CASE
Problem
Sample Solution
Core Design Issues
Other Solutions
THE PROBLEM
Home
Base
NetApp
Filer
NetApp
FilerDevice
Client A
NetApp
Filer
NetApp
FilerDevice
Client B
NetApp
Filer
NetApp
FilerDevice
Client C
Log
Log
Log
Logs
REST
API
Full SQL
Access
Flat File
Access
Latest
All
Applications
Engineers
& Analysts
SEARCH APPLICATION
FEATURES
• Find last three days of raw logs from an entire cluster
• Group capacity available grouped by machine serial number
and show the largest capacities first
• Search all device header lines for “FAILURE”
• View all hard disk objects that have product number 2341AB
• Find all motherboards with an associated customer ticket
SAMPLE SOLUTION
Logs
Ingestion
Parsing/Loading
CustomRESTfulSearchAPI
QueriesIndexing
HDFS
MapReduce
INGESTION
PARSING, LOADING,AND
INDEXING
Load HBase with parsed objects
Store HBase ROW_ID
Store pointer to raw file in HDFS
Index a number of desired fields
/ingestion/sequencefile1
/ingestion/sequencefile2
/ingestion/sequencefile3
1534 4562 5323 7232
4601 5105
0
0
0 1492 2987 4767 5987
INSIDE OF HBASE
...... ......
...... ...
...... .........
object5
…...
object4
...
object3object2
...
object1rowkey
THE SOLR DOCUMENT
2343sfOffset
/ingest/file2sequenceFile
1333-2241-3411cluster_id
42ADFF-BZMM
...
configs.log
...
2013/05/12
WARNING: DISK DEAD
...
header
contents
file_name
date_sent
system_id
rowkey
Solr Document
SEARCH APPLICATION
Search Application
Query
Data locations
Stored Object
User
Query Results
2
1
3
4
5
8
6
7
Raw Data
Location
Raw Data
Row_id
CORE DESIGN ISSUES
• Changing the Solr schema (manual reindex)
• Elastic shard scaling (manual reindex)
• No distributed joining (denormalizing the data)
• Replication*
• Manually managing Solr partitioning/sharding*
• Write durability*
SOLRCLOUD
• Automatic shard creation, routing
• Replication
• Limited to a fixed number of shards defined on initial creation
• ZooKeeper for coordination
• Large community
ELASTICSEARCH
• Similar feature set to Solr
• Purpose built for easily managing a distributed index
• Rapidly growing community
• Custom built coordination mechanism
• JSON based API
• Integrates Cassandra and Solr
• Automatic indexing in Solr/storing in Cassandra
• Automatic partitioning
• Automatic reindexing
• Not limited to fixed number of shards
• Proprietary and costs money
DATASTAX ENTERPRISE
+
• Collecting and analyzing device data/product logs can be a
very difficult challenge
• You can use NoSQL and search technologies like Solr or
ElasticSearch in unison...
• ...but it is not always easy to integrate search with NoSQL
CONCLUSION
QUESTIONS?
• Feel free to reach out if you have any questions or need help
with big data/search!
• http://ryantabora.com
• http://thinkbiganalytics.com
• @ryantabora
• ryan.tabora@thinkbiganalytics.com
BONUS SLIDES
HBASE AND SOLR
• Automatic partitioning/reindexing
• Automatic index updates on HBase inserts/deletes
• Mapping HBase cells to a Solr schema
• No perfect commercial/open source solution yet
• Many many many more...
HBASE + SOLR
AUTOMATIC INDEXING
• HBase coprocessors are like storedprocs/triggers
• New, powerful, and dangerous
• Triggers on HBase puts/deletes
• Mapping data to a schema?
HBASE + SOLR
WRITE DURABILITY
Solr
Shard 1
Solr
Shard 3
Solr
Shard 2
HBase Table - SOLR_QUEUE
MapReduce Indexing Application
Solr Queue
Reader
Create SolrDocument from raw ASUP1
Get oldest SolrDocument from HBase
Queue Table
3
Use custom hash algorithm to determine
which shard to add SolrDocument to
4
Query Solr, if SolrDocument
was added, then remove it
from the SOLR_QUEUE
5
SolrDocument
SolrDocument
Add SolrDocument to HBase for durability2
HBASE + SOLR
ELASTIC SHARDING
• HBase’s distributing mechanism uses the concept of regions to
split data across many nodes
• Region splitting can be automatic or manual (performance
degradation as regions split)
• Piggybacking Solr sharding on HBase Region splitting

Más contenido relacionado

La actualidad más candente

Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformIke Ellis
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatternsAnurag S
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsClaudiu Barbura
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)DataPad Inc.
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetCarl W. Handlin
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Future of pandas
Future of pandasFuture of pandas
Future of pandasJeff Reback
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarNilesh Shah
 

La actualidad más candente (20)

Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platform
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internals
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 

Similar a Searching Billions of Product Logs in Real Time (Use Case)

20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014NoSQLmatters
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudCAMMS
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupJohannes Moser
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
 
Elasticsearch and the Database Market
Elasticsearch and the Database MarketElasticsearch and the Database Market
Elasticsearch and the Database MarketObjectRocket
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack IntroductionVikram Shinde
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopAnu Ravindranath
 

Similar a Searching Billions of Product Logs in Real Time (Use Case) (20)

20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Apache drill
Apache drillApache drill
Apache drill
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB Meetup
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
 
Elasticsearch and the Database Market
Elasticsearch and the Database MarketElasticsearch and the Database Market
Elasticsearch and the Database Market
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoop
 

Último

Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 

Último (20)

Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 

Searching Billions of Product Logs in Real Time (Use Case)

  • 1. SEARCHING BILLIONS OF PRODUCT LOGS IN REALTIME RyanTabora -Think Big Analytics NoSQL Search Roadshow - June 6, 2013
  • 2. WHO AM I? RyanTabora Think Big Analytics - Senior Data Engineer Lover of dachshunds, bass, and zombies
  • 3. OVERVIEW Primers What are product logs? How do they apply to big data? Real use case Real issues and designs Conclusion
  • 4. PRODUCT LOGS? • Device data • IT, Energy, Healthcare, Manufacturing,Telecom ... • These devices are pushing data back home (pull works too!) • As more devices are sold/installed, more and more data comes back to ‘home base’
  • 5. • RealtimeVisualization • Realtime Response • Ad Hoc Analysis • Full Historical Capture • Blended Data Sets POWER OF DEVICE DATA DEVICE DATA
  • 6. TRADITIONAL APPROACHES • SQL: PostGres, MySQL, Oracle, Microsoft • SQL provides many of the search features required for typical search applications • Joins, regex, group by, sorting, etc • But the these technologies can only scale so far...
  • 7. • Hadoop • HBase/Cassandra/Accumulo • Search features are very limited • HBase row scans, primary key index • Cassandra limited secondary indexing NEWTECHNIQUES STORING DATA
  • 8. • What is an index? • Lucene • Paralleling Index Creation • MapReduce/Flume/Storm • RealTime Search • Searching before it hits disk NEWTECHNIQUES INDEXING DATA
  • 9. • Solr/ElasticSearch • Both build on top of Lucene • Search servers • RESTful HTTP APIs • Easy to administer • Add powerful text/numerical search capabilities NEWTECHNIQUES SEARCHING DATA
  • 10. BASIC SEARCH FEATURES • Boolean logic (AND, OR + -) • Sorting and Group By • Range queries • Phrase/Prefix/Fuzzy queries
  • 11. ADVANCED SEARCH FEATURES • Custom ranking/scoring • More like this • Auto suggest • Faceting/Highlighting • Geo-spacial search
  • 12. SCALING SEARCH • ElasticSearch and SolrCloud both have distributed features built in • Auto-sharding • Replication • Query routing • Transaction log
  • 13. USE CASE Problem Sample Solution Core Design Issues Other Solutions
  • 14. THE PROBLEM Home Base NetApp Filer NetApp FilerDevice Client A NetApp Filer NetApp FilerDevice Client B NetApp Filer NetApp FilerDevice Client C Log Log Log Logs REST API Full SQL Access Flat File Access Latest All Applications Engineers & Analysts
  • 15. SEARCH APPLICATION FEATURES • Find last three days of raw logs from an entire cluster • Group capacity available grouped by machine serial number and show the largest capacities first • Search all device header lines for “FAILURE” • View all hard disk objects that have product number 2341AB • Find all motherboards with an associated customer ticket
  • 18. PARSING, LOADING,AND INDEXING Load HBase with parsed objects Store HBase ROW_ID Store pointer to raw file in HDFS Index a number of desired fields /ingestion/sequencefile1 /ingestion/sequencefile2 /ingestion/sequencefile3 1534 4562 5323 7232 4601 5105 0 0 0 1492 2987 4767 5987
  • 19. INSIDE OF HBASE ...... ...... ...... ... ...... ......... object5 …... object4 ... object3object2 ... object1rowkey
  • 21. SEARCH APPLICATION Search Application Query Data locations Stored Object User Query Results 2 1 3 4 5 8 6 7 Raw Data Location Raw Data Row_id
  • 22. CORE DESIGN ISSUES • Changing the Solr schema (manual reindex) • Elastic shard scaling (manual reindex) • No distributed joining (denormalizing the data) • Replication* • Manually managing Solr partitioning/sharding* • Write durability*
  • 23. SOLRCLOUD • Automatic shard creation, routing • Replication • Limited to a fixed number of shards defined on initial creation • ZooKeeper for coordination • Large community
  • 24. ELASTICSEARCH • Similar feature set to Solr • Purpose built for easily managing a distributed index • Rapidly growing community • Custom built coordination mechanism • JSON based API
  • 25. • Integrates Cassandra and Solr • Automatic indexing in Solr/storing in Cassandra • Automatic partitioning • Automatic reindexing • Not limited to fixed number of shards • Proprietary and costs money DATASTAX ENTERPRISE +
  • 26. • Collecting and analyzing device data/product logs can be a very difficult challenge • You can use NoSQL and search technologies like Solr or ElasticSearch in unison... • ...but it is not always easy to integrate search with NoSQL CONCLUSION
  • 27. QUESTIONS? • Feel free to reach out if you have any questions or need help with big data/search! • http://ryantabora.com • http://thinkbiganalytics.com • @ryantabora • ryan.tabora@thinkbiganalytics.com
  • 29. HBASE AND SOLR • Automatic partitioning/reindexing • Automatic index updates on HBase inserts/deletes • Mapping HBase cells to a Solr schema • No perfect commercial/open source solution yet • Many many many more...
  • 30. HBASE + SOLR AUTOMATIC INDEXING • HBase coprocessors are like storedprocs/triggers • New, powerful, and dangerous • Triggers on HBase puts/deletes • Mapping data to a schema?
  • 31. HBASE + SOLR WRITE DURABILITY Solr Shard 1 Solr Shard 3 Solr Shard 2 HBase Table - SOLR_QUEUE MapReduce Indexing Application Solr Queue Reader Create SolrDocument from raw ASUP1 Get oldest SolrDocument from HBase Queue Table 3 Use custom hash algorithm to determine which shard to add SolrDocument to 4 Query Solr, if SolrDocument was added, then remove it from the SOLR_QUEUE 5 SolrDocument SolrDocument Add SolrDocument to HBase for durability2
  • 32. HBASE + SOLR ELASTIC SHARDING • HBase’s distributing mechanism uses the concept of regions to split data across many nodes • Region splitting can be automatic or manual (performance degradation as regions split) • Piggybacking Solr sharding on HBase Region splitting