SlideShare una empresa de Scribd logo
1 de 36
SOLUTION TRACK
Finding the Needle in a Big Data Haystack
@EvaAndreasson, Innovator & Problem Solver
Cloudera
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/search-hadoop-data-hub
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Agenda
• Problem (Solving)
• Apache Solr + Apache Hadoop et al
• Real-world examples
• Q&A
Problem Solving
Does it
Work?
Will You
Get in
Trouble
Information Driven Problem Solving
• Ask a Question
• Find All Relevant Data to Serve the Question
• Process the Data to Answer the Question
Information Driven Businesses
Thousands
of employees &
lots of data
Difficult to access
Heterogeneous
legacy IT
Infrastructure
Difficult to manage
Hard to scale
Silos of multi-
structured data
Difficult to Integrate
Holds copies of data
Problem: Finding the Data (Needle) Across (Hay) Silos
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Data
Archives
EDWs Marts SearchServers Document Stores Storage
©2014 Cloudera, Inc. All Rights Reserved. Reproduction
or redistribution without written permission is prohibited.
Independent of
audience, topic, or
tool, data is
accessible
Unified data storage,
processing,
management and
security
Ingest all data
any type or any scale
Eliminate copies,
simplify aggregation
and correlation
EDWs Marts Storage Search
Solution: The Enterprise Data Hub (EDH)
Servers Documents
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
EDH
Archives
©2014 Cloudera, Inc. All Rights Reserved. Reproduction
or redistribution without written permission is prohibited.
EDH
Apache Hadoop et al at the Core
• Open source
• white box, best innovation at all times
• Flexible
• Scalable
• ingest, storage, processing
• Cost efficient
But an EDH also Needs…
• Security & audit
• Manageability, Visibility and Resource Control
• Open architecture
• Multi-workload support and optimization
Problem solved!
We can go home…?
Problem: Finding the Needle in a Big Data Haystack?
New Audiences, New Challenges
• Non-technical staff needs access to data
• Same data used in bigger processes
• Speed up manual introspection
• Technical staff needs to view / explore data
• View interim results
• Drill down into mission critical data
• Explore data to design models
• Cross-workload needs
• Combine structured and unstructured data
Solution: Everyone Knows Search!
Explore
Navigate
Correlate
15
Cloudera Search: How We
Integrated Solr with Hadoop et al
Cloudera Search
Interactive search for Hadoop
• Full-text and faceted navigation
• Batch, near real-time, and on-demand indexing
16
Apache Solr integrated with CDH
• Established, mature search with vibrant community
• Separate runtime like MapReduce, Impala
• Incorporated as part of the Hadoop ecosystem
Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs
Scalable and Robust Index Storage
HDFS
Lucene
Extraction Mapping
Solr
Zookeeper
SolrCloud
Querying API Indexing API
17
Solr and HDFS
• Scalable, cost-efficient
index storage
• Higher availability
• Search and process data
in one platform
Near Real Time Indexing at Ingest
Log File
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest
• Document-level ACL
HDFS
Flume
Agent
Indexer
Other
Log File
Flume
Agent
Indexer
18
Scalable Batch Indexing
Files
Files
SolrCloud
Cluster
19
HDFS
Solr and MapReduce
• Flexible, scalable batch
indexing
• GOLIVE: Start serving new
indices with no downtime
• On-demand indexing, cost-
efficient re-indexing
Index
shard
Index
shard
MR
Indexer
MR
Indexer
Scalable Batch Indexing
20
Mapper:
Parse input into
indexable document
Mapper:
Parse input into
indexable document
Mapper:
Parse input into
indexable document
Index
shard 1
Index
shard 2
Arbitrary reducing steps of indexing and merging
End-Reducer (shard 1):
Index document
End-Reducer (shard 2):
Index document
HBase
Secondary Indexes, in Real-Time
interactiveload
Replication
Listener(s)
(Lily)
Triggers on
updates
Solr server
Solr server
Solr server
Solr server
Solr server
HDFS
HBase
Solr and HBase
• Secondary Indexes made
easy and flexible
HBase
Secondary Indexes, or in Batch
interactiveload
Replication
Listener(s)
(Lily)
Triggers on
updates
Solr server
Solr server
Solr server
Solr server
Solr server
MR
Indexer
HDFS
MR
Indexer
HBase
Index
shard
MR
Indexer
Solr and HBase
• Real Time or Batch
Streamlined Extraction and Mapping
Morphlines
• Simple and flexible data
transformation
• Reusable across multiple
index (and other) workloads
syslog Flume
Agent
Solr sink
Command: readLine
Command: grok
Command: loadSolr
Solr
Event
Record
Record
Record
Document
Security
Cloudera Search + Sentry
• Cluster level access control
• Index level access control
Simple, Customizable Search UI
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language
Simplified Management
Cloudera Manager
• Install, configure, deploy
SolrCloud on the cluster
• Centralized management and
monitoring – cross workloads
• Unified resource management
and control
Data
End User Client App
(e.g. Hue)
Flume
HDFS
Raw, filtered, or
annotaded data
SolrCloud Cluster(s)
Data to be
indexed
Indexed data
MapReduce Batch Indexing
GoLive updates
HBase
Cluster
Replication
Events to be
indexed
Data
ClouderaManager
Search queries
Architecture Overview
Sentry Authorization
Some Real-World Examples
• Image processing and data correlation
• Quick result checks on new algorithms
• Correlate image and device logs or external data
• Image exploration as a service
• Expedite data modeling processes
• Free-text form matching
• Claims report correlation, to gather data likely to be similar
• Serve 360-client or patient records in a speedier way
• Fraud / pattern extraction
• Log management
• Real-time drill down
• Long term trending and capacity planning
• Anomaly detection over larger sets of data
The EDH - Information-Driven Problem Solving Made Easy!
Integration is Key
• Eliminate data moves or copies, break silos
• Create a truly active archive
• Serve non-technical audiences on the same platform
as where advanced analytics workloads run
• Analyze and combine structured and non-structured
data
• Expedite exploration of various data types
• Find data and do something with it – where it is
stored
• Future proof your data management system
Learn More
• Cloudera.com
• Read our blog
• Take our online training (or get Cloudera certified)
• Download whitepapers
• View webinars
• Talk to our customers
• Follow/contact me
• @EvaAndreasson
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/search-
hadoop-data-hub

Más contenido relacionado

Más de C4Media

High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 

Más de C4Media (20)

High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Finding the Needle in a Big Data Haystack

  • 1. SOLUTION TRACK Finding the Needle in a Big Data Haystack @EvaAndreasson, Innovator & Problem Solver Cloudera
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /search-hadoop-data-hub
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Agenda • Problem (Solving) • Apache Solr + Apache Hadoop et al • Real-world examples • Q&A
  • 6. Information Driven Problem Solving • Ask a Question • Find All Relevant Data to Serve the Question • Process the Data to Answer the Question
  • 8. Thousands of employees & lots of data Difficult to access Heterogeneous legacy IT Infrastructure Difficult to manage Hard to scale Silos of multi- structured data Difficult to Integrate Holds copies of data Problem: Finding the Data (Needle) Across (Hay) Silos ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources Data Archives EDWs Marts SearchServers Document Stores Storage ©2014 Cloudera, Inc. All Rights Reserved. Reproduction or redistribution without written permission is prohibited.
  • 9. Independent of audience, topic, or tool, data is accessible Unified data storage, processing, management and security Ingest all data any type or any scale Eliminate copies, simplify aggregation and correlation EDWs Marts Storage Search Solution: The Enterprise Data Hub (EDH) Servers Documents ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources EDH Archives ©2014 Cloudera, Inc. All Rights Reserved. Reproduction or redistribution without written permission is prohibited. EDH
  • 10. Apache Hadoop et al at the Core • Open source • white box, best innovation at all times • Flexible • Scalable • ingest, storage, processing • Cost efficient
  • 11. But an EDH also Needs… • Security & audit • Manageability, Visibility and Resource Control • Open architecture • Multi-workload support and optimization
  • 12.
  • 13. Problem solved! We can go home…?
  • 14. Problem: Finding the Needle in a Big Data Haystack?
  • 15. New Audiences, New Challenges • Non-technical staff needs access to data • Same data used in bigger processes • Speed up manual introspection • Technical staff needs to view / explore data • View interim results • Drill down into mission critical data • Explore data to design models • Cross-workload needs • Combine structured and unstructured data
  • 16. Solution: Everyone Knows Search! Explore Navigate Correlate
  • 17. 15 Cloudera Search: How We Integrated Solr with Hadoop et al
  • 18. Cloudera Search Interactive search for Hadoop • Full-text and faceted navigation • Batch, near real-time, and on-demand indexing 16 Apache Solr integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated as part of the Hadoop ecosystem Open Source • 100% Apache, 100% Solr • Standard Solr APIs
  • 19. Scalable and Robust Index Storage HDFS Lucene Extraction Mapping Solr Zookeeper SolrCloud Querying API Indexing API 17 Solr and HDFS • Scalable, cost-efficient index storage • Higher availability • Search and process data in one platform
  • 20. Near Real Time Indexing at Ingest Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest • Document-level ACL HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 18
  • 21. Scalable Batch Indexing Files Files SolrCloud Cluster 19 HDFS Solr and MapReduce • Flexible, scalable batch indexing • GOLIVE: Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexing Index shard Index shard MR Indexer MR Indexer
  • 22. Scalable Batch Indexing 20 Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging End-Reducer (shard 1): Index document End-Reducer (shard 2): Index document
  • 23. HBase Secondary Indexes, in Real-Time interactiveload Replication Listener(s) (Lily) Triggers on updates Solr server Solr server Solr server Solr server Solr server HDFS HBase Solr and HBase • Secondary Indexes made easy and flexible
  • 24. HBase Secondary Indexes, or in Batch interactiveload Replication Listener(s) (Lily) Triggers on updates Solr server Solr server Solr server Solr server Solr server MR Indexer HDFS MR Indexer HBase Index shard MR Indexer Solr and HBase • Real Time or Batch
  • 25. Streamlined Extraction and Mapping Morphlines • Simple and flexible data transformation • Reusable across multiple index (and other) workloads syslog Flume Agent Solr sink Command: readLine Command: grok Command: loadSolr Solr Event Record Record Record Document
  • 26. Security Cloudera Search + Sentry • Cluster level access control • Index level access control
  • 27. Simple, Customizable Search UI Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • 28. Simplified Management Cloudera Manager • Install, configure, deploy SolrCloud on the cluster • Centralized management and monitoring – cross workloads • Unified resource management and control
  • 29. Data End User Client App (e.g. Hue) Flume HDFS Raw, filtered, or annotaded data SolrCloud Cluster(s) Data to be indexed Indexed data MapReduce Batch Indexing GoLive updates HBase Cluster Replication Events to be indexed Data ClouderaManager Search queries Architecture Overview Sentry Authorization
  • 30. Some Real-World Examples • Image processing and data correlation • Quick result checks on new algorithms • Correlate image and device logs or external data • Image exploration as a service • Expedite data modeling processes • Free-text form matching • Claims report correlation, to gather data likely to be similar • Serve 360-client or patient records in a speedier way • Fraud / pattern extraction • Log management • Real-time drill down • Long term trending and capacity planning • Anomaly detection over larger sets of data
  • 31. The EDH - Information-Driven Problem Solving Made Easy!
  • 32. Integration is Key • Eliminate data moves or copies, break silos • Create a truly active archive • Serve non-technical audiences on the same platform as where advanced analytics workloads run • Analyze and combine structured and non-structured data • Expedite exploration of various data types • Find data and do something with it – where it is stored • Future proof your data management system
  • 33. Learn More • Cloudera.com • Read our blog • Take our online training (or get Cloudera certified) • Download whitepapers • View webinars • Talk to our customers • Follow/contact me • @EvaAndreasson
  • 34.
  • 35.
  • 36. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/search- hadoop-data-hub