SlideShare una empresa de Scribd logo
1 de 39
Real-Time Big
Data Applications
A Reference Architecture
for Search, Discovery, and
Analytics

Justin Makeig
Director, Product Management MarkLogic
June 13, 2012
Hello, my name is _________

§    Director, Product Management
§    Focus on APIs, integrations, and tools
§    With MarkLogic since 2007
§    Former web dev, quant
Agenda

§    Characterizing Big Data applications
§    Examples today
§    Combining analytical and operational
§    What’s next?
Who is MarkLogic?

§  300 customers, $85 million+ in revenue
§  300 employees in San Francisco, New York,
    London, Tokyo, Austin, Frankfurt, Stockholm
§  Founded in 2003
§  Funded by Sequoia and Tenaya
§  Focus on Media, Government, Financial Services
Big Data Workloads

  Analytic           Operational

  §  Batch          §    Real-time, interactive
  §  Aggregate      §    Highly selective
  §  Repeatable     §    Available
                     §    Secure
Operational Databases

RDBMS                       “NoSQL”
§  Indexes                 §  Flexible data model
§  Transactions            §  Commodity scale out
§  Security                §  Distributed, fault-
§  Enterprise operations       tolerant
                            §  Hadoop sink/source


 What if you could get all of these in one system?
MarkLogic Server

§    Enterprise NoSQL database
§    Flexible data model
§    Scales on commodity hardware (1–1,000 nodes)
§    Rich built-in indexes, including full-text, scalar, geo
§    ACID transactions
§    Enterprise-grade operations
Operational
Big Data
LexisNexis

§  $4.2 billion in revenue,
    $2.6 billion LOB
§  5 billion+ documents,
    millions updates/day
§  Real-time search,
    discovery, analytics
§  From 9–12 months to
    2 weeks for new products
§  Enterprise HA/DR
Top 5 Global Investment Bank

§  Real-time transparency
    across all derivatives
§  Predictable scalability
§  Simplified architecture,
    operations
§  Mission-critical uptime and
    performance
                                  http://www.flickr.com/photos/tenaciousme/1797368175/
US Government Intel Agency

§  Crawl of substantial
    part of the Web
§  Evolving enrichment
§  Real-time analysis
§  Granular security
§  Centralized governance
§  ½ DBA
                             http://www.flickr.com/photos/usarak/4969182481
Big Data
Applications
Unified Data

§  Flexible data model reduces need for ETL
§  Multiple simultaneous applications
§  Single governance model
Enterprise Operations

§    Predictable scalability
§    Replication and failover
§    Backup and recovery
§    Instrumentation and monitoring
Continuous Adaptation

§  Load data as-is, evolve with requirements
§  Add new sources in days, not months
§  Transactional updates for accuracy
Iterative Query

§  Real-time access
§  Multi-faceted queries
   –  Full text
   –  Structure, semantics, and relationships
   –  Scalar values and ranges
      (date/time, numbers, strings)
   –  Geospatial
§  Alerting
Big Data Application Platform

                                        APIs and tools"


        Visualization"


                         Data Mining"




                                          Processing"



                                                        Metadata"



                                                                           Search"
                                            Event
                                                                                     Operational
                                                                                     Environment
        Analytic DB                      Operational                Unstructured
         and EDW"                           DB"                       Content"

        Acquisition, Batch Analytics, and Enrichment"
                                                                                     Hadoop
                                            Archive"
In practice…
               BI Tools             Applications




                      Stream and                     Search
                         Event
                                                              Search
                      Processing
                                                              Index
      Stats (SPSS,
       SAS, R, …)
                                        Metadata



      Analytic DB /       Operational               Unstructured
         EDW                 DB                     Content Store




                  Batch
                 Analytics                Archive
               (Hadoop MR)                (HDFS)
Simplified Architecture
               BI Tools             Applications




                      Stream and                     Search
                         Event
                                                              Search
                      Processing
                                                              Index
       Stats (SPSS,
        SAS, R, …)
                                        Metadata



      Analytic DB /       Operational               Unstructured
         EDW                 DB                     Content Store




                  Batch
                 Analytics                Archive
               (Hadoop MR)                (HDFS)
Simplified Architecture
               BI Tools      Applications




       Stats (SPSS,
        SAS, R, …)
                               Metadata



      Analytic DB /
         EDW




                  Batch
                 Analytics        Archive
               (Hadoop MR)        (HDFS)
Simplified Architecture
               BI Tools      Applications




       Stats (SPSS,
        SAS, R, …)
                               Metadata



      Analytic DB /
         EDW




                  Batch
                 Analytics        Archive
               (Hadoop MR)        (HDFS)
Combining
Analytic and
Operational
Use Cases

    Raw Data                           Operational
                                       Applications

               ?        1
                   Intermediate
                    Intelligence
                                                        MarkLogic
                        3                             + Connector for
    Hadoop                                               Hadoop
                     Archive
                                   2
                                        Progressive
                                       Enhancement
Intermediate Intelligence
Sophisticated pre-processing for real-time analytics
§  Aggregate, transform, enrich, join, restructure
§  Keep everything: Long-tail, cost-effective warm
    storage in HDFS
§  Leverage MapReduce ecosystem for analysis and
    ETL and refinement
Progressive Enhancement
Enhance data incrementally to answer new questions
§  Enrich data for search, analytics, and delivery
§  Leverage MarkLogic indexes for performance,
    accuracy
§  Leverage the growing Hadoop/Java ecosystem
    for processing
§  Centralized governance, security in MarkLogic
Archive
Age out data to another storage tier
§  Align storage and processing resources with the
    value of data
§  Maintain a complete picture of all data
§  Simplified lifecycle management for compliance
Reading Data from MarkLogic
Query for input, read in parallel directly from partitions
§  Specify input with a query or expression
§  Automatically divide up input for parallel Map
§  Each split covers one partition


Docs       01–10     11–18                19–30    31–37


                                 Host 2
  Host 1
Writing Data to MarkLogic
Write in parallel directly to partitions
§    Auto-discovery of partition topology at job start
§    Client-side hashing to distribute writes
§    Writes directly to partitions
§    Batch update transactions for efficiency
           Task 1           Task 2          Task 3

                                Host 2
 Host 1
Hortonworks Partnership

§  Simplified architecture: Certified MarkLogic
    distribution of Hadoop using Hortonworks Data
    Platform (HDP)
§  Operational: One-stop production support
§  Enterprise-Ready: Best practices and
    reference architecture
MarkLogic Hadoop Roadmap


           Today                         Next                       Future
§  MarkLogic Connector     §  Unified distribution and    §  Tools and ecosystem
    for Hadoop                  support using Hortonworks §  HDFS as storage
§  Certification against       Data Platform
                                                            §  Compute platform
    0.20.2                  §  Reference architectures and
                                best practices
Unified      Enterprise
Data         Operations



Continual    Iterative
Adaptation   Query
Alerting for Real-Time Models
Alerting allows for real-time match-making
§  Generate statistical model of user behavior in
    Hadoop
§  Mark-up documents (or sub-documents) with
    match criteria
§  Combine full-text, geo, and scalar queries for
    real-time decision-making in MarkLogic
§  Scale to billions of documents, trillions of
    matches

Examples
What about HBase?

§  Documents                   §    Sparse maps
§  Load as-is, ad hoc queries §     Model for expected access
§  Integrated full-text search §    Typically Lucene/Solr bolt-on
§  Built-in scalar, structure, §    Secondary indexes exclusively
    geo-spatial indexes               in middleware
§  Multi-document ACID         §    Row-level atomicity, strong
    transactions                      consistency
§  MapReduce source and sink §      MapReduce source and sink
§  Scale to 100s of nodes on §      Scale to 100s of nodes on
    commodity hardware                commodity hardware
In practice…




                       Metadata




            Batch
           Analytics     Archive
         (Hadoop MR)     (HDFS)
Why Hortonworks?
§  Leaders within Hadoop
    Community                       Contributions to Hadoop Core, 2011
§  Delivered every major Hadoop
    release since 0.1
§  Experience managing world’s
    largest deployment
§  Ongoing access to Y!’s 1,000+
    users and 40k+ nodes for
    testing, QA, etc.
§  Unify and Enable Hadoop
    Ecosystem
§  100% open-source
§  Training and support
§  Solutions and reference
    architectures
Intermediate Intelligence Examples

§  ETL for data cleansing, de-duplication, joining
    with reference data
§  Aggregate analysis on user behavior to affect
    applications
Progressive Enhancements Examples

§  Metadata extraction
§  Entity enrichment
§  Binary processing: facial recognition, audio-to-
    text
§  Summarization: sentiment analysis, classification
§  Data cleansing, restructuring, translation
Bulk Loading
Parallelize ingestion in MarkLogic for performance
§  Stage in HDFS, load in parallel into MarkLogic
§  Optionally process using MapReduce
                                                             2500	
  
    9M	
  doc	
  	
  Inges2on	
  Elapse	
  Time	
  (s)	
  




                                                             2000	
  
                                                                                                                                MarkLogic	
  
                                                             1500	
                                                             single	
  client	
  

                                                             1000	
  
                                                                                                                                MarkLogic	
  +	
  
                                                                                                                                Hadoop	
  
                                                              500	
  


                                                                  0	
  
                                                                          1	
     2	
                           3	
     4	
  
                                                                                          Cluster	
  Size	
  

Más contenido relacionado

La actualidad más candente

Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)SwatiTripathi44
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearnPratap Dangeti
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning ExplainedMelanie Swan
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Data cube computation
Data cube computationData cube computation
Data cube computationRashmi Sheikh
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 

La actualidad más candente (20)

Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data cube computation
Data cube computationData cube computation
Data cube computation
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 

Destacado

eServices-Tp5: api management
eServices-Tp5: api managementeServices-Tp5: api management
eServices-Tp5: api managementLilia Sfaxi
 
eServices-Tp4: esb++
eServices-Tp4: esb++eServices-Tp4: esb++
eServices-Tp4: esb++Lilia Sfaxi
 
eServices-Chp2: SOA
eServices-Chp2: SOAeServices-Chp2: SOA
eServices-Chp2: SOALilia Sfaxi
 
eServices-Chp3: Composition de Services
eServices-Chp3: Composition de ServiceseServices-Chp3: Composition de Services
eServices-Chp3: Composition de ServicesLilia Sfaxi
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsArun Kejariwal
 
eServices-Chp5: Microservices et API Management
eServices-Chp5: Microservices et API ManagementeServices-Chp5: Microservices et API Management
eServices-Chp5: Microservices et API ManagementLilia Sfaxi
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
eServices-Tp1: Web Services
eServices-Tp1: Web ServiceseServices-Tp1: Web Services
eServices-Tp1: Web ServicesLilia Sfaxi
 
eServices-Chp6: WOA
eServices-Chp6: WOAeServices-Chp6: WOA
eServices-Chp6: WOALilia Sfaxi
 
eServices-Chp1: Introduction
eServices-Chp1: IntroductioneServices-Chp1: Introduction
eServices-Chp1: IntroductionLilia Sfaxi
 
eServices-Chp4: ESB
eServices-Chp4: ESBeServices-Chp4: ESB
eServices-Chp4: ESBLilia Sfaxi
 
eServices-Tp2: bpel
eServices-Tp2: bpeleServices-Tp2: bpel
eServices-Tp2: bpelLilia Sfaxi
 
eServices-Tp3: esb
eServices-Tp3: esbeServices-Tp3: esb
eServices-Tp3: esbLilia Sfaxi
 

Destacado (13)

eServices-Tp5: api management
eServices-Tp5: api managementeServices-Tp5: api management
eServices-Tp5: api management
 
eServices-Tp4: esb++
eServices-Tp4: esb++eServices-Tp4: esb++
eServices-Tp4: esb++
 
eServices-Chp2: SOA
eServices-Chp2: SOAeServices-Chp2: SOA
eServices-Chp2: SOA
 
eServices-Chp3: Composition de Services
eServices-Chp3: Composition de ServiceseServices-Chp3: Composition de Services
eServices-Chp3: Composition de Services
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
eServices-Chp5: Microservices et API Management
eServices-Chp5: Microservices et API ManagementeServices-Chp5: Microservices et API Management
eServices-Chp5: Microservices et API Management
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
eServices-Tp1: Web Services
eServices-Tp1: Web ServiceseServices-Tp1: Web Services
eServices-Tp1: Web Services
 
eServices-Chp6: WOA
eServices-Chp6: WOAeServices-Chp6: WOA
eServices-Chp6: WOA
 
eServices-Chp1: Introduction
eServices-Chp1: IntroductioneServices-Chp1: Introduction
eServices-Chp1: Introduction
 
eServices-Chp4: ESB
eServices-Chp4: ESBeServices-Chp4: ESB
eServices-Chp4: ESB
 
eServices-Tp2: bpel
eServices-Tp2: bpeleServices-Tp2: bpel
eServices-Tp2: bpel
 
eServices-Tp3: esb
eServices-Tp3: esbeServices-Tp3: esb
eServices-Tp3: esb
 

Similar a Big Data Real Time Applications

Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
 
Big dataappliance hadoopworld_final
Big dataappliance hadoopworld_finalBig dataappliance hadoopworld_final
Big dataappliance hadoopworld_finaljdijcks
 
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Cloudera, Inc.
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutesLucidworks (Archived)
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World DistilledRTTS
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processingSchubert Zhang
 

Similar a Big Data Real Time Applications (20)

Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Big dataappliance hadoopworld_final
Big dataappliance hadoopworld_finalBig dataappliance hadoopworld_final
Big dataappliance hadoopworld_final
 
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutes
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Último (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Big Data Real Time Applications

  • 1. Real-Time Big Data Applications A Reference Architecture for Search, Discovery, and Analytics Justin Makeig Director, Product Management MarkLogic June 13, 2012
  • 2. Hello, my name is _________ §  Director, Product Management §  Focus on APIs, integrations, and tools §  With MarkLogic since 2007 §  Former web dev, quant
  • 3. Agenda §  Characterizing Big Data applications §  Examples today §  Combining analytical and operational §  What’s next?
  • 4. Who is MarkLogic? §  300 customers, $85 million+ in revenue §  300 employees in San Francisco, New York, London, Tokyo, Austin, Frankfurt, Stockholm §  Founded in 2003 §  Funded by Sequoia and Tenaya §  Focus on Media, Government, Financial Services
  • 5. Big Data Workloads Analytic Operational §  Batch §  Real-time, interactive §  Aggregate §  Highly selective §  Repeatable §  Available §  Secure
  • 6. Operational Databases RDBMS “NoSQL” §  Indexes §  Flexible data model §  Transactions §  Commodity scale out §  Security §  Distributed, fault- §  Enterprise operations tolerant §  Hadoop sink/source What if you could get all of these in one system?
  • 7. MarkLogic Server §  Enterprise NoSQL database §  Flexible data model §  Scales on commodity hardware (1–1,000 nodes) §  Rich built-in indexes, including full-text, scalar, geo §  ACID transactions §  Enterprise-grade operations
  • 9. LexisNexis §  $4.2 billion in revenue, $2.6 billion LOB §  5 billion+ documents, millions updates/day §  Real-time search, discovery, analytics §  From 9–12 months to 2 weeks for new products §  Enterprise HA/DR
  • 10. Top 5 Global Investment Bank §  Real-time transparency across all derivatives §  Predictable scalability §  Simplified architecture, operations §  Mission-critical uptime and performance http://www.flickr.com/photos/tenaciousme/1797368175/
  • 11. US Government Intel Agency §  Crawl of substantial part of the Web §  Evolving enrichment §  Real-time analysis §  Granular security §  Centralized governance §  ½ DBA http://www.flickr.com/photos/usarak/4969182481
  • 13. Unified Data §  Flexible data model reduces need for ETL §  Multiple simultaneous applications §  Single governance model
  • 14. Enterprise Operations §  Predictable scalability §  Replication and failover §  Backup and recovery §  Instrumentation and monitoring
  • 15. Continuous Adaptation §  Load data as-is, evolve with requirements §  Add new sources in days, not months §  Transactional updates for accuracy
  • 16. Iterative Query §  Real-time access §  Multi-faceted queries –  Full text –  Structure, semantics, and relationships –  Scalar values and ranges (date/time, numbers, strings) –  Geospatial §  Alerting
  • 17. Big Data Application Platform APIs and tools" Visualization" Data Mining" Processing" Metadata" Search" Event Operational Environment Analytic DB Operational Unstructured and EDW" DB" Content" Acquisition, Batch Analytics, and Enrichment" Hadoop Archive"
  • 18. In practice… BI Tools Applications Stream and Search Event Search Processing Index Stats (SPSS, SAS, R, …) Metadata Analytic DB / Operational Unstructured EDW DB Content Store Batch Analytics Archive (Hadoop MR) (HDFS)
  • 19. Simplified Architecture BI Tools Applications Stream and Search Event Search Processing Index Stats (SPSS, SAS, R, …) Metadata Analytic DB / Operational Unstructured EDW DB Content Store Batch Analytics Archive (Hadoop MR) (HDFS)
  • 20. Simplified Architecture BI Tools Applications Stats (SPSS, SAS, R, …) Metadata Analytic DB / EDW Batch Analytics Archive (Hadoop MR) (HDFS)
  • 21. Simplified Architecture BI Tools Applications Stats (SPSS, SAS, R, …) Metadata Analytic DB / EDW Batch Analytics Archive (Hadoop MR) (HDFS)
  • 23. Use Cases Raw Data Operational Applications ? 1 Intermediate Intelligence MarkLogic 3 + Connector for Hadoop Hadoop Archive 2 Progressive Enhancement
  • 24. Intermediate Intelligence Sophisticated pre-processing for real-time analytics §  Aggregate, transform, enrich, join, restructure §  Keep everything: Long-tail, cost-effective warm storage in HDFS §  Leverage MapReduce ecosystem for analysis and ETL and refinement
  • 25. Progressive Enhancement Enhance data incrementally to answer new questions §  Enrich data for search, analytics, and delivery §  Leverage MarkLogic indexes for performance, accuracy §  Leverage the growing Hadoop/Java ecosystem for processing §  Centralized governance, security in MarkLogic
  • 26. Archive Age out data to another storage tier §  Align storage and processing resources with the value of data §  Maintain a complete picture of all data §  Simplified lifecycle management for compliance
  • 27. Reading Data from MarkLogic Query for input, read in parallel directly from partitions §  Specify input with a query or expression §  Automatically divide up input for parallel Map §  Each split covers one partition Docs 01–10 11–18 19–30 31–37 Host 2 Host 1
  • 28. Writing Data to MarkLogic Write in parallel directly to partitions §  Auto-discovery of partition topology at job start §  Client-side hashing to distribute writes §  Writes directly to partitions §  Batch update transactions for efficiency Task 1 Task 2 Task 3 Host 2 Host 1
  • 29. Hortonworks Partnership §  Simplified architecture: Certified MarkLogic distribution of Hadoop using Hortonworks Data Platform (HDP) §  Operational: One-stop production support §  Enterprise-Ready: Best practices and reference architecture
  • 30. MarkLogic Hadoop Roadmap Today Next Future §  MarkLogic Connector §  Unified distribution and §  Tools and ecosystem for Hadoop support using Hortonworks §  HDFS as storage §  Certification against Data Platform §  Compute platform 0.20.2 §  Reference architectures and best practices
  • 31. Unified Enterprise Data Operations Continual Iterative Adaptation Query
  • 32.
  • 33. Alerting for Real-Time Models Alerting allows for real-time match-making §  Generate statistical model of user behavior in Hadoop §  Mark-up documents (or sub-documents) with match criteria §  Combine full-text, geo, and scalar queries for real-time decision-making in MarkLogic §  Scale to billions of documents, trillions of matches Examples
  • 34. What about HBase? §  Documents §  Sparse maps §  Load as-is, ad hoc queries §  Model for expected access §  Integrated full-text search §  Typically Lucene/Solr bolt-on §  Built-in scalar, structure, §  Secondary indexes exclusively geo-spatial indexes in middleware §  Multi-document ACID §  Row-level atomicity, strong transactions consistency §  MapReduce source and sink §  MapReduce source and sink §  Scale to 100s of nodes on §  Scale to 100s of nodes on commodity hardware commodity hardware
  • 35. In practice… Metadata Batch Analytics Archive (Hadoop MR) (HDFS)
  • 36. Why Hortonworks? §  Leaders within Hadoop Community Contributions to Hadoop Core, 2011 §  Delivered every major Hadoop release since 0.1 §  Experience managing world’s largest deployment §  Ongoing access to Y!’s 1,000+ users and 40k+ nodes for testing, QA, etc. §  Unify and Enable Hadoop Ecosystem §  100% open-source §  Training and support §  Solutions and reference architectures
  • 37. Intermediate Intelligence Examples §  ETL for data cleansing, de-duplication, joining with reference data §  Aggregate analysis on user behavior to affect applications
  • 38. Progressive Enhancements Examples §  Metadata extraction §  Entity enrichment §  Binary processing: facial recognition, audio-to- text §  Summarization: sentiment analysis, classification §  Data cleansing, restructuring, translation
  • 39. Bulk Loading Parallelize ingestion in MarkLogic for performance §  Stage in HDFS, load in parallel into MarkLogic §  Optionally process using MapReduce 2500   9M  doc    Inges2on  Elapse  Time  (s)   2000   MarkLogic   1500   single  client   1000   MarkLogic  +   Hadoop   500   0   1   2   3   4   Cluster  Size