1. Real-Time Big
Data Applications
A Reference Architecture
for Search, Discovery, and
Analytics
Justin Makeig
Director, Product Management MarkLogic
June 13, 2012
2. Hello, my name is _________
§ Director, Product Management
§ Focus on APIs, integrations, and tools
§ With MarkLogic since 2007
§ Former web dev, quant
3. Agenda
§ Characterizing Big Data applications
§ Examples today
§ Combining analytical and operational
§ What’s next?
4. Who is MarkLogic?
§ 300 customers, $85 million+ in revenue
§ 300 employees in San Francisco, New York,
London, Tokyo, Austin, Frankfurt, Stockholm
§ Founded in 2003
§ Funded by Sequoia and Tenaya
§ Focus on Media, Government, Financial Services
5. Big Data Workloads
Analytic Operational
§ Batch § Real-time, interactive
§ Aggregate § Highly selective
§ Repeatable § Available
§ Secure
6. Operational Databases
RDBMS “NoSQL”
§ Indexes § Flexible data model
§ Transactions § Commodity scale out
§ Security § Distributed, fault-
§ Enterprise operations tolerant
§ Hadoop sink/source
What if you could get all of these in one system?
7. MarkLogic Server
§ Enterprise NoSQL database
§ Flexible data model
§ Scales on commodity hardware (1–1,000 nodes)
§ Rich built-in indexes, including full-text, scalar, geo
§ ACID transactions
§ Enterprise-grade operations
9. LexisNexis
§ $4.2 billion in revenue,
$2.6 billion LOB
§ 5 billion+ documents,
millions updates/day
§ Real-time search,
discovery, analytics
§ From 9–12 months to
2 weeks for new products
§ Enterprise HA/DR
10. Top 5 Global Investment Bank
§ Real-time transparency
across all derivatives
§ Predictable scalability
§ Simplified architecture,
operations
§ Mission-critical uptime and
performance
http://www.flickr.com/photos/tenaciousme/1797368175/
11. US Government Intel Agency
§ Crawl of substantial
part of the Web
§ Evolving enrichment
§ Real-time analysis
§ Granular security
§ Centralized governance
§ ½ DBA
http://www.flickr.com/photos/usarak/4969182481
13. Unified Data
§ Flexible data model reduces need for ETL
§ Multiple simultaneous applications
§ Single governance model
14. Enterprise Operations
§ Predictable scalability
§ Replication and failover
§ Backup and recovery
§ Instrumentation and monitoring
15. Continuous Adaptation
§ Load data as-is, evolve with requirements
§ Add new sources in days, not months
§ Transactional updates for accuracy
16. Iterative Query
§ Real-time access
§ Multi-faceted queries
– Full text
– Structure, semantics, and relationships
– Scalar values and ranges
(date/time, numbers, strings)
– Geospatial
§ Alerting
17. Big Data Application Platform
APIs and tools"
Visualization"
Data Mining"
Processing"
Metadata"
Search"
Event
Operational
Environment
Analytic DB Operational Unstructured
and EDW" DB" Content"
Acquisition, Batch Analytics, and Enrichment"
Hadoop
Archive"
18. In practice…
BI Tools Applications
Stream and Search
Event
Search
Processing
Index
Stats (SPSS,
SAS, R, …)
Metadata
Analytic DB / Operational Unstructured
EDW DB Content Store
Batch
Analytics Archive
(Hadoop MR) (HDFS)
19. Simplified Architecture
BI Tools Applications
Stream and Search
Event
Search
Processing
Index
Stats (SPSS,
SAS, R, …)
Metadata
Analytic DB / Operational Unstructured
EDW DB Content Store
Batch
Analytics Archive
(Hadoop MR) (HDFS)
23. Use Cases
Raw Data Operational
Applications
? 1
Intermediate
Intelligence
MarkLogic
3 + Connector for
Hadoop Hadoop
Archive
2
Progressive
Enhancement
24. Intermediate Intelligence
Sophisticated pre-processing for real-time analytics
§ Aggregate, transform, enrich, join, restructure
§ Keep everything: Long-tail, cost-effective warm
storage in HDFS
§ Leverage MapReduce ecosystem for analysis and
ETL and refinement
25. Progressive Enhancement
Enhance data incrementally to answer new questions
§ Enrich data for search, analytics, and delivery
§ Leverage MarkLogic indexes for performance,
accuracy
§ Leverage the growing Hadoop/Java ecosystem
for processing
§ Centralized governance, security in MarkLogic
26. Archive
Age out data to another storage tier
§ Align storage and processing resources with the
value of data
§ Maintain a complete picture of all data
§ Simplified lifecycle management for compliance
27. Reading Data from MarkLogic
Query for input, read in parallel directly from partitions
§ Specify input with a query or expression
§ Automatically divide up input for parallel Map
§ Each split covers one partition
Docs 01–10 11–18 19–30 31–37
Host 2
Host 1
28. Writing Data to MarkLogic
Write in parallel directly to partitions
§ Auto-discovery of partition topology at job start
§ Client-side hashing to distribute writes
§ Writes directly to partitions
§ Batch update transactions for efficiency
Task 1 Task 2 Task 3
Host 2
Host 1
29. Hortonworks Partnership
§ Simplified architecture: Certified MarkLogic
distribution of Hadoop using Hortonworks Data
Platform (HDP)
§ Operational: One-stop production support
§ Enterprise-Ready: Best practices and
reference architecture
30. MarkLogic Hadoop Roadmap
Today Next Future
§ MarkLogic Connector § Unified distribution and § Tools and ecosystem
for Hadoop support using Hortonworks § HDFS as storage
§ Certification against Data Platform
§ Compute platform
0.20.2 § Reference architectures and
best practices
31. Unified Enterprise
Data Operations
Continual Iterative
Adaptation Query
32.
33. Alerting for Real-Time Models
Alerting allows for real-time match-making
§ Generate statistical model of user behavior in
Hadoop
§ Mark-up documents (or sub-documents) with
match criteria
§ Combine full-text, geo, and scalar queries for
real-time decision-making in MarkLogic
§ Scale to billions of documents, trillions of
matches
Examples
34. What about HBase?
§ Documents § Sparse maps
§ Load as-is, ad hoc queries § Model for expected access
§ Integrated full-text search § Typically Lucene/Solr bolt-on
§ Built-in scalar, structure, § Secondary indexes exclusively
geo-spatial indexes in middleware
§ Multi-document ACID § Row-level atomicity, strong
transactions consistency
§ MapReduce source and sink § MapReduce source and sink
§ Scale to 100s of nodes on § Scale to 100s of nodes on
commodity hardware commodity hardware
35. In practice…
Metadata
Batch
Analytics Archive
(Hadoop MR) (HDFS)
36. Why Hortonworks?
§ Leaders within Hadoop
Community Contributions to Hadoop Core, 2011
§ Delivered every major Hadoop
release since 0.1
§ Experience managing world’s
largest deployment
§ Ongoing access to Y!’s 1,000+
users and 40k+ nodes for
testing, QA, etc.
§ Unify and Enable Hadoop
Ecosystem
§ 100% open-source
§ Training and support
§ Solutions and reference
architectures
37. Intermediate Intelligence Examples
§ ETL for data cleansing, de-duplication, joining
with reference data
§ Aggregate analysis on user behavior to affect
applications
39. Bulk Loading
Parallelize ingestion in MarkLogic for performance
§ Stage in HDFS, load in parallel into MarkLogic
§ Optionally process using MapReduce
2500
9M
doc
Inges2on
Elapse
Time
(s)
2000
MarkLogic
1500
single
client
1000
MarkLogic
+
Hadoop
500
0
1
2
3
4
Cluster
Size