Más contenido relacionado La actualidad más candente (20) Similar a What's Next for Google's BigTable (20) What's Next for Google's BigTable2. TODAY’S TALK
• History of the World: Part 3
• Bigtable/Accumulo Technology Overview
• Accumulo Demonstration
• Database Technology Survey
© 2014 Sqrrl Data, Inc. | All Rights Reserved 2
3. TIMELINE OF RELEVANT EVENTS
© 2014 Sqrrl Data, Inc. | All Rights Reserved
Google’s
BigTable Paper
2006
NSA Builds
Accumulo
2008
Sqrrl Founded
2012
1st Sqrrl Release
and Customers
2013
NSA Open
Sources
Accumulo
2011
3
4. Accumulo is a:
• Apache Software Foundation (ASF) Open-
Source Software Project
• Clone of Google’s Bigtable
• Secure, Sorted Key-Value Store
• Row-level ACID (locally) Distributed NoSQL
Database
© 2014 Sqrrl Data, Inc. | All Rights Reserved 4
5. Sqrrl is:
• A commercial software company located in
Cambridge, MA
• A search and Exploration Platform built with
Apache Accumulo
• An exciting startup with a long roadmap of
challenging problems to solve
• Hiring!
© 2014 Sqrrl Data, Inc. | All Rights Reserved 5
7. BIGTABLE & ACCUMULO TECH
OVERVIEW
1. Data Model & API
2. Underlying Architecture
3. Distinguishing Features
© 2014 Sqrrl Data, Inc. | All Rights Reserved 7
8. An Accumulo key is a 5-tuple, consisting of:
• Row: Controls Atomicity
• Column Family: Controls Locality
• Column Qualifier: Controls Uniqueness
• Visibility Label: Controls Access
• Timestamp: Controls Versioning
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912
Patient suffers
from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results X-Ray JD|PHYS_JD 20120513
1010110110100
…
Accumulo
Key/Value
Example
ACCUMULO DATA FORMAT
© 2014 Sqrrl Data, Inc. | All Rights Reserved 8
10. • Collections of KV pairs form Tables
• Tables are partitioned into Tablets
• Metadata tablets hold info about
other tablets, forming a 3-level
hierarchy
• A Tablet is a unit of work for a
Tablet Server
Data
Tablet
-‐∞
:
thing
Data
Tablet
thing
:
∞
Data
Tablet
-‐∞
:
Ocelot
Data
Tablet
Ocelot
:
Yak
Data
Tablet
Yak
:
∞
Data
Tablet
-‐∞
to
∞
Table:
Adam’s
Table
Table:
Encyclopedia
Table:
Foo
ACCUMULO TABLETS
Well-‐Known
Loca9on
(zookeeper)
Root
Tablet
-‐∞
to
∞
Metadata
Tablet
2
“Encyclopedia:Ocelot”
to
∞
Metadata
Tablet
1
-‐∞
to
“Encyclopedia:Ocelot”
© 2014 Sqrrl Data, Inc. | All Rights Reserved 10
11. Tablet
Server
Tablet
Tablet
Server
Tablet
Tablet
Server
Tablet
Applica9on
Zookeeper
Zookeeper
Zookeeper
Master
HDFS
Read/Write
Store/Replicate
Assign/Balance
Delegate
Authority
Delegate
Authority
Applica9on
Applica9on
ACCUMULO PROCESSES
© 2014 Sqrrl Data, Inc. | All Rights Reserved 11
12. In-‐Memory
Map
Write
Ahead
Log
(For
Recovery)
Sorted,
Indexed
File
Sorted,
Indexed
File
Sorted,
Indexed
File
Tablet
Reads
Iterator
Tree
Minor
Compac<on
Merging
/
Major
Compac<on
Iterator
Tree
Writes
Iterator
Tree
Scan
TABLET DATA FLOW
© 2014 Sqrrl Data, Inc. | All Rights Reserved 12
13. Iterator Operations:
• File Reads
• Block Caching
• Merging
• Deletion
• Isolation
• Locality Groups
• Range Selection
• Column Selection
• Cell-level Security
• Versioning
• Filtering
• Aggregation
• Partitioned Joins
ITERATOR FRAMEWORK
© 2014 Sqrrl Data, Inc. | All Rights Reserved 13
15. Ingesters QueriersTablet Servers
ACCUMULO LATENCIES
Input
Batch
Writer
In-
Memory
Map
Scan
Iterators
Scanner/
Batch
Scanner
In-
Memory
Map
RFile
Compactio
n
Iterators
Scan
Iterators
RFile
Compactio
n
Iterators
In-
Memory
Map
RFiles
Compactio
n
Iterators
Scan
Iterators
Output
~ms~ms ~ms
ms-min
© 2014 Sqrrl Data, Inc. | All Rights Reserved 15
16. ACCUMULO THROUGHPUT
Ingesters QueriersTablet Servers
Input
Batch
Writer
In-
Memory
Map
Scan
Iterators
Scanner
/Batch
Scanner
In-
Memory
Map
RFile
Compacti
on
Iterators
Scan
Iterators
RFile
Compacti
on
Iterators
In-
Memory
Map
RFiles
Compactio
n
Iterators
Scan
Iterators
Output
~ms~ms ~ms
ms-min
Scan:
~1M entries/s per
node
Ingest:
~200K entries/s
per node
Read-Modify-Write Latency: ~ms
ê
>1K entries/s challenging with R-M-W
© 2014 Sqrrl Data, Inc. | All Rights Reserved 16
19. SURVEY OF DATABASE
TECHNOLOGY
• Exercises in Center-Seeking
• SQL vs. NoSQL
• Ingest-time vs. Query-time Analytics
• ACID vs. BASE
• Normalized vs. Denormalized Data Models
• Primary Use Cases for Sqrrl+Accumulo
© 2014 Sqrrl Data, Inc. | All Rights Reserved 19
20. SQL VS. NOSQL
NoSQL
• Optimized for get/put
operations
• Specialized for client
languages
• High concurrency
• More client-side
control
Hybrid
• Extend and evolve
SQL
• Standardize and
incorporate NoSQL
paradigms
SQL
• Optimized for joins
• Strong mathematical
roots in set theory
• Automatic query
optimization
© 2014 Sqrrl Data, Inc. | All Rights Reserved 20
21. INGEST-TIME VS. QUERY-TIME
ANALYTICS
Ingest-Time
• Optimized for online
statistics
• Can reduce storage
footprint
• Can be indexed for
low latency
• Leverages a variety
of indexes
• Requires extensive
data organization at
ingest
Hybrid
• Create partial
summary at ingest
(Question-focused
datasets, knowledge
bases, etc.)
• Support ad-hoc
queries over
summaries
• Leverage all known
indexing strategies **
Query-Time
• Can compute holistic
statistics, like ranking,
topN, etc.
• Ad-hoc analytics:
don’t know the query
ahead of time
• High latency and low
concurrency at scale
• Leverages block
indexes, columnar
layout
• Ingest can be “stream
to disk”
© 2014 Sqrrl Data, Inc. | All Rights Reserved 21
22. ACID VS. BASE
ACID
• Atomicity: all or
nothing for a group of
operations
• Consistency and
Isolation: support
simple reasoning for
distributed,
multithreaded clients
• Durability: simple
reasoning for whether
data might be lost
Hybrid
• Must make some
relaxations for
performance at scale
(under failure modes)
• Many options for
“Lightweight”
transaction support
• Accumulo limits
atomicity,
consistency, and
isolation to row-level
operations
BASE
• Basically Available:
ensure that core
operations always
complete in an
advertised time
• Soft-State: relaxation
of referential integrity,
etc.
• Eventual
Consistency:
relaxation of
© 2014 Sqrrl Data, Inc. | All Rights Reserved 22
23. NORMALIZED VS. DENORMALIZED
DATA MODELS
Normalized
• “Normal Form
Relational Database”
• Minimizes data
footprint
• Minimizes cost of
data maintenance
• Can lead to
expensive joins at
query time
Hybrid
• Start with document
store
• Introduce links/edges
for quick joins
• Dynamically adapt to
flexible or sparse
schemas
• Similar to property
graphs
Denormalized
• “Document Store”
• Flexible schema lets
applications adapt
quickly to changing
environments
• Pre-joined to
eliminate joins at
query-time
• Optimized for
“append-only” data
• Can inflate data sizes
and slow data ingest
© 2014 Sqrrl Data, Inc. | All Rights Reserved 23
24. KNOWLEDGE-BASE USE CASE
2014-04-14
06:36:09 429
73.105.179.202
username@msn.c
om 500 POST
application/json
2014-04-14 06:36:09 429 73.105.179.202 username@msn.com 500 POST application/json
HTTPS “wikipedia.org:443/grouchinesses/?215=felled&297=wading&768=shimmies...” "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/
26.0.1410.43 Safari/537.31” 208.80.152.201
HR
Netflow
Proxy Logs
HTTPS “wikipedia.org:
443/grouchinesses/?
215=felled&297=wadin
g&768=shimmies...”
"Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_8_3)
AppleWebKit/537.31
(KHTML, like Gecko)
Chrome/26.0.1410.43
Safari/537.31”
208.80.152.201
Email
Social Media
© 2014 Sqrrl Data, Inc. | All Rights Reserved 24
25. STREAM PROCESSING USE CASE
© 2014 Sqrrl Data, Inc. | All Rights Reserved
Dashboards
Actions
Interactive
Analysis Tools
(Discovery + Forensics)
1. SPE queries Sqrrl to enrich streaming data
2. SPE persists results in Sqrrl for future query
3. SPE takes action automatically
4. SPE issues data-driven alerts
5. Sqrrl provides context for dashboards
6. Analysis tools query use Sqrrl to search and
manipulate historical data
DATA
SPE
25
26. SQRRL OPERATIONALIZES
ACCUMULO WITH...
© 2014 Sqrrl Data, Inc. | All Rights Reserved 26
Data-Centric Security
Petabyte Scale and Operational Speeds
Document and Graph Data Models
SqrrlQL, including Aggregates, Secure Full-
Text Search, and Secure Graph Search
Analytics, including Real-Time Statistics and
Hadoop Integrations
28. UPCOMING EVENTS
Accumulo Summit 2014
• June 12 in College Park, MD
• http://accumulosummit.com
• Multiple tracks of talks from the leaders of the Accumulo community
IEEE HPEC Conference 2014
• September 9-11 in Waltham, MA
• http://www.ieee-hpec.org/
• Accumulo Users Group Meeting as a Special Event
• Accumulo tutorial
Watch for more meetup opportunities coming soon!
© 2014 Sqrrl Data, Inc. | All Rights Reserved 28