Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

•Download as PPTX, PDF•

5 likes•2,158 views

Slides from my Strata+Hadoop 2015 Conference session titled: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP. This talk describes the Doradus OLAP query/storage engine, which is an open source module that runs on top of the Cassandra NoSQL DB. Among the benefits of this service is fast data loading, a rich query language with full text and graph query features, and very dense data storage. See the Notes section for details on each slide.

Software

What is Doradus?
 Storage and query service
 Leverages Cassandra NoSQL DB
 Pure Java
- Stateless
- Embeddable or standalone
 Open source: Apache 2.0 License
30 Doradus: The Tarantula Nebula
Source: Hubble Space Telescope

Why Use Doradus?
 Easy to use: no client driver
 Spider storage manager
- Good for unstructured data
 OLAP storage manager
- Near real time data warehousing
 Compared to Cassandra alone:
- Data model, searching, analytics
 Compared to Hadoop:
- Fast data loads and queries
- Dense storage: less hardware
Cassandra
Data
Applications
REST API
OLAP Spider
Doradus

A Multi-Node Cluster
Doradus
Cassandra
Data
Node 2
Cassandra
Data
Doradus
Cassandra
Data
Node 1 Node 3
Applications
REST API
Secondary Doradus
instances are optional

Why Did We Build Doradus OLAP?
 Some tough customer requirements:
- Statistical queries most important
- Need to scan millions of objects/second
- User-customizable “insights” = millions of possible queries
 Couldn’t use indexes, pre-computed queries, etc.
 Disk physics
- ~100's of random reads/second
- ~1000's of serial reads/second
 Needed a radically new approach!

Doradus OLAP
 Combines ideas from:
- Online Analytical Processing: data arranged in static cubes
- Columnar databases: Column-oriented storage and compression
- NoSQL databases: Sharding
 Features:
- Fast loading: up to 500K objects/second/node
- Dense storage: 1 billion objects in 2 GB!
- Fast cube merging: typically seconds
- No indexes!

Example: Message Tracking Schema
Message Participant Address
Person
Manager
Employees
Person
Address 
Attachments
Message
Participants
Message
Address
Participants
Attachment

DQL Object Queries
 Builds on Lucene syntax
- Full text queries
 Adds link paths
- Directed graph searches
- Quantifiers and filters
- Transitive searches
 Other features
- Stateless paging
- Sorting
 Examples:
- LastName = Smith AND NOT (FirstName : Jo*)
AND BirthDate = [1986 TO 1992]
- ALL(Participants).ANY(Address.WHERE
(Email='*.gmail.com')).Person.Department :
support
- Employees^(4).Office='San Jose’

DQL Aggregate Queries
 Metric functions
- COUNT, AVERAGE, MIN,
MAX, DISTINCT, ...
 Multi-level grouping
 Grouping functions
- BATCH, BOTTOM, FIRST,
LAST, LOWER, SETS,
TERMS, TOP, TRUNCATE,
UPPER, WHERE, ...
 Examples:
- metric=COUNT(*), AVERAGE(Size),
MIN(Participants.Address.Person.Birthdate)
- metric=DISTINCT(Attachments.Extension);
groups=Tags,
Participants.Address.Person.Department;
query=Attachments.Size > 100000
- metric=AVERAGE(Size);
groups=TOP(10,Participants.Address.Email)

OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Sources

OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch 2
Batch 3
...
Sources Batches
Batch 4

OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch 2
Batch 3
...
2014-03-01
2014-02-28
2014-02-27
Sources Batches Shards
Batch 4
Merge

Storing Batches
Field Values
ID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ...
Size 48120 5435 24220 ...
SendDate 1280246462000 1279354872112 1279357261413 ...
Priority 0 0 1 ...
Subject.txt ballades encash nautch
colloquy geared
nettlier outdoors culvert
hypothec winder
stolons ungot guiding
rupiahs outgone
...
Subject 1 2 0 ...
...
Data is sorted by object ID and stored as columnar, compressed blobs
Key Columns
Email/Message/2014-03-01/{Batch GUID}/ID [compressed data]
Email/Message/2014-03-01/{Batch GUID}/Size [compressed data]
Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data]
... ......
OLAP Table
Field Value Arrays
Compressed rows

Merging Batches
Key Columns
Email/Message/2014-03-01/ID [compressed data]
Email/Message/2014/03-01/Size [compressed data]
Email/Message/2014-03-01/SendDate [compressed data]
... ...
Email/Person/2014-03-01/ID [compressed data]
Email/Person/2014-03-01/FirstName [compressed data]
Email/Person/2014-03-01/LastName [compressed data]
... ...
Email/Address/2014-03-01/ID [compressed data]
Email/Address/2014-03-01/Person [compressed data]
Email/Address/2014/-03-01/Message [compressed data]
... ...
Email/Message/2014-02-28/ID [compressed data]
Email/Message/2014-02-28/Size [compressed data]
...
Batch #1: Shard 2014-03-01
Message Table
ID ...
Size ...
SendDate ...
...
Batch #2: Shard 2014-03-01
Message Table
ID ...
Size ...
SendDate ...
...
...
OLAP Store
Person Table
ID ...
FirstName ...
Lastname ...
...
Address Table
ID ...
Person ...
Messages ...
...
Person Table
ID ...
FirstName ...
Lastname ...
...
Address Table
ID ...
Person ...
Messages ...
...
Message table data
Shard 2014-03-01
Person table data
Shard 2014-03-01
Address table data
Shard 2014-03-01
Data for other shards

OLAP Query Execution
 Example query:
- Count messages with Size between 1000-10000 and
HasBeenSent=false in shards 2014-03-01 to 2014-03-31
 How many rows are read?
- 2 fields x 31 shards = 62 rows
- Typically represents millions of values
 Value arrays are scanned in memory
 Physical rows are read on “cold” start only
- Multiple caching levels for “warm” and “hot” data

1 Billion Objects in 2GB?
Example Security Event (CSV format):
Fixed Fields Variable Fields
Computer Name MAILSERVER18 1 MAILSERVER18$
Log Name Security 2
Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation
Type Success Audit 4 (0x0,0x142999A)
Source Security 5 3
Category Logon/Logoff 6 Kerberos
Event ID 540 7 Kerberos
User Domain NT AUTHORITY
User Name SYSTEM
User SID S-1-5-18
MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,
"Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,
"(0x0,0x142999A)",3,Kerberos,Kerberos

Events Schema
Events
Insertion
Strings
Fields:
• ComputerName (text)
• LogName (text)
• Timestamp (timestamp)
• Type (text)
• Source (text)
Fields:
• Index (integer)
• Value (text)
• Event (link)
Count: 115 Million Count: 880 Million
Params 
 Event (inverse)
• Category (text)
• EventID (integer)
• UserDomain (text)
• UserSID (text)
• Params (link)

Event Schema Load
 Load stats:
Total shards: 860
Total events: 114,572,247
Total ins strings: 879,529,753
Total objects: 994,102,000
Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)
 Space usage:
:nodetool -h localhost status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1

Demo
 1) Count all Events in all shards
- 860 shards => 115M events
 2) Find the top 5 hours-of-the-day when certain privileged events fail:
- Event IDs are any of 577, 681, 529
- Event type is ‘Failure Audit’
- Insertion string 8 is (0x0,0x3E7)
- Event occurred in first half of 2005 (181 shards)

Doradus OLAP Summary
 Advantages:
Simple REST API
All fields are searchable without indexes
Ad-hoc statistical searches
Support for graph-based queries
Near real time data warehousing
Dense storage = less hardware
Horizontally scalable when needed

Doradus OLAP Summary
 Good for applications where data:
Is continuous/streaming
Is structured to semi-structured
Can be loaded in batches
Is partitionable, especially by time
Is typically queried in a subset of shards
Emphasizes statistical queries

Thank You!
 Where to find Doradus
- Source: github.com/dell-oss/Doradus
- Downloads: search.maven.org
 Contact me
- Randy.Guck@software.dell.com
- @randyguck

What's hot

Apache Spark Introductionsudhakara st

Apache Spark Overview part2 (20161117)Steve Min

Productionizing your Streaming JobsDatabricks

BI, Reporting and Analytics on Apache CassandraVictor Coustenoble

Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab

Apache SparkUwe Printz

Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble

Hadoop EcosystemLior Sidi

ETL with SPARK - First Spark London meetupRafal Kwasny

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff

Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan

Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy

Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer

DataEngConf SF16 - Spark SQL WorkshopHakka Labs

Spark Streaming with CassandraJacek Lewandowski

What's hot (20)

Apache Spark Introduction

Apache Spark Overview part2 (20161117)

Productionizing your Streaming Jobs

BI, Reporting and Analytics on Apache Cassandra

Introduction to Apache Drill - interactive query and analysis at scale

Processing Large Data with Apache Spark -- HasGeek

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...

Apache Spark

Spark + Cassandra = Real Time Analytics on Operational Data

Hadoop Ecosystem

ETL with SPARK - First Spark London meetup

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

Real time data pipeline with spark streaming and cassandra with mesos

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

Spark Cassandra Connector: Past, Present, and Future

DataEngConf SF16 - Spark SQL Workshop

Spark Streaming with Cassandra

Viewers also liked

Extending Cassandra with Doradus OLAP for High Performance Analyticsrandyguck

Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang

Infinite gerritHaithem Jarraya

Overiew of Cassandra and Doradusrandyguck

Developing with CassandraSperasoft

Big Data-Driven Applications with Cassandra and SparkArtem Chebotko

Solr & Cassandra: Searching Cassandra with DataStax EnterpriseDataStax Academy

Application Development with Apache Cassandra as a ServiceWSO2

Apache Cassandra at the Geek2Geek BerlinChristian Johannsen

Breakthrough OLAP performance with Cassandra and SparkEvan Chan

Using Cassandra with your Web Applicationsupertom

OLAP with Cassandra and SparkEvan Chan

TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy

Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning

Viewers also liked (14)

Extending Cassandra with Doradus OLAP for High Performance Analytics

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive

Infinite gerrit

Overiew of Cassandra and Doradus

Developing with Cassandra

Big Data-Driven Applications with Cassandra and Spark

Solr & Cassandra: Searching Cassandra with DataStax Enterprise

Application Development with Apache Cassandra as a Service

Apache Cassandra at the Geek2Geek Berlin

Breakthrough OLAP performance with Cassandra and Spark

Using Cassandra with your Web Application

OLAP with Cassandra and Spark

TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Apache Kylin - OLAP Cubes for SQL on Hadoop

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf

Active Directory Penetration Testing, cionsystems.com.pdf

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Software Quality Assurance Interview Questions

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Test Automation Strategy for Frontend and Backend

Right Money Management App For Your Financial Goals

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

A Secure and Reliable Document Management System is Essential.docx

TECUNIQUE: Success Stories: IT Service provider

Microsoft AI Transformation Partner Playbook.pdf

DNT_Corporate presentation know about us

5 Signs You Need a Fashion PLM Software.pdf

Unlocking the Future of AI Agents with Large Language Models

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

1. One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group

2. What is Doradus?  Storage and query service  Leverages Cassandra NoSQL DB  Pure Java - Stateless - Embeddable or standalone  Open source: Apache 2.0 License 30 Doradus: The Tarantula Nebula Source: Hubble Space Telescope

3. Why Use Doradus?  Easy to use: no client driver  Spider storage manager - Good for unstructured data  OLAP storage manager - Near real time data warehousing  Compared to Cassandra alone: - Data model, searching, analytics  Compared to Hadoop: - Fast data loads and queries - Dense storage: less hardware Cassandra Data Applications REST API OLAP Spider Doradus

4. A Multi-Node Cluster Doradus Cassandra Data Node 2 Cassandra Data Doradus Cassandra Data Node 1 Node 3 Applications REST API Secondary Doradus instances are optional

5. Why Did We Build Doradus OLAP?  Some tough customer requirements: - Statistical queries most important - Need to scan millions of objects/second - User-customizable “insights” = millions of possible queries  Couldn’t use indexes, pre-computed queries, etc.  Disk physics - ~100's of random reads/second - ~1000's of serial reads/second  Needed a radically new approach!

6. Doradus OLAP  Combines ideas from: - Online Analytical Processing: data arranged in static cubes - Columnar databases: Column-oriented storage and compression - NoSQL databases: Sharding  Features: - Fast loading: up to 500K objects/second/node - Dense storage: 1 billion objects in 2 GB! - Fast cube merging: typically seconds - No indexes!

7. Example: Message Tracking Schema Message Participant Address Person Manager Employees Person Address  Attachments Message Participants Message Address Participants Attachment

8. DQL Object Queries  Builds on Lucene syntax - Full text queries  Adds link paths - Directed graph searches - Quantifiers and filters - Transitive searches  Other features - Stateless paging - Sorting  Examples: - LastName = Smith AND NOT (FirstName : Jo*) AND BirthDate = [1986 TO 1992] - ALL(Participants).ANY(Address.WHERE (Email='*.gmail.com')).Person.Department : support - Employees^(4).Office='San Jose’

9. DQL Aggregate Queries  Metric functions - COUNT, AVERAGE, MIN, MAX, DISTINCT, ...  Multi-level grouping  Grouping functions - BATCH, BOTTOM, FIRST, LAST, LOWER, SETS, TERMS, TOP, TRUNCATE, UPPER, WHERE, ...  Examples: - metric=COUNT(*), AVERAGE(Size), MIN(Participants.Address.Person.Birthdate) - metric=DISTINCT(Attachments.Extension); groups=Tags, Participants.Address.Person.Department; query=Attachments.Size > 100000 - metric=AVERAGE(Size); groups=TOP(10,Participants.Address.Email)

10. OLAP Data Loading EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Sources

11. OLAP Data Loading Batch 1 EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch 2 Batch 3 ... Sources Batches Batch 4

12. OLAP Data Loading Batch 1 EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch 2 Batch 3 ... 2014-03-01 2014-02-28 2014-02-27 Sources Batches Shards Batch 4 Merge

13. OLAP Data Loading Batch 1 EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch 2 Batch 3 ... 2014-03-01 2014-02-28 2014-02-27 Sources Batches Shards OLAP Store Batch 4 Merge

14. Storing Batches Field Values ID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ... Size 48120 5435 24220 ... SendDate 1280246462000 1279354872112 1279357261413 ... Priority 0 0 1 ... Subject.txt ballades encash nautch colloquy geared nettlier outdoors culvert hypothec winder stolons ungot guiding rupiahs outgone ... Subject 1 2 0 ... ... Data is sorted by object ID and stored as columnar, compressed blobs Key Columns Email/Message/2014-03-01/{Batch GUID}/ID [compressed data] Email/Message/2014-03-01/{Batch GUID}/Size [compressed data] Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data] ... ...... OLAP Table Field Value Arrays Compressed rows

15. Merging Batches Key Columns Email/Message/2014-03-01/ID [compressed data] Email/Message/2014/03-01/Size [compressed data] Email/Message/2014-03-01/SendDate [compressed data] ... ... Email/Person/2014-03-01/ID [compressed data] Email/Person/2014-03-01/FirstName [compressed data] Email/Person/2014-03-01/LastName [compressed data] ... ... Email/Address/2014-03-01/ID [compressed data] Email/Address/2014-03-01/Person [compressed data] Email/Address/2014/-03-01/Message [compressed data] ... ... Email/Message/2014-02-28/ID [compressed data] Email/Message/2014-02-28/Size [compressed data] ... Batch #1: Shard 2014-03-01 Message Table ID ... Size ... SendDate ... ... Batch #2: Shard 2014-03-01 Message Table ID ... Size ... SendDate ... ... ... OLAP Store Person Table ID ... FirstName ... Lastname ... ... Address Table ID ... Person ... Messages ... ... Person Table ID ... FirstName ... Lastname ... ... Address Table ID ... Person ... Messages ... ... Message table data Shard 2014-03-01 Person table data Shard 2014-03-01 Address table data Shard 2014-03-01 Data for other shards

16. Does Merging Take Long?

17. OLAP Query Execution  Example query: - Count messages with Size between 1000-10000 and HasBeenSent=false in shards 2014-03-01 to 2014-03-31  How many rows are read? - 2 fields x 31 shards = 62 rows - Typically represents millions of values  Value arrays are scanned in memory  Physical rows are read on “cold” start only - Multiple caching levels for “warm” and “hot” data

18. 1 Billion Objects in 2GB? Example Security Event (CSV format): Fixed Fields Variable Fields Computer Name MAILSERVER18 1 MAILSERVER18$ Log Name Security 2 Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation Type Success Audit 4 (0x0,0x142999A) Source Security 5 3 Category Logon/Logoff 6 Kerberos Event ID 540 7 Kerberos User Domain NT AUTHORITY User Name SYSTEM User SID S-1-5-18 MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security, "Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation, "(0x0,0x142999A)",3,Kerberos,Kerberos

19. Events Schema Events Insertion Strings Fields: • ComputerName (text) • LogName (text) • Timestamp (timestamp) • Type (text) • Source (text) Fields: • Index (integer) • Value (text) • Event (link) Count: 115 Million Count: 880 Million Params   Event (inverse) • Category (text) • EventID (integer) • UserDomain (text) • UserSID (text) • Params (link)

20. Event Schema Load  Load stats: Total shards: 860 Total events: 114,572,247 Total ins strings: 879,529,753 Total objects: 994,102,000 Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)  Space usage: :nodetool -h localhost status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns Host ID Token Rack UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1

21. Demo  1) Count all Events in all shards - 860 shards => 115M events  2) Find the top 5 hours-of-the-day when certain privileged events fail: - Event IDs are any of 577, 681, 529 - Event type is ‘Failure Audit’ - Insertion string 8 is (0x0,0x3E7) - Event occurred in first half of 2005 (181 shards)

22. Doradus OLAP Summary  Advantages: Simple REST API All fields are searchable without indexes Ad-hoc statistical searches Support for graph-based queries Near real time data warehousing Dense storage = less hardware Horizontally scalable when needed

23. Doradus OLAP Summary  Good for applications where data: Is continuous/streaming Is structured to semi-structured Can be loaded in batches Is partitionable, especially by time Is typically queried in a subset of shards Emphasizes statistical queries

24. Thank You!  Where to find Doradus - Source: github.com/dell-oss/Doradus - Downloads: search.maven.org  Contact me - Randy.Guck@software.dell.com - @randyguck

Editor's Notes

There are many good software modules available today that provide big data analytics using distributed clusters. Some applications need fast aggregate queries on large data sets but cannot justify the cost or complexity of a large database cluster. If an application’s requirements meet certain constraints, such as the ability to partition its data into time-based shards, it can benefit from Doradus OLAP and probably get all of its data on a single node. Doradus is an open source storage and query engine that runs on top of Cassandra. It offers a high-level data model, full text searching, and aggregate queries such as AVERAGE() with multi-level grouping. Doradus offers two storage services that use specialized storage and query execution techniques for different application domains. The Doradus OLAP service, described in this session, borrows techniques from Online Analytical Processing, columnar storage, compression, and sharding to yield extremely compact databases. Though many applications may be able to use a single node, the underlying Cassandra NoSQL database allows data to be distributed and replicated over a cluster when needed. Some of the features that will be described are: Doradus OLAP uses as little as 1% of the disk space used by other systems. OLAP cubes, called shards, can be updated and rebuilt very quickly, typically in a few seconds. The REST API supports JSON and XML for maximum accessibility. A simple but flexible data model supports multi-valued scalar fields and bi-directional links, which provide inter-object relationships with full referential integrity. All data fields are searchable using a Lucene-compatible query language that supports terms, phrases, wildcards, ranges, inequalities, etc. The query language uses link paths, which are much simpler than joins. They offer rich query features such as quantifiers, path filters, transitive searches, and more. Aggregate queries perform fast, complex metric computations with multi-level grouping—without indexes! Though many databases will fit on a single node, the Cassandra persistence layer allows the database to be distributed on a multi-node cluster for scalability, replication, and failover. A case study using time-stamped event data is used to demonstrate features and show how one billion objects require only 2GB of disk space.
From a high-level view, Doradus is a storage and query service that runs on top of the Cassandra NoSQL database. Like Cassandra, Doradus is a pure Java application. Doradus can be embedded within another application or run as a standalone process. Doradus is an open source project that offers the Apache License Version 2.0. Source code and other links are provided later in this presentation. Where did the name Doradus come from? “30 Doradus” is another name for the Tarantula Nebula, which is a relatively new star cluster (as star clusters go). It is the birth place of many new stars.
Why might you want to use Doradus? First, it’s easy to use because it supports a REST API that can be accessed from any platform and language without a specialized client library. Also, Doradus offers two storage managers that are ideal for different application types. Each application schema chooses the storage manager ideal for its data. The data model and query language used by each storage manager are nearly identical. The Doradus Spider storage manager is good for unstructured data and applications that emphasize immediate indexing and full text queries. It is analogous to Lucene-based search engines such as Solr or ElasticSearch, though the Doradus data model and query language have extra benefits such as graph queries. Details of Spider are beyond the scope of this presentation. The Doradus OLAP storage manager is relatively new, and it is the focus of this presentation. OLAP provides unique features that allow it to act like a data warehouse, except that data can be loaded more quickly and uses far less disk space than other databases. The time between loading data and being able to query it is sufficiently short that we like to say that OLAP allows “near real time data warehousing”. Compared to using Cassandra by itself, Doradus offers a higher-level data model, rich search features, and analytics through statistical queries. Compared to Hadoop, Doradus allows fast loads and queries while using much less storage, thereby requiring less hardware.
Doradus can be embedded in the same JVM as another application, which then calls Doradus service methods directly. This is ideal for bulk load applications that need maximum performance because the overhead of the REST API is avoided. When Doradus executes as a standalone process, applications communicate using the REST API passing JSON or XML messages. A minimum deployment is a single Doradus process accessing a one-node Cassandra database. When needed, Cassandra can be deployed on a multi-node cluster to provide replication, failover, and horizontal scaling. Cassandra also supports multi-data center deployment, which allows data to be replicated to multiple geographic regions. Technically, only a single Doradus instance is needed, operating anywhere on the network. Doradus will automatically rotate requests through Cassandra instances and retry/failover requests if a Cassandra node fails. However, Doradus is stateless, so multiple instances can execute against the same Cassandra cluster. One deployment model is to execute a Doradus instance on each Cassandra node. This makes all nodes equal and provides applications with failover options should any Doradus/Cassandra node fail. Doradus uses the same masterless/peer model as Cassandra, although Doradus instances do not directly communicate with each other.
Given all the great NoSQL and big data software available today, much of it open source, why did we build Doradus OLAP? Originally, we built Doradus Spider believing that full text document searching was the primary use case. However, our primary customer team decided that statistical queries were more important. They wanted complex computational queries with multi-level grouping and lots of filtering features. Because these queries are generated from user-customizable insights, there are infinite ways in which data can be dissected. We therefore couldn’t use indexes or tricks such as pre-computing queries in the background. Given our customer’s response time requirements, we needed to scan millions of objects per second. However, we ran into disk physics that we couldn’t do anything about. Most disks will perform ~hundreds of random reads and ~thousands of serial reads per second, but no more. Furthermore, the entire database wouldn’t fit in memory, so a memory database wasn’t feasible either. We needed a radically new approach.
We decided to build a new storage manager, which we call Doradus OLAP. Its design borrows ideas from several diciplines: Online Analytical Processing: OLAP is widely used in data warehouses, where data is transformed and stored in n-dimensional cubes. Columnar databases: These databases store column values instead of row values in the same physical record. NoSQL databases: These databases commonly use sharding to co-locate data, especially time-series data. Doradus OLAP combines ideas from these areas to provide a new storage service that is suitable for certain kinds of applications. Compared to the Spider storage manager, OLAP imposes some restrictions. But for applications that can use it, OLAP offers many advantages including: Fast data loading: We’ve observed loads up to 500,000 objects/second on a single node in bulk-load scenarios. Dense storage: Data is stored in a way that it is highly compressable. Fast merging: One of the typical drawbacks of data warehouses is the time required to update the database with new data. With Doradus OLAP, data is stored in cubes that can be updated quickly, often in a few seconds. To accomplish this, Doradus OLAP uses no indexes. It is also not a memory database, though it uses in-memory processing for its queries.
Here is an example Doradus message tracking application schema, consisting of five tables. Depicted are the link fields that form relationships. Every link field has an inverse that defines the same relationship from the opposite direction. For example, the Message table defines a link called Participants, which points to the Participant table. The inverse link is called Message, which points back to the Message table. Doradus maintains inverse links automatically: if Message.Participants is updated, the inverse Participant.Message link values are automatically added or deleted as needed. Note that the Person table defines a link called Manager, which points to the same table, Person; Employees is the inverse link. The Manager/Employees relationship forms an org chart graph that can be searched “up” or “down”. A link whose extent—the table it points to—is the same as its owner is called a reflexive link. Reflexive links can be searched recursively via transitive functions. Doradus also allows self-reflexive links that are their own inverse. An example self-reflexive relationship is Friends (though friendship is not always reciprocal!) We’ll see how links are used in the query language next.
The Doradus Query Language (DQL) can be used in two contexts: An object query selects and returns objects from a specific table, called the perspective. DQL is based on and extends the Lucene full text language. This means that DQL supports many clause types including equalities and inequalities, AND/OR/NOT and parentheses, terms, phrases, wildcards, ranges, and more. To these full text concepts DQL adds the notion of link paths to traverse relationship graphs. Link paths can use functions such as quantifiers and filters to narrow the selection of objects. Reflexive relationships can use a transitive function to search a graph recursively. The first example is a simple full text query. It consists of three clauses that select objects (1) whose LastName is Smith, (2) whose FirstName does not contain a term that begins with “Jo”, and (3) whose BirthDate falls between the years 1986 and 1992 (inclusively). This example uses several features including an equality clause, a term clause, a wildcard term, a range clause, and clause negation. The second example shows how link fields are used. Link paths are one of the features that makes DQL easy to use. Though deceptively simple, each “.” in an expression such as X.Y.Z essentially performs a join. This example shows how link paths can be explicitly quantified. When a link path is enclosed in the quantifier ANY, ALL, or NONE, a specific number of the objects in the quantified link path must match the clause’s condition. In this example, ALL(Participants) means that every Participants value must meet the remainder of the clause. The Address link is quantified with ANY, and it is also filtered with a WHERE clause. This means that only Address objects whose Email ends with “gmail.com” are selected, and at least one of them (ANY) must meet the remainder of the clause. Without the quantifiers and filter, the link path is Participants.Address.Person.Department, and it is used in a contains clause searching for the term support. This means the Department field must contain the term “support” (case-insensitive). Textually, this query selects objects where all Participants have at least one Address whose Email ends with “gmail.com” and whose linked Person’s department is in support. The last example demonstrates transitive searches. Because Employees is a reflexive link, we can search it recursively through the graph, in this case “down” the org chart. The carat (^) is the transitive function and causes the preceding link, Employees, to be searched recursively. The optional value in parentheses after the carat limits the recursion to a maxmimum number of levels. In this case, ^(4) says “recurse the link to no more than 4 levels”. Without the limit parameter, the transitive function searches the graph until a cycle or a “leaf” object is found.
The other context in which DQL can be used is aggregate queries, which select objects from a perspective table and performs computations across those objects. This slide shows several example aggregate queries. These examples demonstrate the metric expression, grouping expression, and query expression components of aggregate queries. The first example performs three computations across selected objects in a single pass: (1) a COUNT of all objects, (2) the AVERAGE value of the Size field, and (2) the smallest (MIN) Birthdate found among objects linked via Participants.Address.Person. All three statistical computations are made in a single pass through the data. Since this example contains no grouping expression, a single value is returned for each of these computations. The second example demonstrates multi-level grouping, which divides objects into groups and performs the metric computations across each subgroup. In this example, a single metric function is computed: the unique (DISTINCT) Attachments.Extension values within each group. The grouping expression groups objects first by their Tags field and secondarily by the values found in the link path Participants.Address.Person.Department. Because this example has a query expression, only those objects matching the selection expression are used in the aggregate computation. The third example computes AVERAGE Size for all objects, grouped by a single-level grouping expression Participants.Address.Email. The grouping expression is enclosed in a TOP(10) function, which means that only the groups with the top 10 metric values (average Size) are returned. Doradus supports a wide range of grouping functions to create aggregate query groups. Some of the more popular functions are listed above and summarized below: BATCH: creates groups from explicit value ranges FIRST/LAST: returns a limited number of groups based on their highest/lowest alphabetical group names INCLUDE/EXCLUDE: includes or excludes specific group values within a grouping level SETS: creates groups from arbitrary query expressions TERMS: creates one group per term within a text field TOP/BOTTOM: returns a limited number of groups based on their highest/lowest metric values TRUNCATE: creates groups from a timestamp value rounded down to a specific date/time granularity UPPER/LOWER: Creates groups with case-insensitive text values See the Doradus documentation for a full description of all aggregate query grouping functions.
Let’s look at how data loading works with Doradus OLAP. Looking at the data loading sequence also highlights some of the criteria that applications must meet to effectively load data. As with most database applications, there may be multiple data sources each with their own “velocity”. That means that some data may be generated quickly, perhaps continuously, whereas other data may change infrequently. Events, for example, may be collected in a continuous stream, whereas information about People may be collected from a directory server via a daily snapshot. It is up to the application to decide how often data is collected and loaded.
Doradus OLAP requires data to be loaded in batches. A single batch can mix new, modified, and deleted objects, and a batch can contain partial updates such as adding or modifying a single field. Also, a single batch can contain updates to multiple tables. The ideal batch size depends on many factors including how many fields are updated, but tests show that good batch sizes are at least a few hundred objects and sometimes many thousands of objects.
As each batch is loaded, it identifies the shard to which it belongs. A shard is a partition of the database and can be visualized as a cube. A shard is typically a snapshot of the database for a specific time period. The most common shard granularity is “one day”, though finer and coarser shard granularities will be ideal for some applications. Each shard has a name: the name is not meaningful to Doradus, but the shard names are considered ordered. So, if you want to query a specific range of day shards, say March 1st through March 31st, a good idea is to name shards using the format YYYY-DD-MM. Then you can query for the shard range 2015-03-01 to 2015-03-31 and the appropriate shards are selected. When a batch is loaded, is it not immediately visible to the shard, which means its data is not returned in queries to that shard. Periodically, the shard must be merged, which causes its pending batches to be applied. You can think of a shard’s querably data as the “live cube”: merging applies updates from batches, creating a new live cube. The batches are then deleted. The frequency at which to merge shards affects the latency of the data. If you merge updated shards once per hour, queries will reflect data that is up to 1 hour old. Merging more often yields fresher data but incurs more resources.
As each shard is updated and merged, a “cube” is added to the OLAP store, which is a Cassandra ColumnFamily. Inside, a shard consists of arrays that are designed for fast merging and fast loading during queries. This is a critical component of Doradus OLAP, so let’s take a closer look.
When batches are loaded, data is extracted and stored in minimally-sized arrays. For example, if an integer field is found to have a maximum value of 50,000, then a 2-byte array is used. Boolean fields create a bit array. Text values are de-duped and stored case-insensitively. Each value array is sorted in object ID order and then stored as a compressed row in Cassandra. Because each array contains homogeneous data (e.g., all integers), it compresses extremely well, often 90% or better. Rows that are too small to warrant compression are stored uncompressed. Multiple, small rows are joined into single large rows so that all values can be loaded with a single read. The idea is to use as little physical disk space as possible to allow a large number of values to be loaded at once.
When a shard containing batch updates is merged, all updates are combined with the shard’s current live data to create a new live data set. Since batch loading generates sorted arrays, the merge process consists of heap merging all arrays of the same table and field into a new array. Heap merging is very fast, hence merging does not take as long.
In traditional data warehouse technology, the extract-transform-load (ETL) process is typically very lengthy, sometimes many hours. Since a Doradus OLAP shard is intended to hold millions of objects, you might therefore assume that the merge process takes a long time. However, the merge process is designed to be fast, often only a few seconds. This graph shows the merge time required for an event tracking application (described detail). In this load, 860 shards were loaded varying in size. The time to merge each shard with 100,000 objects or more is plotted against the time to merge the shard. As the graph shows, the merge time is directly proportional to the number of objects in the shard. For this application, the longest time to merge a shard was just over 90 seconds: that shard contained over 37 million objects. (Keep in mind that merge time is also affected by the number of tables and fields being merged, so your mileage may vary.) Quick merge time means that Doradus OLAP allows data to be added and merged fairly often, allowing queries to access data that is close to real time.
This slide shows how OLAP arrays are accessed during queries. The query searches the Message table and counts objects whose Size field falls between 1,000 and 10,000 and whose HasBeenSent flag is false. Furthermore, the query searches shards whose name falls between 2014-03-01 and 2014-03-31. Assuming we used 1-day shards, this requires searching 31 shards. Since the query accesses two fields, OLAP must read 62 rows: the value array for each field in each shard. This might sound like a lot of reads for a single query, but remember that shards typically contain a large set of objects. If our shards averaged 1 million objects each, this query scans 31 million objects with only 62 reads! As each array is read, it is decompressed and scanned in memory. Value arrays are designed for fast scanning: modern processors can typically scan 10’s of millions of values per second. When a value array is scanned, an “object bit array” is generated to reflect the selected objects. Each additional value array scan turns on or off bits; the final bit array represents the results of the query. On a “cold” system, where nothing is cached, each row read requires a physical read. However, in practice data is cached at multiple levels: The operating system caches Cassandra data files (SSTables). Since data is compressed, a large amount of data is typically cached at this level. Cassandra caches certain information such as recently-read rows, key indices, and bloom filters to speed-up access to recently-accessed data. Doradus OLAP caches value arrays on an most-recently-basic (MRU) basis, hence recently-accessed values are not re-read from Cassandra. Since query results consist of compact bit arrays, these are also cached at the clause level on an MRU basis. Hence, recent repeated query clauses are reused and act as cached queries. These levels produce a natural “hot”, “warm”, and “cold” caching pattern that speeds-up access to the most requested data.
The topic of this presentation is “1 billion objects in 2GB”. For this audacious claim, you may think that the objects must be fairly trivial, like numbers or something, right? Actually, these objects are taken from a real-world application: a Windows event tracking application. Shown is an example source event in CSV format. Each event consists of 10 fixed fields and 0 or more variable fields, called “insertion strings”. The number of insertion strings depends on the event’s ID. In this case, a 540 event has 7 insertion strings, one of which is null. The index of the insertion strings is significant: that is, we must store the index of each insertion string since each position is meaningful to the corresponding event.
For this event tracking application, we use two tables: Events stores the 10 fixed fields of each event, and InsertionStrings stores the index and value of each variable field. The Events.Params link field and its inverse InsertionStrings.Event connect each event to its insertion string objects. This allows us to find events and navigate to its insertion strings or vice versa. For this test, we loaded just under 115 million events. Since events average around 7.7 insertion strings each, this requires the creation of 880 million insertion string objects. Each event object is connected to its insertion string objects by populating the Params/Event links.
The data set contains events that span 860 calendar days, and we loaded the data with one-day shards, hence we created 860 shards. As shown, the load created a total of over 994 million objects. (OK, so it’s not quite 1 billion, but it’s close.) On a MacBook Air, this data can be loaded in about 2 hours. A server class system typically requires 30 minutes or less to load. After the load was finished, Cassandra was directed to flush and compact the data. The output of the “nodetool status” command is shown, revealing that the entire database takes less than 2GB. In another test, we compared the space used by Doradus to the space used by a relational database for an auditing application. The SQL Server database required 8.5K per object whereas Doradus required only 87 bytes–a savings of almost 99%! Doradus OLAP’s space savings result from columnar storage, compression, de-duping, and the lack of indexes.
Here are some example DQL queries. Because Doradus provides a REST API, we can just use a browser to query the database. The links below assume Doradus is running on the local machine. Query 1: http://localhost:1123/OLAPEvents/Events/_aggregate?m=COUNT(*)&range=0 This query counts all Events objects in all 860 shards. This query takes less than 10 seconds on a “cold” system and less than 1 second on a warmed-up system. Query 2: http://localhost:1123/OLAPEvents/Events/_aggregate?m=COUNT(*)&range=2005-01-01,2005-06-30&q=Type='Failure Audit' EventID IN (577,681,529) AND Params.Index=8 AND Params.Value='(0x0,0x3E7)'&f=TOP(5,Timestamp.HOUR) AS HourOfDay This query performs a typical analytical query. Suppose we’re trying to see if there’s a pattern of when certain privileged events fail. That is, we want to know if they mostly fail at the same time of day. The query looks for events where (1) the EventID belongs to a specific set of privileged event numbers, (2) the event Type field denotes a failure, and (3) the result code from the failed Windows operation has a specific event code. Furthermore, we are querying in a six month range, which means we will scan 181 shards. For those reading the slides, here’s a typical result of this query in XML. It shows that, within the selected shards, a total of 591 events were found that met the search criteria. Of those events, the majority (119) occurred in hour 8 (08:00). However, the next most common hours in which these failure events occur (18, 19, 9, and 5) have slightly down-trending values (87, 72, 69, and 66), which suggests that the events do not occur at a statistically significant hour of the day. <results> <aggregate metric="COUNT(*)" query="Type='Failure Audit' EventID IN (577,681,529) AND Params.Index=8 AND Params.Value='(0x0,0x3E7)'" group="TOP(5,Timestamp.HOUR) AS HourOfDay"/> <totalobjects>591</totalobjects> <summary>591</summary> <totalgroups>17</totalgroups> <groups> <group> <metric>119</metric> <field name="HourOfDay">8</field> </group> <group> <metric>87</metric> <field name="HourOfDay">18</field> </group> <group> <metric>72</metric> <field name="HourOfDay">19</field> </group> <group> <metric>69</metric> <field name="HourOfDay">9</field> </group> <group> <metric>66</metric> <field name="HourOfDay">5</field> </group> </groups> </results>
To summarize, the primary advantages of Doradus OLAP are listed below: The REST API is easy to use in all languages and platforms without requiring a specialized client library. All fields are searchable without using specialized indexes. The lack of indexes means Doradus OLAP is ideal for ad-hoc statistical queries. DQL extends Lucene’s full text query language with graph-based query features that are much simpler than joins. Fast data loading and shard merging allows an OLAP database to provide near real time data warehousing. Because data is stored very compactly, less disk space is required, saving up to 99% compared to other databases. Combined with fast in-memory query processing, this means less hardware is required compared to other big data approaches. A single node will be sufficient for many analytics applications. But when necessary, the database can be scaled horizontally using Cassandra’s replication and sharding features.
Is Doradus OLAP a good fit for your application? This slide summarizes the criteria with which OLAP fits well: Data is received as a continuous stream: events, log records, transactions, etc. Data is structured (all fields are predefined) or semi-structured (variability can be accommodated as in the Events application in this presentation). Data can be accumulated and loaded in batches. Data can be partitioned into shards. Time-series data works the best, but strictly speaking other partitioning criteria can work. Data is typically queried in a subset of shards. For time-sharded data, this means queries select specific time ranges. The application primarily uses statistical queries. Full text queries are supported, but statistical queries are Doradus’ forte.
The Dell Software Group uses many open source software projects, and we’re pleased we can contribute back to the open source community. Doradus is open source under the Apache License 2.0, so everyone if free to download, modify, and use the code. Full source code and documentation is available from Github. Binary, source code, and Java doc bundles are available from Maven central. Comments and suggestions are more than welcome, so feel free to contact me. Thank you! Randy Guck

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Similar to Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP (20)

Recently uploaded

Recently uploaded (20)

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Editor's Notes