SlideShare a Scribd company logo
1 of 24
One Billion Objects in 2GB:
Big Data Analytics on Small
Clusters with Doradus OLAP
Randy Guck
Principal Engineer
Dell Software Group
What is Doradus?
 Storage and query service
 Leverages Cassandra NoSQL DB
 Pure Java
- Stateless
- Embeddable or standalone
 Open source: Apache 2.0 License
30 Doradus: The Tarantula Nebula
Source: Hubble Space Telescope
Why Use Doradus?
 Easy to use: no client driver
 Spider storage manager
- Good for unstructured data
 OLAP storage manager
- Near real time data warehousing
 Compared to Cassandra alone:
- Data model, searching, analytics
 Compared to Hadoop:
- Fast data loads and queries
- Dense storage: less hardware
Cassandra
Data
Applications
REST API
OLAP Spider
Doradus
A Multi-Node Cluster
Doradus
Cassandra
Data
Node 2
Cassandra
Data
Doradus
Cassandra
Data
Node 1 Node 3
Applications
REST API
Secondary Doradus
instances are optional
Why Did We Build Doradus OLAP?
 Some tough customer requirements:
- Statistical queries most important
- Need to scan millions of objects/second
- User-customizable “insights” = millions of possible queries
 Couldn’t use indexes, pre-computed queries, etc.
 Disk physics
- ~100's of random reads/second
- ~1000's of serial reads/second
 Needed a radically new approach!
Doradus OLAP
 Combines ideas from:
- Online Analytical Processing: data arranged in static cubes
- Columnar databases: Column-oriented storage and compression
- NoSQL databases: Sharding
 Features:
- Fast loading: up to 500K objects/second/node
- Dense storage: 1 billion objects in 2 GB!
- Fast cube merging: typically seconds
- No indexes!
Example: Message Tracking Schema
Message Participant Address
Person
Manager
Employees
Person
Address 
Attachments
Message
Participants
Message
Address
Participants
Attachment
DQL Object Queries
 Builds on Lucene syntax
- Full text queries
 Adds link paths
- Directed graph searches
- Quantifiers and filters
- Transitive searches
 Other features
- Stateless paging
- Sorting
 Examples:
- LastName = Smith AND NOT (FirstName : Jo*)
AND BirthDate = [1986 TO 1992]
- ALL(Participants).ANY(Address.WHERE
(Email='*.gmail.com')).Person.Department :
support
- Employees^(4).Office='San Jose’
DQL Aggregate Queries
 Metric functions
- COUNT, AVERAGE, MIN,
MAX, DISTINCT, ...
 Multi-level grouping
 Grouping functions
- BATCH, BOTTOM, FIRST,
LAST, LOWER, SETS,
TERMS, TOP, TRUNCATE,
UPPER, WHERE, ...
 Examples:
- metric=COUNT(*), AVERAGE(Size),
MIN(Participants.Address.Person.Birthdate)
- metric=DISTINCT(Attachments.Extension);
groups=Tags,
Participants.Address.Person.Department;
query=Attachments.Size > 100000
- metric=AVERAGE(Size);
groups=TOP(10,Participants.Address.Email)
OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Sources
OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch 2
Batch 3
...
Sources Batches
Batch 4
OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch 2
Batch 3
...
2014-03-01
2014-02-28
2014-02-27
Sources Batches Shards
Batch 4
Merge
OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch 2
Batch 3
...
2014-03-01
2014-02-28
2014-02-27
Sources Batches Shards OLAP Store
Batch 4
Merge
Storing Batches
Field Values
ID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ...
Size 48120 5435 24220 ...
SendDate 1280246462000 1279354872112 1279357261413 ...
Priority 0 0 1 ...
Subject.txt ballades encash nautch
colloquy geared
nettlier outdoors culvert
hypothec winder
stolons ungot guiding
rupiahs outgone
...
Subject 1 2 0 ...
...
Data is sorted by object ID and stored as columnar, compressed blobs
Key Columns
Email/Message/2014-03-01/{Batch GUID}/ID [compressed data]
Email/Message/2014-03-01/{Batch GUID}/Size [compressed data]
Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data]
... ......
OLAP Table
Field Value Arrays
Compressed rows
Merging Batches
Key Columns
Email/Message/2014-03-01/ID [compressed data]
Email/Message/2014/03-01/Size [compressed data]
Email/Message/2014-03-01/SendDate [compressed data]
... ...
Email/Person/2014-03-01/ID [compressed data]
Email/Person/2014-03-01/FirstName [compressed data]
Email/Person/2014-03-01/LastName [compressed data]
... ...
Email/Address/2014-03-01/ID [compressed data]
Email/Address/2014-03-01/Person [compressed data]
Email/Address/2014/-03-01/Message [compressed data]
... ...
Email/Message/2014-02-28/ID [compressed data]
Email/Message/2014-02-28/Size [compressed data]
...
Batch #1: Shard 2014-03-01
Message Table
ID ...
Size ...
SendDate ...
...
Batch #2: Shard 2014-03-01
Message Table
ID ...
Size ...
SendDate ...
...
...
OLAP Store
Person Table
ID ...
FirstName ...
Lastname ...
...
Address Table
ID ...
Person ...
Messages ...
...
Person Table
ID ...
FirstName ...
Lastname ...
...
Address Table
ID ...
Person ...
Messages ...
...
Message table data
Shard 2014-03-01
Person table data
Shard 2014-03-01
Address table data
Shard 2014-03-01
Data for other shards
Does Merging Take Long?
OLAP Query Execution
 Example query:
- Count messages with Size between 1000-10000 and
HasBeenSent=false in shards 2014-03-01 to 2014-03-31
 How many rows are read?
- 2 fields x 31 shards = 62 rows
- Typically represents millions of values
 Value arrays are scanned in memory
 Physical rows are read on “cold” start only
- Multiple caching levels for “warm” and “hot” data
1 Billion Objects in 2GB?
Example Security Event (CSV format):
Fixed Fields Variable Fields
Computer Name MAILSERVER18 1 MAILSERVER18$
Log Name Security 2
Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation
Type Success Audit 4 (0x0,0x142999A)
Source Security 5 3
Category Logon/Logoff 6 Kerberos
Event ID 540 7 Kerberos
User Domain NT AUTHORITY
User Name SYSTEM
User SID S-1-5-18
MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,
"Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,
"(0x0,0x142999A)",3,Kerberos,Kerberos
Events Schema
Events
Insertion
Strings
Fields:
• ComputerName (text)
• LogName (text)
• Timestamp (timestamp)
• Type (text)
• Source (text)
Fields:
• Index (integer)
• Value (text)
• Event (link)
Count: 115 Million Count: 880 Million
Params 
 Event (inverse)
• Category (text)
• EventID (integer)
• UserDomain (text)
• UserSID (text)
• Params (link)
Event Schema Load
 Load stats:
Total shards: 860
Total events: 114,572,247
Total ins strings: 879,529,753
Total objects: 994,102,000
Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)
 Space usage:
:nodetool -h localhost status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1
Demo
 1) Count all Events in all shards
- 860 shards => 115M events
 2) Find the top 5 hours-of-the-day when certain privileged events fail:
- Event IDs are any of 577, 681, 529
- Event type is ‘Failure Audit’
- Insertion string 8 is (0x0,0x3E7)
- Event occurred in first half of 2005 (181 shards)
Doradus OLAP Summary
 Advantages:
Simple REST API
All fields are searchable without indexes
Ad-hoc statistical searches
Support for graph-based queries
Near real time data warehousing
Dense storage = less hardware
Horizontally scalable when needed
Doradus OLAP Summary
 Good for applications where data:
Is continuous/streaming
Is structured to semi-structured
Can be loaded in batches
Is partitionable, especially by time
Is typically queried in a subset of shards
Emphasizes statistical queries
Thank You!
 Where to find Doradus
- Source: github.com/dell-oss/Doradus
- Downloads: search.maven.org
 Contact me
- Randy.Guck@software.dell.com
- @randyguck

More Related Content

What's hot

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Steve Min
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 

What's hot (20)

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 

Viewers also liked

Extending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analyticsrandyguck
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradusrandyguck
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with CassandraSperasoft
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
 
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseSolr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseDataStax Academy
 
Application Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a ServiceApplication Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a ServiceWSO2
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Applicationsupertom
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 

Viewers also liked (14)

Extending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analytics
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Infinite gerrit
Infinite gerritInfinite gerrit
Infinite gerrit
 
Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradus
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with Cassandra
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseSolr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
 
Application Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a ServiceApplication Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a Service
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 

Similar to Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System ModernisationMongoDB
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
from source to solution - building a system for event-oriented data
from source to solution - building a system for event-oriented datafrom source to solution - building a system for event-oriented data
from source to solution - building a system for event-oriented dataEric Sammer
 
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...ForgeRock
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseMongoDB
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revisedMongoDB
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessMongoDB
 
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB
 
Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014John Ternent
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013DataStax Academy
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Mt logging with_bam
Mt logging with_bamMt logging with_bam
Mt logging with_bamAmani Soysa
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 

Similar to Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP (20)

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System Modernisation
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
from source to solution - building a system for event-oriented data
from source to solution - building a system for event-oriented datafrom source to solution - building a system for event-oriented data
from source to solution - building a system for event-oriented data
 
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revised
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
 
Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Mt logging with_bam
Mt logging with_bamMt logging with_bam
Mt logging with_bam
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

  • 1. One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group
  • 2. What is Doradus?  Storage and query service  Leverages Cassandra NoSQL DB  Pure Java - Stateless - Embeddable or standalone  Open source: Apache 2.0 License 30 Doradus: The Tarantula Nebula Source: Hubble Space Telescope
  • 3. Why Use Doradus?  Easy to use: no client driver  Spider storage manager - Good for unstructured data  OLAP storage manager - Near real time data warehousing  Compared to Cassandra alone: - Data model, searching, analytics  Compared to Hadoop: - Fast data loads and queries - Dense storage: less hardware Cassandra Data Applications REST API OLAP Spider Doradus
  • 4. A Multi-Node Cluster Doradus Cassandra Data Node 2 Cassandra Data Doradus Cassandra Data Node 1 Node 3 Applications REST API Secondary Doradus instances are optional
  • 5. Why Did We Build Doradus OLAP?  Some tough customer requirements: - Statistical queries most important - Need to scan millions of objects/second - User-customizable “insights” = millions of possible queries  Couldn’t use indexes, pre-computed queries, etc.  Disk physics - ~100's of random reads/second - ~1000's of serial reads/second  Needed a radically new approach!
  • 6. Doradus OLAP  Combines ideas from: - Online Analytical Processing: data arranged in static cubes - Columnar databases: Column-oriented storage and compression - NoSQL databases: Sharding  Features: - Fast loading: up to 500K objects/second/node - Dense storage: 1 billion objects in 2 GB! - Fast cube merging: typically seconds - No indexes!
  • 7. Example: Message Tracking Schema Message Participant Address Person Manager Employees Person Address  Attachments Message Participants Message Address Participants Attachment
  • 8. DQL Object Queries  Builds on Lucene syntax - Full text queries  Adds link paths - Directed graph searches - Quantifiers and filters - Transitive searches  Other features - Stateless paging - Sorting  Examples: - LastName = Smith AND NOT (FirstName : Jo*) AND BirthDate = [1986 TO 1992] - ALL(Participants).ANY(Address.WHERE (Email='*.gmail.com')).Person.Department : support - Employees^(4).Office='San Jose’
  • 9. DQL Aggregate Queries  Metric functions - COUNT, AVERAGE, MIN, MAX, DISTINCT, ...  Multi-level grouping  Grouping functions - BATCH, BOTTOM, FIRST, LAST, LOWER, SETS, TERMS, TOP, TRUNCATE, UPPER, WHERE, ...  Examples: - metric=COUNT(*), AVERAGE(Size), MIN(Participants.Address.Person.Birthdate) - metric=DISTINCT(Attachments.Extension); groups=Tags, Participants.Address.Person.Department; query=Attachments.Size > 100000 - metric=AVERAGE(Size); groups=TOP(10,Participants.Address.Email)
  • 11. OLAP Data Loading Batch 1 EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch 2 Batch 3 ... Sources Batches Batch 4
  • 12. OLAP Data Loading Batch 1 EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch 2 Batch 3 ... 2014-03-01 2014-02-28 2014-02-27 Sources Batches Shards Batch 4 Merge
  • 13. OLAP Data Loading Batch 1 EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch 2 Batch 3 ... 2014-03-01 2014-02-28 2014-02-27 Sources Batches Shards OLAP Store Batch 4 Merge
  • 14. Storing Batches Field Values ID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ... Size 48120 5435 24220 ... SendDate 1280246462000 1279354872112 1279357261413 ... Priority 0 0 1 ... Subject.txt ballades encash nautch colloquy geared nettlier outdoors culvert hypothec winder stolons ungot guiding rupiahs outgone ... Subject 1 2 0 ... ... Data is sorted by object ID and stored as columnar, compressed blobs Key Columns Email/Message/2014-03-01/{Batch GUID}/ID [compressed data] Email/Message/2014-03-01/{Batch GUID}/Size [compressed data] Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data] ... ...... OLAP Table Field Value Arrays Compressed rows
  • 15. Merging Batches Key Columns Email/Message/2014-03-01/ID [compressed data] Email/Message/2014/03-01/Size [compressed data] Email/Message/2014-03-01/SendDate [compressed data] ... ... Email/Person/2014-03-01/ID [compressed data] Email/Person/2014-03-01/FirstName [compressed data] Email/Person/2014-03-01/LastName [compressed data] ... ... Email/Address/2014-03-01/ID [compressed data] Email/Address/2014-03-01/Person [compressed data] Email/Address/2014/-03-01/Message [compressed data] ... ... Email/Message/2014-02-28/ID [compressed data] Email/Message/2014-02-28/Size [compressed data] ... Batch #1: Shard 2014-03-01 Message Table ID ... Size ... SendDate ... ... Batch #2: Shard 2014-03-01 Message Table ID ... Size ... SendDate ... ... ... OLAP Store Person Table ID ... FirstName ... Lastname ... ... Address Table ID ... Person ... Messages ... ... Person Table ID ... FirstName ... Lastname ... ... Address Table ID ... Person ... Messages ... ... Message table data Shard 2014-03-01 Person table data Shard 2014-03-01 Address table data Shard 2014-03-01 Data for other shards
  • 17. OLAP Query Execution  Example query: - Count messages with Size between 1000-10000 and HasBeenSent=false in shards 2014-03-01 to 2014-03-31  How many rows are read? - 2 fields x 31 shards = 62 rows - Typically represents millions of values  Value arrays are scanned in memory  Physical rows are read on “cold” start only - Multiple caching levels for “warm” and “hot” data
  • 18. 1 Billion Objects in 2GB? Example Security Event (CSV format): Fixed Fields Variable Fields Computer Name MAILSERVER18 1 MAILSERVER18$ Log Name Security 2 Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation Type Success Audit 4 (0x0,0x142999A) Source Security 5 3 Category Logon/Logoff 6 Kerberos Event ID 540 7 Kerberos User Domain NT AUTHORITY User Name SYSTEM User SID S-1-5-18 MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security, "Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation, "(0x0,0x142999A)",3,Kerberos,Kerberos
  • 19. Events Schema Events Insertion Strings Fields: • ComputerName (text) • LogName (text) • Timestamp (timestamp) • Type (text) • Source (text) Fields: • Index (integer) • Value (text) • Event (link) Count: 115 Million Count: 880 Million Params   Event (inverse) • Category (text) • EventID (integer) • UserDomain (text) • UserSID (text) • Params (link)
  • 20. Event Schema Load  Load stats: Total shards: 860 Total events: 114,572,247 Total ins strings: 879,529,753 Total objects: 994,102,000 Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)  Space usage: :nodetool -h localhost status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns Host ID Token Rack UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1
  • 21. Demo  1) Count all Events in all shards - 860 shards => 115M events  2) Find the top 5 hours-of-the-day when certain privileged events fail: - Event IDs are any of 577, 681, 529 - Event type is ‘Failure Audit’ - Insertion string 8 is (0x0,0x3E7) - Event occurred in first half of 2005 (181 shards)
  • 22. Doradus OLAP Summary  Advantages: Simple REST API All fields are searchable without indexes Ad-hoc statistical searches Support for graph-based queries Near real time data warehousing Dense storage = less hardware Horizontally scalable when needed
  • 23. Doradus OLAP Summary  Good for applications where data: Is continuous/streaming Is structured to semi-structured Can be loaded in batches Is partitionable, especially by time Is typically queried in a subset of shards Emphasizes statistical queries
  • 24. Thank You!  Where to find Doradus - Source: github.com/dell-oss/Doradus - Downloads: search.maven.org  Contact me - Randy.Guck@software.dell.com - @randyguck

Editor's Notes

  1. There are many good software modules available today that provide big data analytics using distributed clusters. Some applications need fast aggregate queries on large data sets but cannot justify the cost or complexity of a large database cluster. If an application’s requirements meet certain constraints, such as the ability to partition its data into time-based shards, it can benefit from Doradus OLAP and probably get all of its data on a single node. Doradus is an open source storage and query engine that runs on top of Cassandra. It offers a high-level data model, full text searching, and aggregate queries such as AVERAGE() with multi-level grouping. Doradus offers two storage services that use specialized storage and query execution techniques for different application domains. The Doradus OLAP service, described in this session, borrows techniques from Online Analytical Processing, columnar storage, compression, and sharding to yield extremely compact databases. Though many applications may be able to use a single node, the underlying Cassandra NoSQL database allows data to be distributed and replicated over a cluster when needed. Some of the features that will be described are: Doradus OLAP uses as little as 1% of the disk space used by other systems. OLAP cubes, called shards, can be updated and rebuilt very quickly, typically in a few seconds. The REST API supports JSON and XML for maximum accessibility. A simple but flexible data model supports multi-valued scalar fields and bi-directional links, which provide inter-object relationships with full referential integrity. All data fields are searchable using a Lucene-compatible query language that supports terms, phrases, wildcards, ranges, inequalities, etc. The query language uses link paths, which are much simpler than joins. They offer rich query features such as quantifiers, path filters, transitive searches, and more. Aggregate queries perform fast, complex metric computations with multi-level grouping—without indexes! Though many databases will fit on a single node, the Cassandra persistence layer allows the database to be distributed on a multi-node cluster for scalability, replication, and failover. A case study using time-stamped event data is used to demonstrate features and show how one billion objects require only 2GB of disk space.
  2. From a high-level view, Doradus is a storage and query service that runs on top of the Cassandra NoSQL database. Like Cassandra, Doradus is a pure Java application. Doradus can be embedded within another application or run as a standalone process. Doradus is an open source project that offers the Apache License Version 2.0. Source code and other links are provided later in this presentation. Where did the name Doradus come from? “30 Doradus” is another name for the Tarantula Nebula, which is a relatively new star cluster (as star clusters go). It is the birth place of many new stars.
  3. Why might you want to use Doradus? First, it’s easy to use because it supports a REST API that can be accessed from any platform and language without a specialized client library. Also, Doradus offers two storage managers that are ideal for different application types. Each application schema chooses the storage manager ideal for its data. The data model and query language used by each storage manager are nearly identical. The Doradus Spider storage manager is good for unstructured data and applications that emphasize immediate indexing and full text queries. It is analogous to Lucene-based search engines such as Solr or ElasticSearch, though the Doradus data model and query language have extra benefits such as graph queries. Details of Spider are beyond the scope of this presentation. The Doradus OLAP storage manager is relatively new, and it is the focus of this presentation. OLAP provides unique features that allow it to act like a data warehouse, except that data can be loaded more quickly and uses far less disk space than other databases. The time between loading data and being able to query it is sufficiently short that we like to say that OLAP allows “near real time data warehousing”. Compared to using Cassandra by itself, Doradus offers a higher-level data model, rich search features, and analytics through statistical queries. Compared to Hadoop, Doradus allows fast loads and queries while using much less storage, thereby requiring less hardware.
  4. Doradus can be embedded in the same JVM as another application, which then calls Doradus service methods directly. This is ideal for bulk load applications that need maximum performance because the overhead of the REST API is avoided. When Doradus executes as a standalone process, applications communicate using the REST API passing JSON or XML messages. A minimum deployment is a single Doradus process accessing a one-node Cassandra database. When needed, Cassandra can be deployed on a multi-node cluster to provide replication, failover, and horizontal scaling. Cassandra also supports multi-data center deployment, which allows data to be replicated to multiple geographic regions. Technically, only a single Doradus instance is needed, operating anywhere on the network. Doradus will automatically rotate requests through Cassandra instances and retry/failover requests if a Cassandra node fails. However, Doradus is stateless, so multiple instances can execute against the same Cassandra cluster. One deployment model is to execute a Doradus instance on each Cassandra node. This makes all nodes equal and provides applications with failover options should any Doradus/Cassandra node fail. Doradus uses the same masterless/peer model as Cassandra, although Doradus instances do not directly communicate with each other.
  5. Given all the great NoSQL and big data software available today, much of it open source, why did we build Doradus OLAP? Originally, we built Doradus Spider believing that full text document searching was the primary use case. However, our primary customer team decided that statistical queries were more important. They wanted complex computational queries with multi-level grouping and lots of filtering features. Because these queries are generated from user-customizable insights, there are infinite ways in which data can be dissected. We therefore couldn’t use indexes or tricks such as pre-computing queries in the background. Given our customer’s response time requirements, we needed to scan millions of objects per second. However, we ran into disk physics that we couldn’t do anything about. Most disks will perform ~hundreds of random reads and ~thousands of serial reads per second, but no more. Furthermore, the entire database wouldn’t fit in memory, so a memory database wasn’t feasible either. We needed a radically new approach.
  6. We decided to build a new storage manager, which we call Doradus OLAP. Its design borrows ideas from several diciplines: Online Analytical Processing: OLAP is widely used in data warehouses, where data is transformed and stored in n-dimensional cubes. Columnar databases: These databases store column values instead of row values in the same physical record. NoSQL databases: These databases commonly use sharding to co-locate data, especially time-series data. Doradus OLAP combines ideas from these areas to provide a new storage service that is suitable for certain kinds of applications. Compared to the Spider storage manager, OLAP imposes some restrictions. But for applications that can use it, OLAP offers many advantages including: Fast data loading: We’ve observed loads up to 500,000 objects/second on a single node in bulk-load scenarios. Dense storage: Data is stored in a way that it is highly compressable. Fast merging: One of the typical drawbacks of data warehouses is the time required to update the database with new data. With Doradus OLAP, data is stored in cubes that can be updated quickly, often in a few seconds. To accomplish this, Doradus OLAP uses no indexes. It is also not a memory database, though it uses in-memory processing for its queries.
  7. Here is an example Doradus message tracking application schema, consisting of five tables. Depicted are the link fields that form relationships. Every link field has an inverse that defines the same relationship from the opposite direction. For example, the Message table defines a link called Participants, which points to the Participant table. The inverse link is called Message, which points back to the Message table. Doradus maintains inverse links automatically: if Message.Participants is updated, the inverse Participant.Message link values are automatically added or deleted as needed. Note that the Person table defines a link called Manager, which points to the same table, Person; Employees is the inverse link. The Manager/Employees relationship forms an org chart graph that can be searched “up” or “down”. A link whose extent—the table it points to—is the same as its owner is called a reflexive link. Reflexive links can be searched recursively via transitive functions. Doradus also allows self-reflexive links that are their own inverse. An example self-reflexive relationship is Friends (though friendship is not always reciprocal!) We’ll see how links are used in the query language next.
  8. The Doradus Query Language (DQL) can be used in two contexts: An object query selects and returns objects from a specific table, called the perspective. DQL is based on and extends the Lucene full text language. This means that DQL supports many clause types including equalities and inequalities, AND/OR/NOT and parentheses, terms, phrases, wildcards, ranges, and more. To these full text concepts DQL adds the notion of link paths to traverse relationship graphs. Link paths can use functions such as quantifiers and filters to narrow the selection of objects. Reflexive relationships can use a transitive function to search a graph recursively. The first example is a simple full text query. It consists of three clauses that select objects (1) whose LastName is Smith, (2) whose FirstName does not contain a term that begins with “Jo”, and (3) whose BirthDate falls between the years 1986 and 1992 (inclusively). This example uses several features including an equality clause, a term clause, a wildcard term, a range clause, and clause negation. The second example shows how link fields are used. Link paths are one of the features that makes DQL easy to use. Though deceptively simple, each “.” in an expression such as X.Y.Z essentially performs a join. This example shows how link paths can be explicitly quantified. When a link path is enclosed in the quantifier ANY, ALL, or NONE, a specific number of the objects in the quantified link path must match the clause’s condition. In this example, ALL(Participants) means that every Participants value must meet the remainder of the clause. The Address link is quantified with ANY, and it is also filtered with a WHERE clause. This means that only Address objects whose Email ends with “gmail.com” are selected, and at least one of them (ANY) must meet the remainder of the clause. Without the quantifiers and filter, the link path is Participants.Address.Person.Department, and it is used in a contains clause searching for the term support. This means the Department field must contain the term “support” (case-insensitive). Textually, this query selects objects where all Participants have at least one Address whose Email ends with “gmail.com” and whose linked Person’s department is in support. The last example demonstrates transitive searches. Because Employees is a reflexive link, we can search it recursively through the graph, in this case “down” the org chart. The carat (^) is the transitive function and causes the preceding link, Employees, to be searched recursively. The optional value in parentheses after the carat limits the recursion to a maxmimum number of levels. In this case, ^(4) says “recurse the link to no more than 4 levels”. Without the limit parameter, the transitive function searches the graph until a cycle or a “leaf” object is found.
  9. The other context in which DQL can be used is aggregate queries, which select objects from a perspective table and performs computations across those objects. This slide shows several example aggregate queries. These examples demonstrate the metric expression, grouping expression, and query expression components of aggregate queries. The first example performs three computations across selected objects in a single pass: (1) a COUNT of all objects, (2) the AVERAGE value of the Size field, and (2) the smallest (MIN) Birthdate found among objects linked via Participants.Address.Person. All three statistical computations are made in a single pass through the data. Since this example contains no grouping expression, a single value is returned for each of these computations. The second example demonstrates multi-level grouping, which divides objects into groups and performs the metric computations across each subgroup. In this example, a single metric function is computed: the unique (DISTINCT) Attachments.Extension values within each group. The grouping expression groups objects first by their Tags field and secondarily by the values found in the link path Participants.Address.Person.Department. Because this example has a query expression, only those objects matching the selection expression are used in the aggregate computation. The third example computes AVERAGE Size for all objects, grouped by a single-level grouping expression Participants.Address.Email. The grouping expression is enclosed in a TOP(10) function, which means that only the groups with the top 10 metric values (average Size) are returned. Doradus supports a wide range of grouping functions to create aggregate query groups. Some of the more popular functions are listed above and summarized below: BATCH: creates groups from explicit value ranges FIRST/LAST: returns a limited number of groups based on their highest/lowest alphabetical group names INCLUDE/EXCLUDE: includes or excludes specific group values within a grouping level SETS: creates groups from arbitrary query expressions TERMS: creates one group per term within a text field TOP/BOTTOM: returns a limited number of groups based on their highest/lowest metric values TRUNCATE: creates groups from a timestamp value rounded down to a specific date/time granularity UPPER/LOWER: Creates groups with case-insensitive text values See the Doradus documentation for a full description of all aggregate query grouping functions.
  10. Let’s look at how data loading works with Doradus OLAP. Looking at the data loading sequence also highlights some of the criteria that applications must meet to effectively load data. As with most database applications, there may be multiple data sources each with their own “velocity”. That means that some data may be generated quickly, perhaps continuously, whereas other data may change infrequently. Events, for example, may be collected in a continuous stream, whereas information about People may be collected from a directory server via a daily snapshot. It is up to the application to decide how often data is collected and loaded.
  11. Doradus OLAP requires data to be loaded in batches. A single batch can mix new, modified, and deleted objects, and a batch can contain partial updates such as adding or modifying a single field. Also, a single batch can contain updates to multiple tables. The ideal batch size depends on many factors including how many fields are updated, but tests show that good batch sizes are at least a few hundred objects and sometimes many thousands of objects.
  12. As each batch is loaded, it identifies the shard to which it belongs. A shard is a partition of the database and can be visualized as a cube. A shard is typically a snapshot of the database for a specific time period. The most common shard granularity is “one day”, though finer and coarser shard granularities will be ideal for some applications. Each shard has a name: the name is not meaningful to Doradus, but the shard names are considered ordered. So, if you want to query a specific range of day shards, say March 1st through March 31st, a good idea is to name shards using the format YYYY-DD-MM. Then you can query for the shard range 2015-03-01 to 2015-03-31 and the appropriate shards are selected. When a batch is loaded, is it not immediately visible to the shard, which means its data is not returned in queries to that shard. Periodically, the shard must be merged, which causes its pending batches to be applied. You can think of a shard’s querably data as the “live cube”: merging applies updates from batches, creating a new live cube. The batches are then deleted. The frequency at which to merge shards affects the latency of the data. If you merge updated shards once per hour, queries will reflect data that is up to 1 hour old. Merging more often yields fresher data but incurs more resources.
  13. As each shard is updated and merged, a “cube” is added to the OLAP store, which is a Cassandra ColumnFamily. Inside, a shard consists of arrays that are designed for fast merging and fast loading during queries. This is a critical component of Doradus OLAP, so let’s take a closer look.
  14. When batches are loaded, data is extracted and stored in minimally-sized arrays. For example, if an integer field is found to have a maximum value of 50,000, then a 2-byte array is used. Boolean fields create a bit array. Text values are de-duped and stored case-insensitively. Each value array is sorted in object ID order and then stored as a compressed row in Cassandra. Because each array contains homogeneous data (e.g., all integers), it compresses extremely well, often 90% or better. Rows that are too small to warrant compression are stored uncompressed. Multiple, small rows are joined into single large rows so that all values can be loaded with a single read. The idea is to use as little physical disk space as possible to allow a large number of values to be loaded at once.
  15. When a shard containing batch updates is merged, all updates are combined with the shard’s current live data to create a new live data set. Since batch loading generates sorted arrays, the merge process consists of heap merging all arrays of the same table and field into a new array. Heap merging is very fast, hence merging does not take as long.
  16. In traditional data warehouse technology, the extract-transform-load (ETL) process is typically very lengthy, sometimes many hours. Since a Doradus OLAP shard is intended to hold millions of objects, you might therefore assume that the merge process takes a long time. However, the merge process is designed to be fast, often only a few seconds. This graph shows the merge time required for an event tracking application (described detail). In this load, 860 shards were loaded varying in size. The time to merge each shard with 100,000 objects or more is plotted against the time to merge the shard. As the graph shows, the merge time is directly proportional to the number of objects in the shard. For this application, the longest time to merge a shard was just over 90 seconds: that shard contained over 37 million objects. (Keep in mind that merge time is also affected by the number of tables and fields being merged, so your mileage may vary.) Quick merge time means that Doradus OLAP allows data to be added and merged fairly often, allowing queries to access data that is close to real time.
  17. This slide shows how OLAP arrays are accessed during queries. The query searches the Message table and counts objects whose Size field falls between 1,000 and 10,000 and whose HasBeenSent flag is false. Furthermore, the query searches shards whose name falls between 2014-03-01 and 2014-03-31. Assuming we used 1-day shards, this requires searching 31 shards. Since the query accesses two fields, OLAP must read 62 rows: the value array for each field in each shard. This might sound like a lot of reads for a single query, but remember that shards typically contain a large set of objects. If our shards averaged 1 million objects each, this query scans 31 million objects with only 62 reads! As each array is read, it is decompressed and scanned in memory. Value arrays are designed for fast scanning: modern processors can typically scan 10’s of millions of values per second. When a value array is scanned, an “object bit array” is generated to reflect the selected objects. Each additional value array scan turns on or off bits; the final bit array represents the results of the query. On a “cold” system, where nothing is cached, each row read requires a physical read. However, in practice data is cached at multiple levels: The operating system caches Cassandra data files (SSTables). Since data is compressed, a large amount of data is typically cached at this level. Cassandra caches certain information such as recently-read rows, key indices, and bloom filters to speed-up access to recently-accessed data. Doradus OLAP caches value arrays on an most-recently-basic (MRU) basis, hence recently-accessed values are not re-read from Cassandra. Since query results consist of compact bit arrays, these are also cached at the clause level on an MRU basis. Hence, recent repeated query clauses are reused and act as cached queries. These levels produce a natural “hot”, “warm”, and “cold” caching pattern that speeds-up access to the most requested data.
  18. The topic of this presentation is “1 billion objects in 2GB”. For this audacious claim, you may think that the objects must be fairly trivial, like numbers or something, right? Actually, these objects are taken from a real-world application: a Windows event tracking application. Shown is an example source event in CSV format. Each event consists of 10 fixed fields and 0 or more variable fields, called “insertion strings”. The number of insertion strings depends on the event’s ID. In this case, a 540 event has 7 insertion strings, one of which is null. The index of the insertion strings is significant: that is, we must store the index of each insertion string since each position is meaningful to the corresponding event.
  19. For this event tracking application, we use two tables: Events stores the 10 fixed fields of each event, and InsertionStrings stores the index and value of each variable field. The Events.Params link field and its inverse InsertionStrings.Event connect each event to its insertion string objects. This allows us to find events and navigate to its insertion strings or vice versa. For this test, we loaded just under 115 million events. Since events average around 7.7 insertion strings each, this requires the creation of 880 million insertion string objects. Each event object is connected to its insertion string objects by populating the Params/Event links.
  20. The data set contains events that span 860 calendar days, and we loaded the data with one-day shards, hence we created 860 shards. As shown, the load created a total of over 994 million objects. (OK, so it’s not quite 1 billion, but it’s close.) On a MacBook Air, this data can be loaded in about 2 hours. A server class system typically requires 30 minutes or less to load. After the load was finished, Cassandra was directed to flush and compact the data. The output of the “nodetool status” command is shown, revealing that the entire database takes less than 2GB. In another test, we compared the space used by Doradus to the space used by a relational database for an auditing application. The SQL Server database required 8.5K per object whereas Doradus required only 87 bytes–a savings of almost 99%! Doradus OLAP’s space savings result from columnar storage, compression, de-duping, and the lack of indexes.
  21. Here are some example DQL queries. Because Doradus provides a REST API, we can just use a browser to query the database. The links below assume Doradus is running on the local machine. Query 1: http://localhost:1123/OLAPEvents/Events/_aggregate?m=COUNT(*)&range=0 This query counts all Events objects in all 860 shards. This query takes less than 10 seconds on a “cold” system and less than 1 second on a warmed-up system. Query 2: http://localhost:1123/OLAPEvents/Events/_aggregate?m=COUNT(*)&range=2005-01-01,2005-06-30&q=Type='Failure Audit' EventID IN (577,681,529) AND Params.Index=8 AND Params.Value='(0x0,0x3E7)'&f=TOP(5,Timestamp.HOUR) AS HourOfDay This query performs a typical analytical query. Suppose we’re trying to see if there’s a pattern of when certain privileged events fail. That is, we want to know if they mostly fail at the same time of day. The query looks for events where (1) the EventID belongs to a specific set of privileged event numbers, (2) the event Type field denotes a failure, and (3) the result code from the failed Windows operation has a specific event code. Furthermore, we are querying in a six month range, which means we will scan 181 shards. For those reading the slides, here’s a typical result of this query in XML. It shows that, within the selected shards, a total of 591 events were found that met the search criteria. Of those events, the majority (119) occurred in hour 8 (08:00). However, the next most common hours in which these failure events occur (18, 19, 9, and 5) have slightly down-trending values (87, 72, 69, and 66), which suggests that the events do not occur at a statistically significant hour of the day. <results> <aggregate metric="COUNT(*)" query="Type='Failure Audit' EventID IN (577,681,529) AND Params.Index=8 AND Params.Value='(0x0,0x3E7)'" group="TOP(5,Timestamp.HOUR) AS HourOfDay"/> <totalobjects>591</totalobjects> <summary>591</summary> <totalgroups>17</totalgroups> <groups> <group> <metric>119</metric> <field name="HourOfDay">8</field> </group> <group> <metric>87</metric> <field name="HourOfDay">18</field> </group> <group> <metric>72</metric> <field name="HourOfDay">19</field> </group> <group> <metric>69</metric> <field name="HourOfDay">9</field> </group> <group> <metric>66</metric> <field name="HourOfDay">5</field> </group> </groups> </results>
  22. To summarize, the primary advantages of Doradus OLAP are listed below: The REST API is easy to use in all languages and platforms without requiring a specialized client library. All fields are searchable without using specialized indexes. The lack of indexes means Doradus OLAP is ideal for ad-hoc statistical queries. DQL extends Lucene’s full text query language with graph-based query features that are much simpler than joins. Fast data loading and shard merging allows an OLAP database to provide near real time data warehousing. Because data is stored very compactly, less disk space is required, saving up to 99% compared to other databases. Combined with fast in-memory query processing, this means less hardware is required compared to other big data approaches. A single node will be sufficient for many analytics applications. But when necessary, the database can be scaled horizontally using Cassandra’s replication and sharding features.
  23. Is Doradus OLAP a good fit for your application? This slide summarizes the criteria with which OLAP fits well: Data is received as a continuous stream: events, log records, transactions, etc. Data is structured (all fields are predefined) or semi-structured (variability can be accommodated as in the Events application in this presentation). Data can be accumulated and loaded in batches. Data can be partitioned into shards. Time-series data works the best, but strictly speaking other partitioning criteria can work. Data is typically queried in a subset of shards. For time-sharded data, this means queries select specific time ranges. The application primarily uses statistical queries. Full text queries are supported, but statistical queries are Doradus’ forte.
  24. The Dell Software Group uses many open source software projects, and we’re pleased we can contribute back to the open source community. Doradus is open source under the Apache License 2.0, so everyone if free to download, modify, and use the code. Full source code and documentation is available from Github. Binary, source code, and Java doc bundles are available from Maven central. Comments and suggestions are more than welcome, so feel free to contact me. Thank you! Randy Guck