SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
©2012 DataStax 1
DataStax Enterprise 3.x
Realtime Analytics with Solr
Jason Rutherglen
©2012 DataStax 2
•  Big Data Engineer at DataStax
•  Co-author of ‘Programming Hive’
and ‘Introduction to Solr’ from
O’Reilly
About the Presenter
©2012 DataStax 3
•  The company behind Cassandra
•  Sells DataStax Enterprise
About DataStax
©2012 DataStax 4
DataStax Enterprise 3.x
©2012 DataStax 5
•  Single stack
•  Cassandra
•  Solr
•  Hadoop
•  Consulting
•  Support
DataStax Enterprise
©2012 DataStax 6
•  ZDNet article: “The biggest cloud
app of all: Netflix”
•  http://zd.net/ZHtrmW
•  Built on Cassandra
•  “According to Cockcroft, if something
goes wrong, Netflix can continue to
run the entire service on two out of
three zones”
Cassandra at Netflix
©2012 DataStax 7
•  Petabytes of growing data
•  Hadoop is for batch work
•  What are the solutions for realtime?
What is Big Data?
©2012 DataStax 8
•  Near realtime
•  1000 millisecond latency
What is realtime?
©2012 DataStax 9
•  Cost of scaling to petabytes
•  Physical limitations
Why not relational databases?
©2012 DataStax 10
•  Hadoop for batch
•  Solr and Cassandra for realtime
•  Gives most of relational capability at
1/10 the cost, scales linearly
Relational to Big Data
©2012 DataStax 11
•  Distributed database heavy lifting
•  Simple dynamo model
•  Executes replication tasks extremely
well
Why Cassandra?
©2012 DataStax 12
•  Cassandra is easier, code is readable
•  Fewer moving parts
•  Multi-datacenter replication
•  Enables low level IO tuning
Cassandra vs. HBase
©2012 DataStax 13
•  HBase runs on HDFS
•  HDFS is not designed for random
access IO
•  Multiple hacks / products to perform
random access (MapR, HDFS Jiras)
Cassandra vs. HBase
©2012 DataStax 14
•  Cassandra is peer to peer, there is no
single point of failure (SPOF)
•  The HDFS name node is a single
point of failure
Cassandra vs. HBase
©2012 DataStax 15
•  Most of Facebook runs on MySQL
•  Memcache front ends the reads
HBase at Facebook
©2012 DataStax 16
•  Hive with Hadoop
•  A vague dialect of SQL
•  Requires Java for UDFs
•  Relational Joins
Batch Analytics
©2012 DataStax 17
•  Solr
•  SQL features except relational joins
•  Use Hive for relational joins
•  CEP (Complex Event Processing)
•  Storm
Realtime Analytics
©2012 DataStax 18
•  Storm, computes results on
streaming data
CEP (Complex Event Processing)
©2012 DataStax 19
•  Java inverted indexing library
•  Text analytics is raw computation
over linear sets of data
•  High speed computation engine
Lucene
©2012 DataStax 20
•  Terms dictionary points to list
document ids (integers)
•  Tokenizes text
•  Complete variety of computation on
vectors of data
Inverted Indexes
©2012 DataStax 21
•  Search server built around Lucene
•  Adds faceting, distributed search
•  Missed the cloud environment
features of NoSQL systems for many
years
Solr
©2012 DataStax 22
•  Solr Cloud is a Zookeeper based
system
•  New and probably not production
ready
•  Playing catch up
Solr Cloud
©2012 DataStax 23
•  High overlap with Solr
•  More mature than Solr Cloud
•  Less distributed features than
Cassandra
Elastic Search
©2012 DataStax 24
•  Columns, column families, keyspaces
•  Peer to peer
•  Eventual consistency
•  Implements basic Google BigTable
model
Cassandra Concepts
©2012 DataStax 25
•  Both implement a log structured
merge tree file architecture
Lucene and Cassandra
©2012 DataStax 26
•  Data is stored in Cassandra
•  Data placement controlled by
Cassandra
•  Solr is a secondary index (only)
DataStax Enterprise with Solr
©2012 DataStax 27
•  Separation of church and state, eg,
data and index
DataStax Enterprise with Solr
©2012 DataStax 28
•  Indexing is a CPU intensive task
•  Not IO bound because of multi-
threading
•  When a thread is flushing, other
threads are indexing, CPU is
saturated at all times
Indexing
©2012 DataStax 29
•  IO bound, index needs to fit in RAM,
then CPU bound
•  Lucene enables multithreading
queries
•  Solr does not multithread queries
Queries
©2012 DataStax 30
•  Eventual consistency, each node has
it’s own Lucene index
•  Lucene segment files are not
replicated (like Solr Cloud and
ElasticSearch)
DataStax Enterprise with Solr
©2012 DataStax 31
•  Query requests are round robin’d
across nodes automatically
Distributed Search Architecture
©2012 DataStax 32
•  3.0.1 is the current release of
DataStax Enterprise
DSE 3.0.1
©2012 DataStax 33
•  Ease of re-indexing
•  Re-index the entire cluster or per-
node
•  Re-indexing occurs when the Solr
schema changes
New Features in DSE 3.0.1
©2012 DataStax 34
•  Solr Cloud requires re-indexing from
an external data source such as a
relational database
New Features in DSE 3.0.1
©2012 DataStax 35
•  DSE re-indexes directly from
Cassandra
•  No custom code is required for re-
indexing
New Features in DSE 3.0.1
©2012 DataStax 36
•  View the heap memory usage of the
field caches
•  Perform capacity planning
New Features in DSE 3.0.1
©2012 DataStax 37
•  Multithreaded re-indexing and repair
•  Adding a new Solr node is fast
New Features in DSE 3.0.1
©2012 DataStax 38
•  Kerberos and SSL security
•  Security audit logging
New Features in DSE 3.0.1
©2012 DataStax 39
•  Near realtime: per-segment filters,
facets, multivalue facets
•  Solr 4.3
DataStax Enterprise 3.1
©2012 DataStax 40
•  vNodes
•  Composite keys
DataStax Enterprise 3.1
©2012 DataStax 41
•  Multi datacenter live Solr schema
updates and re-indexing
•  CQL -> Solr queries, makes porting
SQL applications easy for SQL
developers
Future
©2012 DataStax 42
Demo of Wikipedia
©2012 DataStax 43
•  Details about every trade
•  Tick data generated real time and is
quantitatively query-able
•  Too big to query on in real time? Not
anymore!
Real World Example: Tick Data
©2012 DataStax 44
•  Computing the moving stock price
average in real time
•  Comparing multiple moving averages
for different stock_symbols
•  Requires statistical analysis, group
by companies, and faceting features
Tick Data - Moving Average
©2012 DataStax 45
•  Read latest ticks for a given company
•  Query ticks for companies in specific
verticals during large events such as
press releases
•  Compute deviation of stock data over
5 years for groups of companies
Tick Data Analytics - Ad Hoc
Searches
©2012 DataStax 46
Real Time Stocks Demo
©2012 DataStax 47
General
©2012 DataStax 48
•  Like an SQL CREATE TABLE
statement
•  Defines field types
•  Defines fields
Schema
©2012 DataStax 49
•  XML based configuration options for
Solr
Solr Config
©2012 DataStax 50
•  Commits new index segment to RAM
•  Avoids ‘hard’ commit fsync
Soft Commit
©2012 DataStax 51
Auto Soft Commit
<!-- The default high-performance update handler --> !
<updateHandler class="solr.DirectUpdateHandler2”>!
<autoSoftCommit> !
<maxTime>1000</maxTime> <!– Near Realtime of 1 second -->!
</autoSoftCommit> !
</updateHandler>!
©2012 DataStax 52
•  Loaded for sort and facet queries
•  Uses heap space
Field Cache
©2012 DataStax 53
•  Java based API for interacting with a
Solr server
•  DSE supports SolrJ/HTTP with no
changes
SolrJ / HTTP
©2012 DataStax 54
•  Auto data type mapping
•  Copy fields
•  Dynamic fields
Insert data with CQL
©2012 DataStax 55
•  Exists however is mainly useful for
debugging
•  Limited functionality, queries a single
node
CQL with Solr Query
©2012 DataStax 56
CQL Insert Example
INSERT INTO wikipedia (key, text) !
VALUES ('1', 'when in rome')!
©2012 DataStax 57
How to convert applications
©2012 DataStax 58
•  Common to convert existing SQL
applications to Big Data
•  Focus on the application functionality
SQL to Solr
©2012 DataStax 59
•  Cassandra makes all distributed
operations easy
SQL to Solr
©2012 DataStax 60
•  SELECT * FROM wikipedia WHERE
type = ‘pdf’
•  q=type:pdf
SELECT WHERE
©2012 DataStax 61
•  SELECT title,text FROM wikipedia
•  q=*:*
•  fl=title,text
SELECT columns
©2012 DataStax 62
•  SELECT COUNT(*) FROM wikipedia
WHERE type = ‘pdf’
•  q=type:pdf
•  Get the num found
SELECT COUNT
©2012 DataStax 63
•  SELECT * FROM stocks ORDER BY
price ASC
•  q=*:*
•  sort=price asc
SELECT ORDER BY
©2012 DataStax 64
•  SELECT AVG(price) FROM stocks
•  q=*:*
•  stats=true
•  stats.field=price
•  The average is called ‘mean’ in the
Solr results
SELECT AVG
©2012 DataStax 65
•  SELECT AVG(price) FROM stocks
GROUP BY symbol
•  q=*:*
•  stats=true
•  stats.field=price
•  stats.facet=symbol
SELECT AVG GROUP BY
©2012 DataStax 66
•  SELECT * FROM wikipedia WHERE
text LIKE ‘rom%’
•  q=text:rom*
SELECT WHERE LIKE
©2012 DataStax 67

Más contenido relacionado

La actualidad más candente

Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraMichael Kjellman
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownDataStax
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax
 
Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7DataStax
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupVictor Coustenoble
 
Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraDataStax Academy
 
mParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from CassandramParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from CassandraScyllaDB
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
 
DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchDataStax Academy
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6DataStax
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandrazznate
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
Workshop - How to benchmark your database
Workshop - How to benchmark your databaseWorkshop - How to benchmark your database
Workshop - How to benchmark your databaseScyllaDB
 

La actualidad más candente (20)

Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to Cassandra
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
 
Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in Cassandra
 
mParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from CassandramParticle's Journey to Scylla from Cassandra
mParticle's Journey to Scylla from Cassandra
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
 
DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a Hitch
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
 
Cassandra in e-commerce
Cassandra in e-commerceCassandra in e-commerce
Cassandra in e-commerce
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Workshop - How to benchmark your database
Workshop - How to benchmark your databaseWorkshop - How to benchmark your database
Workshop - How to benchmark your database
 

Similar a C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012jbellis
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3xKinAnx
 
Toronto jaspersoft meetup
Toronto jaspersoft meetupToronto jaspersoft meetup
Toronto jaspersoft meetupPatrick McFadin
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Apache Cassandra and The Multi-Cloud by Amanda Moran
Apache Cassandra and The Multi-Cloud by Amanda MoranApache Cassandra and The Multi-Cloud by Amanda Moran
Apache Cassandra and The Multi-Cloud by Amanda MoranData Con LA
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft ShopThe Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft ShopJeff Smoley
 
Sql pass summit
Sql pass summitSql pass summit
Sql pass summitDon Severs
 
On Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and FutureOn Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and Futurepcmanus
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIBM_Info_Management
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
 

Similar a C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen (20)

State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
Toronto jaspersoft meetup
Toronto jaspersoft meetupToronto jaspersoft meetup
Toronto jaspersoft meetup
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?
 
Apache Cassandra and The Multi-Cloud by Amanda Moran
Apache Cassandra and The Multi-Cloud by Amanda MoranApache Cassandra and The Multi-Cloud by Amanda Moran
Apache Cassandra and The Multi-Cloud by Amanda Moran
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft ShopThe Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
 
Sql pass summit
Sql pass summitSql pass summit
Sql pass summit
 
On Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and FutureOn Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and Future
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 

Más de DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkDataStax Academy
 

Más de DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 

Último

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

  • 1. ©2012 DataStax 1 DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen
  • 2. ©2012 DataStax 2 •  Big Data Engineer at DataStax •  Co-author of ‘Programming Hive’ and ‘Introduction to Solr’ from O’Reilly About the Presenter
  • 3. ©2012 DataStax 3 •  The company behind Cassandra •  Sells DataStax Enterprise About DataStax
  • 4. ©2012 DataStax 4 DataStax Enterprise 3.x
  • 5. ©2012 DataStax 5 •  Single stack •  Cassandra •  Solr •  Hadoop •  Consulting •  Support DataStax Enterprise
  • 6. ©2012 DataStax 6 •  ZDNet article: “The biggest cloud app of all: Netflix” •  http://zd.net/ZHtrmW •  Built on Cassandra •  “According to Cockcroft, if something goes wrong, Netflix can continue to run the entire service on two out of three zones” Cassandra at Netflix
  • 7. ©2012 DataStax 7 •  Petabytes of growing data •  Hadoop is for batch work •  What are the solutions for realtime? What is Big Data?
  • 8. ©2012 DataStax 8 •  Near realtime •  1000 millisecond latency What is realtime?
  • 9. ©2012 DataStax 9 •  Cost of scaling to petabytes •  Physical limitations Why not relational databases?
  • 10. ©2012 DataStax 10 •  Hadoop for batch •  Solr and Cassandra for realtime •  Gives most of relational capability at 1/10 the cost, scales linearly Relational to Big Data
  • 11. ©2012 DataStax 11 •  Distributed database heavy lifting •  Simple dynamo model •  Executes replication tasks extremely well Why Cassandra?
  • 12. ©2012 DataStax 12 •  Cassandra is easier, code is readable •  Fewer moving parts •  Multi-datacenter replication •  Enables low level IO tuning Cassandra vs. HBase
  • 13. ©2012 DataStax 13 •  HBase runs on HDFS •  HDFS is not designed for random access IO •  Multiple hacks / products to perform random access (MapR, HDFS Jiras) Cassandra vs. HBase
  • 14. ©2012 DataStax 14 •  Cassandra is peer to peer, there is no single point of failure (SPOF) •  The HDFS name node is a single point of failure Cassandra vs. HBase
  • 15. ©2012 DataStax 15 •  Most of Facebook runs on MySQL •  Memcache front ends the reads HBase at Facebook
  • 16. ©2012 DataStax 16 •  Hive with Hadoop •  A vague dialect of SQL •  Requires Java for UDFs •  Relational Joins Batch Analytics
  • 17. ©2012 DataStax 17 •  Solr •  SQL features except relational joins •  Use Hive for relational joins •  CEP (Complex Event Processing) •  Storm Realtime Analytics
  • 18. ©2012 DataStax 18 •  Storm, computes results on streaming data CEP (Complex Event Processing)
  • 19. ©2012 DataStax 19 •  Java inverted indexing library •  Text analytics is raw computation over linear sets of data •  High speed computation engine Lucene
  • 20. ©2012 DataStax 20 •  Terms dictionary points to list document ids (integers) •  Tokenizes text •  Complete variety of computation on vectors of data Inverted Indexes
  • 21. ©2012 DataStax 21 •  Search server built around Lucene •  Adds faceting, distributed search •  Missed the cloud environment features of NoSQL systems for many years Solr
  • 22. ©2012 DataStax 22 •  Solr Cloud is a Zookeeper based system •  New and probably not production ready •  Playing catch up Solr Cloud
  • 23. ©2012 DataStax 23 •  High overlap with Solr •  More mature than Solr Cloud •  Less distributed features than Cassandra Elastic Search
  • 24. ©2012 DataStax 24 •  Columns, column families, keyspaces •  Peer to peer •  Eventual consistency •  Implements basic Google BigTable model Cassandra Concepts
  • 25. ©2012 DataStax 25 •  Both implement a log structured merge tree file architecture Lucene and Cassandra
  • 26. ©2012 DataStax 26 •  Data is stored in Cassandra •  Data placement controlled by Cassandra •  Solr is a secondary index (only) DataStax Enterprise with Solr
  • 27. ©2012 DataStax 27 •  Separation of church and state, eg, data and index DataStax Enterprise with Solr
  • 28. ©2012 DataStax 28 •  Indexing is a CPU intensive task •  Not IO bound because of multi- threading •  When a thread is flushing, other threads are indexing, CPU is saturated at all times Indexing
  • 29. ©2012 DataStax 29 •  IO bound, index needs to fit in RAM, then CPU bound •  Lucene enables multithreading queries •  Solr does not multithread queries Queries
  • 30. ©2012 DataStax 30 •  Eventual consistency, each node has it’s own Lucene index •  Lucene segment files are not replicated (like Solr Cloud and ElasticSearch) DataStax Enterprise with Solr
  • 31. ©2012 DataStax 31 •  Query requests are round robin’d across nodes automatically Distributed Search Architecture
  • 32. ©2012 DataStax 32 •  3.0.1 is the current release of DataStax Enterprise DSE 3.0.1
  • 33. ©2012 DataStax 33 •  Ease of re-indexing •  Re-index the entire cluster or per- node •  Re-indexing occurs when the Solr schema changes New Features in DSE 3.0.1
  • 34. ©2012 DataStax 34 •  Solr Cloud requires re-indexing from an external data source such as a relational database New Features in DSE 3.0.1
  • 35. ©2012 DataStax 35 •  DSE re-indexes directly from Cassandra •  No custom code is required for re- indexing New Features in DSE 3.0.1
  • 36. ©2012 DataStax 36 •  View the heap memory usage of the field caches •  Perform capacity planning New Features in DSE 3.0.1
  • 37. ©2012 DataStax 37 •  Multithreaded re-indexing and repair •  Adding a new Solr node is fast New Features in DSE 3.0.1
  • 38. ©2012 DataStax 38 •  Kerberos and SSL security •  Security audit logging New Features in DSE 3.0.1
  • 39. ©2012 DataStax 39 •  Near realtime: per-segment filters, facets, multivalue facets •  Solr 4.3 DataStax Enterprise 3.1
  • 40. ©2012 DataStax 40 •  vNodes •  Composite keys DataStax Enterprise 3.1
  • 41. ©2012 DataStax 41 •  Multi datacenter live Solr schema updates and re-indexing •  CQL -> Solr queries, makes porting SQL applications easy for SQL developers Future
  • 42. ©2012 DataStax 42 Demo of Wikipedia
  • 43. ©2012 DataStax 43 •  Details about every trade •  Tick data generated real time and is quantitatively query-able •  Too big to query on in real time? Not anymore! Real World Example: Tick Data
  • 44. ©2012 DataStax 44 •  Computing the moving stock price average in real time •  Comparing multiple moving averages for different stock_symbols •  Requires statistical analysis, group by companies, and faceting features Tick Data - Moving Average
  • 45. ©2012 DataStax 45 •  Read latest ticks for a given company •  Query ticks for companies in specific verticals during large events such as press releases •  Compute deviation of stock data over 5 years for groups of companies Tick Data Analytics - Ad Hoc Searches
  • 46. ©2012 DataStax 46 Real Time Stocks Demo
  • 48. ©2012 DataStax 48 •  Like an SQL CREATE TABLE statement •  Defines field types •  Defines fields Schema
  • 49. ©2012 DataStax 49 •  XML based configuration options for Solr Solr Config
  • 50. ©2012 DataStax 50 •  Commits new index segment to RAM •  Avoids ‘hard’ commit fsync Soft Commit
  • 51. ©2012 DataStax 51 Auto Soft Commit <!-- The default high-performance update handler --> ! <updateHandler class="solr.DirectUpdateHandler2”>! <autoSoftCommit> ! <maxTime>1000</maxTime> <!– Near Realtime of 1 second -->! </autoSoftCommit> ! </updateHandler>!
  • 52. ©2012 DataStax 52 •  Loaded for sort and facet queries •  Uses heap space Field Cache
  • 53. ©2012 DataStax 53 •  Java based API for interacting with a Solr server •  DSE supports SolrJ/HTTP with no changes SolrJ / HTTP
  • 54. ©2012 DataStax 54 •  Auto data type mapping •  Copy fields •  Dynamic fields Insert data with CQL
  • 55. ©2012 DataStax 55 •  Exists however is mainly useful for debugging •  Limited functionality, queries a single node CQL with Solr Query
  • 56. ©2012 DataStax 56 CQL Insert Example INSERT INTO wikipedia (key, text) ! VALUES ('1', 'when in rome')!
  • 57. ©2012 DataStax 57 How to convert applications
  • 58. ©2012 DataStax 58 •  Common to convert existing SQL applications to Big Data •  Focus on the application functionality SQL to Solr
  • 59. ©2012 DataStax 59 •  Cassandra makes all distributed operations easy SQL to Solr
  • 60. ©2012 DataStax 60 •  SELECT * FROM wikipedia WHERE type = ‘pdf’ •  q=type:pdf SELECT WHERE
  • 61. ©2012 DataStax 61 •  SELECT title,text FROM wikipedia •  q=*:* •  fl=title,text SELECT columns
  • 62. ©2012 DataStax 62 •  SELECT COUNT(*) FROM wikipedia WHERE type = ‘pdf’ •  q=type:pdf •  Get the num found SELECT COUNT
  • 63. ©2012 DataStax 63 •  SELECT * FROM stocks ORDER BY price ASC •  q=*:* •  sort=price asc SELECT ORDER BY
  • 64. ©2012 DataStax 64 •  SELECT AVG(price) FROM stocks •  q=*:* •  stats=true •  stats.field=price •  The average is called ‘mean’ in the Solr results SELECT AVG
  • 65. ©2012 DataStax 65 •  SELECT AVG(price) FROM stocks GROUP BY symbol •  q=*:* •  stats=true •  stats.field=price •  stats.facet=symbol SELECT AVG GROUP BY
  • 66. ©2012 DataStax 66 •  SELECT * FROM wikipedia WHERE text LIKE ‘rom%’ •  q=text:rom* SELECT WHERE LIKE