SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Advanced search and 
Top-K queries in Cassandra 
1 
Daniel Higuero 
dhiguero@stratio.com 
@dhiguero 
Andrés de la Peña 
andres@stratio.com 
@a_de_la_pena
Who are we? 
• Stratio is a Big Data Company 
• Founded in 2013 
• Commercially launched in 2014 
• 70+ employees in Madrid 
• Office in San Francisco 
• Certified Spark distribution 
#CassandraSummit 2014
Cassandra query methods 
Stratio Lucene based 2i implementation 
Integrating Lucene 2i with Apache Spark 
1 
2 
3 
CONTENTS
primary key 
secondary indexes 
token ranges 
Throughput 
Expressiveness 
Cassandra query methods 
#CassandraSummit 2014 4
Primary key queries 
• O(1) node lookup for partition key 
• Range slices for clustering key 
• Usually requires denormalization 
Partition key CLIENT Clustering key range 
Node 
3 
Node 
1 
Node 
2 
apena 
2014-04-10:body 
When you.. 
aagea 
dhiguero 
apena 
2014-04-06:body 2014-04-07:body 2014-04-08:body 
To study and… To think and... If you see what.. 
2014-04-06:body 
The cautious… 
2014-04-10:body 
When you.. 
2014-04-11:body 
When you do… 
#CassandraSummit 2014 5
primary key 
secondary indexes 
token ranges 
Throughput 
Expressiveness 
Cassandra query methods 
#CassandraSummit 2014 6
CLIENT C* 
node 
C* 
node 
2i local column 
family 
C* 
node 
2i local column 
family 
2i local column 
family 
Secondary indexes queries 
• Inverted index 
• Mitigates denormalization 
• Queries may involve all C* nodes 
• Queries limited to a single column 
#CassandraSummit 2014 7
primary key 
secondary indexes 
token ranges 
Throughput 
Expressiveness 
Cassandra query methods 
#CassandraSummit 2014 8
C*# 
node# 
C*# 
node# 
C*# 
node# 
Spark 
master 
Token range queries 
• Used by MapReduce frameworks 
as Hadoop or Spark 
• All kinds of queries are possible 
• Low throughput 
• Ad-hoc queries 
• Batch processing 
• Materialized views 
CLIENT 
query= function (all data) 
#CassandraSummit 2014 9
C*# 
node# 
C*# 
node# 
C*# 
node# 
Combining 2i with MapReduce 
• Expressiveness avoiding full scans 
• Still limited by one indexed column per query 
Spark 
CLIENT master 
Secondary 
index 
Secondary 
index 
Secondary 
index 
#CassandraSummit 2014 10
What do we miss from 2i indexes? 
MORE EXPRESIVENESS 
• Range queries 
• Multivariable search 
• Full text search 
• Sorting by fields 
• Top-k queries 
#CassandraSummit 2014 11
What do we like from the existing 2i? 
IT’S ARCHITECTURE 
• Each node indexes its own data 
• The index implementations do not need to be distributed 
• Natural extension point 
• Can be created after design and ingestion 
#CassandraSummit 2014 12
Thinking in a custom secondary index implementation… 
WHY NOT USE ? 
#CassandraSummit 2014 13
Why we like Lucene 
• Proven stable and fast indexing solution 
• Expressive queries 
- Multivariable, ranges, full text, sorting, top-k, etc. 
• Mature distributed search solutions built on top of it 
- Solr, ElasticSearch 
• Can be fully embedded in application code 
• Published under the Apache License 
#CassandraSummit 2014 14
HOW IT WORKS
ALTER TABLE tweets ADD lucene TEXT; 
CREATE TABLE tweets ( 
id bigint, 
createdAt timestamp, 
message text, 
userid bigint, 
username text, 
PRIMARY KEY (userid, createdAt, id) ); 
Create index 
• Built in the background in any moment 
• Real time updates 
• Mapping eases ETL 
• Language aware 
CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) 
USING 'com.stratio.index.RowIndex' 
WITH OPTIONS = { 
'refresh_seconds' : '60', 
'schema' : '{ default_analyzer : "EnglishAnalyzer", 
fields : { 
createdat : {type : "date", pattern : "yyyy-MM-dd"}, 
message : {type : "text", analyzer : ”EnglishAnalyzer"}, 
userid : {type : "string"}, 
username : {type : "string"} 
}} '}; 
#CassandraSummit 2014 16
SELECT * FROM tweets WHERE lucene 
= ‘{ 
filter : {type : "match", 
field : "text", 
value : "cassandra"} 
}’ LIMIT 10; 
search 10 
found 6 
found 4 
We are done ! 
Filtering query 
CLIENT 
C* 
node 
C* 
node 
C* 
node 
Lucene 
index 
Lucene 
index 
Lucene 
index 
#CassandraSummit 2014 17
Found 5 
Found 4 
Found 5 
Top-k query 
SELECT * FROM tweets WHERE lucene 
= ‘{ 
query: {type:”match", 
field : ”text”, 
value : “cassandra”} 
}’ LIMIT 5; 
C* 
node 
Search top-5 CLIENT Search top-5 
C* 
node 
C* 
node 
Lucene 
index 
Lucene 
index 
Lucene 
index 
Merge 
14 to 
best 5 
#CassandraSummit 2014 18
Modifying Cassandra for generic top-k queries 
Two new methods in SecondaryIndexSearcher: 
boolean'requiresFullScan(List<IndexExpression>'clause);' 
List<Row>'sort(List<IndexExpression>'clause,'List<Row>'rows);' 
Two new methods in AbstractRangeCommand: 
boolean'requiresFullScan();' 
List<Row>'combine(List<Row>'rows);' 
And some changes in StorageProxy#getRangeSlice… 
#CassandraSummit 2014 19
Queries can be as complex as you want 
SELECT * FROM tweets WHERE lucene = ‘{ 
filter : 
{ 
type : "boolean", must : 
[ 
{type : "range", field : "time" lower : "2014/04/25”}, 
{type : "boolean", should : 
[ 
{type : "prefix", field : "user", value : "a"} , 
{type : "wildcard", field : "user", value : "*b*"} , 
{type : "match", field : "user", value : "fast"} 
] 
} 
] 
}, 
sort : 
{ 
fields: [ {field :"time", reverse : true}, 
{field : "user", reverse : false} ] 
} 
}’ LIMIT 10000; 
#CassandraSummit 2014 20
Some implementation details 
• A Lucene document per CQL row, and a Lucene field per indexed column 
• SortingMergePolicy keeps index sorted in the same way that C* does 
• Index commits synchronized with column family flushes 
• Segments merge synchronized with column family compactions 
NO MAINTENANCE REQUIRED 
#CassandraSummit 2014 21
QUERY BUILDER
FLUENT QUERY BUILDER 
Query builder 
• Facilitates writing index-related clauses 
• Compatible with the existing C* Query Builder 
#CassandraSummit 2014 23
Query builder example 
SELECT * FROM tweets WHERE lucene = 
‘{ 
query : {type : "range", 
field : "time”, 
lower : "2014/04/25”, 
upper : "2014/04/30”}, 
filter : {type : "match", 
field : "text", 
value : "cassandra"}, 
sort : 
{ 
fields: [ {field :"time", 
reverse : true}} ] 
} 
}’ LIMIT 10; 
String'filter'='SearchBuilders' 
'''.filter(range("time")' 
'''''''''''''.lower("2014/04/25")' 
'''''''''''''.upper("2014/04/30"))' 
'''.query(match("text",'"cassandra")' 
'''.sort(sorting("time",'true))' 
'''.toJson();' 
' 
QueryBuilder.select()'''''''''''''''''''''''''' 
'''.from(KEYSPACE,'TABLE)' 
'''.where(eq("lucene",'filter))' 
'''.limit(10) 
#CassandraSummit 2014 24
LUCENE 
AND 
SPARK
Integrating Lucene & Spark 
Split friendly. It supports searches within a token range 
SELECT * FROM tweets WHERE lucene = ‘{ 
filter : {type:"match", field:”text", value:"cassandra"} 
}’ 
AND TOKEN(userid, createdAt, id) > 253653456456 
AND TOKEN(userid, createdAt, id) <= 3456467456756 
LIMIT 10000; 
#CassandraSummit 2014 26
Integrating Lucene & Spark 
Paging friendly: It supports starting queries in a certain point 
SELECT * FROM tweets WHERE lucene = ‘{ 
filter : {type:”match", field:”text", value:”cassandra”} 
}’ 
AND userid = 3543534 
AND createdAt > 2011-02-03 04:05+0000 
LIMIT 5000; 
#CassandraSummit 2014 27
Integrating Lucene & Spark 
CLIENT 
Spark 
master 
C* 
node 
C* 
node 
C* 
node 
Lucene 
Lucene 
Lucene 
• Compute large amounts of data 
• Avoid systematic full scan 
• Reduces the amount of data to be processed 
• Filtering push-down 
#CassandraSummit 2014 28
WHEN TO 
USE INDEXES 
AND WHEN TO 
USE FULL SCAN
Index performance in Spark 
Time 
Full scan 
Lucene 2i 
Records returned 
#CassandraSummit 2014 30
Lucene indexes in C* DEMO
OTHER 
TOOLS
Stratio Deep 
INTEGRATING SPARK WITH DIFFERENT DATASTORES 
• Common Cell abstraction in the RDD 
• Maintain compatibility with Spark operations 
• Compatible with multiple datastore technologies 
• DeepSparkContext 
• DeepJobConfig 
• Compatible with Lucene indexes 
#CassandraSummit 2014 33
Stratio Crossdata 
UNIFYING BATCH AND STREAMING QUERIES 
• Single SQL-like language 
• Compatible with multiple datastore technologies 
• Connector-based architecture 
• Ability to combine data from different datastore 
• Complement non-native operation with Spark 
• E.g., JOIN in Cassandra 
• Custom support for Lucene-based secondary indexes 
#CassandraSummit 2014 34
CREATING INDEXES 
Stratio Crossdata 
CREATE'FULLTEXT'INDEX'ON'app.users(name,'email);' 
QUERYING THE INDEXES 
SELECT'*'FROM'app.users'' 
where'email'MATCH'‘*@stratio.com’;' 
#CassandraSummit 2014 35
Conclusions 
• Added new query methods 
- Multivariable queries (AND, OR, NOT) 
- Range queries (>, >=, <, <=) and regular expressions 
- Full text queries (match, phrase, fuzzy...) 
• Top-k query support 
- Lucene scoring formula 
- Sort by field values 
• Compatible with MapReduce frameworks 
• Preserves Cassandra’s functionality 
#CassandraSummit 2014 36
github.com/stratio/stratio-cassandra 
• Published as fork of Apache Cassandra 
• Apache License Version 2.0 
stratio.github.io/crossdata 
Its open source 
• Apache License Version 2.0 
#CassandraSummit 2014 37
Advanced search and 
Top-K queries in Cassandra 
38 
Daniel Higuero 
dhiguero@stratio.com 
@dhiguero 
Andrés de la Peña 
andres@stratio.com 
@a_de_la_pena

Más contenido relacionado

La actualidad más candente

How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014Patrick McFadin
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with CassandraRyan King
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha
 

La actualidad más candente (7)

How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Dynamo DB Indexes
Dynamo DB IndexesDynamo DB Indexes
Dynamo DB Indexes
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
 

Destacado

How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHubSpot
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...HubSpot
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...HubSpot
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessHubSpot
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?HubSpot
 
Add the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonAdd the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonHubSpot
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call QuestionsHubSpot
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoHubSpot
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful CompaniesHubSpot
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017John Maeda
 
ELSA France "Teaching is us!"
ELSA France "Teaching is us!" ELSA France "Teaching is us!"
ELSA France "Teaching is us!" Adrian Scarlett
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpotHubSpot
 
Top 5 Deep Learning and AI Stories 3/9
Top 5 Deep Learning and AI Stories 3/9Top 5 Deep Learning and AI Stories 3/9
Top 5 Deep Learning and AI Stories 3/9NVIDIA
 
Infographic: Medicare Marketing: Direct Mail: Still The #1 Influencer For Tho...
Infographic: Medicare Marketing: Direct Mail: Still The #1 Influencer For Tho...Infographic: Medicare Marketing: Direct Mail: Still The #1 Influencer For Tho...
Infographic: Medicare Marketing: Direct Mail: Still The #1 Influencer For Tho...Scott Levine
 
The Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax DeductionsThe Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax DeductionsWagepoint
 
Sage Gold Inc. Corporate Presentation
Sage Gold Inc. Corporate PresentationSage Gold Inc. Corporate Presentation
Sage Gold Inc. Corporate PresentationMomentumPR
 
Payments Trends 2017
Payments Trends 2017Payments Trends 2017
Payments Trends 2017Capgemini
 

Destacado (19)

How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's Buyer
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your Business
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?
 
Add the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonAdd the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-Thon
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call Questions
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot Tokyo
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017
 
ELSA France "Teaching is us!"
ELSA France "Teaching is us!" ELSA France "Teaching is us!"
ELSA France "Teaching is us!"
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 
Culture
CultureCulture
Culture
 
Top 5 Deep Learning and AI Stories 3/9
Top 5 Deep Learning and AI Stories 3/9Top 5 Deep Learning and AI Stories 3/9
Top 5 Deep Learning and AI Stories 3/9
 
Infographic: Medicare Marketing: Direct Mail: Still The #1 Influencer For Tho...
Infographic: Medicare Marketing: Direct Mail: Still The #1 Influencer For Tho...Infographic: Medicare Marketing: Direct Mail: Still The #1 Influencer For Tho...
Infographic: Medicare Marketing: Direct Mail: Still The #1 Influencer For Tho...
 
The Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax DeductionsThe Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax Deductions
 
Sage Gold Inc. Corporate Presentation
Sage Gold Inc. Corporate PresentationSage Gold Inc. Corporate Presentation
Sage Gold Inc. Corporate Presentation
 
Payments Trends 2017
Payments Trends 2017Payments Trends 2017
Payments Trends 2017
 

Similar a Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014

Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...DataStax Academy
 
Geospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexGeospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexAndrés de la Peña
 
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
Cassandra Summit 2014: Apache Cassandra at Telefonica CBSCassandra Summit 2014: Apache Cassandra at Telefonica CBS
Cassandra Summit 2014: Apache Cassandra at Telefonica CBSDataStax Academy
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...Álvaro Agea Herradón
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...DataStax Academy
 
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.Keshav Murthy
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveIBM Cloud Data Services
 
RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4Marin Dimitrov
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraStratio
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Maxime Beugnet
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xshradha ambekar
 
How to use the new Domino Query Language
How to use the new Domino Query LanguageHow to use the new Domino Query Language
How to use the new Domino Query LanguageTim Davis
 

Similar a Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014 (20)

Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
 
Geospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexGeospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene index
 
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
Cassandra Summit 2014: Apache Cassandra at Telefonica CBSCassandra Summit 2014: Apache Cassandra at Telefonica CBS
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
 
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
 
Spark Introduction
Spark IntroductionSpark Introduction
Spark Introduction
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
 
RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and Cassandra
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
How to use the new Domino Query Language
How to use the new Domino Query LanguageHow to use the new Domino Query Language
How to use the new Domino Query Language
 

Último

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014

  • 1. Advanced search and Top-K queries in Cassandra 1 Daniel Higuero dhiguero@stratio.com @dhiguero Andrés de la Peña andres@stratio.com @a_de_la_pena
  • 2. Who are we? • Stratio is a Big Data Company • Founded in 2013 • Commercially launched in 2014 • 70+ employees in Madrid • Office in San Francisco • Certified Spark distribution #CassandraSummit 2014
  • 3. Cassandra query methods Stratio Lucene based 2i implementation Integrating Lucene 2i with Apache Spark 1 2 3 CONTENTS
  • 4. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2014 4
  • 5. Primary key queries • O(1) node lookup for partition key • Range slices for clustering key • Usually requires denormalization Partition key CLIENT Clustering key range Node 3 Node 1 Node 2 apena 2014-04-10:body When you.. aagea dhiguero apena 2014-04-06:body 2014-04-07:body 2014-04-08:body To study and… To think and... If you see what.. 2014-04-06:body The cautious… 2014-04-10:body When you.. 2014-04-11:body When you do… #CassandraSummit 2014 5
  • 6. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2014 6
  • 7. CLIENT C* node C* node 2i local column family C* node 2i local column family 2i local column family Secondary indexes queries • Inverted index • Mitigates denormalization • Queries may involve all C* nodes • Queries limited to a single column #CassandraSummit 2014 7
  • 8. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2014 8
  • 9. C*# node# C*# node# C*# node# Spark master Token range queries • Used by MapReduce frameworks as Hadoop or Spark • All kinds of queries are possible • Low throughput • Ad-hoc queries • Batch processing • Materialized views CLIENT query= function (all data) #CassandraSummit 2014 9
  • 10. C*# node# C*# node# C*# node# Combining 2i with MapReduce • Expressiveness avoiding full scans • Still limited by one indexed column per query Spark CLIENT master Secondary index Secondary index Secondary index #CassandraSummit 2014 10
  • 11. What do we miss from 2i indexes? MORE EXPRESIVENESS • Range queries • Multivariable search • Full text search • Sorting by fields • Top-k queries #CassandraSummit 2014 11
  • 12. What do we like from the existing 2i? IT’S ARCHITECTURE • Each node indexes its own data • The index implementations do not need to be distributed • Natural extension point • Can be created after design and ingestion #CassandraSummit 2014 12
  • 13. Thinking in a custom secondary index implementation… WHY NOT USE ? #CassandraSummit 2014 13
  • 14. Why we like Lucene • Proven stable and fast indexing solution • Expressive queries - Multivariable, ranges, full text, sorting, top-k, etc. • Mature distributed search solutions built on top of it - Solr, ElasticSearch • Can be fully embedded in application code • Published under the Apache License #CassandraSummit 2014 14
  • 16. ALTER TABLE tweets ADD lucene TEXT; CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) ); Create index • Built in the background in any moment • Real time updates • Mapping eases ETL • Language aware CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer", fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string"} }} '}; #CassandraSummit 2014 16
  • 17. SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"} }’ LIMIT 10; search 10 found 6 found 4 We are done ! Filtering query CLIENT C* node C* node C* node Lucene index Lucene index Lucene index #CassandraSummit 2014 17
  • 18. Found 5 Found 4 Found 5 Top-k query SELECT * FROM tweets WHERE lucene = ‘{ query: {type:”match", field : ”text”, value : “cassandra”} }’ LIMIT 5; C* node Search top-5 CLIENT Search top-5 C* node C* node Lucene index Lucene index Lucene index Merge 14 to best 5 #CassandraSummit 2014 18
  • 19. Modifying Cassandra for generic top-k queries Two new methods in SecondaryIndexSearcher: boolean'requiresFullScan(List<IndexExpression>'clause);' List<Row>'sort(List<IndexExpression>'clause,'List<Row>'rows);' Two new methods in AbstractRangeCommand: boolean'requiresFullScan();' List<Row>'combine(List<Row>'rows);' And some changes in StorageProxy#getRangeSlice… #CassandraSummit 2014 19
  • 20. Queries can be as complex as you want SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] } }’ LIMIT 10000; #CassandraSummit 2014 20
  • 21. Some implementation details • A Lucene document per CQL row, and a Lucene field per indexed column • SortingMergePolicy keeps index sorted in the same way that C* does • Index commits synchronized with column family flushes • Segments merge synchronized with column family compactions NO MAINTENANCE REQUIRED #CassandraSummit 2014 21
  • 23. FLUENT QUERY BUILDER Query builder • Facilitates writing index-related clauses • Compatible with the existing C* Query Builder #CassandraSummit 2014 23
  • 24. Query builder example SELECT * FROM tweets WHERE lucene = ‘{ query : {type : "range", field : "time”, lower : "2014/04/25”, upper : "2014/04/30”}, filter : {type : "match", field : "text", value : "cassandra"}, sort : { fields: [ {field :"time", reverse : true}} ] } }’ LIMIT 10; String'filter'='SearchBuilders' '''.filter(range("time")' '''''''''''''.lower("2014/04/25")' '''''''''''''.upper("2014/04/30"))' '''.query(match("text",'"cassandra")' '''.sort(sorting("time",'true))' '''.toJson();' ' QueryBuilder.select()'''''''''''''''''''''''''' '''.from(KEYSPACE,'TABLE)' '''.where(eq("lucene",'filter))' '''.limit(10) #CassandraSummit 2014 24
  • 26. Integrating Lucene & Spark Split friendly. It supports searches within a token range SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:”text", value:"cassandra"} }’ AND TOKEN(userid, createdAt, id) > 253653456456 AND TOKEN(userid, createdAt, id) <= 3456467456756 LIMIT 10000; #CassandraSummit 2014 26
  • 27. Integrating Lucene & Spark Paging friendly: It supports starting queries in a certain point SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:”match", field:”text", value:”cassandra”} }’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000; #CassandraSummit 2014 27
  • 28. Integrating Lucene & Spark CLIENT Spark master C* node C* node C* node Lucene Lucene Lucene • Compute large amounts of data • Avoid systematic full scan • Reduces the amount of data to be processed • Filtering push-down #CassandraSummit 2014 28
  • 29. WHEN TO USE INDEXES AND WHEN TO USE FULL SCAN
  • 30. Index performance in Spark Time Full scan Lucene 2i Records returned #CassandraSummit 2014 30
  • 31. Lucene indexes in C* DEMO
  • 33. Stratio Deep INTEGRATING SPARK WITH DIFFERENT DATASTORES • Common Cell abstraction in the RDD • Maintain compatibility with Spark operations • Compatible with multiple datastore technologies • DeepSparkContext • DeepJobConfig • Compatible with Lucene indexes #CassandraSummit 2014 33
  • 34. Stratio Crossdata UNIFYING BATCH AND STREAMING QUERIES • Single SQL-like language • Compatible with multiple datastore technologies • Connector-based architecture • Ability to combine data from different datastore • Complement non-native operation with Spark • E.g., JOIN in Cassandra • Custom support for Lucene-based secondary indexes #CassandraSummit 2014 34
  • 35. CREATING INDEXES Stratio Crossdata CREATE'FULLTEXT'INDEX'ON'app.users(name,'email);' QUERYING THE INDEXES SELECT'*'FROM'app.users'' where'email'MATCH'‘*@stratio.com’;' #CassandraSummit 2014 35
  • 36. Conclusions • Added new query methods - Multivariable queries (AND, OR, NOT) - Range queries (>, >=, <, <=) and regular expressions - Full text queries (match, phrase, fuzzy...) • Top-k query support - Lucene scoring formula - Sort by field values • Compatible with MapReduce frameworks • Preserves Cassandra’s functionality #CassandraSummit 2014 36
  • 37. github.com/stratio/stratio-cassandra • Published as fork of Apache Cassandra • Apache License Version 2.0 stratio.github.io/crossdata Its open source • Apache License Version 2.0 #CassandraSummit 2014 37
  • 38. Advanced search and Top-K queries in Cassandra 38 Daniel Higuero dhiguero@stratio.com @dhiguero Andrés de la Peña andres@stratio.com @a_de_la_pena