3. 3
03
About me
• Consultant with Avalon Consulting, LLC
• ~4 years working with Hadoop and Search
• Contributed patches to Ambari, HBase, Knox, Solr, Storm
• Installation, security, performance tuning, development,
administration
• Kevin Risden
• Apache Lucene/Solr Committer
• YCSB Contributor
5. 5
01
Background - What is JDBC?
The JDBC API is a Java API that can access any kind of tabular data, especially
data stored in a Relational Database.
Source: https://docs.oracle.com/javase/tutorial/jdbc/overview/
JDBC drivers convert SQL into a backend query.
6. 6
01
Background - Why should you care about Solr JDBC?
• SQL skills are prolific.
• JDBC drivers exist for most relational databases.
• Existing reporting tools work with JDBC/ODBC drivers.
Solr 6 works with SQL and existing JDBC tools!
7. 7
01
Use Case – Analytics – Utility Rates
Data set: 2011 Utility Rates
Questions:
• How many utility companies serve the state of Maryland?
• Which Maryland utility has the cheapest residential rates?
• What are the minimum and maximum residential power rates excluding missing data elements?
• What is the state and zip code with the highest residential rate?
How could you answer those questions with Solr?
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
• Facets
• Filter Queries
• Filters
• Grouping
• Sorting
• Stats
• String queries together
8. 8
01
Use Case – Analytics – Utility Rates
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
Method: Lucene syntax
Questions:
• How many utility companies serve the state of Maryland?
http://solr:8983/solr/rates/select?q=state%3A%22MD
%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1
• Which Maryland utility has the cheapest residential rates?
http://solr:8983/solr/rates/select?q=state%3A%22MD
%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc
• What are the minimum and maximum residential power rates excluding missing data elements?
http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse
%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate
• What is the state and zip code with the highest residential rate?
http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1
Is there a better way?
9. 9
01
Solr JDBC
Highlights
• JDBC Driver for Solr
• Powered by Streaming Expressions and Parallel SQL
• Thursday - Parallel SQL and Analytics with Solr – Yonik Seeley
• Thursday - Creating New Streaming Expressions – Dennis Gove
• Integrates with any* JDBC client * tested with the JDBC clients in this presentation
Usage
jdbc:solr://SOLR_ZK_CONNECTION_STRING?collection=COLLECTION_NAME
Apache Solr Reference Guide - Parallel SQL Interface
21. 21
01
Use Case – Analytics – Utility Rates
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
Method: Lucene syntax
Questions:
• How many utility companies serve the state of Maryland?
http://solr:8983/solr/rates/select?q=state%3A%22MD
%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1
• Which Maryland utility has the cheapest residential rates?
http://solr:8983/solr/rates/select?q=state%3A%22MD
%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc
• What are the minimum and maximum residential power rates excluding missing data elements?
http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse
%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate
• What is the state and zip code with the highest residential rate?
http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1
Is there a better way?
22. 22
01
Use Case – Analytics – Utility Rates
Method: SQL
Questions:
• How many utility companies serve the state of Maryland?
select distinct utility_name from rates where state='MD';
• Which Maryland utility has the cheapest residential rates?
select utility_name,min(res_rate) from rates where state='MD' group by utility_name order by min(res_rate) asc limit 1;
• What are the minimum and maximum residential power rates excluding missing data elements?
select min(res_rate),max(res_rate) from rates where not res_rate = 0;
• What is the state and zip code with the highest residential rate?
select state,zip,max(res_rate) from rates group by state,zip order by max(res_rate) desc limit 1;
How should you answer those questions with Solr? – Using SQL!
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
23. 23
01
Use Case – Analytics – Utility Rates
How should you answer those questions with Solr? – Using SQL!
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
24. 24
01
Future Development/Improvements
• Replace Presto with Apache Calcite - SOLR-8593
• Improve SQL compatibility
• Ability to specify optimization rules (push downs, joins, etc)
• Potentially use Avatica JDBC/ODBC drivers
• Streaming Expressions/Parallel SQL improvements - SOLR-8125
• JDBC driver improvements - SOLR-8659
Info on how to get involved
25. 25
01
Future Development/Improvements
SQL Join
Info on how to get involved
SELECT
movie_title,character_name,line
FROM
movie_dialogs_movie_titles_metadata a
JOIN
movie_dialogs_movie_lines b
ON
a.movieID=b.movieID;
select(
innerJoin(
search(movie_dialogs_movie_titles_metadata,
q=*:*,
fl="movieID,movie_title",
sort="movieID asc"),
search(movie_dialogs_movie_lines,
q=*:*,
fl="movieID,character_name,line",
sort="movieID asc"),
on="movieID”
),
movie_title,character_name,line
)
Streaming Expression Join