SlideShare una empresa de Scribd logo
1 de 46
http://tripadvisor.com/careers 1
Stephen R. Scaffidi
sscaffidi@tripadvisor.com
Hadoop Summit
San Jose 2013
Simplifying the use of Hive
with the Hive Query Tool
http://tripadvisor.com/careers 2
• Introduction
• What is the Hive Query Tool (HQT)?
• Why did we build it?
• How it’s being used today
• Design & system requirements
• HQT Query Templates
• Getting the source, building, and running
• Future plans & possibilities
Talk Outline
Introduction
http://tripadvisor.com/careers 3
Introduction Section Title
Introduction
4http://tripadvisor.com/careers
About Me
• Sr. Software Engineer at TripAdvisor
• Data Warehouse Engineering Group
• Mildly obsessed with making things “Just Work”
• OK, more than mildly...
• Varied background, from PC Tech to Email
Admin to Telco NMS to Lisp Hacker, etc...
• Thrives on making computers do the work
• No Hadoop experience before joining
Introduction
http://tripadvisor.com/careers
About my team
• Data Warehouse Engineering
• Small, focused, tenacious group
• Varied skills and backgrounds
• We keep the elephants fed and healthy
• We help others in the company make use of
the facilities provided on the clusters
• DevOps in every sense of the term
5
Introduction
6http://tripadvisor.com/careers
About TripAdvisor
• Awesome place to work
• It really feels like we’re one big team
• Always new challenges and things to learn
• Smart, driven and genuinely *nice* people
• Offices around the world
• Great benefits
• We’re hiring!
What is the Hive Query Tool?
http://tripadvisor.com/careers 7
Simplifying the use of Hive with The Hive Query Tool
What is the Hive Query Tool?
8http://tripadvisor.com/careers
A simple web interface for running reports on Hive
• Our Specific Goals/Needs:
• Easy to use for non-technical people
• More flexible query customization than simple
variable interpolation
• Relatively easy installation and administration
• Allow jobs to run with different scheduler queues
and users
• Performance equal-to or better-than plain Hive
What is the Hive Query Tool?
9http://tripadvisor.com/careers
Easy for non-technical end-users
• Intended for use by non-technical people:
• Sales, Marketing, Customer-Relations, etc.
• People who don’t know anything about Hadoop
or Hive (or need to)
• People who don’t live in a *nix shell
• No need to even know anything about SQL!
What is the Hive Query Tool?
10http://tripadvisor.com/careers
Flexible Query Customization
• Other solutions we looked at were too limited
• We needed to give the users something more
powerful than simple variable substitution.
• HQT’s template system can generate and
insert arbitrary HQL clauses into a query based
on a user’s input to a simple web interface.
What is the Hive Query Tool?
11http://tripadvisor.com/careers
Easy Install and Administration
• If we were going to build our own, we didn’t
want maintenance to be *another* full-time job
• Internal adoption by other engineers was
important
• Java hackers don’t want to deal with a 23.5-step
install and configure process
• Especially if it’s not written in Java
• Check-out the source, run the setup script, edit
a single config file and run the startup scripts.
What is the Hive Query Tool?
12http://tripadvisor.com/careers
Run jobs with different Users and Queues
• Face it, the Hive Thrift Server is horrible
• Most other user-friendly Hive front-ends use it
• So they have all its limitations
• And its bugs 
• The HQT simply spawns a Hive CLI for each
job, using sudo to change users when
necessary.
What is the Hive Query Tool?
13http://tripadvisor.com/careers
Performance?
• Some options we looked at before building the
HQT did a whole lot more
• Some claimed to be faster than Hive.
• Some of these options had so much overhead
that they were slower than using Hive directly!
• The HQT simply runs HQL code thru the
standard Hive CLI. No overhead, no difference
in performance over plain-vanilla Hive.
Why did we need this?
http://tripadvisor.com/careers 14
Simplifying the use of Hive with The Hive Query Tool
Why did we need the HQT?
15http://tripadvisor.com/careers
Making the data accessible
• The data we pump into our Hadoop clusters is
full of valuable information to our business
• And more is fed into our Hive tables every day
• And more people need access to that data
every day
• But not all of those people are 733t h4(k3r
engineers 😉
Why did we need the HQT?
16http://tripadvisor.com/careers
Making the data accessible
• The target users may not know Linux and Java
and SQL...
• But they do know how they want the data
filtered and correlated and aggregated.
• We needed a way to let them run queries
where they could choose these parameters
with a high degree of flexibility...
• But without having to teach them all HQL
Why did we need the HQT?
17http://tripadvisor.com/careers
Looked at what was available...
• Nothing else we looked at seemed to satisfy all
our requirements.
• Some that looked interesting, unfortunately
had terrible performance, as they did not use
Hive directly.
• Not that everything we looked at was terrible –
some solutions were really quite impressive.
• But it came down to a classic question in tech-
oriented businesses...
Why did we need the HQT?
18http://tripadvisor.com/careers
The bottom line
• We knew what we wanted
• We knew what we wanted wasn’t particularly
complex
• We asked ourselves if we could just build
something that gives us exactly what we need
• And would that effort cost less than trying to
make something else work the way we
wanted?
• A “Eureka!” moment and a rough prototype
answered the question 😉
HQT Use at TripAdvisor
http://tripadvisor.com/careers 19
HQT Use at TripAdvisor
20http://tripadvisor.com/careers
A surprise hit
• Some interested people tried the prototype
• Liked how it worked, requested more features
• Other groups became interested
• Even committed engineering resources to help
get it to “beta”
• It’s now being used across the company
• New report templates constantly being added
• (sorry, those aren’t available publicly)
HQT Use at TripAdvisor
21http://tripadvisor.com/careers
Company-wide adoption
• End users find it easy to use and relatively
convenient.
• Template authors have found it easy to create
and modify report templates.
• Users include people in Sales, Marketing,
Commerce, and even Legal!
• Weekly peak usage at over 40 simultaneous
Hive jobs – on a single server.
(we’ve actually had to add throttling to keep HQT
jobs from using too many mapred slots)
HQT Design
http://tripadvisor.com/careers 22
HQT Design
23http://tripadvisor.com/careers
Architecture: Front-End
• Web interface
• Handles user authentication
• Processes HQT Templates to determine...
• What options/input elements to present the user
• How to process and validate input from the user
• What HQL to send to the back-end
• Gets job progress and status info from the
back-end
• Doesn’t do much else
HQT Design
24http://tripadvisor.com/careers
Architecture: Back-End
• Presents a “json/rest-like” interface over HTTP
to receive requests from the front-end
• Uses an event-loop instead of threads
• Spawns Hive CLI instances to run submitted
HQL
• Tracks and parses output from each instance
• Watches CLI instances for progress and errors
• Processes results for retrieval by users
• Sends email notifications
HQT Design
25http://tripadvisor.com/careers
Template System
• The “special sauce” of the HQT
• The template “language” is designed so that
“directives” concisely express a whole lot:
• What input to gather (and optionally what kind)
• How to validate that input
• What output to generate and how to format it
• It’s a little tricky to explain
• But extremely flexible
• More details shortly...
HQT Design
26http://tripadvisor.com/careers
Language & Frameworks
• Written in Perl
• Uses lots of components from the CPAN
• Front-end web framework is Mojolicious
• Template System uses Text::Template
• Back-end uses AnyEvent
• Most classes built using Moo
• Decent example of “Modern Perl”, but is still a
work-in-progress.
HQT Design
27http://tripadvisor.com/careers
System Requirements
• Requires Perl 5.10.1 or newer
• Hadoop & Hive clients & libs should already be
installed and configured
• Does *not* require root or root access
• LDAP & sudo should be configured if you want
to run jobs as different users.
• Web-server is built-in, but can run under just
about any setup you want
HQT Design
28http://tripadvisor.com/careers
Current State
• The front-end code is rather nice
• MVC-style web app code
• Uses Mojolicious .epl templates for web content,
which is very similar to .erb
• Back-end code is kind of hairy
• AnyEvent is fairly low-level
• REST/json stuff too mixed with the code that
wraps the Hive CLI processes.
• It shouldn’t be responsible for sending email!
HQT Design
29http://tripadvisor.com/careers
Current State, contd.
• Template-system code:
• Fairly simple code, but allows for a lot of
interesting functionality.
• Other engineers seem to think it’s fine...
• But I think it needs refactoring
∙ Too much “action at a distance”
∙ Template evaluation is a big security risk
∙ Should use OO instead of ad-hoc data structures
∙ Etc...
The HQL Template System
http://tripadvisor.com/careers 30
Template System
31http://tripadvisor.com/careers
• Template code blocks are embedded into
otherwise normal HQL:
{{ begin_main_select }}
SELECT foo, bar FROM baz
WHERE ds={{
insert_var
date => { type => ‘date’, default => days_ago_ymd(3) }
}}
{{
append_where {
columns => { wibble => ‘string’, wobble => ‘int’ }
}
}}
Template System
32http://tripadvisor.com/careers
• Template functions/”directives” simultaneously
define...
• What input options to present the user
• Input validation
• What to insert into the HQL based on the input
Template System
33http://tripadvisor.com/careers
So, this...
{{ begin_main_select }}
SELECT foo, bar FROM baz
WHERE ds={{
insert_var
date => { type => ‘date’, default => days_ago_ymd(3) }
}}
{{
append_where {
columns => { wibble => ‘string’, wobble => ‘int’ }
}
}}
Template System
34http://tripadvisor.com/careers
Renders this:
Template System
35http://tripadvisor.com/careers
Which when filled out like this:
Template System
36http://tripadvisor.com/careers
Generates HQL like this:
Template System
37http://tripadvisor.com/careers
Template Engine
• Didn’t build anything new, just used the
existing Text::Template module in a clever
way
• Template blocks are just Perl code, evaluated
in a specified package/namespace.
• Used some trickery to make it look a little less
like Perl, but nothing fancy.
• The things that look like “directives” are just
functions.
• Lots of functions defined in that namespace...
Template System
38http://tripadvisor.com/careers
Template Functions
• Functions available for:
• Simple value insertion/substitution
• Adding & extending WHERE clauses
• Adding & extending GROUP BY clauses
• Setting defaults
• Manipulating and comparing dates
• Parameter validation
• Plus a lot of misc utils and support functions that
probably should be in a different module.
Template System
39http://tripadvisor.com/careers
Template Files
• Simple format – a YAML header followed by
templatized HQL code like you saw earlier:
id: pageviews_uniques
name: Daily Pageviews and Unique Visitors
description: >
Any description which will appear on the page.
<i>May include HTML</i>
author: optional
...
{{ begin_main_select() }}
SELECT foo, bar FROM baz
WHERE ds={{ insert_var date => {type => ‘date’} }}
Template System
40http://tripadvisor.com/careers
Issues
• Code in the web-app depends on the structure
of data internal to the template module.
• Would take a lot of work to fix, but worth it.
• Template evaluation is a potential security
nightmare.
• Perl does have a sandbox module for this sort of
things, though. I just need RTFM and use it.
• The APIs of the various functions isn’t entirely
consistent, but not too bad
• Will definitely fix for next release.
Try it for yourself
http://tripadvisor.com/careers 41
Try for Yourself
42http://tripadvisor.com/careers
Availability
• Source code available on GitHub now:
• https://github.com/tripadvisor/hive-query-tool
• Apache 2.0 Licensed
• Modest system prerequisites
• Automated download and installation of all
dependencies
• Works on a variety of platforms
• Bug Reports, Feature Requests and Pull
Requests all *very welcome*
The future?
http://tripadvisor.com/careers 43
Future Plans
44http://tripadvisor.com/careers
• Complete rewrite of the back-end for cleaner
and more flexible code.
• Implement sandboxing for template security
• More user-oriented features, like
• Ability to save pre-filled query reports
• Better management of past and running jobs
• Better status info from the backend
• Column-headers in report output
• An administrator dashboard/console
• Bug-fixes, feature enhancements, lots more
Future Possibilities
45http://tripadvisor.com/careers
• Workflow & Scheduling functionality
• Separate template system for stand-alone use
• Make the back-end good enough to be a viable
replacement for the Hive Thrift Server.
• Add template functions for joins, sub-selects,
and lots of other HQL constructs that aren’t yet
customizable.
• Add ability for a single template to define
multiple queries delivering in multiple result-
sets.
• Rewrite in Perl 6 😉
Questions?
?
Any Questions?

Más contenido relacionado

La actualidad más candente

How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning ModelsJosh Patterson
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JJosh Patterson
 
Tooling for the JavaScript Era
Tooling for the JavaScript EraTooling for the JavaScript Era
Tooling for the JavaScript Eramartinlippert
 
[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders
[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders
[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReadersRakuten Group, Inc.
 
Developing Complex WordPress Sites without Fear of Failure (with MVC)
Developing Complex WordPress Sites without Fear of Failure (with MVC)Developing Complex WordPress Sites without Fear of Failure (with MVC)
Developing Complex WordPress Sites without Fear of Failure (with MVC)Mike Schinkel
 
Untangling - fall2017 - week 7
Untangling - fall2017 - week 7Untangling - fall2017 - week 7
Untangling - fall2017 - week 7Derek Jacoby
 
Reactive All the Way Down the Stack
Reactive All the Way Down the StackReactive All the Way Down the Stack
Reactive All the Way Down the StackSteve Pember
 

La actualidad más candente (12)

SeaJUG 5 15-2018
SeaJUG 5 15-2018SeaJUG 5 15-2018
SeaJUG 5 15-2018
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
From Heroku to Amazon AWS
From Heroku to Amazon AWSFrom Heroku to Amazon AWS
From Heroku to Amazon AWS
 
Node.js
Node.jsNode.js
Node.js
 
Tooling for the JavaScript Era
Tooling for the JavaScript EraTooling for the JavaScript Era
Tooling for the JavaScript Era
 
[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders
[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders
[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders
 
SOA Latam 2015
SOA Latam 2015SOA Latam 2015
SOA Latam 2015
 
Developing Complex WordPress Sites without Fear of Failure (with MVC)
Developing Complex WordPress Sites without Fear of Failure (with MVC)Developing Complex WordPress Sites without Fear of Failure (with MVC)
Developing Complex WordPress Sites without Fear of Failure (with MVC)
 
Stackato v3
Stackato v3Stackato v3
Stackato v3
 
Untangling - fall2017 - week 7
Untangling - fall2017 - week 7Untangling - fall2017 - week 7
Untangling - fall2017 - week 7
 
Reactive All the Way Down the Stack
Reactive All the Way Down the StackReactive All the Way Down the Stack
Reactive All the Way Down the Stack
 

Destacado (9)

03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
01 hbase
01 hbase01 hbase
01 hbase
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 

Similar a Simplifying Use of Hive with the Hive Query Tool

Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014
Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014
Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014datafundamentals
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Lessons learned on the Azure API Stewardship Journey.pptx
Lessons learned on the Azure API Stewardship Journey.pptxLessons learned on the Azure API Stewardship Journey.pptx
Lessons learned on the Azure API Stewardship Journey.pptxapidays
 
Building a Startup in .NET
Building a Startup in .NETBuilding a Startup in .NET
Building a Startup in .NETMarcelo Calbucci
 
How to prepare your site for content migration
How to prepare your site for content migrationHow to prepare your site for content migration
How to prepare your site for content migrationBlend Interactive
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCloudIDSummit
 
Making the Transition from Suite to the Hub
Making the Transition from Suite to the HubMaking the Transition from Suite to the Hub
Making the Transition from Suite to the HubBlack Duck by Synopsys
 
Tech Thursdays: Building Products
Tech Thursdays: Building ProductsTech Thursdays: Building Products
Tech Thursdays: Building ProductsHayden Bleasel
 
Making sense of microservices, service mesh, and serverless
Making sense of microservices, service mesh, and serverlessMaking sense of microservices, service mesh, and serverless
Making sense of microservices, service mesh, and serverlessChristian Posta
 
Making Sense of Hypermedia APIs – Hype or Reality?
Making Sense of Hypermedia APIs – Hype or Reality?Making Sense of Hypermedia APIs – Hype or Reality?
Making Sense of Hypermedia APIs – Hype or Reality?Akana
 
5 Common Mistakes You are Making on your Website
 5 Common Mistakes You are Making on your Website 5 Common Mistakes You are Making on your Website
5 Common Mistakes You are Making on your WebsiteAcquia
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
 
Untangling spring week1
Untangling spring week1Untangling spring week1
Untangling spring week1Derek Jacoby
 
Pearls and Must-Have Tools for the Modern Web / .NET Developer
Pearls and Must-Have Tools for the Modern Web / .NET DeveloperPearls and Must-Have Tools for the Modern Web / .NET Developer
Pearls and Must-Have Tools for the Modern Web / .NET DeveloperOfer Zelig
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Cloud Foundry API for Fun and Ops
Cloud Foundry API for Fun and OpsCloud Foundry API for Fun and Ops
Cloud Foundry API for Fun and OpsChris DeLashmutt
 
#SPSOttawa 2017 migrate to the #SharePoint Framework #spfx
#SPSOttawa 2017 migrate to the #SharePoint Framework #spfx#SPSOttawa 2017 migrate to the #SharePoint Framework #spfx
#SPSOttawa 2017 migrate to the #SharePoint Framework #spfxVincent Biret
 

Similar a Simplifying Use of Hive with the Hive Query Tool (20)

Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014
Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014Hadoop Demystified + Automation Smackdown!  Austin JUG June 24 2014
Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Lessons learned on the Azure API Stewardship Journey.pptx
Lessons learned on the Azure API Stewardship Journey.pptxLessons learned on the Azure API Stewardship Journey.pptx
Lessons learned on the Azure API Stewardship Journey.pptx
 
JSON all the way
JSON all the wayJSON all the way
JSON all the way
 
Building a Startup in .NET
Building a Startup in .NETBuilding a Startup in .NET
Building a Startup in .NET
 
How to prepare your site for content migration
How to prepare your site for content migrationHow to prepare your site for content migration
How to prepare your site for content migration
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
 
Making the Transition from Suite to the Hub
Making the Transition from Suite to the HubMaking the Transition from Suite to the Hub
Making the Transition from Suite to the Hub
 
Tech Thursdays: Building Products
Tech Thursdays: Building ProductsTech Thursdays: Building Products
Tech Thursdays: Building Products
 
Platforms FTW!
Platforms FTW!Platforms FTW!
Platforms FTW!
 
Platforms FTW!
Platforms FTW!Platforms FTW!
Platforms FTW!
 
Making sense of microservices, service mesh, and serverless
Making sense of microservices, service mesh, and serverlessMaking sense of microservices, service mesh, and serverless
Making sense of microservices, service mesh, and serverless
 
Making Sense of Hypermedia APIs – Hype or Reality?
Making Sense of Hypermedia APIs – Hype or Reality?Making Sense of Hypermedia APIs – Hype or Reality?
Making Sense of Hypermedia APIs – Hype or Reality?
 
5 Common Mistakes You are Making on your Website
 5 Common Mistakes You are Making on your Website 5 Common Mistakes You are Making on your Website
5 Common Mistakes You are Making on your Website
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Untangling spring week1
Untangling spring week1Untangling spring week1
Untangling spring week1
 
Pearls and Must-Have Tools for the Modern Web / .NET Developer
Pearls and Must-Have Tools for the Modern Web / .NET DeveloperPearls and Must-Have Tools for the Modern Web / .NET Developer
Pearls and Must-Have Tools for the Modern Web / .NET Developer
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Cloud Foundry API for Fun and Ops
Cloud Foundry API for Fun and OpsCloud Foundry API for Fun and Ops
Cloud Foundry API for Fun and Ops
 
#SPSOttawa 2017 migrate to the #SharePoint Framework #spfx
#SPSOttawa 2017 migrate to the #SharePoint Framework #spfx#SPSOttawa 2017 migrate to the #SharePoint Framework #spfx
#SPSOttawa 2017 migrate to the #SharePoint Framework #spfx
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Último (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Simplifying Use of Hive with the Hive Query Tool

  • 1. http://tripadvisor.com/careers 1 Stephen R. Scaffidi sscaffidi@tripadvisor.com Hadoop Summit San Jose 2013 Simplifying the use of Hive with the Hive Query Tool
  • 2. http://tripadvisor.com/careers 2 • Introduction • What is the Hive Query Tool (HQT)? • Why did we build it? • How it’s being used today • Design & system requirements • HQT Query Templates • Getting the source, building, and running • Future plans & possibilities Talk Outline
  • 4. Introduction 4http://tripadvisor.com/careers About Me • Sr. Software Engineer at TripAdvisor • Data Warehouse Engineering Group • Mildly obsessed with making things “Just Work” • OK, more than mildly... • Varied background, from PC Tech to Email Admin to Telco NMS to Lisp Hacker, etc... • Thrives on making computers do the work • No Hadoop experience before joining
  • 5. Introduction http://tripadvisor.com/careers About my team • Data Warehouse Engineering • Small, focused, tenacious group • Varied skills and backgrounds • We keep the elephants fed and healthy • We help others in the company make use of the facilities provided on the clusters • DevOps in every sense of the term 5
  • 6. Introduction 6http://tripadvisor.com/careers About TripAdvisor • Awesome place to work • It really feels like we’re one big team • Always new challenges and things to learn • Smart, driven and genuinely *nice* people • Offices around the world • Great benefits • We’re hiring!
  • 7. What is the Hive Query Tool? http://tripadvisor.com/careers 7 Simplifying the use of Hive with The Hive Query Tool
  • 8. What is the Hive Query Tool? 8http://tripadvisor.com/careers A simple web interface for running reports on Hive • Our Specific Goals/Needs: • Easy to use for non-technical people • More flexible query customization than simple variable interpolation • Relatively easy installation and administration • Allow jobs to run with different scheduler queues and users • Performance equal-to or better-than plain Hive
  • 9. What is the Hive Query Tool? 9http://tripadvisor.com/careers Easy for non-technical end-users • Intended for use by non-technical people: • Sales, Marketing, Customer-Relations, etc. • People who don’t know anything about Hadoop or Hive (or need to) • People who don’t live in a *nix shell • No need to even know anything about SQL!
  • 10. What is the Hive Query Tool? 10http://tripadvisor.com/careers Flexible Query Customization • Other solutions we looked at were too limited • We needed to give the users something more powerful than simple variable substitution. • HQT’s template system can generate and insert arbitrary HQL clauses into a query based on a user’s input to a simple web interface.
  • 11. What is the Hive Query Tool? 11http://tripadvisor.com/careers Easy Install and Administration • If we were going to build our own, we didn’t want maintenance to be *another* full-time job • Internal adoption by other engineers was important • Java hackers don’t want to deal with a 23.5-step install and configure process • Especially if it’s not written in Java • Check-out the source, run the setup script, edit a single config file and run the startup scripts.
  • 12. What is the Hive Query Tool? 12http://tripadvisor.com/careers Run jobs with different Users and Queues • Face it, the Hive Thrift Server is horrible • Most other user-friendly Hive front-ends use it • So they have all its limitations • And its bugs  • The HQT simply spawns a Hive CLI for each job, using sudo to change users when necessary.
  • 13. What is the Hive Query Tool? 13http://tripadvisor.com/careers Performance? • Some options we looked at before building the HQT did a whole lot more • Some claimed to be faster than Hive. • Some of these options had so much overhead that they were slower than using Hive directly! • The HQT simply runs HQL code thru the standard Hive CLI. No overhead, no difference in performance over plain-vanilla Hive.
  • 14. Why did we need this? http://tripadvisor.com/careers 14 Simplifying the use of Hive with The Hive Query Tool
  • 15. Why did we need the HQT? 15http://tripadvisor.com/careers Making the data accessible • The data we pump into our Hadoop clusters is full of valuable information to our business • And more is fed into our Hive tables every day • And more people need access to that data every day • But not all of those people are 733t h4(k3r engineers 😉
  • 16. Why did we need the HQT? 16http://tripadvisor.com/careers Making the data accessible • The target users may not know Linux and Java and SQL... • But they do know how they want the data filtered and correlated and aggregated. • We needed a way to let them run queries where they could choose these parameters with a high degree of flexibility... • But without having to teach them all HQL
  • 17. Why did we need the HQT? 17http://tripadvisor.com/careers Looked at what was available... • Nothing else we looked at seemed to satisfy all our requirements. • Some that looked interesting, unfortunately had terrible performance, as they did not use Hive directly. • Not that everything we looked at was terrible – some solutions were really quite impressive. • But it came down to a classic question in tech- oriented businesses...
  • 18. Why did we need the HQT? 18http://tripadvisor.com/careers The bottom line • We knew what we wanted • We knew what we wanted wasn’t particularly complex • We asked ourselves if we could just build something that gives us exactly what we need • And would that effort cost less than trying to make something else work the way we wanted? • A “Eureka!” moment and a rough prototype answered the question 😉
  • 19. HQT Use at TripAdvisor http://tripadvisor.com/careers 19
  • 20. HQT Use at TripAdvisor 20http://tripadvisor.com/careers A surprise hit • Some interested people tried the prototype • Liked how it worked, requested more features • Other groups became interested • Even committed engineering resources to help get it to “beta” • It’s now being used across the company • New report templates constantly being added • (sorry, those aren’t available publicly)
  • 21. HQT Use at TripAdvisor 21http://tripadvisor.com/careers Company-wide adoption • End users find it easy to use and relatively convenient. • Template authors have found it easy to create and modify report templates. • Users include people in Sales, Marketing, Commerce, and even Legal! • Weekly peak usage at over 40 simultaneous Hive jobs – on a single server. (we’ve actually had to add throttling to keep HQT jobs from using too many mapred slots)
  • 23. HQT Design 23http://tripadvisor.com/careers Architecture: Front-End • Web interface • Handles user authentication • Processes HQT Templates to determine... • What options/input elements to present the user • How to process and validate input from the user • What HQL to send to the back-end • Gets job progress and status info from the back-end • Doesn’t do much else
  • 24. HQT Design 24http://tripadvisor.com/careers Architecture: Back-End • Presents a “json/rest-like” interface over HTTP to receive requests from the front-end • Uses an event-loop instead of threads • Spawns Hive CLI instances to run submitted HQL • Tracks and parses output from each instance • Watches CLI instances for progress and errors • Processes results for retrieval by users • Sends email notifications
  • 25. HQT Design 25http://tripadvisor.com/careers Template System • The “special sauce” of the HQT • The template “language” is designed so that “directives” concisely express a whole lot: • What input to gather (and optionally what kind) • How to validate that input • What output to generate and how to format it • It’s a little tricky to explain • But extremely flexible • More details shortly...
  • 26. HQT Design 26http://tripadvisor.com/careers Language & Frameworks • Written in Perl • Uses lots of components from the CPAN • Front-end web framework is Mojolicious • Template System uses Text::Template • Back-end uses AnyEvent • Most classes built using Moo • Decent example of “Modern Perl”, but is still a work-in-progress.
  • 27. HQT Design 27http://tripadvisor.com/careers System Requirements • Requires Perl 5.10.1 or newer • Hadoop & Hive clients & libs should already be installed and configured • Does *not* require root or root access • LDAP & sudo should be configured if you want to run jobs as different users. • Web-server is built-in, but can run under just about any setup you want
  • 28. HQT Design 28http://tripadvisor.com/careers Current State • The front-end code is rather nice • MVC-style web app code • Uses Mojolicious .epl templates for web content, which is very similar to .erb • Back-end code is kind of hairy • AnyEvent is fairly low-level • REST/json stuff too mixed with the code that wraps the Hive CLI processes. • It shouldn’t be responsible for sending email!
  • 29. HQT Design 29http://tripadvisor.com/careers Current State, contd. • Template-system code: • Fairly simple code, but allows for a lot of interesting functionality. • Other engineers seem to think it’s fine... • But I think it needs refactoring ∙ Too much “action at a distance” ∙ Template evaluation is a big security risk ∙ Should use OO instead of ad-hoc data structures ∙ Etc...
  • 30. The HQL Template System http://tripadvisor.com/careers 30
  • 31. Template System 31http://tripadvisor.com/careers • Template code blocks are embedded into otherwise normal HQL: {{ begin_main_select }} SELECT foo, bar FROM baz WHERE ds={{ insert_var date => { type => ‘date’, default => days_ago_ymd(3) } }} {{ append_where { columns => { wibble => ‘string’, wobble => ‘int’ } } }}
  • 32. Template System 32http://tripadvisor.com/careers • Template functions/”directives” simultaneously define... • What input options to present the user • Input validation • What to insert into the HQL based on the input
  • 33. Template System 33http://tripadvisor.com/careers So, this... {{ begin_main_select }} SELECT foo, bar FROM baz WHERE ds={{ insert_var date => { type => ‘date’, default => days_ago_ymd(3) } }} {{ append_where { columns => { wibble => ‘string’, wobble => ‘int’ } } }}
  • 37. Template System 37http://tripadvisor.com/careers Template Engine • Didn’t build anything new, just used the existing Text::Template module in a clever way • Template blocks are just Perl code, evaluated in a specified package/namespace. • Used some trickery to make it look a little less like Perl, but nothing fancy. • The things that look like “directives” are just functions. • Lots of functions defined in that namespace...
  • 38. Template System 38http://tripadvisor.com/careers Template Functions • Functions available for: • Simple value insertion/substitution • Adding & extending WHERE clauses • Adding & extending GROUP BY clauses • Setting defaults • Manipulating and comparing dates • Parameter validation • Plus a lot of misc utils and support functions that probably should be in a different module.
  • 39. Template System 39http://tripadvisor.com/careers Template Files • Simple format – a YAML header followed by templatized HQL code like you saw earlier: id: pageviews_uniques name: Daily Pageviews and Unique Visitors description: > Any description which will appear on the page. <i>May include HTML</i> author: optional ... {{ begin_main_select() }} SELECT foo, bar FROM baz WHERE ds={{ insert_var date => {type => ‘date’} }}
  • 40. Template System 40http://tripadvisor.com/careers Issues • Code in the web-app depends on the structure of data internal to the template module. • Would take a lot of work to fix, but worth it. • Template evaluation is a potential security nightmare. • Perl does have a sandbox module for this sort of things, though. I just need RTFM and use it. • The APIs of the various functions isn’t entirely consistent, but not too bad • Will definitely fix for next release.
  • 41. Try it for yourself http://tripadvisor.com/careers 41
  • 42. Try for Yourself 42http://tripadvisor.com/careers Availability • Source code available on GitHub now: • https://github.com/tripadvisor/hive-query-tool • Apache 2.0 Licensed • Modest system prerequisites • Automated download and installation of all dependencies • Works on a variety of platforms • Bug Reports, Feature Requests and Pull Requests all *very welcome*
  • 44. Future Plans 44http://tripadvisor.com/careers • Complete rewrite of the back-end for cleaner and more flexible code. • Implement sandboxing for template security • More user-oriented features, like • Ability to save pre-filled query reports • Better management of past and running jobs • Better status info from the backend • Column-headers in report output • An administrator dashboard/console • Bug-fixes, feature enhancements, lots more
  • 45. Future Possibilities 45http://tripadvisor.com/careers • Workflow & Scheduling functionality • Separate template system for stand-alone use • Make the back-end good enough to be a viable replacement for the Hive Thrift Server. • Add template functions for joins, sub-selects, and lots of other HQL constructs that aren’t yet customizable. • Add ability for a single template to define multiple queries delivering in multiple result- sets. • Rewrite in Perl 6 😉