The slides from my talk about a tool we've developed at TripAdvisor and open-sourced. The Hive Query Tool makes it easy for non-technical end users to run highly customizable reports on Hive.
2. http://tripadvisor.com/careers 2
• Introduction
• What is the Hive Query Tool (HQT)?
• Why did we build it?
• How it’s being used today
• Design & system requirements
• HQT Query Templates
• Getting the source, building, and running
• Future plans & possibilities
Talk Outline
4. Introduction
4http://tripadvisor.com/careers
About Me
• Sr. Software Engineer at TripAdvisor
• Data Warehouse Engineering Group
• Mildly obsessed with making things “Just Work”
• OK, more than mildly...
• Varied background, from PC Tech to Email
Admin to Telco NMS to Lisp Hacker, etc...
• Thrives on making computers do the work
• No Hadoop experience before joining
5. Introduction
http://tripadvisor.com/careers
About my team
• Data Warehouse Engineering
• Small, focused, tenacious group
• Varied skills and backgrounds
• We keep the elephants fed and healthy
• We help others in the company make use of
the facilities provided on the clusters
• DevOps in every sense of the term
5
7. What is the Hive Query Tool?
http://tripadvisor.com/careers 7
Simplifying the use of Hive with The Hive Query Tool
8. What is the Hive Query Tool?
8http://tripadvisor.com/careers
A simple web interface for running reports on Hive
• Our Specific Goals/Needs:
• Easy to use for non-technical people
• More flexible query customization than simple
variable interpolation
• Relatively easy installation and administration
• Allow jobs to run with different scheduler queues
and users
• Performance equal-to or better-than plain Hive
9. What is the Hive Query Tool?
9http://tripadvisor.com/careers
Easy for non-technical end-users
• Intended for use by non-technical people:
• Sales, Marketing, Customer-Relations, etc.
• People who don’t know anything about Hadoop
or Hive (or need to)
• People who don’t live in a *nix shell
• No need to even know anything about SQL!
10. What is the Hive Query Tool?
10http://tripadvisor.com/careers
Flexible Query Customization
• Other solutions we looked at were too limited
• We needed to give the users something more
powerful than simple variable substitution.
• HQT’s template system can generate and
insert arbitrary HQL clauses into a query based
on a user’s input to a simple web interface.
11. What is the Hive Query Tool?
11http://tripadvisor.com/careers
Easy Install and Administration
• If we were going to build our own, we didn’t
want maintenance to be *another* full-time job
• Internal adoption by other engineers was
important
• Java hackers don’t want to deal with a 23.5-step
install and configure process
• Especially if it’s not written in Java
• Check-out the source, run the setup script, edit
a single config file and run the startup scripts.
12. What is the Hive Query Tool?
12http://tripadvisor.com/careers
Run jobs with different Users and Queues
• Face it, the Hive Thrift Server is horrible
• Most other user-friendly Hive front-ends use it
• So they have all its limitations
• And its bugs
• The HQT simply spawns a Hive CLI for each
job, using sudo to change users when
necessary.
13. What is the Hive Query Tool?
13http://tripadvisor.com/careers
Performance?
• Some options we looked at before building the
HQT did a whole lot more
• Some claimed to be faster than Hive.
• Some of these options had so much overhead
that they were slower than using Hive directly!
• The HQT simply runs HQL code thru the
standard Hive CLI. No overhead, no difference
in performance over plain-vanilla Hive.
14. Why did we need this?
http://tripadvisor.com/careers 14
Simplifying the use of Hive with The Hive Query Tool
15. Why did we need the HQT?
15http://tripadvisor.com/careers
Making the data accessible
• The data we pump into our Hadoop clusters is
full of valuable information to our business
• And more is fed into our Hive tables every day
• And more people need access to that data
every day
• But not all of those people are 733t h4(k3r
engineers 😉
16. Why did we need the HQT?
16http://tripadvisor.com/careers
Making the data accessible
• The target users may not know Linux and Java
and SQL...
• But they do know how they want the data
filtered and correlated and aggregated.
• We needed a way to let them run queries
where they could choose these parameters
with a high degree of flexibility...
• But without having to teach them all HQL
17. Why did we need the HQT?
17http://tripadvisor.com/careers
Looked at what was available...
• Nothing else we looked at seemed to satisfy all
our requirements.
• Some that looked interesting, unfortunately
had terrible performance, as they did not use
Hive directly.
• Not that everything we looked at was terrible –
some solutions were really quite impressive.
• But it came down to a classic question in tech-
oriented businesses...
18. Why did we need the HQT?
18http://tripadvisor.com/careers
The bottom line
• We knew what we wanted
• We knew what we wanted wasn’t particularly
complex
• We asked ourselves if we could just build
something that gives us exactly what we need
• And would that effort cost less than trying to
make something else work the way we
wanted?
• A “Eureka!” moment and a rough prototype
answered the question 😉
19. HQT Use at TripAdvisor
http://tripadvisor.com/careers 19
20. HQT Use at TripAdvisor
20http://tripadvisor.com/careers
A surprise hit
• Some interested people tried the prototype
• Liked how it worked, requested more features
• Other groups became interested
• Even committed engineering resources to help
get it to “beta”
• It’s now being used across the company
• New report templates constantly being added
• (sorry, those aren’t available publicly)
21. HQT Use at TripAdvisor
21http://tripadvisor.com/careers
Company-wide adoption
• End users find it easy to use and relatively
convenient.
• Template authors have found it easy to create
and modify report templates.
• Users include people in Sales, Marketing,
Commerce, and even Legal!
• Weekly peak usage at over 40 simultaneous
Hive jobs – on a single server.
(we’ve actually had to add throttling to keep HQT
jobs from using too many mapred slots)
23. HQT Design
23http://tripadvisor.com/careers
Architecture: Front-End
• Web interface
• Handles user authentication
• Processes HQT Templates to determine...
• What options/input elements to present the user
• How to process and validate input from the user
• What HQL to send to the back-end
• Gets job progress and status info from the
back-end
• Doesn’t do much else
24. HQT Design
24http://tripadvisor.com/careers
Architecture: Back-End
• Presents a “json/rest-like” interface over HTTP
to receive requests from the front-end
• Uses an event-loop instead of threads
• Spawns Hive CLI instances to run submitted
HQL
• Tracks and parses output from each instance
• Watches CLI instances for progress and errors
• Processes results for retrieval by users
• Sends email notifications
25. HQT Design
25http://tripadvisor.com/careers
Template System
• The “special sauce” of the HQT
• The template “language” is designed so that
“directives” concisely express a whole lot:
• What input to gather (and optionally what kind)
• How to validate that input
• What output to generate and how to format it
• It’s a little tricky to explain
• But extremely flexible
• More details shortly...
26. HQT Design
26http://tripadvisor.com/careers
Language & Frameworks
• Written in Perl
• Uses lots of components from the CPAN
• Front-end web framework is Mojolicious
• Template System uses Text::Template
• Back-end uses AnyEvent
• Most classes built using Moo
• Decent example of “Modern Perl”, but is still a
work-in-progress.
27. HQT Design
27http://tripadvisor.com/careers
System Requirements
• Requires Perl 5.10.1 or newer
• Hadoop & Hive clients & libs should already be
installed and configured
• Does *not* require root or root access
• LDAP & sudo should be configured if you want
to run jobs as different users.
• Web-server is built-in, but can run under just
about any setup you want
28. HQT Design
28http://tripadvisor.com/careers
Current State
• The front-end code is rather nice
• MVC-style web app code
• Uses Mojolicious .epl templates for web content,
which is very similar to .erb
• Back-end code is kind of hairy
• AnyEvent is fairly low-level
• REST/json stuff too mixed with the code that
wraps the Hive CLI processes.
• It shouldn’t be responsible for sending email!
29. HQT Design
29http://tripadvisor.com/careers
Current State, contd.
• Template-system code:
• Fairly simple code, but allows for a lot of
interesting functionality.
• Other engineers seem to think it’s fine...
• But I think it needs refactoring
∙ Too much “action at a distance”
∙ Template evaluation is a big security risk
∙ Should use OO instead of ad-hoc data structures
∙ Etc...
37. Template System
37http://tripadvisor.com/careers
Template Engine
• Didn’t build anything new, just used the
existing Text::Template module in a clever
way
• Template blocks are just Perl code, evaluated
in a specified package/namespace.
• Used some trickery to make it look a little less
like Perl, but nothing fancy.
• The things that look like “directives” are just
functions.
• Lots of functions defined in that namespace...
38. Template System
38http://tripadvisor.com/careers
Template Functions
• Functions available for:
• Simple value insertion/substitution
• Adding & extending WHERE clauses
• Adding & extending GROUP BY clauses
• Setting defaults
• Manipulating and comparing dates
• Parameter validation
• Plus a lot of misc utils and support functions that
probably should be in a different module.
39. Template System
39http://tripadvisor.com/careers
Template Files
• Simple format – a YAML header followed by
templatized HQL code like you saw earlier:
id: pageviews_uniques
name: Daily Pageviews and Unique Visitors
description: >
Any description which will appear on the page.
<i>May include HTML</i>
author: optional
...
{{ begin_main_select() }}
SELECT foo, bar FROM baz
WHERE ds={{ insert_var date => {type => ‘date’} }}
40. Template System
40http://tripadvisor.com/careers
Issues
• Code in the web-app depends on the structure
of data internal to the template module.
• Would take a lot of work to fix, but worth it.
• Template evaluation is a potential security
nightmare.
• Perl does have a sandbox module for this sort of
things, though. I just need RTFM and use it.
• The APIs of the various functions isn’t entirely
consistent, but not too bad
• Will definitely fix for next release.
41. Try it for yourself
http://tripadvisor.com/careers 41
42. Try for Yourself
42http://tripadvisor.com/careers
Availability
• Source code available on GitHub now:
• https://github.com/tripadvisor/hive-query-tool
• Apache 2.0 Licensed
• Modest system prerequisites
• Automated download and installation of all
dependencies
• Works on a variety of platforms
• Bug Reports, Feature Requests and Pull
Requests all *very welcome*
44. Future Plans
44http://tripadvisor.com/careers
• Complete rewrite of the back-end for cleaner
and more flexible code.
• Implement sandboxing for template security
• More user-oriented features, like
• Ability to save pre-filled query reports
• Better management of past and running jobs
• Better status info from the backend
• Column-headers in report output
• An administrator dashboard/console
• Bug-fixes, feature enhancements, lots more
45. Future Possibilities
45http://tripadvisor.com/careers
• Workflow & Scheduling functionality
• Separate template system for stand-alone use
• Make the back-end good enough to be a viable
replacement for the Hive Thrift Server.
• Add template functions for joins, sub-selects,
and lots of other HQL constructs that aren’t yet
customizable.
• Add ability for a single template to define
multiple queries delivering in multiple result-
sets.
• Rewrite in Perl 6 😉