2. Who is talking?
• Tomáš Červenka
• Quick Bio:
• Slovakia
• Cambridge CompSci + Management
• Google – Adsense for TV
• VisualDNA – Software Engineer -> … -> CTO
• @tomascervenka
2
3. What is this talk about?
• What is Hive?
• What is it useful for?
• What is it not useful for?
• Where to start?
• Amazon EMR + S3
• Simple example
• How do we use Hive at VisualDNA?
• What is VisualDNA, anyway?
• Use cases: reporting, analytics, ML
• Tips and tricks
• Q&A
3
4. What is Hive?
• Data warehousing solution built on top of Hadoop
• Input format agnostic: can read CSV, Json, Thrift, SequenceTable…
• Initially developed at Facebook, became part of Apache Hadoop
• In simple terms gives you SQL-like interface to query Big Data.
• HiveQL together with custom mappers and reducers give you
enough flexibility to write most data processing back-ends.
• Hive compiles your HiveQL queries to a set of MapReduce jobs and
uses Hadoop to deliver the results of the query.
4
5. Why is HiveQL important?
http://howfuckedismydatabase.com/nosql/
5
6. What is Hive useful for?
• Big Data analytics
• Running queries over large semi-structured datasets
• Makes filtering, aggregation, joins etc. very easy
• Hides the complexity of MR => used by non-developers
• Big Data processing
• Efficient and effective way to write data pipelines
• Easy way to parallelise computationally complex queries
• Scales nicely with amount of data and cluster size
6
7. What is Hive not useful for?
• Real time analytics or processing
• Even small queries can take tens of seconds or minutes
• Can’t build Hive (or Hadoop for that matter) into real-time flow
• Algorithms which are difficult to parallelise
• Almost everything can be expressed in a number of MR steps
• Almost always MR is sub-optimal
• If your data is small, R or scripting is often better and faster
• Another downside: Hive (on EMR) tends to be a pain to debug 7
8. How to start with Hive?
• Build your own Hadoop cluster + install Hive
• The “right” way to do it, might take some time for multi-node setup
• Spinning up an EMR cluster
• The quick and cheap way to do it.
• You need an Amazon AWS account and some data on S3.
• You need an EMR ruby library installed and configured locally.
• You need to spin up an EMR cluster in interactive mode. Voila.
$ emr --create --alive --name ”MY JOB" --hive-interactive --num-instances 8 -
-instance-type cc1.4xlarge --hive-versions 0.8.1 --bootstrap-action "s3://my-
bucket/emr-bootstrap"
8
9. Getting Started with Hive
• SSH into your cluster (your namenode)
• $ emr --ssh j-AHF0QE733K8F
• Run screen (you’ll quickly find out why)
• $ screen
• Run Hive
• $ hive
• Welcome to Hive interface!
• Monitor Hive
• $ elinks http://localhost:9100
• Terminate Hive
• $ emr --terminate j-AHF0QE733K8F
9
10. Example – CTR by ad by day
add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
CREATE EXTERNAL TABLE
events (
time string,
action string,
id string)
PARTITIONED BY
(d string)
ROW FORMAT SERDE
'com.amazon.elasticmapreduce.JsonSerde’
WITH SERDEPROPERTIES ('paths'=’time,action,id')
LOCATION
's3://my-bucket/events/’;
ALTER TABLE events ADD PARTITION (d='2012-07-09');
ALTER TABLE events ADD PARTITION (d='2012-07-08');
ALTER TABLE events ADD PARTITION (d='2012-07-07');
10
11. Example – CTR by ad by day
CREATE EXTERNAL TABLE
ad_stats (d string, id string, impressions bigint, clicks bigint, ctr float)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n'
LOCATION 's3://my-bucket/ad-stats/';
INSERT OVERWRITE TABLE ad_stats
SELECT i.d, c.id, impressions, clicks, clicks/impressions
FROM (
SELECT d, id, count(1) as clicks
FROM events
WHERE action = 'CLICK’
GROUP BY d, id
) c FULL OUTER JOIN (
SELECT d, id, count(1) as impressions
FROM events
WHERE action = 'IMPRESSION’
GROUP BY d, id
) i ON (c.d = i.d AND c.id = i.id);
11
12. What is VisualDNA, anyway?
• Leading audience profiling technology and audience data provider
• ≈ 100 million monthly uniques reach globally
• Data for ad targeting, risk, personalisation, recommendations…
• Use Cassandra, Hadoop, Hive, Redis, Scala, PHP, Java in production
• Running on a mix of AWS and physical HW in London
We’re hiring (like mad)
Back-End, Front-End and Research Engineers
www.visualdna.com/careers
12
13. How do we use Hive?
• Main data source is: events
• Events are associated with users actions (mainly)
• Conversions, pageviews, impressions, clicks, syncs…
• Contain user ID, timestamp, browser info, geo, event info…
• Roughly 50M of them a day = 50 GB of text
• JSON format, one JSON object per line, validated input
• Coming from 8 events trackers, rotated every 5 mins
• Partitioned by date (d=2012-07-09)
• Storing all of them on S3. Never deleting anything.
13
14. Use case #1: Analytics
• Analytics queries on our events table
• How many people started each quiz in the last 3 months?
• Give me the IDs of people visiting football section on Mirror today.
• Give me a histogram of frequency of visits per user
• …
• Best thing about Hive: non-developers use it (after we wrote a wiki)
• Can simplify further by using Karmasphere on AWS
• Downsides:
• Takes time to spin-up the cluster on AWS.
• Takes time to execute simple queries. Very big queries often fail.
• Replacing a lot of the “how many” queries by Redis.
14
15. Use case #2: Reporting pipeline
• Interactive mode in Hive is only part of the picture.
• Hive can also run scripted queries for you:
• $ emr --create --hive-script --name ”Test” --num-instances 2 --slave-
instance-type cc1.4xlarge --master-instance-type c1.medium --arg
hive_script.q --args "-d,PARTITION=2012-07-09,-d,RUNDATE=2012-07-10”
• Note: arguments are accessible in the hive query: ${PARTITION}
• Rule of thumb: always run queries by hand first, script them if you’re sure they work
• Reporting is repeated analytics => similar queries, but ran regularly
• Hive drives a lot of our reporting tools and provides data for Redis
• We use cron + bash scripts to schedule, run and monitor Hive jobs
• Poll emr for status of the job until finished (success or fail)
• Suggestions for better tools? 15
16. Use case #3: Inference Engine (ML)
• Inference Engine helps us scale audience data to 100M+ profiles
• In principle, extrapolates quiz profile data over user behaviour online
• At its heart, it’s a few hundred lines of Hive queries
• Every day, fetches users from Cassandra and sifts through events:
• Update profiles for pages visited by profiled users yesterday
• Update profiles for users based on their behaviour yesterday
• Input is about 2M users, 50M events; output is 5-10M user profiles
• Runs in < 3 hours with 10 large instances -> parallelises nicely
• Could use Apache Mahout, but was single-threaded back then
• Biggest issues? Global sorts, running out of memory/disk on joins.
16
17. Tips and Tricks
• Performance related
• On AWS, S3 is often the bottleneck. Use cc1.* or cc2.* instances.
• Copy from S3 to internal table if you query it multiple times.
• Use compression for output. Plenty of CPU cycles for this.
• Use SequenceTable format and internal tables where applicable.
• Use MapJoin wherever possible (SELECT /*+ MAPJOIN(table)*/).
• Avoid SerDe-s and TRANSFORM mappers if possible.
• Don’t sort (ORDER BY) unless you really have to => 1 reducer.
• Partition your data (input and/or output) if you can.
• Might make your life easier
• If queries start stalling, add more instances. Debugging is painful.
• Use arguments to pass in commands / partitions (if you need to).
17
18. Q&A
• Thank you for your time!
• Hope this was a bit useful – let me know your feedback.
• Any questions?
18