SlideShare una empresa de Scribd logo
1 de 18
@
Who is talking?
• Tomáš Červenka


• Quick Bio:
   • Slovakia
   • Cambridge CompSci + Management
   • Google – Adsense for TV
   • VisualDNA – Software Engineer -> … -> CTO


• @tomascervenka
                                                 2
What is this talk about?
• What is Hive?
   • What is it useful for?
   • What is it not useful for?
• Where to start?
   • Amazon EMR + S3
   • Simple example
• How do we use Hive at VisualDNA?
   • What is VisualDNA, anyway?
   • Use cases: reporting, analytics, ML
   • Tips and tricks
• Q&A
                                           3
What is Hive?
• Data warehousing solution built on top of Hadoop
• Input format agnostic: can read CSV, Json, Thrift, SequenceTable…
• Initially developed at Facebook, became part of Apache Hadoop


• In simple terms gives you SQL-like interface to query Big Data.
   • HiveQL together with custom mappers and reducers give you
     enough flexibility to write most data processing back-ends.


• Hive compiles your HiveQL queries to a set of MapReduce jobs and
 uses Hadoop to deliver the results of the query.
                                                                      4
Why is HiveQL important?




                           http://howfuckedismydatabase.com/nosql/
                                                                 5
What is Hive useful for?
• Big Data analytics
   • Running queries over large semi-structured datasets
   • Makes filtering, aggregation, joins etc. very easy
   • Hides the complexity of MR => used by non-developers


• Big Data processing
   • Efficient and effective way to write data pipelines
   • Easy way to parallelise computationally complex queries
   • Scales nicely with amount of data and cluster size
                                                               6
What is Hive not useful for?
• Real time analytics or processing
   • Even small queries can take tens of seconds or minutes
   • Can’t build Hive (or Hadoop for that matter) into real-time flow


• Algorithms which are difficult to parallelise
   • Almost everything can be expressed in a number of MR steps
   • Almost always MR is sub-optimal
   • If your data is small, R or scripting is often better and faster


• Another downside: Hive (on EMR) tends to be a pain to debug           7
How to start with Hive?
• Build your own Hadoop cluster + install Hive
   • The “right” way to do it, might take some time for multi-node setup


• Spinning up an EMR cluster
   • The quick and cheap way to do it.
   • You need an Amazon AWS account and some data on S3.
   • You need an EMR ruby library installed and configured locally.
   • You need to spin up an EMR cluster in interactive mode. Voila.
   $ emr --create --alive --name ”MY JOB" --hive-interactive --num-instances 8 -
   -instance-type cc1.4xlarge --hive-versions 0.8.1 --bootstrap-action "s3://my-
   bucket/emr-bootstrap"
                                                                                   8
Getting Started with Hive
• SSH into your cluster (your namenode)
   • $ emr --ssh j-AHF0QE733K8F

• Run screen (you’ll quickly find out why)
   • $ screen

• Run Hive
   • $ hive

• Welcome to Hive interface!
• Monitor Hive
   • $ elinks http://localhost:9100

• Terminate Hive
   • $ emr --terminate j-AHF0QE733K8F
                                             9
Example – CTR by ad by day
add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;

CREATE EXTERNAL TABLE
    events (
         time string,
         action string,
         id string)
PARTITIONED BY
    (d string)
ROW FORMAT SERDE
    'com.amazon.elasticmapreduce.JsonSerde’
WITH SERDEPROPERTIES ('paths'=’time,action,id')
LOCATION
    's3://my-bucket/events/’;

ALTER TABLE events ADD PARTITION (d='2012-07-09');
ALTER TABLE events ADD PARTITION (d='2012-07-08');
ALTER TABLE events ADD PARTITION (d='2012-07-07');
                                                                     10
Example – CTR by ad by day
CREATE EXTERNAL TABLE
    ad_stats (d string, id string, impressions bigint, clicks bigint, ctr float)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n'
LOCATION 's3://my-bucket/ad-stats/';

INSERT OVERWRITE TABLE ad_stats
SELECT i.d, c.id, impressions, clicks, clicks/impressions
FROM (
    SELECT d, id, count(1) as clicks
    FROM events
    WHERE action = 'CLICK’
    GROUP BY d, id
) c FULL OUTER JOIN (
    SELECT d, id, count(1) as impressions
    FROM events
    WHERE action = 'IMPRESSION’
    GROUP BY d, id
) i ON (c.d = i.d AND c.id = i.id);
                                                                                   11
What is VisualDNA, anyway?
• Leading audience profiling technology and audience data provider
• ≈ 100 million monthly uniques reach globally
• Data for ad targeting, risk, personalisation, recommendations…


• Use Cassandra, Hadoop, Hive, Redis, Scala, PHP, Java in production
• Running on a mix of AWS and physical HW in London

              We’re hiring (like mad)
           Back-End, Front-End and Research Engineers
                     www.visualdna.com/careers
                                                                       12
How do we use Hive?
• Main data source is: events
• Events are associated with users actions (mainly)
   • Conversions, pageviews, impressions, clicks, syncs…
   • Contain user ID, timestamp, browser info, geo, event info…
• Roughly 50M of them a day = 50 GB of text
   • JSON format, one JSON object per line, validated input
   • Coming from 8 events trackers, rotated every 5 mins
   • Partitioned by date (d=2012-07-09)
   • Storing all of them on S3. Never deleting anything.
                                                                  13
Use case #1: Analytics
• Analytics queries on our events table
   • How many people started each quiz in the last 3 months?
   • Give me the IDs of people visiting football section on Mirror today.
   • Give me a histogram of frequency of visits per user
   • …
• Best thing about Hive: non-developers use it (after we wrote a wiki)
   • Can simplify further by using Karmasphere on AWS
• Downsides:
   • Takes time to spin-up the cluster on AWS.
   • Takes time to execute simple queries. Very big queries often fail.
   • Replacing a lot of the “how many” queries by Redis.
                                                                            14
Use case #2: Reporting pipeline
• Interactive mode in Hive is only part of the picture.
• Hive can also run scripted queries for you:
   •   $ emr --create --hive-script --name ”Test” --num-instances 2 --slave-
       instance-type cc1.4xlarge --master-instance-type c1.medium --arg
       hive_script.q --args "-d,PARTITION=2012-07-09,-d,RUNDATE=2012-07-10”
   •   Note: arguments are accessible in the hive query: ${PARTITION}
   •   Rule of thumb: always run queries by hand first, script them if you’re sure they work

• Reporting is repeated analytics => similar queries, but ran regularly
• Hive drives a lot of our reporting tools and provides data for Redis
• We use cron + bash scripts to schedule, run and monitor Hive jobs
   • Poll emr for status of the job until finished (success or fail)
   • Suggestions for better tools?                                                             15
Use case #3: Inference Engine (ML)
• Inference Engine helps us scale audience data to 100M+ profiles
• In principle, extrapolates quiz profile data over user behaviour online
• At its heart, it’s a few hundred lines of Hive queries
• Every day, fetches users from Cassandra and sifts through events:
   • Update profiles for pages visited by profiled users yesterday
   • Update profiles for users based on their behaviour yesterday
• Input is about 2M users, 50M events; output is 5-10M user profiles
• Runs in < 3 hours with 10 large instances -> parallelises nicely
   • Could use Apache Mahout, but was single-threaded back then
• Biggest issues? Global sorts, running out of memory/disk on joins.
                                                                            16
Tips and Tricks
• Performance related
   • On AWS, S3 is often the bottleneck. Use cc1.* or cc2.* instances.
       • Copy from S3 to internal table if you query it multiple times.
       • Use compression for output. Plenty of CPU cycles for this.
   • Use SequenceTable format and internal tables where applicable.
   • Use MapJoin wherever possible (SELECT     /*+ MAPJOIN(table)*/).

   • Avoid SerDe-s and TRANSFORM mappers if possible.
   • Don’t sort (ORDER BY) unless you really have to => 1 reducer.
   • Partition your data (input and/or output) if you can.
• Might make your life easier
   • If queries start stalling, add more instances. Debugging is painful.
   • Use arguments to pass in commands / partitions (if you need to).
                                                                            17
Q&A



• Thank you for your time!

• Hope this was a bit useful – let me know your feedback.

• Any questions?




                                                            18

Más contenido relacionado

La actualidad más candente

Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
Mark Miller
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
Yukinori Suda
 

La actualidad más candente (20)

(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)
 
Building your own slack bot on the AWS stack
Building your own slack bot on the AWS stackBuilding your own slack bot on the AWS stack
Building your own slack bot on the AWS stack
 
Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)
 
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
 
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight Deployments
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 

Destacado

Sesion 16 toma de decisiones
Sesion 16   toma de decisionesSesion 16   toma de decisiones
Sesion 16 toma de decisiones
Marco Carrillo
 
Resume_Rajib Chowdhury..PDF
Resume_Rajib Chowdhury..PDFResume_Rajib Chowdhury..PDF
Resume_Rajib Chowdhury..PDF
Rajib Chowdhury
 

Destacado (17)

상 넣는 기계
상 넣는 기계상 넣는 기계
상 넣는 기계
 
Megan's resume
Megan's resumeMegan's resume
Megan's resume
 
BRTI Website
BRTI WebsiteBRTI Website
BRTI Website
 
Doc
DocDoc
Doc
 
Sesion 16 toma de decisiones
Sesion 16   toma de decisionesSesion 16   toma de decisiones
Sesion 16 toma de decisiones
 
Campeonato Regional de Juvenis de Futsal - Caderno da Prova
Campeonato Regional de Juvenis de Futsal - Caderno da ProvaCampeonato Regional de Juvenis de Futsal - Caderno da Prova
Campeonato Regional de Juvenis de Futsal - Caderno da Prova
 
Undangan p yon
Undangan p yonUndangan p yon
Undangan p yon
 
Unitat 1.4 resum
Unitat 1.4 resumUnitat 1.4 resum
Unitat 1.4 resum
 
Caderno do Torneio Regional de Inter Associações Sub -14 Futsal
Caderno do Torneio Regional de Inter Associações Sub -14 FutsalCaderno do Torneio Regional de Inter Associações Sub -14 Futsal
Caderno do Torneio Regional de Inter Associações Sub -14 Futsal
 
E commerce
E commerceE commerce
E commerce
 
Field exp. 11.2015
Field exp. 11.2015Field exp. 11.2015
Field exp. 11.2015
 
CV S.K.Panda
CV S.K.PandaCV S.K.Panda
CV S.K.Panda
 
Trabajo en clase 2
Trabajo en clase 2Trabajo en clase 2
Trabajo en clase 2
 
Resume_Rajib Chowdhury..PDF
Resume_Rajib Chowdhury..PDFResume_Rajib Chowdhury..PDF
Resume_Rajib Chowdhury..PDF
 
Cuadro de tesis
Cuadro de tesisCuadro de tesis
Cuadro de tesis
 
William Shakespeare
William ShakespeareWilliam Shakespeare
William Shakespeare
 
Bahasa Indonesia Kelas VII SMP MTs 2016 P & K
Bahasa Indonesia Kelas VII SMP MTs 2016 P & KBahasa Indonesia Kelas VII SMP MTs 2016 P & K
Bahasa Indonesia Kelas VII SMP MTs 2016 P & K
 

Similar a First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA

Intro to node and mongodb 1
Intro to node and mongodb   1Intro to node and mongodb   1
Intro to node and mongodb 1
Mohammad Qureshi
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 

Similar a First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA (20)

Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Data Science
Data ScienceData Science
Data Science
 
Top ten-list
Top ten-listTop ten-list
Top ten-list
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the Cloud
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Intro to node and mongodb 1
Intro to node and mongodb   1Intro to node and mongodb   1
Intro to node and mongodb 1
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA

  • 1. @
  • 2. Who is talking? • Tomáš Červenka • Quick Bio: • Slovakia • Cambridge CompSci + Management • Google – Adsense for TV • VisualDNA – Software Engineer -> … -> CTO • @tomascervenka 2
  • 3. What is this talk about? • What is Hive? • What is it useful for? • What is it not useful for? • Where to start? • Amazon EMR + S3 • Simple example • How do we use Hive at VisualDNA? • What is VisualDNA, anyway? • Use cases: reporting, analytics, ML • Tips and tricks • Q&A 3
  • 4. What is Hive? • Data warehousing solution built on top of Hadoop • Input format agnostic: can read CSV, Json, Thrift, SequenceTable… • Initially developed at Facebook, became part of Apache Hadoop • In simple terms gives you SQL-like interface to query Big Data. • HiveQL together with custom mappers and reducers give you enough flexibility to write most data processing back-ends. • Hive compiles your HiveQL queries to a set of MapReduce jobs and uses Hadoop to deliver the results of the query. 4
  • 5. Why is HiveQL important? http://howfuckedismydatabase.com/nosql/ 5
  • 6. What is Hive useful for? • Big Data analytics • Running queries over large semi-structured datasets • Makes filtering, aggregation, joins etc. very easy • Hides the complexity of MR => used by non-developers • Big Data processing • Efficient and effective way to write data pipelines • Easy way to parallelise computationally complex queries • Scales nicely with amount of data and cluster size 6
  • 7. What is Hive not useful for? • Real time analytics or processing • Even small queries can take tens of seconds or minutes • Can’t build Hive (or Hadoop for that matter) into real-time flow • Algorithms which are difficult to parallelise • Almost everything can be expressed in a number of MR steps • Almost always MR is sub-optimal • If your data is small, R or scripting is often better and faster • Another downside: Hive (on EMR) tends to be a pain to debug 7
  • 8. How to start with Hive? • Build your own Hadoop cluster + install Hive • The “right” way to do it, might take some time for multi-node setup • Spinning up an EMR cluster • The quick and cheap way to do it. • You need an Amazon AWS account and some data on S3. • You need an EMR ruby library installed and configured locally. • You need to spin up an EMR cluster in interactive mode. Voila. $ emr --create --alive --name ”MY JOB" --hive-interactive --num-instances 8 - -instance-type cc1.4xlarge --hive-versions 0.8.1 --bootstrap-action "s3://my- bucket/emr-bootstrap" 8
  • 9. Getting Started with Hive • SSH into your cluster (your namenode) • $ emr --ssh j-AHF0QE733K8F • Run screen (you’ll quickly find out why) • $ screen • Run Hive • $ hive • Welcome to Hive interface! • Monitor Hive • $ elinks http://localhost:9100 • Terminate Hive • $ emr --terminate j-AHF0QE733K8F 9
  • 10. Example – CTR by ad by day add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar; CREATE EXTERNAL TABLE events ( time string, action string, id string) PARTITIONED BY (d string) ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde’ WITH SERDEPROPERTIES ('paths'=’time,action,id') LOCATION 's3://my-bucket/events/’; ALTER TABLE events ADD PARTITION (d='2012-07-09'); ALTER TABLE events ADD PARTITION (d='2012-07-08'); ALTER TABLE events ADD PARTITION (d='2012-07-07'); 10
  • 11. Example – CTR by ad by day CREATE EXTERNAL TABLE ad_stats (d string, id string, impressions bigint, clicks bigint, ctr float) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' LOCATION 's3://my-bucket/ad-stats/'; INSERT OVERWRITE TABLE ad_stats SELECT i.d, c.id, impressions, clicks, clicks/impressions FROM ( SELECT d, id, count(1) as clicks FROM events WHERE action = 'CLICK’ GROUP BY d, id ) c FULL OUTER JOIN ( SELECT d, id, count(1) as impressions FROM events WHERE action = 'IMPRESSION’ GROUP BY d, id ) i ON (c.d = i.d AND c.id = i.id); 11
  • 12. What is VisualDNA, anyway? • Leading audience profiling technology and audience data provider • ≈ 100 million monthly uniques reach globally • Data for ad targeting, risk, personalisation, recommendations… • Use Cassandra, Hadoop, Hive, Redis, Scala, PHP, Java in production • Running on a mix of AWS and physical HW in London We’re hiring (like mad) Back-End, Front-End and Research Engineers www.visualdna.com/careers 12
  • 13. How do we use Hive? • Main data source is: events • Events are associated with users actions (mainly) • Conversions, pageviews, impressions, clicks, syncs… • Contain user ID, timestamp, browser info, geo, event info… • Roughly 50M of them a day = 50 GB of text • JSON format, one JSON object per line, validated input • Coming from 8 events trackers, rotated every 5 mins • Partitioned by date (d=2012-07-09) • Storing all of them on S3. Never deleting anything. 13
  • 14. Use case #1: Analytics • Analytics queries on our events table • How many people started each quiz in the last 3 months? • Give me the IDs of people visiting football section on Mirror today. • Give me a histogram of frequency of visits per user • … • Best thing about Hive: non-developers use it (after we wrote a wiki) • Can simplify further by using Karmasphere on AWS • Downsides: • Takes time to spin-up the cluster on AWS. • Takes time to execute simple queries. Very big queries often fail. • Replacing a lot of the “how many” queries by Redis. 14
  • 15. Use case #2: Reporting pipeline • Interactive mode in Hive is only part of the picture. • Hive can also run scripted queries for you: • $ emr --create --hive-script --name ”Test” --num-instances 2 --slave- instance-type cc1.4xlarge --master-instance-type c1.medium --arg hive_script.q --args "-d,PARTITION=2012-07-09,-d,RUNDATE=2012-07-10” • Note: arguments are accessible in the hive query: ${PARTITION} • Rule of thumb: always run queries by hand first, script them if you’re sure they work • Reporting is repeated analytics => similar queries, but ran regularly • Hive drives a lot of our reporting tools and provides data for Redis • We use cron + bash scripts to schedule, run and monitor Hive jobs • Poll emr for status of the job until finished (success or fail) • Suggestions for better tools? 15
  • 16. Use case #3: Inference Engine (ML) • Inference Engine helps us scale audience data to 100M+ profiles • In principle, extrapolates quiz profile data over user behaviour online • At its heart, it’s a few hundred lines of Hive queries • Every day, fetches users from Cassandra and sifts through events: • Update profiles for pages visited by profiled users yesterday • Update profiles for users based on their behaviour yesterday • Input is about 2M users, 50M events; output is 5-10M user profiles • Runs in < 3 hours with 10 large instances -> parallelises nicely • Could use Apache Mahout, but was single-threaded back then • Biggest issues? Global sorts, running out of memory/disk on joins. 16
  • 17. Tips and Tricks • Performance related • On AWS, S3 is often the bottleneck. Use cc1.* or cc2.* instances. • Copy from S3 to internal table if you query it multiple times. • Use compression for output. Plenty of CPU cycles for this. • Use SequenceTable format and internal tables where applicable. • Use MapJoin wherever possible (SELECT /*+ MAPJOIN(table)*/). • Avoid SerDe-s and TRANSFORM mappers if possible. • Don’t sort (ORDER BY) unless you really have to => 1 reducer. • Partition your data (input and/or output) if you can. • Might make your life easier • If queries start stalling, add more instances. Debugging is painful. • Use arguments to pass in commands / partitions (if you need to). 17
  • 18. Q&A • Thank you for your time! • Hope this was a bit useful – let me know your feedback. • Any questions? 18