SlideShare una empresa de Scribd logo
1 de 23
My life as a
  beekeeper
   @89clouds
Who am I?
Pedro Figueiredo (pfig@89clouds.com)

Hadoop et al

SocialFacebook games, media (TV,
publishing)

Elastic MapReduce, Cloudera

NoSQL, as in “Not a SQL guy”
The problem with
      Hive



It looks like SQL
No, seriously
SELECT
  CONCAT(vishi,vislo),
  SUM(
    CASE WHEN searchengine = 'google'
       THEN 1
       ELSE 0
    END
  ) AS google_searches
FROM omniture
WHERE
  year(hittime) = 2011 AND
  month(hittime) = 8 AND
  is_search = 'Y'
GROUP BY CONCAT(vishi,vislo);
“It’s just like
     Oracle!”
Analysts will be very happy

At least until they join with that 30
billion-record table

Pro tip: explain MapReduce and then
MAPJOIN

 set
hive.mapjoin.smalltable.filesize=xxx;
Your first interview
      question


 “Explain the difference
 between CREATE TABLE and
 CREATE EXTERNAL TABLE”
Dynamic partitions

Partitions are the poor person’s
indexes

Unstructured data is full of surprises
 set   hive.exec.dynamic.partition.mode=nonstrict;
 set   hive.exec.dynamic.partition=true;
 set   hive.exec.max.dynamic.partitions=100000;
 set   hive.exec.max.dynamic.partitions.pernode=100000;

Plan your partitions ahead
Multi-vitamins

You can minimise input scans by using
multi-table INSERTs:

FROM input
INSERT INTO TABLE output1 SELECT foo
INSERT INTO TABLE output2 SELECT bar;
Persistence, do you
     speak it?
 External Hive metastore

 Avoid the pain of cluster set up

 Use an RDS metastore if on AWS, RDBMS
 otherwise.

 10GB will get you a long way, this
 thing is tiny
Now you have 2
      problems
Regular expressions are great, if
you’re using a real programming
language.

WHERE foo RLIKE ‘(a|b|c)’ will hurt

WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’

Generate these statements, if needs
be, it will pay off.
Avro

Serialisation framework (think
Thrift/Protocol Buffers).

Avro container files are
SequenceFile-like, splittable.

Support for snappy built-in.

If using the LinkedIn SerDe, the
table creation syntax changes.
Avro
CREATE EXTERNAL TABLE IF NOT EXISTS mytable
  PARTITIONED BY (ds STRING)
  ROW FORMAT SERDE
    'com.linkedin.haivvreo.AvroSerDe'
  WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/
hadoop/avro/myschema.avsc')
  STORED AS
    INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
    OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
  LOCATION '/data/mytable'
;
MAKE! MONEY! FAST!


Use spot instances in EMR

Usually stick around until America
wakes up

Brilliant for worker nodes
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
To be or not to be
“Consider a traditional RDBMS”

At what size should we do this?

Hive is not an end, it’s the means

Data on HDFS/S3 is simply available,
not “available to Hive”

Hive isn’t suitable for near real
time
Hive != MapReduce

Don’t use Hive instead of Native/
Streaming

“I know, I’ll just stream this bit
through a shell script!”

Imo, Hive excels at analysis and
aggregation, so use it for that
Thank you



Fred Easey (@poppa_f)

Peter Hanlon
Questions?

 pfig@89clouds.com
 @pfig / @89clouds


http://89clouds.com/

Más contenido relacionado

La actualidad más candente

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?
Lloyd Huang
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコード
Shinpei Ohtani
 

La actualidad más candente (20)

PuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbquery
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
HBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User GroupHBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User Group
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
puppet @techlifecookpad
puppet @techlifecookpadpuppet @techlifecookpad
puppet @techlifecookpad
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Docker tips & tricks
Docker  tips & tricksDocker  tips & tricks
Docker tips & tricks
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Working with databases in Perl
Working with databases in PerlWorking with databases in Perl
Working with databases in Perl
 
COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?
 
GoとElixir、同時開発した時の気づき
GoとElixir、同時開発した時の気づきGoとElixir、同時開発した時の気づき
GoとElixir、同時開発した時の気づき
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codables
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millions
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコード
 
Shell实现的windows回收站功能的脚本
Shell实现的windows回收站功能的脚本Shell实现的windows回收站功能的脚本
Shell实现的windows回收站功能的脚本
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 

Destacado (6)

The problem with Perl
The problem with PerlThe problem with Perl
The problem with Perl
 
CPAN Training
CPAN TrainingCPAN Training
CPAN Training
 
Perl in Teh Cloud
Perl in Teh CloudPerl in Teh Cloud
Perl in Teh Cloud
 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPAN
 
PERL Unit 6 regular expression
PERL Unit 6 regular expressionPERL Unit 6 regular expression
PERL Unit 6 regular expression
 
Logic Progamming in Perl
Logic Progamming in PerlLogic Progamming in Perl
Logic Progamming in Perl
 

Similar a My life as a beekeeper

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Steve Watt
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.
Peter Higgins
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
Krishna Sankar
 

Similar a My life as a beekeeper (20)

Hadoop
HadoopHadoop
Hadoop
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimization
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
SQL -PHP Tutorial
SQL -PHP TutorialSQL -PHP Tutorial
SQL -PHP Tutorial
 
Sql user group
Sql user groupSql user group
Sql user group
 
Elasticsearch sur Azure : Make sense of your (BIG) data !
Elasticsearch sur Azure : Make sense of your (BIG) data !Elasticsearch sur Azure : Make sense of your (BIG) data !
Elasticsearch sur Azure : Make sense of your (BIG) data !
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
JavaScript ES6
JavaScript ES6JavaScript ES6
JavaScript ES6
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Html5 Overview
Html5 OverviewHtml5 Overview
Html5 Overview
 
ClickHouse new features and development roadmap, by Aleksei Milovidov
ClickHouse new features and development roadmap, by Aleksei MilovidovClickHouse new features and development roadmap, by Aleksei Milovidov
ClickHouse new features and development roadmap, by Aleksei Milovidov
 
Python training for beginners
Python training for beginnersPython training for beginners
Python training for beginners
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

My life as a beekeeper

  • 1. My life as a beekeeper @89clouds
  • 2. Who am I? Pedro Figueiredo (pfig@89clouds.com) Hadoop et al SocialFacebook games, media (TV, publishing) Elastic MapReduce, Cloudera NoSQL, as in “Not a SQL guy”
  • 3. The problem with Hive It looks like SQL
  • 4. No, seriously SELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = 'google' THEN 1 ELSE 0 END ) AS google_searches FROM omniture WHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = 'Y' GROUP BY CONCAT(vishi,vislo);
  • 5. “It’s just like Oracle!” Analysts will be very happy At least until they join with that 30 billion-record table Pro tip: explain MapReduce and then MAPJOIN set hive.mapjoin.smalltable.filesize=xxx;
  • 6. Your first interview question “Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”
  • 7. Dynamic partitions Partitions are the poor person’s indexes Unstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000; Plan your partitions ahead
  • 8. Multi-vitamins You can minimise input scans by using multi-table INSERTs: FROM input INSERT INTO TABLE output1 SELECT foo INSERT INTO TABLE output2 SELECT bar;
  • 9. Persistence, do you speak it? External Hive metastore Avoid the pain of cluster set up Use an RDS metastore if on AWS, RDBMS otherwise. 10GB will get you a long way, this thing is tiny
  • 10. Now you have 2 problems Regular expressions are great, if you’re using a real programming language. WHERE foo RLIKE ‘(a|b|c)’ will hurt WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’ Generate these statements, if needs be, it will pay off.
  • 11. Avro Serialisation framework (think Thrift/Protocol Buffers). Avro container files are SequenceFile-like, splittable. Support for snappy built-in. If using the LinkedIn SerDe, the table creation syntax changes.
  • 12. Avro CREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/ hadoop/avro/myschema.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/mytable' ;
  • 13. MAKE! MONEY! FAST! Use spot instances in EMR Usually stick around until America wakes up Brilliant for worker nodes
  • 14. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 15. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 16. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 17. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 18. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 19. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 20. To be or not to be “Consider a traditional RDBMS” At what size should we do this? Hive is not an end, it’s the means Data on HDFS/S3 is simply available, not “available to Hive” Hive isn’t suitable for near real time
  • 21. Hive != MapReduce Don’t use Hive instead of Native/ Streaming “I know, I’ll just stream this bit through a shell script!” Imo, Hive excels at analysis and aggregation, so use it for that
  • 22. Thank you Fred Easey (@poppa_f) Peter Hanlon
  • 23. Questions? pfig@89clouds.com @pfig / @89clouds http://89clouds.com/

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. https://www.facebook.com/note.php?note_id=470667928919\n“Currently, if the total size of small tables is larger than 25MB, then the conditional task will choose the original common join to run. 25MB is a very conservative number and you can change this number with set hive.smalltable.filesize=30000000”\nSELECT /* +mapjoin(f,b,g) */\nset hive.auto.convert.join = true;\nhive.smalltable.filesize, depending on version\nset hive.mapjoin.localtask.max.memory.usage = 0.999;\n\n
  6. \n
  7. Also, there’s no UPDATE, you can only overwrite a whole table, so use partitions\ne.g., 20 games with 40 events with 5 attrs on average, per day (date=/game=/event=/attr=): 1.46M partitions per year (4000/day)\nSET hive.exec.max.dynamic.partitions=100000;\nSET hive.exec.max.dynamic.partitions.pernode=100000;\navoid RECOVER PARTITIONS, generate a partition list and add them statically, or use a persistent metastore\n
  8. Or INSERT OVERWRITE. Append (INSERT INTO) only available from 0.8 onwards\nObviously works with partitions, static (with the value in the INSERT statement) or dynamic, but:\nThe dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause\n
  9. \n
  10. \n
  11. Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.\nNo manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.\nThe schema (defined in JSON) is included in the data files\nHive >= 0.9.1\n\n
  12. The new SerDe uses TBLPROPERTIES and avro.schema.url / literal. Another property is\norg.apache.hadoop.hive.serde2.avro.AvroSerDe\nAlso, the statement order is important!\nOne more thing: 1.6.x won’t read files created with 1.7.x. CDH3 up to u3 comes with 1.6.0, so be conservative\n
  13. Look at the historical prices, bid above it\nRegular price: $0.38, spot: $0.03\n
  14. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  15. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  16. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  17. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  18. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  19. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  20. When using an RDBMS, it’s much harder to get at your data from other tools\n
  21. Convoluted, long-winded code\nReporting is hard\n
  22. \n
  23. \n