My life as a beekeeper

•Descargar como KEY, PDF•

1 recomendación•1,219 vistas

Your Hive honeymoon can be cut short if you don't take the necessary precautions. In this talk I'll share my experience with Hive in the last 3 years (in Elastic MapReduce and Cloudera CDH3), describing what I got wrong the first time around, and what eventually saved the day. I've used Hive in environments with a number of events ranging from a few million to a few billion a day, so hopefully there'll be something for everyone.

Tecnología

Who am I?
Pedro Figueiredo (pfig@89clouds.com)

Hadoop et al

SocialFacebook games, media (TV,
publishing)

Elastic MapReduce, Cloudera

NoSQL, as in “Not a SQL guy”

The problem with
Hive

It looks like SQL

No, seriously
SELECT
CONCAT(vishi,vislo),
SUM(
CASE WHEN searchengine = 'google'
THEN 1
ELSE 0
END
) AS google_searches
FROM omniture
WHERE
year(hittime) = 2011 AND
month(hittime) = 8 AND
is_search = 'Y'
GROUP BY CONCAT(vishi,vislo);

“It’s just like
Oracle!”
Analysts will be very happy

At least until they join with that 30
billion-record table

Pro tip: explain MapReduce and then
MAPJOIN

set
hive.mapjoin.smalltable.filesize=xxx;

Your first interview
question

“Explain the difference
between CREATE TABLE and
CREATE EXTERNAL TABLE”

Dynamic partitions

Partitions are the poor person’s
indexes

Unstructured data is full of surprises
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=100000;

Plan your partitions ahead

Multi-vitamins

You can minimise input scans by using
multi-table INSERTs:

FROM input
INSERT INTO TABLE output1 SELECT foo
INSERT INTO TABLE output2 SELECT bar;

Persistence, do you
speak it?
External Hive metastore

Avoid the pain of cluster set up

Use an RDS metastore if on AWS, RDBMS
otherwise.

10GB will get you a long way, this
thing is tiny

Now you have 2
problems
Regular expressions are great, if
you’re using a real programming
language.

WHERE foo RLIKE ‘(a|b|c)’ will hurt

WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’

Generate these statements, if needs
be, it will pay off.

Avro

Serialisation framework (think
Thrift/Protocol Buffers).

Avro container files are
SequenceFile-like, splittable.

Support for snappy built-in.

If using the LinkedIn SerDe, the
table creation syntax changes.

Avro
CREATE EXTERNAL TABLE IF NOT EXISTS mytable
PARTITIONED BY (ds STRING)
ROW FORMAT SERDE
'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/
hadoop/avro/myschema.avsc')
STORED AS
INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION '/data/mytable'
;

MAKE! MONEY! FAST!

Use spot instances in EMR

Usually stick around until America
wakes up

Brilliant for worker nodes

Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;

To be or not to be
“Consider a traditional RDBMS”

At what size should we do this?

Hive is not an end, it’s the means

Data on HDFS/S3 is simply available,
not “available to Hive”

Hive isn’t suitable for near real
time

Hive != MapReduce

Don’t use Hive instead of Native/
Streaming

“I know, I’ll just stream this bit
through a shell script!”

Imo, Hive excels at analysis and
aggregation, so use it for that

Thank you

Fred Easey (@poppa_f)

Peter Hanlon

Questions?

pfig@89clouds.com
@pfig / @89clouds

http://89clouds.com/

Más contenido relacionado

La actualidad más candente

PuppetDB, Puppet Explorer and puppetdbquery

Puppet

Unleash your inner console cowboy

Kenneth Geisshirt

HBase + Hue - LA HBase User Group

gethue

05 pig user defined functions (udfs)

Subhas Kumar Ghosh

puppet @techlifecookpad

Naoya Nakazawa

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

Yahoo Developer Network

AWS Hadoop and PIG and overview

Dan Morrill

Docker tips & tricks

Dharmit Shah

Ordered Record Collection

Hadoop User Group

An overview of the main questions/design issues when starting to work with databases in Perl - choosing a database - matching DB datatypes to Perl datatypes - DBI architecture (handles, drivers, etc.) - steps of DBI interaction : prepare/execute/fetch - ORM principles and difficulties, ORMs on CPAN - a few examples with DBIx::DataModel - performance issues First given at YAPC::EU::2009 in Lisbon. Updated version given at FPW2011 in Paris and YAPC::EU::2011 in Riga

Working with databases in Perl

Laurent Dami

COSCUP2012: How to write a bash script like the python?

Lloyd Huang

GoとElixir、同時開発した時の気づき

Takahiro Kobaru

W świecie mikrousługowym architektura Lambda zadomowiła się już na dobre. Tak przetwarzania streamingowe, jak i batchowe buduje wiele firm. Na rynku (o ile o rynku można mówić w kontekście open source) istnieje wiele frameworków, każdy jednak ma pewne cechy, które — zwłaszcza przy dużych projektach — utrudniają pracę. Jedne służą do przetwarzania real-time, drugie lepiej spisują się w workloadach batchowych. Niektóre z nich zaś można uznać za „rock-solid” tylko jeśli uruchamiamy je na Hadoopie. Nie brak tych problemów jest jednak główną zaletą Beama. A co nią jest? Dowiecie się na prezentacji! Poruszymy takie kwestie jak model przetwarzania, use-case’y, w których Beam się sprawdza, a także środowiska uruchomieniowe. Zobaczycie też, jak uruchamiać joby Apache Beam na Google Cloud Platform.

Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018

Piotr Wikiel

Hive vs Pig for HadoopSourceCodeReading

Mitsuharu Hamba

Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup

gethue

Value protocols and codables

Florent Vilmart

Parse, scale to millions

Florent Vilmart

サンプルから見るMap reduceコード

Shinpei Ohtani

Shell实现的windows回收站功能的脚本

Lingfei Kong

Performance Profiling in Rust

InfluxData

La actualidad más candente (20)

PuppetDB, Puppet Explorer and puppetdbquery

Unleash your inner console cowboy

HBase + Hue - LA HBase User Group

05 pig user defined functions (udfs)

puppet @techlifecookpad

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

AWS Hadoop and PIG and overview

Docker tips & tricks

Ordered Record Collection

Working with databases in Perl

COSCUP2012: How to write a bash script like the python?

GoとElixir、同時開発した時の気づき

Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018

Hive vs Pig for HadoopSourceCodeReading

Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup

Value protocols and codables

Parse, scale to millions

サンプルから見るMap reduceコード

Shell实现的windows回收站功能的脚本

Performance Profiling in Rust

Destacado

The problem with Perl

Pedro Figueiredo

CPAN Training

Pedro Figueiredo

With more businesses moving to cloud-based solutions everyday, we must re-think the strategies used to deploy Perl applications and related libraries, given the volatile aspects of the cloud and its constraints. In this talk I go over the challenges posed by virtualised environments, and consider several solutions to them. The use cases are all related to Amazon's EC2, but will easily be adapted for GoGrid, Mosso, and others.

Perl in Teh Cloud

Pedro Figueiredo

30 Minutes To CPAN

daoswald

PERL Unit 6 regular expression

Binsent Ribera

Logic Progamming in Perl

Curtis Poe

Destacado (6)

The problem with Perl

CPAN Training

Perl in Teh Cloud

30 Minutes To CPAN

PERL Unit 6 regular expression

Logic Progamming in Perl

Similar a My life as a beekeeper

Hadoop

Scott Leberknight

Good practices for PrestaShop code security and optimization

PrestaShop

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Steve Watt

Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...

Big Data Spain

This is Apache Spark Question & Answer Tutorial. We provide training on Big Data & Hadoop,Hadoop Admin ,MongoDB,Data Analytics with R, Python..etc Our Big Data & Hadoop course consists of Introduction of Hadoop and Big Data,HDFS architecture ,MapReduce ,YARN ,PIG Latin ,Hive,HBase,Mahout,Zookeeper,Oozie,Flume,Spark,Nosql with quizzes and assignments. To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark

Interview questions on Apache spark [part 2]

knowbigdata

SQL -PHP Tutorial

Information Technology

Sql user group

Stefan Bauer

Sous licence Apache2, elasticsearch est un moteur de recherche puissant, distribué et scalable. Il fournit également des agrégations en temps réel en fonction de vos besoins. Couplé à Kibana, dashboard générique et hautement personnalisable, il vous permet de donner immédiatement du sens à vos données. En forte progression au niveau de son adhésion par les entreprises et les sites publics, découvrez ce que sont elasticsearch et Kibana et à quel point il est simple de les déployer facilement sur la plate-forme Windows Azure. Thomas et David illustreront à l'aide de cas clients les bénéfices obtenus à travers ces solutions. Speakers : Thomas Conté (Microsoft), David Pilato (Elasticsearch)

Elasticsearch sur Azure : Make sense of your (BIG) data !

Microsoft

Your Library Sucks, and why you should use it.

Peter Higgins

Nosql hands on handout 04

Krishna Sankar

23.05.15 Одесса. Impact Hub Odessa. Конференция AI&BigData Lab Александр Конопко "Celos: оркестрирование и тестирование задач Hadoop" В компании Collective используется более сотни Hadoop задач. Проблема их мониторинга и оркестрирования стояла очень остро. Для решения этой проблемы была разработана система Celos, которая существенно упростила работу технических инжинеров в компании. В этом докладе я познакомлю слушателя с и предложу наш способ решения этих проблем. Подробнее: http://geekslab.co/ https://www.facebook.com/GeeksLab.co https://www.youtube.com/user/GeeksLabVideo

AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...

GeeksLab Odessa

Getting started with Hadoop, Hive, and Elastic MapReduce

obdit

JavaScript ES6

Leo Hernandez

January 2011 HUG: Pig Presentation

Yahoo Developer Network

Introduction to the Hadoop Ecosystem (SEACON Edition)

Uwe Printz

Introduction to the Hadoop Ecosystem (codemotion Edition)

Uwe Printz

Apache Hadoop is one of the most popular solutions for today’s Big Data challenges. Hadoop offers a reliable and scalable platform for fail-safe storage of large amounts of data as well as the tools to process this data. This presentation will give an overview of the architecture of Hadoop and explain the possibilities for integration within existing enterprise systems. Finally, the main tools for processing data will be introduced which includes the scripting language layer Pig, the SQL-like query layer Hive as well as the column-based NoSQL layer HBase.

Introduction to the hadoop ecosystem by Uwe Seiler

Codemotion

Html5 Overview

Abdel Moneim Emad

ClickHouse new features and development roadmap, by Aleksei Milovidov

Altinity Ltd

Python training for beginners

LADONNEE Consulting, SARL à capital variable

Similar a My life as a beekeeper (20)

Hadoop

Good practices for PrestaShop code security and optimization

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...

Interview questions on Apache spark [part 2]

SQL -PHP Tutorial

Sql user group

Elasticsearch sur Azure : Make sense of your (BIG) data !

Your Library Sucks, and why you should use it.

Nosql hands on handout 04

AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...

Getting started with Hadoop, Hive, and Elastic MapReduce

JavaScript ES6

January 2011 HUG: Pig Presentation

Introduction to the Hadoop Ecosystem (SEACON Edition)

Introduction to the Hadoop Ecosystem (codemotion Edition)

Introduction to the hadoop ecosystem by Uwe Seiler

Html5 Overview

ClickHouse new features and development roadmap, by Aleksei Milovidov

Python training for beginners

Último

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Retrieval augmented generation (RAG) is the most popular style of large language model application to emerge from 2023. The most basic style of RAG works by vectorizing your data and injecting it into a vector database like Milvus for retrieval to augment the text output generated by an LLM. This is just the beginning. One of the ways that we can extend RAG, and extend AI, is through multilingual use cases. Typical RAG is done in English using embedding models that are trained in English. In this talk, we’ll explore how RAG could work in languages other than English. We’ll explore French, Chinese, and Polish.

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Zilliz

Discover the innovative features and strategic vision that keep WSO2 an industry leader. Explore the exciting 2024 roadmap of WSO2 API management, showcasing innovations, unified APIM/APK control plane, natural language API interaction, and cloud native agility. Discover how open source solutions, microservices architecture, and cloud native technologies unlock seamless API management in today's dynamic landscapes. Leave with a clear blueprint to revolutionize your API journey and achieve industry success!

WSO2's API Vision: Unifying Control, Empowering Developers

WSO2

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Bhuvaneswari Subramani

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

ICT role in 21st century education and its challenges

rafiqahmad00786416

Dubai, often portrayed as a shimmering oasis in the desert, faces its own set of challenges, including the occasional threat of flooding. Despite its reputation for opulence and modernity, the emirate is not immune to the forces of nature. In recent years, Dubai has experienced sporadic but significant floods, testing the resilience of its infrastructure and communities. Among the critical lifelines in this bustling metropolis is the Dubai International Airport, a bustling hub that connects the city to the world. This article explores the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Orbitshub

Vector Search -An Introduction in Oracle Database 23ai.pptx

Remote DBA Services

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Tracing the root cause of a performance issue requires a lot of patience, experience, and focus. It’s so hard that we sometimes attempt to guess by trying out tentative fixes, but that usually results in frustration, messy code, and a considerable waste of time and money. This talk explains how to correctly zoom in on a performance bottleneck using three levels of profiling: distributed tracing, metrics, and method profiling. After we learn to read the JVM profiler output as a flame graph, we explore a series of bottlenecks typical for backend systems, like connection/thread pool starvation, invisible aspects, blocking code, hot CPU methods, lock contention, and Virtual Thread pinning, and we learn to trace them even if they occur in library code you are not familiar with. Attend this talk and prepare for the performance issues that will eventually hit any successful system. About authorWith two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Victor Rentea

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Exploring Multimodal Embeddings with Milvus

Zilliz

My life as a beekeeper

1. My life as a beekeeper @89clouds

2. Who am I? Pedro Figueiredo (pfig@89clouds.com) Hadoop et al SocialFacebook games, media (TV, publishing) Elastic MapReduce, Cloudera NoSQL, as in “Not a SQL guy”

3. The problem with Hive It looks like SQL

4. No, seriously SELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = 'google' THEN 1 ELSE 0 END ) AS google_searches FROM omniture WHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = 'Y' GROUP BY CONCAT(vishi,vislo);

5. “It’s just like Oracle!” Analysts will be very happy At least until they join with that 30 billion-record table Pro tip: explain MapReduce and then MAPJOIN set hive.mapjoin.smalltable.filesize=xxx;

6. Your first interview question “Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”

7. Dynamic partitions Partitions are the poor person’s indexes Unstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000; Plan your partitions ahead

8. Multi-vitamins You can minimise input scans by using multi-table INSERTs: FROM input INSERT INTO TABLE output1 SELECT foo INSERT INTO TABLE output2 SELECT bar;

9. Persistence, do you speak it? External Hive metastore Avoid the pain of cluster set up Use an RDS metastore if on AWS, RDBMS otherwise. 10GB will get you a long way, this thing is tiny

10. Now you have 2 problems Regular expressions are great, if you’re using a real programming language. WHERE foo RLIKE ‘(a|b|c)’ will hurt WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’ Generate these statements, if needs be, it will pay off.

11. Avro Serialisation framework (think Thrift/Protocol Buffers). Avro container files are SequenceFile-like, splittable. Support for snappy built-in. If using the LinkedIn SerDe, the table creation syntax changes.

12. Avro CREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/ hadoop/avro/myschema.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/mytable' ;

13. MAKE! MONEY! FAST! Use spot instances in EMR Usually stick around until America wakes up Brilliant for worker nodes

14. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

15. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

16. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

17. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

18. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

19. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

20. To be or not to be “Consider a traditional RDBMS” At what size should we do this? Hive is not an end, it’s the means Data on HDFS/S3 is simply available, not “available to Hive” Hive isn’t suitable for near real time

21. Hive != MapReduce Don’t use Hive instead of Native/ Streaming “I know, I’ll just stream this bit through a shell script!” Imo, Hive excels at analysis and aggregation, so use it for that

22. Thank you Fred Easey (@poppa_f) Peter Hanlon

23. Questions? pfig@89clouds.com @pfig / @89clouds http://89clouds.com/

Notas del editor

\n
\n
\n
\n
https://www.facebook.com/note.php?note_id=470667928919\n&#x201C;Currently, if the total size of small tables is larger than 25MB, then the conditional task will choose the original common join to run. 25MB is a very conservative number and you can change this number with set hive.smalltable.filesize=30000000&#x201D;\nSELECT /* +mapjoin(f,b,g) */\nset hive.auto.convert.join = true;\nhive.smalltable.filesize, depending on version\nset hive.mapjoin.localtask.max.memory.usage = 0.999;\n\n
\n
Also, there&#x2019;s no UPDATE, you can only overwrite a whole table, so use partitions\ne.g., 20 games with 40 events with 5 attrs on average, per day (date=/game=/event=/attr=): 1.46M partitions per year (4000/day)\nSET hive.exec.max.dynamic.partitions=100000;\nSET hive.exec.max.dynamic.partitions.pernode=100000;\navoid RECOVER PARTITIONS, generate a partition list and add them statically, or use a persistent metastore\n
Or INSERT OVERWRITE. Append (INSERT INTO) only available from 0.8 onwards\nObviously works with partitions, static (with the value in the INSERT statement) or dynamic, but:\nThe dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause\n
\n
\n
Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.\nNo manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.\nThe schema (defined in JSON) is included in the data files\nHive >= 0.9.1\n\n
The new SerDe uses TBLPROPERTIES and avro.schema.url / literal. Another property is\norg.apache.hadoop.hive.serde2.avro.AvroSerDe\nAlso, the statement order is important!\nOne more thing: 1.6.x won&#x2019;t read files created with 1.7.x. CDH3 up to u3 comes with 1.6.0, so be conservative\n
Look at the historical prices, bid above it\nRegular price: $0.38, spot: $0.03\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
When using an RDBMS, it&#x2019;s much harder to get at your data from other tools\n
Convoluted, long-winded code\nReporting is hard\n
\n
\n

My life as a beekeeper

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a My life as a beekeeper

Similar a My life as a beekeeper (20)

Último

Último (20)

My life as a beekeeper

Notas del editor