Future of HCatalog - Hadoop Summit 2012

•Download as PPTX, PDF•

19 likes•6,627 views

Hortonworks

Technology

Future of HCatalog
Alan F. Gates
@alanfgates

Page 1

Who Am I?
• HCatalog committer and mentor
• Co-founder of Hortonworks
• Lead for Pig, Hive, and HCatalog at Hortonworks
• Pig committer and PMC Member
• Member of Apache Software Foundation and Incubator
PMC
• Author of Programming Pig from O’Reilly

© Hortonworks Inc. 2012
Page 2

Hadoop Ecosystem

MapReduce Hive Pig

SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store

HDFS
Metastore

© Hortonworks 2012
Page 3

Opening up Metadata to MR & Pig

MapReduce Hive Pig

HCaInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer

SerDe
InputFormat/
Metastore Client
OuputFormat

HDFS
Metastore

© Hortonworks 2012
Page 4

Templeton - REST API
• REST endpoints: databases, tables, partitions, columns, table properties
• PUT to create/update, GET to list or describe, DELETE to drop

Get a list of all tables in the default database:

GET
http://…/v1/ddl/database/default/table
Hadoop/
HCatalog
{
"tables": ["counted","processed",],
"database": "default"
}

© Hortonworks 2012
Page 5

Reading and Writing Data in Parallel
• Use Case: Users want
– to read and write records in parallel between Hadoop and their parallel system
– driven by their system
– in a language independent way
– without needing to understand Hadoop’s file formats
• Example: an MPP data store wants to read data out of Hadoop as
HCatRecords for its parallel jobs
• What exists today
– webhdfs
– Language independent
– Can move data in parallel
– Driven from the user side
– Moves only bytes, no understanding of file format
– Sqoop
– Can move data in parallel
– Understands data format
– Driven from Hadoop side
– Requires connector or JDBC

© 2012 Hortonworks
Page 8

HCatReader and HCatWriter

getHCatReader
Master HCatalog
HCatReader

read
Input Slave
Splits Iterator<HCatRecord>

read
Slave HDFS
Iterator<HCatRecord>

read
Slave
Iterator<HCatRecord>

Right now all in Java, needs to be REST
© 2012 Hortonworks
Page 9

Storing Semi-/Unstructured Data

Table Users File Users
Name Zip {"name":"alice","zip":"93201"}
Alice 93201 {"name":"bob”,"zip":"76331"}
Bob 76331 {"name":"cindy"}
{"zip":"87890"}

select name, zip A = load ‘Users’ as
from users; (name:chararray, zip:chararray);
B = foreach A generate name, zip;

© Hortonworks Inc. 2012
Page 10

Hive ODBC/JDBC Today

Issue: Have to have Hive
JDBC code on the client
Client

Hive Hadoop
Server

Issues:
• Not concurrent
ODBC • Not secure
Client • Not scalable

Issue: Open source version
not easy to use

© 2012 Hortonworks
Page 13

ODBC/JDBC Proposal

JDBC
Client

Provide robust open source REST Hadoop
implementations Server

• Spawns job inside cluster
ODBC • Runs job as submitting user
Client • Works with security
• Scaling web services well understood

© 2012 Hortonworks
Page 14

What's hot

HCatalogGetInData

Introduction to Hive and HCatalogmarkgrover

Hadoop hive presentationArvind Kumar

YARN - Strata 2014Hortonworks

Apache DrillBig Data User Group Karlsruhe/Stuttgart

Apache HiveAjit Koti

Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale

Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar

Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit

Session 01 - Into to HadoopAnandMHadoop

Hive Anatomynzhang

מיכאלsqlserver.co.il

Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari

HADOOP TECHNOLOGY pptsravya raju

Hadoop pigSean Murphy

Mar 2012 HUG: Hive with HBaseYahoo Developer Network

Basics of big data analytics hadoopAmbuj Kumar

Hive HadoopFarafekr Technology Ltd.

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn

HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase Cloudera, Inc.

What's hot (20)

HCatalog

Introduction to Hive and HCatalog

Hadoop hive presentation

YARN - Strata 2014

Apache Drill

Apache Hive

Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

Introduction to Apache Hive(Big Data, Final Seminar)

Data Discovery on Hadoop - Realizing the Full Potential of your Data

Session 01 - Into to Hadoop

Hive Anatomy

מיכאל

Accessing external hadoop data sources using pivotal e xtension framework (px...

HADOOP TECHNOLOGY ppt

Hadoop pig

Mar 2012 HUG: Hive with HBase

Basics of big data analytics hadoop

Hive Hadoop

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...

HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase

Viewers also liked

20120830 DBリファクタリング読書会第三回都元ダイスケ Miyamoto

【17-E-3】オンライン機械学習で実現する大規模データ処理Developers Summit

Database smellsMikiya Okuno

Cloudera Manager4.0とNameNode-HAセミナー資料Cloudera Japan

Writing Yarn Applications Hadoop Summit 2012Hortonworks

Data analytics with hadoop hive on multiple data centersHirotaka Niisato

Lars George HBase Seminar with O'REILLY Oct.12 2012Cloudera Japan

並列データベースシステムの概念と原理Makoto Yui

PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)Satoshi Yamada

あなたの知らないPostgreSQL監視の世界Yoshinori Nakanishi

【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方kwatch

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham

SQLチューニング入門入門編Miki Shimogai

Datalogからsqlへのトランスレータを書いた話Yuki Takeichi

ならば（その弐）Tomoaki Hiramoto

PostgreSQLクエリ実行の基礎知識～Explainを読み解こう～Miki Shimogai

Viewers also liked (16)

20120830 DBリファクタリング読書会第三回

【17-E-3】オンライン機械学習で実現する大規模データ処理

Database smells

Cloudera Manager4.0とNameNode-HAセミナー資料

Writing Yarn Applications Hadoop Summit 2012

Data analytics with hadoop hive on multiple data centers

Lars George HBase Seminar with O'REILLY Oct.12 2012

並列データベースシステムの概念と原理

PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)

あなたの知らないPostgreSQL監視の世界

【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

SQLチューニング入門入門編

Datalogからsqlへのトランスレータを書いた話

ならば（その弐）

PostgreSQLクエリ実行の基礎知識～Explainを読み解こう～

Similar to Future of HCatalog - Hadoop Summit 2012

H cat berlinbuzzwords2012Hortonworks

Sql saturday pig session (wes floyd) v2Wes Floyd

HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh

TriHUG November HCatalog Talk by Alan Gatestrihug

2013 feb 20_thug_h_catalogAdam Muise

A Reference Architecture for ETL 2.0 DataWorks Summit

Yahoo! Hack Europe WorkshopHortonworks

The other Apache technologies your big data solution needs!gagravarr

Hive 3 a new horizonArtem Ervits

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit

App cap2956v2-121001194956-phpapp01 (1)outstanding59

Inside the Hadoop Machine @ VMworldRichard McDougall

App Cap2956v2 121001194956 Phpapp01 (1)outstanding59

Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh

Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel

Hadoop past, present and futureCodemotion

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Hoya for Code ReviewSteve Loughran

Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.

06 pig-01-introAasim Naveed

Similar to Future of HCatalog - Hadoop Summit 2012 (20)

H cat berlinbuzzwords2012

Sql saturday pig session (wes floyd) v2

HUG Meetup 2013: HCatalog / Hive Data Out

TriHUG November HCatalog Talk by Alan Gates

2013 feb 20_thug_h_catalog

A Reference Architecture for ETL 2.0

Yahoo! Hack Europe Workshop

The other Apache technologies your big data solution needs!

Hive 3 a new horizon

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...

App cap2956v2-121001194956-phpapp01 (1)

Inside the Hadoop Machine @ VMworld

App Cap2956v2 121001194956 Phpapp01 (1)

Hadoop Summit San Jose 2014: Data Discovery on Hadoop

Data discoveryonhadoop@yahoo! hadoopsummit2014

Hadoop past, present and future

Big Data Hoopla Simplified - TDWI Memphis 2014

Hoya for Code Review

Big data, just an introduction to Hadoop and Scripting Languages

06 pig-01-intro

Recently uploaded

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Commit 2024 - Secret Management made easyAlfredo García Lavilla

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko

Designing IA for AI - Information Architecture Conference 2024

Advanced Test Driven-Development @ php[tek] 2024

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Vertex AI Gemini Prompt Engineering Tips

What's New in Teams Calling, Meetings and Devices March 2024

Anypoint Exchange: It’s Not Just a Repo!

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Unleash Your Potential - Namagunga Girls Coding Club

Are Multi-Cloud and Serverless Good or Bad?

Scanning the Internet for External Cloud Exposures via SSL Certs

Powerpoint exploring the locations used in television show Time Clash

Connect Wave/ connectwave Pitch Deck Presentation

Gen AI in Business - Global Trends Report 2024.pdf

Dev Dives: Streamline document processing with UiPath Studio Web

DevEX - reference for building teams, processes, and platforms

Ensuring Technical Readiness For Copilot in Microsoft 365

Commit 2024 - Secret Management made easy

The Future of Software Development - Devin AI Innovative Approach.pdf

Search Engine Optimization SEO PDF for 2024.pdf

Future of HCatalog - Hadoop Summit 2012

1. Future of HCatalog Alan F. Gates @alanfgates Page 1

2. Who Am I? • HCatalog committer and mentor • Co-founder of Hortonworks • Lead for Pig, Hive, and HCatalog at Hortonworks • Pig committer and PMC Member • Member of Apache Software Foundation and Incubator PMC • Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2

3. Hadoop Ecosystem MapReduce Hive Pig SerDe InputFormat/ InputFormat/ Load/ Metastore Client OuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 3

4. Opening up Metadata to MR & Pig MapReduce Hive Pig HCaInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 4

5. Templeton - REST API • REST endpoints: databases, tables, partitions, columns, table properties • PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } © Hortonworks 2012 Page 5

6. Templeton - REST API • REST endpoints: databases, tables, partitions, columns, table properties • PUT to create/update, GET to list or describe, DELETE to drop Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 6

7. Templeton - REST API • REST endpoints: databases, tables, partitions, columns, table properties • PUT to create/update, GET to list or describe, DELETE to drop Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" } • Included in HDP • Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © Hortonworks 2012 Page 7

8. Reading and Writing Data in Parallel • Use Case: Users want – to read and write records in parallel between Hadoop and their parallel system – driven by their system – in a language independent way – without needing to understand Hadoop’s file formats • Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs • What exists today – webhdfs – Language independent – Can move data in parallel – Driven from the user side – Moves only bytes, no understanding of file format – Sqoop – Can move data in parallel – Understands data format – Driven from Hadoop side – Requires connector or JDBC © 2012 Hortonworks Page 8

9. HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader read Input Slave Splits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 9

10. Storing Semi-/Unstructured Data Table Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 10

11. Storing Semi-/Unstructured Data Table Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 11

12. Storing Semi-/Unstructured Data Table Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 12

13. Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: • Not concurrent ODBC • Not secure Client • Not scalable Issue: Open source version not easy to use © 2012 Hortonworks Page 13

14. ODBC/JDBC Proposal JDBC Client Provide robust open source REST Hadoop implementations Server • Spawns job inside cluster ODBC • Runs job as submitting user Client • Works with security • Scaling web services well understood © 2012 Hortonworks Page 14

Editor's Notes

SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
Not concurrent: runs one query at a timeNot secure: runs as Hive user

Future of HCatalog - Hadoop Summit 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Future of HCatalog - Hadoop Summit 2012

Similar to Future of HCatalog - Hadoop Summit 2012 (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Future of HCatalog - Hadoop Summit 2012

Editor's Notes