Más contenido relacionado La actualidad más candente (20) Similar a Future of HCatalog (20) Más de DataWorks Summit (20) Future of HCatalog2. Who Am I?
• HCatalog committer and mentor
• Co-founder of Hortonworks
• Lead for Pig, Hive, and HCatalog at Hortonworks
• Pig committer and PMC Member
• Member of Apache Software Foundation and Incubator
PMC
• Author of Programming Pig from O’Reilly
© Hortonworks Inc. 2012
Page 2
3. Hadoop Ecosystem
MapReduce Hive Pig
SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store
HDFS
Metastore
© Hortonworks 2012
Page 3
4. Opening up Metadata to MR & Pig
MapReduce Hive Pig
HCaInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer
SerDe
InputFormat/
Metastore Client
OuputFormat
HDFS
Metastore
© Hortonworks 2012
Page 4
5. Templeton - REST API
• REST endpoints: databases, tables, partitions, columns, table properties
• PUT to create/update, GET to list or describe, DELETE to drop
Get a list of all tables in the default database:
GET
http://…/v1/ddl/database/default/table
Hadoop/
HCatalog
{
"tables": ["counted","processed",],
"database": "default"
}
© Hortonworks 2012
Page 5
6. Templeton - REST API
• REST endpoints: databases, tables, partitions, columns, table properties
• PUT to create/update, GET to list or describe, DELETE to drop
Create new table “rawevents”
PUT
{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],
"partitionedBy": [{ "name": "ds", "type": "string" }]}
http://…/v1/ddl/database/default/table/rawevents
Hadoop/
HCatalog
{
"table": "rawevents",
"database": "default”
}
© Hortonworks 2012
Page 6
7. Templeton - REST API
• REST endpoints: databases, tables, partitions, columns, table properties
• PUT to create/update, GET to list or describe, DELETE to drop
Describe table “rawevents”
GET
http://…/v1/ddl/database/default/table/rawevents
Hadoop/
HCatalog
{
"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],
"database": "default",
"table": "rawevents"
}
• Included in HDP
• Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182
© Hortonworks 2012
Page 7
8. Reading and Writing Data in Parallel
• Use Case: Users want
– to read and write records in parallel between Hadoop and their parallel system
– driven by their system
– in a language independent way
– without needing to understand Hadoop’s file formats
• Example: an MPP data store wants to read data out of Hadoop as
HCatRecords for its parallel jobs
• What exists today
– webhdfs
– Language independent
– Can move data in parallel
– Driven from the user side
– Moves only bytes, no understanding of file format
– Sqoop
– Can move data in parallel
– Understands data format
– Driven from Hadoop side
– Requires connector or JDBC
© 2012 Hortonworks
Page 8
9. HCatReader and HCatWriter
getHCatReader
Master HCatalog
HCatReader
read
Input Slave
Splits Iterator<HCatRecord>
read
Slave HDFS
Iterator<HCatRecord>
read
Slave
Iterator<HCatRecord>
Right now all in Java, needs to be REST
© 2012 Hortonworks
Page 9
10. Storing Semi-/Unstructured Data
Table Users File Users
Name Zip {"name":"alice","zip":"93201"}
Alice 93201 {"name":"bob”,"zip":"76331"}
Bob 76331 {"name":"cindy"}
{"zip":"87890"}
select name, zip A = load ‘Users’ as
from users; (name:chararray, zip:chararray);
B = foreach A generate name, zip;
© Hortonworks Inc. 2012
Page 10
11. Storing Semi-/Unstructured Data
Table Users File Users
Name Zip {"name":"alice","zip":"93201"}
Alice 93201 {"name":"bob”,"zip":"76331"}
Bob 76331 {"name":"cindy"}
{"zip":"87890"}
A = load ‘Users’ as
(name:chararray, zip:chararray);
B = foreach A generate name, zip;
select name, zip
from users;
A = load ‘Users’
B = foreach A generate name, zip;
© Hortonworks Inc. 2012
Page 11
12. Storing Semi-/Unstructured Data
Table Users File Users
Name Zip {"name":"alice","zip":"93201"}
Alice 93201 {"name":"bob”,"zip":"76331"}
Bob 76331 {"name":"cindy"}
{"zip":"87890"}
A = load ‘Users’ as
(name:chararray,
zip:chararray);
B = foreach A generate name, zip;
select name, zip A = load ‘Users’
from users; B = foreach A generate name, zip;
© Hortonworks Inc. 2012
Page 12
13. Hive ODBC/JDBC Today
Issue: Have to have Hive
JDBC code on the client
Client
Hive Hadoop
Server
Issues:
• Not concurrent
ODBC • Not secure
Client • Not scalable
Issue: Open source version
not easy to use
© 2012 Hortonworks
Page 13
14. ODBC/JDBC Proposal
JDBC
Client
Provide robust open source REST Hadoop
implementations Server
• Spawns job inside cluster
ODBC • Runs job as submitting user
Client • Works with security
• Scaling web services well understood
© 2012 Hortonworks
Page 14