Web Services Hadoop Summit 2012

Web Services in
Hadoop
Nicholas Sze and Alan F. Gates
@szetszwo, @alanfgates

Page 1

REST-ful API Front-door for Hadoop
• Opens the door to languages other than Java
• Thin clients via web services vs. fat-clients in gateway
• Insulation from interface changes release to release

HCatalog web interfaces

MapReduce Pig Hive

HCatalog

External
HDFS HBase
Store

© 2012 Hortonworks Page 2

Not Covered in this Talk
•  HttpFS (fka Hoop) – same API as WebHDFS but proxied
•  Stargate – REST API for HBase

© 2012 Hortonworks
Page 3

HDFS Clients
• DFSClient: the native client
– High performance (using RPC)
– Java blinding

• libhdfs: a C++ client interface
– Using JNI => large overhead
– Also Java blinding (require Hadoop installing)

Architecting the Future of Big Data Page 4

HFTP
• Designed for cross-version copying (DistCp)
– High performance (using HTTP)
– Read-only
– The HTTP API is proprietary
– Clients must use HftpFileSystem (hftp://)

• WebHDFS is a rewrite of HFTP


Design Goals

• Support a public HTTP API

• Support Read and Write

• High Performance

• Cross-version

• Security


WebHDFS features
• HTTP REST API
– Defines a public API
– Permits non-Java client implementation
– Support common tools like curl/wget

• Wire Compatibility
– The REST API will be maintained for wire compatibility
– WebHDFS clients can talk to different Hadoop versions.


WebHDFS features (2)

• A Complete HDFS Interface
– Support all user operations
– reading files
– writing to files
– mkdir, chmod, chown, mv, rm, …

• High Performance
– Using HTTP redirection to provide data locality
– File read/write are redirected to the corresponding
datanodes


WebHDFS features (3)

• Secure Authentication
– Same as Hadoop authentication: Kerberos (SPNEGO)
and Hadoop delegation tokens
– Support proxy users

• A HDFS Built-in Component
– WebHDFS is a first class built-in component of HDFS.
– Run inside Namenodes and Datanodes

• Apache Open Source
– Available in Apache Hadoop 1.0 and above.


WebHDFS URI & URL
• FileSystem scheme:
webhdfs://

• FileSystem URI:
webhdfs://<HOST>:<HTTP_PORT>/<PATH>

• HTTP URL:
http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=..

– Path prefix: /webhdfs/v1
– Query: ?op=..


URI/URL Examples
•  Suppose we have the following file
hdfs://namenode:8020/user/szetszwo/w.txt

•  WebHDFS FileSystem URI
webhdfs://namenode:50070/user/szetszwo/w.txt

•  WebHDFS HTTP URL
http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=..

•  WebHDFS HTTP URL to open the file
http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN


Example: curl
•  Use curl to open a file

$curl -i -L "http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN"

HTTP/1.1 307 TEMPORARY_REDIRECT
Content-Type: application/octet-stream
Location: http://192.168.5.2:50075/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN&offset=0
Content-Length: 0
Server: Jetty(6.1.26)


Example: curl (2)

HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 21
Server: Jetty(6.1.26)

Hello, WebHDFS user!


Example: wget
•  Use wget to open the same file

$wget "http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN" –O w.txt

Resolving ...
Connecting to ... connected.
HTTP request sent, awaiting response...
307 TEMPORARY_REDIRECT
Location: http://192.168.5.2:50075/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN&offset=0 [following]


Example: wget (2)

--2012-06-13 01:42:10-- http://192.168.5.2:50075/
webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0
Connecting to 192.168.5.2:50075... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21 [application/octet-stream]
Saving to: `w.txt'

100%[=================>] 21 --.-K/s in 0s

2012-06-13 01:42:10 (3.34 MB/s) - `w.txt' saved
[21/21]


Example: Firefox


HCatalog REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop
•  Uses JSON to describe metadata objects
•  Versioned, because we assume we will have to update it:
http://hadoop.acme.com/templeton/v1/…
•  Runs in a Jetty server
•  Supports security
–  Authentication done via kerberos using SPNEGO
•  Included in HDP, runs on Thrift metastore server machine
•  Not yet checked in, but you can find the code on Apache’s JIRA
HCATALOG-182

© 2012 Hortonworks
Page 17

HCatalog REST API
Get a list of all tables in the default database:

GET
http://…/v1/ddl/database/default/table
Hadoop/
HCatalog
{
"tables": ["counted","processed",],
"database": "default"
}

Indicate user with URL parameter:
http://…/v1/ddl/database/default/table?user.name=gates
Actions authorized as indicated user

© Hortonworks 2012
Page 18

HCatalog REST API
Create new table “rawevents”

PUT
{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],
"partitionedBy": [{ "name": "ds", "type": "string" }]}

http://…/v1/ddl/database/default/table/rawevents

Hadoop/
HCatalog
{
"table": "rawevents",
"database": "default”
}

© Hortonworks 2012
Page 19

HCatalog REST API
Describe table “rawevents”

GET
http://…/v1/ddl/database/default/table/rawevents
Hadoop/
HCatalog
{
"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],
"database": "default",
"table": "rawevents"
}

© Hortonworks 2012
Page 20

Job Management
•  Includes APIs to submit and monitor jobs
•  Any files needed for the job first uploaded to HDFS via WebHDFS
–  Pig and Hive scripts
–  Jars, Python scripts, or Ruby scripts for UDFs
–  Pig macros
•  Results from job stored to HDFS, can be retrieved via WebHDFS
•  User responsible for cleaning up output in HDFS
•  Job state information stored in ZooKeeper or HDFS

© 2012 Hortonworks
Page 21

Job Submission
•  Can submit MapReduce, Pig, and Hive jobs
•  POST parameters include
–  script to run or HDFS file containing script/jar to run
–  username to execute the job as
–  optionally an HDFS directory to write results to (defaults to user’s home directory)
–  optionally a URL to invoke GET on when job is done

POST
http://hadoop.acme.com/templeton/v1/pig
Hadoop/
HCatalog
{"id": "job_201111111311_0012",…}

© 2012 Hortonworks
Page 22

Find all Your Jobs
•  GET on queue returns all jobs belonging to the submitting user
•  Pig, Hive, and MapReduce jobs will be returned

GET
http://…/templeton/v1/queue?user.name=gates
Hadoop/
HCatalog
{"job_201111111311_0008",
"job_201111111311_0012"}

© 2012 Hortonworks
Page 23

Get Status of a Job
•  Doing a GET on jobid gets you information about a particular job
•  Can be used to poll to see if job is finished
•  Used after job is finished to get job information
•  Doing a DELETE on jobid kills the job

GET
http://…/templeton/v1/queue/job_201111111311_0012
Hadoop/
HCatalog
{…, "percentComplete": "100% complete",
"exitValue": 0,…
"completed": "done"
}

© 2012 Hortonworks
Page 24

Future
•  Job management
–  Job management APIs don’t belong in HCatalog
–  Only there by historical accident
–  Need to move them out to MapReduce framework
•  Authentication needs more options than kerberos
•  Integration with Oozie
•  Need a directory service
–  Users should not need to connect to different servers for HDFS, HBase, HCatalog,
Oozie, and job submission

© 2012 Hortonworks
Page 25

Web Services Hadoop Summit 2012

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (19)

Similar a Web Services Hadoop Summit 2012

Similar a Web Services Hadoop Summit 2012 (20)

Más de Hortonworks

Más de Hortonworks (20)

Último

Último (20)

Web Services Hadoop Summit 2012