Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-First Infrastructure for Data Exchange

Startup case study:
Leveraging the broad
Hadoop ecosystem
to develop world-first
infrastructure for data exchange
Amin Abbaspour,
Senior Solutions Architect
@aminize
Juan Delard de Rigoulières,
Head of Engineering
@datarepublicans

Data is to this century what oil
was to the last one:
a driver of growth and change.

McKinsey estimates that
open data can unlock
$3-5 trillion in economic
value annually across
seven sectors.

The biggest, most
useful datasets are
trapped in islands.

Data sharing is hard and takes a long time.

There is no single
platform and trusted
infrastructure for
multi-lateral
data exchange.

Initial Tech Journey
(after legal framework)

Characteristics of a data exchange
- Secure
- Privacy compliant
- Communications, legal contracts/enforcements
- Scalable
- Metadata is first-class citizen
- Organization control (governance)

Our Journey - Always starts from Monolith

Why Hadoop?
- Turn key solution for our data ecosystem
- Lots of tooling/expansion for future
- Ready to use API and governance layer
- Scale, when and if needed
- No more reinventing the wheel - even though our requirements
are very specific
- A (rather large) component that gives us the ability to quickly
test new hypothesis and use cases
- We want to consume it directly (via API - not just as a backend)
- Cloud agnostic – Deploy globally in other regions

All API are directly exposed to client; no middleware or other services.

Hadoop “microservice” - REST API via Knox

Knox Topology - Shiro LDAP Authn Provider

Knox Topology - PreAuth Federation Provider

Database API - Let’s Talk About WebHCat

- The whole thing is slow. Expect 20-30 sec response times
- HCatalog does not honor Ranger policies for Hive. Full
access to all Meta
- Can use “like=prefix*” in REST but that is just client
side security
- Bottom line: unusable for interactive web applications
Summary of Issues with WebHCat DDL

Welcome to WebHCat JDBC Delegator
- Wrote a small wrapper (~50 lines) around HCatDelegator to run DDL
commands over JDBC (github.com/apache/hive/pull/133)
- Sub second response times
- Fully compliant to Ranger/Hive access control.
- HA JDBC connectivity zookeeper URL
- Does auto-refresh of Kerberos tickets
- Now good enough to build interactive UI on top of it

Database API - Hive and Schemas
Hive/HCatalog has no concept of schema/namespaces.
Database name is the first and last level of granularity
Issue: No two organization can have same database
name!
Fix: Prefix database name with customer’s LDAP groupId.
test => o_89xxdm4x3_test

Ranger Policies and (Fake) Hive Schemas
prefix: o_xxxxxxxxxxx_*

Ranger Policies and JDBC Access
LDAP RBAC Roles for HDP:
• hdp_knox_webhcat => WebHCat Access
• hdp_knox_webhdfs => WebHDFS Access
• hdp_knox_hive => JDBC/ODBC Access
• sftp_access => SFTP Access
Simply applied roles in knox gateway. That’s on top of
routine HDFS and Hive permissions.

WebHDFS: Beyond Ranger + Security
Layered security:
- Reverse Proxy: make sure user is signed in
- Ranger/Knox: make sure user has the right role
- Ranger/HDFS: apply POSIX style access write
That’s all good but what if we want to limit web API access
to only certain folder in HDFS? Say /user/UID

HTTP Uploads to HDFS - Browser Compatibility

Reverse Proxy with a taste of Lua

Ingestion via SFTP: NFS to HDFS

Pain Points
• Slow upload time
• Reduced replication factor
• Hard limit on NFS r/w size of 1M enforced by Kernel
• SFTP sends 32kb chunks to NFS
• Only reached 1.8-2MB/sec upload speed
• No overwrite supported
• Small chunks imposes pressure on NameNode

Ingestion via SFTP: SFTP to HDFS (with FUSE)

FUSE Internals
Buffering Writes Truncating Files

FUSE Buffered NFS to HDFS
• Faster upload time
• up from 1.8MB/s to 20MB/s (x12 speed up)
• Reset replication factor back to default 3
• Does support overwrite
• No more pressure on NameNode
• Open source https://github.com/datarepublic/gwfs

Data API - OLAP API with Kylin

For best results with Kylin
- Facts table partitioned by date
• single depth, partitioned by day
- Facts table clustered into buckets
• depending on table size to 16-256 buckets
- EOD process to build/merge cubes
- Aggregated REST API exposes over cube SQL

Cloud Data Access (CDA) Disaster Recovery
- JBOD - snapshot EBS volumes with Lambda
- AWS Users (split Access/Secret Key) not Roles. why?
- CRON Job to distcp to S3. KMS Encrypted
- S3 VPC-ep for VPC
Key Value
fs.s3a.experimental.input.fadvise random
fs.s3a.fast.upload true
fs.s3a.fast.upload.buffer bytebuffer
numListstatusThreads 4
fs.s3a.access.key AWS Access Key. fs.s3a.secret.key in core-site.xml

Data Marketplace Analyze data from major retail, finance, loyalty brands.
Governance platform Manage data exchanges from one secure
dashboard.
Privacy management Protect customer privacy with de-identification
technology.
Secure cloud analytics
Run analytics projects in secure, encrypted
cloud environments.

Rapid growth in number and
volume of Datasets and PI
Identifiable data for ~16m Australian
adults
The Datasets currently loaded on to the
Senate platform contain 40m digital
identities, covering 75% of the adult
Australian population with another 40m
coming in early 2017.
17
50
180
+1500
14B
Published Datasets
Rows of data
Data Contributors
Data Scenarios Listings
Certified Partners

Summary for this talk
• Data Exchange is complex; involves trust and
governance - data and metadata is part of it
• Hadoop gives us an amazing capability to build upon;
we fulfill complicated new scenarios everyday
• We’ve quickly solved all our small issues and deliver
great business value
•

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-First Infrastructure for Data Exchange

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-First Infrastructure for Data Exchange

Similar to Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-First Infrastructure for Data Exchange (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-First Infrastructure for Data Exchange