Back in 2014, our team set out to change the way the world exchanges and collaborates with data. Our vision was to build a single tenant environment for multiple organisations to securely share and consume data. And we did just that, leveraging multiple Hadoop technologies to help our infrastructure scale quickly and securely.
Today Data Republic’s technology delivers a trusted platform for hundreds of enterprise level companies to securely exchange, commercialise and collaborate with large datasets.
Join Head of Engineering, Juan Delard de Rigoulières and Senior Solutions Architect, Amin Abbaspour as they share key lessons from their team’s journey with Hadoop:
* How a startup leveraged a clever combination of Hadoop technologies to build a secure data exchange platform
* How Hadoop technologies helped us deliver key solutions around governance, security and controls of data and metadata
* An evaluation on the maturity and usefulness of some Hadoop technologies in our environment: Hive, HDFS, Spark, Ranger, Atlas, Knox, Kylin: we've use them all extensively.
* Our bold approach to expose APIs directly to end users; as well as the challenges, learning and code we created in the process
* Learnings from the front-line: How our team coped with code changes, performance tuning, issues and solutions while building our data exchange
Whether you’re an enterprise level business or a start-up looking to scale - this case study discussion offers behind-the-scenes lessons and key tips when using Hadoop technologies to manage data governance and collaboration in the cloud.
Speakers:
Juan Delard De Rigoulieres, Head of Engineering, Data Republic Pty Ltd
Amin Abbaspour, Senior Solutions Architect, Data Republic
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-First Infrastructure for Data Exchange
1. Startup case study:
Leveraging the broad
Hadoop ecosystem
to develop world-first
infrastructure for data exchange
Amin Abbaspour,
Senior Solutions Architect
@aminize
Juan Delard de Rigoulières,
Head of Engineering
@datarepublicans
2. Data is to this century what oil
was to the last one:
a driver of growth and change.
13. Why Hadoop?
- Turn key solution for our data ecosystem
- Lots of tooling/expansion for future
- Ready to use API and governance layer
- Scale, when and if needed
- No more reinventing the wheel - even though our requirements
are very specific
- A (rather large) component that gives us the ability to quickly
test new hypothesis and use cases
- We want to consume it directly (via API - not just as a backend)
- Cloud agnostic – Deploy globally in other regions
24. - The whole thing is slow. Expect 20-30 sec response times
- HCatalog does not honor Ranger policies for Hive. Full
access to all Meta
- Can use “like=prefix*” in REST but that is just client
side security
- Bottom line: unusable for interactive web applications
Summary of Issues with WebHCat DDL
25. Welcome to WebHCat JDBC Delegator
- Wrote a small wrapper (~50 lines) around HCatDelegator to run DDL
commands over JDBC (github.com/apache/hive/pull/133)
- Sub second response times
- Fully compliant to Ranger/Hive access control.
- HA JDBC connectivity zookeeper URL
- Does auto-refresh of Kerberos tickets
- Now good enough to build interactive UI on top of it
26.
27. Database API - Hive and Schemas
Hive/HCatalog has no concept of schema/namespaces.
Database name is the first and last level of granularity
Issue: No two organization can have same database
name!
Fix: Prefix database name with customer’s LDAP groupId.
test => o_89xxdm4x3_test
33. WebHDFS: Beyond Ranger + Security
Layered security:
- Reverse Proxy: make sure user is signed in
- Ranger/Knox: make sure user has the right role
- Ranger/HDFS: apply POSIX style access write
That’s all good but what if we want to limit web API access
to only certain folder in HDFS? Say /user/UID
38. Pain Points
• Slow upload time
• Reduced replication factor
• Hard limit on NFS r/w size of 1M enforced by Kernel
• SFTP sends 32kb chunks to NFS
• Only reached 1.8-2MB/sec upload speed
• No overwrite supported
• Small chunks imposes pressure on NameNode
41. FUSE Buffered NFS to HDFS
• Faster upload time
• up from 1.8MB/s to 20MB/s (x12 speed up)
• Reset replication factor back to default 3
• Does support overwrite
• No more pressure on NameNode
• Open source https://github.com/datarepublic/gwfs
43. For best results with Kylin
- Facts table partitioned by date
• single depth, partitioned by day
- Facts table clustered into buckets
• depending on table size to 16-256 buckets
- EOD process to build/merge cubes
- Aggregated REST API exposes over cube SQL
44. Cloud Data Access (CDA) Disaster Recovery
- JBOD - snapshot EBS volumes with Lambda
- AWS Users (split Access/Secret Key) not Roles. why?
- CRON Job to distcp to S3. KMS Encrypted
- S3 VPC-ep for VPC
Key Value
fs.s3a.experimental.input.fadvise random
fs.s3a.fast.upload true
fs.s3a.fast.upload.buffer bytebuffer
numListstatusThreads 4
fs.s3a.access.key AWS Access Key. fs.s3a.secret.key in core-site.xml
46. Data Marketplace Analyze data from major retail, finance, loyalty brands.
Governance platform Manage data exchanges from one secure
dashboard.
Privacy management Protect customer privacy with de-identification
technology.
Secure cloud analytics
Run analytics projects in secure, encrypted
cloud environments.
47. Rapid growth in number and
volume of Datasets and PI
Identifiable data for ~16m Australian
adults
The Datasets currently loaded on to the
Senate platform contain 40m digital
identities, covering 75% of the adult
Australian population with another 40m
coming in early 2017.
17
50
180
+1500
14B
Published Datasets
Rows of data
Data Contributors
Data Scenarios Listings
Certified Partners
48. Summary for this talk
• Data Exchange is complex; involves trust and
governance - data and metadata is part of it
• Hadoop gives us an amazing capability to build upon;
we fulfill complicated new scenarios everyday
• We’ve quickly solved all our small issues and deliver
great business value
•