Más contenido relacionado La actualidad más candente (20) Similar a An Apache Hive Based Data Warehouse (20) Más de DataWorks Summit (20) An Apache Hive Based Data Warehouse7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Types SQL Features File Formats Hive 2
Numeric Core SQL Features Columnar ACID MERGE
FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries
INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins
BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT
String UNION ALL Logfile
CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs
BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints
Date, Time UNION DISTINCT JSON Default Values
DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi-statement Transactions
Complex Types OLAP and Windowing Functions Custom Formats
ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features
Nested Data Analytics CUBE and Grouping Sets XPath Analytics
Nested Data Traversal ACID Transactions
Lateral Views INSERT / UPDATE / DELETE
Procedural Extensions Constraints
HPL/SQL Primary / Foreign Key (Non Validated)
Apache Hive: Journey to SQL:2011 Analytics
Legend
New
Future work
Hive 2
Track Hive SQL:2011 Complete: HIVE-13554
14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage
à Of course HDFS, default in the Hadoop world
à More and more cloud
à Move is copy in S3, but current implementation assumes move is atomic and nearly free
– modifying Hadoop (HADOOP-11694) and Hive (HIVE-14535)
à ACID in the cloud
– Compactor moves a lot of files around, need to optimize
– Need to figure out how streaming ingest works in the cloud
à LLAP, caching much more valuable in the cloud
– Looking at flushing cache to SSD so misses are less costly
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Wire
encryption
• HDFS
encryption +
Ranger KMS
• Centralized
audit
reporting w/
Apache
Ranger
• Fine grain
access control
with Apache
Ranger
Security today in Hadoop
Authorization
What can I do?
Audit
What did I do?
Data Protection
Can data be encrypted at
rest and over the wire?
• Kerberos
• API security
with Apache
Knox
Authentication
Who am I/prove
it?
Centralized Security Administration w/ Ranger & Knox
17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authentication—API Security with Knox
• Eliminates SSH “edge node”
• Central API management
• Central audit control
• Service level authorization
• SSO - SAMLv2, Siteminder
and OAM
• LDAP and AD integration
• SSO for Hadoop UIs (Ranger,
Ambari..)
Apache Knox extends the reach of Hadoop REST API without
Kerberos complexities
Integrated with existing IdM
systems
Single, simple point of
access for a cluster
Centralized and consistent
secure API across one or
more clusters
• Kerberos Encapsulation
• Single Hadoop access point
• REST API hierarchy
• Consolidated API calls
• Multi-cluster support
18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Data Access
User ID Region Total Spend
1 East 5,131
2 East 27,828
3 West 55,493
4 West 7,193
5 East 18,193
Apache Ranger: Per-User Row Filtering by Region in Hive
User 2
(East Region)
User 1
(West Region)
Original Query:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
Query Rewrites based on
Dynamic Ranger Policies
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “east”
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “west”
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger: Dynamic Data Masking of Hive Columns
R A N G E R
Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation!
Goal: Mask or anonymize sensitive columns of data
(e.g. PII, PCI, PHI) from Hive query output
⬢ Benefits
– Sensitive information never leaves database
– No changes are required at the application or Hive layer
– No need to produce additional protected duplicate
versions of datasets
– Simple & easy to setup masking policies
⬢ Core Technologies: Ranger, Hive
AT L A S
H I V E
20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Tag-based Access Policies with Apache Atlas
• Basic Tag policy – PII example. Access and
entitlements must be tag based ABAC and scalable in
implementation.
• Geo-based policy – Policy based on IP address, proxy
IP substitution maybe required. The rule
enforcement must be geo aware.
• Time-based policy – Timer for data access, de-
coupled from deletion of data.
• Prohibitions – Prevention of combination of Hive
tables that may pose a risk together.
Key Benefits:
New scalable metadata
based security paradigm
Dynamic, real-time policy
Active protection – fast
updates to changes
Centralized and simple to
manage policy
23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Enables Business Catalog for Ease of Use
à Organize data assets along business terms
– Authoritative: Hierarchical Taxonomy Creation
– Agile modeling: Model Conceptual, Logical, Physical assets
– Definition and assignment of tags like PII (Personally
Identifiable Information)
à Comprehensive features for compliance
– Multiple user profiles including Data Steward and Business
Analysts
– Object auditing to track “Who did it”
– Metadata Versioning to track ”what did they do”
à Faster Insight:
– Data Quality tab for profiling and sampling
– User Comments
Key Benefits:
Organize data assets along
business terms
Compliance Features
Faster Insight
28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Column Security with LLAP
à Fine-Grained Column Level Access Control for SparkSQL.
à Fully dynamic policies per user. Doesn’t require views.
à Use Standard Ranger policies and tools to control access and masking policies.
Flow:
1. SparkSQL gets data locations
known as “splits” from HiveServer
and plans query.
2. HiveServer2 authorizes access
using Ranger. Per-user policies
like row filtering are applied.
3. Spark gets a modified query plan
based on dynamic security policy.
4. Spark reads data from LLAP.
Filtering / masking guaranteed by
LLAP server.
HiveServer2
Authorization
Hive Metastore
Data Locations
View Definitions
LLAP
Data Read
Filter Pushdown
Ranger Server
Dynamic Policies
Spark Client
1
2
4
3
31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Data Warehousing on Hadoop
Capabilities
Batch SQL OLAP / CubeInteractive SQL
Sub-Second
SQL
ACID / MERGE
Applications
• ETL
• Reporting
• Data Mining
• Deep Analytics
• Multidimensional
Analytics
• MDX Tools
• Excel
• Reporting
• BI Tools: Tableau,
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• BI Tools: Tableau,
Excel
• Continuous
Ingestion from
Operational DBMS
• Slowly Changing
Dimensions
Existing
Development
Emerging
Legend
Core
Platform
Scale-Out Storage
Petabyte Scale
Processing
Core SQL Engine
Apache Tez: Scalable
Distributed Processing
Advanced Cost-Based
Optimizer
Connectivity
Advanced Security
JDBC / ODBC
Comprehensive
SQL:2011 Coverage
MDX
32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
For More Details
à Today
– Running Zeppelin in Enterprise – 3:10
– Dancing Elephants – Efficiently Working with Object Stores from Apache Spark and Hive – 4:20
– Open Metadata and Governance with Apache Atlas – 5:10
– LLAP: Building Cloud First BI – 5:50pm
à Tomorrow
– Interactive Analytics At Scale in Apache Hive Using Druid – 9:00
– Disaster Recovery and Cloud Migration for you Apache Hive Warehouse – 11:00
– LLAP: Building Cloud-First BI – 11:50
– Treat Your Enterprise Data Lake Indigestion: Enterprise Ready Security and Governance – 3:10
– Birds of a Feather Session for Hive and HBase – 6:00