Deploying enterprise grade security for Hadoop or six security problems with Apache Hive. In this talk we will discuss the security problems with Hive and then secure Hive with Apache Sentry. Additional topics will include Hadoop security, and Role Based Access Control (RBAC).
1. 6 ways to exploit Hive
– and what to do about it
Brock Noland |Software Engineer, Cloudera
January 23, 2013
1
2. Outline
Introduction
• Hadoop security primer
•
•
•
•
Security options
•
•
•
•
2
Authentication
Authorization
Default
Kerberos with Impersonation
Kerberos with Sentry
Demo
3. Introduction
Tonight's focus is SQL-on-Hadoop
• Vast majority of Hadoop users use Hive or Cloudera
Impala
• Data warehouse offload is the most common use
case
• Data warehouse offload is a two step process
1.
2.
3
Automatic transformations moved to Hadoop
Data analysts given query access
7. Default Authentication – trusted network
Default security mechanism
• Hadoop client uses local username
• Used in
•
•
•
•
•
7
POCs
Startups
Demos
Pre-prod environments
8. Default Authentication – trusted network
Client Host
User: brock
File: a.txt
Contents: some data
$ whoami
brock
$ cat a.txt
some data
$ hadoop fs -put file .
8
Hadoop
9. Strong Authentication – Kerberos
•
Hadoop is secured with Kerberos
•
•
•
Every user and service has a Kerberos “principal”
•
•
•
Service: impala/hostname@MYCOMPANY.COM
User: brock@MYCOMPANY.COM
Credentials
•
•
9
Provides mutual authentication
Protects against eavesdropping and replay attacks
Service: keytabs
User: password
10. Strong Authentication – Kerberos
Client Host
User: brock
<kerberos ticket>
<encrypted data> *
$ whoami
brock
$ kinit
Password: *******
$ cat a.txt
some data
$ hadoop fs -put file .
10
Hadoop
* RPC Encryption must be enabled
11. Strong Authentication – Kerberos
•
Keytab
•
•
11
Encrypted key for servers (similar to a “password”)
Generated by server such as MIT Kerberos or Active
Directory
12. Hive Server 2 and Oozie
Beeline
(Hive CLI)
Tableau
JDBC
Hive Server 2 (HS2)
Oozie
Hadoop
12
Oozie CLI
Control-M
13. Strong Authentication – Kerberos
•
Impersonation
•
•
•
13
Services such as Hive Server2 impersonate users
Data loaded by “joe” via HS2 is owned by “joe”
Oozie jobs submitted by “brock” are run as “brock”
14. Authorization
•
HDFS permissions
•
•
•
•
Other Hadoop components have authorization
•
•
14
Unix style
Read/Write/Execute for Owner/Group/Other
Coarse grained
MapReduce who can use which job queues
HBase table ACL’s
15. HDFS Permisssions
$ hadoop fs -ls file
-rw-r----1 analyst1 analysts
•
Permissions
•
•
•
•
Owner
•
•
Unix style permissions
Read/Write/Execute
Owner/Group/Other
One and only one owner
Group
•
One and only one group
2244 2014-01-19 12:15 file
16. Back to our use case
•
Scenario facts
•
•
•
•
Next step
•
•
16
ETL offload is a success
Data warehouse is expensive and at capacity
Same data is in Hadoop
End users start using Hadoop to augment the DW
Security becomes primary concern
17. End users need to share data
Unlike automated ETL jobs, end users want to share
data with peers
• Must manage HDFS permissions manually
• Each file has a single group
• End result is users set permissions to world
readable/writeable
•
17
18. Outline
Introduction
• Hadoop Security Primer
•
•
•
•
Security options
•
•
•
•
18
Authentication
Authorization
Default
Kerberos with Impersonation
Kerberos with Sentry
Demo
19. Hive: Security holes
CREATE TEMPORARY FUNCTION
custom_udf AS ’com.mycompany.
MaliciousClass’;
SELECT TRANSFORM(stuff)
USING 'malicious-script.pl'
AS thing1, thing;
CREATE EXTERNAL TABLE
external_table(column1 string)
LOCATION ‘/path/to/any/table’;
19
20. Hive: Security holes
CREATE TABLE test (c1 string)
ROW FORMAT SERDE
'com.mycompany.MaliciousClass';
FROM (
FROM t1
MAP t1.c1
USING 'malicious-script1.pl'
CLUSTER BY key) map_output
INSERT OVERWRITE TABLE t2
REDUCE t2.c1
USING 'malicious-script2.pl'
AS c2;
20
21. Default: Authorization
•
Hive ships with an “advisory” authorization system
•
•
•
21
All users see all databases/tables/columns
Does not fix any security holes
Users grant themselves permissions
22. Outline
Introduction
• Hadoop Security Primer
•
•
•
•
Security options
•
•
•
•
22
Authentication
Authorization
Default
Kerberos with Impersonation
Kerberos with Sentry
Demo
23. Kerberos with impersonation: Sharing data
The user “manager1” wants to share the table “manager1_table”
with senior analysts but not junior analysts.
# hadoop fs -ls -R /user/hive/warehouse
drwxr-x--T
- analyst1
analyst1
drwxr-x--T
- jranalyst1 jranalyst1
drwxr-x--T
- manager1
manager1
23
0
0
0
analyst1_table
jranalyst1_table
manager1_table
24. Kerberos with impersonation: Sharing data
IT must create a group
# groupadd senioranalysts
Then add the appropriate members to group
# usermod -G analyst,senioranalysts analyst1
# usermod -G management,analyst,senioranalysts manager1
24
25. Kerberos with impersonation: Sharing data
Then “manager1” can manually change the file permissions
$ hadoop fs -chgrp -R senioranalysts …/warehouse/manager1_table
$ hadoop fs -ls /user/hive/warehouse/
Found 3 items
drwxr-x--T
- analyst1
analyst1
drwxr-x--T
- jranalyst1 jranalyst1
drwxr-x--T
- manager1
senioranalysts
25
0
0
0
analyst1_table
jranalyst1_table
manager1_table
26. Kerberos with impersonation: Sharing data
Now any senior-level analyst can query the data
$ whoami
analyst1
$ beeline ...
Connected to: Hive (version 0.10.0)
0: jdbc:hive2://localhost:10000/default>
select count(*) from manager1_table;
+------------+
| count(*)
|
+------------+
| 47
|
+------------+
26
⏎
27. Kerberos with impersonation: Sharing data
Junior analysts cannot query the data:
$ whoami
jranalyst1
$ beeline ....
Connected to: Hive (version 0.10.0)
0: jdbc:hive2://localhost:10000/default> ⏎
select * from manager1_table;
Error: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Permission denied:
user=jranalyst1, access=READ_EXECUTE, inode="/user/hive/warehouse/mana
ger1_table":manager1:senioranalysts:drwxr-x--T
27
29. Kerberos with impersonation: Sharing data
Table “manager1_table” is owned by user/group “manager1”
$ hadoop fs -ls /user/hive/warehouse/
Found 3 items
drwxr-x--T
- analyst1
analyst1
drwxr-x--T
- jranalyst1 jranalyst1
drwxr-x--T
- manager1
manager1
29
0
0
0
analyst1_table
jranalyst1_table
manager1_table
30. Kerberos with impersonation: Sharing data
User “manager1” makes “manager1_table” world readable/writable
$ hadoop fs -chmod -R 777 /user/hive/warehouse/manager1_table
$ hadoop fs -ls /user/hive/warehouse/
Found 3 items
drwxr-x--T
- analyst1
analyst1
drwxr-x--T
- jranalyst1 jranalyst1
drwxrwxrwt
- manager1
manager1
30
0
0
0
analyst1_table
jranalyst1_table
manager1_table
31. Kerberos with impersonation: Summary
•
Securing Hive with Kerberos makes Hive unusable for
DW offload
•
•
•
•
31
Manual file permission management
End state is world writable/readable
No ability to restrict access to columns or rows
All users see all databases/tables/columns
32. Outline
Introduction
• Hadoop Security Primer
•
•
•
•
Security options
•
•
•
•
32
Authentication
Authorization
Default
Kerberos with Impersonation
Kerberos with Sentry
Demo
33. Fine Grained Security: Apache Sentry
Authorization module for Hive, Search, & Impala
Unlocks Key RBAC Requirements
Secure, fine-grained, role-based authorization
Multi-tenant administration
Open Source
Apache Incubator project
Ecosystem Support
Apache SOLR, HiveServer2, & Impala 1.1+
33
34. Key Benefits of Sentry
Store Sensitive Data in Hadoop
Extend Hadoop to More Users
Comply with Regulations
34
35. Key Capabilities of Sentry
Fine-Grained Authorization
Specify security for SERVERS, DATABASES, TABLES & VIEWS
Role-Based Authorization
SELECT privilege on views & tables
INSERT privilege on tables
ALL privilege on the server, databases, tables & views
ALL privilege is needed to create/modify schema
Multi-Tenant Administration
Separate policies for each database/schema
Can be maintained by separate admins
35
Many, many ways to execute arbitrary codeHive was created originally by web companies that simply don’t care about security. In fact we often run into push back from the community when integrating security. In my presentation at the TC HUG I will explain in detail all the ways in which Hive is insecure. The point is by default any user can execute any code they wish.Users grant themselves permissionsUsers can query any data they please by granting themselves permissions.Zero metadata securityNote possible to stop users from modifying or viewing any metadata.
Manual file permission managementWhen users want to share tables and data with other users it requires modifying file permissions. Can anyone guess what happens next?End state is world writable/readableUsers end up making data world writable and readable.No ability to restrict access to columns or rows Users cannot be restricted to a subset of the data and so tables are copied simply to restrict access to data which results in thousands of out of date tables which full read and write permissions.
Role-Based Access Control (RBAC) For finer-grained access to data accessible via schema -- that is, data structures described by the Apache Hive Metastore and utilized by computing engines like Hive and Impala, as well as collections and indices within Cloudera Search -- Cloudera developed Apache Sentry, which offers a highly modular, role-based privilege model for this data and its given schema. (Cloudera donated Apache Sentry to the Apache Foundation in 2013.) Sentry governs access to each schema object in the Metastore via a set of privileges like SELECT and INSERT. The schema objects are common entities in data management, such as SERVER, DATABASE, TABLE, COLUMN, and URI, i.e. file location within HDFS. Cloudera Search has its own set of privileges, e.g. QUERY, and objects, e.g. COLLECTION. As with other RBAC systems that IT teams are already familiar with, Sentry provides for: Hierarchies of objects, with permissions automatically inherited by objects that exist within a larger umbrella object; Rules containing a set of multiple object/permission pairs; Groups that can be granted one or more roles; Users can be assigned to one or more groups. Sentry is normally configured to deny access to services and data by default so that users have limited rights until they are assigned to a group that has explicit access roles. Column-level Security, Row-level Security and Masked Access Using the combination of Sentry-based permissions, SQL views, and User Defined Functions (UDFs), developers can gain a high degree of access control granularity for SQL computing engines through HiveServer2 and Impala, including: Column-level security - To limit access to only particular columns of entire tables, uses can access the data through a view, which contains either a subset of columns in the table, or have certain columns masked. For example, a view can filter a column to only the last four digits of a US Social Security number. Row-level security - To limit access by particular values, views can employ CASE statements to control rows to which a group of users has access. For example, a broker at a financial services firm may only be able to see data within her managed accounts.
Impala metadata queries, i.e. “SHOW TABLES,” query the Hive Metastore directly and then queries Sentry to filter the results before returning.