(Presented by Intel) This is the best of times and the worst of times for cloud services developers. At no other time in history has open access to data, open interfaces to data analytics, and open licensing of source code come together with scalable, cost-effective, cloud infrastructures. This is the good news.
The bad news is that enterprises are being left behind. Stymied by concerns of data protection and data governance, enterprises need proof that the services and solutions built on a cloud infrastructure comply with policies and practices they’ve come to learn (not necessarily love). At its heart is the root of trust issue – how far down can I trust the cloud service, its infrastructure software, and the data that it analyzes? And how do I know my keys are safe? Join this session to learn how Intel has been enabling trusted analytics with cloud services secured top to bottom – from Apache Hadoop to Java, Xen, and Linux – without compromising security.
2. Data-Driven discoveries depend on analytics
Operational
Efficiency
Consumer Behavior
Security &
Risk Management
Traffic
Optimization
Location Aware
Ad Placement
Personalized
Preventive Care
Smart Energy
Grid
Buyer Protection
Program
Claim Fraud
Reduction
3. Machine-generated data requires end-to-end analytics
1990
2000
2010
Traditional Analytics
Big Data Analytics
End-to-End Analytics
•
Descriptive analysis,
business intelligence, and
reporting
•
Interactive analysis,
complex queries, and
data-intensive models
•
Real-time analysis of
streaming data from IoT
Internally sourced,
relatively small, structured
data
•
Fast and large amounts of
poly-structured data from
multiple sources
•
•
Predictive and prescriptive
analysis integrated into
organizational processes
Analysts and Quants
huddled in back-rooms
•
Data Scientists at the fore
•
•
Widespread access to
tools
3
4. End-to-end analytics for the Internet of things era
Verticals
Analytics Platform
Enable horizontal
platform for e2e
analytics
Data Platform
Servers
Help build lighthouse
solutions for targeted
verticals
Accelerate evolution
of Apache Hadoop
Storage
Network
Catalyze architectural
transitions to drive
growth
4
5. End-to-end analytics needs software-defined infrastructure
Processing
Orchestration
Compute
API
File System Security Scheduler
Compliance
Storage
Service
Assurance
Datacenter Operating
Systems
Intelligent Workload
Placement
Network
Composable Resource Pools
Thermals
Power
Location
Datacenter Facilities
6. Apache Hadoop as a Datacenter Operating System
API
Hadoop,
Storm,
GraphLab,
Spark, Shark,
MPI
Expressway
Future NVM
Memory
Mgmt
Process
Mgmt
Scheduler
YARN + SLURM |
Moab
Future Fabric Controller
I/O
TXT, AES-NI
Rhino
Data Governance
Security
File Systems
HDFS, LustreFS, GlusterFS, Ceph + Kafka
6
7. Intel leadership in foundational technologies of big data
HPC
Cloud
Enabling technical
computing on
massive data sets
Helping organizations
build open
interoperable clouds
Open Source
Contributing code
and fostering
ecosystem
Intel employs over 10,000 software developers
* Other names and brands may be claimed as the property of others.
8. Hadoop in a virtualized infrastructure
•
Good
– Agility: Lets you bring up and tear
down resources quickly on demand.
– Fault Tolerance: Protect against
SPOF in Hadoop/HDFS (NN, JT,
Zookeeper) and reduce downtime for
planned updates.
– Resource Efficiency: Run multiple
Hadoop clusters or other applications
– Security: Isolate clusters or nodes
– Simpler management of datacenter
•
Bad
– Performance hit of virtualization is
indeterminate and hard to optimize
– Storage configuration with SAN
and NAS is very different from the
disk attached storage of typical
Hadoop
– Nested virtualization with JVM in a
VM is philosophically uncomfortable
9. Hadoop in the cloud
• Good
– If your data is stored in a cloud
provider's storage infrastructure,
moving compute to data is
logical.
– If your analytics jobs are
infrequent, you can rent the
cluster only when you need it.
– Isolation offers security.
– Easy to use. Easy to expand.
– Pay as you go.
• Bad
– Cost of storage rises at the rate
of ingest and storage.
– Cost of compute rises with
cluster time. There is no "spare
cluster time" for low priority
work.
– Hadoop makes assumptions
about running in a fixed physical
infrastructure.
10. Deploying IDH on AWS
• Use a hop machine to connect into the VPC (private network)
for IDH. This is the only machine that allows inbound SSH
connections from clients on the internet. You must SSH into
the hop machine to gain access to machines in the VPC.
• The hop machine hosts the aws_system scripts.
• Although data may be retained on AWS, do not expect data to
always be saved. Assume machines and data will removed at
any time. Save any needed data or results to another
location.
11. Deploying IDH on AWS
createIDHCluster.sh
• Picks a management node. This should be the first IP address in the
list of IPs that you specify in the nodeips argument.
• After the nodes are running, verifies it can SSH in as the root user
on the management node and as either the root user or some other
non-root user on the other nodes.
• Checks that IDH is NOT installed on any of the nodes. If it cannot
SSH in or IDH is installed, the script exits with a failure.
• Copies over the IDH tarball and the idhscripts.tar to the
management node.
• On the management node, sets up the yum repository and installs
intel manager. Then installs and configures IDH on all the nodes.
14. Intel® Distribution for Apache Hadoop* software
Hardware-enhanced performance & security
Enables partner innovation in analytics
Strengthens Apache Hadoop* ecosystem
Intel employs over 300 people developing and supporting big data software
15. Hadoop Security and Compliance Challenges
Data manipulation
Log Data Collector
Data flow
(compiler, planner, driver)
Giraph
HCatalog
Metadata
Graph analysis framework
HBase Coprocessors
HBase
Mahout
Data mining
YARN (MRv2)
Data execution engine
Flume
Oozie
Hive
HiveQL
Interactive Query
R connectors
Distributed Processing Framework
Real-time Distributed BigTable
HDFS 2.0
Hadoop Distributed File System
statistics
Coordination
Pig
Zookeeper
Sqoop
RDB Data Collector
Hadoop is an ecosystem of loosely coupled components
16. Hadoop Security and Compliance Challenges
Data manipulation
Log Data Collector
Data flow
(compiler, planner, driver)
Giraph
HCatalog
Metadata
Graph analysis framework
HBase Coprocessors
HBase
Mahout
Data mining
YARN (MRv2)
Data execution engine
Flume
Oozie
Hive
HiveQL
Interactive Query
R connectors
Distributed Processing Framework
Real-time Distributed BigTable
HDFS 2.0
Hadoop Distributed File System
statistics
Coordination
Pig
Zookeeper
Sqoop
RDB Data Collector
Components sharing an authentication framework
17. Hadoop Security and Compliance Challenges
Data manipulation
Log Data Collector
Data flow
(compiler, planner, driver)
Giraph
HCatalog
Metadata
Graph analysis framework
HBase Coprocessors
HBase
Mahout
Data mining
YARN (MRv2)
Data execution engine
Flume
Oozie
Hive
HiveQL
Interactive Query
R connectors
Distributed Processing Framework
Real-time Distributed BigTable
HDFS 2.0
Hadoop Distributed File System
statistics
Coordination
Pig
Zookeeper
Sqoop
RDB Data Collector
Components capable of access control
18. Hadoop Security and Compliance Challenges
Data manipulation
Log Data Collector
Data flow
(compiler, planner, driver)
Giraph
HCatalog
Metadata
Graph analysis framework
HBase Coprocessors
HBase
Mahout
Data mining
YARN (MRv2)
Data execution engine
Flume
Oozie
Hive
HiveQL
Interactive Query
R connectors
Distributed Processing Framework
Real-time Distributed Big Table
HDFS 2.0
Hadoop Distributed File System
statistics
Coordination
Pig
Zookeeper
Sqoop
RDB Data Collector
Components capable of admission control
19. Hadoop Security and Compliance Challenges
Data manipulation
Log Data Collector
Data flow
(compiler, planner, driver)
Giraph
HCatalog
Metadata
Graph analysis framework
HBase Coprocessors
HBase
Mahout
Data mining
YARN (MRv2)
Data execution engine
Flume
Oozie
Hive
HiveQL
Interactive Query
R connectors
Distributed Processing Framework
Real-time Distributed Big Table
HDFS 2.0
Hadoop Distributed File System
statistics
Coordination
Pig
Zookeeper
Sqoop
RDB Data Collector
Components capable of (transparent) encryption
20. Hadoop Security and Compliance Challenges
Data manipulation
Log Data Collector
Data flow
(compiler, planner, driver)
Giraph
HCatalog
Metadata
Graph analysis framework
HBase Coprocessors
HBase
Mahout
Data mining
YARN (MRv2)
Data execution engine
Flume
Oozie
Hive
HiveQL
Interactive Query
R connectors
Distributed Processing Framework
Real-time Distributed Big Table
HDFS 2.0
Hadoop Distributed File System
statistics
Coordination
Pig
Zookeeper
Sqoop
RDB Data Collector
Components sharing a common policy engine
21. Hadoop Security and Compliance Challenges
Data manipulation
Log Data Collector
Data flow
(compiler, planner, driver)
Giraph
HCatalog
Metadata
Graph analysis framework
HBase Coprocessors
HBase
Mahout
Data mining
YARN (MRv2)
Data execution engine
Flume
Oozie
Hive
HiveQL
Interactive Query
R connectors
Distributed Processing Framework
Real-time Distributed Big Table
HDFS 2.0
Hadoop Distributed File System
statistics
Coordination
Pig
Zookeeper
Sqoop
RDB Data Collector
Components sharing a common audit log format
22. Project Rhino
•
Strategic Objectives
•
•
•
•
•
•
Framework support for encryption and key management
Token based authentication and SSO for internal cluster services
Role-based access control for simpler administration of authorizations
A common authorization framework, optional but easy to adopt
Consistent audit logging, enhanced for compliance support
Current Projects
• Develop crypto framework in Hadoop Common
• Enable transparent encryption in HBase
• Extend HBase support for ACLs to the cell level
23. Intel Distribution: Security
Connectors
Netezza, Oracle,
SAP, SQLServer,
Teradata, DB2
Vertical Accelerators
Behavior Model
Recommendation Engine
Analytics Workbench
Heat Map
HBase Explorer
Oozie
Workflow
Zookeeper
Coordination
Lucene, Solr
Tribeca
Gryphon
Search
Graph Mining
Low-latency SQL-92
Pig
Scripting
Mahout
Machine Learning
R
Stats
Hive
Query
Hcatalog
Metadata
YARN (+MapReduce)
Distributed Processing Framework
SLURM
Scheduler
Job Profiler
Resource
Monitor
HBase
Sqoop
Data Transfer
Flume
Log Collector
Kafka
Event Bus
Security
Controls
Upgrade
Alerts
Unified Logging
HDFS | Lustre | GlusterFS
Hadoop Compatible File Systems
Tuning
High Availability and Disaster Recovery
Configuration
Rhino (Security) [Encryption, Authentication, Authorization, Auditing]
Deployment
All external names and brands are claimed as the property of others.
23
24. Enterprise data requires defense in depth
Firewall
Gateway
Isolation
Authn
AuthZ
Encryption
Audit & Alerts
25. Intel Expressway protects Hadoop APIs
Firewall
Hcatalog
Stargate
REST APIs
WebHDFS
Containment
AuthnEnforces consistent security policies across all Hadoop services
•
•
Serves as a trusted proxy to Hadoop, Hbase, and WebHDFS APIs
RBAC
•
Complies with Common Criteria EAL4+, HSM, FIPS 140-2
certifications
Encryption
•
Deploys as software, virtual appliance, or hardware appliance
26. Kerberos authenticates Hadoop services
Firewall
APIs
request
ticket
1
2
3
Authentication
KDC
•
Wizard enables setup of
Containment cluster with
secure
encrypted key exchange
send service
ticket
Intel
Manager
5
request service
•
Manager generates
principal and keytab for
Hadoop services
•
Manager enables batch
upload of keytab files
validate
ticket
4
send
Encryption
respose
27. Intel Manager simplifies role-based access control
Firewall
AuthZ
•
File, table, and service-level controls
•
Intel Manager pushes ACLs to each node
28. Intel Distribution provides HDFS encryption
•
Extends compression codec into crypto codec
•
Firewall
Provides an abstract API for general use
HDFS
Derivativ
e Decrypt
MapReduce
RecordReader
Map
Combiner
Partitioner
Encrypt
Merge & Sort
RBAC
Reduce
Decryp
t
Derivative
Encrypt
RecordWriter
Local
29. Crypto Codec Framework
• Extends compression codec and establishes a common
abstraction of the API level that can be shared by all crypto codec
implementations as well as users that use the API
CryptoCodec cryptoCodec = (CryptoCodec) ReflectionUtils.newInstance(codecClass, conf);
CryptoContext cryptoContext = new CryptoContext();
...
cryptoCodec.setCryptoContext(cryptoContext);
CompressionInputStream input = cryptoCodec.createInputStream(inputStream);
...
• Provides a foundation for other components in Hadoop* such as
MapReduce or HBase* to support encryption features
31. Crypto Codec: API Example
The usage is aligned with compression codec but with context supporting
Configuration conf = new Configuration();
CryptoCodec cryptoCodec =
(CryptoCodec) ReflectionUtils.newInstance(AESCodec.class, conf);
CryptoContext cryptoContext = new CryptoContext();
cryptoContext.setKey(Key.derive(password));
cryptoCodec.setCryptoContext(cryptoContext);
DataInputStream input = inputFile.getFileSystem(conf).open(inputFile);
DataOutputStream outputStream = outputFile.getFileSystem(conf).create(outputFile);
CompressionOutputStream output = cryptoCodec.createOutputStream(outputStream);
// encrypt the stream
writeStream(input, output);
input.close();
output.close();
32. Crypto Codec: A Simple MapReduce Example
The usage is aligned with compression codec usage in MapReduce job
but with context resolving
Job job = Job.getInstance(conf, "example");
JobConf jobConf = (JobConf)job.getConfiguration();
FileMatches fileMatches = new FileMatches(
KeyContext.refer("KEY00", Key.KeyType.SYMMETRIC_KEY, "AES", 128));
fileMatches.addMatch("^.*/input1.intelaes$",
KeyContext.refer("KEY01", Key.KeyType.SYMMETRIC_KEY, "AES", 128));
String keyStoreFile = "file:///" + secureDir + "/my.keystore";
String keyStorePasswordFile = "file:///" + secureDir + "/my.keystore.passwords";
KeyProviderConfig keyProviderConfig =
KeyProviderCryptoContextProvider.getKeyStoreKeyProviderConfig(
keyStoreFile, "JCEKS", null, keyStorePasswordFile, true);
KeyProviderCryptoContextProvider.setInputCryptoContextProvider(
jobConf, fileMatches, true, keyProviderConfig);
33. Key Distribution and Protection for MapReduce
• Targets
– A framework at MapReduce side for enabling crypto codec in MapReduce
job such as key context resolving, distribution and protection
– Enabling different key storage or management systems to plug-in for
providing keys
– Satisfying the common requirements that stage and file of a single job may
use different keys
• A complete key management system is not part of Intel®
Distribution for Apache Hadoop* software
– An API to integrate with an external key manage system is included
34. Secrets Distribution
Node A
Node B
task
2
task
IM Agent
task
1
Job credentials
& data
encryption key
task
3
task
task
IM Agent
task
Job credentials &
data encryption key
task
Shared storage or
distributed in each
node
IM Agent: Intel® Manager for Apache Hadoop* is a service resident in each cluster node.
35. Pig* & Hive* Encryption: Overview
Intel
Client
MapReduce
Encrypted Job input/output
data
HDFS*
Cluster
https for uploading master key
Master key also be encrypted
Local Disk
Encrypted secrets
Decrypt secrets
Encrypted secrets
Encrypted Intermediate
data
Intel® Manager for
Apache Hadoop*
software
Hive*
Secrets Protection
Service
Pig*
36. Pig* & Hive* Encryption
• Pig* Encryption Capabilities
–
–
–
–
Support of text file and Avro* file format
Intermediate job output file protection
Pluggable key retrieving and key resolving
Protection of key distribution in cluster
• Hive* Encryption Capabilities
– Support of RC file and Avro file format
– Intermediate and final output data encryption
– Encryption is transparent to end user without changing existing SQL
39. Intel® Data Protection Technology
Advanced Encryption Standard New Instructions
(AES-NI)
•
•
Processor assistance for performing AES
encryption
Makes enabled encryption software faster and
stronger
Internet
AES-NI -
Data in Motion
Secure transactions used
pervasively in ecommerce,
banking, etc.
Data at Rest
Full disk encryption
software protects data
while saving to disk
Data in Process
Most enterprise and cloud
applications offer
encryption options to
secure information and
protect confidentiality
43. Legal Disclaimer
Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute the instructions in
the correct sequence. AES-NI is available on select Intel® processors. For availability, consult your reseller or system manufacturer. For
more information, see Intel® Advanced Encryption Standard Instructions (AES-NI).
• Software Source Code Disclaimer: Any software source code reprinted in this document is furnished under a software license and may
only be used or copied in accordance with the terms of that license.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute,
sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following
conditions:
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN
NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
•
44. Risk Factors
The above statements and any others in this document that refer to plans and expectations for the third quarter, the year and the future are forward-looking
statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,”
“should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify
forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual
results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be the important factors that could
cause actual results to differ materially from the company’s expectations. Demand could be different from Intel's expectations due to factors including changes in
business and economic conditions; customer acceptance of Intel’s and competitors’ products; supply constraints and other disruptions affecting customers; changes
in customer order patterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and financial conditions
poses a risk that consumers and businesses may defer purchases in response to negative financial events, which could negatively affect product demand and other
related matters. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short
term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product
introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions,
marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to technological developments and to incorporate
new features into its products. The gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation,
including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the
manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or
resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and intangible assets. Intel's results
could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including
military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses,
particularly certain marketing and compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's
products and the level of revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be affected
by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual
property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable
ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices,
impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and
other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings
release.
Rev. 7/17/13
45. We are sincerely eager to hear
your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form
when you have a chance.