Más contenido relacionado
La actualidad más candente (20)
Similar a Big Data Infrastructure (20)
Big Data Infrastructure
- 1. Big Data Infrastructure.
Appliance, Cloud, or Do-it-Yourself.
Daniel Steiger
Discipline Manager Infrastructure Engineering
BASEL BERN BRUGG GENF LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
1
- 2. Unser Unternehmen
Trivadis ist führend bei der IT-Beratung, der Systemintegration, dem
Solution-Engineering und der Erbringung von IT-Services mit
Fokussierung auf und Technologien
im D-A-CH-Raum. Unsere strategischen Geschäftsfelder...
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
2
- 3. Mit über 600 IT- und Fachexperten bei Ihnen vor Ort
Stuttgart
Brugg
2014 © Trivadis
3
12 Trivadis Niederlassungen mit
über 600 Mitarbeitenden
200 Service Level Agreements
Mehr als 4'000 Trainingsteilnehmer
Forschungs- und Entwicklungs-budget:
CHF 5.0 Mio. / EUR 4.0 Mio.
Finanziell unabhängig und
nachhaltig profitabel
Erfahrung aus mehr als 1'900 Projekten
pro Jahr bei über 800 Kunden
(Stand 12/2013)
3
Big Data Infrastructure
DOAG Jahreskonferenz 2014
3
Hamburg
Düsseldorf
Frankfurt
Freiburg München
Wien
Basel
Bern Zürich
Lausanne
- 4. 1. Big Data Infrastructure Challenges
2. Hadoop on an Appliance
3. Hadoop in the Cloud
4. Hadoop Do-it-Yourself
5. Conclusion
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
4
Agenda
- 5. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
5
Big Data Infrastructure Challenges
- 6. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
6
Trailwise – a "quantified self" use case
11'000 data points rendered in 165ms
47'295 data points rendered in 643ms
- 7. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
7
Trailwise – Infrastructure for a Proof of Concept
7
§ Hadoop HDFS as data
store
§ HBase for real-time data
access
§ Hadoop Map/Reduce
- 8. 2014 © Trivadis
Concerns…
§ Scalability
§ Costs for "always up"
§ Setup and administration of a
large cluster on AWS
§ Break-even cloud vs on-premise
For a proof of concept hadoop in the
cloud (e.g. on Amazon EC2) is perfect...
+ Fast and easy deployment
+ Optimized Hadoop/HBase setup
+ HBase real-time performance
+ Map/Reduce scalability
+ Affordable, ca. EUR 15.-/day
Big Data Infrastructure
DOAG Jahreskonferenz 2014
8
Trailwise – Infrastructure Lessons Learned
- 9. § Big Data means big data volume
§ Petabytes and exabytes
§ Scalability
§ 10, 20, 50, 100, ... cluster nodes
§ Costs should scale as well...
§ High demands on machine-to-machine networks
§ In Big Data for every one-client interaction, there may be hundreds or thousands of
server and data node interactions
§ This generates far more east-west (server-to-server or server-to-storage) network traffic
than north-south (server-to-client or server-to-outside) network traffic
§ And many others like integration, data protection, operation, etc.
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
9
Big Data Infrastructure Challenges
- 10. § Infrastructure must be engineered to scale
§ The network has to provide high bandwidth,
low latency, and should scale seamlessly
with Hadoop clusters to provide predictable
performance
§ And many more, like
§ Integration with operational data systems
§ Authentication, authorization, encryption
§ Centralized management
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
10
Infrastructure Requirements
Figure 1.2: Picture of a row of servers in a Google WSC, 1.6. ARCHITECTURAL Will my infrastructure
meet my needs
now and in the future
without putting my
business at risk?
- 11. When enterprises adopt Hadoop, one of the decisions they must make is the
deployment model. There are four options:
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
11
Where to Deploy your Hadoop Cluster?
When enterprises adopt Hadoop, one of the
decisions they must make is the deployment
model. There are four options as illustrated
in Figure 1:
‡On-premise full custom. With this
option, businesses purchase commodity
hardware, then they install software and
There have existed two divergent views
related to the price-performance ratio
for Hadoop deployments. One view is
that a virtualized Hadoop cluster is
slower because Hadoop’s workload has
intensive I/O operations, which tend to
run slowly on virtualized environments.
A related and fourth area is data
enrichment, which involves leveraging
multiple datasets to uncover new insights.
For example, combining a consumer’s
purchase history and social-networking
activities can yield a deeper understanding
of the consumer’s lifestyle and key personal
Figure 1. The spectrum of Hadoop deployment options
On-premise
full custom
Hadoop
appliance
Hadoop
hosting
Hadoop-as-a-
Service
Bare-metal Cloud
Reference: Hadoop Deployment Comparison Study, Price-Performance Comparison, Accenture Technology Labs, 2013
- 12. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
12
Hadoop on an Appliance
Oracle Big Data Appliance
- 13. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
13
Overview: Oracle's Big Data Solution
§ A complete and optimized
solution for big data
§ Tight integration with
Exadata, Exalogic,
Exalytics and SPARC
Supercluster using
Infiniband network
§ Single-vendor support for
both hardware and
software
- 14. Full Rack Configuration (up to 18 racks)
§ 18 x compute/storage nodes
Per Node:
§ 2 x Eight-Core Intel ® Xeon ® E5-2650 V2 Processors
§ 64 GB Memory (up to 512 GB)
§ 48 TB Raw Storage Capacity
§ 40 Gb/sec Infiniband Network
§ 10 Gb/sec Data Center Connectivity
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
14
Oracle Big Data Appliance X4-2 HW
Source: Oracle ®
- 15. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
15
Oracle Big Data Appliance Internal Network Connectivity
Source: Oracle Big Data Appliance: Datacenter Network Integration, Oracle White Paper, 2012
- 16. 2014 © Trivadis
§ Oracle R Distribution
§ Oracle NoSQL DB Community Ed.
§ BDA Enterprise Manager Plug-In
§ Optional Software*
§ Oracle Big Data SQL
§ Oracle Big Data Connectors
§ Oracle Audit Vault Database Firewall
for Hadoop Auditing
§ Oracle Data Integrator
§ Oracle NoSQL Database EE
§ Oracle Linux 6.4 with UEK
§ Oracle Java JDK 7
§ Cloudera Enterprise Data Hub
Edition
§ Apache Hadoop HDFS
§ HBase
§ Cloudera Impala
§ Cloudera Search
§ Cloudera Manager
§ Apache Spark
Big Data Infrastructure
DOAG Jahreskonferenz 2014
16
Big Data Appliance Software Stack
*Connectors are licensed separately from Oracle Big Data Appliance
- 17. 2014 © Trivadis
§ Oracle R Support for Big Data
§ R is an open-source language and
environment for statistical analysis and
graphing
§ The standard R distribution is installed
on all nodes of Oracle Big Data
Appliance
§ Oracle R Connector for Hadoop
provides R users with high-performance,
native access to HDFS
and the MapReduce programming
framework
§ Oracle R Enterprise is a separate
package that provides real-time access
to Oracle Database.
§ Oracle NoSQL Database
§ Oracle NoSQL Database is a
distributed key-value database built on
storage technology of Berkeley DB
Java Edition.
§ An intelligent driver on top of Berkeley
DB keeps track of the underlying
storage topology, shards the data and
knows where data can be placed with
the lowest latency
Big Data Infrastructure
DOAG Jahreskonferenz 2014
17
BDA Specific Software Features
- 18. § Oracle SQL Connector for HDFS
§ Oracle Loader for Hadoop
§ Oracle R Connector for Hadoop
§ Oracle Data Integrator Application Adapter
for Hadoop
§ Data in HDFS (and NoSQL) data is
accessable through relational database external
table mechanism (HDFS as cluster file system)
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
18
Oracle Big Data Connectors
Reference: Oracle Big Data Connectors Data Sheet Source: Oracle ®
- 19. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
19
Oracle Big Data SQL: one tool for all data sources
Reference: https://www.oracle.com/webfolder/s/delivery_production/docs/FY15h1/doc6/1-T2-BigData.pdf
- 20. § Oracle Big Data Lite VM
§ http://www.oracle.com/technetwork/database/bigdata-appliance/
oracle-bigdatalite-2104726.html
§ MOS Notes
§ Information Center: Oracle Big Data Appliance (Doc ID 1445762.2)
§ Big Data Connectors (ID 1487399.2)
§ Sqoop Frequently Asked Questions (FAQ) (Doc ID 1510470.1)
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
20
Oracle Big Data Appliance Ressources
- 21. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
21
Hadoop in the Cloud
- 22. Hadoop in the Cloud
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
22
- 23. There are five key areas to consider when choosing the right deployment model*:
Five key areas to consider when choosing the right deployment model:
*Public Cloud, Private Cloud, Community Cloud oder Hybrid Cloud
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
23
Deployment Considerations
The second area of consideration is data
privacy, which is a common concern when
storing data outside of corporate-owned
infrastructure. Cloud-based deployment
requires a comprehensive cloud-data
privacy strategy that encompasses
areas such as proper implementation of
legal requirements, well-orchestrated
and therefore enable companies to
introduce new services and products of
interest. The primary challenge is that
the storage of these multiple datasets
increases the volume of data, resulting
in slow connectivity. Therefore, many
organizations choose to co-locate these
datasets. Given volume and portability
For the experiment, we first built
the total cost of ownership (TCO)
model to control two environments
at the matched cost level. Then, using
Accenture Data Platform Benchmark
as real-world workloads, we compared
the performance of both a bare-metal
Hadoop cluster and Amazon
Price-performance
ratio
Data privacy Data gravity Data
enrichment
Productivity of
developers and
data scientists
Reference: Where to Deploy your Hadoop Cluster?, Executive Summary, Accenture Technology Labs, 2013
- 24. EC2 Instance for Hadoop/MapReduce
Storage optimized – current generation
§ Instance hs1.8xlarge
§ 16 vCPUs (Intel Xeon)
§ 117GB RAM
§ 24 x 2000GB = 48TB
§ 10 Gigabit network
§ MapR as option
§ M3, M5 or M7 edition
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
24
Amazon EMR with the MapR Distribution for Hadoop
Reference: http://aws.amazon.com/elasticmapreduce/mapr/
- 25. Costs for hs1.8xlarge Instance
§ Medium Utilization Reserved Instances
§ 1-Year term: upfront $9'200, $1.809 per Hour
§ 3-Year term: upfront $14'109, $1.581 per Hour
§ Data Transfer IN to Amazon EC2 from internet: $0.0 per GB
§ Data Transfer OUT from Amazon EC2 to internet: $0.12 per GB up to 10TB/
month ($120 per TB)
§ MapR M7: $1.49 per Hour
§ Total: $2'600/month, $31'200/year (24/365 utilization)
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
25
Amazon EMR with the MapR Distribution for Hadoop
- 26. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
26
Hadoop on
Do-It-Yourself Infrastructure
- 27. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
27
Do-it-Yourself (experimental setup)
Source: http://blog.ittoby.com/
- 28. HP ProLiant DL380p Gen8
§ 2 x Eight-Core Intel ® Xeon ® E5-2650 V2
§ 64 GB Memory (up to 512 GB)
§ 48 TB Raw Storage Capacity
§ 40 Gb/sec Infiniband Network
§ 10 Gb/sec Data Center Connectivity
§ About $20'000 + Rack + Network + Work
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
28
Do-it-Yourself (enterprise class setup)
HP ProLiant DL380e Gen8
The HP ProLiant DL380e Gen8 (2U) is an excellent choice as the server platform for the Figure 6. HP ProLiant DL380e Gen8 Server
§ Cloudera Enterprise Data Hub
Edition 5.x
§ ca. $2'500/node + support
- 29. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
29
Conclusion
- 30. Oracle BDA
+ High performance scalable
network architecture
+ Highly integrated into
Oracle eco system
+ Complete software stack
Oracle Hadoop
+ Single point of support
+ Competitive price/
performance ratio for
enterprise class demands
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
30
Appliance, Cloud or DIY?
Amazon EC2 Instances
+ Fast and easy deployment
+ Scales from very small to
very large cluster setups
+ Capacity on demand on
hourly base
+ Optional enterprise class
hadoop distribution
+ Interesting price model for
volatile utilisation and
capacity on demand
Servers running the node processes should have sufficient memory for either HBase or for the amount of Map/Reduce configured on the server. A server with larger RAM configuration will deliver optimum performance for both HBase Map/Reduce. To ensure optimal memory performance and bandwidth, we recommend using 8GB or 16GB DIMMs to
populate each of the 6 memory channels as needed.
Network configuration
The DL380e includes four 1GbE NICs onboard. MapR automatically identifies the available NICs on the server and bonds
them via the MapR software to increase throughput.
MapR Benefit
Each of the reference architecture configurations below specifies an additional Top of Rack Switch for redundancy. make use of this, we recommend cabling the ProLiant DL380e Worker Nodes so that NIC 1 is cabled to Switch 1 and cabled to Switch 2, repeating the same process for NICs 3 and 4. Each NIC in the server should have its own IP subnet
instead of sharing the same subnet with other NICs.
HP ProLiant DL380e Gen8
The HP ProLiant DL380e Gen8 (2U) is an excellent choice as the server platform for the worker nodes.
Figure 6. HP ProLiant DL380e Gen8 Server
Do it Yourself
+ Low entry point
+ Free choice of hardware
+ Free choice of software
stack
- 31. § Building an enterprise-class hadoop infrastructure is a challenge
§ Analyse and prioritize your requirements (business and IT) is crucial
§ Start „small fast“ with a proof of concept
§ Consider various deployment models (On-Premis,
Appliance, IaaS, PaaS, HaaS, ...)
§ The Oracle Database Appliance is a very competitive
offering – especially as extension to your existing
Oracle operational data systems
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
31
Conclusion
- 32. Thank you.
Daniel Steiger
Discipline Manager Infratructure Engineering
Tel: +41 58 459 50 88
daniel.steiger@trivadis.com
BASEL BERN BRUGG GENF LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
32
- 33. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
33
Trivadis an der DOAG
Ebene 3 - gleich neben der Rolltreppe
Wir freuen uns auf Ihren Besuch.
Denn mit Trivadis gewinnen Sie immer.
- 34. 2014 © Trivadis
Big Data Infrastructure
DOAG Jahreskonferenz 2014
34
Cost comparison
Aribute
Oracle
BDA
Amazon
EMR
DIY
Typ
X4-‐2
hs1.8xlarge
DL-‐380
CPU
2x8-‐Core
16
vCPU
2x8-‐Core
RAM
64
GB
117
GB
64
GB
Storage
48
TB
48
TB
8
TB
Network
10
GB
/
40
GB
10
GB
10
GB
/
40
GB
Hadoop
Distr.
Cloudera
MapR
Cloudera
Preis
/
Jahr
525'000
562'256
405'000
Wartung
/
Jahr
63'000
-‐
40'000
Total
1.
Jahr
588'000
562'256
445'000
Total
3
Jahre
714'000
1'686'768
525'000