Deutsche Telekom and T-Systems are large European telecommunications companies. Deutsche Telekom has revenue of $75 billion and over 230,000 employees, while T-Systems has revenue of $13 billion and over 52,000 employees providing data center, networking, and systems integration services. Hadoop is an open source platform that provides more cost effective storage, processing, and analysis of large amounts of structured and unstructured data compared to traditional data warehouse solutions. Hadoop can help companies gain value from all their data by allowing them to ask bigger questions.
Scanning the Internet for External Cloud Exposures via SSL Certs
Deutsche Telekom Perspective on HADOOP and Big Data Technologies
1. Deutsche Telekom Perspective on
HADOOP and Big Data Technologies
Gregory Smith
VP Solution Design and Emerging Technologies and Architectures
T-Systems North America
Gregory.Smith@t-systems.com
2. Deutsche Telekom and T-Systems Key Stats
Deutsche Telekom is Europe’s largest telecom service provider
– Revenue: $75 billion
– Employees: 232,342
T-Systems is the enterprise division of Deutsche Telekom
– Revenue: $13 billion
– Employees: 52,742
– Services: data center, end user computing, networking, systems
integration, cloud and big data
1
3. Overwhelmed by new data types?
2
Sentiment
data
Call detail records (CDRs)
Sensor- / machine-based data
Big Data
Transactions, Interactions, Observations
Clickstream
data
4. 80% of new data in 2015 will land on Hadoop!
3
Hadoop is like a data warehouse,
but it can store more data, more kinds of data,
and perform more flexible analyses
Hadoop is open source
and runs on industry standard hardware,
so it's 1-2 orders of magnitude more economical
than conventional data warehouse solutions
Hadoop provides more cost effective storage, processing,
and analysis. Some existing workloads run faster, cheaper, better
Hadoop can deliver a foundation for profitable growth:
Gain value from all your data by asking bigger questions
5. Reference architecture view of Hadoop
4
Security
Operations
Infrastructure
Virtualization Compute / Storage / Network
WorkflowandSchedulingManagementandMonitoring
DataIsolationAccessManagementDataEncryption
Data
Integration
Data Processing
Batch Processing
Real Time/Stream
Processing
Search and Indexing
Application
Analytics Apps Transactional Apps
Analytics
Middleware
Presentation
Data Visualization and
Reporting
Clients
Real Time
Ingestion
Batch
Ingestion
Data
Connectors
Metadata
Services
Data Management
Distributed
Processing
(MapReduce)
Non-relational
DB
Structured
In-Memory
Distributed
Storage
(HDFS)
Hadoop Core
Hadoop Projects
Adjacent Categories
6. Example application landscape
ETL
Real Time
Streams
(Social,
sensors)
Structured and Unstructured Data
(HDFS, MAPR)
Real Time
Database
(Shark,
Gemfire, hBase,
Cassandra)
Interactive
Analytics
(Impala,
Greenplum,
AsterData,
Netezza…)
Batch
Processing
(Map-Reduce)
Real-Time
Processing
(s4, storm,
spark) Data Visualization
(Excel, Tableau)
(Informatica, Talend,
Spring Integration)
Compute Storage Networking
Cloud Infrastructure
HIVE
Machine Learning
(Mahout, etc…)
Source: Vmware
7. Disruptive innovations in Big Data
6
Traditional
Database
HADOOP
NoSQL
Database
MPP
Analytics
Data
Warehouse
Schema
Pre-defined, fixed
Required on write
Required on read
Store first, ask questions later
Processing
No or limited
data processing
Processing coupled with data
Parallel processing / scale
out
Data typesStructured Any, including unstructured
..
Physical
infrastructure
Enterprise grade
Mission critical
Commodity is an option
Much cheaper storage
8. Business
problem
Technology
Solution
Legacy BI
Backward-looking
analysis
Using data out of
business applications
SAP Business Objects
IBM Cognos
MicroStrategy
Structured
Limited (2 – 3 TB in
RAM)
High Performance
BI
Quasi-real-time
analysis
Using data out of
business applications
Oracle Exadata
SAP HANA
Structured
Limited (2 – 8 TB in
RAM)
“Hadoop”
Ecosystem
Forward-looking
predictive analysis
Questions defined in
the moment, using
data from many
sources
Hadoop distributions
No ACID transactions
Limited SQL Set (joins)
Structured or
unstructured
Unlimited (20 – 30 PB)
„True“ big data
Legacy vendor definition of big data
Selected Vendors
Data Type/Scalability
Innovations: Hadoop is 100x cheaper per TB
than in-memory appliances like HANA and
handles unstructured data as well
9. Innovations:
Store first, ask questions later
8
SAN Storage
3-5€/GB
Based on HDS
SAN Storage
NAS Filers
1-3€/GB
Based on Netapp
FAS-Series
White Box DAS1)
0.50-1.00€/GB
Hardware can be
self-assembled
Data Cloud1)
0.10-0.30€ /GB
Based on large
scale object
storage interfaces
Enterprise Class
Hadoop Storage
???€/GB
Based on Netapp
E-Series (NOSH)
1) Hadoop offers Storage + Compute (incl. search). Data Cloud offers Amazon S3 and native storage functions
? !Illustrative acquisition cost
Much cheaper storage
but not just storage…
10. Target use cases
9
IT Infrastructure
& Operations
Business
Intelligence &
Data Warehousing
Line of Business &
Business Analysts
CXO
Time to value
LongerShorter
Lower
Higher
Potential
value
Lower Cost
Storage
Enterprise
Data Lake
Enterprise Data
Warehouse
Offload
Enterprise Data
Warehouse
Archive
ETL Offload
Capacity Planning &
Utilization
Customer Profiling &
Revenue Analytics
Targeted Advertising
Analytics
Service Renewal
Implementation
CDR based Data
Analytics
Fraud Management
New
Business
Models
Cost effective storage,
processing, and analysis
Foundation for
profitable growth
11. Enterprise data warehouse offload use case
10
The Challenge
Many EDWs are at capacity
Running out of budget before
running out of relevant data
Older data archived “in the dark”,
not available for exploration
The Solution
Hadoop for data storage and
processing: parse, cleanse,
apply structure and transform
Free EDW for valuable queries
Retain all data for analysis!
Operational (44%)
ETL Processing (42%)
Analytics (11%)
DATA WAREHOUSE
Storage & Processing
HADOOP
Operational (50%)
Analytics (50%)
DATA WAREHOUSE
Cost is
1/10th
12. GOAL:
Platform that natively supports
mixed workloads as shared service
AVOID:
Systems separated by workload
type due to contention
From data puddles and ponds to lakes and oceans
Page 11
Big
Data
BU1
Big
Data
BU2
Big
Data
BU3
Big Data
Transactions, Interactions, Observations
Refine Explore Enrich
Batch Interactive Online
13. Questions to ask in designing a solution
for a particular business use case
Which distribution is right for your needs today vs. tomorrow?
Which distribution will ensure you stay on the main path of
open source innovation, vs. trap you in proprietary forks?
12
Security
Operations
Infrastructure
Data
Inte-
gra-
tion Data Processing
Application
Presentation
Data Management
Note: Distributions include more than just the Data Management layer but are discussed at this point in the presentation.
Not shown: Intel, Fujitsu and other distributions
Widely adopted, mature distribution
GTM partners include Oracle, HP, Dell, IBM
Fully open source distribution (incl. management tools)
Reputation for cost-effective licensing
Strong developer ecosystem momentum
GTM partners include Microsoft, Teradata, Informatica, Talend
More proprietary distribution with features that appeal to some
business critical use cases
GTM partner AWS (M3 and M5 versions only)
Just announced by EMC, very early stage
Differentiator is HAWQ – claims manifold query speed
improvement, full SQL instruction set
14. Common objections to Hadoop
13
We don’t have big
data problems
We don’t have
petabytes of data
We can’t justify
the budget for a
new project
We don’t have
the skills
We’re not sure
Hadoop is
mature/secure/
enterprise-ready
We already have a
scale-out strategy
for our EDW/ETL
15. MYTH:
Big Data means “Petabytes”
Not just Volume
Remember Variety, Velocity
Plenty of issues at smaller scales
– Data processing
– Unstructured data
Often warehouse volumes are small
because the technology is
expensive, not because there is no
relevant data
Scalability is about growing with the
business, affordably and predictably
Every organization has data problems!
Hadoop can help…
14
MYTH:
Big Data means Data Science
Hadoop solves existing problems
faster, better, cheaper than
conventional technology, e.g.
– Landing zone – capturing and
refining multi-structured data
types with unknown future value
– Cost effective platform for
retaining lots of data for long
periods of time
Walk before you run
Big Data Is a State of Mind
16. Waves of adoption – crossing the chasm
15
Wave 1
Batch Orientation
Wave 2
Interactive Orientation
Wave 3
Real-Time Orientation
Mainstream,
70% of organizations
Early adopters,
20% of organizations
Bleeding edge,
10% of organizations
Adoption
today*
Refine:
archival and
transformation
Explore:
query and
visualization
Enrich:
real-time decisions
Example use
cases
Hour(s) Minutes SecondsResponse time
Volume VelocityData
characteristic
EDW / RDBMS talk
to Hadoop
Analytic apps talk
directly to Hadoop
Derived data also
stored in Hadoop
Architectural
characteristic
MapReduce, Pig,
Hive
ODBC/JDBC, Hive HBase, NoSQL,
SQL
Example
technologies
* Among organizations using Hadoop
17. Hadoop in a nutshell
The Hadoop open source ecosystem delivers powerful innovation
in storage, databases and business intelligence, promising
unprecedented price / performance compared to existing
technologies.
Hadoop is becoming an enterprise-wide landing zone for large
amounts of data. Increasingly it is also used to transform data.
Large enterprises have realized substantial cost reductions by
offloading some enterprise data warehouse, ETL and archiving
workloads to a Hadoop cluster.
16
18. Challenges in the Enterprise
Use-case identification and cost justification
Cooperation and coordination from independent business units
As Hadoop increases its footprint in business-critical areas, the
business will demand mature enterprise capabilities, e.g. DR,
snap-shots, etc.
Hadoop’s disruptive approve is challenging strong legacy EDW
People, processes and technologies.
Data harmonization is often a significant challenge.
Fear of forking (think UNIX)
Proprietary absorption (Borged)
Audience: Hadoop address business problems, not IT problems
Fear of data complexity (“I hated statistics class!”)
17
Big Data = Transactions + Interactions + ObservationsTransactions are pretty simple to understand. This is our ERP data. It is the data that we maintain and track in our OLTP systems. It can be any record of any system-to-system or human-to-system interaction. It can even be a human-to-human interaction as long as it is captured electronically. We use a lot of this data in our analytics today.Interactions are the points in time we relate with a system. It could be a tweet or a facebook post. It could be an electronic or paper customer satisfaction survey. Interactions are web logs and A/B tests. We have a lot of this data but typically no efficient way to understand or extract value from it.Observations are interesting because they represent a world of net new data sources that we once never thought of analyzing. It is data that was once thought of as low to medium value data or even exhaust data that was too bulky and just too expensive to store. This can be machine-generated data from sensors or web logs and clickstreams or even audio/video or largely unstructured content. Typically, we never even thought of this data before.
Presentation Layer: Application Layer:Data Processing Layer: Infrastructure Layer: Data Ingestition Layer:Security Layer:Management & Monitoring LayerAmbari: Apache Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters. Hadoop clusters require many inter-related components that must be installed, configured, and managed across the entire cluster. Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is utilized significantly by many distributed applications such as HBase. HBase: HBase is the distributed Hadoop database, scalable and able to collect and store big data volumes on HDFS. This class of database is often categorized as NoSQL (Not only SQL). Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop; this provides deep integration into Enterprise Data Warehouses (E.G. Teradata) and with Data Integration tools such as Talend. MapReduce: HadoopMapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS: Hadoop Distributed File System is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid parallel computations. • Talend Open Studio for Big Data: 100% Open Source Code Generator for Graphical User Interface used for Extract Transfer Load, Extract Load Transfer for data movement, cleansing in and out of Hadoop. Data Integration Services – HDP integrates Talend Open Studio for Big Data, the leading open source data integration platform for Apache Hadoop. Included is a visual development environment and hundreds of pre-built connectors to leading applications that allow you to connect to any data source without writing code. Centralized Metadata Services – HDP includes HCatalog, a metadata and table management system that simplifies data sharing both between Hadoop applications running on the platform and between Hadoop and other enterprise data systems. HDP’s open metadata infrastructure also enables deep integration with third-party tools.
Line of BusinessDemand 360 view of customer, employee, market, etc, but cannot be certain about what matters for analysisBusiness AnalystsNeed to incorporate more data into analysis, LOBs not sure what matters; want to reuse existing skill setsData Warehouse OwnersMust efficiently store, process, organize, deliver massive and growing data volume and variety while meeting SLAsIT ManagementDrive innovation, reduce costs, meet growing analytic demands of LOBs, mitigate risk of adopting new technologySystem AdministratorsEnsure stability and reliability of systemsBuyers:VP AnalyticsVP/Director Business IntelligenceVP/Director Data Warehousing/ManagementVP/Director InfrastructureVP/Director Operations/IT SystemsFaster customer acquisitionBetter product developmentBetter qualityLower churn