SlideShare una empresa de Scribd logo
1 de 22
Evolution of Big Data at Intel - crawl, walk
and run approach
Gomathy Bala | Director
Chandhu Yalla | Manager & Architect
Key Contributors: Sonja Sandeen, Seshu Edala, Nghia Ngo and Darin Watson
IT BI Big Data Team
Copyright © 2014, Intel Corporation. All rights reserved.
Legal Notices
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
The content in this presentation is being shared Under NDA.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2014, Intel Corporation. All rights reserved.
2
Copyright © 2014, Intel Corporation. All rights reserved.
Agenda
• Intel IT Big Data Journey
• Enterprise DW architecture
• BI Big Data 3 yr Roadmap
• Big Data Ecosystem Architecture
• Platform Strategies & BKMs
• Summary
3
Copyright © 2014, Intel Corporation. All rights reserved.
2011 2012 2013 2014 2015
Intel IT Big Data Journey
4
Big Data
&
Analytics
Strategy
Production
Online
Telmap:
1st Use Case
Preproduction
Online
Hadoop
Evaluation
IDH to CDH
Hadoop 2.0
$176M BV
Production: Security BI,
Attribute Reduction System,
ATM Ellipses Engine, IAH-
Retail Analytics
6 Environments
CDH 5.3
4 Use Cases in
Preproduction
12 POC Use
Cases
6 Use Cases in
Production
$290K
investment
$948/TB
3 Use Cases in
Production
Smart-What, Marketing-
IAH, Incident
Predictability
$6M BV
CDH 5.1
IAH – Cloud CRM
In Production
Enterprise
Standards,
Guidance,
Processes for
Platform &
Capabilities
15 Active Use Cases | $290K + 10.5 HC Investment | Delivered $182M BV
Copyright © 2014, Intel Corporation. All rights reserved.
Big Data & Analytics Really Delivers!
5From 2014 – 2015 Intel IT Business Review – Annual Edition
Kim's Video
Copyright © 2014, Intel Corporation. All rights reserved.
Any Data Source
ERP
In Memory Real-Time Data Platform
CRM
SCM
SRM
ECC
BW
ECCW
Real-Time & Self Service
Analytics Platform
MDG
NW
Teradata Cloudera Hadoop Data Lake
Reporting Tools
Data Tiering
Hot-Cold data
Enterprise
Data Warehouse
Other Apps
Custom
Intel
…
NR
T
Predictive
Analytics
BPC
BCS
Cloud
BI
Saa
S
New
Apps.
Downstream
Applications
2014-2017 Vision: Real-Time Enterprise
6
Copyright © 2014, Intel Corporation. All rights reserved.
FE Tools
CLS/Proxy
High speed data loader
BigData
• Machine Learning
• Log Processing
• Unstructured data
Use Cases
• High volume counter Analytics
• Text Parsing/Mining
• Strategic/Operational reporting
• Interactive Reporting
Use Cases
• High Concurrent user analytics -
Supply/Order
• Mission critical analytics – Finance/HR
SQL on Hadoop
Enterprise Data Architecture with Hadoop and Other MPP DWH
Current & Future Strategy
Future Present
EDWMfg Data
A %ge of
Traditiona
l BI use
cases
IMT
Copyright © 2014, Intel Corporation. All rights reserved.
BI Big Data | 3-Year Roadmap
8
Big Data + AA
Big Data + SSAA +
Traditional BI
Big Data + SSAA +
Traditional BI
2015
2016
2017
Scalable and well
designed Hadoop
Platform
 Evolve IMT + Hadoop
 Data Lineage & Data
Catalog
 Streaming Capabilities
 Advanced SQL on Hadoop
 ACID semantics
 Evolve Big Data + SSAA per
ecosystem roadmaps
 BC/DR
 End to end enterprise features
 Enterprise ready: OLAP and
Traditional DW
Hadoop is an open source framework designed for big data analytics.
Hadoop is evolving rapidly, but it will still take a couple of years for it to
mature and support “traditional bi” use cases.
Legend
Orange Text: Traditional BI Capabilities
Green Text: Big Data/AA Capabilities
 Security (RBAC, ITS/IRS)
 Data Governance
 Data Discovery
 Self Service AA Framework
 IMT + Hadoop
 AVP + Hadoop
 In-memory + Near real time
capabilities
 SQL on Hadoop
Copyright © 2014, Intel Corporation. All rights reserved.
Data Integration
Big Data Platform – Ecosystem Architecture & Maturity
9
NRT/Stream Processing In-Memory Processing
Processing
Layer Batch Processing
Data Virtualization Data DiscoveryAdv. AnalyticsAdv. Visualization
Data
Management
Presentation
Layer
End User
Data
Steward
Business
Analyst
Data
Scientist
DeveloperUser layer Auditor
Machine Learning
Analytical
layer Statistical
Numerical Time series
Textual/Log Spatial
Graph
Textual/Log DB Hierarchy DBRelational DB Graph DB
Storage
Model
Platform Virtualization
Infrastructure
Platform Management Network Management Systems Management
Data Ingestion
Continuous IntegrationDev Framework Security
Source/Target APIs 3rd Party Drivers
Ent. Scheduler Srvs Metadata MgmtWorkload Mgmt
Middleware
*Other names and brands may be claimed as the property of others.
Columnar DB
Data Egression
Other Vendors offered capabilities
Majority CDH offered capabilities
Data Consumption
Prescriptive
Guidance
Change
Release
GovernanceEngagement
Service
Management
Training
Support
Processes
Copyright © 2014, Intel Corporation. All rights reserved.
BI Big Data Platform
10
Hadoop Project Sandbox – CDH 5.3
Multiple Instances
Deployed on Intel Cloud & MyCloud
environments. TTM to business: 2-3 Days
Hadoop Pre-Production – CDH 5.3
10 data nodes | 399TB | 320 vcores
Use cases in Dev/POC: 14
Hadoop Production – CDH 5.3
22 data nodes | 658TB | 704 vcores
Use cases Live in prod: 7
 Hadoop 2.0 architecture provides reliability,
scalability & performance
 High availability and scalability design
 Well positioned to meet 2015 business use case
requirements
 Repeatable architecture for faster builds.
 Capacity additions: Add data node. White boxes,
Waterfall equipment or HP servers
 TTM: Varies depending on HW (3 wks-2 months) Job/Workflow
Management
Data Node Data Node Data Node Data Node Data Node
Name Node
Resource Mgr
Name Node
Resource Mgr
heartbeat, balancing, replication
YARN
Scale to meet business needs
Gateway
Nodes
(NN hi-av)
Gateway nodes
Login (ssh) : AD authentication &
authorization, access cluster, run
HDFS commands, submit jobs, etc.
Management
Node
Source Data
DB Data
Visualization
Tools
Data Movement/ETL
EDW or Datamart
DB data
Unstructured Semi-structured
Copyright © 2014, Intel Corporation. All rights reserved.
• Skills and resources with time to ramp up
• Starting small is ok. Focus on design and scalability for the platform.
• Technical product evaluation
 Stick with a distribution which is core Hadoop open source stack vs proprietary software
• Security is a big deal to Intel, Big Data Security capabilities implementation is
key focus
• Methodology to understand the data is to use an iterative discovery method with
technical, business and modeling teams.
• Intel IT Big Data Journey benefited heavily from Cloudera partnership
• Open source will play a big role in advancing Big Data capabilities and analytics
BKM’s | Summary
Copyright © 2014, Intel Corporation. All rights reserved.
BI Big Data IT@Intel Resource Info
12
BI Big Data IT@Intel Resource Links:
1. Hadoop Migration Success Story: How Intel IT Moved to Cloudera
2. Mining Big Data in the Enterprise for Better Business Intelligence
3. Enabling Big Data Platforms and Solutions with Centralized Data Management
4. Integrating Apache Hadoop* into Intel’s Big Data Environment
5. Using a Multiple Data Warehouse Strategy to Improve BI Analytics
To learn more: www.intel.com/bigdata
Copyright © 2014, Intel Corporation. All rights reserved.
Q & A
13
Intel Confidential — Do Not Forward
Copyright © 2014, Intel Corporation. All rights reserved.
Backup
15
Copyright © 2014, Intel Corporation. All rights reserved.
Big Data Capability Catalog
Hive
HDFS MapReduceZookeeper
Pig Mahout
NetworkServers Storage Security OS Hi-AvEAM / AD Integration
HDFS Compress
WHIRR
Hbase
Governance
Change
Release
Engagement
Service mgmt.
Prescriptive
Guidance
Training
SQOOP JDBC
Other DW
Infrastructure
Process
Cloudera* Distribution of Hadoop (CDH)
*Other names and brands may be claimed as the property of others.
Storm
Hcatalog
ACCUMULOYARN
SPARK
Autosys
SecureGIT
Impala JDBC
HiveODBC
3rd Party SW/Connectors
Integration
HUE SOLRIMPALA
PARQUET DataFu
Impala ODBC
TDCH
Oozie
Kafka
Sqoop
DI
Gateway
Flume
SFTP
SMBClient
Data
Integration
Camel
Enabled PlannedWIP
Avail. Now 1-3 Months 3-6+ Months
Cloudera Manager*
System Management
Cloudera Navigator*
Data Management
Audit
Access Control
Discovery Explore
Lineage Lifecyle
DeploymentMonitoring Reporting Diagnostics
Alerting
Service
Management
Rolling
Upgrades
Config
Rollbacks
List includes only the capabilities planned for next 6 months.
16
Google Analytics
SFDC
Sentry
Copyright © 2014, Intel Corporation. All rights reserved.
i. Find Differences with a
Comparative Evaluation in a
Sandbox Environment
ii. Define Your Strategy for the
Cloudera Implementation
iii. Split the Hardware
Environment
iv. Upgrade the Hadoop Version
v. Create a Preproduction-to-
Production Pipeline
vi. Rebalance the Data
Migration to Cloudera – 6 BKMs
Copyright © 2014, Intel Corporation. All rights reserved.
Building Block Strategy to Enterprise Security of Hadoop
Q1’15: Perimeter access with LDAP + finer grain
controls with Sentry. The second building block
towards enterprise grade security design.
Q2’15: Add Kerberos to enable
more Hadoop components and
further secure the platform
2H’15: Exploration starting,
awaiting product and target to
adopt in 2H’15 in Production.
NowQ2’15 2H’15
Copyright © 2014, Intel Corporation. All rights reserved.
Hadoop Maturity & Evolution
19
MapReduce
(batch data processing, cluster
resource management)
HDFS 1.0
(redundant, reliable
data storage)
Hadoop 1.0
YARN
(cluster resource management)
HDFS 2.0
(redundant, reliable data storage)
Interactive
(Impala)
In-Memory
(Spark)
Batch
(Map
Reduce)
Online
(Hbase)
Others
(Search, Storm
etc.)
Graph
Applications Run Natively In Hadoop
+ Scalable data storage and processing
platform
+ Positioned for Batch processing workloads
for Map and Reduce only
+ Apache Hive offers SQL like query
language
- Lacks reliability and stability
- No support for low latency queries
 Apache YARN allows you to run multiple applications in Hadoop and provides reliability, scalability
and performance
 Advanced Resource Management
 Apache Hive offers a 50x improvement in performance for queries
 Cloudera Impala to support low latency query requirements with SQL-92 and SQL- 2000 support
 Data at Rest Encryption and Row Level/Cell Level Security planned
 Data Streaming and Search Capability
 GraphDB
 Expanded Data Governance
 IMT + Hadoop Integration
 Improved Front End tool integration/support
 Deeper Diagnostics for multiple components
2005 - 2012 2013 - 2014
Hadoop 2.0
HDFS
(redundant, reliable
data storage)
YARN
(cluster resource management)
Batch
(Map Reduce)
Others
(data processing)
2015 - 2017
Copyright © 2014, Intel Corporation. All rights reserved.
2014 Intel IT Vital Statistics
20
>6,300 IT employees
59 global IT sites
>98,000 Intel employees1
168 Intel sites in 65 Countries
64 Data Centers
(91 Data Centers in 2010)
80% of servers virtualized
(42% virtualized in 2010, goal of 75%)
>147,000+ Devices
100% of laptops encrypted
100% of laptops with SSD’s
>43,200 handheld devices
57 mobile applications developed
Source: Information provided by Intel IT as of Jan 2014
1Total employee count does not include wholly owned subsidiaries that Intel IT
does not directly support
Copyright © 2014, Intel Corporation. All rights reserved.
Copyright © 2014, Intel Corporation. All rights reserved.
Big Data in the Industry
21
Recommendation Engine Fraud Detection
Sentiment Analytics
Behavioral Targeting
Customer Experience AnalyticsMarketing campaign Analytics
Copyright © 2014, Intel Corporation. All rights reserved.
Learn more about Intel IT’s Initiatives at
www.intel.com/IT
Sharing Intel IT Best Practices
With the World

Más contenido relacionado

La actualidad más candente

Ml conference slides
Ml conference slidesMl conference slides
Ml conference slides
QuantUniversity
 

La actualidad más candente (20)

Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptx
 
Ml conference slides
Ml conference slidesMl conference slides
Ml conference slides
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing Strategies
 
Bigquery 101
Bigquery 101Bigquery 101
Bigquery 101
 
Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introduction
 
Disease prediction and doctor recommendation system
Disease prediction and doctor recommendation systemDisease prediction and doctor recommendation system
Disease prediction and doctor recommendation system
 
Big data
Big dataBig data
Big data
 
big data analytics in mobile cellular network
big data analytics in mobile cellular networkbig data analytics in mobile cellular network
big data analytics in mobile cellular network
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Presentation on Sentiment Analysis
Presentation on Sentiment AnalysisPresentation on Sentiment Analysis
Presentation on Sentiment Analysis
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps Manifesto
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
 
Enterprise Data Warehouse
Enterprise Data Warehouse Enterprise Data Warehouse
Enterprise Data Warehouse
 
Big data analysis using map/reduce
Big data analysis using map/reduceBig data analysis using map/reduce
Big data analysis using map/reduce
 
Introduction to knime
Introduction to knimeIntroduction to knime
Introduction to knime
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 

Destacado

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
DataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 

Destacado (20)

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
 
large scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraphlarge scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraph
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
 

Similar a Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Pentaho
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
EMC
 
EMC Pivotal overview deck
EMC Pivotal overview deckEMC Pivotal overview deck
EMC Pivotal overview deck
mister_moun
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExp
bigdata sunil
 
A new platform for a new era emc
A new platform for a new era   emcA new platform for a new era   emc
A new platform for a new era emc
Taldor Group
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
 

Similar a Evolution of Big Data at Intel - Crawl, Walk and Run Approach (20)

Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Oracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analyticsOracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analytics
 
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
 
Robin_Hadoop
Robin_HadoopRobin_Hadoop
Robin_Hadoop
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
EMC Pivotal overview deck
EMC Pivotal overview deckEMC Pivotal overview deck
EMC Pivotal overview deck
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExp
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
A new platform for a new era emc
A new platform for a new era   emcA new platform for a new era   emc
A new platform for a new era emc
 
Big Data: Myths and Realities
Big Data: Myths and RealitiesBig Data: Myths and Realities
Big Data: Myths and Realities
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Big Data
Big DataBig Data
Big Data
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 

Más de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Evolution of Big Data at Intel - Crawl, Walk and Run Approach

  • 1. Evolution of Big Data at Intel - crawl, walk and run approach Gomathy Bala | Director Chandhu Yalla | Manager & Architect Key Contributors: Sonja Sandeen, Seshu Edala, Nghia Ngo and Darin Watson IT BI Big Data Team
  • 2. Copyright © 2014, Intel Corporation. All rights reserved. Legal Notices This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. The content in this presentation is being shared Under NDA. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright © 2014, Intel Corporation. All rights reserved. 2
  • 3. Copyright © 2014, Intel Corporation. All rights reserved. Agenda • Intel IT Big Data Journey • Enterprise DW architecture • BI Big Data 3 yr Roadmap • Big Data Ecosystem Architecture • Platform Strategies & BKMs • Summary 3
  • 4. Copyright © 2014, Intel Corporation. All rights reserved. 2011 2012 2013 2014 2015 Intel IT Big Data Journey 4 Big Data & Analytics Strategy Production Online Telmap: 1st Use Case Preproduction Online Hadoop Evaluation IDH to CDH Hadoop 2.0 $176M BV Production: Security BI, Attribute Reduction System, ATM Ellipses Engine, IAH- Retail Analytics 6 Environments CDH 5.3 4 Use Cases in Preproduction 12 POC Use Cases 6 Use Cases in Production $290K investment $948/TB 3 Use Cases in Production Smart-What, Marketing- IAH, Incident Predictability $6M BV CDH 5.1 IAH – Cloud CRM In Production Enterprise Standards, Guidance, Processes for Platform & Capabilities 15 Active Use Cases | $290K + 10.5 HC Investment | Delivered $182M BV
  • 5. Copyright © 2014, Intel Corporation. All rights reserved. Big Data & Analytics Really Delivers! 5From 2014 – 2015 Intel IT Business Review – Annual Edition Kim's Video
  • 6. Copyright © 2014, Intel Corporation. All rights reserved. Any Data Source ERP In Memory Real-Time Data Platform CRM SCM SRM ECC BW ECCW Real-Time & Self Service Analytics Platform MDG NW Teradata Cloudera Hadoop Data Lake Reporting Tools Data Tiering Hot-Cold data Enterprise Data Warehouse Other Apps Custom Intel … NR T Predictive Analytics BPC BCS Cloud BI Saa S New Apps. Downstream Applications 2014-2017 Vision: Real-Time Enterprise 6
  • 7. Copyright © 2014, Intel Corporation. All rights reserved. FE Tools CLS/Proxy High speed data loader BigData • Machine Learning • Log Processing • Unstructured data Use Cases • High volume counter Analytics • Text Parsing/Mining • Strategic/Operational reporting • Interactive Reporting Use Cases • High Concurrent user analytics - Supply/Order • Mission critical analytics – Finance/HR SQL on Hadoop Enterprise Data Architecture with Hadoop and Other MPP DWH Current & Future Strategy Future Present EDWMfg Data A %ge of Traditiona l BI use cases IMT
  • 8. Copyright © 2014, Intel Corporation. All rights reserved. BI Big Data | 3-Year Roadmap 8 Big Data + AA Big Data + SSAA + Traditional BI Big Data + SSAA + Traditional BI 2015 2016 2017 Scalable and well designed Hadoop Platform  Evolve IMT + Hadoop  Data Lineage & Data Catalog  Streaming Capabilities  Advanced SQL on Hadoop  ACID semantics  Evolve Big Data + SSAA per ecosystem roadmaps  BC/DR  End to end enterprise features  Enterprise ready: OLAP and Traditional DW Hadoop is an open source framework designed for big data analytics. Hadoop is evolving rapidly, but it will still take a couple of years for it to mature and support “traditional bi” use cases. Legend Orange Text: Traditional BI Capabilities Green Text: Big Data/AA Capabilities  Security (RBAC, ITS/IRS)  Data Governance  Data Discovery  Self Service AA Framework  IMT + Hadoop  AVP + Hadoop  In-memory + Near real time capabilities  SQL on Hadoop
  • 9. Copyright © 2014, Intel Corporation. All rights reserved. Data Integration Big Data Platform – Ecosystem Architecture & Maturity 9 NRT/Stream Processing In-Memory Processing Processing Layer Batch Processing Data Virtualization Data DiscoveryAdv. AnalyticsAdv. Visualization Data Management Presentation Layer End User Data Steward Business Analyst Data Scientist DeveloperUser layer Auditor Machine Learning Analytical layer Statistical Numerical Time series Textual/Log Spatial Graph Textual/Log DB Hierarchy DBRelational DB Graph DB Storage Model Platform Virtualization Infrastructure Platform Management Network Management Systems Management Data Ingestion Continuous IntegrationDev Framework Security Source/Target APIs 3rd Party Drivers Ent. Scheduler Srvs Metadata MgmtWorkload Mgmt Middleware *Other names and brands may be claimed as the property of others. Columnar DB Data Egression Other Vendors offered capabilities Majority CDH offered capabilities Data Consumption Prescriptive Guidance Change Release GovernanceEngagement Service Management Training Support Processes
  • 10. Copyright © 2014, Intel Corporation. All rights reserved. BI Big Data Platform 10 Hadoop Project Sandbox – CDH 5.3 Multiple Instances Deployed on Intel Cloud & MyCloud environments. TTM to business: 2-3 Days Hadoop Pre-Production – CDH 5.3 10 data nodes | 399TB | 320 vcores Use cases in Dev/POC: 14 Hadoop Production – CDH 5.3 22 data nodes | 658TB | 704 vcores Use cases Live in prod: 7  Hadoop 2.0 architecture provides reliability, scalability & performance  High availability and scalability design  Well positioned to meet 2015 business use case requirements  Repeatable architecture for faster builds.  Capacity additions: Add data node. White boxes, Waterfall equipment or HP servers  TTM: Varies depending on HW (3 wks-2 months) Job/Workflow Management Data Node Data Node Data Node Data Node Data Node Name Node Resource Mgr Name Node Resource Mgr heartbeat, balancing, replication YARN Scale to meet business needs Gateway Nodes (NN hi-av) Gateway nodes Login (ssh) : AD authentication & authorization, access cluster, run HDFS commands, submit jobs, etc. Management Node Source Data DB Data Visualization Tools Data Movement/ETL EDW or Datamart DB data Unstructured Semi-structured
  • 11. Copyright © 2014, Intel Corporation. All rights reserved. • Skills and resources with time to ramp up • Starting small is ok. Focus on design and scalability for the platform. • Technical product evaluation  Stick with a distribution which is core Hadoop open source stack vs proprietary software • Security is a big deal to Intel, Big Data Security capabilities implementation is key focus • Methodology to understand the data is to use an iterative discovery method with technical, business and modeling teams. • Intel IT Big Data Journey benefited heavily from Cloudera partnership • Open source will play a big role in advancing Big Data capabilities and analytics BKM’s | Summary
  • 12. Copyright © 2014, Intel Corporation. All rights reserved. BI Big Data IT@Intel Resource Info 12 BI Big Data IT@Intel Resource Links: 1. Hadoop Migration Success Story: How Intel IT Moved to Cloudera 2. Mining Big Data in the Enterprise for Better Business Intelligence 3. Enabling Big Data Platforms and Solutions with Centralized Data Management 4. Integrating Apache Hadoop* into Intel’s Big Data Environment 5. Using a Multiple Data Warehouse Strategy to Improve BI Analytics To learn more: www.intel.com/bigdata
  • 13. Copyright © 2014, Intel Corporation. All rights reserved. Q & A 13
  • 14. Intel Confidential — Do Not Forward
  • 15. Copyright © 2014, Intel Corporation. All rights reserved. Backup 15
  • 16. Copyright © 2014, Intel Corporation. All rights reserved. Big Data Capability Catalog Hive HDFS MapReduceZookeeper Pig Mahout NetworkServers Storage Security OS Hi-AvEAM / AD Integration HDFS Compress WHIRR Hbase Governance Change Release Engagement Service mgmt. Prescriptive Guidance Training SQOOP JDBC Other DW Infrastructure Process Cloudera* Distribution of Hadoop (CDH) *Other names and brands may be claimed as the property of others. Storm Hcatalog ACCUMULOYARN SPARK Autosys SecureGIT Impala JDBC HiveODBC 3rd Party SW/Connectors Integration HUE SOLRIMPALA PARQUET DataFu Impala ODBC TDCH Oozie Kafka Sqoop DI Gateway Flume SFTP SMBClient Data Integration Camel Enabled PlannedWIP Avail. Now 1-3 Months 3-6+ Months Cloudera Manager* System Management Cloudera Navigator* Data Management Audit Access Control Discovery Explore Lineage Lifecyle DeploymentMonitoring Reporting Diagnostics Alerting Service Management Rolling Upgrades Config Rollbacks List includes only the capabilities planned for next 6 months. 16 Google Analytics SFDC Sentry
  • 17. Copyright © 2014, Intel Corporation. All rights reserved. i. Find Differences with a Comparative Evaluation in a Sandbox Environment ii. Define Your Strategy for the Cloudera Implementation iii. Split the Hardware Environment iv. Upgrade the Hadoop Version v. Create a Preproduction-to- Production Pipeline vi. Rebalance the Data Migration to Cloudera – 6 BKMs
  • 18. Copyright © 2014, Intel Corporation. All rights reserved. Building Block Strategy to Enterprise Security of Hadoop Q1’15: Perimeter access with LDAP + finer grain controls with Sentry. The second building block towards enterprise grade security design. Q2’15: Add Kerberos to enable more Hadoop components and further secure the platform 2H’15: Exploration starting, awaiting product and target to adopt in 2H’15 in Production. NowQ2’15 2H’15
  • 19. Copyright © 2014, Intel Corporation. All rights reserved. Hadoop Maturity & Evolution 19 MapReduce (batch data processing, cluster resource management) HDFS 1.0 (redundant, reliable data storage) Hadoop 1.0 YARN (cluster resource management) HDFS 2.0 (redundant, reliable data storage) Interactive (Impala) In-Memory (Spark) Batch (Map Reduce) Online (Hbase) Others (Search, Storm etc.) Graph Applications Run Natively In Hadoop + Scalable data storage and processing platform + Positioned for Batch processing workloads for Map and Reduce only + Apache Hive offers SQL like query language - Lacks reliability and stability - No support for low latency queries  Apache YARN allows you to run multiple applications in Hadoop and provides reliability, scalability and performance  Advanced Resource Management  Apache Hive offers a 50x improvement in performance for queries  Cloudera Impala to support low latency query requirements with SQL-92 and SQL- 2000 support  Data at Rest Encryption and Row Level/Cell Level Security planned  Data Streaming and Search Capability  GraphDB  Expanded Data Governance  IMT + Hadoop Integration  Improved Front End tool integration/support  Deeper Diagnostics for multiple components 2005 - 2012 2013 - 2014 Hadoop 2.0 HDFS (redundant, reliable data storage) YARN (cluster resource management) Batch (Map Reduce) Others (data processing) 2015 - 2017
  • 20. Copyright © 2014, Intel Corporation. All rights reserved. 2014 Intel IT Vital Statistics 20 >6,300 IT employees 59 global IT sites >98,000 Intel employees1 168 Intel sites in 65 Countries 64 Data Centers (91 Data Centers in 2010) 80% of servers virtualized (42% virtualized in 2010, goal of 75%) >147,000+ Devices 100% of laptops encrypted 100% of laptops with SSD’s >43,200 handheld devices 57 mobile applications developed Source: Information provided by Intel IT as of Jan 2014 1Total employee count does not include wholly owned subsidiaries that Intel IT does not directly support Copyright © 2014, Intel Corporation. All rights reserved.
  • 21. Copyright © 2014, Intel Corporation. All rights reserved. Big Data in the Industry 21 Recommendation Engine Fraud Detection Sentiment Analytics Behavioral Targeting Customer Experience AnalyticsMarketing campaign Analytics
  • 22. Copyright © 2014, Intel Corporation. All rights reserved. Learn more about Intel IT’s Initiatives at www.intel.com/IT Sharing Intel IT Best Practices With the World

Notas del editor

  1. 2
  2. Stream Processing or Complex Event Processing -- where small chunks of data come at rapid intervals [smaller quantum, requiring transformation]. E.g., Sensory data from manufacturing floors. Batch Processing -- aggregated chunks of data, perhaps collected over a long span, waiting to be analyzed in one run. OLAP processing. E.g. Gold path analysis on intel.com In-memory processing -- running interactive analytics over large batches of summary/factual data by leveraging the memory as the pre-emptive transient store. E.g. SQL aggregates/operational metrics from OLAP process Machine Learning -- class of unsupervised and supervised learning techniques destined for a decision support or an expert system Unsupervised Learning (No "response" variable. Just observations) -- tools Mahout Clustering -- E.g. customer segmentation; clustering users by age, ethnicity, gender, income standards, geo, profession, and buying propensity to new form factors. Frequent pattern mining -- E.g. co-branding strategies. People buying realsense cameras also downloading Intel XDK kits within 7 days of purchase. Supervised Learning [predicting a "response" variable when encountering a new "condition". The response patterns learned from prior training sets of course…] -- H2O Regression -- E.g. YoY growth for DCG Xeon co-processor shipment at 16% between 2011 and 2014. This year, we will ship 36 million units; current inventory levels at 23 mill Classification -- E.g. Customer (Widgets Inc) responses to email automation and phone calls favorable in the last 3 months. Last upgrade was 2 years ago. The likelihood of an enterprise upgrade is "high". Textual -- class of algorithms that "derive" meaning from what is otherwise flat left-to-right-top-to-bottom "text". Shred sentence structure into nouns-verbs-adjectives-adverbs; count entities and turn "text" into "terms" [features]. Encode the feature into a term-document or a "graph" representation so traditional analytics -- machine learning (supervised and unsupervised techniques may be applied). Lucene, SOLR is useful for indexing/tokenizing text; NLTK or Stanford parsers are useful to "tag" terms to class of linguistic tokens such as nouns and verbs. E.g. identify service management tickets that entail Windows 8.1 issues. Log -- Logs are textual in syntax but do not possess linguistic rigor. Such contents are useful just indexing as is and searching. The machines do not "decode" meaning. Humans synthesize and add logical rules when the content is surfaced back via a search interface. E.g Logstash used to monitor errors in log4j logs of Hive jobs. Spatial -- Class of problems that deal with spatial layout of entities. E.g. every die is sacred. Rationing and allocating sub-systems on a die via simulatory techniques to optimize wastage loss and maximize "premium" quality. Or optimizing lithographic etches that minimize orthogonal cuts by employing space-filling heuristics. Statistical -- class of problems that infer patterns from data that exhibits stochastic characteristics -- e.g. identifying aggreations like stddev, min, max, avg yields of a graphics die; and performing outlier analysis. Numerical -- class of problems that deal with data that exhibits deterministic characteristics -- e.g. Taguchi methods or iterative monte carlo methods that search and seek global minima/maxima. Genetic algorithms, deep learning methods/neural networks etc. Time-series -- class of problems that deal with data that exhibits stochasticity, but also exhibits temporal/seasonal resonance patterns. E.g. noise-cancellation filters that employ feedback loops; or predicting stock-price movement etc Graph -- class of problems that compute statistics about entities connected to other entities. E.g. computing pagerank/link-popularity of a web page, congestion patterns of a traffic flow, sewage system planning etc Storage Models Textual/Binary -- No DDL. All data is stored row-first, column-next where there is only one BLOB column per row. E.g Zip files, MainFrames Relational -- well specified DDL, but data is stored row-first [co-located fields of a row]; locking semantics at row level. Yields faster entity retrievals but poorer compression ratios when heterogeous fields co-exist in data. The index is built for row-offsets; e.g. -- Oracle, MySQL Columnar -- well specified DDL; but data is stored column-first [all first names are co-stored in ine file, last-names co-stored in another etc]; locking semantics at cell level. Yields faster aggregates [min, max on a single field], better compression ratios [because all fields of a columnar file are a homogenous type]. But lacks atomic consistency because a record change transpires into mutations in multiple "columnar/co-location" files. E.g. HBase, Cassandra Hierarchy -- well specified structural definition. Mostly follows a denormalized parent-child taxonomy. All fields relevant to a record are stored as a "hierarchic document" ala XML or JSON document. Yields a great consistency model because the grain of the data is a "document". Any mutation will always mean a complete denormalized update of the full document -- json or xml. E.g. MongoDB, CouchDB GraphDB -- native adjacency property graph that stores entities as "vertices" of a graph, relations as "edges", and attributes as "properties". Since indices are combinatorially developed on all -- entities, relations, and attributes -- adjacency mining, filtering, mutations are performant and atomic. E.g. Neo4J, TitanDB
  3. SLIDE PURPOSE: Who Are We … we are the IT organization at Intel (IT@Intel) .. Core background information on Intel IT and our mission/goals/capabilities Key Messages: We are the IT organization Inside Intel’s Business. Our organization is large, diverse multi-national enterprise with a wide variety of operational requirements and needs Our Vision is to accelerate Intel’s quest to connect and enrich the lives of every person on Earth by the end of the decade. Our Mission is to Grow Intel’s Business through Information Technology for Intel by facilitating IT Consumerization, delivering IT efficiency and continuity through Cloud Computing, increase employee productivity through seamless connectivity and Security, provide significant business value through Business Intelligence initiatives and drive increased collaboration through Social Computing. Review some of the Information/Key Stats shown here. Size and Location: 6,334 IT employees … Supporting over 98,000 employees. Note: Intel IT only reflects the number of employees we support directly (we exclude Intel employees who support wholly owned subsidiaries) Remote Support is Vital. Data Centers and Facilities: 59 Data Centers worldwide (down from 142 in 2007) Need to confirm this data[~55,000 servers (down from 100,000 in 2007) consuming a large electrical and power/cooling load (roughly 55MW total power) Our Data Centers also support 300M email messages (per month), >2,183 Terabytes WAN traffic (per month)] and store 45 petabytes of raw storage capacity Employee / Client Technology: Support over 147K devices (note >1 per employee ratio .. This ratio is growing with support of BYO and custom technology delivery to meet business needs) >We have been 80%+ mobile PCs (laptops) as our core employee technology standard since 1997 We have been actively evaluating, enabling and supporting many companion devices for improved productivity and flexibility Need to add what we are doing with tablets - Janet >43,200 Handhelds (variety of form factors (phones/tablets) vendors, software and solutions)  the majority of these devices are now EMPLOYEE OWNED Intel IT continues to embrace consumerization of IT and mobile applications are a major component of our strategy. We have delivered 57 mobile apps and counting to support new form factors. Our goal is to deliver a seamless, secure experience for our employees across a wide spectrum of devices by putting user experience first. Enabled Leadership Business Capabilities: Enable a top 25 supply chain (recognized by Gartner, previously AMR Research) . #25 in 2009, #18 in 2010, #16 in 2011, #7 in 2012 and #5 in 2013 key focus for IT innovation … delivered solid business results and competitive differentiation for Intel Additional fun facts … 100% Intel laptops support SSD and 100% are deployed with disk encryption