SlideShare una empresa de Scribd logo
1 de 12
1© Copyright 2010 EMC Corporation. All rights reserved.
RDBMS and Hadoop
A Powerful Combination
Jacque Istok
2© Copyright 2010 EMC Corporation. All rights reserved.
You Know Hadoop, But What Is Greenplum?
EMC/Greenplum is an MPP data warehouse
system, based off PostgreSQL, with the full
capabilities of a traditional RDBMS system. In
conjunction with SQL-99 compliance for
structured analysis, Greenplum also offers a
MapReduce implementation for non structured
analysis. In short:
Greenplum ~ Hadoop/Hive
3© Copyright 2010 EMC Corporation. All rights reserved.
Data in a Typical Enterprise
• Data is everywhere –
corporate EDW, 100s
of data marts,
‘shadow’ databases,
spreadsheets, logs,
etc
• The goal of
centralizing all data
in a single EDW has
proven untenable
EDW
~10% of data
Data Marts and
‘Personal Databases’
~90% of data
4© Copyright 2010 EMC Corporation. All rights reserved.
Today’s Big Data Challenges
• Sources of data and the amount of data to analyze
is growing exponentially
• Stale data exists because DW solutions cannot
ingest the vast amounts of data fast enough
• Lack of performance for advanced analytics and
complex queries
• The number of users and the concurrency of users
is increasing rapidly
• Security and privacy around the data is both
preferred and often mandated
5© Copyright 2010 EMC Corporation. All rights reserved.
Architecture of HDFS/Hadoop/Hive
Hive Server accepts SQL and dynamically
generates and executes MapReduce code
Flexible framework for processing large datasets
Materialize data subsets to
reduce impact of node failure
DataNode servers process
analytics close to the data in
parallel
NameNode
DataNodeDataNode DataNode DataNode DataNode
…
NameNode
SQL (subset)
Hive
Process large datasets with support for
both SQL and MapReduce
MapReduce
6© Copyright 2010 EMC Corporation. All rights reserved.
Architecture of Greenplum
Master servers optimize queries
for the most efficient query execution
MPP Scatter/Gather streaming for
fast loading of data
Flexible framework for processing large datasets
Interconnect for continuous
pipelining of data processing
Segment servers process queries
close to the data in parallel
Master
SegmentSegment Segment Segment Segment
…
Master
SQL
MapReduce
Process large datasets with support for
both SQL and MapReduce
7© Copyright 2010 EMC Corporation. All rights reserved.
RDBMS Advantages
8© Copyright 2010 EMC Corporation. All rights reserved.
Common Real World Implementation
Lots ‘O Data
9© Copyright 2010 EMC Corporation. All rights reserved.
A Cyber-Analytics Data Mart Use Case
• Commercial SIEM products struggle
with the volumes of data generated in
a large enterprise. Non-parallel
event processing systems can’t keep
up with ingest, user load, etc
• Greenplum provides the ability to
cost-effectively ingest and store large
volumes of sensor data.
• Greenplum provides the parallel
analytics that support data mining,
event correlation, etc, over datasets
from TB’s to PB’s in size.
Access and
Events
Greenplum
Analytics
Data Mart
GPLoad
SQL MapReduce
(Perl)
(Python Math Lib)
(R)
SoR
ETL
ODS
BI
10© Copyright 2010 EMC Corporation. All rights reserved.
Coexistence Approach – Use Case
Compute
Storage
Analytics
General Purpose X86 Cluster of
Systems
Network
• Provides true, complete SQL compliant analytics
• Data can be read and written from Hadoop via
Greenplum
• Store your data structured, unstructured, column or row
oriented, compressed, leveraging Index support where
appropriate
• SQL can be executed, through Greenplum, on data
residing within Greenplum as well as data residing
within HDFS
• MapReduce can be executed through Greenplum in
Java, C, Perl, Python or through Java in Hadoop
• Designed for rapid analysis of data volumes from less
than a terabyte scaling into the petabytes
11© Copyright 2010 EMC Corporation. All rights reserved.
Big Data is Complementary to EDW
Commodity
Hardware
Virtual Machines Public Cloud
Greenplum
Enterprise Data Warehouse
• Single Source of Truth
• 1 Logical Model
• Heavy data governance and quality
• Operational Reporting
• Financial Consolidation
MapReduce Analytics Cloud
• Source of all raw data (often 10X size of
EDW)
• Self-service infrastructure to support multiple
marts and sandboxes
• Rapid analytic iteration, and business owned
solutions
12© Copyright 2010 EMC Corporation. All rights reserved.

Más contenido relacionado

Más de Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Cloudera, Inc.
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionCloudera, Inc.
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Cloudera, Inc.
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloudera, Inc.
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceCloudera, Inc.
 

Más de Cloudera, Inc. (20)

Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR compliance
 

Último

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Greenplum - Jacque Istok - Hadoop World 2010

  • 1. 1© Copyright 2010 EMC Corporation. All rights reserved. RDBMS and Hadoop A Powerful Combination Jacque Istok
  • 2. 2© Copyright 2010 EMC Corporation. All rights reserved. You Know Hadoop, But What Is Greenplum? EMC/Greenplum is an MPP data warehouse system, based off PostgreSQL, with the full capabilities of a traditional RDBMS system. In conjunction with SQL-99 compliance for structured analysis, Greenplum also offers a MapReduce implementation for non structured analysis. In short: Greenplum ~ Hadoop/Hive
  • 3. 3© Copyright 2010 EMC Corporation. All rights reserved. Data in a Typical Enterprise • Data is everywhere – corporate EDW, 100s of data marts, ‘shadow’ databases, spreadsheets, logs, etc • The goal of centralizing all data in a single EDW has proven untenable EDW ~10% of data Data Marts and ‘Personal Databases’ ~90% of data
  • 4. 4© Copyright 2010 EMC Corporation. All rights reserved. Today’s Big Data Challenges • Sources of data and the amount of data to analyze is growing exponentially • Stale data exists because DW solutions cannot ingest the vast amounts of data fast enough • Lack of performance for advanced analytics and complex queries • The number of users and the concurrency of users is increasing rapidly • Security and privacy around the data is both preferred and often mandated
  • 5. 5© Copyright 2010 EMC Corporation. All rights reserved. Architecture of HDFS/Hadoop/Hive Hive Server accepts SQL and dynamically generates and executes MapReduce code Flexible framework for processing large datasets Materialize data subsets to reduce impact of node failure DataNode servers process analytics close to the data in parallel NameNode DataNodeDataNode DataNode DataNode DataNode … NameNode SQL (subset) Hive Process large datasets with support for both SQL and MapReduce MapReduce
  • 6. 6© Copyright 2010 EMC Corporation. All rights reserved. Architecture of Greenplum Master servers optimize queries for the most efficient query execution MPP Scatter/Gather streaming for fast loading of data Flexible framework for processing large datasets Interconnect for continuous pipelining of data processing Segment servers process queries close to the data in parallel Master SegmentSegment Segment Segment Segment … Master SQL MapReduce Process large datasets with support for both SQL and MapReduce
  • 7. 7© Copyright 2010 EMC Corporation. All rights reserved. RDBMS Advantages
  • 8. 8© Copyright 2010 EMC Corporation. All rights reserved. Common Real World Implementation Lots ‘O Data
  • 9. 9© Copyright 2010 EMC Corporation. All rights reserved. A Cyber-Analytics Data Mart Use Case • Commercial SIEM products struggle with the volumes of data generated in a large enterprise. Non-parallel event processing systems can’t keep up with ingest, user load, etc • Greenplum provides the ability to cost-effectively ingest and store large volumes of sensor data. • Greenplum provides the parallel analytics that support data mining, event correlation, etc, over datasets from TB’s to PB’s in size. Access and Events Greenplum Analytics Data Mart GPLoad SQL MapReduce (Perl) (Python Math Lib) (R) SoR ETL ODS BI
  • 10. 10© Copyright 2010 EMC Corporation. All rights reserved. Coexistence Approach – Use Case Compute Storage Analytics General Purpose X86 Cluster of Systems Network • Provides true, complete SQL compliant analytics • Data can be read and written from Hadoop via Greenplum • Store your data structured, unstructured, column or row oriented, compressed, leveraging Index support where appropriate • SQL can be executed, through Greenplum, on data residing within Greenplum as well as data residing within HDFS • MapReduce can be executed through Greenplum in Java, C, Perl, Python or through Java in Hadoop • Designed for rapid analysis of data volumes from less than a terabyte scaling into the petabytes
  • 11. 11© Copyright 2010 EMC Corporation. All rights reserved. Big Data is Complementary to EDW Commodity Hardware Virtual Machines Public Cloud Greenplum Enterprise Data Warehouse • Single Source of Truth • 1 Logical Model • Heavy data governance and quality • Operational Reporting • Financial Consolidation MapReduce Analytics Cloud • Source of all raw data (often 10X size of EDW) • Self-service infrastructure to support multiple marts and sandboxes • Rapid analytic iteration, and business owned solutions
  • 12. 12© Copyright 2010 EMC Corporation. All rights reserved.