SlideShare una empresa de Scribd logo
1 de 29
Greenplum & Hadoop
                                            Why do such a thing?


                                            Donald Miner
                                            Solutions Architect
                                            Advanced Technologies Group
                                            Donald.Miner@emc.com




© Copyright 2012 EMC Corporation. All rights reserved.                    1
QUICK INTRODUCTION TO


        GREENPLUM DATABASE




© Copyright 2012 EMC Corporation. All rights reserved.                 2
GREENPLUM DATABASE

Greenplum Database Basics
Massively Parallel Processing (MPP) Database

Uses commodity hardware                                                  Master             Master



Data is distributed by a
user-defined “distribution key”

Master node delegates
queries to segments                                      Segment   Segment        Segment            Segment



1:1 segment and master
mirroring for redundancy




© Copyright 2012 EMC Corporation. All rights reserved.                                                         3
GREENPLUM DATABASE

Greenplum Database Features
Full SQL support based on PostgreSQL 8.2

Columnar or row-oriented storage with compression

Multi-level table partitioning with query time partition pruning

B-tree and bitmap indexes

JDBC, ODBC, OLEDB, etc. interfaces

High speed, parallel bulk ingest

Parallel query optimizer

External tables




© Copyright 2012 EMC Corporation. All rights reserved.             4
GREENPLUM DATABASE

MADlib Analytics with Greenplum

Scalable and in-database                                 > SELECT householdID, variables
                                                            FROM households
Mathematical, statistical,                                  ORDER BY RANDOM()
                                                            LIMIT 100000;
 machine learning
                                                         > SELECT run_univariate_analysis (
                                                               'households_training',
Active open source project                                     'variables');
                                                            WHERE pvalue<.01 AND r2>.01;
                                                         > SELECT run_regression(
                                                               'univariate_results',
                                                               'households_training');
                                                         > SELECT householdID,
                                                         madlib.array_dot(
                                                               coef::REAL[],
                                                               xmatrix::REAL[])
                                                            FROM coefficients, households;




© Copyright 2012 EMC Corporation. All rights reserved.                                        5
GREENPLUM DATABASE

MADlib In-Database Analytical Functions
    Descriptive Statistics                               Modeling
    Quantile                                             Correlation Matrix
    Profile                                              Association Rule Mining

    CountMin (Cormode-Muthukrishnan)
                                                         K-Means Clustering
    Sketch-based Estimator

    FM (Flajolet-Martin) Sketch-based
                                                         Naïve Bayes Classification
    Estimator
    MFV (Most Frequent Values) Sketch-
                                                         Linear Regression
    based Estimator
    Frequency                                            Logistic Regression
    Histogram                                            Support Vector Machines
    Bar Chart                                            SVD Matrix Factorisation
    Box Plot Chart                                       Decision Trees/CART
    Latent Dirichlet Allocation Topic
    Modeling




© Copyright 2012 EMC Corporation. All rights reserved.                                6
GREENPLUM DATABASE

PostGIS Support in Greenplum DB
  PostGIS adds support for geographic objects in PostgreSQL

  Example: find all records within 25 miles of hurricane path
                                                           http://postgis.refractions.net/

 select customer_id, ST_AsText(lat_lon), phone_num
 from clients
 where ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING(
 -79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, -
 80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, -
 83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI())


 customer_id | st_astext                         | phone_num
 ------------+-----------------------------+-------------
 493140        | POINT(-80.040397 26.570613) | 1231231234
 192401        | POINT(-81.820933 26.242611) | 2342342345



© Copyright 2012 EMC Corporation. All rights reserved.                                             7
GREENPLUM DATABASE

 Solr integration with GPDB
 Solr is an open source enterprise search engine

 Enable in-database text indexing and search
                                                           id |        score        |      message_text
select                                                    -----------+------------------+-------------------------------------------
   t.id,                                                    71552856 | 5.43078422546387 | Hates BB's Love IPhones!
   q.score,                                                91373993 | 4.06371879577637 | Its a love hate relationship with
   t.message_text                                         iPhone spellcheck
from
   message t,                                              25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate
   gptext.search(                                         relationship...
    'twitter.public.message',
                                                          120166038 | 3.39410924911499 | Love the new iPhone 4s, hate
    '(iphone and (hate or love))',                        @ATT service #Verizonhereicome
    'author_lang:en',
       100                                                117498183 | 3.39181470870972 | I got a love-hate relationship for
   )q                                                     my iPhone!!!
where
   t.id=q.id                                               86416378 | 3.39180779457092 | Absolutely love the new iPhone,
                                                          but Siri seems to hate me..
order by score desc;




 © Copyright 2012 EMC Corporation. All rights reserved.                                                                                8
GREENPLUM HADOOP




© Copyright 2012 EMC Corporation. All rights reserved.   9
GREENPLUM HADOOP

Greenplum “HD”
• Bundled open source

• HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Ma
  hout




© Copyright 2012 EMC Corporation. All rights reserved.   10
GREENPLUM HADOOP

Greenplum “MR”
• Bundled MapR, a commercial version of Hadoop
• API compatible with traditional Hadoop
• MapR improvements over Hadoop:
        – Improved control system
        – Major portions of HDFS re-implemented
           in C++
        – HDFS is NFS mountable
        – Improved shuffle and sort
        – Distributed NameNode
        – Supports large number of files
        – Mirroring, snapshot capability



© Copyright 2012 EMC Corporation. All rights reserved.   11
Why do such a thing?
 Greenplum DB
MADLib
               Partitioning                                        GP Solr/Lucene
   SQL
                Indexing                                                   Text objects
        RDBMS                                  PostGIS
                                                                     GPMapReduce
Tables and Schemas

  STRUCTURED                                              SEMISTRUCTURED            UNSTRUCTURED




 © Copyright 2012 EMC Corporation. All rights reserved.                                            12
Why do such a thing?
Hadoop


                                                                              Schema on load
                                                                                   MapReduce
                            Hive
                                                               XML, JSON, …        Flat files
                                           Pig

 STRUCTURED                                              SEMISTRUCTURED       UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                          13
Why do such a thing?
HBase


                                          Row keys

         Hive                                             Flexible schema       MapReduce

                                                          HBase Tables
                          Pig

 STRUCTURED                                              SEMISTRUCTURED     UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                      14
Why do such a thing?
 Hybrid architecture with all three (or two…)
MADLib
        Partitioning Row keys            GP Solr/Lucene
  SQL                                                    Schema on load
        Indexing                                Text objects
                             Flexible schema                  MapReduce
     RDBMS      Hive  PostGIS
                            HBase Tables GPMapReduce
Tables and Schemas Pig              XML, JSON, …              Flat files

  STRUCTURED                                              SEMISTRUCTURED   UNSTRUCTURED




 © Copyright 2012 EMC Corporation. All rights reserved.                                   15
Greenplum Unified Analytics Platform




© Copyright 2012 EMC Corporation. All rights reserved.   16
Hadoop External Tables in GPDB
  External tables bring external data into the database.

  Native support for HDFS with parallelized loading.

  Can write to HDFS or read from HDFS.

 > CREATE EXTERNAL TABLE hdfs_document_feature (
   docid integer,
   term text,
   freq integer)
  LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*')
  FORMAT 'text' (delimiter '|');

 > SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE
 h.term = g.word;

 > WRITE INTO hdfs_export SELECT * FROM gpdb_source;




© Copyright 2012 EMC Corporation. All rights reserved.                17
Why do such a thing?
Many of the same use cases of a HBase/Hadoop environment

Use Hadoop as a data groomer

Do rollups in Hadoop and store results in GPDB

Use the best tool for the job (structured vs. unstructured)

Use GPDB to host data sets in a more real-time layer for ad-hoc
analytics




© Copyright 2012 EMC Corporation. All rights reserved.            18
EMC Isilon
    Hardware appliance for scale-out
    network-attached storage (NAS)
    Stripes data across all nodes
    Uses Infiniband for intra-cluster
    communication
    Up to 15.5PB total storage
    3 different hardware configurations
    to handle different workloads
    Uses “OneFS”, Isilon’s operating system and file system
    Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few
    more.



© Copyright 2012 EMC Corporation. All rights reserved.        19
Isilon HDFS interface
    Isilon is able to “pretend” to be a HDFS
    cluster: it mimics the NameNode and
    DataNode protocols to host data.
    Underlying system is OneFS and does not
    follow the traditional HDFS scheme.
    Point HDFS clients (MapReduce, command
    line, etc.) to any IP in the Isilon cluster.




© Copyright 2012 EMC Corporation. All rights reserved.   20
Pros & Cons
    Isilon is more dense
    Isilon can be mounted via a number of
    protocols
        – Easier ingest / egress
        – Raw data accessible by applications
    Isilon is easy to manage
    Free of certain HDFS limitations
    Isilon loses data locality (~250MB/sec
    throughput per node over network)

© Copyright 2012 EMC Corporation. All rights reserved.   21
Why do such a thing?
    Hadoop backup or archive
     – More dense than HDFS, more accessible than
       tape, no need for compute
    Complete HDFS replacement
     – More dense, more accessible, utilize existing
       Isilon, slower per terabyte of storage
    Hot/warm storage
     – Use HDFS as primary, but Isilon as secondary
    Storage for original content
     – Use MapReduce to extract metadata from original
       content, and leave original content in place

© Copyright 2012 EMC Corporation. All rights reserved.   22
HBase External Tables in GPDB
  Project in development

  Load data in parallel from HBase by specifying table name and
  column qualifiers


 > CREATE EXTERNAL TABLE hbase_document_feature (
   “HBASEROWKEY” text,
   “term” text,
   “freq” integer)
  LOCATION ('gphbase://docfeatures')
  FORMAT ‟CUSTOM' (formatter=„gpdbwriteable_import‟);

 > SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE
 h.term = g.word;




© Copyright 2012 EMC Corporation. All rights reserved.                 23
HBase External Tables in GPDB
Possible TODO list:

                 Specify range of rowkeys

                 Support writes into HBase

                 Specify filter criteria on the external table

                 select * from hbase_external where ROWKEY=‘abc’

                 Accumulo?




© Copyright 2012 EMC Corporation. All rights reserved.             24
Why do such a thing?
Have HBase store semi-structured data

Exploit the strengths of each

Use HBase for really really wide tables

Use HBase as a scalable archive of raw records

Leverage existing HBase applications




© Copyright 2012 EMC Corporation. All rights reserved.   25
Greenplum On HDFS

  Get Greenplum Database to run natively off of HDFS

  Underlying Greenplum Database data is stored in HDFS

  Unifies the two platform further – no need for external tables

  Fully supports Greenplum’s append-only tables


  Early project in R&D

  Talk will be given by Chang Lei at Yahoo Summit




© Copyright 2012 EMC Corporation. All rights reserved.             26
Greenplum On HDFS
                                                             Master host


                                                                                                         Interconnect




                                                                                                             Segment
                                     Segment                                                                 (Mirror)
    Segment                                                Segment                 Segment
                                                                     Segment
                 Segment                        Segment
                                                                     (Mirror)
                                                                                             Segment                    Segment
                 (Mirror)                       (Mirror)                                     (Mirror)
   Segment host                     Segment host                Segment host      Segment host              Segment host

                                                                       Meta Ops                                             Read/Write
             Tables in HDFS filespace


                                                           Namenode
                                                                                                                        B
                                             Datanode          replication
                                                                                             Datanode             Datanode



                            Rack1                                                                       Rack2




© Copyright 2012 EMC Corporation. All rights reserved.                                                                                   27
Why do such a thing?
Covers many of the same use cases as Hive

Run Hadoop MapReduce over data managed by Greenplum DB

Initial results show it is faster than Hive

You only have to store your data in one system




© Copyright 2012 EMC Corporation. All rights reserved.   28
Hadoop & Greenplum: Why Do Such a Thing?

Más contenido relacionado

La actualidad más candente

MySQL Innovation Day Chicago - MySQL HA So Easy : That's insane !!
MySQL Innovation Day Chicago  - MySQL HA So Easy : That's insane !!MySQL Innovation Day Chicago  - MySQL HA So Easy : That's insane !!
MySQL Innovation Day Chicago - MySQL HA So Easy : That's insane !!Frederic Descamps
 
What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?Nabil Kassi
 
MySQL Group Replication: Handling Network Glitches - Best Practices
MySQL Group Replication: Handling Network Glitches - Best PracticesMySQL Group Replication: Handling Network Glitches - Best Practices
MySQL Group Replication: Handling Network Glitches - Best PracticesFrederic Descamps
 
Oracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionOracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionMarkus Michalewicz
 
Percona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL ArchitecturesPercona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL ArchitecturesFrederic Descamps
 
Everything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group ReplicationEverything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group ReplicationNuno Carvalho
 
MySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & OperationsMySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & OperationsFrederic Descamps
 
2023年はTiDBの時代!
2023年はTiDBの時代!2023年はTiDBの時代!
2023年はTiDBの時代!Tomotaka6
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기NeoClova
 
MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)Kenny Gryp
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemFrederic Descamps
 
State of the Dolphin - May 2022
State of the Dolphin - May 2022State of the Dolphin - May 2022
State of the Dolphin - May 2022Frederic Descamps
 
Db2 Warehouse セッション資料 db tech showcase
Db2 Warehouse セッション資料 db tech showcase Db2 Warehouse セッション資料 db tech showcase
Db2 Warehouse セッション資料 db tech showcase IBM Analytics Japan
 
Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Mydbops
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10Kenny Gryp
 
MySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a NutshellMySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a NutshellFrederic Descamps
 
Active Directory 侵害と推奨対策
Active Directory 侵害と推奨対策Active Directory 侵害と推奨対策
Active Directory 侵害と推奨対策Yurika Kakiuchi
 
Sql server のバックアップとリストアの基礎
Sql server のバックアップとリストアの基礎Sql server のバックアップとリストアの基礎
Sql server のバックアップとリストアの基礎Masayuki Ozawa
 
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterMySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterKenny Gryp
 
Gpdb best practices v a01 20150313
Gpdb best practices v a01 20150313Gpdb best practices v a01 20150313
Gpdb best practices v a01 20150313Sanghee Lee
 

La actualidad más candente (20)

MySQL Innovation Day Chicago - MySQL HA So Easy : That's insane !!
MySQL Innovation Day Chicago  - MySQL HA So Easy : That's insane !!MySQL Innovation Day Chicago  - MySQL HA So Easy : That's insane !!
MySQL Innovation Day Chicago - MySQL HA So Easy : That's insane !!
 
What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?
 
MySQL Group Replication: Handling Network Glitches - Best Practices
MySQL Group Replication: Handling Network Glitches - Best PracticesMySQL Group Replication: Handling Network Glitches - Best Practices
MySQL Group Replication: Handling Network Glitches - Best Practices
 
Oracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionOracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion Edition
 
Percona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL ArchitecturesPercona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL Architectures
 
Everything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group ReplicationEverything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group Replication
 
MySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & OperationsMySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & Operations
 
2023年はTiDBの時代!
2023年はTiDBの時代!2023年はTiDBの時代!
2023年はTiDBの時代!
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기
 
MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database System
 
State of the Dolphin - May 2022
State of the Dolphin - May 2022State of the Dolphin - May 2022
State of the Dolphin - May 2022
 
Db2 Warehouse セッション資料 db tech showcase
Db2 Warehouse セッション資料 db tech showcase Db2 Warehouse セッション資料 db tech showcase
Db2 Warehouse セッション資料 db tech showcase
 
Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10
 
MySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a NutshellMySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a Nutshell
 
Active Directory 侵害と推奨対策
Active Directory 侵害と推奨対策Active Directory 侵害と推奨対策
Active Directory 侵害と推奨対策
 
Sql server のバックアップとリストアの基礎
Sql server のバックアップとリストアの基礎Sql server のバックアップとリストアの基礎
Sql server のバックアップとリストアの基礎
 
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterMySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
 
Gpdb best practices v a01 20150313
Gpdb best practices v a01 20150313Gpdb best practices v a01 20150313
Gpdb best practices v a01 20150313
 

Destacado

Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview EMC
 
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급hslkdfjs
 
New Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNew Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNuxeo
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasMongoDB
 
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기설리번 프로젝트
 
Tailings dump recovery concept
Tailings dump recovery conceptTailings dump recovery concept
Tailings dump recovery conceptphillip shambare
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
GIS for Infrastructure Management
GIS for Infrastructure ManagementGIS for Infrastructure Management
GIS for Infrastructure ManagementDavid Puckett
 
Real-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping ContainersReal-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping Containersbenaam
 
Designing your Product as a Platform
Designing your Product as a PlatformDesigning your Product as a Platform
Designing your Product as a PlatformMicah Laaker
 
Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Ericsson
 
Web Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolWeb Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolSperasoft
 
Spend Analysis In 60 Seconds
Spend Analysis In 60 SecondsSpend Analysis In 60 Seconds
Spend Analysis In 60 SecondsClaritum
 
Surgical induced astigmatism
Surgical induced astigmatismSurgical induced astigmatism
Surgical induced astigmatismNamrata Gupta
 

Destacado (20)

Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
 
New Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNew Use Cases for DAM in the Enterprise
New Use Cases for DAM in the Enterprise
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
 
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
 
Hadoop Cluster Management
Hadoop Cluster ManagementHadoop Cluster Management
Hadoop Cluster Management
 
Tailings dump recovery concept
Tailings dump recovery conceptTailings dump recovery concept
Tailings dump recovery concept
 
Polymer optical fibers
Polymer optical fibersPolymer optical fibers
Polymer optical fibers
 
SAP Cloud for Service
SAP Cloud for ServiceSAP Cloud for Service
SAP Cloud for Service
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
GIS for Infrastructure Management
GIS for Infrastructure ManagementGIS for Infrastructure Management
GIS for Infrastructure Management
 
Real-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping ContainersReal-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping Containers
 
Designing your Product as a Platform
Designing your Product as a PlatformDesigning your Product as a Platform
Designing your Product as a Platform
 
Chem Lab Report (1)
Chem Lab Report (1)Chem Lab Report (1)
Chem Lab Report (1)
 
High-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for AuditoriumsHigh-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for Auditoriums
 
Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services
 
Web Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolWeb Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI Tool
 
Spend Analysis In 60 Seconds
Spend Analysis In 60 SecondsSpend Analysis In 60 Seconds
Spend Analysis In 60 Seconds
 
Surgical induced astigmatism
Surgical induced astigmatismSurgical induced astigmatism
Surgical induced astigmatism
 

Similar a Hadoop & Greenplum: Why Do Such a Thing?

Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse OffloadJohn Berns
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalCaserta
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsThilina Gunarathne
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Massimo Gaetano Panunzio
 
Php Site Optimization
Php Site OptimizationPhp Site Optimization
Php Site OptimizationAmit Kejriwal
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
AWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAmazon Web Services
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...LeMeniz Infotech
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisPredictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisCaserta
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
An approach to implement model classes in zend
An approach to implement model classes in zendAn approach to implement model classes in zend
An approach to implement model classes in zendswiss IT bridge
 
London data science
London data scienceLondon data science
London data scienceTed Dunning
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 

Similar a Hadoop & Greenplum: Why Do Such a Thing? (20)

Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse Offload
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on Clouds
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Php Site Optimization
Php Site OptimizationPhp Site Optimization
Php Site Optimization
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
AWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data Analytics
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisPredictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, Zementis
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
An approach to implement model classes in zend
An approach to implement model classes in zendAn approach to implement model classes in zend
An approach to implement model classes in zend
 
London data science
London data scienceLondon data science
London data science
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 

Último

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Último (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Hadoop & Greenplum: Why Do Such a Thing?

  • 1. Greenplum & Hadoop Why do such a thing? Donald Miner Solutions Architect Advanced Technologies Group Donald.Miner@emc.com © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. QUICK INTRODUCTION TO GREENPLUM DATABASE © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. GREENPLUM DATABASE Greenplum Database Basics Massively Parallel Processing (MPP) Database Uses commodity hardware Master Master Data is distributed by a user-defined “distribution key” Master node delegates queries to segments Segment Segment Segment Segment 1:1 segment and master mirroring for redundancy © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. GREENPLUM DATABASE Greenplum Database Features Full SQL support based on PostgreSQL 8.2 Columnar or row-oriented storage with compression Multi-level table partitioning with query time partition pruning B-tree and bitmap indexes JDBC, ODBC, OLEDB, etc. interfaces High speed, parallel bulk ingest Parallel query optimizer External tables © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. GREENPLUM DATABASE MADlib Analytics with Greenplum Scalable and in-database > SELECT householdID, variables FROM households Mathematical, statistical, ORDER BY RANDOM() LIMIT 100000; machine learning > SELECT run_univariate_analysis ( 'households_training', Active open source project 'variables'); WHERE pvalue<.01 AND r2>.01; > SELECT run_regression( 'univariate_results', 'households_training'); > SELECT householdID, madlib.array_dot( coef::REAL[], xmatrix::REAL[]) FROM coefficients, households; © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. GREENPLUM DATABASE MADlib In-Database Analytical Functions Descriptive Statistics Modeling Quantile Correlation Matrix Profile Association Rule Mining CountMin (Cormode-Muthukrishnan) K-Means Clustering Sketch-based Estimator FM (Flajolet-Martin) Sketch-based Naïve Bayes Classification Estimator MFV (Most Frequent Values) Sketch- Linear Regression based Estimator Frequency Logistic Regression Histogram Support Vector Machines Bar Chart SVD Matrix Factorisation Box Plot Chart Decision Trees/CART Latent Dirichlet Allocation Topic Modeling © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. GREENPLUM DATABASE PostGIS Support in Greenplum DB PostGIS adds support for geographic objects in PostgreSQL Example: find all records within 25 miles of hurricane path http://postgis.refractions.net/ select customer_id, ST_AsText(lat_lon), phone_num from clients where ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING( -79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, - 80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, - 83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI()) customer_id | st_astext | phone_num ------------+-----------------------------+------------- 493140 | POINT(-80.040397 26.570613) | 1231231234 192401 | POINT(-81.820933 26.242611) | 2342342345 © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. GREENPLUM DATABASE Solr integration with GPDB Solr is an open source enterprise search engine Enable in-database text indexing and search id | score | message_text select -----------+------------------+------------------------------------------- t.id, 71552856 | 5.43078422546387 | Hates BB's Love IPhones! q.score, 91373993 | 4.06371879577637 | Its a love hate relationship with t.message_text iPhone spellcheck from message t, 25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate gptext.search( relationship... 'twitter.public.message', 120166038 | 3.39410924911499 | Love the new iPhone 4s, hate '(iphone and (hate or love))', @ATT service #Verizonhereicome 'author_lang:en', 100 117498183 | 3.39181470870972 | I got a love-hate relationship for )q my iPhone!!! where t.id=q.id 86416378 | 3.39180779457092 | Absolutely love the new iPhone, but Siri seems to hate me.. order by score desc; © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. GREENPLUM HADOOP © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. GREENPLUM HADOOP Greenplum “HD” • Bundled open source • HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Ma hout © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. GREENPLUM HADOOP Greenplum “MR” • Bundled MapR, a commercial version of Hadoop • API compatible with traditional Hadoop • MapR improvements over Hadoop: – Improved control system – Major portions of HDFS re-implemented in C++ – HDFS is NFS mountable – Improved shuffle and sort – Distributed NameNode – Supports large number of files – Mirroring, snapshot capability © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Why do such a thing? Greenplum DB MADLib Partitioning GP Solr/Lucene SQL Indexing Text objects RDBMS PostGIS GPMapReduce Tables and Schemas STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Why do such a thing? Hadoop Schema on load MapReduce Hive XML, JSON, … Flat files Pig STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. Why do such a thing? HBase Row keys Hive Flexible schema MapReduce HBase Tables Pig STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Why do such a thing? Hybrid architecture with all three (or two…) MADLib Partitioning Row keys GP Solr/Lucene SQL Schema on load Indexing Text objects Flexible schema MapReduce RDBMS Hive PostGIS HBase Tables GPMapReduce Tables and Schemas Pig XML, JSON, … Flat files STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Greenplum Unified Analytics Platform © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Hadoop External Tables in GPDB External tables bring external data into the database. Native support for HDFS with parallelized loading. Can write to HDFS or read from HDFS. > CREATE EXTERNAL TABLE hdfs_document_feature ( docid integer, term text, freq integer) LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*') FORMAT 'text' (delimiter '|'); > SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE h.term = g.word; > WRITE INTO hdfs_export SELECT * FROM gpdb_source; © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Why do such a thing? Many of the same use cases of a HBase/Hadoop environment Use Hadoop as a data groomer Do rollups in Hadoop and store results in GPDB Use the best tool for the job (structured vs. unstructured) Use GPDB to host data sets in a more real-time layer for ad-hoc analytics © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. EMC Isilon Hardware appliance for scale-out network-attached storage (NAS) Stripes data across all nodes Uses Infiniband for intra-cluster communication Up to 15.5PB total storage 3 different hardware configurations to handle different workloads Uses “OneFS”, Isilon’s operating system and file system Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few more. © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Isilon HDFS interface Isilon is able to “pretend” to be a HDFS cluster: it mimics the NameNode and DataNode protocols to host data. Underlying system is OneFS and does not follow the traditional HDFS scheme. Point HDFS clients (MapReduce, command line, etc.) to any IP in the Isilon cluster. © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Pros & Cons Isilon is more dense Isilon can be mounted via a number of protocols – Easier ingest / egress – Raw data accessible by applications Isilon is easy to manage Free of certain HDFS limitations Isilon loses data locality (~250MB/sec throughput per node over network) © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Why do such a thing? Hadoop backup or archive – More dense than HDFS, more accessible than tape, no need for compute Complete HDFS replacement – More dense, more accessible, utilize existing Isilon, slower per terabyte of storage Hot/warm storage – Use HDFS as primary, but Isilon as secondary Storage for original content – Use MapReduce to extract metadata from original content, and leave original content in place © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. HBase External Tables in GPDB Project in development Load data in parallel from HBase by specifying table name and column qualifiers > CREATE EXTERNAL TABLE hbase_document_feature ( “HBASEROWKEY” text, “term” text, “freq” integer) LOCATION ('gphbase://docfeatures') FORMAT ‟CUSTOM' (formatter=„gpdbwriteable_import‟); > SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE h.term = g.word; © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. HBase External Tables in GPDB Possible TODO list: Specify range of rowkeys Support writes into HBase Specify filter criteria on the external table select * from hbase_external where ROWKEY=‘abc’ Accumulo? © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. Why do such a thing? Have HBase store semi-structured data Exploit the strengths of each Use HBase for really really wide tables Use HBase as a scalable archive of raw records Leverage existing HBase applications © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. Greenplum On HDFS Get Greenplum Database to run natively off of HDFS Underlying Greenplum Database data is stored in HDFS Unifies the two platform further – no need for external tables Fully supports Greenplum’s append-only tables Early project in R&D Talk will be given by Chang Lei at Yahoo Summit © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Greenplum On HDFS Master host Interconnect Segment Segment (Mirror) Segment Segment Segment Segment Segment Segment (Mirror) Segment Segment (Mirror) (Mirror) (Mirror) Segment host Segment host Segment host Segment host Segment host Meta Ops Read/Write Tables in HDFS filespace Namenode B Datanode replication Datanode Datanode Rack1 Rack2 © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. Why do such a thing? Covers many of the same use cases as Hive Run Hadoop MapReduce over data managed by Greenplum DB Initial results show it is faster than Hive You only have to store your data in one system © Copyright 2012 EMC Corporation. All rights reserved. 28

Notas del editor

  1. Greenplum HD HadoopSoftware