SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
Building Blocks for Large
Analytic Systems
Andrew Lamb (alamb@vertica.com)
Vertica Systems, an HP Company
Oct 19, 2011



          VERTICA CONFIDENTIAL DO NOT DISTRIBUTE
Outline
          • Emergence of new
            hardware, software and
            data volume is driving the
            wave of new analytic
            systems in spite of mature
            existing systems.
          • Common architectural
            choices when building these
            new systems
          • Choices we made when
            building the Vertica Analytic
            Database
          • Broadly applicable (and
            widely used)                2
Architectural Changes

 • Driven by changing
    – Hardware: (really) cheap
      x86_64 servers, high
      capacity disk, high speed




                                     Better-ness
      networking
    – Software: Linux, Open
      Source
    – Requirements: Data Deluge
 • Hard for Legacy Software
    – Compatibility is brutal
    – Linux x86_64 today
                                                   Time
    – Solaris, HPUX, IRIX, etc. 10
      years ago
• Storage Organization and Processing Principles
                                                          3
[Storage + Processing] MPP (no shared disk)
                                                      SMP Server

                                                 CP     CP   CP    CP
                                                 U      U    U     U
• Any modern system should run on a              CP     CP   CP    CP
                                                 U      U    U     U
  cluster of nodes, scale up/down
                                                  System Bus
• Mid range servers are really cheap
                                               Main Memory/Cache
  (to rent or own)
• Aggregate available resources are
  enormous and scalable:                               Shared
                                                        Disk
      – I/O Bandwidth (Disk and Network)
      – Cores, Memory, etc.
                                     Network


MPP Servers




  Local Disk
                                                                        4
[Storage] Keep Your Data Sorted


• For large data sets, extra indexes
  are expensive to maintain
• Much better to index the data itself
  by sorting it
• Easy to find what you are looking
  for, reduces seeks




                                         5
[Storage] Distribute Data by Value (not chunks)


• “Sharding” -- Distribute data so you can easily find it
  again (not round robin)
• Segmenting in analytics layer simplifies app layer
• Round Robin computations won't scale: need to swizzle
  data around the cluster to do most useful thing
• Replication for high availability: not logs

 B     A   C         B   A   C           B   A   C
 2     2   2         1   1   1           3   3   3

 A     B   C         A   B   C           A   B   C
 3     3   3         2   2   2           1   1   1



                                                       6
[Storage] Store the Data in Columns


• Rarely do all fields of a
  data set appear in
  analytic queries
• Really don't want to
  waste I/O for data you
  don't need
• Columns let you be
  clever about applying
  predicates individually,
  further reducing I/O
• Not appropriate for row
  lookup or transaction
  processing systems                    7
[Storage] Write it Once, Don't Modify

• Physical use of disk
  objects should be write
  once (append only)

• System should present the
  illusion of mutability

• Immutable storage
  drastically simplifies
  coherence in a distributed
  system

                                               8
[Processing] Use Large Sequential IOs


       • Spinning disks are very good at large sequential IOs
       • You really don't want to whipsaw the read head
         (another reason why secondary indexes are bad)
       • Especially useful with sorted data
       40             Random vs Sequential Reads
       35

       30

       25
MB/s




       20

       15

       10
                                                   Random
                                                   Sequential
       5

       0
                                                                9
[Processing] Trade CPU for I/O Bandwidth


• Use very aggressive compression, even if it seems/is
  wasteful (keep getting more cores)
• Example: data type specific encoding, then LZO before
  actually writing to disk
• Hide additional
  latency with
  execution
  pipeline
  parallelism



                                                     10
[Processing] Mess with the Data where it Lives

• Bring your processing to the data, not data to the
  processing
• System should push computation down close to data
  (even if calculation turns out to be redundant)
• Example: multi-phase
  aggregation, each phase
  tries to aggregate
  intermediates before passing
  up the memory / node
  hierarchy


                                                       11
[Processing] Declarative & Extendable

• Give users a declarative query language for most tasks
• Writing procedural code for simple queries is wasteful
• Provide procedural extensions for complex analysis




                                                       12
Conclusion

• Emergence of new hardware, software and data
  volumes implies certain architectural choices in
  modern big data analytic systems.
• Commonly observed in new systems
• Vertica (unsurprisingly) features all of them




                                                     13
Introducing Community Edition

• Free version of Vertica
   – help steward better analytics and the democratization of data-
     closed source, but open access!
• All of the features and advanced analytics of Vertica
  Enterprise Edition
• Seamless upgrade to Enterprise Edition
• Limited to 1 TB raw data on 3-node hardware cluster
• Revamping Community area of Vertica’s website for
  knowledge sharing, third party tools, and code sharing
• Launching academic and non-profit research use
  program as well
• Sign up at: www.vertica.com/community
                                                                  14

Más contenido relacionado

La actualidad más candente

S ss0886 pendulum-swings-edge2015-v3
S ss0886 pendulum-swings-edge2015-v3S ss0886 pendulum-swings-edge2015-v3
S ss0886 pendulum-swings-edge2015-v3Tony Pearson
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashAshutosh Mate
 
S ss0885 spectrum-scale-elastic-edge2015-v5
S ss0885 spectrum-scale-elastic-edge2015-v5S ss0885 spectrum-scale-elastic-edge2015-v5
S ss0885 spectrum-scale-elastic-edge2015-v5Tony Pearson
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfsRami Jebara
 
IBM Platform Computing Elastic Storage
IBM Platform Computing  Elastic StorageIBM Platform Computing  Elastic Storage
IBM Platform Computing Elastic StoragePatrick Bouillaud
 
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...xKinAnx
 
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5Doug O'Flaherty
 
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...xKinAnx
 
Technical Report NetApp Clustered Data ONTAP 8.2: An Introduction
Technical Report NetApp Clustered Data ONTAP 8.2: An IntroductionTechnical Report NetApp Clustered Data ONTAP 8.2: An Introduction
Technical Report NetApp Clustered Data ONTAP 8.2: An IntroductionNetApp
 
IBM #Softlayer infographic 2016
IBM #Softlayer infographic 2016IBM #Softlayer infographic 2016
IBM #Softlayer infographic 2016Patrick Bouillaud
 
DB2 10 & 11 for z/OS System Performance Monitoring and Optimisation
DB2 10 & 11 for z/OS System Performance Monitoring and OptimisationDB2 10 & 11 for z/OS System Performance Monitoring and Optimisation
DB2 10 & 11 for z/OS System Performance Monitoring and OptimisationJohn Campbell
 
IBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object StorageIBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object StorageTony Pearson
 
S cv3179 spectrum-integration-openstack-edge2015-v5
S cv3179 spectrum-integration-openstack-edge2015-v5S cv3179 spectrum-integration-openstack-edge2015-v5
S cv3179 spectrum-integration-openstack-edge2015-v5Tony Pearson
 
Spectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf WeiserSpectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf WeiserSandeep Patil
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performancesolarisyougood
 
Engage for success ibm spectrum accelerate 2
Engage for success   ibm spectrum accelerate 2Engage for success   ibm spectrum accelerate 2
Engage for success ibm spectrum accelerate 2xKinAnx
 
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...xKinAnx
 
NetApp C-mode for 7 mode engineers
NetApp C-mode for 7 mode engineersNetApp C-mode for 7 mode engineers
NetApp C-mode for 7 mode engineerssubtitle
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSandeep Patil
 
S016827 pendulum-swings-nola-v1710d
S016827 pendulum-swings-nola-v1710dS016827 pendulum-swings-nola-v1710d
S016827 pendulum-swings-nola-v1710dTony Pearson
 

La actualidad más candente (20)

S ss0886 pendulum-swings-edge2015-v3
S ss0886 pendulum-swings-edge2015-v3S ss0886 pendulum-swings-edge2015-v3
S ss0886 pendulum-swings-edge2015-v3
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ash
 
S ss0885 spectrum-scale-elastic-edge2015-v5
S ss0885 spectrum-scale-elastic-edge2015-v5S ss0885 spectrum-scale-elastic-edge2015-v5
S ss0885 spectrum-scale-elastic-edge2015-v5
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfs
 
IBM Platform Computing Elastic Storage
IBM Platform Computing  Elastic StorageIBM Platform Computing  Elastic Storage
IBM Platform Computing Elastic Storage
 
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
 
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
 
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
 
Technical Report NetApp Clustered Data ONTAP 8.2: An Introduction
Technical Report NetApp Clustered Data ONTAP 8.2: An IntroductionTechnical Report NetApp Clustered Data ONTAP 8.2: An Introduction
Technical Report NetApp Clustered Data ONTAP 8.2: An Introduction
 
IBM #Softlayer infographic 2016
IBM #Softlayer infographic 2016IBM #Softlayer infographic 2016
IBM #Softlayer infographic 2016
 
DB2 10 & 11 for z/OS System Performance Monitoring and Optimisation
DB2 10 & 11 for z/OS System Performance Monitoring and OptimisationDB2 10 & 11 for z/OS System Performance Monitoring and Optimisation
DB2 10 & 11 for z/OS System Performance Monitoring and Optimisation
 
IBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object StorageIBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object Storage
 
S cv3179 spectrum-integration-openstack-edge2015-v5
S cv3179 spectrum-integration-openstack-edge2015-v5S cv3179 spectrum-integration-openstack-edge2015-v5
S cv3179 spectrum-integration-openstack-edge2015-v5
 
Spectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf WeiserSpectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf Weiser
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
Engage for success ibm spectrum accelerate 2
Engage for success   ibm spectrum accelerate 2Engage for success   ibm spectrum accelerate 2
Engage for success ibm spectrum accelerate 2
 
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
 
NetApp C-mode for 7 mode engineers
NetApp C-mode for 7 mode engineersNetApp C-mode for 7 mode engineers
NetApp C-mode for 7 mode engineers
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
S016827 pendulum-swings-nola-v1710d
S016827 pendulum-swings-nola-v1710dS016827 pendulum-swings-nola-v1710d
S016827 pendulum-swings-nola-v1710d
 

Destacado

1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)liqiang xu
 
浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909liqiang xu
 
the NML project
the NML projectthe NML project
the NML projectLei Yang
 
Nginx internals
Nginx internalsNginx internals
Nginx internalsliqiang xu
 
Nginx维护手册
Nginx维护手册Nginx维护手册
Nginx维护手册Lei Yang
 
高性能Web服务器Nginx及相关新技术的应用实践
高性能Web服务器Nginx及相关新技术的应用实践高性能Web服务器Nginx及相关新技术的应用实践
高性能Web服务器Nginx及相关新技术的应用实践rewinx
 

Destacado (7)

1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)
 
Hdfs comics
Hdfs comicsHdfs comics
Hdfs comics
 
浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909
 
the NML project
the NML projectthe NML project
the NML project
 
Nginx internals
Nginx internalsNginx internals
Nginx internals
 
Nginx维护手册
Nginx维护手册Nginx维护手册
Nginx维护手册
 
高性能Web服务器Nginx及相关新技术的应用实践
高性能Web服务器Nginx及相关新技术的应用实践高性能Web服务器Nginx及相关新技术的应用实践
高性能Web服务器Nginx及相关新技术的应用实践
 

Similar a Xldb2011 wed 1415_andrew_lamb-buildingblocks

Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute ClusterRamsay Key
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomMicrosoft Singapore
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Community
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Lars Marowsky-Brée
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcturesabnees
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhcdrsm79
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines
 
NoSQL overview implementation free
NoSQL overview implementation freeNoSQL overview implementation free
NoSQL overview implementation freeBenoit Perroud
 
Casual mass parallel data processing in Java
Casual mass parallel data processing in JavaCasual mass parallel data processing in Java
Casual mass parallel data processing in JavaAltoros
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Database & Technology 2 _ Marting Lambert _ Mixed Workloads Why and How.pdf
Database & Technology 2 _ Marting Lambert _ Mixed Workloads Why and How.pdfDatabase & Technology 2 _ Marting Lambert _ Mixed Workloads Why and How.pdf
Database & Technology 2 _ Marting Lambert _ Mixed Workloads Why and How.pdfInSync2011
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationCeph Community
 

Similar a Xldb2011 wed 1415_andrew_lamb-buildingblocks (20)

Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicom
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
25 snowflake
25 snowflake25 snowflake
25 snowflake
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IX
 
NoSQL overview implementation free
NoSQL overview implementation freeNoSQL overview implementation free
NoSQL overview implementation free
 
Casual mass parallel data processing in Java
Casual mass parallel data processing in JavaCasual mass parallel data processing in Java
Casual mass parallel data processing in Java
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Database & Technology 2 _ Marting Lambert _ Mixed Workloads Why and How.pdf
Database & Technology 2 _ Marting Lambert _ Mixed Workloads Why and How.pdfDatabase & Technology 2 _ Marting Lambert _ Mixed Workloads Why and How.pdf
Database & Technology 2 _ Marting Lambert _ Mixed Workloads Why and How.pdf
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
 

Más de liqiang xu

Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施liqiang xu
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerliqiang xu
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inliqiang xu
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsliqiang xu
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseliqiang xu
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)liqiang xu
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能liqiang xu
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)liqiang xu
 

Más de liqiang xu (8)

Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastner
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouse
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Xldb2011 wed 1415_andrew_lamb-buildingblocks

  • 1. Building Blocks for Large Analytic Systems Andrew Lamb (alamb@vertica.com) Vertica Systems, an HP Company Oct 19, 2011 VERTICA CONFIDENTIAL DO NOT DISTRIBUTE
  • 2. Outline • Emergence of new hardware, software and data volume is driving the wave of new analytic systems in spite of mature existing systems. • Common architectural choices when building these new systems • Choices we made when building the Vertica Analytic Database • Broadly applicable (and widely used) 2
  • 3. Architectural Changes • Driven by changing – Hardware: (really) cheap x86_64 servers, high capacity disk, high speed Better-ness networking – Software: Linux, Open Source – Requirements: Data Deluge • Hard for Legacy Software – Compatibility is brutal – Linux x86_64 today Time – Solaris, HPUX, IRIX, etc. 10 years ago • Storage Organization and Processing Principles 3
  • 4. [Storage + Processing] MPP (no shared disk) SMP Server CP CP CP CP U U U U • Any modern system should run on a CP CP CP CP U U U U cluster of nodes, scale up/down System Bus • Mid range servers are really cheap Main Memory/Cache (to rent or own) • Aggregate available resources are enormous and scalable: Shared Disk – I/O Bandwidth (Disk and Network) – Cores, Memory, etc. Network MPP Servers Local Disk 4
  • 5. [Storage] Keep Your Data Sorted • For large data sets, extra indexes are expensive to maintain • Much better to index the data itself by sorting it • Easy to find what you are looking for, reduces seeks 5
  • 6. [Storage] Distribute Data by Value (not chunks) • “Sharding” -- Distribute data so you can easily find it again (not round robin) • Segmenting in analytics layer simplifies app layer • Round Robin computations won't scale: need to swizzle data around the cluster to do most useful thing • Replication for high availability: not logs B A C B A C B A C 2 2 2 1 1 1 3 3 3 A B C A B C A B C 3 3 3 2 2 2 1 1 1 6
  • 7. [Storage] Store the Data in Columns • Rarely do all fields of a data set appear in analytic queries • Really don't want to waste I/O for data you don't need • Columns let you be clever about applying predicates individually, further reducing I/O • Not appropriate for row lookup or transaction processing systems 7
  • 8. [Storage] Write it Once, Don't Modify • Physical use of disk objects should be write once (append only) • System should present the illusion of mutability • Immutable storage drastically simplifies coherence in a distributed system 8
  • 9. [Processing] Use Large Sequential IOs • Spinning disks are very good at large sequential IOs • You really don't want to whipsaw the read head (another reason why secondary indexes are bad) • Especially useful with sorted data 40 Random vs Sequential Reads 35 30 25 MB/s 20 15 10 Random Sequential 5 0 9
  • 10. [Processing] Trade CPU for I/O Bandwidth • Use very aggressive compression, even if it seems/is wasteful (keep getting more cores) • Example: data type specific encoding, then LZO before actually writing to disk • Hide additional latency with execution pipeline parallelism 10
  • 11. [Processing] Mess with the Data where it Lives • Bring your processing to the data, not data to the processing • System should push computation down close to data (even if calculation turns out to be redundant) • Example: multi-phase aggregation, each phase tries to aggregate intermediates before passing up the memory / node hierarchy 11
  • 12. [Processing] Declarative & Extendable • Give users a declarative query language for most tasks • Writing procedural code for simple queries is wasteful • Provide procedural extensions for complex analysis 12
  • 13. Conclusion • Emergence of new hardware, software and data volumes implies certain architectural choices in modern big data analytic systems. • Commonly observed in new systems • Vertica (unsurprisingly) features all of them 13
  • 14. Introducing Community Edition • Free version of Vertica – help steward better analytics and the democratization of data- closed source, but open access! • All of the features and advanced analytics of Vertica Enterprise Edition • Seamless upgrade to Enterprise Edition • Limited to 1 TB raw data on 3-node hardware cluster • Revamping Community area of Vertica’s website for knowledge sharing, third party tools, and code sharing • Launching academic and non-profit research use program as well • Sign up at: www.vertica.com/community 14