08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Xldb2011 wed 1415_andrew_lamb-buildingblocks
1. Building Blocks for Large
Analytic Systems
Andrew Lamb (alamb@vertica.com)
Vertica Systems, an HP Company
Oct 19, 2011
VERTICA CONFIDENTIAL DO NOT DISTRIBUTE
2. Outline
• Emergence of new
hardware, software and
data volume is driving the
wave of new analytic
systems in spite of mature
existing systems.
• Common architectural
choices when building these
new systems
• Choices we made when
building the Vertica Analytic
Database
• Broadly applicable (and
widely used) 2
3. Architectural Changes
• Driven by changing
– Hardware: (really) cheap
x86_64 servers, high
capacity disk, high speed
Better-ness
networking
– Software: Linux, Open
Source
– Requirements: Data Deluge
• Hard for Legacy Software
– Compatibility is brutal
– Linux x86_64 today
Time
– Solaris, HPUX, IRIX, etc. 10
years ago
• Storage Organization and Processing Principles
3
4. [Storage + Processing] MPP (no shared disk)
SMP Server
CP CP CP CP
U U U U
• Any modern system should run on a CP CP CP CP
U U U U
cluster of nodes, scale up/down
System Bus
• Mid range servers are really cheap
Main Memory/Cache
(to rent or own)
• Aggregate available resources are
enormous and scalable: Shared
Disk
– I/O Bandwidth (Disk and Network)
– Cores, Memory, etc.
Network
MPP Servers
Local Disk
4
5. [Storage] Keep Your Data Sorted
• For large data sets, extra indexes
are expensive to maintain
• Much better to index the data itself
by sorting it
• Easy to find what you are looking
for, reduces seeks
5
6. [Storage] Distribute Data by Value (not chunks)
• “Sharding” -- Distribute data so you can easily find it
again (not round robin)
• Segmenting in analytics layer simplifies app layer
• Round Robin computations won't scale: need to swizzle
data around the cluster to do most useful thing
• Replication for high availability: not logs
B A C B A C B A C
2 2 2 1 1 1 3 3 3
A B C A B C A B C
3 3 3 2 2 2 1 1 1
6
7. [Storage] Store the Data in Columns
• Rarely do all fields of a
data set appear in
analytic queries
• Really don't want to
waste I/O for data you
don't need
• Columns let you be
clever about applying
predicates individually,
further reducing I/O
• Not appropriate for row
lookup or transaction
processing systems 7
8. [Storage] Write it Once, Don't Modify
• Physical use of disk
objects should be write
once (append only)
• System should present the
illusion of mutability
• Immutable storage
drastically simplifies
coherence in a distributed
system
8
9. [Processing] Use Large Sequential IOs
• Spinning disks are very good at large sequential IOs
• You really don't want to whipsaw the read head
(another reason why secondary indexes are bad)
• Especially useful with sorted data
40 Random vs Sequential Reads
35
30
25
MB/s
20
15
10
Random
Sequential
5
0
9
10. [Processing] Trade CPU for I/O Bandwidth
• Use very aggressive compression, even if it seems/is
wasteful (keep getting more cores)
• Example: data type specific encoding, then LZO before
actually writing to disk
• Hide additional
latency with
execution
pipeline
parallelism
10
11. [Processing] Mess with the Data where it Lives
• Bring your processing to the data, not data to the
processing
• System should push computation down close to data
(even if calculation turns out to be redundant)
• Example: multi-phase
aggregation, each phase
tries to aggregate
intermediates before passing
up the memory / node
hierarchy
11
12. [Processing] Declarative & Extendable
• Give users a declarative query language for most tasks
• Writing procedural code for simple queries is wasteful
• Provide procedural extensions for complex analysis
12
13. Conclusion
• Emergence of new hardware, software and data
volumes implies certain architectural choices in
modern big data analytic systems.
• Commonly observed in new systems
• Vertica (unsurprisingly) features all of them
13
14. Introducing Community Edition
• Free version of Vertica
– help steward better analytics and the democratization of data-
closed source, but open access!
• All of the features and advanced analytics of Vertica
Enterprise Edition
• Seamless upgrade to Enterprise Edition
• Limited to 1 TB raw data on 3-node hardware cluster
• Revamping Community area of Vertica’s website for
knowledge sharing, third party tools, and code sharing
• Launching academic and non-profit research use
program as well
• Sign up at: www.vertica.com/community
14