1. Bigdata vs. Data Warehousing
Synergy or Conflict?
Thomas Kejser
thomas@kejser.org
http://blog.kejser.org
@thomaskejser
2. Who is this Guy?
Thomas Kejser
http://blog.kejser.org
@thomaskejser
• Formerly: Lead SQLCAT EMEA
• Now: CTO FusionIo EMEA
• 15 year database experience
• Performance Tuner
3. Human Consciousness Doesn’t Scale
10
9
Billion Humans
8
7
6
5
2000 2050 2100 2150 2200 2250
Year Source: United Nations Projections
4. Text Messages in a Table
CREATE TABLE AllTexts (
Sender BIGINT 8B
, Receiver BIGINT 8B
, SenderLocation BIGINT 8B
, ReceiverLocation BIGINT 8B
, Time DATETIME 8B
, SMS VARCHAR(140) 140B
)
= 180Bytes
5. How much do we text?
• World Average
• 6.1 Trillion Text Messages / year
• About 80% cell phone coverage
• 7 billion people
• 3 messages/day/person
• But:
• Teenagers: 50 messages/day
Source: Pew Internet Research 2010 & ITU
6. How much will we EVER text?
• 9B people acting like teenagers (in 2050)
• 50 texts/day
• That’s 450 billion texts/day
• 164 Trillion texts/year (20x today)
• 180 bytes each
• Assume x3 compression
• Approximation: 10 Petabytes/year in
2050
8. How Large is this/year?
Hard Disk (4TB) : 2.5” Wine Bottle (75cl): 4.0”
About 1500 Wine Bottles
9. In the Data Center
• Calculating:
• 2U Storage=24 Disks
(includes compute)
• 4TB per Disk
• 100TB in 2U (a bit
less)
• 10PB = 200U storage
• About six racks
11. … And it is Becoming a Commodity
• Good Management
Interfaces
• Standard SQL
• with a few extensions
• Appliances
• Support system
• Homogenous HW
• In chunks
13. PDW vs. Hive – Scan/seek
Query 1 Query 2
SELECT count(*) SELECT max(l_quantity)
FROM lineitem FROM lineitem
WHERE l_orderkey > 1000
and l_orderkey < 100000
GROUP BY l_linestatus
Secs.
1500
1000
Hive
500 PDW
0
Query 1 Query 2
14. PDW vs. Hive - Joins
PDW-U:
SELECT max(l_orderkey) • orders partitioned on c_custkey
FROM orders
JOIN lineitem • lineitem partitioned on l_partkey
ON l_orderkey = o_orderkey PDW-P:
• orders partitioned on o_orderkey
• lineitem partitioned on
l_orderkey
Secs.
4000
3000
Hive
2000 PDW-U
1000 PDW-P
0
Hive PDW-U PDW-P
15. What does Big Data need to Catch up?
• Thread startup times
• Co-location awareness
• Files vs. optimized DB memory
structures
• Column stores and other DB tech
Generic is good…
… but when there is structure, make
use of it!
17. How many Pictures of Cats?
• Flickr Today:
• 300MB/month
• 2GB/year
• 51M users (too small?)
• Estimate: 102 PB /
year
• 10 x text messages
Source: WikiPedia
25. Saturday, 1:39am - at The Pub
Your Semi-structured Data, For Free
26. Big Value
Extraction of
of meaning and insight
from semi-structured data
27. Extracting Meaning from Humans
Method Examples
Turn semi-structure to structure Image recognition, network proximity
and super nodes, social media
Needle in a haystack Extract outliers, Fraud
Herd behaviors Clustering, Pattern Recognition,
“Customers who bought this also
bought”
Text classification and search Text indexes, syntactic counting,
pagerank
Text to structure Semantic analysis, loose structure into
structure
28. Find New Customers
“Michael, who is
Tommy
Thomas
respected among his
peers, Michael
often talks
about his
new, cool
gadgets”
34. Things to Learn for the Future
• Get good at
• Statistics (again)
• Distributed Algorithms
• Tuning
• Understand Physical
Constraints
• Acquire deep domain
knowledge
40. Summary
Data Warehouse Big Data
• There is a model • Don’t bother modeling!
• Seek Co-location • Optional Co-Location
• Respond in seconds • Respond in minutes
• Calculate first, query after • Calculate while querying
• Expensive HW • Cheap HW
• Optimise for target HW • Good enough on all HW
• Homogenous HW • Heterogeneous HW
• Pay vendor, expect • Free license, optimise
optimised yourself
We are at the end of the growth curve... 9B is our total population... This is an important observation because many data estimates are based on human activity and has so far assumed exponention growthm.. This is NOT the case anymore!
This show the development of hard drive capacity over time
The calculation is not meant to be read, just letting people know we did the calc and what it PHYSICALLY means (see the animation)... There is a real cost to storing a lot of data, and this is one of the reasons cloud makes a lot of senseWine bottles