This slide deck reveals the economy reason to choose big data technology, like Hadoop. I would encourage IT consultants to read this slide deck to convince customers to use big data more. This slide deck also identifies the relationship between big data and data mining. Those are not interchangeable. I hope readers to understand why should we use big data technology.
2. BIG DATA VS DATA MINING
• Please don’ get confused with them! They are not
interchangeable
• I’ll explain why one by one
• Do you want to follow me?
3. BIG DATA
• It could be misleading that the goal of “Big Data” is to achieve
handle large scale data.
• The goal of Big data is to achieve “Scale-out” structure
– REDUCING COST
4. SCALE-UP VS SCALE-OUT
10 Core
10 Core
10 Core
10 Core
10 Core 10 Core 10 Core 10 Core
Scale -up
Scale – out
Increase computing power
in one machine
EXPENSIVE
Increase computing power by increasing the number of machine
CHEAP
5. SCALE-UP VS SCALE-OUT
• Think about this way
• Which one is cheaper?
– Quad-core (4 Core) PC x 2
– Octa-core (8 Core) PC x 1
• Generally Quad-core PC x 2 is cheaper than one octa-core PC.
– This is because only limited number of mother board makers produce the board
that support 8-core
6. WHY DO WE CHOOSE SCALE-OUT
OVER SCALE-UP STRUCTURE
7. THE DIFFICULTY OF SCALE-OUT
STRUCTURE
• How do we balance the CPU usage across the machines?
• If one machine fails, how do we manage it?
• How do we distribute the tasks to each machine?
• What if do we add one machine more?
• Conclusion: DIFFICULT
8. CASE 01 – BUSINESS TRANSACTION
IN RDBMS
• Let’s assume that we need to handle the 1 TB database
• 100 million transactions in a day
• You want to handle this without any failure
• You are a H/W architecture. What would you do?
9. H/W ARCHITECTURE FOR THAT
Commercial
DB
Unix
(40 Core)
Firewall / L2
Commercial
DB
Unix
(40 Core)
SAN Switch
Storage 1TB Storage 1TB
Mirroring
Cluster
10. ESTIMATED COST
[S/W]
DB License $5,000 / Core * 80 =
$400,000
Clustering $50,000
[H/W]
40 Core Unix x 2 = $1,000,000
Storage = $100,000
Switches = $30,000
Discretion: This is not an actual price. It depends on your sales history. I wrote this based upon my experienc
Total
Roughly
$2,000,000
12. CASE 02 – BUSINESS TRANSACTION
IN HADOOP
10 Core
HP DL380
x86
10 Core
HP DL380
x86
10 Core
HP DL380
x86
10 Core
HP DL380
x86
10 Core
HP DL380
x86
10 Core
HP DL380
x86
10 Core
HP DL380
x86
10 Core
HP DL380
x86
F/W
Switch
Suppose each server has 500 GB SCSI HDD. 500GB x 8 = 2 TB
It is able to support full mirroring option
13. ESTIMATED COST
[S/W]
Hadoop is open-source. It’s free!
[H/W]
10 Core x86 machine x 8 = $80,000
Switches = $30,000
Discretion: This is not an actual price. It depends on your sales history. I wrote this based upon my experienc
Total
Roughly
$110,000
vs $2,000,000 Unix +
Commercial DB
14. SCALABILITY
• Let’s assume that we have more customers. We need more computing
power.
[Unix + commercial DB]
I need to buy one more server, one more storage, and 40 core commercial DB license
=> Prohibitively expensive
[Linux + Hadoop]
Just add one more x86 server. It’s not a big deal.
=> Cheap
15. IS HADOOP ALIGHTY?
• No
– You have to use JAVA code in lieu of SQL
– You have to code Map-Reduce to retrieve the data or manipulate the data that
takes a form that you want.
– It doesn’t have sophisticated data management technology to get optimized
performance
– Open Source. Don’t expect any type of technical support
• With Commercial RDBMS, it has mutual supportive relationship.
– RDBMS: real time transaction
– Big Data: Business Intelligence
16. DATA MINING
• Please don’t get confused it with Big Data!
Where do we store the data How do we use the data
17. DATA MINING
Suppose that you are in charge of issuing credit cards.
You want to know who is likely to default…
You already have records of past transactions.
Gender Zipcode Age Education Income Default
Male 46637 33 Master $90,000 No
Female 10001 21 GED $50,000 Yes
… … … … … …
20. DATA MINING
• From existing data, identify the relationship between Y and X value.
– y=f(x1, x2, x3, …)
– It could be y = ax, y=log(x), y=exp(x). We don’t know, but machine is
capable of trying it to find out the best fitted model to account for Y value.
• AlphaGo, Google’s AI Go player, adopted this technology and advanced
it to ultimate level
– Y value: the probability to win this game
– X values: the positions of white and black stones
21. WHAT CAN WE DO WITH DATA
MINING?
• Combining with Big Data Technology
• Identify marketing opportunity
– Analyzing who has purchased our products?
• Financial Fraud
– Which transaction looks fraudulent?
• Artificial Intelligence
– Go, Chess, other games
• Etc.
22. Q&A
• If you have any question, feel free to ask me.
www.mbaprogrammer.com