Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Optimal Chain Matrix Multiplication Big Data Perspective
1. Optimal Chain Matrix Multiplication Big Data
Perspective
Presented By
Pollab Kumar Roy
pollabroy.242@gmail.com
STUDY AND REPORT
2. Presentation Outline
Introduction
Big Data Overview
• Definition
• Three V presentation
• Application
Introduction to Hadoop
• Architecture
• How it works
• Advantage
MapReduce
• What is MapReduce?
• The Algorithm
• Example Scenario
HDFS
Matrix Multiplication
Multi Way Join
Proposed Work
Conclusions
Dept. of ICT, MBSTU
2
3. Introduction
Matrix multiplication is widely used for many graph algorithms, such
as those that calculate the transitive closure. MapReduce is good to
implement multi way join operation for very large graphs and metrices.
In this presentation we will see Big Data overview. Matrix
multiplication representation in database. Parallel multi way matrix
join in database with benefit and limitation.
And a proposal for making chain multiplication more optimal with
raw join key.
Dept. of ICT, MBSTU
3
4. Big Data Overview
Big data is a term that refers to data sets whose size , complexity, and
rate of growth make them difficult to be captured, managed,
processed by conventional technologies.
Big Data Source :
Dept. of ICT, MBSTU
4
Stock
Exchange
data
Social
Media
data
Black Box
data
5. Volume
Till 2003 was 5 billion GB.
Two days in 2011.
Every ten minutes in 2013
Variety
Structured: Relational data.
Semi Structured: XML data.
Unstructured: Word, PDF, Text,
Media Logs.
Velocity
Big Data Velocity deals with the
pace at which data flows in from sources and human interaction.
The three dimensions of Big Data
Dept. of ICT, MBSTU
5
6. Big Data Application Segments
Analytics
Predictive Modeling
Decision Processing
Behavior Analysis
Demographics
Data Warehouse
Hosting
Digitization/archive
Backup
Web 2.0
Engineering Collaborating
Design Optimization
Process Flow
Fluid Dynamics
3D Modeling
Analytics
Predictive Modeling
Decision Processing
Behavior Analysis
Demographics
Dept. of ICT, MBSTU
6
7. Introduction to Hadoop
Hadoop: Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using
simple programming models.
Doug Cutting son’s toy.
Hadoop Architecture :
Two major layers.
• Processing layer :
MapReduce
• Storage layer :
Hadoop Distributed
File System
Dept. of ICT, MBSTU
7
MapReduce
(Distributed Computation)
HDFS
(Distributed Storage)
YARN Framework Common Utilities
8. Introduction to Hadoop (cont.)
How Hadoop works : Core tasks across a cluster of computers
• Data dividing into directories and files(128M/64M).
• Files are then distributed across various cluster nodes.
• HDFS, supervises the processing.
• Blocks are replicated.
• Performing sort between the map and reduce stages.
• Sending the sorted data to a certain computer.
Advantage :
• Low-cost alternative to build bigger servers.
• Fault-tolerance and high availability.
• Dynamic clustering.
• Automatic data distribution and open source
Dept. of ICT, MBSTU
8
9. MapReduce
What is MapReduce : A processing technique and a program
model for distributed computing based on java.
• Mapper
• Shuffle
• Reducer
• Java based
• Key Value
Dept. of ICT, MBSTU
9
11. MapReduce (cont.)
Word Count Example :
Dept. of ICT, MBSTU
11
Apple Orange Mango
Orange Grapes Plum
Apple Orange Mango
Orange Grapes Plum
Apple Plum Mango
Apple Apple Plum
Apple Plum Mango
Apple Apple Plum
Apple,1
Orange ,1
Mango,1
Orange,1
Grapes ,1
Plum,1
Apple,1
Plum ,1
Mango,1
Apple,1
Apple ,1
Plum,1
Apple,1
Apple,1
Apple,1
Apple,1
Grapes ,1
Mango,1
Mango,1
Orange,1
Orange,1
Plum,1
Plum,1
Plum,1
Apple,4
Grapes,1
Mango,2
Orange,2
Plum,3
Apple,4
Grapes,1
Mango,2
Orange,2
Plum,3
input Files each line to individual mapper
map key value splitting sort, shuffle Produce key value pairs
Final output
12. Hadoop Distributed File System(HDFS)
The HDFS is a distributed, scalable, and portable file-system written
in Java for the Hadoop framework.
Feature :
• Distributed storage and processing
• Name Node
• Data Node
• Interface in Hadoop
• Streaming access
• Cluster status check
Dept. of ICT, MBSTU
12
13. Hadoop Distributed File System(cont.)
Architecture : Data Node, Name Node, Block
Dept. of ICT, MBSTU
13
Name Node
Meta data(Name, replica…)
/home/foo/data, 3…
Client
Blocks
Replication
Read
D a t a n o d e s D a t a n o d e s
Rack 1 Rack 2
14. Matrix Multiplication (Via multi-way join)
Usage : Widely used in many graph algorithms
• Transitive closure
• N-hop neighbors
Join Operation :
• Matrices A [p×q] and B [q×r]
• C [p×r] = 𝐀 × 𝑩
• Each (i,k) th element of C is 𝒋=𝟏
𝒒
𝑨𝒊𝒋 × 𝑩𝒋𝒌
• A and B by relations 𝑹 𝟏 and 𝑹 𝟐 in database, attributes{row, col, val}
• 𝐀 × 𝑩 in terms of SQL
Dept. of ICT, MBSTU
14
User_1
User_2
User_7
User_3
User_5
User_6
User_4
Fig : Social Network
SELECT 𝑅1.row, 𝑅2.col, sum(𝑅1.val* 𝑅2.val)
FROM 𝑅1, 𝑅2
WHERE 𝑅1.col= 𝑅2.row
GROUP BY 𝑅1.row, 𝑅2.col
16. Matrix Multiplication (cont.)
Chain way join :
• Eq.(1) typical method,serial two-way join (S2). Separate MR
job, Intra-operation parallelism.
• Eq.(2) parallel two-way join (P2). Inter-operation parallelism.
and simultaneously
• Eq.(3) parallel m-way join (PM)
Dept. of ICT, MBSTU
16
((A *B) * (C *D))= (2)
(A * B * C * D)= (3)
(((A *B) * C) * D)= (1)A * B * C * D
A * B C * D
17. Matrix Multiplication (cont.)
Parallel M-way join :
• S2(n-1) = 4
• P2 = 3
• PM = 2
Dept. of ICT, MBSTU
17
Input : Relations M1, M2,…. Mn representing matrices
1: LIST_Mnext <= M1, M2,…. Mn
2: while |LIST_Mnext|> 1 do
3: for I = 1 to |LIST_Mnext | do
4: if ( i mod m ) == 1 then
5: add Mi to LIST_Mleft
6: Mleft = Mi
7: else
8: add Mi to LIST_Mright ( Mleft )
9: end if
10: end for
11: LIST_Mnext = doMR-PM (LIST_Mleft,LIST_Mright )
12: end while
M1
M4 M5M2 M3M1
M1 M4
<1st MR job>
<2nd MR job> < result >
Fig : Example of parallel 3 way
Fig : Algorithm for PM join
[𝑙𝑜𝑔2
𝑛
]
[𝑙𝑜𝑔 𝑚
𝑛 ]
18. Matrix Multiplication (cont.)
Efficiency of m-way join :
• MR job iteration
• Time
Limitation :
• Join key number
• Greater network
and sorting overhead
Dept. of ICT, MBSTU
18
Fig : PM Join key
19. Future study and Proposed Work
Future study :
• Amazon EC2
• Apache Whirr tools
• Larger graph datasets to matrix
• Hadoop , more Papers
Proposed work :
• PM with the raw key.
• This improvement should reduce the number of duplications and
increase the diversity of the join key.
• MapReduce framework that does not perform sort operations in
mappers.
Dept. of ICT, MBSTU
19
20. Conclusion
In this presentation, i explained the multiplication of matrices
into a multi-way join operation s. The implementation of three
types algorithms: S2, P2, and PM.
Parallel m-way join operation can improve the performance of
the matrix chain multiplication process.
However, using the composite key introduces a number of
disadvantages, such as greater network and sorting overhead.
Finally i propose Parallel m-way join operation with raw key to
make it optimal.
Dept. of ICT, MBSTU
20
21. References
Apache hadoop. Website. http://hadoop.apache.org
http://www.sas.com/en_us/insights/big-data/hadoop.html
Zikopoulos, P. C., Eaton, C., DeRoos, D., Deutsch, T., & Lapis, G.
(2012). Understanding big data. New York et al: McGraw-Hill.
Myung, J., & Lee, S. G. (2012, February). Matrix chain
multiplication via multi-way join algorithms in MapReduce. In
Proceedings of the 6th International Conference on Ubiquitous
Information Management and Communication (p. 53). ACM.
J. Dean and S. Ghemawat Map-Reduce: simplified data processing
on large clusters.
Dept. of ICT, MBSTU
21