SlideShare una empresa de Scribd logo
1 de 10
Descargar para leer sin conexión
Proc. of Int. Conf. on Advances in Computer Science, AETACS

Distributed Algorithm for Frequent Pattern Mining
using HadoopMap Reduce Framework
Suhasini A. Itkar1, Uday V. Kulkarni2
1

2

PES Modern college of Engineering, Pune, India
Email: Suhasini.a.itkar@gmail.com
SGGS Institute of Engineering and Technology, Nanded, India
Email: kulkarniuv@yahoo.com

Abstract—With the rapid growth of information technology and in many business
applications, mining frequent patterns and finding associations among them requires
handling large and distributed databases. As FP-tree considered being the best compact data
structure to hold the data patterns in memory there has been efforts to make it parallel and
distributed to handle large databases. However, it incurs lot of communication over head
during the mining. In this paper parallel and distributed frequent pattern mining algorithm
using Hadoop Map Reduce framework is proposed, which shows best performance results
for large databases. Proposed algorithm partitions the database in such a way that, it works
independently at each local node and locally generates the frequent patterns by sharing the
global frequent pattern header table. These local frequent patterns are merged at final stage.
This reduces the complete communication overhead during structure construction as well as
during pattern mining. The item set count is also taken into consideration reducing
processor idle time. Hadoop Map Reduce framework is used effectively in all the steps of the
algorithm. Experiments are carried out on a PC cluster with 5 computing nodes which
shows execution time efficiency as compared to other algorithms. The experimental result
shows that proposed algorithm efficiently handles the scalability for very large datab ases.
Index Terms—Distributed computing, Parallel processing, Hadoop Map Reduce framework,
Frequent pattern mining, Association rules, Data mining

I. INTRODUCTION
In data mining, research in the field of distributed and parallel mining is gaining popularity as there is
tremendous increase in database sizes stored in data warehouses. Frequent Pattern Mining is one of the
important areas in Data Mining to find associations in databases.
Apriori [1] algorithm is first efficient algorithm to find frequent patterns. It is based on candidate set generate
and test approach. It requires database scans for finding itemsets in worst case scenario. Many parallel
and distributed algorithms are developed using Apriori principle to improve efficiency of Apriori algorithm.
Agarwal, Schafer [2] proposed in 1996 algorithms like count distribution, data distribution and candidate
distribution based on parallel and distributed processing. In this paper communication, computation
efficiency, memory usage and synchronization parameters are discussed. Zaki [3] in 1999 discussed about
various aspects of parallel and distributed association rule mining algorithm. Other parallel and distributed
algorithms also tried to improve performance based on Apriori principle like Trie tree [Bodon, 2003] based
parallel and distributed algorithm [4], DPA (Distributed ParallelApriori) Algorithm [5], MapReduce
DOI: 02.AETACS.2013.4.123
© Association of Computer Electronics and Electrical Engineers, 2013
programming model based Apriori algorithm using Hadoop framework [6], parallel data mining on shared
memory system[14] and tree projection algorithm[15]. All these algorithms still have a performance
bottleneck of either multiple database scans or huge search space or both.
In 2000 Han et al. came up with FP Growth [7] algorithm where it proposed more compact tree structure
called FP-tree. It overcomes drawback of Apriori algorithm by reducing search space and taking only two
database scans. It uses FP-growth algorithm for mining frequent patterns.
FP-growth suffers performance issues with huge databases. To improve performance of FP-tree based
algorithms new area of research is introduced with distributed and parallel processing as discussed in survey
paper of frequent pattern mining [16]. Proposed algorithms in this areatried to reduce IPC cost, memory
storage and tried to use computing power
of processors efficiently with load balancing approach. Some of FP-tree based distributed algorithms are PP
tree [8], parallel TID based FPM [9], distributed DH-TRIE [10] frequent pattern mining algorithm and
efficient frequent pattern mining algorithms in many task computing environment[17].
As there is huge demand in data mining to process terabyte or petabyte of information spread across the
internet, there was a need of more powerful framework for distributed computing. With the emergence of
cloud computing environment,the Map-reduce framework patented by Google also gained popularity because
of its ease of parallel processing, capability of distributing data and load balancing. Hadoop is an open source
implementation of Map-reduce framework.
This paper proposes distributed and parallel frequent pattern mining algorithm (DPFPM) using
HadoopMapReduce framework. Here proposed algorithm tries to reduce communication cost among all
processors, and to speed up mining process. The proposed algorithm efficiently handles the scalability for
very large databases. It shows best performance results for large databases using MapReduce framework on
Hadoop cluster.
The remainder of this paper is organized as follows: Section II reviews FP-tree based parallel and distributed
algorithms. Section III describes HadoopMapReduce framework. Section IV introduces a novel distributed
and parallel frequent pattern mining algorithm (DPFPM) using HadoopMapReduce framework. Section V
shows experimental results and comparison of proposed algorithm with existing approaches. Finally, Section
VI draws conclusion from this study and discuss about future scope.
II. RELATED WORK
Frequent pattern mining problem is to find all frequent k-itemsets where all itemset support should be more
than minimum support thresholdξ.
Support(k) >= min_sup_thresholdξfor given threshold (1 ≤ ξ ≤ |DB|).
Apriori based algorithms tried to solve the FPM problem but faced a performance bottleneck of either
multiple database scan or huge search space or both. Prefix tree based algorithms like FP-growth tried to
optimize performance but suffer performance issues with huge database and small support values where it
takes more time for mining or it fails to execute. In this section previous work in the field of distributed and
parallel frequent pattern mining is covered. Distributed algorithms based on prefix tree structure and
MapReduce framework is discussed.
In PP tree based parallel and distributed algorithm[8], it proposes parallel and distributed algorithm using PP
tree (parallel pattern tree) where it requires only one database scan to construct PP tree and in a way reduces
I/O cost. It considers a distributed memory architecture where each node contains all resources
locally.Algorithm is divided in three phases – Phase I: It first accept database contents in horizontal partitions
in any canonical order, construct tree in one scan and then restructure it in global frequency descending
sequence of items. Phase II: Local Mining – Individually mine local PP tree in parallel for discovering global
frequent patterns (FPpg). Phase III: Final sequential step which collects frequent patterns from all local PP trees
and generate global frequent patterns. Phase I: PP tree construction phase – This phase first partitions original
database and distribute it to all nodes. It then performs three steps at each local node. Step I: It construct local
PP tree with insertion stage by generating local header table at each node. Step II: All local header tables are
collected in global header table array (HTA).This global HTA is reorganized in frequency descending order.
Step III: At each node restructuring phase is carried out with the help of global HTA and finally local PP tree
is created.Phase II: Mining of local PP tree – PP tree generate the set of potential frequent patterns (FPpg) by
using FP Growth mining algorithm. For this first local PP trees construct itemset based prefix pattern trees
PT(x) for each frequent pattern x. It is constructed by collecting all prefix sub paths/patterns from original local
PP tree. From this individual tree for each frequent pattern x it generates new conditional tree CT(x), which
16
containsonly potential global items, it prunes all infrequent items. Then it generates the set of frequent patterns
(FPpg) at each local node in parallel without any IPC cost.Phase III: Generating global frequent patterns – In
this master node takes local frequent patterns (FPpg) and sequential step of accumulating supports of all local
(FPpg) is carried out at master node. The final (FPpg) list may lead to false positive patterns but not false
negatives. Master node then removes all infrequent patterns from (FPpg) list and generates final global frequent
patterns. For removing infrequent patterns from (FPpg) list algorithm uses hash based technique to speed up the
search in (FPpg) list.
Although this algorithm tries to reduce Inter Process Communication cost with efficient mining algorithm
and also reduces I/O cost by performing single database scan still it requires more time for insertion and
restructuring phase while constructing local PP trees. Original database is partitioned but not pruned with
infrequent 1-itemsets and original database is only used throughout the algorithm so this may utilizes more
memory for huge databases.
In TPFP tree and BTP tree based parallel and distributed algorithm [9], two parallel and distributed mining
algorithms are proposed based on Tidset concept. Tidset based parallel FP tree (TPFP tree) algorithm is
proposed for cluster environment and balanced Tidset based parallel FP tree (BTP tree) algorithm is proposed
for grid environment.In both the algorithms Tidset (Transaction identifier set) is used to directly select
transactions instead of scanning whole database. Its goal is to reduce IPC cost and tree insertion cost, thus
decreasing execution time. TPFP tree based algorithm is proposed for homogeneous cluster environment. In
this algorithm load is distributed to all nodes evenly without viewing processor capacity. Its objective is to
reduce the execution time of mining information exchange and to shorten the index cost of transaction
extraction. BTP tree algorithm is proposed for heterogeneous grid environment where it tries to balance the
load. In BTP algorithm all the steps are same as TPFP algorithm only one more step is added to it, where it
evaluates the performance index of computing nodes. Depending upon performance index it distributes
mining sets. This algorithm performs better than PFP tree and TPFP tree in grid environment.
In MapReduce as a programming model for Association Rules algorithm on Hadoop [6] paper, Apriori
algorithm is improved using MapReduce model to handle huge datasets on Hadoop distributed computing
environment. In this paper, it first finds frequent 1-itemsets in first database scan. Then it uses MapReduce
model where each mapper generate subset of candidate itemsets. These candidate itemsets are given to
reducer to prune candidates based on minimum support threshold and to generate subset of frequent itemsets.
Multiple iterations to MapReduce are necessary to produce frequent itemsets. At final stage output of all
reducers is combined to generate final frequent itemsets.In this paper although it tried to implement Parallel
Apriori algorithm but it suffers from drawback of multiple database scans and huge candidate set generation.
A distributed DH-TRIE frequent pattern mining algorithm [10] is proposed using HadoopMapReduce
programming model. This paper is based on Dynamic Hash (DH) Trie algorithm. Three problems are
focused in this are globalization, random-write and duration which are tried to solve using Java Persistent
API(JPA) and parallelizing algorithm using MapReduce programming model.In first phase, first two
database scans are assigned to map tasks which results into building of DH-TRIE. In second phase mining is
carried out using breadth-first search model by using queue instead of recursive processing for mining. This
is carried out by multiple mapper tasks where number of mappers is equal to number of elements in the
queue.Although this algorithm tries to solve three problems of globalization, random-write and duration it
still faces some drawbacks like in first phase two database scans are handled by only map tasks, no reducer is
used to construct FP-tree. In second phase mining is also carried out by map tasks to handle sub FP-tree,
slowdowns mining process.
Therefore, this paper proposes DPFPM algorithm to speed up the process of frequent pattern mining for huge
datasets by distributing the data efficiently and executing task in parallel. Proposed approach tries to
overcome the drawbacks of previous work and accelerate the performance effectively.
III. HADOOPMAP-REDUCE FRAMEWORK

Hadoop is an open source programming framework for running applications on large clusters built of
commodity hardware [11]. Hadoop is derived from Google’s MapReduce model and Google File System.
Hadoop cluster characteristic is partitioning or distributing data and computation across multiple nodes and
executing computations in parallel. It provides application both reliability and data motion.It provides a
Hadoop distributed file system (HDFS) [12] that stores data on the compute nodes, providing very high
aggregate bandwidth across the cluster.
17
Hadoop implements a computational paradigm named MapReduce [13], where the application is divided into
many small fragments of work, each of which may be executed or re executed on any node in the cluster. Its
goals are to process large data sets, high throughput, load balancing and failure handling. MapReduce
framework executes user defined MapReduce jobs across the nodes in the cluster.
A MapReduce computation has two phases map phase and reduce phase. The input to the computation is a
data set of key/value pairs. It partition input key/value pairs into chunks, run map() tasks in parallel. After all
map()s are complete, it consolidate all emitted values for each unique emitted key.
(

,

) →

(

,

)

It partition space of output map keys, and run reduce() in parallel. Reducer generates output and stores it in
HDFS.
(

,

( )→

( )

Tasks in each phase are executed in a fault-tolerant manner. Having many map and reduce tasks enables good
load balancing. It provides an interface which is independent of backend technology.
IV. DISTRIBUTEDANDPARALLELFREQUENTP ATTERNMININGALGORITHM(DPFPM)

Proposed DPFPM algorithm uses HadoopMapReduce programming framework for frequent pattern mining.
In frequent pattern mining the task of generating conditional frequent patterns and finding global frequent
patterns can be done in parallel by distributing the data so MapReduce framework is most suitable for it.
In proposed algorithm MapReduce task is used twice, once for support counting and then for generating
frequent patterns, takingmorebenefits of MapReduce framework. It achieves distributed system goals like
high avaiability, high throughput, load balancing and fault tolerance by using map reduce functionality
efficiently and effectively.
Proposed DPFPM algorithm has five steps. In first step it divide transaction databaseDB. In second step it
constructs Global Frequent Header Table (GFHTable) using MapReduce tasks for first time. Third step
generate conditional frequent patterns using map task second time. In fourth step each reducer combine all
conditional frequent patterns of same key to generate frequent sets. In fifth step all reducers output is
aggregated to generate final frequent sets.
Details of all steps are as follows –
A. Divide transaction database DB
Divide complete transaction database DB into number of available processors so that each individual
processor works on the allocated DBin parallel. DB = {DB , DB , . . . . , DB } ,where is number of processors
available to distribute the problem.
B. Construction of GFHTable
Using MapReduce programming model, the frequent support count of each item is calculated that appeared
in the locally allottedDBi . Then entire locally constructed header tables are merged to populate GFHTable on
master node or at job node.
Algorithmic steps for construction of GFHTable are described below.
Step 1: Hadoop stack starts as many mappers as possibleto utilize available infrastructure at max.
Step 2: Complete DB is divided into number of Mappersand its transactions are read by Mapper function
of Hadoop Stack
Step 3: Function GFHTMapper performs following steps taking transaction from DB as input.
Step 3.1: foreach item from transaction Ti
Step 3.2.1: output write element and its frequency count 1
Step 4: Hadoop pass the Mapper output to number of reducers it creates. But number of reducers is
depend on the number of groups (similar items/elements)
Step 5: Function GFHTReducer performs following steps taking input as elements grouped by element
name
Step 5.1: long sum = 0;
Step 5.2: foreach element record
Step 5.2.1: sum = sum + element's frequency count
Step 5.3: outputs from each reducer started by Hadoop stack, element name and its global frequency
count
18
Step6:
Function
aggregator
prune
infrequent
items
from
GFHTable
based
on
ℎ ℎ
Inputto aggregator function is
ℎ ℎ
and output
generated by GHTReducer Step 6.1: foreach element from GHTReducer
Step: 6.1.1: if (element.frequencyCount>= )
Step 6.1.2: putInGFHTable(element);
Proposed methodology used for constructing GFHTable reduces communication overhead to calculate
support of all items. It first calculates frequency count locally of local databases on each processor in parallel
accelerating speed as described in step 1 to step 5. This reduces message exchange among processors. All
local frequency counts are only then accumulated at master node to construct global header. As shown in step
6 all infrequent items are pruned from table and global frequent pattern header table (GFHTable) is
constructed as final output. This step effectively uses map and reduce functionality to speed up the operation
of mining.
C.Construction of Conditional Frequent Patterns

As the GFHTable is light weight in terms of memory, it is shared by all the nodes of the cluster. Each cluster
node is responsible of rearranging the DB transaction pattern as in line with GFHTable and prunes the
elements accordingly. As compared to standalone algorithms like FP-growth, proposed algorithm would not
create global tree. Task of generation of frequent patterns is also distributed among processors and task of
generating conditional patterns is done in parallel.
Here Mapper and Reducer perform different tasks than the traditional scatter and gather functionalities.
Here Mapper performs following steps:
i. It prunes the DBi transaction to inline the transaction items as per the available GFHTable.
ii. According to the available GFHTable, it process the DBi’s transaction and output the key-value pair,
where key is conditional element for which conditional frequent pattern set is constructed as
described in figure 1.
Algorithmic steps as explained above for FPM_Mapper for finding frequent patterns includes functions as
described below. FPM-Mapper is main function where it calls generateConditionalPattern sub function.
FPM-Mapper runs on all processors with multiple mappers running on individual processors. Conditional
patterns are directly generated for each item
on each mapper on every processor as output of this function.
Function –FPM_Mapper to prune transactions and generate conditional frequent patterns

FunctionFPM_Mapper
input :GFHTable, Transaction Tifrom DB
output: Conditional Frequent Patterns
# GFHTableis in local object to access it faster by current Mapper
Steps:
foreach transaction from DB
Collect patternElements;
foreach element of transactionTi
if (checkIfElementExistsInGFHTable(element))
patternElements.add(element);
}
endif
#order pattern Elements using support frequency
Prune-order(patternElements);
#Generate Conditional Patterns
generateConditionalPattern(patternElements);
endfor
endfor
Function – generate conditional frequent patterns
functiongenerateConditionalPattern
input :patternElements
output : conditional prefix-tree for each element
Steps: for each element from patternElements
construct conditional prefix-tree
HadoopContext.write(element, conditionalPrefix-tree);
endfor
19
FPM_Mapper first prune infrequent items from transactions from local database by checking withGFHTable.
Then it calls generateConditionalPattern sub function which ultimately generates localconditional frequent
patterns.
Figure1. shows architecture diagram of frequent pattern generation and does not include steps of GFHTable
construction. Filesystem shown in figure is Hadoopfilesystem (HDFS) which is used for storing intermediate
and final results of each mapper and reducer. HDFS provides reliability for storing huge datasets and stream
intermediate results with high bandwidth to all nodes. The architecture of our system proves efficient use of
MapReduce framework. Proposed architecture helps proposed algorithm to efficiently mine frequent
patterns.

Figure1. Architecture diagram of DPFPM algorithm using MapReduce framework

D. Generation of Frequent Pattern Sets
When all mapper instances finished their work of construction of conditional frequent patterns, MapReduce
framework helps to group all corresponding conditional frequent patterns transaction over available key
element for which conditional frequent patterns are constructed.
Here Reducer performs following step:
At each reducer instance, it combines the different conditional frequent patterns over the same element key
and generates the frequent pattern sets as an output.
FunctionFPM_Reducer
input : Conditional frequent patterns grouped by element Name
output: Frequent pattern Set
Steps:
foreachconditionalPattern from groupofConditionalPatterns
cfpHead = new ConditionalFP();
if (first ConditionalPattern)
cfpHead = contains conditionalPattern;
else
merge the already available pattern elementswith incoming matched
element patterns.
Increment the frequency counts base on the similar elements
update the conditional FPhead node
cfpHead.intersect(conditionalPattern);
20
endif
HadoopContext.write(groupName/*ElementName*/, cfpHead/*Data*/);
endfor
Each Reducer combines conditional patterns based on element key as shown in above algorithm. After
combining all the patterns for each key, respective reducers do pruning of infrequent itemsets based on
=
{
,
,....,
}, Each reducer after pruning generates frequent sets for respective keys locally at each
processor. This step reduces cost of Inter process communication as for each key frequent set are generated
locally.
DPFPM algorithm is explained with following example.
Example: Input Transaction Database DBcontains 5 transactions as shown in table I. Each transaction
contain m elements of I items.
M
ℎ ℎ
is considered for example.
TABLE I. TRANSACTION DATABASE DB
Transaction ID
1
2
3
4
5

Transactions
f, a, c, d, g,i, m, p
a, b, c, f , l, m, o
b, f, h, j, o
b, c, k, s, p
a, f, c, e, l, p, m, n

GFHTable generated using GFHTMapper and GFHTReducer functions is as shown in Table II.
TABLE II.GFHTABLE
Item Identifier
Support count

c
4

f
4

a
3

b
3

m
3

p
3

After construction of GFHTableusing Map Reduce framework for first, second step is carried out using
FPM-Mapper and FPM-Reducer functions, where it uses MapReduce framework for second time.
Example –FPM-Mapper functions in following way:
Transactions are mapped to mapper where sorting and pruning of transactions based on GFHTable takes
place locally on each processor. Each mapper generates as output (Key: Conditional Pattern) pair.
TABLE III. FPM- MAPPER OUTPUTS
Mapperinput (db
transaction pattern input)

Sorted and pruned
transaction patterns

f, a,c,d,g,i,m,p

c,f,a,m,p

a,b,c,f,l,m,o

c,f,a,b,m

b,f,h,j,o

f,b

b, c,j,s,p

c,b,p

a,f,c,e,l,p,m,n

Mapper output – conditional frequentpatterns
(Key Element: Conditional Pattern)
p: c f a m
m: c f a
a: c f
f: c
c: None
m: c f a b
b: c f a
a: c f
f: c
c: None
b: f
f: None
p: c b
b: c
c: None
p: c f a m
m: c f a
a: c f
f: c
c: None

c,f,a,m,p

As shown in table III conditional patterns of each item are created as output to mapper function locally on
each processor.
Example –FPM-Reducer functions in following way:
21
Each Reducer then combines conditional patterns based on element key. After combining all the patterns for
eachkey,
respective
reducers
do
pruning
of
infrequent
Itemsets
based
based
on
ℎ ℎ
= 3 . Each reducer after pruning generates frequent sets for respective keys
locally at each processor. This step reduces cost of Inter process communication as for each key frequent set
are generated locally.
As shown in table IV output of FPM-Mapper is passed to reducer where for example first reducer collects all
patterns based on element ‘a’ that is “a: {c, f} | {c, f} | {c, f}”. Then it prunes infrequent items from the set
based on = 3.For first reducer “a:{c, f}” is frequent so no pruning is done. Finallyconditional patterns of
each item are created as output to mapper functionlocally on each processor.
TABLE IV. FPM-REDUCER OUTPUTS
Reducer combines conditional Patterns Base
on Element Key
a: {c, f} | {c, f} | {c, f}

Pruned infrequent items based
on
a:{c, f}

m: {c, f, a} | {c, f, a, b}| {c, f, a}

m: {a, f , c}

p: {c, f, a, m} | {c, b} | {c, f, a, m}
b: {f} | {c} | {c, f, a}
c: {}
f: {c} | {c}| {c}

p: {c}
b: {}
c: {}
f: {c}

Final Frequent Sets
{a}, {a, f}, {a, f, c}, {a, c}
{m}, {m, a}, {m, a, f}, {m, a, f, c}, {m, f},
{m, a, c},{m, f, c}, {m, c}
{p}, {p, c}
{b}
{c}
{f} {f, c}

E. AGGREGATION
The results of the above step of reducer are aggregated as final outcome of this algorithm to generate frequent
patterns available in complete DBpassed as input to the proposed DPFPM algorithm. Function aggregator
takes output of all reducers and generates final frequent patterns. This step is carried out on master node.
function aggregator
input : fetch all the reducers output
output: generate aggregate output as frequent pattern sets
#Scan through the entire Reducer of Hadoop stack
#Read all the part-r-success-0000* files
cfpHeadList = null;
foreach file from foundFiles
if (cfpHeadList == null)
#read object from file and align here
cfpHeadList = fetchFrequentPatternSets(file);
else
#concatenate objects from file
cfpHeadList.addFrequentPatternSets(file);
endif
endfor
foreachcfpNode from cfpHeadList
#Print element and its frequent sets
System.out.println(elementName, freqquentSets);
endfor
Proposed DPFPM algorithm uses MapReduce framework effectively to achieve performance goals in terms
of execution efficiency and scalability.
V. EXPERIMENTALRESULTS
For Proposed DPFPM algorithm experiments are performed on Hadoop Cluster of 5 Nodes (1 master, 4
slaves). Each node has Core2Duo processors (running at 2.60GHz) and 2GB memory. Several real and
synthetic datasets are used to test the performance of the algorithm.
Experiments are performed with relative minimum support threshold range from 2 to 10. To test scalability
experiments are performed on nodes scale from 1 to 5. To prove the performance results are compared with
DH-Trie algorithm [10].
Comparison with other algorithms22
TABLE V. C OMPARISON OF ALGORITHMS
Algorithm

Number of Transactions

Frequent Pattern generation time

DH-Trie

1,00,00,000

1062.26 seconds

DPFPM

3,16,80,064

718.24 seconds

Comparison of proposed DPFPM algorithm with DH-Trie algorithm as shown in Table V proves that
DPFPM algorithm is order of magnitude faster than DH-Trie algorithm. Proposed algorithm also proves very
good scalability when tested on the scale of 1 node to 5 node cluster. Figure 2 shows experimental results for
dataset Kosarak1G.dat which contain 3,16,80,064 number of transactions. Experiments are performed for
minimum support threshold of 4%. Figure 2 graph shows number of nodes /processors on X-axis and
GFHTable and Frequent pattern generation time in seconds on Y-axis.

Figure2. Graph – Number of Nodes vs. Time
TABLE VI. READINGS FOR VARIOUS MINIMUM SUPPORTS
Minimum Support

Minimum count

GFHTable construction time

Frequent Pattern generation time

2

633601

193.02 seconds

525.22 seconds

4

1267202

198.03 seconds

444.06 seconds

5

1584003

192.1 seconds

536.63 seconds

6

1900803

190.97 seconds

534.27 seconds

8

2534404

190.99 seconds

486.16 seconds

10

3168006

189.32 seconds

420.32 seconds

TABLE VI shows readings forKosarak1G.dat. Graph of these readings are plotted in Figure 3.

Figure 3. Graph – Frequent pattern generation time for various support thresholds
Dataset – Kosarak1G.dat
Graph – Minimum Support vs. Time
Number of Transactions – 3,16,80,064

23
Figure 3 graph shows minimum support count on X-axis and GFHTable and Frequent pattern generation time
in seconds on Y-axis.
Experimental results prove efficiency of proposed DPFPM algorithm. It also proves good scalability
withrespect to various minimum supports.
VI. CONCLUSION
Proposed DPFPM algorithm shows best performance results for large databases using HadoopMapReduce
framework. Proposed algorithm implement parallel processing for header table generation and mining
frequent pattern both, speeding up the execution. Proposed algorithms partition/distribute the database in
such a way that, it works independently and parallel at each local node reducing the communication overhead
in both the stages. It would not construct global tree but local conditional patterns are given to reducer and
itperforms task of frequent pattern generation, reducing communication cost, and speed up mining process.
Experimental result shows that parallel and distributed algorithm efficiently handles the scalability for very
large databases
REFERENCES
[1] R.Agrawal, T.Imielinski and A.N.Swami, “Mining association rules between set of items in large databases”, in
Proceedings of the ACM SIGMOD Conference on management of data, 1993 pp. 207–216.
[2] R. Agrawal and J.C.Shafer, “Parallel Mining of Association Rules: Design, Implementation and experience,”
Technical Report TJ10004, IBM Research Division, Almaden Research Center, 1996.
[3] M. J.Zaki, “Parallel and Distributed AssociationMining:A survey,” IEEE Concurrency, vol. 7(4), 1999, pp. 14-25.
[4] Y. Ye, C. Chiang, “A Parallel Apriori Algorithm for Frequent Itemsets Mining, ”Fourth International Conference on
Software Engineering Research, Management and Applications, 2006, pp. 87 – 94.
[5] K .M. Yu, J. Zhou,T. P. Hong and J. L. Zhou, “A load-balanced distributed parallel mining algorithm”,ElsevierExpert Systems with Applications37 (2010) pp. 2459–2464.
[6] XinYue Yang, Zhen Liu and Yan Fu,“MapReduce as a Programming Model for Association Rules Algorithm on
Hadoop”, in: Information Sciences and Interaction Sciences (ICIS), 2010.
[7] J. Han, J. Pei and Y.Yin, “Mining frequent patterns without candidate generation”, In Proceedings of the ACMSIGMOD conference management of data, 2000, pp. 1–12.
[8] Syed KhairuzzamanTanbeer, ChowdhuryFarhan Ahmed and Byeong-SooJeong, “Parallel and Distributed Frequent
Pattern Mining in Large Databases”,11th IEEE International Conference on HPCC,2009.
[9] Kun-Ming Yu, Jiayi Zhou, ”Parallel TID-based frequent pattern mining algorithm on a PC Cluster and grid
computing system”, Expert Systems with Applications 37 pp. 2486–2494, 2010.
[10] Lai Yang, Zhongzhi Shi, Li XU, Fan Liang and IlanKirsh “DH-TRIE Frequent Pattern Mining on Hadoop using
JPA”, IEEE International Conference on Granular Computing ,2011.
[11] Apache Hadoop. http://hadoop.apache.org/
[12] K. Shvachko, HairongKuang, Sanjay Radia, Robert Chansler, “The Hadoop Distributed File System”,In Proceedings
of the IEEE 26th Symposium on Mass Storage Systems and Technologies, 2010.
[13] J. Dean, S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” In Proc. of the 6 th symposium
on Operating Systems Design and Implementation, San Francisco CA, Dec. 2004.
[14] S. Parthasarathy, M. J. Zaki, M. Ogihara and W. Li., “Parallel Data Mining for Association Rules on SharedMemory Systems”, Knowledge and Information Systems, 3(1):1–29,February 2001.
[15] R.C. Agrawal, C.C. Agrawal, and V.V.V. Prasad, “A tree projection Algorithm for generation of frequent itemsets,”
Journal of Parallel and Distributed Computing, Vol.61, No. 3, pp: 350–371,March 2001.
[16] J. Han, Cheng Hong, Xin Dong, Yan and Xifeng, “Frequent pattern mining: current status and future direction,” in
Data Mining and Knowledge Discovery, Vol. 15, No1, pp. 55-86 , August 2007.
[17] Kawuu W. Lin and Yu-Chin Lo,“Efficient algorithms for frequent pattern mining in many-task computing
environments”, Knowledge-Based Systems,2013.

24

Más contenido relacionado

La actualidad más candente

A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...
A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...
A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...ijgca
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiJoão Gabriel Lima
 
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...AM Publications
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingVinoth Chandar
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)Yu Liu
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Rfhoc a random forest approach to auto-tuning hadoop's configuration
Rfhoc a random forest approach to auto-tuning hadoop's configurationRfhoc a random forest approach to auto-tuning hadoop's configuration
Rfhoc a random forest approach to auto-tuning hadoop's configurationieeepondy
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...IRJET Journal
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTIONijcses
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedDataWorks Summit
 

La actualidad más candente (20)

A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...
A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...
A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
 
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based Grouping
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
Rfhoc a random forest approach to auto-tuning hadoop's configuration
Rfhoc a random forest approach to auto-tuning hadoop's configurationRfhoc a random forest approach to auto-tuning hadoop's configuration
Rfhoc a random forest approach to auto-tuning hadoop's configuration
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
A0360109
A0360109A0360109
A0360109
 

Similar a Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Framework

Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoopdbpublications
 
Parallel Key Value Pattern Matching Model
Parallel Key Value Pattern Matching ModelParallel Key Value Pattern Matching Model
Parallel Key Value Pattern Matching Modelijsrd.com
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
 
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...TELKOMNIKA JOURNAL
 
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduceFiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduceIJCSIS Research Publications
 
3.[18 22]hybrid association rule mining using ac tree
3.[18 22]hybrid association rule mining using ac tree3.[18 22]hybrid association rule mining using ac tree
3.[18 22]hybrid association rule mining using ac treeAlexander Decker
 
3.[18 22]hybrid association rule mining using ac tree
3.[18 22]hybrid association rule mining using ac tree3.[18 22]hybrid association rule mining using ac tree
3.[18 22]hybrid association rule mining using ac treeAlexander Decker
 
Fp growth tree improve its efficiency and scalability
Fp growth tree improve its efficiency and scalabilityFp growth tree improve its efficiency and scalability
Fp growth tree improve its efficiency and scalabilityDr.Manmohan Singh
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...IAEME Publication
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkMahantesh Angadi
 
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
Usage and Research Challenges in the Area of Frequent Pattern in Data MiningUsage and Research Challenges in the Area of Frequent Pattern in Data Mining
Usage and Research Challenges in the Area of Frequent Pattern in Data MiningIOSR Journals
 
REVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesREVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesEditor IJMTER
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
A genetic algorithm coupled with tree-based pruning for mining closed associa...
A genetic algorithm coupled with tree-based pruning for mining closed associa...A genetic algorithm coupled with tree-based pruning for mining closed associa...
A genetic algorithm coupled with tree-based pruning for mining closed associa...IJECEIAES
 
Mining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce FrameworkMining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce FrameworkIRJET Journal
 

Similar a Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Framework (20)

Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
Parallel Key Value Pattern Matching Model
Parallel Key Value Pattern Matching ModelParallel Key Value Pattern Matching Model
Parallel Key Value Pattern Matching Model
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
 
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...
 
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduceFiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
 
3.[18 22]hybrid association rule mining using ac tree
3.[18 22]hybrid association rule mining using ac tree3.[18 22]hybrid association rule mining using ac tree
3.[18 22]hybrid association rule mining using ac tree
 
3.[18 22]hybrid association rule mining using ac tree
3.[18 22]hybrid association rule mining using ac tree3.[18 22]hybrid association rule mining using ac tree
3.[18 22]hybrid association rule mining using ac tree
 
Fp growth tree improve its efficiency and scalability
Fp growth tree improve its efficiency and scalabilityFp growth tree improve its efficiency and scalability
Fp growth tree improve its efficiency and scalability
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Z04404159163
Z04404159163Z04404159163
Z04404159163
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
Usage and Research Challenges in the Area of Frequent Pattern in Data MiningUsage and Research Challenges in the Area of Frequent Pattern in Data Mining
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
 
REVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesREVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining Techniques
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
A genetic algorithm coupled with tree-based pruning for mining closed associa...
A genetic algorithm coupled with tree-based pruning for mining closed associa...A genetic algorithm coupled with tree-based pruning for mining closed associa...
A genetic algorithm coupled with tree-based pruning for mining closed associa...
 
Improved Map reduce Framework using High Utility Transactional Databases
Improved Map reduce Framework using High Utility  Transactional DatabasesImproved Map reduce Framework using High Utility  Transactional Databases
Improved Map reduce Framework using High Utility Transactional Databases
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
Mining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce FrameworkMining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce Framework
 

Más de idescitation (20)

65 113-121
65 113-12165 113-121
65 113-121
 
69 122-128
69 122-12869 122-128
69 122-128
 
71 338-347
71 338-34771 338-347
71 338-347
 
72 129-135
72 129-13572 129-135
72 129-135
 
74 136-143
74 136-14374 136-143
74 136-143
 
80 152-157
80 152-15780 152-157
80 152-157
 
82 348-355
82 348-35582 348-355
82 348-355
 
84 11-21
84 11-2184 11-21
84 11-21
 
62 328-337
62 328-33762 328-337
62 328-337
 
46 102-112
46 102-11246 102-112
46 102-112
 
47 292-298
47 292-29847 292-298
47 292-298
 
49 299-305
49 299-30549 299-305
49 299-305
 
57 306-311
57 306-31157 306-311
57 306-311
 
60 312-318
60 312-31860 312-318
60 312-318
 
5 1-10
5 1-105 1-10
5 1-10
 
11 69-81
11 69-8111 69-81
11 69-81
 
14 284-291
14 284-29114 284-291
14 284-291
 
15 82-87
15 82-8715 82-87
15 82-87
 
29 88-96
29 88-9629 88-96
29 88-96
 
43 97-101
43 97-10143 97-101
43 97-101
 

Último

Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 

Último (20)

Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 

Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Framework

  • 1. Proc. of Int. Conf. on Advances in Computer Science, AETACS Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Framework Suhasini A. Itkar1, Uday V. Kulkarni2 1 2 PES Modern college of Engineering, Pune, India Email: Suhasini.a.itkar@gmail.com SGGS Institute of Engineering and Technology, Nanded, India Email: kulkarniuv@yahoo.com Abstract—With the rapid growth of information technology and in many business applications, mining frequent patterns and finding associations among them requires handling large and distributed databases. As FP-tree considered being the best compact data structure to hold the data patterns in memory there has been efforts to make it parallel and distributed to handle large databases. However, it incurs lot of communication over head during the mining. In this paper parallel and distributed frequent pattern mining algorithm using Hadoop Map Reduce framework is proposed, which shows best performance results for large databases. Proposed algorithm partitions the database in such a way that, it works independently at each local node and locally generates the frequent patterns by sharing the global frequent pattern header table. These local frequent patterns are merged at final stage. This reduces the complete communication overhead during structure construction as well as during pattern mining. The item set count is also taken into consideration reducing processor idle time. Hadoop Map Reduce framework is used effectively in all the steps of the algorithm. Experiments are carried out on a PC cluster with 5 computing nodes which shows execution time efficiency as compared to other algorithms. The experimental result shows that proposed algorithm efficiently handles the scalability for very large datab ases. Index Terms—Distributed computing, Parallel processing, Hadoop Map Reduce framework, Frequent pattern mining, Association rules, Data mining I. INTRODUCTION In data mining, research in the field of distributed and parallel mining is gaining popularity as there is tremendous increase in database sizes stored in data warehouses. Frequent Pattern Mining is one of the important areas in Data Mining to find associations in databases. Apriori [1] algorithm is first efficient algorithm to find frequent patterns. It is based on candidate set generate and test approach. It requires database scans for finding itemsets in worst case scenario. Many parallel and distributed algorithms are developed using Apriori principle to improve efficiency of Apriori algorithm. Agarwal, Schafer [2] proposed in 1996 algorithms like count distribution, data distribution and candidate distribution based on parallel and distributed processing. In this paper communication, computation efficiency, memory usage and synchronization parameters are discussed. Zaki [3] in 1999 discussed about various aspects of parallel and distributed association rule mining algorithm. Other parallel and distributed algorithms also tried to improve performance based on Apriori principle like Trie tree [Bodon, 2003] based parallel and distributed algorithm [4], DPA (Distributed ParallelApriori) Algorithm [5], MapReduce DOI: 02.AETACS.2013.4.123 © Association of Computer Electronics and Electrical Engineers, 2013
  • 2. programming model based Apriori algorithm using Hadoop framework [6], parallel data mining on shared memory system[14] and tree projection algorithm[15]. All these algorithms still have a performance bottleneck of either multiple database scans or huge search space or both. In 2000 Han et al. came up with FP Growth [7] algorithm where it proposed more compact tree structure called FP-tree. It overcomes drawback of Apriori algorithm by reducing search space and taking only two database scans. It uses FP-growth algorithm for mining frequent patterns. FP-growth suffers performance issues with huge databases. To improve performance of FP-tree based algorithms new area of research is introduced with distributed and parallel processing as discussed in survey paper of frequent pattern mining [16]. Proposed algorithms in this areatried to reduce IPC cost, memory storage and tried to use computing power of processors efficiently with load balancing approach. Some of FP-tree based distributed algorithms are PP tree [8], parallel TID based FPM [9], distributed DH-TRIE [10] frequent pattern mining algorithm and efficient frequent pattern mining algorithms in many task computing environment[17]. As there is huge demand in data mining to process terabyte or petabyte of information spread across the internet, there was a need of more powerful framework for distributed computing. With the emergence of cloud computing environment,the Map-reduce framework patented by Google also gained popularity because of its ease of parallel processing, capability of distributing data and load balancing. Hadoop is an open source implementation of Map-reduce framework. This paper proposes distributed and parallel frequent pattern mining algorithm (DPFPM) using HadoopMapReduce framework. Here proposed algorithm tries to reduce communication cost among all processors, and to speed up mining process. The proposed algorithm efficiently handles the scalability for very large databases. It shows best performance results for large databases using MapReduce framework on Hadoop cluster. The remainder of this paper is organized as follows: Section II reviews FP-tree based parallel and distributed algorithms. Section III describes HadoopMapReduce framework. Section IV introduces a novel distributed and parallel frequent pattern mining algorithm (DPFPM) using HadoopMapReduce framework. Section V shows experimental results and comparison of proposed algorithm with existing approaches. Finally, Section VI draws conclusion from this study and discuss about future scope. II. RELATED WORK Frequent pattern mining problem is to find all frequent k-itemsets where all itemset support should be more than minimum support thresholdξ. Support(k) >= min_sup_thresholdξfor given threshold (1 ≤ ξ ≤ |DB|). Apriori based algorithms tried to solve the FPM problem but faced a performance bottleneck of either multiple database scan or huge search space or both. Prefix tree based algorithms like FP-growth tried to optimize performance but suffer performance issues with huge database and small support values where it takes more time for mining or it fails to execute. In this section previous work in the field of distributed and parallel frequent pattern mining is covered. Distributed algorithms based on prefix tree structure and MapReduce framework is discussed. In PP tree based parallel and distributed algorithm[8], it proposes parallel and distributed algorithm using PP tree (parallel pattern tree) where it requires only one database scan to construct PP tree and in a way reduces I/O cost. It considers a distributed memory architecture where each node contains all resources locally.Algorithm is divided in three phases – Phase I: It first accept database contents in horizontal partitions in any canonical order, construct tree in one scan and then restructure it in global frequency descending sequence of items. Phase II: Local Mining – Individually mine local PP tree in parallel for discovering global frequent patterns (FPpg). Phase III: Final sequential step which collects frequent patterns from all local PP trees and generate global frequent patterns. Phase I: PP tree construction phase – This phase first partitions original database and distribute it to all nodes. It then performs three steps at each local node. Step I: It construct local PP tree with insertion stage by generating local header table at each node. Step II: All local header tables are collected in global header table array (HTA).This global HTA is reorganized in frequency descending order. Step III: At each node restructuring phase is carried out with the help of global HTA and finally local PP tree is created.Phase II: Mining of local PP tree – PP tree generate the set of potential frequent patterns (FPpg) by using FP Growth mining algorithm. For this first local PP trees construct itemset based prefix pattern trees PT(x) for each frequent pattern x. It is constructed by collecting all prefix sub paths/patterns from original local PP tree. From this individual tree for each frequent pattern x it generates new conditional tree CT(x), which 16
  • 3. containsonly potential global items, it prunes all infrequent items. Then it generates the set of frequent patterns (FPpg) at each local node in parallel without any IPC cost.Phase III: Generating global frequent patterns – In this master node takes local frequent patterns (FPpg) and sequential step of accumulating supports of all local (FPpg) is carried out at master node. The final (FPpg) list may lead to false positive patterns but not false negatives. Master node then removes all infrequent patterns from (FPpg) list and generates final global frequent patterns. For removing infrequent patterns from (FPpg) list algorithm uses hash based technique to speed up the search in (FPpg) list. Although this algorithm tries to reduce Inter Process Communication cost with efficient mining algorithm and also reduces I/O cost by performing single database scan still it requires more time for insertion and restructuring phase while constructing local PP trees. Original database is partitioned but not pruned with infrequent 1-itemsets and original database is only used throughout the algorithm so this may utilizes more memory for huge databases. In TPFP tree and BTP tree based parallel and distributed algorithm [9], two parallel and distributed mining algorithms are proposed based on Tidset concept. Tidset based parallel FP tree (TPFP tree) algorithm is proposed for cluster environment and balanced Tidset based parallel FP tree (BTP tree) algorithm is proposed for grid environment.In both the algorithms Tidset (Transaction identifier set) is used to directly select transactions instead of scanning whole database. Its goal is to reduce IPC cost and tree insertion cost, thus decreasing execution time. TPFP tree based algorithm is proposed for homogeneous cluster environment. In this algorithm load is distributed to all nodes evenly without viewing processor capacity. Its objective is to reduce the execution time of mining information exchange and to shorten the index cost of transaction extraction. BTP tree algorithm is proposed for heterogeneous grid environment where it tries to balance the load. In BTP algorithm all the steps are same as TPFP algorithm only one more step is added to it, where it evaluates the performance index of computing nodes. Depending upon performance index it distributes mining sets. This algorithm performs better than PFP tree and TPFP tree in grid environment. In MapReduce as a programming model for Association Rules algorithm on Hadoop [6] paper, Apriori algorithm is improved using MapReduce model to handle huge datasets on Hadoop distributed computing environment. In this paper, it first finds frequent 1-itemsets in first database scan. Then it uses MapReduce model where each mapper generate subset of candidate itemsets. These candidate itemsets are given to reducer to prune candidates based on minimum support threshold and to generate subset of frequent itemsets. Multiple iterations to MapReduce are necessary to produce frequent itemsets. At final stage output of all reducers is combined to generate final frequent itemsets.In this paper although it tried to implement Parallel Apriori algorithm but it suffers from drawback of multiple database scans and huge candidate set generation. A distributed DH-TRIE frequent pattern mining algorithm [10] is proposed using HadoopMapReduce programming model. This paper is based on Dynamic Hash (DH) Trie algorithm. Three problems are focused in this are globalization, random-write and duration which are tried to solve using Java Persistent API(JPA) and parallelizing algorithm using MapReduce programming model.In first phase, first two database scans are assigned to map tasks which results into building of DH-TRIE. In second phase mining is carried out using breadth-first search model by using queue instead of recursive processing for mining. This is carried out by multiple mapper tasks where number of mappers is equal to number of elements in the queue.Although this algorithm tries to solve three problems of globalization, random-write and duration it still faces some drawbacks like in first phase two database scans are handled by only map tasks, no reducer is used to construct FP-tree. In second phase mining is also carried out by map tasks to handle sub FP-tree, slowdowns mining process. Therefore, this paper proposes DPFPM algorithm to speed up the process of frequent pattern mining for huge datasets by distributing the data efficiently and executing task in parallel. Proposed approach tries to overcome the drawbacks of previous work and accelerate the performance effectively. III. HADOOPMAP-REDUCE FRAMEWORK Hadoop is an open source programming framework for running applications on large clusters built of commodity hardware [11]. Hadoop is derived from Google’s MapReduce model and Google File System. Hadoop cluster characteristic is partitioning or distributing data and computation across multiple nodes and executing computations in parallel. It provides application both reliability and data motion.It provides a Hadoop distributed file system (HDFS) [12] that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. 17
  • 4. Hadoop implements a computational paradigm named MapReduce [13], where the application is divided into many small fragments of work, each of which may be executed or re executed on any node in the cluster. Its goals are to process large data sets, high throughput, load balancing and failure handling. MapReduce framework executes user defined MapReduce jobs across the nodes in the cluster. A MapReduce computation has two phases map phase and reduce phase. The input to the computation is a data set of key/value pairs. It partition input key/value pairs into chunks, run map() tasks in parallel. After all map()s are complete, it consolidate all emitted values for each unique emitted key. ( , ) → ( , ) It partition space of output map keys, and run reduce() in parallel. Reducer generates output and stores it in HDFS. ( , ( )→ ( ) Tasks in each phase are executed in a fault-tolerant manner. Having many map and reduce tasks enables good load balancing. It provides an interface which is independent of backend technology. IV. DISTRIBUTEDANDPARALLELFREQUENTP ATTERNMININGALGORITHM(DPFPM) Proposed DPFPM algorithm uses HadoopMapReduce programming framework for frequent pattern mining. In frequent pattern mining the task of generating conditional frequent patterns and finding global frequent patterns can be done in parallel by distributing the data so MapReduce framework is most suitable for it. In proposed algorithm MapReduce task is used twice, once for support counting and then for generating frequent patterns, takingmorebenefits of MapReduce framework. It achieves distributed system goals like high avaiability, high throughput, load balancing and fault tolerance by using map reduce functionality efficiently and effectively. Proposed DPFPM algorithm has five steps. In first step it divide transaction databaseDB. In second step it constructs Global Frequent Header Table (GFHTable) using MapReduce tasks for first time. Third step generate conditional frequent patterns using map task second time. In fourth step each reducer combine all conditional frequent patterns of same key to generate frequent sets. In fifth step all reducers output is aggregated to generate final frequent sets. Details of all steps are as follows – A. Divide transaction database DB Divide complete transaction database DB into number of available processors so that each individual processor works on the allocated DBin parallel. DB = {DB , DB , . . . . , DB } ,where is number of processors available to distribute the problem. B. Construction of GFHTable Using MapReduce programming model, the frequent support count of each item is calculated that appeared in the locally allottedDBi . Then entire locally constructed header tables are merged to populate GFHTable on master node or at job node. Algorithmic steps for construction of GFHTable are described below. Step 1: Hadoop stack starts as many mappers as possibleto utilize available infrastructure at max. Step 2: Complete DB is divided into number of Mappersand its transactions are read by Mapper function of Hadoop Stack Step 3: Function GFHTMapper performs following steps taking transaction from DB as input. Step 3.1: foreach item from transaction Ti Step 3.2.1: output write element and its frequency count 1 Step 4: Hadoop pass the Mapper output to number of reducers it creates. But number of reducers is depend on the number of groups (similar items/elements) Step 5: Function GFHTReducer performs following steps taking input as elements grouped by element name Step 5.1: long sum = 0; Step 5.2: foreach element record Step 5.2.1: sum = sum + element's frequency count Step 5.3: outputs from each reducer started by Hadoop stack, element name and its global frequency count 18
  • 5. Step6: Function aggregator prune infrequent items from GFHTable based on ℎ ℎ Inputto aggregator function is ℎ ℎ and output generated by GHTReducer Step 6.1: foreach element from GHTReducer Step: 6.1.1: if (element.frequencyCount>= ) Step 6.1.2: putInGFHTable(element); Proposed methodology used for constructing GFHTable reduces communication overhead to calculate support of all items. It first calculates frequency count locally of local databases on each processor in parallel accelerating speed as described in step 1 to step 5. This reduces message exchange among processors. All local frequency counts are only then accumulated at master node to construct global header. As shown in step 6 all infrequent items are pruned from table and global frequent pattern header table (GFHTable) is constructed as final output. This step effectively uses map and reduce functionality to speed up the operation of mining. C.Construction of Conditional Frequent Patterns As the GFHTable is light weight in terms of memory, it is shared by all the nodes of the cluster. Each cluster node is responsible of rearranging the DB transaction pattern as in line with GFHTable and prunes the elements accordingly. As compared to standalone algorithms like FP-growth, proposed algorithm would not create global tree. Task of generation of frequent patterns is also distributed among processors and task of generating conditional patterns is done in parallel. Here Mapper and Reducer perform different tasks than the traditional scatter and gather functionalities. Here Mapper performs following steps: i. It prunes the DBi transaction to inline the transaction items as per the available GFHTable. ii. According to the available GFHTable, it process the DBi’s transaction and output the key-value pair, where key is conditional element for which conditional frequent pattern set is constructed as described in figure 1. Algorithmic steps as explained above for FPM_Mapper for finding frequent patterns includes functions as described below. FPM-Mapper is main function where it calls generateConditionalPattern sub function. FPM-Mapper runs on all processors with multiple mappers running on individual processors. Conditional patterns are directly generated for each item on each mapper on every processor as output of this function. Function –FPM_Mapper to prune transactions and generate conditional frequent patterns FunctionFPM_Mapper input :GFHTable, Transaction Tifrom DB output: Conditional Frequent Patterns # GFHTableis in local object to access it faster by current Mapper Steps: foreach transaction from DB Collect patternElements; foreach element of transactionTi if (checkIfElementExistsInGFHTable(element)) patternElements.add(element); } endif #order pattern Elements using support frequency Prune-order(patternElements); #Generate Conditional Patterns generateConditionalPattern(patternElements); endfor endfor Function – generate conditional frequent patterns functiongenerateConditionalPattern input :patternElements output : conditional prefix-tree for each element Steps: for each element from patternElements construct conditional prefix-tree HadoopContext.write(element, conditionalPrefix-tree); endfor 19
  • 6. FPM_Mapper first prune infrequent items from transactions from local database by checking withGFHTable. Then it calls generateConditionalPattern sub function which ultimately generates localconditional frequent patterns. Figure1. shows architecture diagram of frequent pattern generation and does not include steps of GFHTable construction. Filesystem shown in figure is Hadoopfilesystem (HDFS) which is used for storing intermediate and final results of each mapper and reducer. HDFS provides reliability for storing huge datasets and stream intermediate results with high bandwidth to all nodes. The architecture of our system proves efficient use of MapReduce framework. Proposed architecture helps proposed algorithm to efficiently mine frequent patterns. Figure1. Architecture diagram of DPFPM algorithm using MapReduce framework D. Generation of Frequent Pattern Sets When all mapper instances finished their work of construction of conditional frequent patterns, MapReduce framework helps to group all corresponding conditional frequent patterns transaction over available key element for which conditional frequent patterns are constructed. Here Reducer performs following step: At each reducer instance, it combines the different conditional frequent patterns over the same element key and generates the frequent pattern sets as an output. FunctionFPM_Reducer input : Conditional frequent patterns grouped by element Name output: Frequent pattern Set Steps: foreachconditionalPattern from groupofConditionalPatterns cfpHead = new ConditionalFP(); if (first ConditionalPattern) cfpHead = contains conditionalPattern; else merge the already available pattern elementswith incoming matched element patterns. Increment the frequency counts base on the similar elements update the conditional FPhead node cfpHead.intersect(conditionalPattern); 20
  • 7. endif HadoopContext.write(groupName/*ElementName*/, cfpHead/*Data*/); endfor Each Reducer combines conditional patterns based on element key as shown in above algorithm. After combining all the patterns for each key, respective reducers do pruning of infrequent itemsets based on = { , ,...., }, Each reducer after pruning generates frequent sets for respective keys locally at each processor. This step reduces cost of Inter process communication as for each key frequent set are generated locally. DPFPM algorithm is explained with following example. Example: Input Transaction Database DBcontains 5 transactions as shown in table I. Each transaction contain m elements of I items. M ℎ ℎ is considered for example. TABLE I. TRANSACTION DATABASE DB Transaction ID 1 2 3 4 5 Transactions f, a, c, d, g,i, m, p a, b, c, f , l, m, o b, f, h, j, o b, c, k, s, p a, f, c, e, l, p, m, n GFHTable generated using GFHTMapper and GFHTReducer functions is as shown in Table II. TABLE II.GFHTABLE Item Identifier Support count c 4 f 4 a 3 b 3 m 3 p 3 After construction of GFHTableusing Map Reduce framework for first, second step is carried out using FPM-Mapper and FPM-Reducer functions, where it uses MapReduce framework for second time. Example –FPM-Mapper functions in following way: Transactions are mapped to mapper where sorting and pruning of transactions based on GFHTable takes place locally on each processor. Each mapper generates as output (Key: Conditional Pattern) pair. TABLE III. FPM- MAPPER OUTPUTS Mapperinput (db transaction pattern input) Sorted and pruned transaction patterns f, a,c,d,g,i,m,p c,f,a,m,p a,b,c,f,l,m,o c,f,a,b,m b,f,h,j,o f,b b, c,j,s,p c,b,p a,f,c,e,l,p,m,n Mapper output – conditional frequentpatterns (Key Element: Conditional Pattern) p: c f a m m: c f a a: c f f: c c: None m: c f a b b: c f a a: c f f: c c: None b: f f: None p: c b b: c c: None p: c f a m m: c f a a: c f f: c c: None c,f,a,m,p As shown in table III conditional patterns of each item are created as output to mapper function locally on each processor. Example –FPM-Reducer functions in following way: 21
  • 8. Each Reducer then combines conditional patterns based on element key. After combining all the patterns for eachkey, respective reducers do pruning of infrequent Itemsets based based on ℎ ℎ = 3 . Each reducer after pruning generates frequent sets for respective keys locally at each processor. This step reduces cost of Inter process communication as for each key frequent set are generated locally. As shown in table IV output of FPM-Mapper is passed to reducer where for example first reducer collects all patterns based on element ‘a’ that is “a: {c, f} | {c, f} | {c, f}”. Then it prunes infrequent items from the set based on = 3.For first reducer “a:{c, f}” is frequent so no pruning is done. Finallyconditional patterns of each item are created as output to mapper functionlocally on each processor. TABLE IV. FPM-REDUCER OUTPUTS Reducer combines conditional Patterns Base on Element Key a: {c, f} | {c, f} | {c, f} Pruned infrequent items based on a:{c, f} m: {c, f, a} | {c, f, a, b}| {c, f, a} m: {a, f , c} p: {c, f, a, m} | {c, b} | {c, f, a, m} b: {f} | {c} | {c, f, a} c: {} f: {c} | {c}| {c} p: {c} b: {} c: {} f: {c} Final Frequent Sets {a}, {a, f}, {a, f, c}, {a, c} {m}, {m, a}, {m, a, f}, {m, a, f, c}, {m, f}, {m, a, c},{m, f, c}, {m, c} {p}, {p, c} {b} {c} {f} {f, c} E. AGGREGATION The results of the above step of reducer are aggregated as final outcome of this algorithm to generate frequent patterns available in complete DBpassed as input to the proposed DPFPM algorithm. Function aggregator takes output of all reducers and generates final frequent patterns. This step is carried out on master node. function aggregator input : fetch all the reducers output output: generate aggregate output as frequent pattern sets #Scan through the entire Reducer of Hadoop stack #Read all the part-r-success-0000* files cfpHeadList = null; foreach file from foundFiles if (cfpHeadList == null) #read object from file and align here cfpHeadList = fetchFrequentPatternSets(file); else #concatenate objects from file cfpHeadList.addFrequentPatternSets(file); endif endfor foreachcfpNode from cfpHeadList #Print element and its frequent sets System.out.println(elementName, freqquentSets); endfor Proposed DPFPM algorithm uses MapReduce framework effectively to achieve performance goals in terms of execution efficiency and scalability. V. EXPERIMENTALRESULTS For Proposed DPFPM algorithm experiments are performed on Hadoop Cluster of 5 Nodes (1 master, 4 slaves). Each node has Core2Duo processors (running at 2.60GHz) and 2GB memory. Several real and synthetic datasets are used to test the performance of the algorithm. Experiments are performed with relative minimum support threshold range from 2 to 10. To test scalability experiments are performed on nodes scale from 1 to 5. To prove the performance results are compared with DH-Trie algorithm [10]. Comparison with other algorithms22
  • 9. TABLE V. C OMPARISON OF ALGORITHMS Algorithm Number of Transactions Frequent Pattern generation time DH-Trie 1,00,00,000 1062.26 seconds DPFPM 3,16,80,064 718.24 seconds Comparison of proposed DPFPM algorithm with DH-Trie algorithm as shown in Table V proves that DPFPM algorithm is order of magnitude faster than DH-Trie algorithm. Proposed algorithm also proves very good scalability when tested on the scale of 1 node to 5 node cluster. Figure 2 shows experimental results for dataset Kosarak1G.dat which contain 3,16,80,064 number of transactions. Experiments are performed for minimum support threshold of 4%. Figure 2 graph shows number of nodes /processors on X-axis and GFHTable and Frequent pattern generation time in seconds on Y-axis. Figure2. Graph – Number of Nodes vs. Time TABLE VI. READINGS FOR VARIOUS MINIMUM SUPPORTS Minimum Support Minimum count GFHTable construction time Frequent Pattern generation time 2 633601 193.02 seconds 525.22 seconds 4 1267202 198.03 seconds 444.06 seconds 5 1584003 192.1 seconds 536.63 seconds 6 1900803 190.97 seconds 534.27 seconds 8 2534404 190.99 seconds 486.16 seconds 10 3168006 189.32 seconds 420.32 seconds TABLE VI shows readings forKosarak1G.dat. Graph of these readings are plotted in Figure 3. Figure 3. Graph – Frequent pattern generation time for various support thresholds Dataset – Kosarak1G.dat Graph – Minimum Support vs. Time Number of Transactions – 3,16,80,064 23
  • 10. Figure 3 graph shows minimum support count on X-axis and GFHTable and Frequent pattern generation time in seconds on Y-axis. Experimental results prove efficiency of proposed DPFPM algorithm. It also proves good scalability withrespect to various minimum supports. VI. CONCLUSION Proposed DPFPM algorithm shows best performance results for large databases using HadoopMapReduce framework. Proposed algorithm implement parallel processing for header table generation and mining frequent pattern both, speeding up the execution. Proposed algorithms partition/distribute the database in such a way that, it works independently and parallel at each local node reducing the communication overhead in both the stages. It would not construct global tree but local conditional patterns are given to reducer and itperforms task of frequent pattern generation, reducing communication cost, and speed up mining process. Experimental result shows that parallel and distributed algorithm efficiently handles the scalability for very large databases REFERENCES [1] R.Agrawal, T.Imielinski and A.N.Swami, “Mining association rules between set of items in large databases”, in Proceedings of the ACM SIGMOD Conference on management of data, 1993 pp. 207–216. [2] R. Agrawal and J.C.Shafer, “Parallel Mining of Association Rules: Design, Implementation and experience,” Technical Report TJ10004, IBM Research Division, Almaden Research Center, 1996. [3] M. J.Zaki, “Parallel and Distributed AssociationMining:A survey,” IEEE Concurrency, vol. 7(4), 1999, pp. 14-25. [4] Y. Ye, C. Chiang, “A Parallel Apriori Algorithm for Frequent Itemsets Mining, ”Fourth International Conference on Software Engineering Research, Management and Applications, 2006, pp. 87 – 94. [5] K .M. Yu, J. Zhou,T. P. Hong and J. L. Zhou, “A load-balanced distributed parallel mining algorithm”,ElsevierExpert Systems with Applications37 (2010) pp. 2459–2464. [6] XinYue Yang, Zhen Liu and Yan Fu,“MapReduce as a Programming Model for Association Rules Algorithm on Hadoop”, in: Information Sciences and Interaction Sciences (ICIS), 2010. [7] J. Han, J. Pei and Y.Yin, “Mining frequent patterns without candidate generation”, In Proceedings of the ACMSIGMOD conference management of data, 2000, pp. 1–12. [8] Syed KhairuzzamanTanbeer, ChowdhuryFarhan Ahmed and Byeong-SooJeong, “Parallel and Distributed Frequent Pattern Mining in Large Databases”,11th IEEE International Conference on HPCC,2009. [9] Kun-Ming Yu, Jiayi Zhou, ”Parallel TID-based frequent pattern mining algorithm on a PC Cluster and grid computing system”, Expert Systems with Applications 37 pp. 2486–2494, 2010. [10] Lai Yang, Zhongzhi Shi, Li XU, Fan Liang and IlanKirsh “DH-TRIE Frequent Pattern Mining on Hadoop using JPA”, IEEE International Conference on Granular Computing ,2011. [11] Apache Hadoop. http://hadoop.apache.org/ [12] K. Shvachko, HairongKuang, Sanjay Radia, Robert Chansler, “The Hadoop Distributed File System”,In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies, 2010. [13] J. Dean, S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” In Proc. of the 6 th symposium on Operating Systems Design and Implementation, San Francisco CA, Dec. 2004. [14] S. Parthasarathy, M. J. Zaki, M. Ogihara and W. Li., “Parallel Data Mining for Association Rules on SharedMemory Systems”, Knowledge and Information Systems, 3(1):1–29,February 2001. [15] R.C. Agrawal, C.C. Agrawal, and V.V.V. Prasad, “A tree projection Algorithm for generation of frequent itemsets,” Journal of Parallel and Distributed Computing, Vol.61, No. 3, pp: 350–371,March 2001. [16] J. Han, Cheng Hong, Xin Dong, Yan and Xifeng, “Frequent pattern mining: current status and future direction,” in Data Mining and Knowledge Discovery, Vol. 15, No1, pp. 55-86 , August 2007. [17] Kawuu W. Lin and Yu-Chin Lo,“Efficient algorithms for frequent pattern mining in many-task computing environments”, Knowledge-Based Systems,2013. 24