SlideShare una empresa de Scribd logo
1 de 7
Descargar para leer sin conexión
2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010)




       An Efficient Data Mining Framework on Hadoop using Java Persistence API

                             Yang Lai                                                              Shi ZhongZhi
 The Key Laboratory of Intelligent Information Process-                      The Key Laboratory of Intelligent Information Process-
   ing, Institute of Computing Technology, Chinese                             ing, Institute of Computing Technology, Chinese
     Academy of Sciences, Beijing, 100190, China.                                Academy of Sciences, Beijing, 100190, China.
 Graduate University of Chinese Academy of Sciences,                                       e-mail: shizz@ics.ict.ac.cn
                 Beijing 100039, China.
              e-mail: yanglai@ics.ict.ac.cn

 Abstract—Data indexing is common in data mining when working                X-tree [8], may be suitable for indexing high-dimensional
 with high-dimensional, large-scale data sets. Hadoop, a cloud               data sets.
 computing project using the MapReduce framework in Java, has                    However, the following features in the MapReduce
 become of significant interest in distributed data mining. A feasi-         framework are unsuitable for data mining. First, in globality,
 ble distributed data indexing algorithm is proposed for Hadoop              map tasks are irrelevant to each other, as are reducing tasks.
 data mining, based on ZSCORE binning and inverted indexing and              Data mining requires that all of the training data be con-
 on the Hadoop SequenceFile format. A data mining framework on               verted into a global model, such as a decision tree or clus-
 Hadoop using the Java Persistence API (JPA) and MySQL Cluster               tering tree. The tasks in the MapReduce framework only
 is proposed. The framework is elaborated in the implementation of
                                                                             handle its partition of the entire data set and output its results
 a decision tree algorithm on Hadoop. We compare the data index-
 ing algorithm with Hadoop MapFile indexing, which performs a
                                                                             into the Hadoop distributed file system (HDFS). Second,
 binary search, in a modest cloud environment. The results show              random-write operations are disallowed by the HDFS, thus
 the algorithm is more efficient than naïve MapFile indexing. We             disabling link-based data models in Hadoop, such as
 compare the JDBC and JPA implementations of the data mining                 linked-lists, trees, and graphs. Finally, the duration of both
 framework. The performance shows the framework is efficient for             map and reduce tasks are based on scanning processing, and
 data mining on Hadoop.                                                      will end when the partitioning of the training dataset is fin-
                                                                             ished. Data mining requires a persistent model for following
   Keywords—Data Mining, Distributed applications, JPA,                      testing processing.
 ORM, Distributed file systems, Cloud computing                                  A database is an ideal persistent repository for objects
                                                                             generated by data mining using Hadoop tasks. To mine
                        I.     INTRODUCTION                                  high-dimensional and large-scale data on Hadoop, we em-
     Many approaches have been proposed for handling                         ploy Object-Relation Mapping (ORM), which stores objects
 high-dimensional and large-scale data, in which query proc-                 whose size may surpass memory limits in a relational data-
 essing is the bottleneck. “Algorithms for knowledge discov-                 base. The Java Persistence API (JPA) provides a persistence
 ery tasks are often based on range searches or nearest                      model for ORM [9]. It can store all the objects generated
 neighbor search in multidimensional feature spaces” [1].                    from data mining models in a relational database. A distrib-
     Business intelligence and data warehouses can hold a                    uted database is a suitable solution to ensure robustness in
 Terabyte or more of data. Cloud computing has emerged for                   distributed handling. MySQL Cluster is designed to with-
 the subsequently increasing demands of data mining. Ma-                     stand any single point of failure [10], which is consistent
 pReduce is a programming framework and an associated                        with Hadoop.
 implementation designed for large data sets. The details of                     We performed the same work that McCreadie’s per-
 partitioning, scheduling, failure handling and communica-                   formed [3] and now propose a novel indexing Hadoop im-
 tion are hidden by MapReduce. Users simply define map                       plementation for continuous values. Using JPA and MySQL
 functions to create intermediate <key, value> tuples, and                   Cluster on Hadoop, we propose an efficient data mining
 then reduce functions to merge the tuples for special proc-                 framework, which is elaborated by a decision tree imple-
 essing [2].                                                                 mentation. The distributed computing in Hadoop and cen-
     A concise indexing Hadoop implementation of MapRe-                      tralized data collection using JPA are combined together
 duce is presented in McCreadie’s work [3]. Ralf proposes a                  organically. We also employ the naïve JDBC implementa-
 basic program skeleton to underlie MapReduce computa-                       tion in comparison with our JPA implementation on Ha-
 tions [4]. Moretti presents an abstraction for scalable data                doop.
 mining, in which data and computation are distributed in a                      The rest of the paper is organized as follows. In Section
 computing cloud with minimal user efforts [5]. Gillick uses                 2 we explain the index structures, flowchart and algorithms,
 Hadoop to implement query-based learning [6].                               and propose the data mining framework. Section 3 provides
     Most data mining algorithms are based on object                         descriptions of our experimental setting and results. In Sec-
 -oriented programming, which runs in memory. Han elabo-                     tion 4, we offer conclusions and suggest possible directions
 rates many of these methods [7]. A link-based structure, like               for future work.

978-0-7695-4108-2/10 $26.00 © 2010 IEEE                                203
DOI 10.1109/CIT.2010.71
II.   METHOD                                    for searching; it then outputs a list of the number of records.
                                                                         The search algorithm is shown in Figure 5.
A. Naïve Data Indexing
                                                                             In intersection, the positions of each integer in
     Naïve data indexing is necessary for querying data in               Val2NO.DAT for a particular query will be extracted from
classification or clustering algorithms. Data indexing can               Col2Val.DAT , and the lists of LineNo’s for each integer
dramatically decrease the complexity of querying. As in                  can be found. The result will be computed by the intersec-
McCreadie’s work [3], we create an inverted index for con-               tion of the lists. The algorithm is omitted.
tinuous values.
   1) Definition                                                                        Table 2 Structure of IntData.DAT in BNF.
     A (n × m) dataset is an (n × m) matrix containing n rows                  IntData.DAT        ::= <Record>
of data, which contain m columns of float numbers. Let                            <Record>        ::= <Bins>
min(i), max(i), sum(i), avg(i), std(i) , cnt(i) (i∈[1..m]) equal
                                                                                    <Bins>        ::= Integer(32bits binary)
the minimum, maximum, sum, average, standard deviation
and count of the i-column, respectively—i.e., the essential                             Table 3 Structure of Col2Val.DAT in BNF.
statistical measures of the i-column of the dataset. Let                    Col2Val.DAT           ::= <Field>
sum2(i) be the dot product of the i-column, which is used for                     <Field>         ::= <Bins, Address, LengthLineNo>
                                           x-avg()                                <Bins>          ::= Integer(32bits binary)
std(i). From the formula ZSCORE(x)= std() [7], the for-
                                                                               <Address>          ::= Integer(64bits binary)
mula YSCORE(x)=round⎛
                            (x - avg())×iStep⎞
                           ⎝       std()       ⎠, a new measure           <LengthLineNo>          ::= Integer(64bits binary)
is defined. The iStep is an experimental integral parameter.                            Table 4 Structure of Val2NO.DAT in BNF.
   2) Simple Design of the Inverted Index                                     Val2NO.DAT          ::= <LineNo>
     In general, high-dimensional datasets are always stored                     <LineNo>         ::= Integer(64bits binary)
as tables in a sophisticated DBMS for most data mining
tasks. Table 1 shows the format of a dataset in DBMS. For
data mining, the CSV format is common data interchange                    FloatData.TXT              Query                    Result
file format.
                   Table 1 A normal data table.
     RecordsNO      Field01        Field02        …
      Rec01         Value02       Value01 …                                Discretization                                  InterSection
      Rec02         Value05       Value01 …
      Rec03         Value02       Value04 …
      …             …             …           …
                                                                           IntData.TXT                                      List of No
    Figure 1 shows the flowchart in which the whole proce-
dure is illustrated for indexing and similarity querying:
    Discretization calculates the essential statistical measures
(Figure 2), then performs YSCORE binning on the Float-
                                                                             Encoder                                        Searcher
Data.TXT file (CSV format) and outputs the IntData.TXT
file in which each line is labeled with a unique number
(LineNo) consecutively.                                                                            IntData.DAT
    The encoder transforms the text format file IntData.TXT                IntData.SEQ
into the Hadoop Sequence File format IntData.SEQ and ge-
nerates IntData.DAT. To locate a record for a query rapidly,
the length of the record has to be equivalent. The Int-
Data.DAT contains an (n × m) matrix, i.e., n rows of records
                                                                              Indexer              Col2Val.DAT
containing m columns of IDs of bins, as does IntData.SEQ.
The structure of IntData.DAT is shown in Table 2.
    With the indexer, the index files Val2NO.DAT and
Col2Val.DAT are outputted with MapReduce from the Int-                                             Val2No.DAT
Data.SEQ file. To obtain the list of LineNo’s for a specific
                                                                                   Figure 1 Flowchart for searching the Hadoop Index
bin, it is needed to store the list in a structural file
(Val2NO.DAT), shown in Table 4; the address and the                        3) Statistics phase
length of the list should be stored in another structural file              In Hadoop, data sets are scanned once to calculate the
(Col2Val.DAT), shown in Table 3. The two structural files                four distributive measures, min(i), max(i), sum(i) and sum2(i)
are easy to access by offset.                                            as well as cnt(i) for the i-column (Figure 2); it then com-
    The searcher, given a query—which should be a vector                 putes the algebraic measures of average and deviation.
with m float values— performs YSCORE binning with the                    Therefore, a (5×m) array is used for the keys in the MapRe-
same statistical measures and yields a vector with m integers            duce process: the first row contains the keys of the minimum



                                                                   204
of a column; the second row contains the keys for the                   e. ELSE
maximum; the third contains the keys for the sum; the fourth               1) Cnt ← value;
contains the keys for the square of the sum; and the fifth           14. avg[] ←0; std[] ←0;
contains the keys for the row count.                                 15. FOR i = 1 THRU LenFields
FUNCTION map (key, values, output)                                      a. avg(i) ← faStat[2][i] / cnt;
1. f[1..m]← parse the values into float array;                          b. std(i)←sqrt ((faStat[3][i]-faStat[2][i]×faStat[2][i]/cnt)
2. iW ←f.length;                                                        /(cnt-1));
3. FOR i = 1 THRU m                                                  16. RETURN min(i),max(i),sum(i),avg(i),std(i),cnt;
   a. IF min > f[i] THEN                                             END OF ALGORITHM STATISTICS
      1) min ← f[i];                                                     Figure 2: Algorithm Statistics for continuous values in Hadoop
      2) output.collect(iW*0+i, min);
   b. IF max < f[i] THEN                                                 The algorithm performs O(N/mapper), in which N
      1) max ← f[i];                                                 represents the number of records in the data sets, and map-
      2) output.collect(iW*1+i, max);                                per is the number of the task of map functions. After step 4,
   c. output.collect(iW*2+i, sum);                                   all partitions will have been calculated into the five distribu-
                                                                     tive measures. The reduce function merges them into global
   d. output.collect(iW*3+i, sum2);
                                                                     measures. Keys for the map function are the positions of the
4. output.collect(iW*4, 1);
                                                                     next line, and values are the next line of text.
FUNCTION reduce (key, values, output)                                    After step 11 in Figure 2, Hadoop will create some result
5. iCategory ← (int) key.get() / LenFields                           files named “part-xxxxx”, which contain the five distributive
6. SWITCH (iCategory)                                                measures for each field; the number of these result files de-
7. CASE 0 : /** min()*/                                              pends on the number of reducers. From step 15, the alge-
   a. WHILE (values.hasNext())                                       braic measures avg() and std() can be calculated from the
      1) fTmp ← values.next();                                       five distributive measures.
      2) IF (min > fTmp) THEN min ← fTmp;                               4) YSCORE phase
   b. output.collect(key, min);                                          Figure 3 shows YSCORE binning of 1,000,000 values
   c. BREAK;                                                         from a normal distribution with a mean of 0 and standard
8. CASE 1 : /** max() */                                             deviation of 1. If the iStep of 1 is used, the upper left figure
   a. WHILE (values.hasNext()){                                      shows exactly the Three-Sigma Rule, there are only 9 bins
      1) fTmp ← values.next();                                       and the maximum number in the bins is near 400,000. When
                                                                     the iStep is changed to 8, the upper right shows that the
      2) IF (max < fTmp) THEN max ← fTmp;
                                                                     number of bins increases to 36, and the figure fits the normal
   b. output.collect(key, max);
                                                                     distribution curve better. The important thing is that the
   c. BREAK;                                                         maximum in the bins reduces to about 50,000 at the iStep of
9. CASE 2 : /** sum() */                                             8. When the iStep is set to 16 or 32, the maximum reduces to
   a. WHILE (values.hasNext())                                       25,000 or 10,000. The lowered numbers administer to in-
      1) sum ← sum + values.next();                                  verted indexing.
      2) output.collect(key, sum);                                                          5                                                     4
                                                                                         x 10                                                  x 10
   b. BREAK;
10.CASE 3 : /** sum2() */                                                            3
                                                                                                                                           4
   a. WHILE (values.hasNext())
                                                                          Counting




                                                                                                                                Counting




                                                                                                                                           3
      1) sum2← sum2+ values.next();                                                  2
                                                                                                                                           2
      2) output.collect(key, sum2);                                                  1
                                                                                                                                           1
   b. BREAK;
                                                                                     0                                                     0
11.CASE 4 : /** cnt() */                                                                 −4 −3 −2 −1 0 1 2 3 4                                   −20       0        20
   a. WHILE (values.hasNext())                                                              4
                                                                                             Yscore bins, iStep=1                                 Yscore bins, iStep=8
                                                                                         x 10
      1) cnt ← cnt + values.next();
      2) output.collect(key, cnt);                                                   2                                         10000
   b. BREAK;
                                                                     Counting




                                                                                                                    Counting




                                                                                1.5
FUNCTION statistics (input)                                                          1                                          5000
12.faStat[][]←0; key←0; value←0; cnt←0;
                                                                                0.5
13.WHILE input.hasNext()
                                                                                     0                                                     0
   a. <key,value> ← Parse the line of input.Next();                                      −50         0         50                              −100         0         100
                                                                                           Yscore bins, iStep=16                                  Yscore bins, iStep=32
   b. iCategory← key / LenFields;
   c. iField← key MOD LenFields;                                       Figure 3 YSCORE binning of 1,000,000 values from a normal distribution
                                                                        with mean 0 and standard deviation 1. The above figures show YSCORE
   d. IF (iCategory == LenFields * 4) THEN                                      binning using an iStep of 1, 8, 16, and 32, respectively.
      1) faStat[iCategory][ iField]← value;



                                                               205
According to [11], “the central limit theorem (CLT)                      reduce task in Hadoop will generate its own result; therefore,
states conditions under which the mean of a sufficiently                     each attribute has its own index file.
large number of independent random variables, each with
finite mean and variance, will be approximately normally                                                 SourceDataSet::=
distributed.” Even if the distribution of a field does not                                        <LongWritable, FloatArrayWritable>




                                                                                                                                         JobDiscret
follow normal distribution, in large-scale data sets the values
of the field can be put in bins using its mean and deviation
when the size approaches a Terabyte.                                                            mapperDiscrete          mapperDiscrete
    These interval values are put into bins using YSCORE                                              1          ...         m
binning [7]. Then, the inverted index will be of the format in
Table 5, following Table 1. The first column holds all fields                                            DiscretedData::=
in the original dataset; the second holds the <bin, RecNoAr-                                       <LongWritable, IntArrayWritable>
ray> tuples.
       Table 5. The structure of inverted index for interval values.
  Fields                           Inverted Index                                                mapperIndex             mapperIndex
                                                                                                     1            ...        m
Field01 <Bin02,RecNoArray>,<Bin05, RecNoArray >…
Field02 <Bin01,RecNoArray>,<Bin04, RecNoArray >…
                                                                                                         IntermediateData::=




                                                                                     JobIndex
…          …                                                                                         <LongWritable, LongWritable>
    The YSCORE binning phase does not need a reduce
function, thus eliminating the cost of sorting and transferring;                                ReducerIndex            ReducerIndex
the result files are precisely the partitions on which map                                           1           ...         r
functions work. Keys in the map function are the positions
of the next line, and values are the next line of text. The in-
                                                                                                             IndexData::=
put file will be FloatDate.TXT, and the output file will be
                                                                                                   <IntWritable, LongArrayWritable>
IntData.TXT, as shown in Figure 3.
    The cumulative distribution function values for the stan-
dard normal distribution will be 50% at 0, 60% at 0.253,                      Figure 4 The Data Flow Diagram for improved Indexing of continuous
                                                                                                      values in Hadoop
70% at 0.524, 80% at 0.842, and 90% at 1.282. If the distri-
bution is approximately close to the standard normal distri-
bution, YSCORE determines a balanced binning from its                        C. Hadoop MapFile Indexing
ZSCORE value.
                                                                                  The MapFile format is a directory consisting of two files,
    After the statistics’ phase and the YSCORE binning
                                                                             a data file and an index file. Both files are, in fact, two files
phase, Hadoop can handle discrete data as in the com-
                                                                             of the SequenceFile type. The data file is a normal Se-
monly-used MapReduce word count example.
                                                                             quenceFile; the index file consists of a few keys at intervals
B. Improved Data Indexing                                                    and their corresponding addresses in the data file. To query a
    Naïve data indexing has a flaw in that it uses only one                  special key for its value, a binary search will be performed
reducer to keep the global id for the keys. This causes a per-               in the index file and a sequential search will be performed in
formance problem that is not scalable. A tractable <Long-                    the data file within the range defined by the corresponding
Writable, LongWritable> pair is employed to transform the                    addresses. [12] The algorithm for searching by MapFile is
three variables of the global id, the attribute and the id of bin            shown in Figure 5.
into the intermediate key-value pairs in the JobIndex,                       FUNCTION SearchMapFile (float[] faQuery)
(Figure 4). In general, the number of attributes and the                     17. laResult ← null; laTmp ← null; isFirst ← TRUE
number of bins in an attribute will not be more than 232. A                  18. FOR iField =1 THRU faQuery.length
LongWritable variable is 64 bits, so an attribute can be put                    a. iValue ←YSCORE(faQuery[i]);
in the upper 32 bits in a LongWritable, and the number of                       b. Indexer ←new MapFile.Reader(index file of iField);
bins can be put in the lower 32 bits. These attribute-bin                       c. Indexer.get(iValue, laTmp);
LongWritables are used as keys for the sorting phase in Ha-                     d. IF (isFirst) THEN
doop.                                                                              1) laResult ← laTmp; isFirst ← FALSE
    The number of the ReducerIndex task is set equal to the                     e. ELSE
number of attributes. The method setPartitionerClass() in                          1) laResult ← getIntersection(laResult, laTmp);
Hadoop decides to which ReducerIndex task an intermediate
                                                                             19. RETURN laResult
key-value pair should be sent, using the upper 32 bits in the                END OF ALGORITHM SearchMapFile
keys. In the ReducerIndex task, the attribute-bin LongWri-                             Figure 5 Algorithm of searching by MapFile.
tables will be grouped by the number of bins in the lower 32
bits in the keys, and then the inverted index                                   HBase, a Bigtable-like storage for the Hadoop Distrib-
<Bin,RecNoArray> (Table 5) can be outputted easily. Each                     uted Computing Environment, is based on MapFile [13]. It
                                                                             does not perform well for reasons of complexity in MapFile.


                                                                       206
D. Data mining framework on Hadoop                                            The following codes are for returning all the children of
    Although this improved data indexing performs better                  a node in a binary tree (Figure 7).
than does MapFile indexing, the overhead for range query-
ing by intersection algorithms is still too high. To apply                1. CREATE PROCEDURE getChild (IN rootid INT)
common object-oriented data mining algorithms in Hadoop,                  2. BEGIN
the three problems concerning globalization, random-write                    a. DROP TABLE IF EXISTS chd;
and duration, posed in the introduction, should be handled.                  b. CREATE TEMPORARY TABLE chd (id INT
For this, we propose a data mining framework on Hadoop                       NOT NULL DEFAULT -1)ENGINE = MEMORY;
using JPA.                                                                   c. DROP TABLE IF EXISTS par;
   1) MySQL Cluster                                                          d. CREATE TEMPORARY TABLE par (id INT
    After many failures, we found that, for data mining with                 NOT NULL DEFAULT -1)ENGINE = MEMORY;
Hadoop, a database was an ideal persistent repository for                    e. DROP TABLE IF EXISTS result;
handling the three problems mentioned. To ensure comput-                     f. CREATE TABLE result (id INT NOT NULL
ing robustness, a distributed database is the suitable solution              DEFAULT -1);
for distributed high-dimensional and large-scale data han-                   g. INSERT chd SELECT A.ID FROM btree A
dling. MySQL Cluster is designed to eliminate any single                     WHERE rootid=A.PID;
point of failure [10] and is based on open source code; both                 h. WHILE ROW_COUNT()>0 DO
characteristics make it consistent with data mining with Ha-                    1) TRUNCATE par;
doop.                                                                           2) INSERT par SELECT * FROM chd;
   2) ORM                                                                       3) INSERT result SELECT * FROM chd;
    The JPA and naïve JDBC are used to implement the                            4) TRUNCATE chd;
ORM for objects of data mining in a database (Figure 6). To
                                                                                5) INSERT chd SELECT A.ID FROM btree A, par
improve the data indexing method, a KD-tree or X-tree may
be suitable. However, the decision tree is considered a clas-                   B WHERE B.ID=A.PID;
sical algorithm, and better elaborates the framework.                        i. END WHILE;
                                                                          3. END;
                                                                               Figure 7 Stored procedure for all children of a node in MySQL
          TrainingData                  TestingData
                                                                             4) Decision tree implementation
                                                                              As shown in the textbook [7], a decision tree for con-
                hadoop                                                    tinuous values can build a binary tree. Let D be a dataset.
                                                                          The entropy of D is given by
                                 Data Mining
                                                                                                                   m
                                                                                                 Info ( D ) = −∑ pi × log 2 ( pi )
                                                                                                                                                   (1)
                                                                                                                  i =1
                      JPA                JDBC
                                                                              Where m is the number of categories and p is the prob-
                                                                          ability of a category. Let A be an attribute in D. The entropy
                                                                          of D based on the partitioning of A is given by
                            ndbcluster
Figure 6 The Data Flow Diagram of Data Mining Framework on Hadoop                                        D1                    D2                  (2)
                                                                                         Info A ( D) =        × Info( D1 ) +        × Info( D2 )
                                                                                                         D                     D
     In the data mining framework, a large-scale dataset is
scanned by Hadoop. The necessary information is extracted                     Where entropy Info() is from formula (1). D1 is the part
using classical data mining modules, which will be embed-                 of D, which belongs to the left child node, and the D2 is the
ded in the Hadoop map or reduce tasks. The extracted ob-                  part of D, which belongs to the right child node.
jects, such as lists, trees, and graphs, will be sent automati-               Each node in the tree has to contain two members, an
cally into a database—in this case, MySQL Cluster. As a test              ATTRIBUTE and a VALUE. Given a test record (an array
case, the test dataset will be scanned and handled by each                of continuous values), the former tells which field should be
data mining task.                                                         checked, and the latter tells which child node should be se-
   3) Stored procedure                                                    lected. If the test record is less than or equal to the member
     After objects obtained from data mining are stored into a            VALUE in the member ATTRIBUTE, then the left child
database, all the details can be finished using a stored pro-             node is selected, otherwise the right child node is selected.
cedure, such as finding a path between two nodes, returning                   In Hadoop, the JobSelect has to select the member
all the children of a node and returning the category of a test           ATTRIBUTE and the member VALUE using the entropy
case. A data mining task should send demands and receive                  criterion (Figure 8). A training data record should consist of
results. This decreases network overhead and latency.                     multiple values and a category. In the mapperSelect tasks, a
                                                                          record will be decomposed into AVC_Writable tuples <at-


                                                                    207
tribute, value, category>, where the attribute indicates the                   use the method output.collect (Figure 9); instead, it writes
position of the value in the record and the category indicates                 directly to two sequence files in the HDFS. This decreases
the class of the record. For example, a record <3, 5, 7, 9, -1>                unwanted I/O overheads. The SourceDataSet is D; the
will output the tuples <1, 3, -1>, <2, 5, -1>, <3, 7, -1>, <4, 9,              ChildDataLeft is D1; the ChildDataRight is D2; and all of
-1>. These AVC_Writable tuples will be the keys in the in-                     these are in fact directories in HDFS, which contain the
termediate key-value pairs. The values will be VC_Writable                     outputs from each mapperSplit task.
tuples, or the part <value, category> in the AVC_Writable                          The above processing is just for one node. To build a
tuples.                                                                        whole decision tree, the processing needs to be repeated
    Hadoop’s method setPartitionerClass() decides to which                     recursively. A complete decision tree Hadoop algorithm is
reducer a key-value pair should be sent, using the attribute in                shown in Figure 10.
the AVC_Writable tuple. Hadoop’s method setOutputVal-
ueGroupingComparator()         will     group     the     values               1. nodeD.Path ← D, JPA.persist(nodeD);
(VC_Writable tuples) by the same attribute in ReducerSelect                    2. BuildTree(nodeD);
tasks. By default, the values are sorted based on the keys, so                 FUNCTION BuildTree (Path D)
determining the best split value for the attribute requires one                3. IF (all categories in D are the same C) THEN
pass through the VC_Writable tuples by the formula (2).                        JPA.begin(); set nodeD as a leaf; set nodeD.category as C;
                                                                               JPA.commit(); RETURN nodeD;
                                                                               4. IF (all attribute checked) THEN
                                SourceDataSet::=
                         <LongWritable, FloatArrayWritable>                    JPA.begin();nodeD.Leaf ← true; nodeD.category ←
                                                                               max(C){C∈D}; JPA.commit(); RETURN nodeD;
                                                                               5. JPA.begin(); nodeD.attribute, nodeD.value ← JobSe-
                       mapperSelect             mapperSelect
                            1           ...         m                          lect(D); JPA.commit();
                                                                               6. JPA.begin(); D1, D2 ← JobSplit(D); JPA.commit(),
                                                                               7. nodeLeft ← BuildTree(D1);
                               IntermediateData::=
           JobSelect




                           <AVC_Writable, VC_Writable>                         8. nodeRight ← BuildTree(D2);
                                                                               9. JPA.persist(nodeLeft); JPA.persist(nodeRight);
                       ReducerSelect           ReducerSelect                   10. JPA.begin(); nodeD.ChildLeft ← nodeLeft;
                            1           ...         r                          nodeD. ChildRight ← nodeRight; JPA.commit();
                                                                               11. RETURN nodeD
                                  SelectData::=                                      Figure 10 Algorithm for building decision tree on Hadoop
                          <EntropyWritable, AV_Writable>
                                                                                  The JPA.persist() function (Figure 10) means a new node
Figure 8 The Data Flow Diagram of Selection of Decision Tree in Hadoop         is persisted by JPA. The assignments enclosed by
                                                                               JPA.begin() and JPA.commit() mean these operations are
    After the best split, values for each attribute are calcu-                 monitored by JPA. Any node in the cluster will share objects
lated by the ReduceSelect tasks, and the key-value pair <En-                   persisted by JPA.
tropyWritable, AV_Writable> is outputted. The minimum
                                                                                                III.   EXPERIMENTAL SETUP
expected entropy is selected. Let A0 and V0 be the best at-
tribute-value pair. D1 is the subset of D satisfying A0 <=V0,                      We evaluated several synthetic datasets and an open da-
and D2 is the subset of D satisfying A0 > V0.                                  taset.
    Before the split, a node in JPA will be created corre-                     A. Computing Environment
sponding to D, with the best attribute A0 and best splitting
value V0.                                                                         Experiments were conducted using Linux CentOS 5.0,
    JobSplit contains only mapperSplit tasks, which do not                     Hadoop 0.19.1, JDK 1.6, a 1Gbps switch network, and 4
                                                                               ASUS RS100-X5 servers (Table 10).
                                SourceDataSet::=                                             Table 6 Devices for Hadoop computing.
                         <LongWritable, FloatArrayWritable>                        Environment                       Parameter
                                                                                   Model name                        ASUS RS100-X5
                        mapperSplit             mapperSplit                        CPU MHz                           Dual Core 2GHz
                            1           ...         m                              Cache size                        1024 KB
           JobSplit




                                                                                   Memory                            3.4GB
                                                                                   NIC                               1Gbps
                        ChildDataLeft         ChildDataRight                       Hard Disk                         1TB
  Figure 9 The Data Flow Diagram of Split of Decision Tree in Hadoop




                                                                         208
B. Inverted indexing vs MapFile Indexing                                 convenient tool for the data mining framework will be a
    Two different-sized experiments are shown in Table 7. A              focus of future efforts.
(1,000,000 × 200) synthetic data set was generated at ran-                                   ACKNOWLEDGEMENTS
dom for test1. A (102,294 × 117) data set was taken from
KDDCUP 2008[14] for test2. The procedure was followed                        This work is supported by the National Basic Research
according to the design (see II.A.2). In “Index Method,” M               Priorities Programme (No. 2007CB311004) and National
means MapFile, and I means “Inverted Index”. Of both tests,              Science Foundation of China (No.60775035, 60903141,
the latter takes fewer seconds for searching than the former.            60933004, 60970088), National Science and Technology
The SearcherTime is the average for a single query.                      Support Plan (No.2006BAC08B06).

          Table 7 Comparison of MapFile and Inverted indexing                                      REFERENCES
           Item                    Test1               Test2             [1] C. Bohm, S. Berchtold, H. P. Kriegel, and U. Michel,
 Index Method              M       I        M       I                        "Multidimensional index structures in relational databases," in
 FloatData(MB)                  1,614            204                         1st International Conference on Data Warehousing and
 DiscretizationTime(s)           180              59                         Knowledge Discovery (DaWak 99), Florence, Italy, 1999, pp.
 EncoderTime(s)                  482              42                         51-70.
                                                                         [2] J. Dean, S. Ghemawat, and Usenix, "MapReduce: Simplified
 IndexerTime(s)            5835 8473        307 397
                                                                              data processing on large clusters," in 6th Symposium on
 SearcherTime(s)           8.25 1.07        5.07 0.68                         Operating Systems Design and Implementation (OSDI 04),
   For a single query, the searching time of the inverted in-                 San Francisco, CA, 2004, pp. 137-149.
dexing shows great advantage.                                            [3] R. M. C. McCreadie, C. Macdonald, and I. Ounis, On
                                                                              Single-Pass Indexing with MapReduce. New York: Assoc
C. Data mining framework                                                      Computing Machinery, 2009.
    The comparison of JDBC and JPA as decision trees uses                [4] R. Lammel, "Google's MapReduce programming model -
the same synthetic dataset (Table 8). The times listed are the                Revisited," Science of Computer Programming, vol. 70, pp.
cumulative of all procedures. To limit the size of the tree, the              1-30, Jan 2008.
depth of any tree is no more than 20.                                    [5] C. Moretti, K. Steinhaeuser, D. Thain, and N. V. Chawla,
       Table 8 Comparison of JDBC and JPA on MySQL cluster                    "Scaling Up Classifiers to Cloud Computers," in IEEE
      Item                    JDBC                 JPA                        International Conference on Data Mining Pisa, Italy:
Size of synthetic 5,897,367,552 bytes                                         http://icdm08.isti.cnr.it/Paper-Submissions/32/accepted-paper
data               5,000,000 records, with 117 fields.                        s, 2008.
                                                                         [6] D. Gillick, A. Faria, and J. DeNero, "MapReduce: Distributed
Number of          501,723 nodes                                              Computing for Machine Learning," 2006, p.
nodes                                                                         http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262
Accumulated        2,695.307 seconds 2,981.279 second                         a_proj.pdf.
inserting time                                                           [7] J. Han and M. Kamber, Data Mining: Concepts and
Time for get-      257.93 seconds        324.51 seconds                       Techniques, Second Edition, second ed. San Francisco:
 Child                                                                        Morgan Kaufmann, 2006.
    The time in Table 8 can be decreased dramatically re-                [8] S. Berchtold, D. A. Keim, and H. P. Kriegel, "The X-tree: An
gardless of robustness; i.e., if the MyISAM engine is used                    Index Structure for High-Dimensional Data," Readings in
                                                                              Multimedia Computing and Networking, 2001.
on a single node instead of the NDBcluster engine on 4
                                                                         [9] R. Biswas and E. Ort, "Java Persistence API - A Simpler
nodes, performance in database will be multiply approxi-                      Programming Model for Entity Persistence,"
mately 25 times.                                                              http://java.sun.com/developer/technicalArticles/J2EE/jpa/,
                                                                              2009.
                       IV.    CONCLUSION
                                                                         [10] S. Hinz, P. DuBois, J. Stephens, M. M. Brown, and A.
    A feasible data indexing algorithm is proposed for                        Bedford, "MySQL Cluster," in
high-dimensional, large-scale data sets. It consists of                       http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-overvi
YSCORE binning, Hadoop processing, and improved inter-                        ew.html, 2009.
sectioning. Experimental results show that the algorithm is              [11] Wikipedia, "Central limit theorem," 2009, p.
technically feasible.                                                         http://en.wikipedia.org/wiki/Central_limit_theorem.
    An efficient data mining framework is proposed for Ha-               [12] A. S. F. ASF, "MapFile (Hadoop 0.19.0 API)," 2008, p.
doop; it employs the JPA and MySQL Cluster to resolve                         http://hadoop.apache.org/core/docs/current/api/org/apache/ha
problems of globalization, random-write and duration in                       doop/io/MapFile.html.
Hadoop. The distributed computing in Hadoop and central-                 [13] J. Gray, "HBase: Bigtable-like structured storage for Hadoop
ized data collection using JPA are combined together or-                      HDFS." vol. 2009, 2008, p.
                                                                              http://wiki.apache.org/hadoop/Hbase.
ganically.
                                                                         [14] S. M. S. USA, "kddcup data," 2008, p.
    We will consider replacing MapFile with our inverted                      http://www.kddcup2008.com/KDDsite/Data.htm.
indexing for HBase in future work in an effort to enhance
the performance of HBase in larger-scale data mining. A



                                                                   209

Más contenido relacionado

La actualidad más candente

IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
 
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...IRJET Journal
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET Journal
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET Journal
 
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexingijdms
 
Improving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoopImproving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoopeSAT Journals
 

La actualidad más candente (18)

IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Ijetcas14 316
Ijetcas14 316Ijetcas14 316
Ijetcas14 316
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
Evaluating HIPI Performance on Image Segmentation Task in Different Hadoop Co...
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
IJET-V3I2P24
IJET-V3I2P24IJET-V3I2P24
IJET-V3I2P24
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
C1803041317
C1803041317C1803041317
C1803041317
 
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
 
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
 
Improving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoopImproving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoop
 

Destacado

User Experience New Trends – Remaining Challenges (2011)
User Experience New Trends – Remaining Challenges (2011)User Experience New Trends – Remaining Challenges (2011)
User Experience New Trends – Remaining Challenges (2011)Silvia Zimmermann
 
Prese_Mailing_QVG_2015
Prese_Mailing_QVG_2015Prese_Mailing_QVG_2015
Prese_Mailing_QVG_2015Alegna Freites
 
Se Media Kit V 4 2009
Se Media Kit V 4 2009Se Media Kit V 4 2009
Se Media Kit V 4 2009Lee Embley
 
Nota periodistica ongs farc
Nota periodistica ongs farcNota periodistica ongs farc
Nota periodistica ongs farcAlejandro Lozano
 
Asia Pacific Link News - January 2013
Asia Pacific Link News - January 2013Asia Pacific Link News - January 2013
Asia Pacific Link News - January 2013Ruchira Jayasinghe
 
Dublinked Innovation Network Transport Event - Peter Cranny, NTA
Dublinked Innovation Network Transport Event - Peter Cranny, NTA Dublinked Innovation Network Transport Event - Peter Cranny, NTA
Dublinked Innovation Network Transport Event - Peter Cranny, NTA Dublinked .
 
Presentacion 1. Taller diseño y evaluación de negocios
Presentacion 1. Taller diseño y evaluación de negociosPresentacion 1. Taller diseño y evaluación de negocios
Presentacion 1. Taller diseño y evaluación de negocioscatalinagodoy
 
Genetic selection as a cause for an ethical dilemma
Genetic selection as a cause for an ethical dilemmaGenetic selection as a cause for an ethical dilemma
Genetic selection as a cause for an ethical dilemmaHarm Kiezebrink
 
Microblogging und die Verhandlung des Sozialen im Web 2.0
Microblogging und die Verhandlung des Sozialen im Web 2.0Microblogging und die Verhandlung des Sozialen im Web 2.0
Microblogging und die Verhandlung des Sozialen im Web 2.0Jana Herwig
 
Artigo Jornalistico O Futuro Do Varejo d
Artigo Jornalistico   O Futuro Do Varejo dArtigo Jornalistico   O Futuro Do Varejo d
Artigo Jornalistico O Futuro Do Varejo dJozelena
 
Pre-Deployment Briefing - HMCS WINNIPEG 2015
Pre-Deployment Briefing - HMCS WINNIPEG 2015Pre-Deployment Briefing - HMCS WINNIPEG 2015
Pre-Deployment Briefing - HMCS WINNIPEG 2015Esquimalt MFRC
 
Tic 04 - Componentes del ordenador - Hardware
Tic 04 - Componentes del ordenador - HardwareTic 04 - Componentes del ordenador - Hardware
Tic 04 - Componentes del ordenador - HardwareRosa Fernández
 
The Safari Method, Thijs van Exel (Kennisland)
The Safari Method, Thijs van Exel (Kennisland)The Safari Method, Thijs van Exel (Kennisland)
The Safari Method, Thijs van Exel (Kennisland)Social Innovation Exchange
 
Three County Fairgrounds Stormwater Permit Plans 11-03-2010
Three County Fairgrounds Stormwater Permit Plans 11-03-2010Three County Fairgrounds Stormwater Permit Plans 11-03-2010
Three County Fairgrounds Stormwater Permit Plans 11-03-2010Adam Cohen
 
Relaciones animales
Relaciones animales Relaciones animales
Relaciones animales loboerecto
 

Destacado (20)

User Experience New Trends – Remaining Challenges (2011)
User Experience New Trends – Remaining Challenges (2011)User Experience New Trends – Remaining Challenges (2011)
User Experience New Trends – Remaining Challenges (2011)
 
Capitulo 5
Capitulo 5Capitulo 5
Capitulo 5
 
Prese_Mailing_QVG_2015
Prese_Mailing_QVG_2015Prese_Mailing_QVG_2015
Prese_Mailing_QVG_2015
 
Se Media Kit V 4 2009
Se Media Kit V 4 2009Se Media Kit V 4 2009
Se Media Kit V 4 2009
 
Nota periodistica ongs farc
Nota periodistica ongs farcNota periodistica ongs farc
Nota periodistica ongs farc
 
Asia Pacific Link News - January 2013
Asia Pacific Link News - January 2013Asia Pacific Link News - January 2013
Asia Pacific Link News - January 2013
 
Dublinked Innovation Network Transport Event - Peter Cranny, NTA
Dublinked Innovation Network Transport Event - Peter Cranny, NTA Dublinked Innovation Network Transport Event - Peter Cranny, NTA
Dublinked Innovation Network Transport Event - Peter Cranny, NTA
 
Presentacion 1. Taller diseño y evaluación de negocios
Presentacion 1. Taller diseño y evaluación de negociosPresentacion 1. Taller diseño y evaluación de negocios
Presentacion 1. Taller diseño y evaluación de negocios
 
Genetic selection as a cause for an ethical dilemma
Genetic selection as a cause for an ethical dilemmaGenetic selection as a cause for an ethical dilemma
Genetic selection as a cause for an ethical dilemma
 
Microblogging und die Verhandlung des Sozialen im Web 2.0
Microblogging und die Verhandlung des Sozialen im Web 2.0Microblogging und die Verhandlung des Sozialen im Web 2.0
Microblogging und die Verhandlung des Sozialen im Web 2.0
 
Artigo Jornalistico O Futuro Do Varejo d
Artigo Jornalistico   O Futuro Do Varejo dArtigo Jornalistico   O Futuro Do Varejo d
Artigo Jornalistico O Futuro Do Varejo d
 
PATIC INCLUSION EDUCATIVA
PATIC  INCLUSION EDUCATIVAPATIC  INCLUSION EDUCATIVA
PATIC INCLUSION EDUCATIVA
 
Pre-Deployment Briefing - HMCS WINNIPEG 2015
Pre-Deployment Briefing - HMCS WINNIPEG 2015Pre-Deployment Briefing - HMCS WINNIPEG 2015
Pre-Deployment Briefing - HMCS WINNIPEG 2015
 
Tic 04 - Componentes del ordenador - Hardware
Tic 04 - Componentes del ordenador - HardwareTic 04 - Componentes del ordenador - Hardware
Tic 04 - Componentes del ordenador - Hardware
 
The Safari Method, Thijs van Exel (Kennisland)
The Safari Method, Thijs van Exel (Kennisland)The Safari Method, Thijs van Exel (Kennisland)
The Safari Method, Thijs van Exel (Kennisland)
 
Folleto-Bulletin Board
Folleto-Bulletin Board Folleto-Bulletin Board
Folleto-Bulletin Board
 
It budget
It budgetIt budget
It budget
 
Innovative Lighting Solutions
Innovative Lighting SolutionsInnovative Lighting Solutions
Innovative Lighting Solutions
 
Three County Fairgrounds Stormwater Permit Plans 11-03-2010
Three County Fairgrounds Stormwater Permit Plans 11-03-2010Three County Fairgrounds Stormwater Permit Plans 11-03-2010
Three County Fairgrounds Stormwater Permit Plans 11-03-2010
 
Relaciones animales
Relaciones animales Relaciones animales
Relaciones animales
 

Similar a An efficient data mining framework on hadoop using java persistence api

A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance
A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance
A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance IJECEIAES
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoopdbpublications
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersBRNSSPublicationHubI
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfKUMARRISHAV37
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498IJRAT
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 

Similar a An efficient data mining framework on hadoop using java persistence api (20)

A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance
A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance
A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
G017143640
G017143640G017143640
G017143640
 
B017320612
B017320612B017320612
B017320612
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 

Más de João Gabriel Lima

Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationJoão Gabriel Lima
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitJoão Gabriel Lima
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência ArtificialJoão Gabriel Lima
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão LinearJoão Gabriel Lima
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoJoão Gabriel Lima
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google HackingJoão Gabriel Lima
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisJoão Gabriel Lima
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaJoão Gabriel Lima
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideJoão Gabriel Lima
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?João Gabriel Lima
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosJoão Gabriel Lima
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptJoão Gabriel Lima
 

Más de João Gabriel Lima (20)

Cooking with data
Cooking with dataCooking with data
Cooking with data
 
Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer Segmentation
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full Stack
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKit
 
JS - IA
JS - IAJS - IA
JS - IA
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência Artificial
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão Linear
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de caso
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google Hacking
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentais
 
Web Machine Learning
Web Machine LearningWeb Machine Learning
Web Machine Learning
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - Clusterização
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e Weka
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark side
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãos
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com Javascript
 

Último

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Último (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

An efficient data mining framework on hadoop using java persistence api

  • 1. 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010) An Efficient Data Mining Framework on Hadoop using Java Persistence API Yang Lai Shi ZhongZhi The Key Laboratory of Intelligent Information Process- The Key Laboratory of Intelligent Information Process- ing, Institute of Computing Technology, Chinese ing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China. Academy of Sciences, Beijing, 100190, China. Graduate University of Chinese Academy of Sciences, e-mail: shizz@ics.ict.ac.cn Beijing 100039, China. e-mail: yanglai@ics.ict.ac.cn Abstract—Data indexing is common in data mining when working X-tree [8], may be suitable for indexing high-dimensional with high-dimensional, large-scale data sets. Hadoop, a cloud data sets. computing project using the MapReduce framework in Java, has However, the following features in the MapReduce become of significant interest in distributed data mining. A feasi- framework are unsuitable for data mining. First, in globality, ble distributed data indexing algorithm is proposed for Hadoop map tasks are irrelevant to each other, as are reducing tasks. data mining, based on ZSCORE binning and inverted indexing and Data mining requires that all of the training data be con- on the Hadoop SequenceFile format. A data mining framework on verted into a global model, such as a decision tree or clus- Hadoop using the Java Persistence API (JPA) and MySQL Cluster tering tree. The tasks in the MapReduce framework only is proposed. The framework is elaborated in the implementation of handle its partition of the entire data set and output its results a decision tree algorithm on Hadoop. We compare the data index- ing algorithm with Hadoop MapFile indexing, which performs a into the Hadoop distributed file system (HDFS). Second, binary search, in a modest cloud environment. The results show random-write operations are disallowed by the HDFS, thus the algorithm is more efficient than naïve MapFile indexing. We disabling link-based data models in Hadoop, such as compare the JDBC and JPA implementations of the data mining linked-lists, trees, and graphs. Finally, the duration of both framework. The performance shows the framework is efficient for map and reduce tasks are based on scanning processing, and data mining on Hadoop. will end when the partitioning of the training dataset is fin- ished. Data mining requires a persistent model for following Keywords—Data Mining, Distributed applications, JPA, testing processing. ORM, Distributed file systems, Cloud computing A database is an ideal persistent repository for objects generated by data mining using Hadoop tasks. To mine I. INTRODUCTION high-dimensional and large-scale data on Hadoop, we em- Many approaches have been proposed for handling ploy Object-Relation Mapping (ORM), which stores objects high-dimensional and large-scale data, in which query proc- whose size may surpass memory limits in a relational data- essing is the bottleneck. “Algorithms for knowledge discov- base. The Java Persistence API (JPA) provides a persistence ery tasks are often based on range searches or nearest model for ORM [9]. It can store all the objects generated neighbor search in multidimensional feature spaces” [1]. from data mining models in a relational database. A distrib- Business intelligence and data warehouses can hold a uted database is a suitable solution to ensure robustness in Terabyte or more of data. Cloud computing has emerged for distributed handling. MySQL Cluster is designed to with- the subsequently increasing demands of data mining. Ma- stand any single point of failure [10], which is consistent pReduce is a programming framework and an associated with Hadoop. implementation designed for large data sets. The details of We performed the same work that McCreadie’s per- partitioning, scheduling, failure handling and communica- formed [3] and now propose a novel indexing Hadoop im- tion are hidden by MapReduce. Users simply define map plementation for continuous values. Using JPA and MySQL functions to create intermediate <key, value> tuples, and Cluster on Hadoop, we propose an efficient data mining then reduce functions to merge the tuples for special proc- framework, which is elaborated by a decision tree imple- essing [2]. mentation. The distributed computing in Hadoop and cen- A concise indexing Hadoop implementation of MapRe- tralized data collection using JPA are combined together duce is presented in McCreadie’s work [3]. Ralf proposes a organically. We also employ the naïve JDBC implementa- basic program skeleton to underlie MapReduce computa- tion in comparison with our JPA implementation on Ha- tions [4]. Moretti presents an abstraction for scalable data doop. mining, in which data and computation are distributed in a The rest of the paper is organized as follows. In Section computing cloud with minimal user efforts [5]. Gillick uses 2 we explain the index structures, flowchart and algorithms, Hadoop to implement query-based learning [6]. and propose the data mining framework. Section 3 provides Most data mining algorithms are based on object descriptions of our experimental setting and results. In Sec- -oriented programming, which runs in memory. Han elabo- tion 4, we offer conclusions and suggest possible directions rates many of these methods [7]. A link-based structure, like for future work. 978-0-7695-4108-2/10 $26.00 © 2010 IEEE 203 DOI 10.1109/CIT.2010.71
  • 2. II. METHOD for searching; it then outputs a list of the number of records. The search algorithm is shown in Figure 5. A. Naïve Data Indexing In intersection, the positions of each integer in Naïve data indexing is necessary for querying data in Val2NO.DAT for a particular query will be extracted from classification or clustering algorithms. Data indexing can Col2Val.DAT , and the lists of LineNo’s for each integer dramatically decrease the complexity of querying. As in can be found. The result will be computed by the intersec- McCreadie’s work [3], we create an inverted index for con- tion of the lists. The algorithm is omitted. tinuous values. 1) Definition Table 2 Structure of IntData.DAT in BNF. A (n × m) dataset is an (n × m) matrix containing n rows IntData.DAT ::= <Record> of data, which contain m columns of float numbers. Let <Record> ::= <Bins> min(i), max(i), sum(i), avg(i), std(i) , cnt(i) (i∈[1..m]) equal <Bins> ::= Integer(32bits binary) the minimum, maximum, sum, average, standard deviation and count of the i-column, respectively—i.e., the essential Table 3 Structure of Col2Val.DAT in BNF. statistical measures of the i-column of the dataset. Let Col2Val.DAT ::= <Field> sum2(i) be the dot product of the i-column, which is used for <Field> ::= <Bins, Address, LengthLineNo> x-avg() <Bins> ::= Integer(32bits binary) std(i). From the formula ZSCORE(x)= std() [7], the for- <Address> ::= Integer(64bits binary) mula YSCORE(x)=round⎛ (x - avg())×iStep⎞ ⎝ std() ⎠, a new measure <LengthLineNo> ::= Integer(64bits binary) is defined. The iStep is an experimental integral parameter. Table 4 Structure of Val2NO.DAT in BNF. 2) Simple Design of the Inverted Index Val2NO.DAT ::= <LineNo> In general, high-dimensional datasets are always stored <LineNo> ::= Integer(64bits binary) as tables in a sophisticated DBMS for most data mining tasks. Table 1 shows the format of a dataset in DBMS. For data mining, the CSV format is common data interchange FloatData.TXT Query Result file format. Table 1 A normal data table. RecordsNO Field01 Field02 … Rec01 Value02 Value01 … Discretization InterSection Rec02 Value05 Value01 … Rec03 Value02 Value04 … … … … … IntData.TXT List of No Figure 1 shows the flowchart in which the whole proce- dure is illustrated for indexing and similarity querying: Discretization calculates the essential statistical measures (Figure 2), then performs YSCORE binning on the Float- Encoder Searcher Data.TXT file (CSV format) and outputs the IntData.TXT file in which each line is labeled with a unique number (LineNo) consecutively. IntData.DAT The encoder transforms the text format file IntData.TXT IntData.SEQ into the Hadoop Sequence File format IntData.SEQ and ge- nerates IntData.DAT. To locate a record for a query rapidly, the length of the record has to be equivalent. The Int- Data.DAT contains an (n × m) matrix, i.e., n rows of records Indexer Col2Val.DAT containing m columns of IDs of bins, as does IntData.SEQ. The structure of IntData.DAT is shown in Table 2. With the indexer, the index files Val2NO.DAT and Col2Val.DAT are outputted with MapReduce from the Int- Val2No.DAT Data.SEQ file. To obtain the list of LineNo’s for a specific Figure 1 Flowchart for searching the Hadoop Index bin, it is needed to store the list in a structural file (Val2NO.DAT), shown in Table 4; the address and the 3) Statistics phase length of the list should be stored in another structural file In Hadoop, data sets are scanned once to calculate the (Col2Val.DAT), shown in Table 3. The two structural files four distributive measures, min(i), max(i), sum(i) and sum2(i) are easy to access by offset. as well as cnt(i) for the i-column (Figure 2); it then com- The searcher, given a query—which should be a vector putes the algebraic measures of average and deviation. with m float values— performs YSCORE binning with the Therefore, a (5×m) array is used for the keys in the MapRe- same statistical measures and yields a vector with m integers duce process: the first row contains the keys of the minimum 204
  • 3. of a column; the second row contains the keys for the e. ELSE maximum; the third contains the keys for the sum; the fourth 1) Cnt ← value; contains the keys for the square of the sum; and the fifth 14. avg[] ←0; std[] ←0; contains the keys for the row count. 15. FOR i = 1 THRU LenFields FUNCTION map (key, values, output) a. avg(i) ← faStat[2][i] / cnt; 1. f[1..m]← parse the values into float array; b. std(i)←sqrt ((faStat[3][i]-faStat[2][i]×faStat[2][i]/cnt) 2. iW ←f.length; /(cnt-1)); 3. FOR i = 1 THRU m 16. RETURN min(i),max(i),sum(i),avg(i),std(i),cnt; a. IF min > f[i] THEN END OF ALGORITHM STATISTICS 1) min ← f[i]; Figure 2: Algorithm Statistics for continuous values in Hadoop 2) output.collect(iW*0+i, min); b. IF max < f[i] THEN The algorithm performs O(N/mapper), in which N 1) max ← f[i]; represents the number of records in the data sets, and map- 2) output.collect(iW*1+i, max); per is the number of the task of map functions. After step 4, c. output.collect(iW*2+i, sum); all partitions will have been calculated into the five distribu- tive measures. The reduce function merges them into global d. output.collect(iW*3+i, sum2); measures. Keys for the map function are the positions of the 4. output.collect(iW*4, 1); next line, and values are the next line of text. FUNCTION reduce (key, values, output) After step 11 in Figure 2, Hadoop will create some result 5. iCategory ← (int) key.get() / LenFields files named “part-xxxxx”, which contain the five distributive 6. SWITCH (iCategory) measures for each field; the number of these result files de- 7. CASE 0 : /** min()*/ pends on the number of reducers. From step 15, the alge- a. WHILE (values.hasNext()) braic measures avg() and std() can be calculated from the 1) fTmp ← values.next(); five distributive measures. 2) IF (min > fTmp) THEN min ← fTmp; 4) YSCORE phase b. output.collect(key, min); Figure 3 shows YSCORE binning of 1,000,000 values c. BREAK; from a normal distribution with a mean of 0 and standard 8. CASE 1 : /** max() */ deviation of 1. If the iStep of 1 is used, the upper left figure a. WHILE (values.hasNext()){ shows exactly the Three-Sigma Rule, there are only 9 bins 1) fTmp ← values.next(); and the maximum number in the bins is near 400,000. When the iStep is changed to 8, the upper right shows that the 2) IF (max < fTmp) THEN max ← fTmp; number of bins increases to 36, and the figure fits the normal b. output.collect(key, max); distribution curve better. The important thing is that the c. BREAK; maximum in the bins reduces to about 50,000 at the iStep of 9. CASE 2 : /** sum() */ 8. When the iStep is set to 16 or 32, the maximum reduces to a. WHILE (values.hasNext()) 25,000 or 10,000. The lowered numbers administer to in- 1) sum ← sum + values.next(); verted indexing. 2) output.collect(key, sum); 5 4 x 10 x 10 b. BREAK; 10.CASE 3 : /** sum2() */ 3 4 a. WHILE (values.hasNext()) Counting Counting 3 1) sum2← sum2+ values.next(); 2 2 2) output.collect(key, sum2); 1 1 b. BREAK; 0 0 11.CASE 4 : /** cnt() */ −4 −3 −2 −1 0 1 2 3 4 −20 0 20 a. WHILE (values.hasNext()) 4 Yscore bins, iStep=1 Yscore bins, iStep=8 x 10 1) cnt ← cnt + values.next(); 2) output.collect(key, cnt); 2 10000 b. BREAK; Counting Counting 1.5 FUNCTION statistics (input) 1 5000 12.faStat[][]←0; key←0; value←0; cnt←0; 0.5 13.WHILE input.hasNext() 0 0 a. <key,value> ← Parse the line of input.Next(); −50 0 50 −100 0 100 Yscore bins, iStep=16 Yscore bins, iStep=32 b. iCategory← key / LenFields; c. iField← key MOD LenFields; Figure 3 YSCORE binning of 1,000,000 values from a normal distribution with mean 0 and standard deviation 1. The above figures show YSCORE d. IF (iCategory == LenFields * 4) THEN binning using an iStep of 1, 8, 16, and 32, respectively. 1) faStat[iCategory][ iField]← value; 205
  • 4. According to [11], “the central limit theorem (CLT) reduce task in Hadoop will generate its own result; therefore, states conditions under which the mean of a sufficiently each attribute has its own index file. large number of independent random variables, each with finite mean and variance, will be approximately normally SourceDataSet::= distributed.” Even if the distribution of a field does not <LongWritable, FloatArrayWritable> JobDiscret follow normal distribution, in large-scale data sets the values of the field can be put in bins using its mean and deviation when the size approaches a Terabyte. mapperDiscrete mapperDiscrete These interval values are put into bins using YSCORE 1 ... m binning [7]. Then, the inverted index will be of the format in Table 5, following Table 1. The first column holds all fields DiscretedData::= in the original dataset; the second holds the <bin, RecNoAr- <LongWritable, IntArrayWritable> ray> tuples. Table 5. The structure of inverted index for interval values. Fields Inverted Index mapperIndex mapperIndex 1 ... m Field01 <Bin02,RecNoArray>,<Bin05, RecNoArray >… Field02 <Bin01,RecNoArray>,<Bin04, RecNoArray >… IntermediateData::= JobIndex … … <LongWritable, LongWritable> The YSCORE binning phase does not need a reduce function, thus eliminating the cost of sorting and transferring; ReducerIndex ReducerIndex the result files are precisely the partitions on which map 1 ... r functions work. Keys in the map function are the positions of the next line, and values are the next line of text. The in- IndexData::= put file will be FloatDate.TXT, and the output file will be <IntWritable, LongArrayWritable> IntData.TXT, as shown in Figure 3. The cumulative distribution function values for the stan- dard normal distribution will be 50% at 0, 60% at 0.253, Figure 4 The Data Flow Diagram for improved Indexing of continuous values in Hadoop 70% at 0.524, 80% at 0.842, and 90% at 1.282. If the distri- bution is approximately close to the standard normal distri- bution, YSCORE determines a balanced binning from its C. Hadoop MapFile Indexing ZSCORE value. The MapFile format is a directory consisting of two files, After the statistics’ phase and the YSCORE binning a data file and an index file. Both files are, in fact, two files phase, Hadoop can handle discrete data as in the com- of the SequenceFile type. The data file is a normal Se- monly-used MapReduce word count example. quenceFile; the index file consists of a few keys at intervals B. Improved Data Indexing and their corresponding addresses in the data file. To query a Naïve data indexing has a flaw in that it uses only one special key for its value, a binary search will be performed reducer to keep the global id for the keys. This causes a per- in the index file and a sequential search will be performed in formance problem that is not scalable. A tractable <Long- the data file within the range defined by the corresponding Writable, LongWritable> pair is employed to transform the addresses. [12] The algorithm for searching by MapFile is three variables of the global id, the attribute and the id of bin shown in Figure 5. into the intermediate key-value pairs in the JobIndex, FUNCTION SearchMapFile (float[] faQuery) (Figure 4). In general, the number of attributes and the 17. laResult ← null; laTmp ← null; isFirst ← TRUE number of bins in an attribute will not be more than 232. A 18. FOR iField =1 THRU faQuery.length LongWritable variable is 64 bits, so an attribute can be put a. iValue ←YSCORE(faQuery[i]); in the upper 32 bits in a LongWritable, and the number of b. Indexer ←new MapFile.Reader(index file of iField); bins can be put in the lower 32 bits. These attribute-bin c. Indexer.get(iValue, laTmp); LongWritables are used as keys for the sorting phase in Ha- d. IF (isFirst) THEN doop. 1) laResult ← laTmp; isFirst ← FALSE The number of the ReducerIndex task is set equal to the e. ELSE number of attributes. The method setPartitionerClass() in 1) laResult ← getIntersection(laResult, laTmp); Hadoop decides to which ReducerIndex task an intermediate 19. RETURN laResult key-value pair should be sent, using the upper 32 bits in the END OF ALGORITHM SearchMapFile keys. In the ReducerIndex task, the attribute-bin LongWri- Figure 5 Algorithm of searching by MapFile. tables will be grouped by the number of bins in the lower 32 bits in the keys, and then the inverted index HBase, a Bigtable-like storage for the Hadoop Distrib- <Bin,RecNoArray> (Table 5) can be outputted easily. Each uted Computing Environment, is based on MapFile [13]. It does not perform well for reasons of complexity in MapFile. 206
  • 5. D. Data mining framework on Hadoop The following codes are for returning all the children of Although this improved data indexing performs better a node in a binary tree (Figure 7). than does MapFile indexing, the overhead for range query- ing by intersection algorithms is still too high. To apply 1. CREATE PROCEDURE getChild (IN rootid INT) common object-oriented data mining algorithms in Hadoop, 2. BEGIN the three problems concerning globalization, random-write a. DROP TABLE IF EXISTS chd; and duration, posed in the introduction, should be handled. b. CREATE TEMPORARY TABLE chd (id INT For this, we propose a data mining framework on Hadoop NOT NULL DEFAULT -1)ENGINE = MEMORY; using JPA. c. DROP TABLE IF EXISTS par; 1) MySQL Cluster d. CREATE TEMPORARY TABLE par (id INT After many failures, we found that, for data mining with NOT NULL DEFAULT -1)ENGINE = MEMORY; Hadoop, a database was an ideal persistent repository for e. DROP TABLE IF EXISTS result; handling the three problems mentioned. To ensure comput- f. CREATE TABLE result (id INT NOT NULL ing robustness, a distributed database is the suitable solution DEFAULT -1); for distributed high-dimensional and large-scale data han- g. INSERT chd SELECT A.ID FROM btree A dling. MySQL Cluster is designed to eliminate any single WHERE rootid=A.PID; point of failure [10] and is based on open source code; both h. WHILE ROW_COUNT()>0 DO characteristics make it consistent with data mining with Ha- 1) TRUNCATE par; doop. 2) INSERT par SELECT * FROM chd; 2) ORM 3) INSERT result SELECT * FROM chd; The JPA and naïve JDBC are used to implement the 4) TRUNCATE chd; ORM for objects of data mining in a database (Figure 6). To 5) INSERT chd SELECT A.ID FROM btree A, par improve the data indexing method, a KD-tree or X-tree may be suitable. However, the decision tree is considered a clas- B WHERE B.ID=A.PID; sical algorithm, and better elaborates the framework. i. END WHILE; 3. END; Figure 7 Stored procedure for all children of a node in MySQL TrainingData TestingData 4) Decision tree implementation As shown in the textbook [7], a decision tree for con- hadoop tinuous values can build a binary tree. Let D be a dataset. The entropy of D is given by Data Mining m Info ( D ) = −∑ pi × log 2 ( pi ) (1) i =1 JPA JDBC Where m is the number of categories and p is the prob- ability of a category. Let A be an attribute in D. The entropy of D based on the partitioning of A is given by ndbcluster Figure 6 The Data Flow Diagram of Data Mining Framework on Hadoop D1 D2 (2) Info A ( D) = × Info( D1 ) + × Info( D2 ) D D In the data mining framework, a large-scale dataset is scanned by Hadoop. The necessary information is extracted Where entropy Info() is from formula (1). D1 is the part using classical data mining modules, which will be embed- of D, which belongs to the left child node, and the D2 is the ded in the Hadoop map or reduce tasks. The extracted ob- part of D, which belongs to the right child node. jects, such as lists, trees, and graphs, will be sent automati- Each node in the tree has to contain two members, an cally into a database—in this case, MySQL Cluster. As a test ATTRIBUTE and a VALUE. Given a test record (an array case, the test dataset will be scanned and handled by each of continuous values), the former tells which field should be data mining task. checked, and the latter tells which child node should be se- 3) Stored procedure lected. If the test record is less than or equal to the member After objects obtained from data mining are stored into a VALUE in the member ATTRIBUTE, then the left child database, all the details can be finished using a stored pro- node is selected, otherwise the right child node is selected. cedure, such as finding a path between two nodes, returning In Hadoop, the JobSelect has to select the member all the children of a node and returning the category of a test ATTRIBUTE and the member VALUE using the entropy case. A data mining task should send demands and receive criterion (Figure 8). A training data record should consist of results. This decreases network overhead and latency. multiple values and a category. In the mapperSelect tasks, a record will be decomposed into AVC_Writable tuples <at- 207
  • 6. tribute, value, category>, where the attribute indicates the use the method output.collect (Figure 9); instead, it writes position of the value in the record and the category indicates directly to two sequence files in the HDFS. This decreases the class of the record. For example, a record <3, 5, 7, 9, -1> unwanted I/O overheads. The SourceDataSet is D; the will output the tuples <1, 3, -1>, <2, 5, -1>, <3, 7, -1>, <4, 9, ChildDataLeft is D1; the ChildDataRight is D2; and all of -1>. These AVC_Writable tuples will be the keys in the in- these are in fact directories in HDFS, which contain the termediate key-value pairs. The values will be VC_Writable outputs from each mapperSplit task. tuples, or the part <value, category> in the AVC_Writable The above processing is just for one node. To build a tuples. whole decision tree, the processing needs to be repeated Hadoop’s method setPartitionerClass() decides to which recursively. A complete decision tree Hadoop algorithm is reducer a key-value pair should be sent, using the attribute in shown in Figure 10. the AVC_Writable tuple. Hadoop’s method setOutputVal- ueGroupingComparator() will group the values 1. nodeD.Path ← D, JPA.persist(nodeD); (VC_Writable tuples) by the same attribute in ReducerSelect 2. BuildTree(nodeD); tasks. By default, the values are sorted based on the keys, so FUNCTION BuildTree (Path D) determining the best split value for the attribute requires one 3. IF (all categories in D are the same C) THEN pass through the VC_Writable tuples by the formula (2). JPA.begin(); set nodeD as a leaf; set nodeD.category as C; JPA.commit(); RETURN nodeD; 4. IF (all attribute checked) THEN SourceDataSet::= <LongWritable, FloatArrayWritable> JPA.begin();nodeD.Leaf ← true; nodeD.category ← max(C){C∈D}; JPA.commit(); RETURN nodeD; 5. JPA.begin(); nodeD.attribute, nodeD.value ← JobSe- mapperSelect mapperSelect 1 ... m lect(D); JPA.commit(); 6. JPA.begin(); D1, D2 ← JobSplit(D); JPA.commit(), 7. nodeLeft ← BuildTree(D1); IntermediateData::= JobSelect <AVC_Writable, VC_Writable> 8. nodeRight ← BuildTree(D2); 9. JPA.persist(nodeLeft); JPA.persist(nodeRight); ReducerSelect ReducerSelect 10. JPA.begin(); nodeD.ChildLeft ← nodeLeft; 1 ... r nodeD. ChildRight ← nodeRight; JPA.commit(); 11. RETURN nodeD SelectData::= Figure 10 Algorithm for building decision tree on Hadoop <EntropyWritable, AV_Writable> The JPA.persist() function (Figure 10) means a new node Figure 8 The Data Flow Diagram of Selection of Decision Tree in Hadoop is persisted by JPA. The assignments enclosed by JPA.begin() and JPA.commit() mean these operations are After the best split, values for each attribute are calcu- monitored by JPA. Any node in the cluster will share objects lated by the ReduceSelect tasks, and the key-value pair <En- persisted by JPA. tropyWritable, AV_Writable> is outputted. The minimum III. EXPERIMENTAL SETUP expected entropy is selected. Let A0 and V0 be the best at- tribute-value pair. D1 is the subset of D satisfying A0 <=V0, We evaluated several synthetic datasets and an open da- and D2 is the subset of D satisfying A0 > V0. taset. Before the split, a node in JPA will be created corre- A. Computing Environment sponding to D, with the best attribute A0 and best splitting value V0. Experiments were conducted using Linux CentOS 5.0, JobSplit contains only mapperSplit tasks, which do not Hadoop 0.19.1, JDK 1.6, a 1Gbps switch network, and 4 ASUS RS100-X5 servers (Table 10). SourceDataSet::= Table 6 Devices for Hadoop computing. <LongWritable, FloatArrayWritable> Environment Parameter Model name ASUS RS100-X5 mapperSplit mapperSplit CPU MHz Dual Core 2GHz 1 ... m Cache size 1024 KB JobSplit Memory 3.4GB NIC 1Gbps ChildDataLeft ChildDataRight Hard Disk 1TB Figure 9 The Data Flow Diagram of Split of Decision Tree in Hadoop 208
  • 7. B. Inverted indexing vs MapFile Indexing convenient tool for the data mining framework will be a Two different-sized experiments are shown in Table 7. A focus of future efforts. (1,000,000 × 200) synthetic data set was generated at ran- ACKNOWLEDGEMENTS dom for test1. A (102,294 × 117) data set was taken from KDDCUP 2008[14] for test2. The procedure was followed This work is supported by the National Basic Research according to the design (see II.A.2). In “Index Method,” M Priorities Programme (No. 2007CB311004) and National means MapFile, and I means “Inverted Index”. Of both tests, Science Foundation of China (No.60775035, 60903141, the latter takes fewer seconds for searching than the former. 60933004, 60970088), National Science and Technology The SearcherTime is the average for a single query. Support Plan (No.2006BAC08B06). Table 7 Comparison of MapFile and Inverted indexing REFERENCES Item Test1 Test2 [1] C. Bohm, S. Berchtold, H. P. Kriegel, and U. Michel, Index Method M I M I "Multidimensional index structures in relational databases," in FloatData(MB) 1,614 204 1st International Conference on Data Warehousing and DiscretizationTime(s) 180 59 Knowledge Discovery (DaWak 99), Florence, Italy, 1999, pp. EncoderTime(s) 482 42 51-70. [2] J. Dean, S. Ghemawat, and Usenix, "MapReduce: Simplified IndexerTime(s) 5835 8473 307 397 data processing on large clusters," in 6th Symposium on SearcherTime(s) 8.25 1.07 5.07 0.68 Operating Systems Design and Implementation (OSDI 04), For a single query, the searching time of the inverted in- San Francisco, CA, 2004, pp. 137-149. dexing shows great advantage. [3] R. M. C. McCreadie, C. Macdonald, and I. Ounis, On Single-Pass Indexing with MapReduce. New York: Assoc C. Data mining framework Computing Machinery, 2009. The comparison of JDBC and JPA as decision trees uses [4] R. Lammel, "Google's MapReduce programming model - the same synthetic dataset (Table 8). The times listed are the Revisited," Science of Computer Programming, vol. 70, pp. cumulative of all procedures. To limit the size of the tree, the 1-30, Jan 2008. depth of any tree is no more than 20. [5] C. Moretti, K. Steinhaeuser, D. Thain, and N. V. Chawla, Table 8 Comparison of JDBC and JPA on MySQL cluster "Scaling Up Classifiers to Cloud Computers," in IEEE Item JDBC JPA International Conference on Data Mining Pisa, Italy: Size of synthetic 5,897,367,552 bytes http://icdm08.isti.cnr.it/Paper-Submissions/32/accepted-paper data 5,000,000 records, with 117 fields. s, 2008. [6] D. Gillick, A. Faria, and J. DeNero, "MapReduce: Distributed Number of 501,723 nodes Computing for Machine Learning," 2006, p. nodes http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262 Accumulated 2,695.307 seconds 2,981.279 second a_proj.pdf. inserting time [7] J. Han and M. Kamber, Data Mining: Concepts and Time for get- 257.93 seconds 324.51 seconds Techniques, Second Edition, second ed. San Francisco: Child Morgan Kaufmann, 2006. The time in Table 8 can be decreased dramatically re- [8] S. Berchtold, D. A. Keim, and H. P. Kriegel, "The X-tree: An gardless of robustness; i.e., if the MyISAM engine is used Index Structure for High-Dimensional Data," Readings in Multimedia Computing and Networking, 2001. on a single node instead of the NDBcluster engine on 4 [9] R. Biswas and E. Ort, "Java Persistence API - A Simpler nodes, performance in database will be multiply approxi- Programming Model for Entity Persistence," mately 25 times. http://java.sun.com/developer/technicalArticles/J2EE/jpa/, 2009. IV. CONCLUSION [10] S. Hinz, P. DuBois, J. Stephens, M. M. Brown, and A. A feasible data indexing algorithm is proposed for Bedford, "MySQL Cluster," in high-dimensional, large-scale data sets. It consists of http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-overvi YSCORE binning, Hadoop processing, and improved inter- ew.html, 2009. sectioning. Experimental results show that the algorithm is [11] Wikipedia, "Central limit theorem," 2009, p. technically feasible. http://en.wikipedia.org/wiki/Central_limit_theorem. An efficient data mining framework is proposed for Ha- [12] A. S. F. ASF, "MapFile (Hadoop 0.19.0 API)," 2008, p. doop; it employs the JPA and MySQL Cluster to resolve http://hadoop.apache.org/core/docs/current/api/org/apache/ha problems of globalization, random-write and duration in doop/io/MapFile.html. Hadoop. The distributed computing in Hadoop and central- [13] J. Gray, "HBase: Bigtable-like structured storage for Hadoop ized data collection using JPA are combined together or- HDFS." vol. 2009, 2008, p. http://wiki.apache.org/hadoop/Hbase. ganically. [14] S. M. S. USA, "kddcup data," 2008, p. We will consider replacing MapFile with our inverted http://www.kddcup2008.com/KDDsite/Data.htm. indexing for HBase in future work in an effort to enhance the performance of HBase in larger-scale data mining. A 209