SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Machine Learning on Cell Processor




Submitted By:                         Supervisor:
Robin Srivastava                      Dr. Eric McCreath
Uni ID: U4700252

Course: COMP8740
Abstract
The technique of delayed stochastic gradient given in the paper titled – “Slow Learners are
Fast” theoretically shows how online learning process could be parallelized. However, with
the real experimental setup, given in the paper, the parallelization does not improve the
performance. In this project we implement and evaluate this algorithm on Cell and an Intel
dual core processor with a target to obtain speedup with its outlined real experimental
setup. We also discuss the limitations of Cell processor pertaining to this algorithm along
with suggestion on CPU architectures for which it is better suited.




                                            1
1. INTRODUCTION                                                          3

2. BACKGROUND                                                             5

2.1 MACHINE LEARNING                                                      5
2.2 ALGORITHM (REFERENCED FROM [LANGFORD, SAMOLA AND ZINKEVICH, 2009])    6
2.3 POSSIBLE TEMPLATES FOR IMPLEMENTATION                                 6
A) ASYNCHRONOUS OPTIMIZATION                                              6
B) PIPELINED OPTIMIZATION                                                 7
C) RANDOMIZATION                                                          7
2.4 CELL PROCESSOR                                                        7
2.5 EXPERIMENTAL SETUP                                                    9

3. DESIGN AND IMPLEMENTATION                                             11

3.1 PRE-PROCESSING TREC DATASET                                          11
3.1.1 INTEL DUAL CORE                                                    11
3.1.2 CELL PROCESSOR                                                     11
3.1.3 REPRESENTATION OF EMAILS AND LABELS                                12
3.2 IMPLEMENTATION OF LOGISTIC REGRESSION                                12
3.3 IMPLEMENTATION OF LOGISTIC REGRESSION WITH DELAYED UPDATE            13
3.3.1 IMPLEMENTATION ON A DUAL CORE INTEL PENTIUM PROCESSOR              14
3.3.2 IMPLEMENTATION ON CELL BROADBAND ENGINE                            15

4. RESULTS                                                               17

5. CONCLUSION AND FUTURE WORK                                            19

APPENDIX I                                                               20

BAG OF WORDS REPRESENTATION                                              20

APPENDIX II                                                              21

HASHING                                                                  21

REFERENCES                                                               22




                                           2
1. Introduction
The inherent properties exhibited by the online learning algorithm suggest that it is an
excellent way of making the machines learn. This type of learning uses the observations
either one at a time or in small batches and discard them before the next set of
observations are considered. They are found to be a suitable candidate for real-time
learning where data arrives in the form of stream and predictions are required to be made
before the whole dataset has been seen. Online algorithms are also useful in the case of
large dataset because they do not require the whole dataset to be loaded into the memory
at once.

On the flip side this very suitable property of sequentiality turns out to be a curse for its
performance. The algorithm in itself is a sequential one and with the advent of multi-core
processors it leads to severe under-utilization of resources put forward by these high-end
machines.

In Langford et. al. [1], the authors gave a parallel version of online learning algorithm along
with its performance data when it was run on a machine with eight cores and 32 GB of
memory. They did the implementation of the algorithm in Java. The simulation results were
promising and they obtained speedup with the increase in number of threads as shown in
(Figure 1). However, their efforts to parallelize the exact experiments resulted in a failure
because of the high speed of serial implementation which was capable to handle over
150,000 examples/second. Based on the facts that the mathematical calculations involved
in this algorithm can be accelerated by the use of SIMD operations and Java does not
have any programming support for SIMD, we have implemented and evaluated this
algorithm on Cell processor to exploit the SIMD capabilities of its specialized co-
processors in the view to obtain the speedup for the real experimental setup. An
implementation of this algorithm was also done for a machine having Intel dual core
processor and 1.86 GB of RAM.




                                   Figure 1 From Langford et. al. [1]
The Cell processor is the first implementation on Cell Broadband Engine Architecture
(CBEA) having a primary processor of 64-bit IBM PowerPC architecture and eight
specialized SIMD supported co-processors. The communication amongst these
processors, their dedicated local store and main memory is done through a very high
speed communication channel which has a capability to transfer at a theoretical peak rate
of 96 B/cycle. The communication of data plays very crucial role for the implementation of
this algorithm on Cell, the primary reason being the large gap between the amounts of
data to be processed (approx. 76 MB) and memory available with the co-processors of
Cell (256 KB). An efficient approach to bridge this gap is discussed in section of design
and implementation. This section also gives the design of how the data was pre-processed

                                               3
for implementation on Intel dual core and Cell processor. The section on background
discusses about the gradient descent and delayed stochastic gradient descent algorithm,
the possible templates for the latter’s implementation, an overview of Cell processor and
the real experimental setup suggested by the designers of this algorithm. The result
section shows comparative study of this algorithm on both the machines and we finally
conclude in the last section of conclusion and future work. This section also provides a
suggestion on the CPU architecture for which this algorithm would be better suited and we
might expect a better performance in terms of speedup and reduced coding complexity.




                                           4
2. Background
2.1 Machine Learning
Machine learning is a technique by which a machine modifies its own behaviour on the
basis of past experiences and performance. The collection of data of past experiences and
performance is called training set. One of the methods to make a machine learn is to pass
the entire training set in one go. This method is known as batch learning. The generic
steps for batch learning are as follows:



               Step 1: Initialize the weights.
               Step 2: For each batch of training data
                       Step 2a: Process all the training data
                       Step 2b: Update the weight


A popularly known batch learning algorithm is gradient descent in which after every step
the weight vector of the function moves in the direction of greatest decrease of the error
function. Mathematically this is feasible due to the observation that if any real valued
function F (x) is defined and differentiable in a neighbourhood of point a , then F (x)
decreases fastest in the direction of negative gradient of function F (x) at point a − ∇F (a ) .
Therefore if b = a − η∇F (a ) for η > 0 being a small number then F (a ) ≥ F (b) . To perform the
actual steps, the algorithm goes as follows:


     Step 1: Initialize the weight vector w 0 with sum arbitrary values
     Step 2: Update the weight vector as follows
                  w (τ +1) = w (τ ) −η∇E  w (τ ) 
                                                 
                                                 
     Where ∇E is the gradient of error function and η is the learning
     rate.
     Step 3: Follow step 2 for all the batches of data




This algorithm, however, does not prove to be a very efficient one (discussed in Bishop
and Nabney, 2008). Two major weaknesses of gradient descent are:

    1. The algorithm can take many iterations to converge towards a local minimum, if the
       curvature in different directions is very different.
    2. Finding the optimal η per step can be time-consuming. Conversely, using a
       fixed η can yield poor results.


                                                 5
Some of the other more robust and faster batch learning algorithms are conjugate
gradients and quasi-Newton methods. In gradient-based methods the algorithms are
required to run multiple numbers of times to obtain an optimal solution. This proves to be
computationally very costly for large datasets. There exists yet another method to make
the machines learn. It involves passing records from training set one at a time (online
learning). To overcome the aforementioned weakness in gradient-based methods there is
an online gradient descent algorithm that has proved useful in practice for training neural
networks on large data sets (Le Cun et al. 1989). It is also called sequential or stochastic
gradient descent and it involves updating the weight vector of the function based on one
record at a time. The update of weight vector is done for each record either in consecutive
order or randomly. The algorithm steps of stochastic gradient descent are similar to the
steps outlined above for batch gradient descent with a difference of considering one data
point per iteration.
The algorithm given in (2.2) is a parallel version of stochastic gradient descent through the
concept of delayed update.

2.2 Algorithm (Referenced from [Langford, Samola and Zinkevich, 2009])


   Input:
   Input: Feasible space W ⊆ R n , annealing schedule η t and delay τ ∈ N
   Initialization: Set w1 ......wτ = 0 and compute corresponding g t = ∇f ( wt )
                  For t = τ + 1 to T + τ do
                         Obtain f t and incur loss f t ( wt )
                         Compute g t = ∇f t ( wt )
                         Update wt +1 = arg min w∈W w − ( wt − η t g t −τ )
                  End for
   Where f i : χ  R is a convex function, χ is Banach space
                 →


The goal here is to find some parameter vector w such that the sum over functions f i
takes the smallest possible value. In the algorithm if τ = 0, it becomes the standard
stochastic gradient descent algorithm. Here, instead of updating the parameter vector wt
by the current gradient g t , it is updated by a delayed gradient g t −τ .

2.3 Possible templates for implementation
There are three suggested implementation models for delayed stochastic gradient
descent. Following any of these three model would lead to an effective implementation o
the algorithm. Each model follow some assumptions based on the dataset being used.
A model could be chosen on the basis of the constraints matching with the assumptions
highlighted in a specific model.

   a) Asynchronous Optimization
       Assume a machine with n cores. We further assume that the time taken to compute
       the gradient f t is at least n times higher than that to update the value of weight
       vector. We run the stochastic gradient descent on all the n cores of the machine on
       different instances of f t while sharing a common instance of weight vector. Each

                                                   6
core is allowed to update the shared copy of weight vector in a round-robin fashion.
       This would result in a delay of τ = n – 1 between when a core sees f t and when it
       gets to update the shared copy of weight vector. This template is primarily suitable
       when computation of f t takes a large time. This implementation requires explicit
       synchronization for update of weight vector as it is an atomic operation. Based on
       the architecture of CPU significant amount of bandwidth could be exclusively used
       for the purpose of synchronization.

    b) Pipelined Optimization
       In this form of optimization we parallelize the computation of f t instead of running
       the same instance on different cores. In this case the delay occurs in the second
       stage of processing of results. While the second stage is still busy processing the
       result of the first, the latter has already moved on to the processing of f t +1 . Even in
       this case the weight vector is computed with a delay of τ .

    c) Randomization
       This form of optimization is used when there is a high correlation between τ and f t
       such that we cannot treat data as i.i.d. The observations are de-correlated by doing
       random permutations of the instances. The delay in this case occurs during the
       update of model parameters because range of de-correlation needs to exceed τ
       considerably.


2.4 Cell Processor

   Cell processor is the first implementation of Cell Broadband Engine Architecture (CBEA)
(Figure 2) which emerged from a joint venture of IBM, Sony and Toshiba. It’s a fully
compatible extension of 64-bit PowerPC Architecture. The design of CBEA was based on
the analysis of workloads in wide variety of areas such as cryptography, graphic transform
and lighting, physics, fast-Fourier transforms (FFT), matrix operations, and scientific
workloads.

       The Cell processor is a multicore, heterogeneous chip carrying one 64-bit power
processor element (PPE), eight specialized single-instruction multiple-data (SIMD)
architecture co- processors called synergistic processing element (SPE) and a high-
bandwidth bus interface (Element Interconnect Bus), all integrated on-chip.

The PPE consists of a power processing unit (PPU) connected to a 512 KB of L2 cache. It
is the main processor of Cell and is responsible for running the OS as well as managing
the workload amongst the SPE. The PPU is a dual-issue, in-order processor with dual-
thread support. The PPU can fetch four instructions at a time and issue two. To better the
performance of in-order issue, the PPE utilizes delayed-execution pipelines and allows
limited out-of-order execution.

An SPE (Figure 4) consists of a synergistic processing unit (SPU) and a synergistic
memory flow controller (SMF). It is used for data-intensive applications found readily in
cryptography, media and high performance scientific applications. Each SPE runs an
independent application thread. The SPE design is optimized for computation-intensive
applications. It has SIMD support, as mentioned above, and 256 KB of its local store. The
memory flow controller consists of a DMA controller along with a memory management

                                                7
unit (MMU) and atomic unit to facilitate synchronization issues with other SPEs and with
the PPE. SPU is also a dual-issue, in-order processor like PPU.

SPU works on the data that exists in its dedicated local store which in turn depends on




                           Figure 2, Cell Broadband Engine Architecture
channel interface for accessing main memory and local stores in other SPEs. The channel

interface runs independently of SPU and resides in MFC. In parallel an SPU can perform
operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four
single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of
performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.

The PPE and SPEs communicate through an internal high-speed element interconnect
bus (EIB) [2] (Figure 3). Apart from these processors EIB also allows communication
among off-chip memory and external IO.

The EIB is implemented as a circular ring consisting of four 16B-wide unidirectional
channels. Two of them rotate clockwise and two anti-clockwise. These channels are
capable of giving a performance of three concurrent transactions. The EIB runs at half the
rate of system clock and thus have an effective channel rate of 16 bytes every two system
clocks. At maximum concurrency, with three active transactions on each of the four rings,
the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16


                                              8
Interconnect
                           Figure 3 Element Interconnect Bus, from [3]



bytes wide / 2 system clocks per transfer). The theoretical peak of EIB at 3.2 GHz is
204.8GB/s.




                                 Figure 4 SPE, from [4]
The memory interface controller (MIC) in the Cell BE chip is connected to the external
RAMBUS XDR memory through two XIO channels operating at a maximum effective
frequency of 3.2GHz. The MIC has separate read and write request queues for each XIO
channel operating independently. For each channel, the MIC arbiter alternates the
dispatch between read and write queues after a minimum of every eight dispatches from
each queue or until the queue becomes empty, whichever is shorter. High-priority read
requests are given priority over reads and writes. With both XIO channels operating at
3.2GHz, the peak raw memory bandwidth is 25.6GB/s. However, normal memory
operations such as refresh, scrubbing, and so on, typically reduce the bandwidth by about
1GB/s.

2.5 Experimental Setup
The experiment is done using the asynchronous optimization (section 2.3). Figure 5
schematically describes the optimization. Each core computes its own error gradient and
updates a shared copy of weight vector, shared amongst all the cores. This update is
carried out in a round-robin fashion. The delay in computation and gradient and update of
weight vector is of τ =n-1. Explicit synchronization is required for the atomic update of
                                               9
weight vector.
The experiment is run on the complete dataset involving all the available cores.



                Data                      Error Gradient



                Data                      Error Gradient
In                                                                                 Weight
P                                                                                  Vector
ar              Data                      Error Gradient

all
                Data                      Error Gradient



                              Figure 5 Asynchronous Optimization




                                             10
3. Design and Implementation
There were three stages in the implementation of the project
   1. Pre-processing of TREC dataset
   2. Implementation of logistic regression algorithm
   3. Implementation of logistic regression in accordance to the methodologies
      suggested in delayed stochastic gradient technique


3.1 Pre-processing TREC Dataset
3.1.1 Intel Dual Core
The dataset contains 75, 419 emails. These emails were tokenized by a list of symbols
(white spaces( ); comma(,); back slash(); period(.); semi-colon(;); colon(:); single(‘) and
double inverted comma(“); open and close parenthesis(( )), brace({ }) and bracket([ ]);
greater(>) and less(<) than sign; hyphen(-); at the rate of symbol(@); equals(=); new
line(n); carriage return(r); and tab (t)). Tokenization with the aforementioned symbol list
yielded 2,218,878 different tokens. A dictionary of tokens, containing the token name along
with a unique index for each token, was created and stored in a file (dictionary).


                              Complete              Convert mails        Create files for
                              Dictionary             to vectors          each mail vector

                                                       Set F1

                                                                         Save to
    Raw Dataset                                                          Disk


                                                       Set F2



                             Condensed              Convert mails        Create files for
                             Dictionary              to vectors          each mail vector




                              Figure 6 Pre-processing TREC dataset
                                       Pre-

3.1.2 Cell Processor
Due to memory limitations on Cell processor a condensed form of dictionary was used.
This condensed form contained first hundred features from the complete dictionary. On
one hand the reduced size affected the performance of the algorithm in terms of accuracy
and on the other it became more suitable for implementation on Cell. With the condensed
form we transferred 32 mails vectors (discussion of vector form of mail representation

                                              11
follows the current discussion) per MFC operation unlike the MFCs in the order of 10s for
transfer of one mail vector if complete dictionary is used.

3.1.3 Representation of Emails and Labels
The emails were represented as linear vectors by using a simple bag of words
representation (Appendix I). The representation of emails was done in a struct data-type
having an unsigned int for the index value and a short for the weight of respective index.
Since the dimensionality of the complete dataset comes out to be very high therefore
hashing (Appendix II) was used with 218 bins. While constructing the dictionary it took
approximately 3 hours to process ~6000 emails. This estimate was drastically reduced by
the use of hashing and finally it took approximately half an hour to process all the emails in
the dataset. Once the dictionary was in place along with a working framework for hashing
a second pass on the entire dataset was carried out. In this pass each email was
converted to a bag of words representation and stored in separate file. The format of the
file was in the following pattern:




                                Figure 7 Email Files after pre-processing
                                                           pre-


The labels were provided separately in an array of short time. A label ‘1’ signified that the
email is a ‘ham’ and label ‘-1’ signifies that it is a ‘spam’.
Since each mail was stored as a vector form in a file therefore on an average it took only
0.03 ms (on Intel dual core 2GHz) for parsing the emails and loading them into the
memory for logistic regression.


3.2 Implementation of logistic regression
For a two class problem (C1 and C2), the posterior probability of class C1 given the input
data x and a set of fixed basis function φ = φ (x) is define by the softmax transformation
                                                      exp(a1 )
                            p(C1 | φ ) = y (φ ) =                                          3.1
                                                  exp(a1 ) + exp(a 2 )
Where the activations a1 is given as follows
                                   T
                            a k = wk φ                                                     3.2
with p(C 2 | φ ) = 1 − p(C1 | φ ) , w being the weight vector.

The likelihood function for input data x and target data T (coded in the 1-of-K coding
scheme) is then
                                                     N                                           N
                                                           (                               )
                               p (T | w1 , w2 ) = ∏ p (C1 | φ n ) tn1 . p (C 2 | φ n ) tn 2 = ∏ y nn11 . y nn22
                                                                                                  t        t
                                                                                                                  3.3
                                                    n =1                                        n =1

where y nk = y k (φ ( x n )) , and T is the N x 2 matrix of target variables with elements tnk.
The error function is determined by taking the negative logarithm of the equation of
likelihood and its gradient could be written as


                                                           12
N
                            ∇ w j E ( w1 , w2 ) = ∑ ( y nj − t nj )φ n                      3.4
                                                  n =1

The weight vector wk for the given class Ck is updated as follows:
                            wτ +1 = wτ − η∇ wk E ( w1 , w2 )
                             k       k                                                      3.5
where η is the learning rate.

In this project we have defined the first class as an email being a ‘Ham’ and second class
is for it being a ‘Spam’. The feature man φ is the identity function, φ ( x) = x . The weight
vectors are initialized as zero.
For the purpose of comparison two version of implementation of logistic regression was
provided. The first version was sequential and the second version of implementation was
parallel. As per the claim by the authors of the delayed stochastic gradient technique, the
parallel version gave a better performance compared to the sequential version without
affecting the correctness of the result. The comparison of performance is given in Section
4.


3.3 Implementation of Logistic Regression with delayed update

To incorporate the concept of delayed update the equation (3.5) mentioned above was
changed according to the algorithm described in Section 2.2. This required processing of
computing the error gradient on divided set of input separately. The division of input was
carried out differently for Intel Dual core and Cell processor. For the former case this
division was more direct with less programming complexity, however, for the latter the
division had to be carried out explicitly and it involved a significant complexity in terms of
programming. The division of data is explained in detail in the following discussion.

The representation chosen for the mail helps in improving the time performance of the
algorithm. Since we are storing the indices of the vectors therefore while updating a weight
vector in accordance to the contributions made by a specific mail vector we do not need to
iterate through the complete dimension of the weight vector and error gradient. This is
because the contributions by a particular mail vector would affect indices which are
present in it. This Figure 8, below shows this concept pictorially.




                                                         13
1                            1
 count                index
                              6                            6

                              13                           13
           1      6
           3     13
           2     73           73                Error      73                    Weight
           5     88                             Gradient                         Vector

          Mail Vector
                              88                           88

                                                                                 D: Dimension of
                                                                                 weight vector
                                                                                 and error
                                                                                 gradient
                              D                            D

                                                    Figure 8

3.3.1 Implementation on a Dual Core Intel Pentium processor
For implementation on the Intel dual core machine (2 GHz with 1.86 GB of main memory)
the processed email from the complete dictionary was used. The mail-vectors were
created as and when they were required. The first core processed all the odd emails and
second one all the even ones. Each core computed the error gradient separately along
with updating a private copy of weight vectors for each core. The shared copy of weight
vectors were updated atomically by both the cores.
This implementation used OpenMP constructs for parallelization of the algorithm. Using
OpenMP helped in the division of email. The thread number was augmented with a
counter to determine the mail number. This ensured that no two thread would access the
same data.

               Mail 1                                             Computes
               Mail 2                                           error gradient
               Mail 3                      Core 1
                                                                                              Update
 Set F1




               Mail 4                                                                      weight vector
                                                                                           (atomically)
                                           Core 2
                                                                  Computes
                                                                error gradient
               Mail N


                              Note: N is odd


                                   Figure 9 Implementation on Intel Dual Core

                                                      14
3.3.2 Implementation on Cell Broadband Engine

Implementation of the algorithm on Cell processor used the processed mails generated
from the condensed dictionary. The data was divided sequentially into chunks for each
SPE. The PPE was responsible for constructing the labels and the array of mail vectors.
Using the MFC operation the data was made available to SPE. Each MFC operation
transferred data for 32 mails. This value was chosen because of the limited capacity (256
KB) of local store of SPEs.

The SIMD implementation on Cell could not benefit from the implementation model shown
in Figure 10. This is because for a full scale SIMD implementation there are operations
involved for converting the data in the __vector form specialized for SPEs. Since we are
storing the indices separately therefore converting the data to appropriate __vector would
require rearranging them according to the indices. This rearrangement would require large
number of load operations and would affect the overall benefits from SIMD operation. The
complexity for time converting the data to __vector would be O(N2) where N is the
dimension of mail-vector.



         Mail 1                                                       Note: N is odd
         Mail 2

         Mail 3
Set F2




         Mail 4
                                            PPE
                                                                     Main
                                                                    memory

         Mail N




                  SPE-1                   SPE-2                      SPE-6



              Computes error            Computes error             Computes error
                gradient                  gradient                   gradient




              Update weight            Update weight               Update weight
                 vector                   vector                      vector




                               Figure 10 Implementation on Cell



                                               15
For the parallel version of the algorithm each SPE required a maximum of four weight
vectors to be stored in the local store. Two among them were supposed to be owned
privately by the SPE and the remaining two was shared among all the SPEs. Along with
the weight vector each SPE would also be required to store two error gradients. The data-
type for each of these quantities is float. Considering the dictionary containing 2,218,878
features the requirement of memory tend to be the order of MBs. Following two data
structures were considered for storing these quantities:
    a) Storing the complete data as an array of required dimension. This data structure is
       straight forward and easy to implement but there is possibility of potential wastage
       of memory. For the original dimension of 2,218,878 this data structure would require
       approx. 50 MB of memory for each instance of SPE. This is obviously not feasible
       as the local store on SPEs are only of 256 KB.
    b) The second data structure is to use struct having an index and a count value for
       each entry. Since most of the values in weight vector and error gradient are not
       required (refer to discussion pertaining to Figure 8), therefore by using this data
       structure the required memory was significantly reduced and theoretically came in
       the order of few MBs(approx. 3). This is also not feasible because of the limited size
       of local store in SPE.

With the use of data generated by condensed dictionary and the latter data structure
proposed above, the requirement got reduced to 2400 bytes. Rest of the memory available
with the local store was used for storing the mail-vectors and the target labels.

To hide the latency of transfer of data from the main memory to the local store of the SPE,
the technique of double-buffering could be used. While the SPU is performing the
computation on the data, the MFC could be used to bring more data from main memory of
the system to the local store of the Cell. Therefore the wait for transfer of data is reduced
and latency of transfer is hidden (either partly or completely). The algorithm for processing
while doing double buffering is as follows:

   1. The SPU queues a DMA GET to pull a portion of the problem data set from main
       memory into buffer #1.
   2. The SPU queues a DMA GET to pull a portion of the problem data set from main
       memory into buffer #2.
   3. The SPU waits for buffer #1 to finish filling.
   4. The SPU processes buffer #1.
   5. The SPU (a) queues a DMA PUT to transmit the contents of buffer #1 and then (b)
       queues a DMA GETB to execute after thePUT to refill the buffer with the next
       portion of data from main memory.
   6. The SPU waits for buffer #2 to finish filling.
   7. The SPU processes buffer #2.
   8. The SPU (a) queues a DMA PUT to transmit the contents of buffer #2 and then (b)
       queues a DMA GETB to execute after thePUT to refill the buffer with the next
       portion of data from main memory.
   9. Repeat starting at step 3 until all data has been processed.
   10. Wait for all buffers to finish.



                                             16
4. Results
The experiments on Intel dual core machine was run by using the mails processed with the
complete dictionary. The time taken on this machine is significantly higher than that on
Cell. Serial implementation of logistic regression on Intel dual core for two simultaneous
run takes 36.93 sec and 36.45 sec.

The taken by the parallel implementation using delayed stochastic gradient method is
given in Table 1.

                         Number       Time in                Time in
                         of           seconds                seconds (run
                         Threads      (run 1)                2)
                         1.           113.09                 47.09
                         2.           20.85                  20.92

                                            Table 1

For a single thread in the first run the time taken is very large compared to any other time.
This is because most of the memory load operation would be resulting in a cache miss.
Since all these runs were performed consecutively therefore the time is drastically reduced
because of reduction in the cache miss rate. It is also observed that this algorithm renders
a poorer performance when run with single thread as compared to that of serial
implementation. This time should be theoretically same; however, in the case of delayed
stochastic process extra time is spent in division of data which end up not being used
anywhere.

The Table 2 below shows the performance of the algorithm on multiple SPEs. The
performance with respect to time gets better with the increase in the number of SPEs.
These values are plotted in the graph given below. The performance on SPE although
shows better results, however, the accuracy is suffered to a great extent (results not shown
here).



                                                             Perform ance


                                        48000


                    Time in
       Number                           47000
                    micro
       of SPE
                    seconds             46000

       1            47398
       2            44419               45000


       3            42407               44000
       4            42384
       5            42144               43000

       6            41966
                                        42000

             Table 2
                                        41000
                                                0        1   2      3         4       5   6   7

The use of condensed dictionary                                  N um be r of S P E               comes
with a severe penalty of

                                                    17
accuracy.

The issue of accuracy could be solved by the use of complete dictionary. However, the
memory limitation on Cell processor constrained the use of complete dictionary.




                                           18
5. Conclusion and Future Work
The approach of delayed update shows better time performance with the increase in
parallelization. The improvement was shown for Intel dual core as well as for Cell
processor. The former machine being SMP capable had less overhead of data division as
compared to that on latter. The use of Cell processor posed several limitations on the
implementation of this algorithm, the primary one being the memory limitation. The
limitation of memory caused extra overhead due to communication. A dataset having less
feature vectors might be expected to perform with a better speedup on this machine. For a
data set with large feature vectors, this algorithm might perform better on a symmetric
multiprocessing machine (SMP). A study of this algorithm could be done on more powerful
SMP capable machine with large amount of main memory as the amount of memory
required to store the data doubles with a unit increase in level of parallelization.




                                           19
Appendix I
Bag of Words Representation

A bag-of-words representation is a model to represent a sentence in the form of vector. It
is frequently used in natural language processing and information retrieval. This model
represents a sentence as an unordered collection without any regard for grammar.

To form a vector for a sentence, firstly, all the distinct words are identified in it. Each
distinct word is given a unique identifier called index. Each index serves as a dimension in
a D-dimensional vector space, where D is total number of unique words. The magnitude of
the vector in a particular dimension is determined by the count of words having that index.
This process requires two passes through the entire dataset. In the first pass a dictionary
containing the unique words along with their unique indices is created. In the second pass
the vectors are formed by referencing to the dictionary.

For example:

Consider the following sentence

What do you think you are doing?

                                  Word                Index
                                  what                  0
                                   do                   1
                                   you                  2
                                  think                 3
                                   are                  4
                                  doing                 5

The resulting vector for the above sentence would be as follows:

1(0) + 1(1) + 2(2) + 1(3) + 1(4) + 1(5)

The vector dimension is given in parenthesis and the respective magnitudes are given
along side. The magnitude of dimension 2 is 2 because the word you is repeated twice in
the sentence. Others are one for the similar reason.




                                            20
Appendix II
Hashing

Hashing is the transformation of a string of characters into a usually shorter fixed-length
value or key that represents the original string. Hashing is used to index and retrieve items
in a database because it is faster to find the item using the shorter hashed key than to find
it using the original value. It is also used in many encryption algorithms.

The hashing function used in this project is same as the one used by Oracle’s JVM. The
code snippet performing hashing is pasted below

 unsigned int hashCode(char *word, int n) {
   unsigned int h = 0;
   int i;

     for(i=0;i<n;i++)
       h += word[i] * pow(31, n-i-1);

     return h%SIZE;
 }




                                             21
References
[1]   John Langford, Alexander J. Samola and Martin Zinkevich. Slow learners are fast
      published in Journal of Machine Learning Research 1(2009)
[2]   Michael Kistler, Michael Perrone, Fabrizio Petrini. Cell Multiprocessor
      Communication Network: Built for Speed.
[3]   Thomas Chen , Ram Raghavan , Jason Dale and Eiji Iwata. Cell Broadband Engine
      Architecture and its first implementation
[4]   Jonathan Bartlett. Programming high-performance applications on the Cell/B.E.
      processor, Part 6: Smart buffer management with DMA transfers
[5]   Introduction to Statistical Machine Learning, 2010 course assignment 1
[6]   Christopher Bishop, Pattern Recognition and Machine Learning.




                                         22

Más contenido relacionado

La actualidad más candente

Comparative study to realize an automatic speaker recognition system
Comparative study to realize an automatic speaker recognition system Comparative study to realize an automatic speaker recognition system
Comparative study to realize an automatic speaker recognition system IJECEIAES
 
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...IOSR Journals
 
Speech recognition using hidden markov model mee 03_19
Speech recognition using hidden markov model mee 03_19Speech recognition using hidden markov model mee 03_19
Speech recognition using hidden markov model mee 03_19NGUYEN_SPKT
 
11.dynamic instruction scheduling for microprocessors having out of order exe...
11.dynamic instruction scheduling for microprocessors having out of order exe...11.dynamic instruction scheduling for microprocessors having out of order exe...
11.dynamic instruction scheduling for microprocessors having out of order exe...Alexander Decker
 
Tutotial 2 answer
Tutotial 2 answerTutotial 2 answer
Tutotial 2 answerUdaya Kumar
 
A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAIJERD Editor
 
Md simulations modified
Md simulations modifiedMd simulations modified
Md simulations modifiedshahmeermateen
 
M sc thesis of nicolo' savioli
M sc thesis of nicolo' savioliM sc thesis of nicolo' savioli
M sc thesis of nicolo' savioliNicolò Savioli
 
Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...ijics
 
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centersOn the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centersCemal Ardil
 
Using the black-box approach with machine learning methods in ...
Using the black-box approach with machine learning methods in ...Using the black-box approach with machine learning methods in ...
Using the black-box approach with machine learning methods in ...butest
 
Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...IOSR Journals
 
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable HardwareGenetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable HardwareCSCJournals
 
ECG beats classification using multiclass SVMs with ECOC
ECG beats classification using multiclass SVMs with ECOCECG beats classification using multiclass SVMs with ECOC
ECG beats classification using multiclass SVMs with ECOCYomna Mahmoud Ibrahim Hassan
 

La actualidad más candente (17)

Comparative study to realize an automatic speaker recognition system
Comparative study to realize an automatic speaker recognition system Comparative study to realize an automatic speaker recognition system
Comparative study to realize an automatic speaker recognition system
 
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
 
Concurrent Programming
Concurrent ProgrammingConcurrent Programming
Concurrent Programming
 
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
 
Speech recognition using hidden markov model mee 03_19
Speech recognition using hidden markov model mee 03_19Speech recognition using hidden markov model mee 03_19
Speech recognition using hidden markov model mee 03_19
 
11.dynamic instruction scheduling for microprocessors having out of order exe...
11.dynamic instruction scheduling for microprocessors having out of order exe...11.dynamic instruction scheduling for microprocessors having out of order exe...
11.dynamic instruction scheduling for microprocessors having out of order exe...
 
Tutotial 2 answer
Tutotial 2 answerTutotial 2 answer
Tutotial 2 answer
 
A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDA
 
Md simulations modified
Md simulations modifiedMd simulations modified
Md simulations modified
 
F017423643
F017423643F017423643
F017423643
 
M sc thesis of nicolo' savioli
M sc thesis of nicolo' savioliM sc thesis of nicolo' savioli
M sc thesis of nicolo' savioli
 
Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...
 
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centersOn the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
 
Using the black-box approach with machine learning methods in ...
Using the black-box approach with machine learning methods in ...Using the black-box approach with machine learning methods in ...
Using the black-box approach with machine learning methods in ...
 
Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...
 
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable HardwareGenetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
 
ECG beats classification using multiclass SVMs with ECOC
ECG beats classification using multiclass SVMs with ECOCECG beats classification using multiclass SVMs with ECOC
ECG beats classification using multiclass SVMs with ECOC
 

Destacado

Exchange and Consumption of Huge RDF Data
Exchange and Consumption of Huge RDF DataExchange and Consumption of Huge RDF Data
Exchange and Consumption of Huge RDF DataMario Arias
 
Coding for different resolutions
Coding for different resolutionsCoding for different resolutions
Coding for different resolutionsRobin Srivastava
 
Presentation on experimental setup for verigying - &quot;Slow Learners are F...
Presentation on experimental setup for verigying  - &quot;Slow Learners are F...Presentation on experimental setup for verigying  - &quot;Slow Learners are F...
Presentation on experimental setup for verigying - &quot;Slow Learners are F...Robin Srivastava
 
What Will I Be Doing
What Will I Be Doing What Will I Be Doing
What Will I Be Doing JAC0909
 
презентація
презентаціяпрезентація
презентаціяTrishahaba
 
Motion in one direction
Motion in one directionMotion in one direction
Motion in one directionChris Auld
 

Destacado (16)

Exchange and Consumption of Huge RDF Data
Exchange and Consumption of Huge RDF DataExchange and Consumption of Huge RDF Data
Exchange and Consumption of Huge RDF Data
 
How to use basecamp
How to use basecampHow to use basecamp
How to use basecamp
 
Instalaciones y montages
Instalaciones y montagesInstalaciones y montages
Instalaciones y montages
 
How to use sagespark
How to use sagesparkHow to use sagespark
How to use sagespark
 
Pagina web
Pagina webPagina web
Pagina web
 
How to use jing
How to use jingHow to use jing
How to use jing
 
Coding for different resolutions
Coding for different resolutionsCoding for different resolutions
Coding for different resolutions
 
afiche-bolleto
afiche-bolletoafiche-bolleto
afiche-bolleto
 
Presentation
PresentationPresentation
Presentation
 
Presentation on experimental setup for verigying - &quot;Slow Learners are F...
Presentation on experimental setup for verigying  - &quot;Slow Learners are F...Presentation on experimental setup for verigying  - &quot;Slow Learners are F...
Presentation on experimental setup for verigying - &quot;Slow Learners are F...
 
How to use evernote
How to use evernoteHow to use evernote
How to use evernote
 
What Will I Be Doing
What Will I Be Doing What Will I Be Doing
What Will I Be Doing
 
Anareport mm.
Anareport mm.Anareport mm.
Anareport mm.
 
презентація
презентаціяпрезентація
презентація
 
Motion in one direction
Motion in one directionMotion in one direction
Motion in one direction
 
How to use backupify
How to use backupifyHow to use backupify
How to use backupify
 

Similar a Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;

PROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHM
PROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHMPROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHM
PROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHMecij
 
Super-linear speedup for real-time condition monitoring using image processi...
Super-linear speedup for real-time condition monitoring using  image processi...Super-linear speedup for real-time condition monitoring using  image processi...
Super-linear speedup for real-time condition monitoring using image processi...IJECEIAES
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcscpconf
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
 
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONSAPPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONSVLSICS Design
 
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONSAPPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONSVLSICS Design
 
Design and Implementation of DC Motor Speed Control using Fuzzy Logic
Design and Implementation of DC Motor Speed Control using Fuzzy LogicDesign and Implementation of DC Motor Speed Control using Fuzzy Logic
Design and Implementation of DC Motor Speed Control using Fuzzy LogicWaleed El-Badry
 
Comparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus UsingComparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus Usingjorgerodriguessimao
 
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...IJECEIAES
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORVLSICS Design
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORVLSICS Design
 
Design and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc ProcessorDesign and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc ProcessorVLSICS Design
 
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...ijcax
 
Heterogeneous computing with graphical processing unit: improvised back-prop...
Heterogeneous computing with graphical processing unit:  improvised back-prop...Heterogeneous computing with graphical processing unit:  improvised back-prop...
Heterogeneous computing with graphical processing unit: improvised back-prop...IJECEIAES
 
PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...
                                  PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...                                  PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...
PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...ijfcstjournal
 
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
 

Similar a Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot; (20)

PROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHM
PROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHMPROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHM
PROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHM
 
Super-linear speedup for real-time condition monitoring using image processi...
Super-linear speedup for real-time condition monitoring using  image processi...Super-linear speedup for real-time condition monitoring using  image processi...
Super-linear speedup for real-time condition monitoring using image processi...
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONSAPPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
 
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONSAPPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONS
 
D031201021027
D031201021027D031201021027
D031201021027
 
Design and Implementation of DC Motor Speed Control using Fuzzy Logic
Design and Implementation of DC Motor Speed Control using Fuzzy LogicDesign and Implementation of DC Motor Speed Control using Fuzzy Logic
Design and Implementation of DC Motor Speed Control using Fuzzy Logic
 
Comparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus UsingComparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus Using
 
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
 
Design and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc ProcessorDesign and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc Processor
 
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
 
Heterogeneous computing with graphical processing unit: improvised back-prop...
Heterogeneous computing with graphical processing unit:  improvised back-prop...Heterogeneous computing with graphical processing unit:  improvised back-prop...
Heterogeneous computing with graphical processing unit: improvised back-prop...
 
PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...
                                  PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...                                  PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...
PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...
 
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
 
imagefiltervhdl.pptx
imagefiltervhdl.pptximagefiltervhdl.pptx
imagefiltervhdl.pptx
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
 

Último

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 

Último (20)

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 

Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;

  • 1. Machine Learning on Cell Processor Submitted By: Supervisor: Robin Srivastava Dr. Eric McCreath Uni ID: U4700252 Course: COMP8740
  • 2. Abstract The technique of delayed stochastic gradient given in the paper titled – “Slow Learners are Fast” theoretically shows how online learning process could be parallelized. However, with the real experimental setup, given in the paper, the parallelization does not improve the performance. In this project we implement and evaluate this algorithm on Cell and an Intel dual core processor with a target to obtain speedup with its outlined real experimental setup. We also discuss the limitations of Cell processor pertaining to this algorithm along with suggestion on CPU architectures for which it is better suited. 1
  • 3. 1. INTRODUCTION 3 2. BACKGROUND 5 2.1 MACHINE LEARNING 5 2.2 ALGORITHM (REFERENCED FROM [LANGFORD, SAMOLA AND ZINKEVICH, 2009]) 6 2.3 POSSIBLE TEMPLATES FOR IMPLEMENTATION 6 A) ASYNCHRONOUS OPTIMIZATION 6 B) PIPELINED OPTIMIZATION 7 C) RANDOMIZATION 7 2.4 CELL PROCESSOR 7 2.5 EXPERIMENTAL SETUP 9 3. DESIGN AND IMPLEMENTATION 11 3.1 PRE-PROCESSING TREC DATASET 11 3.1.1 INTEL DUAL CORE 11 3.1.2 CELL PROCESSOR 11 3.1.3 REPRESENTATION OF EMAILS AND LABELS 12 3.2 IMPLEMENTATION OF LOGISTIC REGRESSION 12 3.3 IMPLEMENTATION OF LOGISTIC REGRESSION WITH DELAYED UPDATE 13 3.3.1 IMPLEMENTATION ON A DUAL CORE INTEL PENTIUM PROCESSOR 14 3.3.2 IMPLEMENTATION ON CELL BROADBAND ENGINE 15 4. RESULTS 17 5. CONCLUSION AND FUTURE WORK 19 APPENDIX I 20 BAG OF WORDS REPRESENTATION 20 APPENDIX II 21 HASHING 21 REFERENCES 22 2
  • 4. 1. Introduction The inherent properties exhibited by the online learning algorithm suggest that it is an excellent way of making the machines learn. This type of learning uses the observations either one at a time or in small batches and discard them before the next set of observations are considered. They are found to be a suitable candidate for real-time learning where data arrives in the form of stream and predictions are required to be made before the whole dataset has been seen. Online algorithms are also useful in the case of large dataset because they do not require the whole dataset to be loaded into the memory at once. On the flip side this very suitable property of sequentiality turns out to be a curse for its performance. The algorithm in itself is a sequential one and with the advent of multi-core processors it leads to severe under-utilization of resources put forward by these high-end machines. In Langford et. al. [1], the authors gave a parallel version of online learning algorithm along with its performance data when it was run on a machine with eight cores and 32 GB of memory. They did the implementation of the algorithm in Java. The simulation results were promising and they obtained speedup with the increase in number of threads as shown in (Figure 1). However, their efforts to parallelize the exact experiments resulted in a failure because of the high speed of serial implementation which was capable to handle over 150,000 examples/second. Based on the facts that the mathematical calculations involved in this algorithm can be accelerated by the use of SIMD operations and Java does not have any programming support for SIMD, we have implemented and evaluated this algorithm on Cell processor to exploit the SIMD capabilities of its specialized co- processors in the view to obtain the speedup for the real experimental setup. An implementation of this algorithm was also done for a machine having Intel dual core processor and 1.86 GB of RAM. Figure 1 From Langford et. al. [1] The Cell processor is the first implementation on Cell Broadband Engine Architecture (CBEA) having a primary processor of 64-bit IBM PowerPC architecture and eight specialized SIMD supported co-processors. The communication amongst these processors, their dedicated local store and main memory is done through a very high speed communication channel which has a capability to transfer at a theoretical peak rate of 96 B/cycle. The communication of data plays very crucial role for the implementation of this algorithm on Cell, the primary reason being the large gap between the amounts of data to be processed (approx. 76 MB) and memory available with the co-processors of Cell (256 KB). An efficient approach to bridge this gap is discussed in section of design and implementation. This section also gives the design of how the data was pre-processed 3
  • 5. for implementation on Intel dual core and Cell processor. The section on background discusses about the gradient descent and delayed stochastic gradient descent algorithm, the possible templates for the latter’s implementation, an overview of Cell processor and the real experimental setup suggested by the designers of this algorithm. The result section shows comparative study of this algorithm on both the machines and we finally conclude in the last section of conclusion and future work. This section also provides a suggestion on the CPU architecture for which this algorithm would be better suited and we might expect a better performance in terms of speedup and reduced coding complexity. 4
  • 6. 2. Background 2.1 Machine Learning Machine learning is a technique by which a machine modifies its own behaviour on the basis of past experiences and performance. The collection of data of past experiences and performance is called training set. One of the methods to make a machine learn is to pass the entire training set in one go. This method is known as batch learning. The generic steps for batch learning are as follows: Step 1: Initialize the weights. Step 2: For each batch of training data Step 2a: Process all the training data Step 2b: Update the weight A popularly known batch learning algorithm is gradient descent in which after every step the weight vector of the function moves in the direction of greatest decrease of the error function. Mathematically this is feasible due to the observation that if any real valued function F (x) is defined and differentiable in a neighbourhood of point a , then F (x) decreases fastest in the direction of negative gradient of function F (x) at point a − ∇F (a ) . Therefore if b = a − η∇F (a ) for η > 0 being a small number then F (a ) ≥ F (b) . To perform the actual steps, the algorithm goes as follows: Step 1: Initialize the weight vector w 0 with sum arbitrary values Step 2: Update the weight vector as follows w (τ +1) = w (τ ) −η∇E  w (τ )      Where ∇E is the gradient of error function and η is the learning rate. Step 3: Follow step 2 for all the batches of data This algorithm, however, does not prove to be a very efficient one (discussed in Bishop and Nabney, 2008). Two major weaknesses of gradient descent are: 1. The algorithm can take many iterations to converge towards a local minimum, if the curvature in different directions is very different. 2. Finding the optimal η per step can be time-consuming. Conversely, using a fixed η can yield poor results. 5
  • 7. Some of the other more robust and faster batch learning algorithms are conjugate gradients and quasi-Newton methods. In gradient-based methods the algorithms are required to run multiple numbers of times to obtain an optimal solution. This proves to be computationally very costly for large datasets. There exists yet another method to make the machines learn. It involves passing records from training set one at a time (online learning). To overcome the aforementioned weakness in gradient-based methods there is an online gradient descent algorithm that has proved useful in practice for training neural networks on large data sets (Le Cun et al. 1989). It is also called sequential or stochastic gradient descent and it involves updating the weight vector of the function based on one record at a time. The update of weight vector is done for each record either in consecutive order or randomly. The algorithm steps of stochastic gradient descent are similar to the steps outlined above for batch gradient descent with a difference of considering one data point per iteration. The algorithm given in (2.2) is a parallel version of stochastic gradient descent through the concept of delayed update. 2.2 Algorithm (Referenced from [Langford, Samola and Zinkevich, 2009]) Input: Input: Feasible space W ⊆ R n , annealing schedule η t and delay τ ∈ N Initialization: Set w1 ......wτ = 0 and compute corresponding g t = ∇f ( wt ) For t = τ + 1 to T + τ do Obtain f t and incur loss f t ( wt ) Compute g t = ∇f t ( wt ) Update wt +1 = arg min w∈W w − ( wt − η t g t −τ ) End for Where f i : χ  R is a convex function, χ is Banach space → The goal here is to find some parameter vector w such that the sum over functions f i takes the smallest possible value. In the algorithm if τ = 0, it becomes the standard stochastic gradient descent algorithm. Here, instead of updating the parameter vector wt by the current gradient g t , it is updated by a delayed gradient g t −τ . 2.3 Possible templates for implementation There are three suggested implementation models for delayed stochastic gradient descent. Following any of these three model would lead to an effective implementation o the algorithm. Each model follow some assumptions based on the dataset being used. A model could be chosen on the basis of the constraints matching with the assumptions highlighted in a specific model. a) Asynchronous Optimization Assume a machine with n cores. We further assume that the time taken to compute the gradient f t is at least n times higher than that to update the value of weight vector. We run the stochastic gradient descent on all the n cores of the machine on different instances of f t while sharing a common instance of weight vector. Each 6
  • 8. core is allowed to update the shared copy of weight vector in a round-robin fashion. This would result in a delay of τ = n – 1 between when a core sees f t and when it gets to update the shared copy of weight vector. This template is primarily suitable when computation of f t takes a large time. This implementation requires explicit synchronization for update of weight vector as it is an atomic operation. Based on the architecture of CPU significant amount of bandwidth could be exclusively used for the purpose of synchronization. b) Pipelined Optimization In this form of optimization we parallelize the computation of f t instead of running the same instance on different cores. In this case the delay occurs in the second stage of processing of results. While the second stage is still busy processing the result of the first, the latter has already moved on to the processing of f t +1 . Even in this case the weight vector is computed with a delay of τ . c) Randomization This form of optimization is used when there is a high correlation between τ and f t such that we cannot treat data as i.i.d. The observations are de-correlated by doing random permutations of the instances. The delay in this case occurs during the update of model parameters because range of de-correlation needs to exceed τ considerably. 2.4 Cell Processor Cell processor is the first implementation of Cell Broadband Engine Architecture (CBEA) (Figure 2) which emerged from a joint venture of IBM, Sony and Toshiba. It’s a fully compatible extension of 64-bit PowerPC Architecture. The design of CBEA was based on the analysis of workloads in wide variety of areas such as cryptography, graphic transform and lighting, physics, fast-Fourier transforms (FFT), matrix operations, and scientific workloads. The Cell processor is a multicore, heterogeneous chip carrying one 64-bit power processor element (PPE), eight specialized single-instruction multiple-data (SIMD) architecture co- processors called synergistic processing element (SPE) and a high- bandwidth bus interface (Element Interconnect Bus), all integrated on-chip. The PPE consists of a power processing unit (PPU) connected to a 512 KB of L2 cache. It is the main processor of Cell and is responsible for running the OS as well as managing the workload amongst the SPE. The PPU is a dual-issue, in-order processor with dual- thread support. The PPU can fetch four instructions at a time and issue two. To better the performance of in-order issue, the PPE utilizes delayed-execution pipelines and allows limited out-of-order execution. An SPE (Figure 4) consists of a synergistic processing unit (SPU) and a synergistic memory flow controller (SMF). It is used for data-intensive applications found readily in cryptography, media and high performance scientific applications. Each SPE runs an independent application thread. The SPE design is optimized for computation-intensive applications. It has SIMD support, as mentioned above, and 256 KB of its local store. The memory flow controller consists of a DMA controller along with a memory management 7
  • 9. unit (MMU) and atomic unit to facilitate synchronization issues with other SPEs and with the PPE. SPU is also a dual-issue, in-order processor like PPU. SPU works on the data that exists in its dedicated local store which in turn depends on Figure 2, Cell Broadband Engine Architecture channel interface for accessing main memory and local stores in other SPEs. The channel interface runs independently of SPU and resides in MFC. In parallel an SPU can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision. The PPE and SPEs communicate through an internal high-speed element interconnect bus (EIB) [2] (Figure 3). Apart from these processors EIB also allows communication among off-chip memory and external IO. The EIB is implemented as a circular ring consisting of four 16B-wide unidirectional channels. Two of them rotate clockwise and two anti-clockwise. These channels are capable of giving a performance of three concurrent transactions. The EIB runs at half the rate of system clock and thus have an effective channel rate of 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 8
  • 10. Interconnect Figure 3 Element Interconnect Bus, from [3] bytes wide / 2 system clocks per transfer). The theoretical peak of EIB at 3.2 GHz is 204.8GB/s. Figure 4 SPE, from [4] The memory interface controller (MIC) in the Cell BE chip is connected to the external RAMBUS XDR memory through two XIO channels operating at a maximum effective frequency of 3.2GHz. The MIC has separate read and write request queues for each XIO channel operating independently. For each channel, the MIC arbiter alternates the dispatch between read and write queues after a minimum of every eight dispatches from each queue or until the queue becomes empty, whichever is shorter. High-priority read requests are given priority over reads and writes. With both XIO channels operating at 3.2GHz, the peak raw memory bandwidth is 25.6GB/s. However, normal memory operations such as refresh, scrubbing, and so on, typically reduce the bandwidth by about 1GB/s. 2.5 Experimental Setup The experiment is done using the asynchronous optimization (section 2.3). Figure 5 schematically describes the optimization. Each core computes its own error gradient and updates a shared copy of weight vector, shared amongst all the cores. This update is carried out in a round-robin fashion. The delay in computation and gradient and update of weight vector is of τ =n-1. Explicit synchronization is required for the atomic update of 9
  • 11. weight vector. The experiment is run on the complete dataset involving all the available cores. Data Error Gradient Data Error Gradient In Weight P Vector ar Data Error Gradient all Data Error Gradient Figure 5 Asynchronous Optimization 10
  • 12. 3. Design and Implementation There were three stages in the implementation of the project 1. Pre-processing of TREC dataset 2. Implementation of logistic regression algorithm 3. Implementation of logistic regression in accordance to the methodologies suggested in delayed stochastic gradient technique 3.1 Pre-processing TREC Dataset 3.1.1 Intel Dual Core The dataset contains 75, 419 emails. These emails were tokenized by a list of symbols (white spaces( ); comma(,); back slash(); period(.); semi-colon(;); colon(:); single(‘) and double inverted comma(“); open and close parenthesis(( )), brace({ }) and bracket([ ]); greater(>) and less(<) than sign; hyphen(-); at the rate of symbol(@); equals(=); new line(n); carriage return(r); and tab (t)). Tokenization with the aforementioned symbol list yielded 2,218,878 different tokens. A dictionary of tokens, containing the token name along with a unique index for each token, was created and stored in a file (dictionary). Complete Convert mails Create files for Dictionary to vectors each mail vector Set F1 Save to Raw Dataset Disk Set F2 Condensed Convert mails Create files for Dictionary to vectors each mail vector Figure 6 Pre-processing TREC dataset Pre- 3.1.2 Cell Processor Due to memory limitations on Cell processor a condensed form of dictionary was used. This condensed form contained first hundred features from the complete dictionary. On one hand the reduced size affected the performance of the algorithm in terms of accuracy and on the other it became more suitable for implementation on Cell. With the condensed form we transferred 32 mails vectors (discussion of vector form of mail representation 11
  • 13. follows the current discussion) per MFC operation unlike the MFCs in the order of 10s for transfer of one mail vector if complete dictionary is used. 3.1.3 Representation of Emails and Labels The emails were represented as linear vectors by using a simple bag of words representation (Appendix I). The representation of emails was done in a struct data-type having an unsigned int for the index value and a short for the weight of respective index. Since the dimensionality of the complete dataset comes out to be very high therefore hashing (Appendix II) was used with 218 bins. While constructing the dictionary it took approximately 3 hours to process ~6000 emails. This estimate was drastically reduced by the use of hashing and finally it took approximately half an hour to process all the emails in the dataset. Once the dictionary was in place along with a working framework for hashing a second pass on the entire dataset was carried out. In this pass each email was converted to a bag of words representation and stored in separate file. The format of the file was in the following pattern: Figure 7 Email Files after pre-processing pre- The labels were provided separately in an array of short time. A label ‘1’ signified that the email is a ‘ham’ and label ‘-1’ signifies that it is a ‘spam’. Since each mail was stored as a vector form in a file therefore on an average it took only 0.03 ms (on Intel dual core 2GHz) for parsing the emails and loading them into the memory for logistic regression. 3.2 Implementation of logistic regression For a two class problem (C1 and C2), the posterior probability of class C1 given the input data x and a set of fixed basis function φ = φ (x) is define by the softmax transformation exp(a1 ) p(C1 | φ ) = y (φ ) = 3.1 exp(a1 ) + exp(a 2 ) Where the activations a1 is given as follows T a k = wk φ 3.2 with p(C 2 | φ ) = 1 − p(C1 | φ ) , w being the weight vector. The likelihood function for input data x and target data T (coded in the 1-of-K coding scheme) is then N N ( ) p (T | w1 , w2 ) = ∏ p (C1 | φ n ) tn1 . p (C 2 | φ n ) tn 2 = ∏ y nn11 . y nn22 t t 3.3 n =1 n =1 where y nk = y k (φ ( x n )) , and T is the N x 2 matrix of target variables with elements tnk. The error function is determined by taking the negative logarithm of the equation of likelihood and its gradient could be written as 12
  • 14. N ∇ w j E ( w1 , w2 ) = ∑ ( y nj − t nj )φ n 3.4 n =1 The weight vector wk for the given class Ck is updated as follows: wτ +1 = wτ − η∇ wk E ( w1 , w2 ) k k 3.5 where η is the learning rate. In this project we have defined the first class as an email being a ‘Ham’ and second class is for it being a ‘Spam’. The feature man φ is the identity function, φ ( x) = x . The weight vectors are initialized as zero. For the purpose of comparison two version of implementation of logistic regression was provided. The first version was sequential and the second version of implementation was parallel. As per the claim by the authors of the delayed stochastic gradient technique, the parallel version gave a better performance compared to the sequential version without affecting the correctness of the result. The comparison of performance is given in Section 4. 3.3 Implementation of Logistic Regression with delayed update To incorporate the concept of delayed update the equation (3.5) mentioned above was changed according to the algorithm described in Section 2.2. This required processing of computing the error gradient on divided set of input separately. The division of input was carried out differently for Intel Dual core and Cell processor. For the former case this division was more direct with less programming complexity, however, for the latter the division had to be carried out explicitly and it involved a significant complexity in terms of programming. The division of data is explained in detail in the following discussion. The representation chosen for the mail helps in improving the time performance of the algorithm. Since we are storing the indices of the vectors therefore while updating a weight vector in accordance to the contributions made by a specific mail vector we do not need to iterate through the complete dimension of the weight vector and error gradient. This is because the contributions by a particular mail vector would affect indices which are present in it. This Figure 8, below shows this concept pictorially. 13
  • 15. 1 1 count index 6 6 13 13 1 6 3 13 2 73 73 Error 73 Weight 5 88 Gradient Vector Mail Vector 88 88 D: Dimension of weight vector and error gradient D D Figure 8 3.3.1 Implementation on a Dual Core Intel Pentium processor For implementation on the Intel dual core machine (2 GHz with 1.86 GB of main memory) the processed email from the complete dictionary was used. The mail-vectors were created as and when they were required. The first core processed all the odd emails and second one all the even ones. Each core computed the error gradient separately along with updating a private copy of weight vectors for each core. The shared copy of weight vectors were updated atomically by both the cores. This implementation used OpenMP constructs for parallelization of the algorithm. Using OpenMP helped in the division of email. The thread number was augmented with a counter to determine the mail number. This ensured that no two thread would access the same data. Mail 1 Computes Mail 2 error gradient Mail 3 Core 1 Update Set F1 Mail 4 weight vector (atomically) Core 2 Computes error gradient Mail N Note: N is odd Figure 9 Implementation on Intel Dual Core 14
  • 16. 3.3.2 Implementation on Cell Broadband Engine Implementation of the algorithm on Cell processor used the processed mails generated from the condensed dictionary. The data was divided sequentially into chunks for each SPE. The PPE was responsible for constructing the labels and the array of mail vectors. Using the MFC operation the data was made available to SPE. Each MFC operation transferred data for 32 mails. This value was chosen because of the limited capacity (256 KB) of local store of SPEs. The SIMD implementation on Cell could not benefit from the implementation model shown in Figure 10. This is because for a full scale SIMD implementation there are operations involved for converting the data in the __vector form specialized for SPEs. Since we are storing the indices separately therefore converting the data to appropriate __vector would require rearranging them according to the indices. This rearrangement would require large number of load operations and would affect the overall benefits from SIMD operation. The complexity for time converting the data to __vector would be O(N2) where N is the dimension of mail-vector. Mail 1 Note: N is odd Mail 2 Mail 3 Set F2 Mail 4 PPE Main memory Mail N SPE-1 SPE-2 SPE-6 Computes error Computes error Computes error gradient gradient gradient Update weight Update weight Update weight vector vector vector Figure 10 Implementation on Cell 15
  • 17. For the parallel version of the algorithm each SPE required a maximum of four weight vectors to be stored in the local store. Two among them were supposed to be owned privately by the SPE and the remaining two was shared among all the SPEs. Along with the weight vector each SPE would also be required to store two error gradients. The data- type for each of these quantities is float. Considering the dictionary containing 2,218,878 features the requirement of memory tend to be the order of MBs. Following two data structures were considered for storing these quantities: a) Storing the complete data as an array of required dimension. This data structure is straight forward and easy to implement but there is possibility of potential wastage of memory. For the original dimension of 2,218,878 this data structure would require approx. 50 MB of memory for each instance of SPE. This is obviously not feasible as the local store on SPEs are only of 256 KB. b) The second data structure is to use struct having an index and a count value for each entry. Since most of the values in weight vector and error gradient are not required (refer to discussion pertaining to Figure 8), therefore by using this data structure the required memory was significantly reduced and theoretically came in the order of few MBs(approx. 3). This is also not feasible because of the limited size of local store in SPE. With the use of data generated by condensed dictionary and the latter data structure proposed above, the requirement got reduced to 2400 bytes. Rest of the memory available with the local store was used for storing the mail-vectors and the target labels. To hide the latency of transfer of data from the main memory to the local store of the SPE, the technique of double-buffering could be used. While the SPU is performing the computation on the data, the MFC could be used to bring more data from main memory of the system to the local store of the Cell. Therefore the wait for transfer of data is reduced and latency of transfer is hidden (either partly or completely). The algorithm for processing while doing double buffering is as follows: 1. The SPU queues a DMA GET to pull a portion of the problem data set from main memory into buffer #1. 2. The SPU queues a DMA GET to pull a portion of the problem data set from main memory into buffer #2. 3. The SPU waits for buffer #1 to finish filling. 4. The SPU processes buffer #1. 5. The SPU (a) queues a DMA PUT to transmit the contents of buffer #1 and then (b) queues a DMA GETB to execute after thePUT to refill the buffer with the next portion of data from main memory. 6. The SPU waits for buffer #2 to finish filling. 7. The SPU processes buffer #2. 8. The SPU (a) queues a DMA PUT to transmit the contents of buffer #2 and then (b) queues a DMA GETB to execute after thePUT to refill the buffer with the next portion of data from main memory. 9. Repeat starting at step 3 until all data has been processed. 10. Wait for all buffers to finish. 16
  • 18. 4. Results The experiments on Intel dual core machine was run by using the mails processed with the complete dictionary. The time taken on this machine is significantly higher than that on Cell. Serial implementation of logistic regression on Intel dual core for two simultaneous run takes 36.93 sec and 36.45 sec. The taken by the parallel implementation using delayed stochastic gradient method is given in Table 1. Number Time in Time in of seconds seconds (run Threads (run 1) 2) 1. 113.09 47.09 2. 20.85 20.92 Table 1 For a single thread in the first run the time taken is very large compared to any other time. This is because most of the memory load operation would be resulting in a cache miss. Since all these runs were performed consecutively therefore the time is drastically reduced because of reduction in the cache miss rate. It is also observed that this algorithm renders a poorer performance when run with single thread as compared to that of serial implementation. This time should be theoretically same; however, in the case of delayed stochastic process extra time is spent in division of data which end up not being used anywhere. The Table 2 below shows the performance of the algorithm on multiple SPEs. The performance with respect to time gets better with the increase in the number of SPEs. These values are plotted in the graph given below. The performance on SPE although shows better results, however, the accuracy is suffered to a great extent (results not shown here). Perform ance 48000 Time in Number 47000 micro of SPE seconds 46000 1 47398 2 44419 45000 3 42407 44000 4 42384 5 42144 43000 6 41966 42000 Table 2 41000 0 1 2 3 4 5 6 7 The use of condensed dictionary N um be r of S P E comes with a severe penalty of 17
  • 19. accuracy. The issue of accuracy could be solved by the use of complete dictionary. However, the memory limitation on Cell processor constrained the use of complete dictionary. 18
  • 20. 5. Conclusion and Future Work The approach of delayed update shows better time performance with the increase in parallelization. The improvement was shown for Intel dual core as well as for Cell processor. The former machine being SMP capable had less overhead of data division as compared to that on latter. The use of Cell processor posed several limitations on the implementation of this algorithm, the primary one being the memory limitation. The limitation of memory caused extra overhead due to communication. A dataset having less feature vectors might be expected to perform with a better speedup on this machine. For a data set with large feature vectors, this algorithm might perform better on a symmetric multiprocessing machine (SMP). A study of this algorithm could be done on more powerful SMP capable machine with large amount of main memory as the amount of memory required to store the data doubles with a unit increase in level of parallelization. 19
  • 21. Appendix I Bag of Words Representation A bag-of-words representation is a model to represent a sentence in the form of vector. It is frequently used in natural language processing and information retrieval. This model represents a sentence as an unordered collection without any regard for grammar. To form a vector for a sentence, firstly, all the distinct words are identified in it. Each distinct word is given a unique identifier called index. Each index serves as a dimension in a D-dimensional vector space, where D is total number of unique words. The magnitude of the vector in a particular dimension is determined by the count of words having that index. This process requires two passes through the entire dataset. In the first pass a dictionary containing the unique words along with their unique indices is created. In the second pass the vectors are formed by referencing to the dictionary. For example: Consider the following sentence What do you think you are doing? Word Index what 0 do 1 you 2 think 3 are 4 doing 5 The resulting vector for the above sentence would be as follows: 1(0) + 1(1) + 2(2) + 1(3) + 1(4) + 1(5) The vector dimension is given in parenthesis and the respective magnitudes are given along side. The magnitude of dimension 2 is 2 because the word you is repeated twice in the sentence. Others are one for the similar reason. 20
  • 22. Appendix II Hashing Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms. The hashing function used in this project is same as the one used by Oracle’s JVM. The code snippet performing hashing is pasted below unsigned int hashCode(char *word, int n) { unsigned int h = 0; int i; for(i=0;i<n;i++) h += word[i] * pow(31, n-i-1); return h%SIZE; } 21
  • 23. References [1] John Langford, Alexander J. Samola and Martin Zinkevich. Slow learners are fast published in Journal of Machine Learning Research 1(2009) [2] Michael Kistler, Michael Perrone, Fabrizio Petrini. Cell Multiprocessor Communication Network: Built for Speed. [3] Thomas Chen , Ram Raghavan , Jason Dale and Eiji Iwata. Cell Broadband Engine Architecture and its first implementation [4] Jonathan Bartlett. Programming high-performance applications on the Cell/B.E. processor, Part 6: Smart buffer management with DMA transfers [5] Introduction to Statistical Machine Learning, 2010 course assignment 1 [6] Christopher Bishop, Pattern Recognition and Machine Learning. 22