SlideShare una empresa de Scribd logo
1 de 15
Data Analytics
             Project Documentation
                    Vipul Divyanshu
                       IIL/2012/14
                    Summer Internship
                      Mentor: Saish Kamat
                    India Innovation Labs
Tasks at hand:
*Data Analytics on a Medium Size Data Base

*Building an Recommender Engine for products

Tools and topics Explored:

      Mahout
      Root
      Hadoop
      Data Rush
      Rush Analyser (with KNIME)
      Google Analytics engine

Analysis of the tools and what was explored:

Mahout:Mahout is an open source machine learning library from Apache. The
algorithmsit implements fall under the broad umbrella of machine learning or collective
Intelligence.



Mahout currently has:
 Collaborative Filtering
 User and Item based recommenders
 K-Means, Fuzzy K-Means clustering
 Mean Shift clustering
 Dirichlet process clustering
 Latent Dirichlet Allocation
 Singular value decomposition
Parallel Frequent Pattern mining
 Complementary Naive Bayes classifier
 Random forest decision tree based classifier
 High performance java collections (previously colt collections)


The fact that mahout has this many features and sub tools and libraries to work with,
it is the best suited tool for the self-designed data analytics programs.
And mahout also has core libraries are highly optimized to allow for good
performance also for non-distributed algorithms.

NOTE: For a well understanding of Mahout, the book ‗Mahout In action‘ is suggested.

ROOT:It is an object-oriented framework aimed at solving the
dataanalysis challenges of high-energy physics.
Below, you can find a quick overview of the ROOT framework:
       Save data. You can save your data (and any C++ object) in a compressed
       binaryform in a ROOT file. The object format is also saved in the same file.
       ROOT
       provides a data structure that is extremely powerful for fast access of huge
       amounts of data - orders of magnitude faster than any database.
       Access data. Data saved into one or several ROOT files can be accessed
       from your PC, from the web and from large-scale file delivery systems used
       e.g. in the GRID. ROOT trees spread over several files can be chained and
       accessed as a unique object, allowing for loops over huge amounts of data.
       Process data. Powerful mathematical and statistical tools are provided to
       operate on your data. The full power of a C++ application and of parallel
       processing is available for any kind of data manipulation. Data can also
       be generated following any statistical distribution, making it possible to
       simulate complex systems.
       Show results. Results are best shown with histograms, scatter plots,
       fitting functions, etc. ROOT graphics may be adjusted real-time by few
       mouse clicks. High-quality plots can be saved in PDF or other format.
       Interactive or built application. You can use the CINT C++ interpreter or
       Python for your interactive sessions and to write macros, or compile your
       program to run at full speed. In both cases, you can also create a GUI.
       Link to know more about root: http://root.cern.ch/drupal/
       Link for ROOT user‘s guide: http://root.cern.ch/download/doc/ROOTUsersGuide.pdf
Constrains of ROOT:
What was found was that it is concentrates more on displaying and the graphical
presentation of the collected data and on the representation of computed
(processed) result in the form of canvas, histograms, TGraphs. This can be used in later
point of time to present the processed data in a well-defined and interactive manner.
Screenshot:




HADOOP:
Hadoop is an open source framework for writing and running distributed
applications that process large amounts of data on different networks.
Key distinctions of Hadoop are:
Accessible—Hadoop runs on large clusters of commodity machines or on cloud
computingservices such as Amazon‘s Elastic Compute Cloud (EC2).
Robust—because it is intended to run on commodity hardware, Hadoop is architected
withthe assumption of frequent hardware malfunctions. It can gracefully handle most
suchfailures.
Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the
cluster.
Simple—Hadoop allows users to quickly write efficient parallel code.

Link to explore more in Hadoop: http://hadoop.apache.org/
NOTE: For a well understanding of hadoop, the book ‗Hadoop In action‘ is suggested.
Setting Up Mahout development environment in Eclipse:
NOTE: The following explanation is for Ubuntu (Linux) OS .we can even implement
it on any other OS such as windows.
PREREQUIREMENTS:
1. Java SDK 6u23 x64
2. Maven 3.0.2
3. ANY UPDATED MAHOUT LIBRARY
4. IDE(I had used eclipse)
5. CYGWIN (in case of windows OS)
Running your first sample code:
Once all the above requirements are met we are ready to execute our
first sample code.
Step 1:
        At first, start Eclipse and create a workspace. We takeit ―UsersVipulworkspace‖ for
        the present.
        Extract the source of Mahout below the workspace. Itis
        ―UsersVipulworkspacemahout-distribution-0.4″ for the present.
        Convert Maven project of Mahout into Eclipse project with the belowcommand.
        cd UsersVipulworkspacemahout-distribution-0.4
        mvn eclipse: eclipse
        Now set the classpath variable M2_REPO of Eclipse to Maven2 localrepository.
        mvn -Declipse.workspace= eclipse: add-maven-repo
        But ―Maven – Guide to using Eclipse with Maven 2.x‖ says ―Issue: The
        command does not work‖. So set it in Eclipse directly.
        Open Window > Preferences > Java > Build Path > Classpath Valirables
        from Eclipse‘s menu.
        Press ―New‖ and Add Name as ―M2_REPO‖ and Path as Maven 2
        repository path (its default is .m2/repository at your user directory).
        Finally import the converted Eclipse project of Mahout.
        Open File > Import > General > Existing Projects into Workspace from
Eclipse menu.
Select the project directory UsersVipulworkspacemahoutdistribution-0.6 and all projects.
NOTE: Now you need to have your first code to be implemented
ready. If so proceed to Step 2.
Step 2:
       At first, generate a Maven project for sample codes on the Eclipse
   workspace directory.
   $ cd Users/Vipul/workspace
   $ mvn archetype: create -DgroupId=mia.recommender -
   DartifactId=recommender
       Do the following.
   Delete a generated Skelton code src/main/App.java and copy the
   code into src/main/java/mia/recommender of the ‗recommender‘
   project.
   Convert the Maven project into Eclipse project.
   $ cd Users/Vipul/workspace/recommender
   $ mvn eclipse: eclipse
       Import the project into Eclipse.
   Open File > Import > General > Existing Projects into Workspace
   from Eclipse menu and select the ‗recommender‘ project.
   Then the ‗recommender‘ project is available on Eclipse workspace,
   but all classes have errors because of no Mahout Library reference.
Right click the „recommender‟ project, select Properties > Java Build Path >
Projects from pop-up menu and click „Add‟ and select the below Mahout projects.

   mahout-core
   mahout-examples
   mahout-taste-webapp
   mahout-math
   mahout-utils




Then only 4 errors remain.
Hence they are conflicts with updated APIs, these error correction need to modify codes.
For example, open mia.recommender.ch03.IREvaluatorBooleanPrefIntro2 and press ctrl+1
at error line in it.




   This error says that the code does not catch or declare a exception of
   TasteException which NearestNUserNeighborhood‟s constructor throws. So you
   can choise whichever you like a solution in the pop up menu. Others as well.


   The classes which has main() function can be executed on Eclipse.
   For example, select mia.recommender.ch02.RecommenderIntro and click Run >
   Run in Eclipse‟s menu (or may press ctrl+F11 insted). Then It throws an
   exception as „Exception in thread “main” java.io.FileNotFoundException:
   intro.csv‟.
   To make it read a sample data file „intro.csv‟ in src/mia/recommender/ch02, click
   Run > Run Configurations in Eclipse‟s menu and select the configuration of
RecommenderIntro which is created by the above execution. Then set
mia/recommender/ch02 to Working directory in Arguments tab(see the below
figure). Click “Workspace…” button and select the directory.




Then it outputs a result like “RecommendedItem[item:104, value:4.257081]“.
If you want to make a project, repeat from Maven project creation.
RECOMMENDATION ENGINE:
Recommendation isall about predicting patterns of taste, and using them to discover
new and desirable things you didn‘t already know about.We have many types of
recommender like:
        GenericUserBasedRecommender
        GenericItemBasedRecommender
        SlopeOneRecommender
        SVDRecommender
        KnnItemBasedRecommender
Well I had implemented the code for the first three but with time in hand the
other two and some more can be implemented.
NOTE: For every recommender to feed the data to it we need a file normally of
type .csv and don‘t forget to place it in the same folder in which we have our
pom file of the current project being build.
THE USER BASED RECOMMENDATION ENGINE
All the required details of the user based recommender engine are given in detail in the
book which I had mentioned before. The output of my recommender is shown below:




The output if the above code can be observed in the ellipse.
THE ITEM BASED RECOMMENDATION ENGINE:
It is similar to that of the user based recommendation engine the only difference is that
it finds the similarity between the item instead of users.
Note: Due to the above reason it is more suited in the case when we is a fast growing list
of users and a slower growing product or item list.
The output of the Item based recommender code is:




THE SLOPE-ONE RECOMMENDATION ENGINE:
It is similar to that to of item based recommendation engine but has a pre-processing
state and the output is on the basis of the relation between the different items.
The output of my code is:
THE EVLUATOR FOR RECOMMENDATION ENGINE:
There are many possible ways to evaluate the performance of an the recommender
engine, I have explored the following:
        RecommenderIRStatsEvaluator
        AverageAbsoluteDifferenceRecommenderEvaluator
        RMSRecommenderEvaluator.
Well I had implemented the first two of them
AVERAGEABSOLUTEDIFFERENCERECOMMENDEREVALUATOR
It takes the a part of data as test data and rest as training data and recommends items
for our test data and latter is matched with the real values of the test data. The output
for my code is:
RecommenderIRStatsEvaluator:
This evaluator computes the recall and precision of the recommender and gives their
values as the output. The output of the evaluator code is:
Note: To test the above codes on a larger scale we can download the
Input files for them from: http://www.grouplens.org/node/12
Mahout is still in development stage and still many fields can be explored
like clustering, network pattern learning and classification.
The Hadoop could be used with mahout to implement a cluster and map-
reduce to receive data.

Rush Analyser (with Knime):
This tool is also in Java and eclipse is needed. It was downloaded from the link :
http://bigdata.pervasive.com/Products/Download-Center.aspx
Is the graphical version of Data rush and is very handy in the terms of data analytics and
visualisation.
Here is a snapshot of my work where I have loaded the 10K movie rating data
downloaded from the test data download link given.
In the image different nodes can be seen used to perform different operations on the
data set.




This is the parallel plot of the data set.
This is the scatter plot generated for the same 10K data value scattered on the 2-D plan.




By use of clustering blocks in the rush analyser the data was analysed.
Few of the blocks explored by me are:
       Regression
       Classifiers
       Recommender
       Clustering
       Filters
Data from different Databases can be directly imported by the use of Data Base reader
block.
These are few of the topics explored in Rush Analyser(a interactive Datarush tool)
And it only the tip of the ice berg as Rush Analyser has a lot more in store to be
explored.For more info go to The given link could be referred for exploring data rush:
http://bigdata.pervasive.com/Products/Analytic-Engine-Pervasive-DataRush.aspx.
The potential of the DATA RUSHis still to be explored for the project.



Thank You IIL:
Vipul Divyanshu
IIL/2012/14

Más contenido relacionado

Similar a Vipul divyanshu mahout_documentation

Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmap
Dr. Mohan K. Bavirisetty
 

Similar a Vipul divyanshu mahout_documentation (20)

Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Cc unit 5
Cc unit 5Cc unit 5
Cc unit 5
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Recommendation engine
Recommendation engineRecommendation engine
Recommendation engine
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Maven 2 features
Maven 2 featuresMaven 2 features
Maven 2 features
 
Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmap
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
Maven: Managing Software Projects for Repeatable Results
Maven: Managing Software Projects for Repeatable ResultsMaven: Managing Software Projects for Repeatable Results
Maven: Managing Software Projects for Repeatable Results
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers,
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
.NET Recommended Resources
.NET Recommended Resources.NET Recommended Resources
.NET Recommended Resources
 
A report on mvc using the information
A report on mvc using the informationA report on mvc using the information
A report on mvc using the information
 
Hadoop framework thesis (3)
Hadoop framework thesis (3)Hadoop framework thesis (3)
Hadoop framework thesis (3)
 
HLayer / Cloud Native Best Practices
HLayer / Cloud Native Best PracticesHLayer / Cloud Native Best Practices
HLayer / Cloud Native Best Practices
 
My First Hadoop Program !!!
My First Hadoop Program !!!My First Hadoop Program !!!
My First Hadoop Program !!!
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Vipul divyanshu mahout_documentation

  • 1. Data Analytics Project Documentation Vipul Divyanshu IIL/2012/14 Summer Internship Mentor: Saish Kamat India Innovation Labs Tasks at hand: *Data Analytics on a Medium Size Data Base *Building an Recommender Engine for products Tools and topics Explored: Mahout Root Hadoop Data Rush Rush Analyser (with KNIME) Google Analytics engine Analysis of the tools and what was explored: Mahout:Mahout is an open source machine learning library from Apache. The algorithmsit implements fall under the broad umbrella of machine learning or collective Intelligence. Mahout currently has: Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition
  • 2. Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier High performance java collections (previously colt collections) The fact that mahout has this many features and sub tools and libraries to work with, it is the best suited tool for the self-designed data analytics programs. And mahout also has core libraries are highly optimized to allow for good performance also for non-distributed algorithms. NOTE: For a well understanding of Mahout, the book ‗Mahout In action‘ is suggested. ROOT:It is an object-oriented framework aimed at solving the dataanalysis challenges of high-energy physics. Below, you can find a quick overview of the ROOT framework: Save data. You can save your data (and any C++ object) in a compressed binaryform in a ROOT file. The object format is also saved in the same file. ROOT provides a data structure that is extremely powerful for fast access of huge amounts of data - orders of magnitude faster than any database. Access data. Data saved into one or several ROOT files can be accessed from your PC, from the web and from large-scale file delivery systems used e.g. in the GRID. ROOT trees spread over several files can be chained and accessed as a unique object, allowing for loops over huge amounts of data. Process data. Powerful mathematical and statistical tools are provided to operate on your data. The full power of a C++ application and of parallel processing is available for any kind of data manipulation. Data can also be generated following any statistical distribution, making it possible to simulate complex systems. Show results. Results are best shown with histograms, scatter plots, fitting functions, etc. ROOT graphics may be adjusted real-time by few mouse clicks. High-quality plots can be saved in PDF or other format. Interactive or built application. You can use the CINT C++ interpreter or Python for your interactive sessions and to write macros, or compile your program to run at full speed. In both cases, you can also create a GUI. Link to know more about root: http://root.cern.ch/drupal/ Link for ROOT user‘s guide: http://root.cern.ch/download/doc/ROOTUsersGuide.pdf Constrains of ROOT: What was found was that it is concentrates more on displaying and the graphical presentation of the collected data and on the representation of computed (processed) result in the form of canvas, histograms, TGraphs. This can be used in later point of time to present the processed data in a well-defined and interactive manner.
  • 3. Screenshot: HADOOP: Hadoop is an open source framework for writing and running distributed applications that process large amounts of data on different networks. Key distinctions of Hadoop are: Accessible—Hadoop runs on large clusters of commodity machines or on cloud computingservices such as Amazon‘s Elastic Compute Cloud (EC2). Robust—because it is intended to run on commodity hardware, Hadoop is architected withthe assumption of frequent hardware malfunctions. It can gracefully handle most suchfailures. Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple—Hadoop allows users to quickly write efficient parallel code. Link to explore more in Hadoop: http://hadoop.apache.org/ NOTE: For a well understanding of hadoop, the book ‗Hadoop In action‘ is suggested.
  • 4. Setting Up Mahout development environment in Eclipse: NOTE: The following explanation is for Ubuntu (Linux) OS .we can even implement it on any other OS such as windows. PREREQUIREMENTS: 1. Java SDK 6u23 x64 2. Maven 3.0.2 3. ANY UPDATED MAHOUT LIBRARY 4. IDE(I had used eclipse) 5. CYGWIN (in case of windows OS) Running your first sample code: Once all the above requirements are met we are ready to execute our first sample code. Step 1: At first, start Eclipse and create a workspace. We takeit ―UsersVipulworkspace‖ for the present. Extract the source of Mahout below the workspace. Itis ―UsersVipulworkspacemahout-distribution-0.4″ for the present. Convert Maven project of Mahout into Eclipse project with the belowcommand. cd UsersVipulworkspacemahout-distribution-0.4 mvn eclipse: eclipse Now set the classpath variable M2_REPO of Eclipse to Maven2 localrepository. mvn -Declipse.workspace= eclipse: add-maven-repo But ―Maven – Guide to using Eclipse with Maven 2.x‖ says ―Issue: The command does not work‖. So set it in Eclipse directly. Open Window > Preferences > Java > Build Path > Classpath Valirables from Eclipse‘s menu. Press ―New‖ and Add Name as ―M2_REPO‖ and Path as Maven 2 repository path (its default is .m2/repository at your user directory). Finally import the converted Eclipse project of Mahout. Open File > Import > General > Existing Projects into Workspace from Eclipse menu. Select the project directory UsersVipulworkspacemahoutdistribution-0.6 and all projects. NOTE: Now you need to have your first code to be implemented ready. If so proceed to Step 2. Step 2: At first, generate a Maven project for sample codes on the Eclipse workspace directory. $ cd Users/Vipul/workspace $ mvn archetype: create -DgroupId=mia.recommender - DartifactId=recommender Do the following. Delete a generated Skelton code src/main/App.java and copy the code into src/main/java/mia/recommender of the ‗recommender‘ project. Convert the Maven project into Eclipse project. $ cd Users/Vipul/workspace/recommender $ mvn eclipse: eclipse Import the project into Eclipse. Open File > Import > General > Existing Projects into Workspace from Eclipse menu and select the ‗recommender‘ project. Then the ‗recommender‘ project is available on Eclipse workspace, but all classes have errors because of no Mahout Library reference.
  • 5. Right click the „recommender‟ project, select Properties > Java Build Path > Projects from pop-up menu and click „Add‟ and select the below Mahout projects. mahout-core mahout-examples mahout-taste-webapp mahout-math mahout-utils Then only 4 errors remain.
  • 6. Hence they are conflicts with updated APIs, these error correction need to modify codes. For example, open mia.recommender.ch03.IREvaluatorBooleanPrefIntro2 and press ctrl+1 at error line in it. This error says that the code does not catch or declare a exception of TasteException which NearestNUserNeighborhood‟s constructor throws. So you can choise whichever you like a solution in the pop up menu. Others as well. The classes which has main() function can be executed on Eclipse. For example, select mia.recommender.ch02.RecommenderIntro and click Run > Run in Eclipse‟s menu (or may press ctrl+F11 insted). Then It throws an exception as „Exception in thread “main” java.io.FileNotFoundException: intro.csv‟. To make it read a sample data file „intro.csv‟ in src/mia/recommender/ch02, click Run > Run Configurations in Eclipse‟s menu and select the configuration of
  • 7. RecommenderIntro which is created by the above execution. Then set mia/recommender/ch02 to Working directory in Arguments tab(see the below figure). Click “Workspace…” button and select the directory. Then it outputs a result like “RecommendedItem[item:104, value:4.257081]“. If you want to make a project, repeat from Maven project creation.
  • 8. RECOMMENDATION ENGINE: Recommendation isall about predicting patterns of taste, and using them to discover new and desirable things you didn‘t already know about.We have many types of recommender like: GenericUserBasedRecommender GenericItemBasedRecommender SlopeOneRecommender SVDRecommender KnnItemBasedRecommender Well I had implemented the code for the first three but with time in hand the other two and some more can be implemented. NOTE: For every recommender to feed the data to it we need a file normally of type .csv and don‘t forget to place it in the same folder in which we have our pom file of the current project being build. THE USER BASED RECOMMENDATION ENGINE All the required details of the user based recommender engine are given in detail in the book which I had mentioned before. The output of my recommender is shown below: The output if the above code can be observed in the ellipse.
  • 9. THE ITEM BASED RECOMMENDATION ENGINE: It is similar to that of the user based recommendation engine the only difference is that it finds the similarity between the item instead of users. Note: Due to the above reason it is more suited in the case when we is a fast growing list of users and a slower growing product or item list. The output of the Item based recommender code is: THE SLOPE-ONE RECOMMENDATION ENGINE: It is similar to that to of item based recommendation engine but has a pre-processing state and the output is on the basis of the relation between the different items. The output of my code is:
  • 10. THE EVLUATOR FOR RECOMMENDATION ENGINE: There are many possible ways to evaluate the performance of an the recommender engine, I have explored the following: RecommenderIRStatsEvaluator AverageAbsoluteDifferenceRecommenderEvaluator RMSRecommenderEvaluator. Well I had implemented the first two of them AVERAGEABSOLUTEDIFFERENCERECOMMENDEREVALUATOR It takes the a part of data as test data and rest as training data and recommends items for our test data and latter is matched with the real values of the test data. The output for my code is:
  • 11. RecommenderIRStatsEvaluator: This evaluator computes the recall and precision of the recommender and gives their values as the output. The output of the evaluator code is:
  • 12. Note: To test the above codes on a larger scale we can download the Input files for them from: http://www.grouplens.org/node/12 Mahout is still in development stage and still many fields can be explored like clustering, network pattern learning and classification. The Hadoop could be used with mahout to implement a cluster and map- reduce to receive data. Rush Analyser (with Knime): This tool is also in Java and eclipse is needed. It was downloaded from the link : http://bigdata.pervasive.com/Products/Download-Center.aspx Is the graphical version of Data rush and is very handy in the terms of data analytics and visualisation. Here is a snapshot of my work where I have loaded the 10K movie rating data downloaded from the test data download link given.
  • 13. In the image different nodes can be seen used to perform different operations on the data set. This is the parallel plot of the data set.
  • 14. This is the scatter plot generated for the same 10K data value scattered on the 2-D plan. By use of clustering blocks in the rush analyser the data was analysed. Few of the blocks explored by me are: Regression Classifiers Recommender Clustering Filters Data from different Databases can be directly imported by the use of Data Base reader block. These are few of the topics explored in Rush Analyser(a interactive Datarush tool)
  • 15. And it only the tip of the ice berg as Rush Analyser has a lot more in store to be explored.For more info go to The given link could be referred for exploring data rush: http://bigdata.pervasive.com/Products/Analytic-Engine-Pervasive-DataRush.aspx. The potential of the DATA RUSHis still to be explored for the project. Thank You IIL: Vipul Divyanshu IIL/2012/14