II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - Bpm-Conseil, France)
1. Patrick Beaucamp
Founder of the Vanilla Project
Mail : Patrick.beaucamp@bpm-conseil.com
Search and Data Mining Open Source Platforms
II-SDV, Nice 15th April 2014
1II-SDV, Nice
2. Presentation Agenda
Open Source Search Engine & Search Platform
Some interesting Platforms
Features expected for Search Engines
Features expected for Search Platforms (Interface)
2II-SDV, Nice
Open Source Data Mining & Machine Learning Platform
Datamining subject & Usage type
Some interesting Platforms
Hadoop & Data Mining : the future of Learning Machin
If you don’t find it, it doesn’t exist
… so, may be an engine will find it for you ?
3. Search & Data Mining Together ?
Search Interface
3II-SDV, Nice
Enhanced Search Interface
4. Search & Data Mining Together ?
4II-SDV, Nice
Data Mining : search for you !
• Example : correlation provides you with information about how data
are linked together
5. 5II-SDV, Nice
Confusion Search Engine & Platforms
-Search platform per domain :
- News (yahoo)
- Job (jobserve)
- Real Estate
- Human Relation
- Craig List
- Patent industry
- …
A business case : Facebook
- Basic Search interface
- Proposal for connection
- Advanced graphical search
6. 6II-SDV, Nice
Search Engines
Search Engine Providers in Open Source - Lucene
- Lucene is a java based indexing and search API
- Solr/Lucene is the leading server extension of Lucene. 2 companies,
LucidWorks and ElasticSearch, provides packaging and extension of top of
Lucene and Solr
-Search Landscape
-• Lucene : http://lucene.apache.org
-• Solr/Lucene : http://lucene.apache.org/solr/
-• Plateform OpenSearch : http://www.open-search-server.com
-• Plateform Katta : http://katta.sourceforge.net
-• Plateform LucidWorks : http://www.lucidworks.com
-• Plateform ElasticSearch : http://www.elasticsearch.com
-• Sphinx : http://sphinxsearch.com/
7. 7II-SDV, Nice
Before Search Engines
Before indexing your document base, you need to access it !
-Apache Nutch is a highly extensible and scalable open source web crawler
software project.
-Reference : http://nutch.apache.org/
8. 8II-SDV, Nice
Search Engines : focus
Sphinx : inside Database like MariaDb (Mysql)
Lucene : Retrieval Software library
Use existing Search Infrastructure like Solr/Lucene (Vanilla certified)
http://www.lucidworks.com/ or http://www.elasticsearch.org/
9. 9II-SDV, Nice
Search Engines Basics (1/2)
-Synonyms
- It is possible to extend the search to synonyms if they are listed in a
glossary. For example, to find articles containing synonyms to “TV” when
you search with the word TV.
-Metadata
- Dictionary for list of searchable keywords
10. 10II-SDV, Nice
Search Engines Basics (2/2)
-Reserved Words, Protected Words
- Indexing usually uses stemming, which is to reduce words to their root, for
example "Developp" to find items also contain the word when trying to
develop the word development. However, sometimes there are adverse
lemmatizations, indexing under one lemma two words that have no
relation. It is possible to prevent the stemming of words by listing them in
a file protwords.txt.
-StopWords
- The stopwords are meaningless words. A word considered insignificant
will be ignored. Note that some words are insignificant in some contexts,
others have homonyms signifiers. For example, can refer to a summer
season (rather mean) or past participle of the verb to be (relatively
insignificant). Stopwords.txt the file looks like this
11. 11II-SDV, Nice
Search Engines current Limits
-Multi Language support (this is where commercial search engine have still more
to bring to customer), even there is now Asian type language support (Hindi,
Thai, Chineese, …)
-Elision :
- Elisions are a feature of the French, which consist of a contraction of the
words like or when they are followed by a vowel. Example: + aircraft gives
the aircraft. It is possible to remove these elisions using a lexicon.
-Full text search interface (language with search engine)
-SubQuery support : now its better with Solr 4.7
-Scalability (this is where Solr is taking technical advantage)
12. 12II-SDV, Nice
Hadoop SearchPlatform for BigData
-Cloudera with Solr/Cloud (Solr/Lucene)
-Mapr with ElasticSearch (Lucene code)
-HortonWorks with LucidWorks (Solr/Lucene)
13. 13II-SDV, Nice
Search Interface : expectation (1/3)
-Advance indexing and querying tools.
-Provides distributed searching capabilities to prevent bottleneck for a particular
server.
-Provides document excerpts (snippets) generation that provides summary of the
search
-Relevance ranking display extracts from the documents based on the query.
14. 14II-SDV, Nice
Search Interface : expectation (2/3)
-Duplicate document detection, including fuzzy near duplicates
-Rich Document Parsing and Indexing without using Database Indexing.
-Ranking control carry out a targeted ranking of individual documents.
-Search Grouping by Type / Tag / Categories (General page, documents, images)
15. 15II-SDV, Nice
-Multi Criteria support
-Ranking
-Natural language support
-Apps Support (Android, Ipad)
Search Interface : expectation (3/3)
16. 16II-SDV, Nice
Migration to Open Source
-Only few project, but it’s growing
SAME FUNCTIONALITY, MUCH LOWER COSTS
-http://www.searchtechnologies.com/fast-to-solr-migration.html
-http://fr.slideshare.net/janhoy/migrating-fast-to-solr
17. 17II-SDV, Nice
Datamining
-Work on your data with Dataming component such as
R : http://www.revolutionanalytics.com/
Weka : http://www.cs.waikato.ac.nz/ml/weka/
RapidMiner : http://rapidminer.com/
KNIME : http://www.knime.org/
Apache Mahout : http://mahout.apache.org/
18. Many Definitions
Non-trivial extraction of implicit, previously unknown and potentially useful
information from data
Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover meaningful patterns
Datamining
II-SDV, Nice
19. What is (not) Data Mining?
What is Data Mining?
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
What is not Data Mining?
– Look up phone
number in phone
directory
– Query a Web search
engine for information
about “Amazon”
II-SDV, Nice
Datamining
20. Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniques may be unsuitable due to
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of data
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
II-SDV, Nice
Datamining
21. Data Mining Tasks
Description Methods
Find human-interpretable patterns that describe the
data.
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
II-SDV, Nice
Datamining
23. Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.
II-SDV, Nice
Datamining
25. Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new
cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided otherwise. This {buy, don’t buy}
decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction related information about all
such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
II-SDV, Nice
II-SDV, Nice
Datamining
26. Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an account.
II-SDV, Nice
Datamining
27. Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a
competitor.
Approach:
Use detailed record of transactions with each of the past and present
customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day he calls most,
his financial status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
Datamining
II-SDV, Nice
28. Clustering Definition
Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
II-SDV, Nice
Datamining
30. Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
II-SDV, Nice
Datamining
31. Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may
conceivably be selected as a market target to be reached with a distinct
marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs.
those from different clusters.
II-SDV, Nice
Datamining
32. Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
II-SDV, Nice
Datamining
33. Association Rule Discovery: Definition
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } --> {Potato Chips}
Potato Chips as consequent => Can be used to determine what should be done to
boost its sales.
Bagels in the antecedent => Can be used to see which products would be affected
if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used to see what
products should be sold with Bagels to promote sale of Potato chips!
II-SDV, Nice
Datamining
35. Association Rule Discovery: Application 1
Supermarket shelf management.
Goal: To identify items that are bought together by sufficiently many customers.
Approach: Process the point-of-sale data collected with barcode scanners to find
dependencies among items.
A classic rule --
If a customer buys diaper and milk, then he is very likely to buy
beer.
So, don’t be surprised if you find six-packs stacked next to diapers!
II-SDV, Nice
Datamining
36. Association Rule Discovery: Application 2
Inventory Management:
Goal: A consumer appliance repair company wants to anticipate the nature of
repairs on its consumer products and keep the service vehicles equipped
with right parts to reduce on number of visits to consumer households.
Approach: Process the data on tools and parts required in previous repairs at
different consumer locations and discover the co-occurrence patterns.
II-SDV, Nice
Datamining
37. Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
II-SDV, Nice
Datamining
38. 38II-SDV, Nice
Learning Machines
- Weka
- RapidMiner (formerly Yale : Yet Another Learning Environment)
- Hadoop Mahout
Background infrastructure & Framework
Set of Api and Libraries
Provides Algorithms for DataMining
Your Data
Your Application