SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Patrick Beaucamp
Founder of the Vanilla Project
Mail : Patrick.beaucamp@bpm-conseil.com
Search and Data Mining Open Source Platforms
II-SDV, Nice 15th April 2014
1II-SDV, Nice
Presentation Agenda
Open Source Search Engine & Search Platform
Some interesting Platforms
Features expected for Search Engines
Features expected for Search Platforms (Interface)
2II-SDV, Nice
Open Source Data Mining & Machine Learning Platform
Datamining subject & Usage type
Some interesting Platforms
Hadoop & Data Mining : the future of Learning Machin
If you don’t find it, it doesn’t exist
… so, may be an engine will find it for you ?
Search & Data Mining Together ?
Search Interface
3II-SDV, Nice
Enhanced Search Interface
Search & Data Mining Together ?
4II-SDV, Nice
Data Mining : search for you !
• Example : correlation provides you with information about how data
are linked together
5II-SDV, Nice
Confusion Search Engine & Platforms
-Search platform per domain :
- News (yahoo)
- Job (jobserve)
- Real Estate
- Human Relation
- Craig List
- Patent industry
- …
A business case : Facebook
- Basic Search interface
- Proposal for connection
- Advanced graphical search
6II-SDV, Nice
Search Engines
Search Engine Providers in Open Source - Lucene
- Lucene is a java based indexing and search API
- Solr/Lucene is the leading server extension of Lucene. 2 companies,
LucidWorks and ElasticSearch, provides packaging and extension of top of
Lucene and Solr
-Search Landscape
-• Lucene : http://lucene.apache.org
-• Solr/Lucene : http://lucene.apache.org/solr/
-• Plateform OpenSearch : http://www.open-search-server.com
-• Plateform Katta : http://katta.sourceforge.net
-• Plateform LucidWorks : http://www.lucidworks.com
-• Plateform ElasticSearch : http://www.elasticsearch.com
-• Sphinx : http://sphinxsearch.com/
7II-SDV, Nice
Before Search Engines
Before indexing your document base, you need to access it !
-Apache Nutch is a highly extensible and scalable open source web crawler
software project.
-Reference : http://nutch.apache.org/
8II-SDV, Nice
Search Engines : focus
Sphinx : inside Database like MariaDb (Mysql)
Lucene : Retrieval Software library
Use existing Search Infrastructure like Solr/Lucene (Vanilla certified)
http://www.lucidworks.com/ or http://www.elasticsearch.org/
9II-SDV, Nice
Search Engines Basics (1/2)
-Synonyms
- It is possible to extend the search to synonyms if they are listed in a
glossary. For example, to find articles containing synonyms to “TV” when
you search with the word TV.
-Metadata
- Dictionary for list of searchable keywords
10II-SDV, Nice
Search Engines Basics (2/2)
-Reserved Words, Protected Words
- Indexing usually uses stemming, which is to reduce words to their root, for
example "Developp" to find items also contain the word when trying to
develop the word development. However, sometimes there are adverse
lemmatizations, indexing under one lemma two words that have no
relation. It is possible to prevent the stemming of words by listing them in
a file protwords.txt.
-StopWords
- The stopwords are meaningless words. A word considered insignificant
will be ignored. Note that some words are insignificant in some contexts,
others have homonyms signifiers. For example, can refer to a summer
season (rather mean) or past participle of the verb to be (relatively
insignificant). Stopwords.txt the file looks like this
11II-SDV, Nice
Search Engines current Limits
-Multi Language support (this is where commercial search engine have still more
to bring to customer), even there is now Asian type language support (Hindi,
Thai, Chineese, …)
-Elision :
- Elisions are a feature of the French, which consist of a contraction of the
words like or when they are followed by a vowel. Example: + aircraft gives
the aircraft. It is possible to remove these elisions using a lexicon.
-Full text search interface (language with search engine)
-SubQuery support : now its better with Solr 4.7
-Scalability (this is where Solr is taking technical advantage)
12II-SDV, Nice
Hadoop SearchPlatform for BigData
-Cloudera with Solr/Cloud (Solr/Lucene)
-Mapr with ElasticSearch (Lucene code)
-HortonWorks with LucidWorks (Solr/Lucene)
13II-SDV, Nice
Search Interface : expectation (1/3)
-Advance indexing and querying tools.
-Provides distributed searching capabilities to prevent bottleneck for a particular
server.
-Provides document excerpts (snippets) generation that provides summary of the
search
-Relevance ranking display extracts from the documents based on the query.
14II-SDV, Nice
Search Interface : expectation (2/3)
-Duplicate document detection, including fuzzy near duplicates
-Rich Document Parsing and Indexing without using Database Indexing.
-Ranking control carry out a targeted ranking of individual documents.
-Search Grouping by Type / Tag / Categories (General page, documents, images)
15II-SDV, Nice
-Multi Criteria support
-Ranking
-Natural language support
-Apps Support (Android, Ipad)
Search Interface : expectation (3/3)
16II-SDV, Nice
Migration to Open Source
-Only few project, but it’s growing
SAME FUNCTIONALITY, MUCH LOWER COSTS
-http://www.searchtechnologies.com/fast-to-solr-migration.html
-http://fr.slideshare.net/janhoy/migrating-fast-to-solr
17II-SDV, Nice
Datamining
-Work on your data with Dataming component such as
R : http://www.revolutionanalytics.com/
Weka : http://www.cs.waikato.ac.nz/ml/weka/
RapidMiner : http://rapidminer.com/
KNIME : http://www.knime.org/
Apache Mahout : http://mahout.apache.org/
Many Definitions
Non-trivial extraction of implicit, previously unknown and potentially useful
information from data
Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover meaningful patterns
Datamining
II-SDV, Nice
What is (not) Data Mining?
 What is Data Mining?
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
 What is not Data Mining?
– Look up phone
number in phone
directory
– Query a Web search
engine for information
about “Amazon”
II-SDV, Nice
Datamining
Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniques may be unsuitable due to
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of data
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
II-SDV, Nice
Datamining
Data Mining Tasks
Description Methods
Find human-interpretable patterns that describe the
data.
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
II-SDV, Nice
Datamining
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
II-SDV, Nice
Datamining
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.
II-SDV, Nice
Datamining
Classification with Weka – J48
II-SDV, Nice
Datamining
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new
cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided otherwise. This {buy, don’t buy}
decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction related information about all
such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
II-SDV, Nice
II-SDV, Nice
Datamining
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an account.
II-SDV, Nice
Datamining
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a
competitor.
Approach:
Use detailed record of transactions with each of the past and present
customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day he calls most,
his financial status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
Datamining
II-SDV, Nice
Clustering Definition
Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
II-SDV, Nice
Datamining
Clustering with Weka – K-Means
II-SDV, Nice
Datamining
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
II-SDV, Nice
Datamining
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may
conceivably be selected as a market target to be reached with a distinct
marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs.
those from different clusters.
II-SDV, Nice
Datamining
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
II-SDV, Nice
Datamining
Association Rule Discovery: Definition
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } --> {Potato Chips}
Potato Chips as consequent => Can be used to determine what should be done to
boost its sales.
Bagels in the antecedent => Can be used to see which products would be affected
if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used to see what
products should be sold with Bagels to promote sale of Potato chips!
II-SDV, Nice
Datamining
Association Rules with Weka – APriori
II-SDV, Nice
Datamining
Association Rule Discovery: Application 1
Supermarket shelf management.
Goal: To identify items that are bought together by sufficiently many customers.
Approach: Process the point-of-sale data collected with barcode scanners to find
dependencies among items.
A classic rule --
If a customer buys diaper and milk, then he is very likely to buy
beer.
So, don’t be surprised if you find six-packs stacked next to diapers!
II-SDV, Nice
Datamining
Association Rule Discovery: Application 2
Inventory Management:
Goal: A consumer appliance repair company wants to anticipate the nature of
repairs on its consumer products and keep the service vehicles equipped
with right parts to reduce on number of visits to consumer households.
Approach: Process the data on tools and parts required in previous repairs at
different consumer locations and discover the co-occurrence patterns.
II-SDV, Nice
Datamining
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
II-SDV, Nice
Datamining
38II-SDV, Nice
Learning Machines
- Weka
- RapidMiner (formerly Yale : Yet Another Learning Environment)
- Hadoop Mahout
Background infrastructure & Framework
Set of Api and Libraries
Provides Algorithms for DataMining
Your Data
Your Application
39II-SDV, Nice
Learning Machines Implementation
With Apache Mahout
Classification
- Naives Bayes
- Markov Models
- Logistic Regression
- Random Forest
-Clustering
- K-Means and Fuzzy K-Means
- Canopy
- Spectral Clustering
-Association
- Latent Derichlet Allocation
Working with documents :
Mahout – Lucene integration
40II-SDV, Nice
Learning Machines Implementation
41II-SDV, Nice
References (Search Domain)
-Lucene : http://lucene.apache.org
-Plateform Solr/Lucene : http://www.lucidworks.com
-Plateform OpenSearch : http://www.open-search-server.com
-Plateform Katta : http://katta.sourceforge.net
-Plateform ElasticSearch : http://www.elasticsearch.com
-Plateform http://www.lucidworks.com/
-Sphinx : http://sphinxsearch.com/
-http://nutch.apache.org/
-http://en.wikipedia.org/wiki/List_of_search_engines
-http://www.thesearchenginelist.com/
-www.cloudera.com
-www.mapr.com
-www.hortonworks.com
42II-SDV, Nice
References (DataMining Domain)
R : http://www.revolutionanalytics.com/
Weka : http://www.cs.waikato.ac.nz/ml/weka/
RapidMiner : http://rapidminer.com/
KNIME : http://www.knime.org/
Apache Mahout : http://mahout.apache.org/
43II-SDV, Nice

Más contenido relacionado

La actualidad más candente

Ai2020 ai and or final
Ai2020 ai and or finalAi2020 ai and or final
Ai2020 ai and or finalRichard Vidgen
 
AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...
AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...
AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...Dr. Haxel Consult
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Miningtobiemuir
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
 
AI-SDV 2021: Angela Bauch - AILANI for clinical competitive landscaping
AI-SDV 2021: Angela Bauch - AILANI for clinical competitive landscapingAI-SDV 2021: Angela Bauch - AILANI for clinical competitive landscaping
AI-SDV 2021: Angela Bauch - AILANI for clinical competitive landscapingDr. Haxel Consult
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
Machine learning for factor investing
Machine learning for factor investingMachine learning for factor investing
Machine learning for factor investingQuantUniversity
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
AI-SDV 2021 - Tony Trippe - The Current State of Machine Learning for Patent ...
AI-SDV 2021 - Tony Trippe - The Current State of Machine Learning for Patent ...AI-SDV 2021 - Tony Trippe - The Current State of Machine Learning for Patent ...
AI-SDV 2021 - Tony Trippe - The Current State of Machine Learning for Patent ...Dr. Haxel Consult
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008Mark Tabladillo
 
AI-SDV 2020: Delivering AIM™ Patent Landscapes for Competitive Intelligence –...
AI-SDV 2020: Delivering AIM™ Patent Landscapes for Competitive Intelligence –...AI-SDV 2020: Delivering AIM™ Patent Landscapes for Competitive Intelligence –...
AI-SDV 2020: Delivering AIM™ Patent Landscapes for Competitive Intelligence –...Dr. Haxel Consult
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
 
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET Journal
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabusanoop bk
 

La actualidad más candente (20)

Ai2020 ai and or final
Ai2020 ai and or finalAi2020 ai and or final
Ai2020 ai and or final
 
AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...
AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...
AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Mining
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
AI-SDV 2021: Angela Bauch - AILANI for clinical competitive landscaping
AI-SDV 2021: Angela Bauch - AILANI for clinical competitive landscapingAI-SDV 2021: Angela Bauch - AILANI for clinical competitive landscaping
AI-SDV 2021: Angela Bauch - AILANI for clinical competitive landscaping
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Machine learning for factor investing
Machine learning for factor investingMachine learning for factor investing
Machine learning for factor investing
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Data Science
Data ScienceData Science
Data Science
 
AI-SDV 2021 - Tony Trippe - The Current State of Machine Learning for Patent ...
AI-SDV 2021 - Tony Trippe - The Current State of Machine Learning for Patent ...AI-SDV 2021 - Tony Trippe - The Current State of Machine Learning for Patent ...
AI-SDV 2021 - Tony Trippe - The Current State of Machine Learning for Patent ...
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Data mining and its applications!
Data mining and its applications!Data mining and its applications!
Data mining and its applications!
 
Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008
 
AI-SDV 2020: Delivering AIM™ Patent Landscapes for Competitive Intelligence –...
AI-SDV 2020: Delivering AIM™ Patent Landscapes for Competitive Intelligence –...AI-SDV 2020: Delivering AIM™ Patent Landscapes for Competitive Intelligence –...
AI-SDV 2020: Delivering AIM™ Patent Landscapes for Competitive Intelligence –...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 
AI-SDV 2020: IPscreener
AI-SDV 2020: IPscreenerAI-SDV 2020: IPscreener
AI-SDV 2020: IPscreener
 

Similar a II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - Bpm-Conseil, France)

II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...Dr. Haxel Consult
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfPoornimaShetty27
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfSreenivasa Harish
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Health Plan Survey Paper
Health Plan Survey PaperHealth Plan Survey Paper
Health Plan Survey PaperLisa Olive
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Karen Thompson
 
Data Science: Expediting Use of Data by Business Users with Self-service Disc...
Data Science: Expediting Use of Data by Business Users with Self-service Disc...Data Science: Expediting Use of Data by Business Users with Self-service Disc...
Data Science: Expediting Use of Data by Business Users with Self-service Disc...Denodo
 
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Denodo
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxRupaRani28
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
E-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/SolrE-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/SolrVincenzo D'Amore
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Internet of Things Chicago - Meetup
Internet of Things Chicago - MeetupInternet of Things Chicago - Meetup
Internet of Things Chicago - MeetupJason Lobel
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 

Similar a II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - Bpm-Conseil, France) (20)

II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
Sanmitra Ijeri Resume
Sanmitra Ijeri ResumeSanmitra Ijeri Resume
Sanmitra Ijeri Resume
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Health Plan Survey Paper
Health Plan Survey PaperHealth Plan Survey Paper
Health Plan Survey Paper
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
 
Data Science: Expediting Use of Data by Business Users with Self-service Disc...
Data Science: Expediting Use of Data by Business Users with Self-service Disc...Data Science: Expediting Use of Data by Business Users with Self-service Disc...
Data Science: Expediting Use of Data by Business Users with Self-service Disc...
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
 
Dm and dw
Dm and dwDm and dw
Dm and dw
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptx
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
E-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/SolrE-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/Solr
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Internet of Things Chicago - Meetup
Internet of Things Chicago - MeetupInternet of Things Chicago - Meetup
Internet of Things Chicago - Meetup
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 

Más de Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementDr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...Dr. Haxel Consult
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterDr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCDr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...Dr. Haxel Consult
 

Más de Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Último

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 

Último (20)

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 

II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - Bpm-Conseil, France)

  • 1. Patrick Beaucamp Founder of the Vanilla Project Mail : Patrick.beaucamp@bpm-conseil.com Search and Data Mining Open Source Platforms II-SDV, Nice 15th April 2014 1II-SDV, Nice
  • 2. Presentation Agenda Open Source Search Engine & Search Platform Some interesting Platforms Features expected for Search Engines Features expected for Search Platforms (Interface) 2II-SDV, Nice Open Source Data Mining & Machine Learning Platform Datamining subject & Usage type Some interesting Platforms Hadoop & Data Mining : the future of Learning Machin If you don’t find it, it doesn’t exist … so, may be an engine will find it for you ?
  • 3. Search & Data Mining Together ? Search Interface 3II-SDV, Nice Enhanced Search Interface
  • 4. Search & Data Mining Together ? 4II-SDV, Nice Data Mining : search for you ! • Example : correlation provides you with information about how data are linked together
  • 5. 5II-SDV, Nice Confusion Search Engine & Platforms -Search platform per domain : - News (yahoo) - Job (jobserve) - Real Estate - Human Relation - Craig List - Patent industry - … A business case : Facebook - Basic Search interface - Proposal for connection - Advanced graphical search
  • 6. 6II-SDV, Nice Search Engines Search Engine Providers in Open Source - Lucene - Lucene is a java based indexing and search API - Solr/Lucene is the leading server extension of Lucene. 2 companies, LucidWorks and ElasticSearch, provides packaging and extension of top of Lucene and Solr -Search Landscape -• Lucene : http://lucene.apache.org -• Solr/Lucene : http://lucene.apache.org/solr/ -• Plateform OpenSearch : http://www.open-search-server.com -• Plateform Katta : http://katta.sourceforge.net -• Plateform LucidWorks : http://www.lucidworks.com -• Plateform ElasticSearch : http://www.elasticsearch.com -• Sphinx : http://sphinxsearch.com/
  • 7. 7II-SDV, Nice Before Search Engines Before indexing your document base, you need to access it ! -Apache Nutch is a highly extensible and scalable open source web crawler software project. -Reference : http://nutch.apache.org/
  • 8. 8II-SDV, Nice Search Engines : focus Sphinx : inside Database like MariaDb (Mysql) Lucene : Retrieval Software library Use existing Search Infrastructure like Solr/Lucene (Vanilla certified) http://www.lucidworks.com/ or http://www.elasticsearch.org/
  • 9. 9II-SDV, Nice Search Engines Basics (1/2) -Synonyms - It is possible to extend the search to synonyms if they are listed in a glossary. For example, to find articles containing synonyms to “TV” when you search with the word TV. -Metadata - Dictionary for list of searchable keywords
  • 10. 10II-SDV, Nice Search Engines Basics (2/2) -Reserved Words, Protected Words - Indexing usually uses stemming, which is to reduce words to their root, for example "Developp" to find items also contain the word when trying to develop the word development. However, sometimes there are adverse lemmatizations, indexing under one lemma two words that have no relation. It is possible to prevent the stemming of words by listing them in a file protwords.txt. -StopWords - The stopwords are meaningless words. A word considered insignificant will be ignored. Note that some words are insignificant in some contexts, others have homonyms signifiers. For example, can refer to a summer season (rather mean) or past participle of the verb to be (relatively insignificant). Stopwords.txt the file looks like this
  • 11. 11II-SDV, Nice Search Engines current Limits -Multi Language support (this is where commercial search engine have still more to bring to customer), even there is now Asian type language support (Hindi, Thai, Chineese, …) -Elision : - Elisions are a feature of the French, which consist of a contraction of the words like or when they are followed by a vowel. Example: + aircraft gives the aircraft. It is possible to remove these elisions using a lexicon. -Full text search interface (language with search engine) -SubQuery support : now its better with Solr 4.7 -Scalability (this is where Solr is taking technical advantage)
  • 12. 12II-SDV, Nice Hadoop SearchPlatform for BigData -Cloudera with Solr/Cloud (Solr/Lucene) -Mapr with ElasticSearch (Lucene code) -HortonWorks with LucidWorks (Solr/Lucene)
  • 13. 13II-SDV, Nice Search Interface : expectation (1/3) -Advance indexing and querying tools. -Provides distributed searching capabilities to prevent bottleneck for a particular server. -Provides document excerpts (snippets) generation that provides summary of the search -Relevance ranking display extracts from the documents based on the query.
  • 14. 14II-SDV, Nice Search Interface : expectation (2/3) -Duplicate document detection, including fuzzy near duplicates -Rich Document Parsing and Indexing without using Database Indexing. -Ranking control carry out a targeted ranking of individual documents. -Search Grouping by Type / Tag / Categories (General page, documents, images)
  • 15. 15II-SDV, Nice -Multi Criteria support -Ranking -Natural language support -Apps Support (Android, Ipad) Search Interface : expectation (3/3)
  • 16. 16II-SDV, Nice Migration to Open Source -Only few project, but it’s growing SAME FUNCTIONALITY, MUCH LOWER COSTS -http://www.searchtechnologies.com/fast-to-solr-migration.html -http://fr.slideshare.net/janhoy/migrating-fast-to-solr
  • 17. 17II-SDV, Nice Datamining -Work on your data with Dataming component such as R : http://www.revolutionanalytics.com/ Weka : http://www.cs.waikato.ac.nz/ml/weka/ RapidMiner : http://rapidminer.com/ KNIME : http://www.knime.org/ Apache Mahout : http://mahout.apache.org/
  • 18. Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Datamining II-SDV, Nice
  • 19. What is (not) Data Mining?  What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)  What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon” II-SDV, Nice Datamining
  • 20. Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems II-SDV, Nice Datamining
  • 21. Data Mining Tasks Description Methods Find human-interpretable patterns that describe the data. Prediction Methods Use some variables to predict unknown or future values of other variables. II-SDV, Nice Datamining
  • 22. Data Mining Tasks... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] II-SDV, Nice Datamining
  • 23. Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. II-SDV, Nice Datamining
  • 24. Classification with Weka – J48 II-SDV, Nice Datamining
  • 25. Classification: Application 1 Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model. II-SDV, Nice II-SDV, Nice Datamining
  • 26. Classification: Application 2 Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account. II-SDV, Nice Datamining
  • 27. Classification: Application 3 Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. Approach: Use detailed record of transactions with each of the past and present customers, to find attributes. How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty. Datamining II-SDV, Nice
  • 28. Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures. II-SDV, Nice Datamining
  • 29. Clustering with Weka – K-Means II-SDV, Nice Datamining
  • 30. Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized II-SDV, Nice Datamining
  • 31. Clustering: Application 1 Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. II-SDV, Nice Datamining
  • 32. Clustering: Application 2 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. II-SDV, Nice Datamining
  • 33. Association Rule Discovery: Definition Marketing and Sales Promotion: Let the rule discovered be {Bagels, … } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! II-SDV, Nice Datamining
  • 34. Association Rules with Weka – APriori II-SDV, Nice Datamining
  • 35. Association Rule Discovery: Application 1 Supermarket shelf management. Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule -- If a customer buys diaper and milk, then he is very likely to buy beer. So, don’t be surprised if you find six-packs stacked next to diapers! II-SDV, Nice Datamining
  • 36. Association Rule Discovery: Application 2 Inventory Management: Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns. II-SDV, Nice Datamining
  • 37. Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data II-SDV, Nice Datamining
  • 38. 38II-SDV, Nice Learning Machines - Weka - RapidMiner (formerly Yale : Yet Another Learning Environment) - Hadoop Mahout Background infrastructure & Framework Set of Api and Libraries Provides Algorithms for DataMining Your Data Your Application
  • 39. 39II-SDV, Nice Learning Machines Implementation With Apache Mahout Classification - Naives Bayes - Markov Models - Logistic Regression - Random Forest -Clustering - K-Means and Fuzzy K-Means - Canopy - Spectral Clustering -Association - Latent Derichlet Allocation
  • 40. Working with documents : Mahout – Lucene integration 40II-SDV, Nice Learning Machines Implementation
  • 41. 41II-SDV, Nice References (Search Domain) -Lucene : http://lucene.apache.org -Plateform Solr/Lucene : http://www.lucidworks.com -Plateform OpenSearch : http://www.open-search-server.com -Plateform Katta : http://katta.sourceforge.net -Plateform ElasticSearch : http://www.elasticsearch.com -Plateform http://www.lucidworks.com/ -Sphinx : http://sphinxsearch.com/ -http://nutch.apache.org/ -http://en.wikipedia.org/wiki/List_of_search_engines -http://www.thesearchenginelist.com/ -www.cloudera.com -www.mapr.com -www.hortonworks.com
  • 42. 42II-SDV, Nice References (DataMining Domain) R : http://www.revolutionanalytics.com/ Weka : http://www.cs.waikato.ac.nz/ml/weka/ RapidMiner : http://rapidminer.com/ KNIME : http://www.knime.org/ Apache Mahout : http://mahout.apache.org/