SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
How to build your own google ...
artur.grzadziel@gmail.com
Data Wizards
Dec 2015
Artur Grządziel
few words about me
email: artur.grzadziel@gmail.com
Currently: BigData and Machine Learning Leader
From Jan 2016: BigData Solution Architect at General Electric
PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute
Graduated from Warsaw University of Technology and Warsaw School of Economics
BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning
in real business cases
Privately, husband and father
pl.linkedin.com/in/ArturGrzadziel
Introduction
Data Wizards
Artur represents „Data Wizards” group – informal group of
BigData/Machine Learning/Data Science professionals located in
Poland and interested in knowledge sharing and addressing business
challenges leveraging modern BigData and Machine Learning
methods.
Agenda
1. Cloudera search
2. How it works?
MySearch
very high level architecture
Data
Source
Index
Cloudera search
Apache Solr and Tika
1.
Other
Sources
Cloudera Search
Cloudera Search is one of Cloudera's near-real-time access products.
Cloudera Search enables non-technical users to search and explore data stored
in or ingested into Hadoop and HBase. Users do not need SQL or programming
skills to use Cloudera Search because it provides a simple, full-text interface for
searching.
Cloudera Search incorporates Apache Solr, which includes Apache Lucene,
SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated
with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search
provides these key capabilities:
- Near-real-time indexing
- Batch indexing
- Simple, full-text data exploration and navigated drill down
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-
0/Cloudera-Search-User-Guide/csug_introducing.html
Cloudera search
Tika
https://tika.apache.org/download.html
Cloudera search
Tika – image
Cloudera search
Tika – PDF file
Cloudera search
Tika – gazeta.pl
Cloudera search
Tika – formats
Supported Document Formats
• HyperText Markup Language
• XML and derived formats
• Microsoft Office document formats
• OpenDocument Format
• Portable Document Format
• Electronic Publication Format
• Rich Text Format
• Compression and packaging formats
• Text formats
• Audio formats
• Image formats
• Video formats
• Java class files and archives
• The mbox format
https://tika.apache.org/1.4/formats.html
Cloudera search
Solr – how to start it …
.binsolr start –e cloud -noprompt http://lucene.apache.org/solr/
Cloudera Search
Administration
Cloudera Search
Data
id cat name price inStock author series_t sequence_i genre_s
553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy
553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy
055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy
553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi
812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy
812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi
441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy
380014300 book
Nine Princes In
Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy
805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy
080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy
Cloudera Search
Output format
Cloudera Search
Simple query
Cloudera Search
Simple query
Cloudera Search
More advanced query
Cloudera Search
Query with facets
Cloudera search
Solr – other features
The MoreLikeThis search component enables users to query for documents
similar to a document in their result list. It is achieved leveraging terms from the
original document to find similar documents in the index
The SpellCheck component is designed to provide inline query suggestions
based on other, similar, terms.
Highlighting in Solr allows fragments of documents that match the user's query
to be included with the query response.
Synonyms, stop words
Cloudera search
Solr – other features – geospacial search
Solr has sophisticated geospatial support, including searching within a
specified distance range of a given location (or within a bounding box),
sorting by distance, or even boosting results by the distance
http://lucene.apache.org/solr/quickstart.html
Cloudera Search
Common Use Cases
Cloudera Search lets your entire business explore and analyze data quickly and
easily for a variety of critical use cases all within a single platform, including:
- Threat detection
- Customer 360-degree visibility
- Improved user experience
- Interactive market segmentation
- Accessible global knowledge base
https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-
solr.html
Cloudera Search
Other Use Cases
Instagram: Instagram (a Facebook company) is one of the famous sites, and it
uses Solr to power its geosearch API
WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and
Solr
Netflix: Solr powers basic movie searching on this extremely busy site
StubHub.com: This ticket reseller uses Solr to help visitors search for concerts
and sporting events.
https://www.safaribooksonline.com/library/view/scaling-apache-
solr/9781783981748/ch01s05.html
How it works ... ?
How it works … ?
Data Source – documents …
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
How it works … ?
Data Source – documents … space of unique terms
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
1 2 3 4
1 2 3 5
6 2 3 4
7 2 3 4
List of unique
words:
1. John
2. has
3. a
4. cat
5. dog
6. Eva
7. George
How it works … ?
Data Source – Documents … boolean search with inverted
index
Term Tot. freq.
John 2
has 4
a 4
cat 2
dog 2
Eva 1
George 1
Doc #
1
2
1
2
3
4
1
2
3
4
1
3
2
4
3
4
Dictionary Documents
How it works … ?
Data Source – documents as vectors
Documents
document 1 John has a cat
document 2 John has a dog
document 3 Eva has a cat
document 4 George has a dog
Space of unique terms -> John has a cat dog Eva George
vector representing doc1 -> 1 1 1 1 0 0 0
vector representing doc2 -> 1 1 1 0 1 0 0
vector representing doc3 -> 0 1 1 1 0 1 0
vector representing doc4 -> 0 1 1 0 1 0 1
How it works … ?
Data Source – Documents … vectors
Summary
1.
Other
Sources
Thank you
Data Wizards
E-mail: artur.grzadziel@gmail.com
Links:
• Cloudera Search:
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-
3-0/Cloudera-Search-User-Guide/csug_introducing.html
• Tika
https://tika.apache.org/
• Apache Solr
http://lucene.apache.org/solr/
https://www.cloudera.com/content/www/en-us/products/apache-
hadoop/apache-solr.html
• Vectors, Inversed Index, Frequency Matrix, etc. ...
http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm

Más contenido relacionado

La actualidad más candente

2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP SpainChristian Martorella
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataFabien Gandon
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFdonaldlsmithjr
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
Search engines coh m
Search engines coh mSearch engines coh m
Search engines coh mcpcmattc
 

La actualidad más candente (7)

2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Linked Data:Libraries and Beyond
Linked Data:Libraries and BeyondLinked Data:Libraries and Beyond
Linked Data:Libraries and Beyond
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDF
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
Search engines coh m
Search engines coh mSearch engines coh m
Search engines coh m
 

Destacado

Małe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICMMałe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICMData Science Warsaw
 
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'aOracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'aData Science Warsaw
 
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big  Data, Wearable, sztuczna inteligencja i ekonomia współpracyBig  Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracyData Science Warsaw
 
Data science warsaw inaugural meetup
Data science warsaw   inaugural meetupData science warsaw   inaugural meetup
Data science warsaw inaugural meetupData Science Warsaw
 
Online content popularity prediction
Online content popularity predictionOnline content popularity prediction
Online content popularity predictionData Science Warsaw
 
Data Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chainData Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chainData Science Warsaw
 
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!Data Science Warsaw
 
Rozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnychRozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnychData Science Warsaw
 
ARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINAARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINAEdgar Mrtinez
 
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia SeahorseWizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia SeahorseData Science Warsaw
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...Data Science Warsaw
 

Destacado (20)

Ask Data Anything
Ask Data AnythingAsk Data Anything
Ask Data Anything
 
Małe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICMMałe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICM
 
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'aOracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'a
 
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big  Data, Wearable, sztuczna inteligencja i ekonomia współpracyBig  Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
 
Data science warsaw inaugural meetup
Data science warsaw   inaugural meetupData science warsaw   inaugural meetup
Data science warsaw inaugural meetup
 
Online content popularity prediction
Online content popularity predictionOnline content popularity prediction
Online content popularity prediction
 
Data Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chainData Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chain
 
Data Science Warsaw
Data Science WarsawData Science Warsaw
Data Science Warsaw
 
Analiza języka naturalnego
Analiza języka naturalnegoAnaliza języka naturalnego
Analiza języka naturalnego
 
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
 
unidad 1
unidad 1unidad 1
unidad 1
 
Trash Talk
Trash TalkTrash Talk
Trash Talk
 
unidad 1
unidad 1unidad 1
unidad 1
 
Data science w ubezpieczeniach
Data science w ubezpieczeniachData science w ubezpieczeniach
Data science w ubezpieczeniach
 
Rozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnychRozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnych
 
ARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINAARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINA
 
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia SeahorseWizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
 
QIIP
QIIPQIIP
QIIP
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
 
To się w ram ie nie zmieści
To się w ram ie nie zmieściTo się w ram ie nie zmieści
To się w ram ie nie zmieści
 

Similar a How to build your own google

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
SolrMeter Lightning talk - Lucene Revolution 2010
SolrMeter   Lightning talk - Lucene Revolution 2010SolrMeter   Lightning talk - Lucene Revolution 2010
SolrMeter Lightning talk - Lucene Revolution 2010Tomás Fernández Löbbe
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for MeaningTrey Grainger
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search SystemTrey Grainger
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and RetrievalOptum
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Secure Syntactic key Ranked Search over Encrypted Cloud in Data
Secure Syntactic key Ranked Search over Encrypted Cloud in DataSecure Syntactic key Ranked Search over Encrypted Cloud in Data
Secure Syntactic key Ranked Search over Encrypted Cloud in DataIJERA Editor
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User IntentTrey Grainger
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.comFernando Meyer
 
eDiscovery and Microsoft Teams
eDiscovery and Microsoft TeamseDiscovery and Microsoft Teams
eDiscovery and Microsoft TeamsAlbert Hoitingh
 
Breaking the Google Addiction
Breaking the Google AddictionBreaking the Google Addiction
Breaking the Google AddictionAlan Manifold
 
Building Efficient eDiscovery and Compliance with SharePoint and O365
Building Efficient eDiscovery and Compliance with SharePoint and O365Building Efficient eDiscovery and Compliance with SharePoint and O365
Building Efficient eDiscovery and Compliance with SharePoint and O365Mitul Rana
 

Similar a How to build your own google (20)

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
SolrMeter Lightning talk - Lucene Revolution 2010
SolrMeter   Lightning talk - Lucene Revolution 2010SolrMeter   Lightning talk - Lucene Revolution 2010
SolrMeter Lightning talk - Lucene Revolution 2010
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
Scalable Search Analytics
Scalable Search AnalyticsScalable Search Analytics
Scalable Search Analytics
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Secure Syntactic key Ranked Search over Encrypted Cloud in Data
Secure Syntactic key Ranked Search over Encrypted Cloud in DataSecure Syntactic key Ranked Search over Encrypted Cloud in Data
Secure Syntactic key Ranked Search over Encrypted Cloud in Data
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.com
 
eDiscovery and Microsoft Teams
eDiscovery and Microsoft TeamseDiscovery and Microsoft Teams
eDiscovery and Microsoft Teams
 
Breaking the Google Addiction
Breaking the Google AddictionBreaking the Google Addiction
Breaking the Google Addiction
 
Building Efficient eDiscovery and Compliance with SharePoint and O365
Building Efficient eDiscovery and Compliance with SharePoint and O365Building Efficient eDiscovery and Compliance with SharePoint and O365
Building Efficient eDiscovery and Compliance with SharePoint and O365
 

Más de Data Science Warsaw

Más de Data Science Warsaw (7)

CRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining ProjectsCRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining Projects
 
Ile informacji jest w danych?
Ile informacji jest w danych?Ile informacji jest w danych?
Ile informacji jest w danych?
 
Otwarte Miasta
Otwarte MiastaOtwarte Miasta
Otwarte Miasta
 
Azure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurzeAzure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurze
 
As simple as Apache Spark
As simple as Apache SparkAs simple as Apache Spark
As simple as Apache Spark
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
 
Haven 2 0
Haven 2 0 Haven 2 0
Haven 2 0
 

Último

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

How to build your own google

  • 1. How to build your own google ... artur.grzadziel@gmail.com Data Wizards Dec 2015
  • 2. Artur Grządziel few words about me email: artur.grzadziel@gmail.com Currently: BigData and Machine Learning Leader From Jan 2016: BigData Solution Architect at General Electric PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute Graduated from Warsaw University of Technology and Warsaw School of Economics BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning in real business cases Privately, husband and father pl.linkedin.com/in/ArturGrzadziel
  • 3. Introduction Data Wizards Artur represents „Data Wizards” group – informal group of BigData/Machine Learning/Data Science professionals located in Poland and interested in knowledge sharing and addressing business challenges leveraging modern BigData and Machine Learning methods.
  • 5. MySearch very high level architecture Data Source Index
  • 6. Cloudera search Apache Solr and Tika 1. Other Sources
  • 7. Cloudera Search Cloudera Search is one of Cloudera's near-real-time access products. Cloudera Search enables non-technical users to search and explore data stored in or ingested into Hadoop and HBase. Users do not need SQL or programming skills to use Cloudera Search because it provides a simple, full-text interface for searching. Cloudera Search incorporates Apache Solr, which includes Apache Lucene, SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search provides these key capabilities: - Near-real-time indexing - Batch indexing - Simple, full-text data exploration and navigated drill down http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3- 0/Cloudera-Search-User-Guide/csug_introducing.html
  • 12. Cloudera search Tika – formats Supported Document Formats • HyperText Markup Language • XML and derived formats • Microsoft Office document formats • OpenDocument Format • Portable Document Format • Electronic Publication Format • Rich Text Format • Compression and packaging formats • Text formats • Audio formats • Image formats • Video formats • Java class files and archives • The mbox format https://tika.apache.org/1.4/formats.html
  • 13. Cloudera search Solr – how to start it … .binsolr start –e cloud -noprompt http://lucene.apache.org/solr/
  • 15. Cloudera Search Data id cat name price inStock author series_t sequence_i genre_s 553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy 553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy 055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy 553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi 812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy 812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi 441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy 380014300 book Nine Princes In Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy 805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy 080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy
  • 21. Cloudera search Solr – other features The MoreLikeThis search component enables users to query for documents similar to a document in their result list. It is achieved leveraging terms from the original document to find similar documents in the index The SpellCheck component is designed to provide inline query suggestions based on other, similar, terms. Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response. Synonyms, stop words
  • 22. Cloudera search Solr – other features – geospacial search Solr has sophisticated geospatial support, including searching within a specified distance range of a given location (or within a bounding box), sorting by distance, or even boosting results by the distance http://lucene.apache.org/solr/quickstart.html
  • 23. Cloudera Search Common Use Cases Cloudera Search lets your entire business explore and analyze data quickly and easily for a variety of critical use cases all within a single platform, including: - Threat detection - Customer 360-degree visibility - Improved user experience - Interactive market segmentation - Accessible global knowledge base https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache- solr.html
  • 24. Cloudera Search Other Use Cases Instagram: Instagram (a Facebook company) is one of the famous sites, and it uses Solr to power its geosearch API WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and Solr Netflix: Solr powers basic movie searching on this extremely busy site StubHub.com: This ticket reseller uses Solr to help visitors search for concerts and sporting events. https://www.safaribooksonline.com/library/view/scaling-apache- solr/9781783981748/ch01s05.html
  • 25. How it works ... ?
  • 26. How it works … ? Data Source – documents … Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog
  • 27. How it works … ? Data Source – documents … space of unique terms Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog 1 2 3 4 1 2 3 5 6 2 3 4 7 2 3 4 List of unique words: 1. John 2. has 3. a 4. cat 5. dog 6. Eva 7. George
  • 28. How it works … ? Data Source – Documents … boolean search with inverted index Term Tot. freq. John 2 has 4 a 4 cat 2 dog 2 Eva 1 George 1 Doc # 1 2 1 2 3 4 1 2 3 4 1 3 2 4 3 4 Dictionary Documents
  • 29. How it works … ? Data Source – documents as vectors Documents document 1 John has a cat document 2 John has a dog document 3 Eva has a cat document 4 George has a dog Space of unique terms -> John has a cat dog Eva George vector representing doc1 -> 1 1 1 1 0 0 0 vector representing doc2 -> 1 1 1 0 1 0 0 vector representing doc3 -> 0 1 1 1 0 1 0 vector representing doc4 -> 0 1 1 0 1 0 1
  • 30. How it works … ? Data Source – Documents … vectors
  • 32. Thank you Data Wizards E-mail: artur.grzadziel@gmail.com Links: • Cloudera Search: http://www.cloudera.com/content/www/en-us/documentation/archive/search/1- 3-0/Cloudera-Search-User-Guide/csug_introducing.html • Tika https://tika.apache.org/ • Apache Solr http://lucene.apache.org/solr/ https://www.cloudera.com/content/www/en-us/products/apache- hadoop/apache-solr.html • Vectors, Inversed Index, Frequency Matrix, etc. ... http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm