SlideShare una empresa de Scribd logo
1 de 7
SEARCH APPROACH-ES,
GRAPHDB
Search, Mining
Sunita Shrivastava
sunitas@microsoft.com
Abstract
This is a comparison of search technologies for Code Lens, CEG, Tech Debt. The initial focus
is on understanding the relationships between inverted index search technologies like ES
with Graph DB approaches like Titan.
1
Table of Contents
1. Making the ES deployment available to other VSO Services................................................................1
2. Beyond ALM Search ......................................................................................................................2
2.1 Scenarios .............................................................................................................................2
2.1.1 Test Results Search and Reporting .......................................................................................2
2.1.2 Code Lens.........................................................................................................................3
2.1.3 Code Connect ...................................................................................................................3
2.1.4 Semantic Search................................................................................................................3
2.1.5 Code Map.........................................................................................................................3
2.1.6 Tech Debt.........................................................................................................................3
2.2 Overview of Indexing Technologies for Search ..........................................................................3
2.2.1 Elastic Search....................................................................................................................4
2.2.2 Graph Databases...............................................................................................................4
2.2.3 Titan and Elastic Search......................................................................................................4
3. Appendix .....................................................................................................................................4
ES as a Search Platform ........................................................................................................................4
Extensibility: ....................................................................................................................................4
Leverages Lucene Features ................................................................................................................5
HA and Scale-out ..............................................................................................................................5
Aggregation.....................................................................................................................................5
WHAT ES IS NOT DESIGNED FOR.........................................................................................................6
Search Platform Vs Reporting Platform Vs Analytics Platform ....................................................................6
ALM Search Platform
1.Making the ES deployment availableto other VSO
Services
A well designed search platform will go a long ways towards supporting the growing needs around data
crunching, quick data searching and finally reporting.
2
Apart from ALM Search, here is what I have come up with so far as candidates that can leverage ES for
indexing and analysis.
1) Test Results
2) Perhaps query of test data related to certain releases in Release Management
3) Internal Telemetry
4) Log Analysis [TBD
5) Code Lens
6) Potentially Load Testing
The following spec describes at a high level, the various options that would be plausible for an overall
architecture that encompasses both ALM search and Test Result Indexing. It was deemed that a shared
Elastic Search Cluster, is a must at a minimum.
https://microsoft.sharepoint.com/teams/DPT/_layouts/15/WopiFrame.aspx?sourcedoc={B0D71992-E9C8-
428F-BA72-C3840C45D656}&file=Indexing%20for%20VSO%20Services.pptx&action=default
Here is a link summary of proposal to enable the same.
With multiple indexing pipelines, in different services, feeding into the same ES cluster, issues around
throttling might arise. More experiments are required to firm up this proposal.
In context of this analysis, we did work on whether we can separate query and ingestion paths into an ES
cluster. We also worked on how we can secure an ES cluster that multiple VSO services can access in a
secure manner.
TBD: Attach the mail/notes with Pradeep and Sean.
TBD: Apart from basic VSO artefact search, what are the other needs for Search in DevDiv. A better
understanding of those will help us flush the requirements around the Search Platform for Devdiv.
2.Beyond ALM Search
This is an attempt to better understand the relationships between technologies like ES, which is essentially an
inverted index and say, Titan, which is a graph database.
2.1 Scenarios
We will use Semantic Search, CodeLens and Code Connect scenarios to explore, analyze and understand this
space better.
2.1.1 Test Results Search and Reporting
3
2.1.2 Code Lens
The Code Lens Service crawls for changelists (commits) and workitems. Eventually for each file, it builds a
document that provides file level information and function level information. When a file is open in a VS
Client or code browser, the client fetches the file level information and decorates the file with the relevant
metadata. The indexing technology used by Code Lens is essentially SQL.
Interestingly, the same data could have been stored in an Elastic Search Index or be served by a Graph
Database, akin to what the CEG team has proposed.
The key questions are:
1) What would be the benefit of storing Code Lens Data in ES? Format/Schema-less/Support for
change.
2) What are the benefits of storing it in a graph database akin to what the CEG team is proposing to do
for semantic search?
Hopefully, this attempt will yield a better understanding of graph databases.
2.1.3 Code Connect
Another use case, which we will explore here is Code Connect. Code Connect is a hackathon project that is a
social project.
2.1.4 Semantic Search
Currently, the Code Entity Graph project is implementing Semantic Search support. Semantic search at a
function level answers queries around references to a function (who are the callers of a given function).
The Code Entity Graph Team’s proposal is to use Titan to build relationship graphs between different entities
like code (functions), people, work items etc over a period of time.
Titan is a graph database.
2.1.5 Code Map
2.1.6 Tech Debt
2.2 Overview of Indexing Technologies for Search
The following are the axes along which we would like a better understanding of technologies like SQL/Elastic
Search and Graph Databases.
1. What it does best
2. What it is not designed for
3. Scale Characteristics
4
2.2.1 Elastic Search
Elastic Search index, on the other hand is an inverted index designed for search. It allows building inverted
indices for multiple fields of a document allowing each field to be searchable in an efficient manner. Hereis
where it differs from SQL, where building multiple indices for each column tends to be fairly expensive.
Elastic Search is schema-less.
2.2.2 Graph Databases
By definition, a graph database has constant search time for discovering the adjacent set of nodes for a given
node in a graph.
http://en.wikipedia.org/wiki/Graph_database has an excellent overview of graph databases, different
providers and their comparison.
Titan is essentially a graph database. Titan works with storage technologies like HBASE, Cassandra. Graph
Databases usually start with an assumption that the exact kinds of relationships are not fully known upfront.
http://vschart.com/compare/titan-database/vs/elasticsearch
2.2.3 Titan and Elastic Search
Titan recommends to use an external index for numeric range, full-text or geo-spatial indexing. An external
index like ES can speed up Order By Queries. For exact match retrieval, the standard index suffices.
https://github.com/thinkaurelius/titan/wiki/Indexing-Backend-
Overviewhttps://github.com/thinkaurelius/titan/wiki/Using-Elastic-Search
http://stackoverflow.com/questions/18191737/how-to-use-elasticsearch-index-in-titan-gremlin-query
3.Appendix
ES as a Search Platform
From the perspectives of providing the underpinnings of a search platform, the key features of ES that stand
out arethe following: Extensibility,
Extensibility:
These are mechanism it provides to content owners to build the kind of index they want and with knobs
to control the features they would want to use.
Some of the extensibilities that we have used are
1) Analyzer Plugin
2) Query Highlighter
3) Query Filter
5
TBD: Add details.
Leverages Lucene Features
Lucene provides a transaction log based indexing mechanism which attemptsto reduce loss of data in the
face of failures and is built on a segment and thread pool model where an index is actually comprised of
multiple smaller index parts so that each may be searched in parallel. It is highly optimized to do I/O in bulk
by using buffering techniques. If desired, it also allows two phase commit of index documents. This means
that you can commit a change to an index in context of a transaction to a SQL DB, by writing a basic
coordinator.
HA and Scale-out
ES essentially provides a high available index with support for scale out. HA is provided by providing replica
shards. Each shard is a Lucene Index. ES monitors for failures and elects new primary when necessary.
Replica shards also help with read scale out. Significant amongst the Scale out features is support for bulk
indexing, the ability to move a shard to a differently sized node and the ability to use an alias to access
multiple indexes. When the tenant data grows, another index may be added.
So, say your search query for an account is against the Index Alias, I.
Currently I is mapped to I1. Once I1 reaches its limit based on the number of shards it was allocated, you can
index new repositories into I2. And change the alias Mapping of I to I1, I2.
The ability to separate out ingestion and query nodes and the capability to do load balancing across those
also allows a single ES cluster to handle multiple indexes fairly easily.
These aredescribed in detail in the following spec.
https://microsoft.sharepoint.com/teams/DPT/_layouts/15/WopiFrame.aspx?sourcedoc={7F951956-9215-
4221-9E08-
BCA0B8A7503B}&file=Index%20Provision%20Schemes%20for%20multi%20tenant.docx&action=default
TBD: Above Spec needs cleanup
Aggregation
It is equally important to understand what ES is not designed for. This is important to understand how other
technologies, say like, COSMOS or HADOOP or GREMLINor even SQL can play in an environment along with
ES to provide solutions for that. [TBD: more thought on this is required]
6
WHAT ES IS NOT DESIGNED FOR
Search Platform Vs Reporting Platform Vs Analytics
Platform
A search engine like ES which has facet support needs fairly powerful aggregationcapabilities. This makes it a
natural candidate for data analytics. A lot of data analytics workloads are concerned with aggregationsover
a given axis, be it time/location/category. Kibana is a data visualization tools built on top of Elastic Search.
TBD: Evaluation of Kibana over analytics frameworks used in devdiv, if any.
My read is that we are pretty behind in service telemetry evaluation with respect to VSO services. It would
be interesting to compare the solutions in place today to an ES based solution based on Logstash and Kibana.

Más contenido relacionado

La actualidad más candente

Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudIJAAS Team
 
Combining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information servicesCombining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information servicesCloudTechnologies
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
 
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...Gabriel Moreira
 
Secrets of Enterprise Data Mining 201310
Secrets of Enterprise Data Mining 201310Secrets of Enterprise Data Mining 201310
Secrets of Enterprise Data Mining 201310Mark Tabladillo
 
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...
IRJET-  	  Data Analysis and Solution Prediction using Elasticsearch in Healt...IRJET-  	  Data Analysis and Solution Prediction using Elasticsearch in Healt...
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...IRJET Journal
 
Combining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information servicesCombining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information servicesPvrtechnologies Nellore
 
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...
E FFICIENT  D ATA  R ETRIEVAL  F ROM  C LOUD  S TORAGE  U SING  D ATA  M ININ...E FFICIENT  D ATA  R ETRIEVAL  F ROM  C LOUD  S TORAGE  U SING  D ATA  M ININ...
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...IJCI JOURNAL
 
Survey on An effective database tampering system with veriable computation a...
Survey on An effective  database tampering system with veriable computation a...Survey on An effective  database tampering system with veriable computation a...
Survey on An effective database tampering system with veriable computation a...IRJET Journal
 
Review Paper On Multi-Keyword Ranked Search in Encrypted Cloud Storage
Review Paper On Multi-Keyword Ranked Search in Encrypted Cloud StorageReview Paper On Multi-Keyword Ranked Search in Encrypted Cloud Storage
Review Paper On Multi-Keyword Ranked Search in Encrypted Cloud StorageIRJET Journal
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
 
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...1crore projects
 
SECURE AUDITING AND DEDUPLICATING DATA IN CLOUD
SECURE AUDITING AND DEDUPLICATING DATA IN CLOUDSECURE AUDITING AND DEDUPLICATING DATA IN CLOUD
SECURE AUDITING AND DEDUPLICATING DATA IN CLOUDNexgen Technology
 

La actualidad más candente (18)

Sap business objects interview questions
Sap business objects interview questionsSap business objects interview questions
Sap business objects interview questions
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
Combining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information servicesCombining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information services
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
 
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
 
Secrets of Enterprise Data Mining 201310
Secrets of Enterprise Data Mining 201310Secrets of Enterprise Data Mining 201310
Secrets of Enterprise Data Mining 201310
 
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...
IRJET-  	  Data Analysis and Solution Prediction using Elasticsearch in Healt...IRJET-  	  Data Analysis and Solution Prediction using Elasticsearch in Healt...
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...
 
Share point metadata
Share point metadataShare point metadata
Share point metadata
 
Aditya_2015
Aditya_2015Aditya_2015
Aditya_2015
 
Combining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information servicesCombining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information services
 
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...
E FFICIENT  D ATA  R ETRIEVAL  F ROM  C LOUD  S TORAGE  U SING  D ATA  M ININ...E FFICIENT  D ATA  R ETRIEVAL  F ROM  C LOUD  S TORAGE  U SING  D ATA  M ININ...
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...
 
Survey on An effective database tampering system with veriable computation a...
Survey on An effective  database tampering system with veriable computation a...Survey on An effective  database tampering system with veriable computation a...
Survey on An effective database tampering system with veriable computation a...
 
Review Paper On Multi-Keyword Ranked Search in Encrypted Cloud Storage
Review Paper On Multi-Keyword Ranked Search in Encrypted Cloud StorageReview Paper On Multi-Keyword Ranked Search in Encrypted Cloud Storage
Review Paper On Multi-Keyword Ranked Search in Encrypted Cloud Storage
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
 
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
 
SECURE AUDITING AND DEDUPLICATING DATA IN CLOUD
SECURE AUDITING AND DEDUPLICATING DATA IN CLOUDSECURE AUDITING AND DEDUPLICATING DATA IN CLOUD
SECURE AUDITING AND DEDUPLICATING DATA IN CLOUD
 
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
Etl
EtlEtl
Etl
 

Destacado

090728 小姪女
090728 小姪女090728 小姪女
090728 小姪女chicoff
 
T R A B A J O E L E N A[1]
T R A B A J O  E L E N A[1]T R A B A J O  E L E N A[1]
T R A B A J O E L E N A[1]guestdc6d4d
 
Tammy LaRue Resume
Tammy LaRue ResumeTammy LaRue Resume
Tammy LaRue ResumeTammy LaRue
 
Is social media free advertising?
Is social media free advertising?Is social media free advertising?
Is social media free advertising?Brainstorm Digital
 
Cutelaria (Pratica Simulada)
Cutelaria (Pratica Simulada)Cutelaria (Pratica Simulada)
Cutelaria (Pratica Simulada)Alexandra Marques
 
Natria - природный уход за кожей и волосами
Natria - природный уход за кожей и волосамиNatria - природный уход за кожей и волосами
Natria - природный уход за кожей и волосамиNSP Ukraine
 
How Radio Can Drive Digital Revenue & Generate Qualified Leads with Online Pr...
How Radio Can Drive Digital Revenue & Generate Qualified Leads with Online Pr...How Radio Can Drive Digital Revenue & Generate Qualified Leads with Online Pr...
How Radio Can Drive Digital Revenue & Generate Qualified Leads with Online Pr...Upland Second Street
 
Sistem informasi dan Teknologi Informasi dalam Kegiatan Organisasi
Sistem informasi dan Teknologi Informasi dalam Kegiatan OrganisasiSistem informasi dan Teknologi Informasi dalam Kegiatan Organisasi
Sistem informasi dan Teknologi Informasi dalam Kegiatan OrganisasiLaila Tusyek
 
Emotional intelligence
Emotional intelligenceEmotional intelligence
Emotional intelligencevicky dawang
 
PANKAJ CV MUD ENGINEER
PANKAJ CV MUD ENGINEERPANKAJ CV MUD ENGINEER
PANKAJ CV MUD ENGINEERpankaj gupta
 
Yelp - creating a better review
Yelp - creating a better reviewYelp - creating a better review
Yelp - creating a better reviewBrandon Hill
 

Destacado (16)

090728 小姪女
090728 小姪女090728 小姪女
090728 小姪女
 
Harrier Success Stories
Harrier Success StoriesHarrier Success Stories
Harrier Success Stories
 
Dev Analytics Overview
Dev Analytics OverviewDev Analytics Overview
Dev Analytics Overview
 
T R A B A J O E L E N A[1]
T R A B A J O  E L E N A[1]T R A B A J O  E L E N A[1]
T R A B A J O E L E N A[1]
 
Harrier_Success-Stories
Harrier_Success-StoriesHarrier_Success-Stories
Harrier_Success-Stories
 
Tammy LaRue Resume
Tammy LaRue ResumeTammy LaRue Resume
Tammy LaRue Resume
 
Is social media free advertising?
Is social media free advertising?Is social media free advertising?
Is social media free advertising?
 
Unidad didáctica tpack. carlos calderón
Unidad didáctica tpack. carlos calderónUnidad didáctica tpack. carlos calderón
Unidad didáctica tpack. carlos calderón
 
Cutelaria (Pratica Simulada)
Cutelaria (Pratica Simulada)Cutelaria (Pratica Simulada)
Cutelaria (Pratica Simulada)
 
Natria - природный уход за кожей и волосами
Natria - природный уход за кожей и волосамиNatria - природный уход за кожей и волосами
Natria - природный уход за кожей и волосами
 
Cerâmica e porcelana
Cerâmica e porcelanaCerâmica e porcelana
Cerâmica e porcelana
 
How Radio Can Drive Digital Revenue & Generate Qualified Leads with Online Pr...
How Radio Can Drive Digital Revenue & Generate Qualified Leads with Online Pr...How Radio Can Drive Digital Revenue & Generate Qualified Leads with Online Pr...
How Radio Can Drive Digital Revenue & Generate Qualified Leads with Online Pr...
 
Sistem informasi dan Teknologi Informasi dalam Kegiatan Organisasi
Sistem informasi dan Teknologi Informasi dalam Kegiatan OrganisasiSistem informasi dan Teknologi Informasi dalam Kegiatan Organisasi
Sistem informasi dan Teknologi Informasi dalam Kegiatan Organisasi
 
Emotional intelligence
Emotional intelligenceEmotional intelligence
Emotional intelligence
 
PANKAJ CV MUD ENGINEER
PANKAJ CV MUD ENGINEERPANKAJ CV MUD ENGINEER
PANKAJ CV MUD ENGINEER
 
Yelp - creating a better review
Yelp - creating a better reviewYelp - creating a better review
Yelp - creating a better review
 

Similar a Search Approach - ES, GraphDB

ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava
 
B036407011
B036407011B036407011
B036407011theijes
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paperJethroData
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES cscpconf
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIEScsandit
 
Enhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiesEnhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiescsandit
 
Elasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsElasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsTiziano Fagni
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEEMEMTECHSTUDENTPROJECTS
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
Sql interview question part 5
Sql interview question part 5Sql interview question part 5
Sql interview question part 5kaashiv1
 
Evaluation criteria for nosql databases
Evaluation criteria for nosql databasesEvaluation criteria for nosql databases
Evaluation criteria for nosql databasesEbenezer Daniel
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniquesIJDKP
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 

Similar a Search Approach - ES, GraphDB (20)

ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
 
B036407011
B036407011B036407011
B036407011
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 
Couch db
Couch dbCouch db
Couch db
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
Enhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiesEnhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologies
 
Elasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsElasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analytics
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
Ebook5
Ebook5Ebook5
Ebook5
 
Sql interview question part 5
Sql interview question part 5Sql interview question part 5
Sql interview question part 5
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
CloWSer
CloWSerCloWSer
CloWSer
 
Evaluation criteria for nosql databases
Evaluation criteria for nosql databasesEvaluation criteria for nosql databases
Evaluation criteria for nosql databases
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniques
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 

Search Approach - ES, GraphDB

  • 1. SEARCH APPROACH-ES, GRAPHDB Search, Mining Sunita Shrivastava sunitas@microsoft.com Abstract This is a comparison of search technologies for Code Lens, CEG, Tech Debt. The initial focus is on understanding the relationships between inverted index search technologies like ES with Graph DB approaches like Titan.
  • 2. 1 Table of Contents 1. Making the ES deployment available to other VSO Services................................................................1 2. Beyond ALM Search ......................................................................................................................2 2.1 Scenarios .............................................................................................................................2 2.1.1 Test Results Search and Reporting .......................................................................................2 2.1.2 Code Lens.........................................................................................................................3 2.1.3 Code Connect ...................................................................................................................3 2.1.4 Semantic Search................................................................................................................3 2.1.5 Code Map.........................................................................................................................3 2.1.6 Tech Debt.........................................................................................................................3 2.2 Overview of Indexing Technologies for Search ..........................................................................3 2.2.1 Elastic Search....................................................................................................................4 2.2.2 Graph Databases...............................................................................................................4 2.2.3 Titan and Elastic Search......................................................................................................4 3. Appendix .....................................................................................................................................4 ES as a Search Platform ........................................................................................................................4 Extensibility: ....................................................................................................................................4 Leverages Lucene Features ................................................................................................................5 HA and Scale-out ..............................................................................................................................5 Aggregation.....................................................................................................................................5 WHAT ES IS NOT DESIGNED FOR.........................................................................................................6 Search Platform Vs Reporting Platform Vs Analytics Platform ....................................................................6 ALM Search Platform 1.Making the ES deployment availableto other VSO Services A well designed search platform will go a long ways towards supporting the growing needs around data crunching, quick data searching and finally reporting.
  • 3. 2 Apart from ALM Search, here is what I have come up with so far as candidates that can leverage ES for indexing and analysis. 1) Test Results 2) Perhaps query of test data related to certain releases in Release Management 3) Internal Telemetry 4) Log Analysis [TBD 5) Code Lens 6) Potentially Load Testing The following spec describes at a high level, the various options that would be plausible for an overall architecture that encompasses both ALM search and Test Result Indexing. It was deemed that a shared Elastic Search Cluster, is a must at a minimum. https://microsoft.sharepoint.com/teams/DPT/_layouts/15/WopiFrame.aspx?sourcedoc={B0D71992-E9C8- 428F-BA72-C3840C45D656}&file=Indexing%20for%20VSO%20Services.pptx&action=default Here is a link summary of proposal to enable the same. With multiple indexing pipelines, in different services, feeding into the same ES cluster, issues around throttling might arise. More experiments are required to firm up this proposal. In context of this analysis, we did work on whether we can separate query and ingestion paths into an ES cluster. We also worked on how we can secure an ES cluster that multiple VSO services can access in a secure manner. TBD: Attach the mail/notes with Pradeep and Sean. TBD: Apart from basic VSO artefact search, what are the other needs for Search in DevDiv. A better understanding of those will help us flush the requirements around the Search Platform for Devdiv. 2.Beyond ALM Search This is an attempt to better understand the relationships between technologies like ES, which is essentially an inverted index and say, Titan, which is a graph database. 2.1 Scenarios We will use Semantic Search, CodeLens and Code Connect scenarios to explore, analyze and understand this space better. 2.1.1 Test Results Search and Reporting
  • 4. 3 2.1.2 Code Lens The Code Lens Service crawls for changelists (commits) and workitems. Eventually for each file, it builds a document that provides file level information and function level information. When a file is open in a VS Client or code browser, the client fetches the file level information and decorates the file with the relevant metadata. The indexing technology used by Code Lens is essentially SQL. Interestingly, the same data could have been stored in an Elastic Search Index or be served by a Graph Database, akin to what the CEG team has proposed. The key questions are: 1) What would be the benefit of storing Code Lens Data in ES? Format/Schema-less/Support for change. 2) What are the benefits of storing it in a graph database akin to what the CEG team is proposing to do for semantic search? Hopefully, this attempt will yield a better understanding of graph databases. 2.1.3 Code Connect Another use case, which we will explore here is Code Connect. Code Connect is a hackathon project that is a social project. 2.1.4 Semantic Search Currently, the Code Entity Graph project is implementing Semantic Search support. Semantic search at a function level answers queries around references to a function (who are the callers of a given function). The Code Entity Graph Team’s proposal is to use Titan to build relationship graphs between different entities like code (functions), people, work items etc over a period of time. Titan is a graph database. 2.1.5 Code Map 2.1.6 Tech Debt 2.2 Overview of Indexing Technologies for Search The following are the axes along which we would like a better understanding of technologies like SQL/Elastic Search and Graph Databases. 1. What it does best 2. What it is not designed for 3. Scale Characteristics
  • 5. 4 2.2.1 Elastic Search Elastic Search index, on the other hand is an inverted index designed for search. It allows building inverted indices for multiple fields of a document allowing each field to be searchable in an efficient manner. Hereis where it differs from SQL, where building multiple indices for each column tends to be fairly expensive. Elastic Search is schema-less. 2.2.2 Graph Databases By definition, a graph database has constant search time for discovering the adjacent set of nodes for a given node in a graph. http://en.wikipedia.org/wiki/Graph_database has an excellent overview of graph databases, different providers and their comparison. Titan is essentially a graph database. Titan works with storage technologies like HBASE, Cassandra. Graph Databases usually start with an assumption that the exact kinds of relationships are not fully known upfront. http://vschart.com/compare/titan-database/vs/elasticsearch 2.2.3 Titan and Elastic Search Titan recommends to use an external index for numeric range, full-text or geo-spatial indexing. An external index like ES can speed up Order By Queries. For exact match retrieval, the standard index suffices. https://github.com/thinkaurelius/titan/wiki/Indexing-Backend- Overviewhttps://github.com/thinkaurelius/titan/wiki/Using-Elastic-Search http://stackoverflow.com/questions/18191737/how-to-use-elasticsearch-index-in-titan-gremlin-query 3.Appendix ES as a Search Platform From the perspectives of providing the underpinnings of a search platform, the key features of ES that stand out arethe following: Extensibility, Extensibility: These are mechanism it provides to content owners to build the kind of index they want and with knobs to control the features they would want to use. Some of the extensibilities that we have used are 1) Analyzer Plugin 2) Query Highlighter 3) Query Filter
  • 6. 5 TBD: Add details. Leverages Lucene Features Lucene provides a transaction log based indexing mechanism which attemptsto reduce loss of data in the face of failures and is built on a segment and thread pool model where an index is actually comprised of multiple smaller index parts so that each may be searched in parallel. It is highly optimized to do I/O in bulk by using buffering techniques. If desired, it also allows two phase commit of index documents. This means that you can commit a change to an index in context of a transaction to a SQL DB, by writing a basic coordinator. HA and Scale-out ES essentially provides a high available index with support for scale out. HA is provided by providing replica shards. Each shard is a Lucene Index. ES monitors for failures and elects new primary when necessary. Replica shards also help with read scale out. Significant amongst the Scale out features is support for bulk indexing, the ability to move a shard to a differently sized node and the ability to use an alias to access multiple indexes. When the tenant data grows, another index may be added. So, say your search query for an account is against the Index Alias, I. Currently I is mapped to I1. Once I1 reaches its limit based on the number of shards it was allocated, you can index new repositories into I2. And change the alias Mapping of I to I1, I2. The ability to separate out ingestion and query nodes and the capability to do load balancing across those also allows a single ES cluster to handle multiple indexes fairly easily. These aredescribed in detail in the following spec. https://microsoft.sharepoint.com/teams/DPT/_layouts/15/WopiFrame.aspx?sourcedoc={7F951956-9215- 4221-9E08- BCA0B8A7503B}&file=Index%20Provision%20Schemes%20for%20multi%20tenant.docx&action=default TBD: Above Spec needs cleanup Aggregation It is equally important to understand what ES is not designed for. This is important to understand how other technologies, say like, COSMOS or HADOOP or GREMLINor even SQL can play in an environment along with ES to provide solutions for that. [TBD: more thought on this is required]
  • 7. 6 WHAT ES IS NOT DESIGNED FOR Search Platform Vs Reporting Platform Vs Analytics Platform A search engine like ES which has facet support needs fairly powerful aggregationcapabilities. This makes it a natural candidate for data analytics. A lot of data analytics workloads are concerned with aggregationsover a given axis, be it time/location/category. Kibana is a data visualization tools built on top of Elastic Search. TBD: Evaluation of Kibana over analytics frameworks used in devdiv, if any. My read is that we are pretty behind in service telemetry evaluation with respect to VSO services. It would be interesting to compare the solutions in place today to an ES based solution based on Logstash and Kibana.