Searching and Analyzing Qualitative Data on Personal Computer
Search Approach - ES, GraphDB
1. SEARCH APPROACH-ES,
GRAPHDB
Search, Mining
Sunita Shrivastava
sunitas@microsoft.com
Abstract
This is a comparison of search technologies for Code Lens, CEG, Tech Debt. The initial focus
is on understanding the relationships between inverted index search technologies like ES
with Graph DB approaches like Titan.
2. 1
Table of Contents
1. Making the ES deployment available to other VSO Services................................................................1
2. Beyond ALM Search ......................................................................................................................2
2.1 Scenarios .............................................................................................................................2
2.1.1 Test Results Search and Reporting .......................................................................................2
2.1.2 Code Lens.........................................................................................................................3
2.1.3 Code Connect ...................................................................................................................3
2.1.4 Semantic Search................................................................................................................3
2.1.5 Code Map.........................................................................................................................3
2.1.6 Tech Debt.........................................................................................................................3
2.2 Overview of Indexing Technologies for Search ..........................................................................3
2.2.1 Elastic Search....................................................................................................................4
2.2.2 Graph Databases...............................................................................................................4
2.2.3 Titan and Elastic Search......................................................................................................4
3. Appendix .....................................................................................................................................4
ES as a Search Platform ........................................................................................................................4
Extensibility: ....................................................................................................................................4
Leverages Lucene Features ................................................................................................................5
HA and Scale-out ..............................................................................................................................5
Aggregation.....................................................................................................................................5
WHAT ES IS NOT DESIGNED FOR.........................................................................................................6
Search Platform Vs Reporting Platform Vs Analytics Platform ....................................................................6
ALM Search Platform
1.Making the ES deployment availableto other VSO
Services
A well designed search platform will go a long ways towards supporting the growing needs around data
crunching, quick data searching and finally reporting.
3. 2
Apart from ALM Search, here is what I have come up with so far as candidates that can leverage ES for
indexing and analysis.
1) Test Results
2) Perhaps query of test data related to certain releases in Release Management
3) Internal Telemetry
4) Log Analysis [TBD
5) Code Lens
6) Potentially Load Testing
The following spec describes at a high level, the various options that would be plausible for an overall
architecture that encompasses both ALM search and Test Result Indexing. It was deemed that a shared
Elastic Search Cluster, is a must at a minimum.
https://microsoft.sharepoint.com/teams/DPT/_layouts/15/WopiFrame.aspx?sourcedoc={B0D71992-E9C8-
428F-BA72-C3840C45D656}&file=Indexing%20for%20VSO%20Services.pptx&action=default
Here is a link summary of proposal to enable the same.
With multiple indexing pipelines, in different services, feeding into the same ES cluster, issues around
throttling might arise. More experiments are required to firm up this proposal.
In context of this analysis, we did work on whether we can separate query and ingestion paths into an ES
cluster. We also worked on how we can secure an ES cluster that multiple VSO services can access in a
secure manner.
TBD: Attach the mail/notes with Pradeep and Sean.
TBD: Apart from basic VSO artefact search, what are the other needs for Search in DevDiv. A better
understanding of those will help us flush the requirements around the Search Platform for Devdiv.
2.Beyond ALM Search
This is an attempt to better understand the relationships between technologies like ES, which is essentially an
inverted index and say, Titan, which is a graph database.
2.1 Scenarios
We will use Semantic Search, CodeLens and Code Connect scenarios to explore, analyze and understand this
space better.
2.1.1 Test Results Search and Reporting
4. 3
2.1.2 Code Lens
The Code Lens Service crawls for changelists (commits) and workitems. Eventually for each file, it builds a
document that provides file level information and function level information. When a file is open in a VS
Client or code browser, the client fetches the file level information and decorates the file with the relevant
metadata. The indexing technology used by Code Lens is essentially SQL.
Interestingly, the same data could have been stored in an Elastic Search Index or be served by a Graph
Database, akin to what the CEG team has proposed.
The key questions are:
1) What would be the benefit of storing Code Lens Data in ES? Format/Schema-less/Support for
change.
2) What are the benefits of storing it in a graph database akin to what the CEG team is proposing to do
for semantic search?
Hopefully, this attempt will yield a better understanding of graph databases.
2.1.3 Code Connect
Another use case, which we will explore here is Code Connect. Code Connect is a hackathon project that is a
social project.
2.1.4 Semantic Search
Currently, the Code Entity Graph project is implementing Semantic Search support. Semantic search at a
function level answers queries around references to a function (who are the callers of a given function).
The Code Entity Graph Team’s proposal is to use Titan to build relationship graphs between different entities
like code (functions), people, work items etc over a period of time.
Titan is a graph database.
2.1.5 Code Map
2.1.6 Tech Debt
2.2 Overview of Indexing Technologies for Search
The following are the axes along which we would like a better understanding of technologies like SQL/Elastic
Search and Graph Databases.
1. What it does best
2. What it is not designed for
3. Scale Characteristics
5. 4
2.2.1 Elastic Search
Elastic Search index, on the other hand is an inverted index designed for search. It allows building inverted
indices for multiple fields of a document allowing each field to be searchable in an efficient manner. Hereis
where it differs from SQL, where building multiple indices for each column tends to be fairly expensive.
Elastic Search is schema-less.
2.2.2 Graph Databases
By definition, a graph database has constant search time for discovering the adjacent set of nodes for a given
node in a graph.
http://en.wikipedia.org/wiki/Graph_database has an excellent overview of graph databases, different
providers and their comparison.
Titan is essentially a graph database. Titan works with storage technologies like HBASE, Cassandra. Graph
Databases usually start with an assumption that the exact kinds of relationships are not fully known upfront.
http://vschart.com/compare/titan-database/vs/elasticsearch
2.2.3 Titan and Elastic Search
Titan recommends to use an external index for numeric range, full-text or geo-spatial indexing. An external
index like ES can speed up Order By Queries. For exact match retrieval, the standard index suffices.
https://github.com/thinkaurelius/titan/wiki/Indexing-Backend-
Overviewhttps://github.com/thinkaurelius/titan/wiki/Using-Elastic-Search
http://stackoverflow.com/questions/18191737/how-to-use-elasticsearch-index-in-titan-gremlin-query
3.Appendix
ES as a Search Platform
From the perspectives of providing the underpinnings of a search platform, the key features of ES that stand
out arethe following: Extensibility,
Extensibility:
These are mechanism it provides to content owners to build the kind of index they want and with knobs
to control the features they would want to use.
Some of the extensibilities that we have used are
1) Analyzer Plugin
2) Query Highlighter
3) Query Filter
6. 5
TBD: Add details.
Leverages Lucene Features
Lucene provides a transaction log based indexing mechanism which attemptsto reduce loss of data in the
face of failures and is built on a segment and thread pool model where an index is actually comprised of
multiple smaller index parts so that each may be searched in parallel. It is highly optimized to do I/O in bulk
by using buffering techniques. If desired, it also allows two phase commit of index documents. This means
that you can commit a change to an index in context of a transaction to a SQL DB, by writing a basic
coordinator.
HA and Scale-out
ES essentially provides a high available index with support for scale out. HA is provided by providing replica
shards. Each shard is a Lucene Index. ES monitors for failures and elects new primary when necessary.
Replica shards also help with read scale out. Significant amongst the Scale out features is support for bulk
indexing, the ability to move a shard to a differently sized node and the ability to use an alias to access
multiple indexes. When the tenant data grows, another index may be added.
So, say your search query for an account is against the Index Alias, I.
Currently I is mapped to I1. Once I1 reaches its limit based on the number of shards it was allocated, you can
index new repositories into I2. And change the alias Mapping of I to I1, I2.
The ability to separate out ingestion and query nodes and the capability to do load balancing across those
also allows a single ES cluster to handle multiple indexes fairly easily.
These aredescribed in detail in the following spec.
https://microsoft.sharepoint.com/teams/DPT/_layouts/15/WopiFrame.aspx?sourcedoc={7F951956-9215-
4221-9E08-
BCA0B8A7503B}&file=Index%20Provision%20Schemes%20for%20multi%20tenant.docx&action=default
TBD: Above Spec needs cleanup
Aggregation
It is equally important to understand what ES is not designed for. This is important to understand how other
technologies, say like, COSMOS or HADOOP or GREMLINor even SQL can play in an environment along with
ES to provide solutions for that. [TBD: more thought on this is required]
7. 6
WHAT ES IS NOT DESIGNED FOR
Search Platform Vs Reporting Platform Vs Analytics
Platform
A search engine like ES which has facet support needs fairly powerful aggregationcapabilities. This makes it a
natural candidate for data analytics. A lot of data analytics workloads are concerned with aggregationsover
a given axis, be it time/location/category. Kibana is a data visualization tools built on top of Elastic Search.
TBD: Evaluation of Kibana over analytics frameworks used in devdiv, if any.
My read is that we are pretty behind in service telemetry evaluation with respect to VSO services. It would
be interesting to compare the solutions in place today to an ES based solution based on Logstash and Kibana.