Designsafe is a web portal focused on helping Natural Hazards Engineering to conduct research. Natural Hazards research spans across multiple physical locations, where the experiments take place, and multiple disciplines. Sharing and searching data is an imperative feature when doing research in multiple physical locations. We are able to handle the research needs by using a distributed database (Elasticsearch) to index important features extracted from data.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal
1. Designsafe: Using Elasticsearch to
Share and Search Data on a Science
Web Portal
Josue Balandrano Coronel
Stephen Mock
Texas Advanced Computing Center
4. - What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
Context
5. - What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
Context
6. - What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
Context
7. - What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
Context
8. - What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
Context
12. - What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
Context
13. - What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
- Data Depot
Context
14. - What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
- Data Depot
- Workspace
Context
15. - What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
- Data Depot
- Workspace
- Reconnaissance
Context
18. - What is Agave?
- Provides a holistic view of core computing concepts
Context
19. - What is Agave?
- Provides a holistic view of core computing concepts
- Abstraction layer on top of HPC systems (execution and storage)
Context
20. - What is Agave?
- Provides a holistic view of core computing concepts
- Abstraction layer on top of HPC systems (execution and storage)
- File permissions and access
Context
21. - What is Agave?
- Provides a holistic view of core computing concepts
- Abstraction layer on top of HPC systems (execution and storage)
- File permissions and access
- Simpler ACL interface
Context
36. - Discoverable and searchable data
- Main queries:
- Give me every file/folder I have access and it’s not in my home dir
Problem
37. - Discoverable and searchable data
- Main queries:
- Give me every file/folder I have access and it’s not in my home dir
- Search within context of the UI
Problem
48. - Consists of 3 blocks:
- Character filters
Elasticsearch - Analyzers
49. - Consists of 3 blocks:
- Character filters
Removing HTML tags.
Elasticsearch - Analyzers
50. - Consists of 3 blocks:
- Character filters
- Tokenizers
Elasticsearch - Analyzers
51. - Consists of 3 blocks:
- Character filters
- Tokenizers
Hierarchical
“username/path/to/file.txt”
[“username”,
“username/path”,
“username/path/to”,
“username/path/to/file.txt”]
Elasticsearch - Analyzers
52. - Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
Elasticsearch - Analyzers
53. - Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
Case insensitive, i.e. lower case, or removing stop words
Elasticsearch - Analyzers
54. - Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
Elasticsearch - Analyzers
55. - Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
Elasticsearch - Analyzers
56. - Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
- Keyword: Noop analyzer
Elasticsearch - Analyzers
57. - Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
- Keyword: Noop analyzer
- Custom Hierarchical: Breaks on specific character
Elasticsearch - Analyzers
58. - Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
- Keyword: Noop analyzer
- Custom Hierarchical: Breaks on specific character
- Language: remove stop words, exclude keywords, stemming
Elasticsearch - Analyzers
81. Elasticsearch - Simple Query String
- Simple language:
+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
- Will never return an error, discards invalid parts of the query.
92. Thank You
Special thanks to:
- DesignSafe Team
- TACC
- Stephen Mock
- PEARC
- My wife: Gigimaria Flores
Email: jcoronel@tacc.utexas.edu
Twitter: @eusoj_xirdneh
IRC: josuebc @ freenode
Editor's Notes
Before diving into what Elasticsearch is and how we use it, let’s explain a little bit of context.
What is DesignSafe?
DesignSafe is a Science Gateway for the Natural Hazards Engineering community.
At its core DesignSafe is a Shared-use research infrastructure,
allowing users to share data, applications and collaborate with other users within a project
and with remote experimental facilities
Now, let’s take a quick look at the architecture so we can have a better idea of how we manage data.
Starting from what the user sees we have a middleware which is implemented using Django and python. This is the actual web portal.
Behind it we have multiple distributed services. Elasticsearch, message queues, custom APIs and Agave -- I’ll talk about Agave in a minute --.
Behind that we have all of our HPC systems, execution like stampede and maverick and storage like corral.
The main components of DesignSafe’s infrastructure are;
the Data Depot, which is where a user can manage, discover and share data.
The workspace, where a user has access to different applications which run in different HPC systems
and the Reconnaissance portal where users can upload and visualize geospatial data.
I mentioned Agave. So, what is agave?
As we can see in this graphic, we use Agave as our main point of interaction with our HPC systems.
It basically is an abstraction layer on top of everything HPC we use.
This is an important concept because Agave allow us to easily manage file permissions and access,
as well as providing a simple ACL interface. All of this through different friendly REST endpoints.
Now, let’s focus on the Data Depot. As we can see we have different sections in the data depot.
My Data is all your private data, this is your home directory.
Here you can share data with any user
and give it read or read/write permission through this interface
Everything that has been shared with you will appear here. All of this data is also searchable.
We also offer a collaboration section called My Projects. Here, a set of users are members of a project. Every user automatically has full access to everything within that project. This section also allow users to curate data and eventually create a publication, but this is not the aim of this presentation.
There’s the published section where we list all the publications we have. All of these publication have DOIs and the metadata is properly rendered.
I won’t go much into the details of the different types of publications that we have but I want you to take into consideration that all of the published metadata is also stored in elasticsearch. And we have some legacy publications which look like this
While newer publications look like this. As we can see these are two different data models.
As a counterpart we have Community Data, which is data that is public but it is not a proper publication. Mainly we store tutorials and examples.
Finally, we also allow users to connect external services like Box or Dropbox so they can move data from and to these external resources.
Now that we have an idea of all the different types of data we manage in the Data Depot we can have a better grasp of what the issue is
All of this data has to be searchable and discoverable.
So, after a lot of thinking about this we realized that we are mainly implementing two queries.
One is give me everything I have access to and is not in my home directory. With this query we get everything that has been shared with a specific user and we can work within that context.
The other query is to get everything pertaining to the Data Depot section the user currently is in.
In order to create these queries we decided to use Elasticsearch,
which is a search engine based on Lucene.
Elasticsearch gives you a nice RESTful API
and allow us to store schema-free JSON documents
as well as being distributed. These last two characteristic are really important to us because the only thing we were sure about is that we did not know the structure of the data we were going to manage and we did not know how fast it was going to grow.
Elasticsearch is also near realtime, which means that a document is available almost in realtime after being written. It usually is a minute, at the most.
So, let’s take a look at how we are indexing files with Elasticsearch so we can query that information. This is called a document in Elasticsearch.
As we can see most of this information is what we get from the “stat” command. Name, length, last modified, etc…
We are going to focus on three specific fields. Name, systemId and path. Most of our queries are going to target these fields. There are some other metrics that can be aggregated from other fields shown here. But the thing is that indexing files in Elasticsearch requires planning.
We need to figure out how are going to use the fields that get indexed. Since we already have an idea about the queries we are going to be executing then we know how we are going to use these fields. We know that we are going to filter documents depending on one or more parent folders, and as we can see we are storing this information in the field “path”. We also need to filter files depending on a specific data depot section. What we are doing here is creating a storage system for each one of the data depot sections previously described. This helps us differentiate where every file is and it is easier to manage with Agave and Corral. So, we can also see a systemId which is the identifier for that specific storage system. Finally we need to pay extra attention to the name because we want the user to be able to query filenames as well as extensions and even extra metadata that we are not showing here so we can keep this simplified. By extra metadata I mean information like user defined keywords, descriptions and other community specific data.
Then we have to see if we need to manipulate any of these fields in order to make our queries faster. It is always better to store the data transformed instead of transforming it on the fly. Elasticsearch introduces the concept of analyzers. Analyzers transform data as it is being stored that way it is easier and faster to apply different queries to the same data.
Analyzers consists of 3 blocks:
character filters, which receives the data as a stream of characters and can be used to add, remove or change characters
, e.g. removing html tags
Tokenizers, which receives a stream of characters and breaks them up into individual tokens and outputs these tokens.
e.g. we can use a tokenizer to store a better representation of a file path. This is called a hierarchical tokenizer.
It will receive the path as a string and will output an array of every hierarchy on that path. This is what allow us to filter all the files under a specific folder faster regardless how many children or subfolders a specific folder has.
Then we have token filters which receives token streams and may add, remove or change tokens.
Can be used to lower case tokens or remove stop words.
There are plenty of analyzers Elasticsearch offers out of the box and one can create a custom analyzer.
The main analyzers we use are: Standard which divides terms on word boundaries and lower cases the stream.
Keyword, which is basically a noop analyzer, meaning that the string will not be touched when being stored.
A custom one which only has a hierarchical tokenizer
and an english specific analyzer, this one helps to remove common stop words, exclude any custom keywords and stemming words.
As an example let’s take a look at a simple file document and how analyzers transform some of these fields. We are using the standard analyzer on the file name, this transforms the data by making it case insensitive, lower casing everything, and breaking the name into words. This allows the user to search on extensions or partial names. When we store this field we store two values, one is the transform value and the other one is using the keyword analyzer, which is the same string untouched.
For the system id we use the hierarchical analyzer, this is because we use internal namespaces for different storage systems. Most of the time we query against the un-analyzed value, meaning the keyword analyzer output value. This is the field which allow us to filter files depending on the context of the UI.
Every one of these sections represent a different system id
And we are also using the hierarchical and keyword analyzer for the path field.
Now, we also need to index and filter files based on permissions. The way we manage these values is a bit simpler because we really only need a set of flags, as in “read”, “write”, “execute” and a username.
This is how the permissions for a file looks like. It is an array of objects with the username and the actual permissions stored in boolean flags. With this data we can easily list all the files a user has access to and show it in a nice interface like this.
Setting up these analyzers in specific fields is called mappings. Elasticsearch has an API to setup different mappings. I’ve mentioned that we usually use multiple analyzers in one field, like a hierarchical analyzer and a keyword analyzer. The way we do this is to create what is called a multifield that way we can specify which transformed data we want to query.
In this example we use the HTTP PUT verb to set the mapping of a specific field. We have to specify the index and document type in the URL as well as the properties we are updating.
Here we are creating a multifield with two fields, one which will reference the hierarchical value (underscore path) and another one which will reference the string unmodified (underscore exact).
Now, let’s take a look at some of the actual queries we are executing. First we have the query that allow us to create the Shared With Me listing. We want to list all the files/folders a user has access to in a specific system and are not children of my home directory. I’ll show two possible ways to do this query.
First, we create what is called a bool query. This type of query allows us to combine different sub queries and filters.
Here we can see the filter we are using, this filter will return every document which has these specific values of username in the permissions object array and the system id. We can see how we are specifying the underscore exact field from the multifield we configured before. We want to use filters as much as possible because filters are cached.
After we filter the necessary documents we retrieve all documents which path does not start with the username value. And this is going to return all the documents we are looking for.
Another way to do it is to take advantage of the hierarchical analyzer we setup and match all documents which do not have the value username in the hierarchical path array.
We can also leverage the hierarchical analyzer to retrieve all the files/folders a user has access to under a specific folder like this.
Another query we use a lot is to grab a query string from the user and get all documents matching that query string.
For this we use elasticsearch’s simple query string.
It is really easy to use, we need to specify the query string and the set of fields to search on.
This type of query has its own small language
This type of query has its own small language and it will never return an error. If there’s any part of the query string that is not valid it will discard it.
Here is an example of how it looks like in DesignSafe when we search for any pdf files.
There are a few caveats when using elasticsearch
Specially when indexing documents representing files in a file system, one has to be extra careful with duplicate and stale documents.This has to be managed externally since elasticsearch does not do it automatically.
Elasticsearch should not be treated the same way as a persistent DB. This is because it is really easy to delete an entire index or a bunch of documents.
There should always be a strategy to quickly rebuild any index and of course recurrent backups.
It is always difficult to synchronize a search index with the actual data. Specially when building a search index for data in a file system.The way we tackle this is to have different scripts to recurrently index newly created data as well as permissions.
Finally, special attention should be put into access management to elasticsearch. There are different way to protect your cluster, it could be using firewalls, basic HTTP authentication or using one of the multiple tools you can add to elasticsearch for authorization.
We also use elasticsearch in other parts of DesignSafe