SlideShare a Scribd company logo
1 of 92
Designsafe: Using Elasticsearch to
Share and Search Data on a Science
Web Portal
Josue Balandrano Coronel
Stephen Mock
Texas Advanced Computing Center
Context
- What is DesignSafe?
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
Context
Context: DesignSafe Architecture
Django
Middleware
Science Gateway
Context: DesignSafe Architecture
Django
Middleware
Agave
Elasticsearch
RabbitMQ
Custom APIs
Science Gateway Distributed Services
Context: DesignSafe Architecture
Django
Middleware
Agave
Elasticsearch
RabbitMQ
Stampede
Maverick
Custom APIs
Corral
Science Gateway Distributed Services HPC
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
- Data Depot
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
- Data Depot
- Workspace
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
- Data Depot
- Workspace
- Reconnaissance
Context
- What is Agave?
Context
Context: DesignSafe Architecture
Django
Middleware
Agave
Elasticsearch
RabbitMQ
Stampede
Maverick
Custom APIs
Corral
Science Gateway Distributed Services HPC
- What is Agave?
- Provides a holistic view of core computing concepts
Context
- What is Agave?
- Provides a holistic view of core computing concepts
- Abstraction layer on top of HPC systems (execution and storage)
Context
- What is Agave?
- Provides a holistic view of core computing concepts
- Abstraction layer on top of HPC systems (execution and storage)
- File permissions and access
Context
- What is Agave?
- Provides a holistic view of core computing concepts
- Abstraction layer on top of HPC systems (execution and storage)
- File permissions and access
- Simpler ACL interface
Context
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Problem
- Discoverable and searchable data
Problem
- Discoverable and searchable data
- Main queries:
Problem
- Discoverable and searchable data
- Main queries:
- Give me every file/folder I have access and it’s not in my home dir
Problem
- Discoverable and searchable data
- Main queries:
- Give me every file/folder I have access and it’s not in my home dir
- Search within context of the UI
Problem
Elasticsearch
- Search engine based on Lucene
Elasticsearch
- Search engine based on Lucene
- RESTful API
Elasticsearch
- Search engine based on Lucene
- RESTful API
- Schema-free JSON documents
Elasticsearch
- Search engine based on Lucene
- RESTful API
- Schema-free JSON documents
- Distributed
Elasticsearch
- Search engine based on Lucene
- RESTful API
- Schema-free JSON documents
- Distributed
- Near Realtime
Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch - Analyzers
- Consists of 3 blocks:
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
Removing HTML tags.
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
Hierarchical
“username/path/to/file.txt”
[“username”,
“username/path”,
“username/path/to”,
“username/path/to/file.txt”]
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
Case insensitive, i.e. lower case, or removing stop words
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
- Keyword: Noop analyzer
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
- Keyword: Noop analyzer
- Custom Hierarchical: Breaks on specific character
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
- Keyword: Noop analyzer
- Custom Hierarchical: Breaks on specific character
- Language: remove stop words, exclude keywords, stemming
Elasticsearch - Analyzers
Elasticsearch
“name”: “file.txt” => “file.txt”
[“file”, “txt”]
Elasticsearch
“name”: “file.txt” => “file.txt”
[“file”, “txt”]
“sytemId”: “designsafe.storage.default” =>
“designsafe.storage.default”
[“designsafe”,
“designsafe.storage”
“designsafe.storage.default”]
Data Depot
Elasticsearch
“name”: “file.txt” => “file.txt”
[“file”, “txt”]
“sytemId”: “designsafe.storage.default” =>
“designsafe.storage.default”
[“designsafe”,
“designsafe.storage”
“designsafe.storage.default”]
“path”: “username/path/to” => “username/path/to”
“username/path/to”
[“username”,
“username/path”,
“username/path/to”]
Elasticsearch
“name”: “file.txt” => “file.txt”
[“file”, “txt”]
Elasticsearch
Data Depot
Elasticsearch - Mappings
Elasticsearch - Mappings
Elasticsearch - Mappings
Elasticsearch
- List all the files/folders I have access to in a specific system AND are not in my home directory
Elasticsearch
- List all the files/folders I have access to in a specific system which are not in my home directory
Elasticsearch
- List all the files/folders I have access to in a specific system which are not in my home directory
Elasticsearch
- List all the files/folders I have access to in a specific system which are not in my home directory
Elasticsearch
- List all the files/folders I have access to in a specific system which are not in my home directory
Data Depot
Elasticsearch
- List all the files/folders I have access to in a specific system under a specific folder
Elasticsearch
- List all the files/folders I have access to under a specific system under a specific folder
Elasticsearch
- List all the files/folders which matches a specific query string
Elasticsearch
- List all the files/folders in my home directory which matches a specific query string
Elasticsearch
- List all the files/folders in my home directory which matches a specific query string
Elasticsearch - Simple Query String
Elasticsearch - Simple Query String
- Simple language:
+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
- Will never return an error, discards invalid parts of the query.
Elasticsearch
Elasticsearch - Caveats
Elasticsearch - Caveats
- Manage dedup
Elasticsearch - Caveats
- Manage dedup
- Not a persistent DB. How to recreate index quickly
Elasticsearch - Caveats
- Manage dedup
- Not a persistent DB. How to recreate index quickly
- Synchronizing data
Elasticsearch - Caveats
- Manage dedup
- Not a persistent DB. How to recreate index quickly
- Synchronizing data
- Access management
Elasticsearch - Other Uses
Elasticsearch - Other Uses
- Site-wide search
Elasticsearch - Other Uses
- Site-wide search
- Publications metadata
Elasticsearch - Other Uses
- Site-wide search
- Publications metadata
- Quick metrics calculations
Thank You
Special thanks to:
- DesignSafe Team
- TACC
- Stephen Mock
- PEARC
- My wife: Gigimaria Flores
Email: jcoronel@tacc.utexas.edu
Twitter: @eusoj_xirdneh
IRC: josuebc @ freenode

More Related Content

What's hot

Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 

What's hot (20)

Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014
 
Использование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайтуИспользование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайту
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Lucene
LuceneLucene
Lucene
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 

Similar to PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal

Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File Carving
Rob Zirnstein
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
Tom Z Zeng
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
琛琳 饶
 

Similar to PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)
 
Find and locate
Find and locateFind and locate
Find and locate
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File Carving
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
Infinispan,Lucene,Hibername OGM
Infinispan,Lucene,Hibername OGMInfinispan,Lucene,Hibername OGM
Infinispan,Lucene,Hibername OGM
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Kyiv.py #16 october 2015
Kyiv.py #16 october 2015Kyiv.py #16 october 2015
Kyiv.py #16 october 2015
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET Developers
 
ElasticSearch Basics
ElasticSearch Basics ElasticSearch Basics
ElasticSearch Basics
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal

Editor's Notes

  1. Before diving into what Elasticsearch is and how we use it, let’s explain a little bit of context.
  2. What is DesignSafe?
  3. DesignSafe is a Science Gateway for the Natural Hazards Engineering community.
  4. At its core DesignSafe is a Shared-use research infrastructure,
  5. allowing users to share data, applications and collaborate with other users within a project
  6. and with remote experimental facilities
  7. Now, let’s take a quick look at the architecture so we can have a better idea of how we manage data.
  8. Starting from what the user sees we have a middleware which is implemented using Django and python. This is the actual web portal.
  9. Behind it we have multiple distributed services. Elasticsearch, message queues, custom APIs and Agave -- I’ll talk about Agave in a minute --.
  10. Behind that we have all of our HPC systems, execution like stampede and maverick and storage like corral.
  11. The main components of DesignSafe’s infrastructure are;
  12. the Data Depot, which is where a user can manage, discover and share data.
  13. The workspace, where a user has access to different applications which run in different HPC systems
  14. and the Reconnaissance portal where users can upload and visualize geospatial data.
  15. I mentioned Agave. So, what is agave?
  16. As we can see in this graphic, we use Agave as our main point of interaction with our HPC systems.
  17. It basically is an abstraction layer on top of everything HPC we use.
  18. This is an important concept because Agave allow us to easily manage file permissions and access,
  19. as well as providing a simple ACL interface. All of this through different friendly REST endpoints.
  20. Now, let’s focus on the Data Depot. As we can see we have different sections in the data depot.
  21. My Data is all your private data, this is your home directory.
  22. Here you can share data with any user
  23. and give it read or read/write permission through this interface
  24. Everything that has been shared with you will appear here. All of this data is also searchable.
  25. We also offer a collaboration section called My Projects. Here, a set of users are members of a project. Every user automatically has full access to everything within that project. This section also allow users to curate data and eventually create a publication, but this is not the aim of this presentation.
  26. There’s the published section where we list all the publications we have. All of these publication have DOIs and the metadata is properly rendered.
  27. I won’t go much into the details of the different types of publications that we have but I want you to take into consideration that all of the published metadata is also stored in elasticsearch. And we have some legacy publications which look like this
  28. While newer publications look like this. As we can see these are two different data models.
  29. As a counterpart we have Community Data, which is data that is public but it is not a proper publication. Mainly we store tutorials and examples.
  30. Finally, we also allow users to connect external services like Box or Dropbox so they can move data from and to these external resources.
  31. Now that we have an idea of all the different types of data we manage in the Data Depot we can have a better grasp of what the issue is
  32. All of this data has to be searchable and discoverable.
  33. So, after a lot of thinking about this we realized that we are mainly implementing two queries.
  34. One is give me everything I have access to and is not in my home directory. With this query we get everything that has been shared with a specific user and we can work within that context.
  35. The other query is to get everything pertaining to the Data Depot section the user currently is in.
  36. In order to create these queries we decided to use Elasticsearch,
  37. which is a search engine based on Lucene.
  38. Elasticsearch gives you a nice RESTful API
  39. and allow us to store schema-free JSON documents
  40. as well as being distributed. These last two characteristic are really important to us because the only thing we were sure about is that we did not know the structure of the data we were going to manage and we did not know how fast it was going to grow.
  41. Elasticsearch is also near realtime, which means that a document is available almost in realtime after being written. It usually is a minute, at the most.
  42. So, let’s take a look at how we are indexing files with Elasticsearch so we can query that information. This is called a document in Elasticsearch. As we can see most of this information is what we get from the “stat” command. Name, length, last modified, etc…
  43. We are going to focus on three specific fields. Name, systemId and path. Most of our queries are going to target these fields. There are some other metrics that can be aggregated from other fields shown here. But the thing is that indexing files in Elasticsearch requires planning. We need to figure out how are going to use the fields that get indexed. Since we already have an idea about the queries we are going to be executing then we know how we are going to use these fields. We know that we are going to filter documents depending on one or more parent folders, and as we can see we are storing this information in the field “path”. We also need to filter files depending on a specific data depot section. What we are doing here is creating a storage system for each one of the data depot sections previously described. This helps us differentiate where every file is and it is easier to manage with Agave and Corral. So, we can also see a systemId which is the identifier for that specific storage system. Finally we need to pay extra attention to the name because we want the user to be able to query filenames as well as extensions and even extra metadata that we are not showing here so we can keep this simplified. By extra metadata I mean information like user defined keywords, descriptions and other community specific data.
  44. Then we have to see if we need to manipulate any of these fields in order to make our queries faster. It is always better to store the data transformed instead of transforming it on the fly. Elasticsearch introduces the concept of analyzers. Analyzers transform data as it is being stored that way it is easier and faster to apply different queries to the same data.
  45. Analyzers consists of 3 blocks:
  46. character filters, which receives the data as a stream of characters and can be used to add, remove or change characters
  47. , e.g. removing html tags
  48. Tokenizers, which receives a stream of characters and breaks them up into individual tokens and outputs these tokens.
  49. e.g. we can use a tokenizer to store a better representation of a file path. This is called a hierarchical tokenizer. It will receive the path as a string and will output an array of every hierarchy on that path. This is what allow us to filter all the files under a specific folder faster regardless how many children or subfolders a specific folder has.
  50. Then we have token filters which receives token streams and may add, remove or change tokens.
  51. Can be used to lower case tokens or remove stop words.
  52. There are plenty of analyzers Elasticsearch offers out of the box and one can create a custom analyzer.
  53. The main analyzers we use are: Standard which divides terms on word boundaries and lower cases the stream.
  54. Keyword, which is basically a noop analyzer, meaning that the string will not be touched when being stored.
  55. A custom one which only has a hierarchical tokenizer
  56. and an english specific analyzer, this one helps to remove common stop words, exclude any custom keywords and stemming words.
  57. As an example let’s take a look at a simple file document and how analyzers transform some of these fields. We are using the standard analyzer on the file name, this transforms the data by making it case insensitive, lower casing everything, and breaking the name into words. This allows the user to search on extensions or partial names. When we store this field we store two values, one is the transform value and the other one is using the keyword analyzer, which is the same string untouched.
  58. For the system id we use the hierarchical analyzer, this is because we use internal namespaces for different storage systems. Most of the time we query against the un-analyzed value, meaning the keyword analyzer output value. This is the field which allow us to filter files depending on the context of the UI.
  59. Every one of these sections represent a different system id
  60. And we are also using the hierarchical and keyword analyzer for the path field.
  61. Now, we also need to index and filter files based on permissions. The way we manage these values is a bit simpler because we really only need a set of flags, as in “read”, “write”, “execute” and a username.
  62. This is how the permissions for a file looks like. It is an array of objects with the username and the actual permissions stored in boolean flags. With this data we can easily list all the files a user has access to and show it in a nice interface like this.
  63. Setting up these analyzers in specific fields is called mappings. Elasticsearch has an API to setup different mappings. I’ve mentioned that we usually use multiple analyzers in one field, like a hierarchical analyzer and a keyword analyzer. The way we do this is to create what is called a multifield that way we can specify which transformed data we want to query.
  64. In this example we use the HTTP PUT verb to set the mapping of a specific field. We have to specify the index and document type in the URL as well as the properties we are updating.
  65. Here we are creating a multifield with two fields, one which will reference the hierarchical value (underscore path) and another one which will reference the string unmodified (underscore exact).
  66. Now, let’s take a look at some of the actual queries we are executing. First we have the query that allow us to create the Shared With Me listing. We want to list all the files/folders a user has access to in a specific system and are not children of my home directory. I’ll show two possible ways to do this query.
  67. First, we create what is called a bool query. This type of query allows us to combine different sub queries and filters.
  68. Here we can see the filter we are using, this filter will return every document which has these specific values of username in the permissions object array and the system id. We can see how we are specifying the underscore exact field from the multifield we configured before. We want to use filters as much as possible because filters are cached.
  69. After we filter the necessary documents we retrieve all documents which path does not start with the username value. And this is going to return all the documents we are looking for.
  70. Another way to do it is to take advantage of the hierarchical analyzer we setup and match all documents which do not have the value username in the hierarchical path array.
  71. We can also leverage the hierarchical analyzer to retrieve all the files/folders a user has access to under a specific folder like this.
  72. Another query we use a lot is to grab a query string from the user and get all documents matching that query string.
  73. For this we use elasticsearch’s simple query string.
  74. It is really easy to use, we need to specify the query string and the set of fields to search on.
  75. This type of query has its own small language
  76. This type of query has its own small language and it will never return an error. If there’s any part of the query string that is not valid it will discard it.
  77. Here is an example of how it looks like in DesignSafe when we search for any pdf files.
  78. There are a few caveats when using elasticsearch
  79. Specially when indexing documents representing files in a file system, one has to be extra careful with duplicate and stale documents. This has to be managed externally since elasticsearch does not do it automatically.
  80. Elasticsearch should not be treated the same way as a persistent DB. This is because it is really easy to delete an entire index or a bunch of documents. There should always be a strategy to quickly rebuild any index and of course recurrent backups.
  81. It is always difficult to synchronize a search index with the actual data. Specially when building a search index for data in a file system. The way we tackle this is to have different scripts to recurrently index newly created data as well as permissions.
  82. Finally, special attention should be put into access management to elasticsearch. There are different way to protect your cluster, it could be using firewalls, basic HTTP authentication or using one of the multiple tools you can add to elasticsearch for authorization.
  83. We also use elasticsearch in other parts of DesignSafe
  84. like site-wide search,
  85. search and rendering of publications metadata
  86. and quick metrics calculations.