|
HRS |
Increase the productivity of data users:
● data scientists
● data analyst
● BI engineers
Why do we need it
Title of presentation 2
|
HRS |
Step 1: Search and find the data
Step 2: Understand the data
Step 3: Perform and analysis and visualization
Step 4: Make a decision and/or share insights
Data-Driven Decision Making Process
Title of presentation 3
Data Discovery
|
HRS |
1. Ask coworkers
2. Ask in wider Zoom channel
3. Search over Confluence
4. Search over Repositories
5. Explore using * SQL queries
Challenge: Search and find the data
Title of presentation 4
|
HRS |
● Multiple results, which one is correct or
up to date?
● What do different columns mean?
Challenge: Understand the data
Title of presentation 5
|
HRS |
1. Discover new data sources
2. Identify end users to notify them of
changes
3. Understand the popularity and
trustworthiness of data
4. Investigate/monitor the magnitude of
protected data exposure
5. Know what your boss or colleagues are
using
6. Talk to upstream producers
7. +30% productivity for data users
Metadata is the key to next bigdata wave
Title of presentation 7
|
HRS |
What type of questions we want to answer
Title of presentation 8
|
HRS |
● First person to explore both North and
South poles
● Norwegian explorer, Roald Amundsen
Amundsen: Person
Title of presentation 10
|
HRS |
• Amundsen is a data discovery and metadata engine for improving the
productivity of data users
• It does that today by indexing data resources (tables, dashboards, streams,
etc.) and powering a page-rank style search based on usage patterns (e.g.
highly queried tables show up earlier than less queried tables)
• Think of it as Google search for data
Amundsen: The tool
Title of presentation 11
|
HRS |
Architecture: Key components
Title of presentation 12
Athena MSSql Exasol ... Glue
CI/CD
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend Service
ML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
|
HRS |
ElasticSearch for search and relevance
Title of presentation 19
● Normal search: match records based on relevancy
● Category search: match records first based on data
type, then relevancy
○ column: warehouse_cost
● Wildcard search:
○ event_*
|
HRS |
Amundsen uses Apache Airflow to orchestrate
Databuilder jobs
Title of presentation 20