Watch full webinar here: https://bit.ly/3mJJ4w9
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spend most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this session to learn how companies can use data virtualization to:
- Create a logical architecture to make all enterprise data available for advanced analytics exercise
- Accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- Integrate popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc
2. How Data Virtualization Puts Enterprise Machine
Learning Programs into Production
Chris Day
Director, APAC Sales Engineering, Denodo
Sushant Kumar
Product Marketing Manager, Denodo
3. Agenda
1. What are Advanced Analytics?
2. The Data Challenge
3. The Rise of Logical Data Architectures
4. Tackling the Data Pipeline Problem
5. Customer Stories
6. Key Takeaways
7. Q&A
8. Next Steps
5. 5
Advanced Analytics & Machine Learning Exercises Need Data
Improving Patient
Outcomes
Data includes patient demographics,
family history, patient vitals, lab test
results, claims data etc.
Predictive Maintenance
Maintenance data logs, data coming in
from sensors – including temperature,
running time, power level duration etc.
Predicting Late Payment
Data includes company or individual
demographics, payment history,
customer support logs etc.
Preventing Frauds
Data includes the location where the
claim originated, time of the day,
claimant history and any recent adverse
events.
Reducing Customer Churn
Data includes customer demographics,
products purchased, products used, pat
transaction, company size, history,
revenue etc.
7. 8
Gartner, Adopt the Logical Data Warehouse Architecture to Meet Your Modern
Analytical Needs, May 2018
“When designed properly, Data Virtualization can speed data
integration, lower data latency, offer flexibility and reuse, and reduce
data sprawl across dispersed data sources. Due to its many benefits,
Data Virtualization is often the first step for organizations evolving a
traditional, repository-style data warehouse into a Logical Architecture”
9. 10
Why A Logical Architecture Is Needed
ü The analytical technology landscape has shifted over time.
ü You need a flexible architecture that will allow you to embrace those shifts rather
than tie you down to a monolithic approach.
ü Only a logical architecture will easily accommodate such changes, and not a
physical architecture.
ü IT should be able to adopt newer technologies without impacting business users.
11. 12
Typical Data Science Workflow
A typical workflow for a data scientist is:
1. Gather the requirements for the business problem
2. Identify useful data
§ Ingest data
3. Cleanse data into a useful format
4. Analyze data
5. Prepare input for your algorithms
6. Execute data science algorithms (ML, AI, etc.)
§ Iterate steps 2 to 6 until valuable insights are
produced
7. Visualize and share
Source:http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/
12. 13
Where Does Your Time Go?
• 80% of time – Finding and
preparing the data
• 10% of time – Analysis
• 10% of time – Visualizing data
Source:http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/
13. 14
Where Does Your Time Go?
A large amount of time and effort goes into tasks not intrinsically related to data science:
• Finding where the right data may be
• Getting access to the data
§ Bureaucracy
§ Understand access methods and technology (noSQL, REST APIs, etc.)
• Transforming data into a format easy to work with
• Combining data originally available in different sources and formats
• Profile and cleanse data to eliminate incomplete or inconsistent data points
14. 15
Data Scientist Workflow
Identify useful
data
Modify datainto
auseful format
Analyzedata Executedata
science algorithms
(ML,AI, etc.)
Prepare for
MLalgorithm
15. 16
Identify Useful Data
If the company has a virtual layer with a good coverage of
data sources, this task is greatly simplified.
§ A data virtualization tool like Denodo can offer
unified access to all data available in the company.
§ It abstracts the technologies underneath, offering a
standard SQL interface to query and manipulate.
To further simplify the challenge, Denodo offers a Data
Catalog to search, find and explore your data assets.
16. 17
Data Scientist Workflow
Identify useful
data
Modify datainto
auseful format
Analyzedata Executedata
science algorithms
(ML,AI, etc.)
Prepare for
MLalgorithm
17. 18
Data Virtualization offers the unique opportunity of
using standard SQL (joins, aggregations,
transformations, etc.) to access, manipulate and
analyze any data.
Cleansing and transformation steps can be easily
accomplished in SQL.
Its modeling capabilities enable the definition of views
that embed this logic to foster reusability.
Ingestion And Data Manipulation Tasks
19. 20
Prologis Launches Data Analytics Program for Cost Optimization
Background
§ Create a single governed data access layer to create
reusable and consistent analytical assets that could be used
by the rest of the business teams to run their own analytics.
§ Save time for data scientists in finding , transforming and
analysing data sets without having to learn new skills and
create data models that could be refreshed on demand.
§ Efficiently maintain its new data architecture with minimum
downtime and configuration management.
Prologis is the largest industrial real estate
company in the world, serving 5000 customers
in over 20 countries and USD 87 billion in
assets under management.
21. 22
Data Virtualization Benefits Experienced by Prologis
§ The analytics team was able to create business focussed subject areas with
consistent data sets that were 30% faster in speed to analytics.
§ Denodo made it possible for Prologis to quick start advanced analytics projects.
§ The Denodo platform’s deployment was as easy as a click of a button with
centralized configuration management. This simplified Prologis’s data architecture
and also helped bring down the overall maintenance cost.
22. 23
Luke Slotwinski, VP, IT Data and Analytics at Prologis
The speed of business is faster than before. It is now critical
to be able to make decisions on a dime to pivot the business
in its needed direction. This is why Prologis went with the
Denodo Platform.
23. 24
ü Denodo can play key role in the data science ecosystem to reduce data
exploration and analysis timeframes.
ü Extends and integrates with the capabilities of notebooks, Python, R, etc.
to improve the toolset of the data scientist.
ü Provides a modern “SQL-on-Anything” engine.
ü Can leverage Big Data technologies like Spark (as a data source, an
ingestion tool and for external processing) to efficiently work with large
data volumes.
ü New and expanded tools for data scientists and citizen analysts: “Apache
Zeppelin for Denodo” Notebook.
Data Virtualization Benefits for AI and Machine Learning
Projects
25. 26
Key Takeaways
ü Information architectures are getting more complex, more diverse, and more
distributed.
ü Traditional technologies and data replication don’t cut it anymore.
ü Data virtualization makes it quick and easy to expose data from multiple source to your
users.
ü Data virtualization provides a governance and management infrastructure required for
successful data management.