HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx

Data Virtualization:
Breaking Down Data Silos
and Other Data Problems
• Richard Hanks (richard_hanks@byu.edu)
• Roger Tervort (roger_tervort@byu.edu)

BRIGHAM YOUNG UNIVERSITY (BYU)

Data Virtualization
“Business demand for self-service
access to real-time data from multiple
data sources and in varied formats
complicates data management.”
- Gartner “Leveraging Data Virtualization in Modern Data Architectures”, April 5, 2019

Data Virtualization
Distributed Data
Management
Technology is based on the execution
of distributed data management
Flexibility
Consumed by applications,
query/reporting tools, message-oriented
middleware or other data management
infrastructure components.
AbstractionLayer
Layer of abstraction above the physical
implementation of data, to simplify
querying logic.
Multiple Data Sources
Used primarily for queries against
multiple heterogeneous data sources,
and federation of query results into
virtual views
Virtual Integrated
Views
Data virtualization can be used to
create virtualized and integrated views
of data. (In-memory, rather than
executing data movement)
Gartner: Market Guide for Data Virtualization (16 Nov 2018);
Leveraging Data Virtualization in Modern Data Architectures (5 Apr 2019)

Data A
Cleaning
Formatting
Joining
Standardizing
Pivoting
Automating
Storing
Field Types
Data B
Report A
Excel
CSV
Oracle
Manual Processes
Report B
Extracting
Wrangling
Aggregating
Calculations
Data C
Data D
Report A (version 2)
ETL
API
Finding
MySQL
If we have the data… Why is it so hard to develop
the report that I want??

Poll How prevalent are data silos at your school?
1 = We have nearly all data centralized
3 = We have some central data
5 = Most of our data is in data silos

Academic Freedom Culture?
Data Silos?
Resulting Problems:
• Lack of Centralized Data
• Disparate Systems
• Data Replication
• Broken Data Pipelines
• Overuse of ETL – just to move data
• Data Security (Authentication /
Authorization)

Selection
of
DV
Tools
www.dremio.com
www.denodo.com

Benefits
from
Data
Virtualization
OIT Managed Data
Enrollment Services
Department of Continuing Education
Center for Teaching and Learning
Library
Marriott School of Management
Single
Point
of
Entry
/
Autorization
Tableau
Business Objects
Excel
PowerBI
Python
R
SQL
Other

Benefits
from
Data
Virtualization
OIT Managed Data
Enrollment Services
Department of Continuing Education
Center for Teaching and Learning
Library
Marriott School of Management
Tableau
Business Objects
Excel
PowerBI
Python
R
SQL
Other
Single
Point
of
Entry
/
Autorization

Benefits
from
Data
Virtualization
Single
Point
of
Entry
/
Autorization
Tableau
Business Objects
Excel
PowerBI
Python
R
SQL
Other
SIS
Identity
Registration
Student
Dimension
Excel/CSV Data
Virtual Views
(Curated)
Database
Physical Layer
Other
Virtual Layer
- Collibra (DSA)
- Searchable (Dremio
Catalog)

01 Reduction in
ETL Development
02 Reduction in
Data Replication
03 Flexible Data
Pipeline for Data
Science and Adhoc
04 Quicker DSA
Approval and Delivery
05 Reduction in Large
Tableau Data Refreshes
06 Breakdown of Data
Silos / Departments
Still have their data

07 Row / Column / Masking
Data Security
08 Addition of CSV, JSON,
and some XLSX files
09 Combining Data Sources
(Oracle, MS SQL,
MySQL, AWS, Mongo)
10 Curated Data Sets
(General and Surgical)
11 Acceleration of Queries
(Caching of Data)
12 Pre-Aggregation Queries
(Cube type OLAP)

Case Studies – Real Life Examples

Library and Enrollment Services (ES) each need data the
other group has. Library data stores library and patron usage
data in MySQL, MongoDB, and Oracle. ES has student
demographic data in Oracle (currently centralized and
managed by IT). Both will need Data Sharing Agreements
(DSA) and will need the data updated frequently.
Need
Extract Mongo DB data to flat file. Build ETL to combine data
from MySQL, Oracle (Library), and Oracle (ES). Data is joined
on common business keys. Estimated time to delivery: 3-4+
weeks (not including DSA)
Old
Leave data in its place. Use Data Virtualization to create
Virtual Data Sources in SQL to query all sources and combine
and join data. Change authorization for new Virtual Data Sets.
Estimated time to delivery 2 days to 1 week (not including
DSA)
New

General Studies needs an analysis of the order of courses
taken to meet the Language of Learning requirement. Course
data is available, but sequencing and analysis will be done in
SAS. Output will be a csv file, but will need to be enriched with
demographic data of students who took specific classes.
Need
Extract Course data into CSV. Analyze data in SAS. Export
SAS result file to CSV. Load CSV into Oracle. Enrich SAS
result data with other Oracle data. Use Tableau to deliver
dashboard of results.
Old
Way
Leave data in its place. Use Dremio to feed data into SAS via
ODBC. Output results stored to NAS drive as csv.
Demographic data added Virtual Data Set. Tableau points to
Dremio data set. Dremio becomes a Data Science Sand Box.
New
Way

Large campus department wanted to do a turnover analysis on
their administrative and student employees. Need 5+ years of
data. No standard analysis process exists. Data in PS Oracle
and will be combined with Department Internal job descriptions
and classifications. HR Department concerned about
additional data in tables with the data that would be “coming
along for the ride.”
Need
Use ETL to create custom table or custom extract into
Department databases. Department will perform its analysis in
MatLab. But how to update?
Old
Leave data in its place. Use SQL in Dremio to query all
sources and combine and join data. MatLab to use ODBC to
query Virtual Data Set for analysis. BONUS: DSA was based
on Virtual Data Set rather than on multiple underlying Oracle
tables. No data came along for the ride. Time to delivery 2
weeks (including DSA)
New

Our Security Operations Center recently expanded their
coverage to include additional Church related academic
institutions. One service they were offering was Threat and
Federated Intelligence. To provide that same service to all
campuses with multiple heterogeneous systems will be a
challenge.
Need
Use RunDeck ETL, python, and other to put automation.
Automation from each system and from those to S3. Somehow
combine enriching data in Oracle with S3 (another ETL?)
Old
Way
Develop an event driven, microservices architecture.
Microservices pulls data from systems and saves to JSON file
in AWS S3. Dremio makes each JSON file appear like a table
and is joined with other tables in Dremio to enrich the data.
Data can feed reporting or other ad-hoc analysis
New
Way

Questions
richard_hanks@byu.edu roger_tervort@byu.edu

HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx

Recomendados

Recomendados

Más contenido relacionado

Similar a HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx

Similar a HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx (20)

Último

Último (20)

HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx

Notas del editor