Data Virtualization
“Business demand for self-service
access to real-time data from multiple
data sources and in varied formats
complicates data management.”
- Gartner “Leveraging Data Virtualization in Modern Data Architectures”, April 5, 2019
Data Virtualization
Distributed Data
Management
Technology is based on the execution
of distributed data management
Flexibility
Consumed by applications,
query/reporting tools, message-oriented
middleware or other data management
infrastructure components.
AbstractionLayer
Layer of abstraction above the physical
implementation of data, to simplify
querying logic.
Multiple Data Sources
Used primarily for queries against
multiple heterogeneous data sources,
and federation of query results into
virtual views
Virtual Integrated
Views
Data virtualization can be used to
create virtualized and integrated views
of data. (In-memory, rather than
executing data movement)
Gartner: Market Guide for Data Virtualization (16 Nov 2018);
Leveraging Data Virtualization in Modern Data Architectures (5 Apr 2019)
Poll How prevalent are data silos at your school?
1 = We have nearly all data centralized
3 = We have some central data
5 = Most of our data is in data silos
Academic Freedom Culture?
Data Silos?
Resulting Problems:
• Lack of Centralized Data
• Disparate Systems
• Data Replication
• Broken Data Pipelines
• Overuse of ETL – just to move data
• Data Security (Authentication /
Authorization)
Benefits
from
Data
Virtualization
OIT Managed Data
Enrollment Services
Department of Continuing Education
Center for Teaching and Learning
Library
Marriott School of Management
Single
Point
of
Entry
/
Autorization
Tableau
Business Objects
Excel
PowerBI
Python
R
SQL
Other
Benefits
from
Data
Virtualization
OIT Managed Data
Enrollment Services
Department of Continuing Education
Center for Teaching and Learning
Library
Marriott School of Management
Tableau
Business Objects
Excel
PowerBI
Python
R
SQL
Other
Single
Point
of
Entry
/
Autorization
01 Reduction in
ETL Development
02 Reduction in
Data Replication
03 Flexible Data
Pipeline for Data
Science and Adhoc
04 Quicker DSA
Approval and Delivery
05 Reduction in Large
Tableau Data Refreshes
06 Breakdown of Data
Silos / Departments
Still have their data
07 Row / Column / Masking
Data Security
08 Addition of CSV, JSON,
and some XLSX files
09 Combining Data Sources
(Oracle, MS SQL,
MySQL, AWS, Mongo)
10 Curated Data Sets
(General and Surgical)
11 Acceleration of Queries
(Caching of Data)
12 Pre-Aggregation Queries
(Cube type OLAP)
Library and Enrollment Services (ES) each need data the
other group has. Library data stores library and patron usage
data in MySQL, MongoDB, and Oracle. ES has student
demographic data in Oracle (currently centralized and
managed by IT). Both will need Data Sharing Agreements
(DSA) and will need the data updated frequently.
Need
Extract Mongo DB data to flat file. Build ETL to combine data
from MySQL, Oracle (Library), and Oracle (ES). Data is joined
on common business keys. Estimated time to delivery: 3-4+
weeks (not including DSA)
Old
Leave data in its place. Use Data Virtualization to create
Virtual Data Sources in SQL to query all sources and combine
and join data. Change authorization for new Virtual Data Sets.
Estimated time to delivery 2 days to 1 week (not including
DSA)
New
General Studies needs an analysis of the order of courses
taken to meet the Language of Learning requirement. Course
data is available, but sequencing and analysis will be done in
SAS. Output will be a csv file, but will need to be enriched with
demographic data of students who took specific classes.
Need
Extract Course data into CSV. Analyze data in SAS. Export
SAS result file to CSV. Load CSV into Oracle. Enrich SAS
result data with other Oracle data. Use Tableau to deliver
dashboard of results.
Old
Way
Leave data in its place. Use Dremio to feed data into SAS via
ODBC. Output results stored to NAS drive as csv.
Demographic data added Virtual Data Set. Tableau points to
Dremio data set. Dremio becomes a Data Science Sand Box.
New
Way
Large campus department wanted to do a turnover analysis on
their administrative and student employees. Need 5+ years of
data. No standard analysis process exists. Data in PS Oracle
and will be combined with Department Internal job descriptions
and classifications. HR Department concerned about
additional data in tables with the data that would be “coming
along for the ride.”
Need
Use ETL to create custom table or custom extract into
Department databases. Department will perform its analysis in
MatLab. But how to update?
Old
Leave data in its place. Use SQL in Dremio to query all
sources and combine and join data. MatLab to use ODBC to
query Virtual Data Set for analysis. BONUS: DSA was based
on Virtual Data Set rather than on multiple underlying Oracle
tables. No data came along for the ride. Time to delivery 2
weeks (including DSA)
New
Our Security Operations Center recently expanded their
coverage to include additional Church related academic
institutions. One service they were offering was Threat and
Federated Intelligence. To provide that same service to all
campuses with multiple heterogeneous systems will be a
challenge.
Need
Use RunDeck ETL, python, and other to put automation.
Automation from each system and from those to S3. Somehow
combine enriching data in Oracle with S3 (another ETL?)
Old
Way
Develop an event driven, microservices architecture.
Microservices pulls data from systems and saves to JSON file
in AWS S3. Dremio makes each JSON file appear like a table
and is joined with other tables in Dremio to enrich the data.
Data can feed reporting or other ad-hoc analysis
New
Way
Single point of entry – Authentication / Authorization of Data (Row, Column, Masking)
Flexible Tool (ODBC and Direct Connections)
Leave Data at the Source (Less Data Replication)
4. Easier access to data across campus (via DSA with no data coming along for the ride)
5. Breakdown of existing silos
6. Use flat files as data sources (csv, Excel, JSON)
7. Virtual Data Warehouse (Enterprise View)
Dimensions: Student, Faculty, Admin, Date, OU Structure, HR Structure, Colleges, Courses
Measures: GPA, Enrollments, Counts, Averages, Hours
8. Curated Data – Data Sets that make sense (some individualized) – THIS is where we really help a lot of the Have Nots.
9. Work with Data Stewards to make “pre-approved” data sets
10. Ability to search data (auto cataloging and tagging)
11. Reflections – data and aggregation query acceleration
12. Rapid Prototyping – data proof of concept (avoid extensive ETL)
13. Queries across multiple database platforms, on-prem and cloud