The Common BI/Big Data Challenges and Solutions presented by seasoned experts, Andriy Zabavskyy (BI Architect) and Serhiy Haziyev (Director of Software Architecture).
This was a complimentary workshop where attendees had the opportunity to learn, network and share knowledge during the lunch and education session.
2. SoftServe BI/Big Data Lunch and Learn Workshop in Utah
January 30, 2013
The Common BI/Big Data Challenges and Solutions presented by seasoned
SoftServe experts, Andriy Zabavskyy (BI Architect) and Serhiy Haziyev (Director of
Software Architecture).
This was a complimentary workshop where attendees had the opportunity to learn,
network and share knowledge during the lunch and education session.
About SoftServe Inc.
SoftServe, founded in 1993, is a leading global outsourced product and application
development company dedicated to empowering businesses worldwide by providing end-toend capabilities from product concept to completion. Utilizing Product Development Services
2.0 (PDS 2.0), we deliver proactive solutions in the areas of SaaS/Cloud, Mobility, BI/Analytics
and UI/UX for industries including Healthcare, Retail, Manufacturing, Logistics, and
Infrastructure & Storage. SoftServe is a rapidly growing global company with 3,000
professionals and offices in North America, Western Europe, Russia and Ukraine.
4. Typical BI Solution
Data Sources
Data Integration
OLTP: CRM,
ERP, Finance
Data Warehouse
Data Mining
Users
Predictive
Prescriptive
Analytics
Data
Warehouse
OLAP cubes
Data Visualization
and Analysis
Flat files
ETL/ELT
Big Data
Reports
Dashboards
Spreadsheets
Legacy System
BI Tools
Analysts
6. Dashboard & Scorecard
Client
Problem:
▪ Single view from multiple
sources
▪ Track performance against
company targets
Internet
Solution:
▪ Dashboard
▪ KPI and Scorecards
Server Tier
7. Dashboard & Scorecard: Implementation
Software Vendors Offering
Boxed solutions from
big players
Development Efforts
Customization
(e.g. SAS, SAP, IBI)
Dashboard Frameworks
(e.g. Tableau, QlikView, JasperSoft)
Dashboard libs
(JIDE libs)
Custom defined KPI
Integration Efforts
Custom defined KPI &
Custom built dashboard
framework
8. Dashboard & Scorecard: Highlights
• Adopting/Customizing of business lines ready
solution could be painful, long and costly process
• Not all dashboard solutions support multitenancy out-of-the-box
9. Self Service BI
Problem:
▪ Give ability for BI users to
explore and analyze data in
highly customizable manner
BI Users
Data Model
Solution:
Toolset
▪ Expose to users a data model
▪ Give a toolset with data
exploring and analysis
capabilities
OLAP
In-Memory
RDBMS/
NoSQL
10. Self Service BI: Implementation
• OLAP engines with proper OLAP
viewers
• BI tools with in-memory engines and
semantic/domain layers
• Report Authoring Tools :
– Microsoft Report Builder
– JasperServer Report Designer
11. Self Service BI: Traditional vs Agile BI Trade-off
Features
Time to Value
Self Service
Collaboration
Interactivity and UX
Customization
Data Quality
Pixel-perfect
Low cost solutions
Traditional
Agile
12. Self Service BI: Highlights
• Need to educate data consumers to properly use
SSBI tools
• Desktop versions of many SSBI vendors are often
more mature in comparison to Web tools
• In-memory capabilities are limited by RAM size
15. ELT
Problem:
• Efficiently processing very
large volumes of data within
ever shortening processing
windows
Solution:
• Perform transformation steps
on target platform
• Set-based processing
Data Warehouse
Semantic Layer
Load
Staging Layer
Transform
Source
Source
Extract
16. ELT: Highlights
• Some data integration platforms have clearly separated
ETL and ELT components
• Consider usage of custom scripts native to target
platform vs. built-in DI component
17. ETL vs. ELT
ETL
Flow
Advantages
Disadvantages
ELT
Data pipeline are used
Transformations to the data one
record at a time
Intermediate data results are
stored in memory
Data is loaded into the
destination server
Set-based processing
Transformations and Lookups
are within the SQL
Complex transformations
Intermediate results in memory
is faster than persisting to disk
The power of the relational
database system can be
utilized for very large data
sets
Large data sets could
Load on RDBMS
overwhelm the memory
More disk activity
Updates are more efficient using
set-based processing
19. Kimball’s Multidimensional EDW
Problem:
• Integrate and consolidate data
from heterogeneous sources
• Keep data history
Data Warehouse
Solution:
• Use multidimensional model to
store data
• Iterate by business lines
• Integrate by conformed
dimensions
Data Sources
23. DWH: Highlights
Implications of column-based storage:
– Additional columns vs. Junked dimensions
– Update scenarios should be omitted where
possible
– Partitions scenario should be carefully established
to support maintenance activities
26. Big Data: Hybrid Approach
Problem:
• Under big data circumstances:
– Flexible online analytics
– Access to most detailed raw
data
Operational and
Historical Analytics
Solution:
• Analytical RDBMS for online
analytics
• NoSQL DB as source for
RDBMS and most detailed row
data
NoSQL
RDBMS/DW
Source
28. Tape Library
HDFS
Disk Array
Throughput
(600 GB load time)
140-500 MB/s
(0.3-1.2 h)
10-30 MB/s
(5.5-16 h)
50-700 MB/s
(0.25-4 h)
2-40 MB/s
(83h)
Max capacity
30-900 PB
21+ PB
16 PB
~Unlimited
Max file size
~Unlimited
~Unlimited
4 – 16 TB (OSlimited)
Accessibility
SAN
Java API, HTTP,
NFS (MapR)
NFS, CIFS, SAN
REST, SOAP
Scalability
Adding cartridges
Adding nodes
Adding disks
Pay-as-you-go
Reliability
Redundancy
Redundancy
(MapR)
Redundancy
99.99%
Encryption
Yes
Yes*
Yes*
Yes
By datacenter
By datacenter
By datacenter
By Amazon
?
No
Yes
Yes
Yes
No
No
Yes
Yes**
100 TB Cost
$40-60K
$100-200K
$80-400K
$132-216K/year
$12-96K/year
1 PB Cost
$90-140K
$1-2M
$0.5-4M
$1.1-1.6M/year
$120-360K/year
15 PB Cost
$0.7-1.2M
$15-30M
~$18M
$9.9-15M/year
$1.8-3.5M/year
HIPAA Compliancy
Random access
Parallel processing
Retention Storage
Requirements
Operation Storage
Big Data isn’t only Hadoop
Amazon S3
Amazon
Glacier
5 TB
40 TB
No
29. Big Data: Highlights
• Clickstream analysis is a classic use case
• Scheduled reports are well suited for Hadoop based
reports
• Majority of Self Service BI tools need relational
representation of data
33. DM Models: Implementation
• Custom algorithm implementation
• Statistical packages like R
• Ready data mining model implementations
34. DM Models: Highlights
• The approach should be:
Problem -> Data Strategy -> Data analysis
… and not vice versa
• DM Algorithms should be carefully selected
• DM Algorithms are highly dependent on business
domain you create them for
35. SoftServe BI Maturity Model
• Improving the business
Wisdom
• decision making (executives)
• data mining, forecasting
• Gaining business insight
Knowledge
• analytical reports (analysts)
• dashboards, KPIs, scorecards, slice & dice, data
warehouse, OLAP
• Measuring and monitoring
Information
• consolidated reports (managers)
• charts, parametrized reports, dedicated
reporting database
• Running the business
Data
• personal operational reports
(workers, customers)
• simple reports, OLTP or files
37. More Info about SoftServe BI Offerings
http://www.softserveinc.com/en-us/services/software-architecture/
http://www.softserveinc.com/en-us/services/bi-analytics/