7. SO WHY SHOULD I CARE ABOUT THIS?
Data is the new Electricity (Satya Nadella, Spring 2016)
https://www.microsoft.com/en-us/sql-server/data-driven
Companies Generate data, Distribute, Meter, and Use it
Where is data stored?
Current: SQL Server, Oracle, Teradata, DB2, Netezza, Open Source Databases
Casandra, MySQL, MongoDB
Unstructured: Hadoop, Spark, Data Lakes
What type of data is stored?
Traditional: Rows and Columns
Big Data Explosion: Images, streaming data, internet-connected devices (IoT),
Machine data
8. BIG DATA IS DRIVING TRANSFORMATIVE CHANGES
Traditional Big Data
Relational data
with highly modeled schema
All data
with schema agility
Specialized HW Commodity HW
Data
characteristics
Costs
Culture
Operational reporting
Focus on rear-view analysis
Experimentation leading
to intelligent action
With machine learning, graph, a/b testing
9. BIG DATA 101
• Sources
• Cell Phones
• Social Media
• Credit Cards
• GPSs
• Bread Crumbs
10. BIG DATA 101
• 5 Vs of Big Data
• Volume
• Variety
• Velocity
• Veracity
• Value
11. BIG DATA 101
• Desired Properties:
• Robustness- Fault Tolerance
• Low Latency
• Scalability
• Generalization
• Extensibility
• Ad hoc Queries
• Minimal Maintenance
• Debuggability
12. BIG DATA 101
• Flow
Collection Pre-processing
Intervention Visualization
Hygiene
Analysis
OVER 90% OF TODAY’S DATA
WAS CREATED IN PAST 2 YEARS
13. BIG DATA 101
• 5 Rs of Data Quality
• Relevancy
• Recency
• Range
• Robustness
• Reliability
• Ephemeral Vs. Durability
• Refresh of Data
14. BIG DATA 101
• Privacy of Data
• If I collect the data, is it mine?
• Ownership Vs Rights
• Share Answers not Data
• OpAl (http://www.trust.mit.edu/projects/)
• Enigma
• Let them know
• Why you are collecting
• What you are collecting
• FIPP- Fair Information Privacy Principles
• Individual Control
• Transparency
• Respect for Context
• Security
• Access and Accuracy
• Focused Collection
• FERPA- Family Education Rights and Privacy Act
15. BIG DATA 101
WHAT IS A DATA LAKE? ---COURTESY : JAMES SERRA
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• A place to store unlimited amounts of data in any format inexpensively, especially for archive
purposes
• Allows collection of data that you may or may not use later: “just in case”
• A way to describe any large data pool in which the schema and data requirements are not defined
until the data is queried: “just in time” or “schema on read”
• Complements EDW and can be seen as a data source for the EDW – capturing all data but only
passing relevant data to the EDW
• Frees up expensive EDW resources (storage and processing), especially for data refinement
• Allows for data exploration to be performed without waiting for the EDW team to model and load
the data (quick user access)
• Some processing in better done with Hadoop tools than ETL tools like SSIS
• Easily scalable
16. BIG DATA 101
THE “DATA LAKE” USES A BOTTOMS-UP APPROACH
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Courtesy : James Serra
21. Near Realtime Data Analytics Pipeline using Azure Steam Analytics
Big Data Analytics Pipeline using Azure Data Lake
Interactive Analytics and Predictive Pipeline using Azure Data Factory
Base Architecture : Big Data Advanced Analytics Pipeline
Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption, BI/visualization)
Consume
(Alerts, Operational Stats,
Insights)
Machine Learning
Telemetry
Azure SQL
(Predictions)
HDI Custom ETL
Aggregate /Partition
Azure Storage Blob
dashboard of
predictions / alerts
Live / real-time data
stats, Anomalies and
aggregates
Customer
MIS
Event
Hub PowerBI
dashboard
Stream Analytics
(real-time analytics)
Azure Data Lake Analytics
(Big Data Processing)
Azure Data Lake
Storage
Azure SQL
Data
in Motion
Data
at Rest
dashboard of
operational stats
21
Scheduledhourly
transferusingAzure
DataFactory
Machine Learning
(Anomaly Detection)
22. VISION FOR BIG
DATA AND DATA
WAREHOUSING
Azure Data Factory
+
Federated Query
On-Premises
Data Warehouse “Big Data”
Cloud
Comprehensive
Connected
Choice
Microsoft Azure Microsoft Azure
Microsoft SQL
Server
25. BIG DATA 101
• Summary:
• Sources
• Privacy concerns
• Storing- Hadoop
• Processing – MapReduce
• Presentation
26. BIG DATA 101 - CONCLUSION
SQL Server is the best Relational Database
The world is much bigger than any one relational
database
What is your company’s data strategy?
What is your company’s cloud strategy?
Learn adjacent technologies that will make you
valuable.
Power BI?
Hadoop?
NoSQL?
27. BIG DATA 101
• BIBLIOGRAPHY –
• http://www.datasciencecentral.com/
• https://www.youtube.com/playlist?list=PLt-
0mOCwxJ6B_OxTlpevxJNAa7GfCLd3l
• https://www.dezyre.com/article/hadoop-components-and-architecture-
big-data-and-hadoop-training/114
• MIT Big Data Analytics Course
• Data Lake presentation by James Serra
• Future of Data…..(or something like that) by George Walters
28. BIBLIOGRAPHY- BIG DATA 101
Ignite (IT Pros) - https://myignite.microsoft.com/videos
Channel9 (Developers) - https://channel9.msdn.com/
Microsoft Virtual Academy (Both) – http://mva.microsoft.com
Technet Virtual Labs (Hands-on!) -
https://technet.microsoft.com/en-us/virtuallabs/default
Free Azure for 1 month - https://azure.microsoft.com/en-us/free/
Free HDInsight (Hadoop as a service) for a week -
https://azure.microsoft.com/en-us/services/hdinsight/information-
request/
MSDN? Link that to Azure for monthly Azure money.
Notas del editor
Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)
http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/
http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/
http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx
http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/
http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses
http://www.martinsights.com/?p=1088
http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/
http://www.martinsights.com/?p=1082
http://www.martinsights.com/?p=1094
http://www.martinsights.com/?p=1102