3. Data Science Definition
“Data science is an interdisciplinary field about
processes and systems to extract knowledge or
insights from data in various forms, either structured
or unstructured, which is a continuation of some of
the data analysis fields such as statistics, machine
learning, data mining, and predictive analytics”
https://en.wikipedia.org/wiki/Data_science
5. The Cloud
Why does the Cloud matter for Data Science?
High capacity and cost effective data storage
Flexible, elastic compute capacity
Ready to use technologies
Choice of Infrastructure or Platform
Enables Agile & DevOps
Operational reliability and security
Pay as you go
6. Microsoft Azure Cloud Platform
Wide range of services covering Compute, Web & Mobile, Data &
Storage, Analytics, Internet of Things & Intelligence plus many more,
see http://azureplatform.azurewebsites.net/en-us/
Easy to get started, free to try for 30 days but limited spend, also
MSDN licence free credits, see https://azure.microsoft.com/en-
gb/free/
Comprehensive documentation and examples
Global presence with many recognisable brands fully committed
Huge investment and growing rapidly
9. NYC taxis
2013 NYC taxi trips and fares – open but non-trivial dataset
24 CSV files - 12 trip, 12 fare, 1 for each month
~20GB compressed, ~50GB uncompressed, 170+ million records
medallion – vehicle identifier
hack license – driver identifier
passenger count
pickup & dropoff – datetime, longitude, latitude
trip – time and distance
fare - payment type, fare amount, surcharge, mta tax, tip amount, tolls
amount, total amount
http://www.andresmh.com/nyctaxitrips/
10. Predictions
Predict whether a specific journey will result in a tip – binary
classification
Predict what class of tip will be for a specific journey – multiclass
classification
Predict how much a tip will be for a specific journey – regression
12. Data Science Virtual Machine
Create Linux and Windows virtual machines in minutes
Wide range of configurations - CPU cores, memory, disks, network
speeds
Scale to what you need
Pay only for what you use
Enhance security and compliance
Preloaded with full set of tools and utilities from Azure MarketPlace
e.g. SQL Server 2016 Developer edition, Azure SDK, Python, R,
Jupyter, etc.
13. Storage Accounts
Massively scalable cloud storage for your applications
Security-enhanced, durable, and highly available across the globe
Industry-leading performance with exabytes of capacity
Pay only for what you use
Open, multi-platform support
14. HDInsight
A managed Apache Hadoop, Spark, R, HBase, and Storm cloud service
made easy
Scale to petabytes on demand
Crunch all data—structured, semi-structured, unstructured
Skip buying and maintaining hardware
Spin up Apache Hadoop, Spark, and R clusters in the cloud
Use Excel or your favourite BI tool to visualize Hadoop data
Connect on-premises Hadoop clusters with the cloud
15. Azure Machine Learning
A fully managed cloud service that enables you to easily build, deploy,
and share predictive analytics solutions.
Powerful cloud based analytics, now part of Cortana Intelligence
Suite
Azure Machine Learning Studio includes hundreds of built-in
packages and support for custom code
Share your solution with the world in the Gallery or on the Azure
Marketplace
17. Preparation & Exploration
Copy data using Azcopy and decompress
Inspect files and load in to RStudio
Create external Hive tables and load
Query over full dataset for further exploration
Remove erroneous data e.g. passenger numbers, lat/long
Engineer features using Hive
Distance from start to finish using Haversine calculation
Binary indicator for tips
Tip level based on ranges for multiclass classification
Downsample dataset and save as internal table for Machine Learning
18. Machine Learning & Deployment
Import Data using Hive Query
Build Training Experiments
Evaluate model performance
Create Predictive Experiments
Publish Web Service
Test Web Service
Call from Excel
19. Next Steps
To build a fully fledged enterprise solution with regular data ingestion
and model execution consider the following:
Data Catalog
Data Factory
Event Hubs & Stream Analytics
Power BI
Cognitive Services
21. Summary
Microsoft Azure provides a wide range of technologies for Data
Science activities
Platform services reduce the management overhead
No capacity limitations and flexible provisioning – pay as you go
Choice of Open Source and Microsoft – use the best tool for the task
The tools are well integrated
Azure Machine Learning makes it trivial to deploy your models
It’s quick and easy to get started
22. Getting Started
Sign up for free
https://azure.microsoft.com/en-gb/free/
Create a Data Science VM
https://azure.microsoft.com/en-us/marketplace/partners/microsoft-
ads/standard-data-science-vm/
Visit Cortana Intelligence Gallery
https://gallery.cortanaintelligence.com/
24. Thank You
Martin Thornalley
Data Solution Architect, Microsoft
@mthornal
martin.thornalley@microsoft.com
https://www.linkedin.com/in/martinthornalley