SlideShare a Scribd company logo
1 of 50
Create a Data Science Lab with
Microsoft and Open Source Tools
Marcel Franke, pmOne AG, Germany
About me – Marcel Franke
Practice Lead Advanced Analytics & Data Science
pmOne AG – Germany, Austria, Switzerland
>10 years experiences with large scale
Data Warehouses based on SQL Server
Blog: dwjunkie.wordpress.com
What is data science?
The Definition
Data science incorporates varying
elements and builds on techniques and
theories from many fields, including
mathematics, statistics, data engineering,
pattern recognition and learning, advanced
computing, visualization, uncertainty
modeling, data warehousing, and high
performance computing with the goal of
extracting meaning from data and
creating data products.

Source: http://en.wikipedia.org/wiki/Data_science
A brief look into history
GAMBLING –
THAT’S WHERE
EVERYTHING
STARTED
The beginnings of gambling
Gambling exists since 3000 BC
First games based on dices

Origin in China and Mesopotamian
* Source: Tiemeyer, E.; Zsifkovitis, H.: Information als Führungsmittel, München: Computerwoche Verlag 1995
Scientific foundations
17th century Paradox of
Chevaliers de Méré
LaPlace und Fermat discussed
the paradox in several letters
The beginning of theory of
probability
* Source: http://de.wikipedia.org/wiki/De-M%C3%A9r%C3%A9-Paradoxon
The science in Data Science
Calculate probabilities
Pattern recognition
Calculation of analytical variance
Machine Learning
Simulations
Predictions
BI, Data Mining & Prediction
WEATHER
FORECAST
What do companies do today?
Walmart – The pioneer of data analytics

Source: Data Unser – Dr. Bloching, Bilder: walmart.com, yourdealz.de, squidoo.com, fuzzybrew.com
Visa

80% correct prediction of divorces
within the next 5 years
Reason: Divorce is the highest risk
for private insolvency
Source: visa.de
Customers need to find the right case

What do consumers
really do?
Blonde looks
somehow different 

The new washing powder is really great…
Data can be accessed easily…
… but, it‘s hard to analyze it.
Other areas of application
SOCIAL
MEDIA

PRODUCT REMOMMENDATION
RETARGETING

PREDICTIVE
MAINTENANCE

PREDICT RISKS

areas of
application
SALES PREDICTIONS

CUSTOMER ANLYSIS

DYNAMIC PRICING

DISPOSITION
How does this fit to Big Data?
Our starting point…
Structured data

Unstructured data

Harmonize and
generate Information
(Role of „Data Scientist“)

„BIG Data“
Volume, Variety, Velocity
Typical Big Data Architecture
Big Data Analytics

Excel

Big Data Advanced Analytics

PowerPivot
Big Data Preparation (SQL, Map Reduce)

Unstructured data

Structured data
Massive Parallel Processing

Big Data Storage Platform
“[Facebook] started in the Hadoop world. We are now bringing in
relational to enhance that. We're kind of going [in] the other
direction.”
“We've been there, and [we] realized that using the wrong
technology for certain kinds of problems can be difficult. We
started at the end and we're working our way backwards, bringing
in both.”
Ken Rudin,
Source: http://tdwi.org/articles/2013/05/06/facebooks-relationalplatform.aspx?j=192038&e=marcel.franke@pmone.com&l=50_HTML&u=3967541&mid=1060748&jb=84&m=1

Director of Analytics for Facebook
Some word to „R“
• R is a language and environment for statistical
computing and graphics
• R is Open Source under GNU general public license
• Most widely used statistical software
• Everything happens in-memory
• Comes with a package manager (~5000 packages)
• Provides also graphical functionalities
Samples of R
How to approach projects?
Starting Point
Problems, which we know from the BI world already, are further exacerbated by
big data.

•

Complexity of systems constantly grows

•

Amount of data growth exponentially (= Big Data)

•

Need for change is more frequent and is increasingly delving deeper into
business rules

•

Solutions can no longer be thought ahead
Solution Option 1 – Classic Deterministic

Everything can be planned and
design at the drawing board…
How does a system with products & components and their
relationships behaves with each other?

Quelle: Cesar Hidalgo
Solution Option 2 – Learn from „mother Nature“
• How does nature deal with complex non-linear systems?
• Evolution – Variation and selection – „Trial and Error“

„It is not the strongest of the species that
survives, nor the most intelligent but the one
most responsive to change.“ (Charles Darwin)
A candlestick?
45 Iterations

Technology helps, to speed iterations.
Laboratory & Factory
The laboratory

Try & Error
Pattern Recognition
Analytical Apps
An efficient laboratory to experiment
Power Pivot
In-Memory

Microsoft Excel

Power View

Unstructured
Data

Power Query

Source Systems

Power Map

SQL Server

Structured
Data
OleD
B
Odata

WebServer-Logs
Sensor-Data

Data Marketplace

SAP

Databases
Easy to cosume

The factory
Integrated in the business process

Analyze on mass data

Host it and run it

At Enterpise Scale
For Realtime Enterprise
Stable Big Data Architecture
Prediction &
Data Science

Front-Ends &
Mobile
Windows
Azure

On-Premises

Source Systems

Unstructured
Data

WebServer-Logs
Sensor-Data

HDInsight

SQL Server PDW

Data Marketplace

Structured
Data

SAP

Databases
How do we scale?
The battle
How do we scale?
Relational data & compute

SQL Server 2012
Parallel Data
Warehouse
Half Rack

Infiniband

Analytical data &
compute

HP DL 385
40 Cores
2 TB RAM
Fusion-IO Card
What is Revolution Analytics?
• Founded in 2007
• Aim: Evolution of R for high-performance
• Offer R packages for faster performance and
greater stability
• Enterprise & Community products
• Stand-alone, Scale-out (HPC), on Hadoop
How do we handle our data?
R-ODBC: 10 MB/s

Flat file export: 80 MB/s

Data preparation

Data transfer

predictive scripts
Results
• Generate predictions for 30.000 customers
–
–
–
–

•
•
•
•

50.000 rows per customer, 54 columns
Customer goal: 5 Minutes
Our solution: 7.500 customers in 5 Minutes
Benchmark: 1 Minute

Revolution Analytics ODBC driver does not work with PDW
Standard R ODBC driver reads data with 10 MB/s
Workaround via flat file export
RDS format faster than csv
Other solutions?
• R in database
• R on Hadoop
– RHadoop
– Revolution Analytics RHadoop
Other solutions?
• Services & Cloud
THANK YOU!
• For attending this session and
PASS SQLRally Nordic 2013, Stockholm
Titles are set to 34 pt, Arial
Click to edit Master title style
• Level 1 text is 28 pt Arial
– Level 2 text is 24 pt Arial
• Level 3 text is 20 pt Arial
– Level 4 text is 20 pt Arial
• Level 5 text is 20 pt Arial
Notes (hidden)
• Some speakers may use this slide for hidden
notes
• Please delete if you prefer not to use
• Please note you are also able to use notes
section for each slide

More Related Content

What's hot

Big Data with SAP HANA Vora
Big Data with SAP HANA VoraBig Data with SAP HANA Vora
Big Data with SAP HANA Vora
Vigram V
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
Willy Lulciuc
 
A11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by Toshiro Morisaki
A11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by  Toshiro MorisakiA11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by  Toshiro Morisaki
A11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by Toshiro Morisaki
Insight Technology, Inc.
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
Brandon Berlinrut
 

What's hot (20)

Big Data with SAP HANA Vora
Big Data with SAP HANA VoraBig Data with SAP HANA Vora
Big Data with SAP HANA Vora
 
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing PlatformSAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
 
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
 
SAP HANA for Line of Business Sales
SAP HANA for Line of Business SalesSAP HANA for Line of Business Sales
SAP HANA for Line of Business Sales
 
データベースMeetup Vol3
データベースMeetup Vol3データベースMeetup Vol3
データベースMeetup Vol3
 
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, CouchbaseDatabase Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
 
A11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by Toshiro Morisaki
A11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by  Toshiro MorisakiA11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by  Toshiro Morisaki
A11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by Toshiro Morisaki
 
SAP HANA Vora SITMTY 20160707
SAP HANA Vora SITMTY 20160707SAP HANA Vora SITMTY 20160707
SAP HANA Vora SITMTY 20160707
 
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
Leveraging SAP HANA with Apache Hadoop and SAP AnalyticsLeveraging SAP HANA with Apache Hadoop and SAP Analytics
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
 
Varadarajan CV
Varadarajan CVVaradarajan CV
Varadarajan CV
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data Solution
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 

Viewers also liked

Analytic powerhouse parallel data warehouse und r
Analytic powerhouse parallel data warehouse und rAnalytic powerhouse parallel data warehouse und r
Analytic powerhouse parallel data warehouse und r
Marcel Franke
 
Acid and base conc
Acid and base concAcid and base conc
Acid and base conc
Devonsdeals
 
Lab report for water experiment
Lab report for water experimentLab report for water experiment
Lab report for water experiment
Ashwin12345
 
Implementing Science Investigations for the CSEC SBA
Implementing Science Investigations for the CSEC SBAImplementing Science Investigations for the CSEC SBA
Implementing Science Investigations for the CSEC SBA
Debbie-Ann Hall
 
Diffusion lab report
Diffusion lab reportDiffusion lab report
Diffusion lab report
leroy walker
 

Viewers also liked (20)

Analytic powerhouse parallel data warehouse und r
Analytic powerhouse parallel data warehouse und rAnalytic powerhouse parallel data warehouse und r
Analytic powerhouse parallel data warehouse und r
 
SAP HANA, Power Pivot, SQL Server – In-memory-Technologien im Vergleich
SAP HANA, Power Pivot, SQL Server – In-memory-Technologien im VergleichSAP HANA, Power Pivot, SQL Server – In-memory-Technologien im Vergleich
SAP HANA, Power Pivot, SQL Server – In-memory-Technologien im Vergleich
 
In Memory-Technologien im Vergleich - SQL Server Konferenz 2015
In Memory-Technologien im Vergleich - SQL Server Konferenz 2015In Memory-Technologien im Vergleich - SQL Server Konferenz 2015
In Memory-Technologien im Vergleich - SQL Server Konferenz 2015
 
Data science and visualization lab presentation
Data science and visualization lab presentationData science and visualization lab presentation
Data science and visualization lab presentation
 
Founding a Hadoop Data Science Lab
Founding a Hadoop Data Science LabFounding a Hadoop Data Science Lab
Founding a Hadoop Data Science Lab
 
Microsoft Data Science Technologies 201505
Microsoft Data Science Technologies 201505Microsoft Data Science Technologies 201505
Microsoft Data Science Technologies 201505
 
Hacking101 delhi 2013
Hacking101 delhi 2013Hacking101 delhi 2013
Hacking101 delhi 2013
 
Acid and base conc
Acid and base concAcid and base conc
Acid and base conc
 
Microsoft Data Science Technologies 201608
Microsoft Data Science Technologies 201608Microsoft Data Science Technologies 201608
Microsoft Data Science Technologies 201608
 
Lauric Acid Lab
Lauric Acid LabLauric Acid Lab
Lauric Acid Lab
 
Data science bootcamp day1
Data science bootcamp day1Data science bootcamp day1
Data science bootcamp day1
 
States of matter
States of matterStates of matter
States of matter
 
Lab report for water experiment
Lab report for water experimentLab report for water experiment
Lab report for water experiment
 
Implementing Science Investigations for the CSEC SBA
Implementing Science Investigations for the CSEC SBAImplementing Science Investigations for the CSEC SBA
Implementing Science Investigations for the CSEC SBA
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
Leroy sba
Leroy sbaLeroy sba
Leroy sba
 
Analytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data ScienceAnalytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data Science
 
Diffusion lab report
Diffusion lab reportDiffusion lab report
Diffusion lab report
 
How to write a plan and design experiment
How to write a plan and design experimentHow to write a plan and design experiment
How to write a plan and design experiment
 
React js
React jsReact js
React js
 

Similar to Create a Data Science Lab with Microsoft and Open Source tools

Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
Andy Lathrop
 

Similar to Create a Data Science Lab with Microsoft and Open Source tools (20)

The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringer
 
Business in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for IntegrationBusiness in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for Integration
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Coding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistanceCoding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - Phdassistance
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data Lake
 
Data Culture Series - Keynote - 16th September 2014
Data Culture Series - Keynote - 16th September 2014Data Culture Series - Keynote - 16th September 2014
Data Culture Series - Keynote - 16th September 2014
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
OpenSistemas Corporate Presentation
OpenSistemas Corporate PresentationOpenSistemas Corporate Presentation
OpenSistemas Corporate Presentation
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
 
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
Stéphane Fréchette - Samedi SQL - Introduction to HDInsightStéphane Fréchette - Samedi SQL - Introduction to HDInsight
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

Create a Data Science Lab with Microsoft and Open Source tools

  • 1.
  • 2. Create a Data Science Lab with Microsoft and Open Source Tools Marcel Franke, pmOne AG, Germany
  • 3. About me – Marcel Franke Practice Lead Advanced Analytics & Data Science pmOne AG – Germany, Austria, Switzerland >10 years experiences with large scale Data Warehouses based on SQL Server Blog: dwjunkie.wordpress.com
  • 4. What is data science?
  • 5. The Definition Data science incorporates varying elements and builds on techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Source: http://en.wikipedia.org/wiki/Data_science
  • 6. A brief look into history
  • 8. The beginnings of gambling Gambling exists since 3000 BC First games based on dices Origin in China and Mesopotamian * Source: Tiemeyer, E.; Zsifkovitis, H.: Information als Führungsmittel, München: Computerwoche Verlag 1995
  • 9. Scientific foundations 17th century Paradox of Chevaliers de Méré LaPlace und Fermat discussed the paradox in several letters The beginning of theory of probability * Source: http://de.wikipedia.org/wiki/De-M%C3%A9r%C3%A9-Paradoxon
  • 10. The science in Data Science Calculate probabilities Pattern recognition Calculation of analytical variance Machine Learning Simulations Predictions
  • 11. BI, Data Mining & Prediction
  • 13. What do companies do today?
  • 14. Walmart – The pioneer of data analytics Source: Data Unser – Dr. Bloching, Bilder: walmart.com, yourdealz.de, squidoo.com, fuzzybrew.com
  • 15. Visa 80% correct prediction of divorces within the next 5 years Reason: Divorce is the highest risk for private insolvency Source: visa.de
  • 16. Customers need to find the right case What do consumers really do? Blonde looks somehow different  The new washing powder is really great…
  • 17. Data can be accessed easily…
  • 18. … but, it‘s hard to analyze it.
  • 19. Other areas of application SOCIAL MEDIA PRODUCT REMOMMENDATION RETARGETING PREDICTIVE MAINTENANCE PREDICT RISKS areas of application SALES PREDICTIONS CUSTOMER ANLYSIS DYNAMIC PRICING DISPOSITION
  • 20. How does this fit to Big Data?
  • 21. Our starting point… Structured data Unstructured data Harmonize and generate Information (Role of „Data Scientist“) „BIG Data“ Volume, Variety, Velocity
  • 22. Typical Big Data Architecture Big Data Analytics Excel Big Data Advanced Analytics PowerPivot Big Data Preparation (SQL, Map Reduce) Unstructured data Structured data Massive Parallel Processing Big Data Storage Platform
  • 23. “[Facebook] started in the Hadoop world. We are now bringing in relational to enhance that. We're kind of going [in] the other direction.” “We've been there, and [we] realized that using the wrong technology for certain kinds of problems can be difficult. We started at the end and we're working our way backwards, bringing in both.” Ken Rudin, Source: http://tdwi.org/articles/2013/05/06/facebooks-relationalplatform.aspx?j=192038&e=marcel.franke@pmone.com&l=50_HTML&u=3967541&mid=1060748&jb=84&m=1 Director of Analytics for Facebook
  • 24. Some word to „R“ • R is a language and environment for statistical computing and graphics • R is Open Source under GNU general public license • Most widely used statistical software • Everything happens in-memory • Comes with a package manager (~5000 packages) • Provides also graphical functionalities
  • 26. How to approach projects?
  • 27. Starting Point Problems, which we know from the BI world already, are further exacerbated by big data. • Complexity of systems constantly grows • Amount of data growth exponentially (= Big Data) • Need for change is more frequent and is increasingly delving deeper into business rules • Solutions can no longer be thought ahead
  • 28. Solution Option 1 – Classic Deterministic Everything can be planned and design at the drawing board…
  • 29. How does a system with products & components and their relationships behaves with each other? Quelle: Cesar Hidalgo
  • 30. Solution Option 2 – Learn from „mother Nature“ • How does nature deal with complex non-linear systems? • Evolution – Variation and selection – „Trial and Error“ „It is not the strongest of the species that survives, nor the most intelligent but the one most responsive to change.“ (Charles Darwin)
  • 32. 45 Iterations Technology helps, to speed iterations.
  • 34. The laboratory Try & Error Pattern Recognition Analytical Apps
  • 35. An efficient laboratory to experiment Power Pivot In-Memory Microsoft Excel Power View Unstructured Data Power Query Source Systems Power Map SQL Server Structured Data OleD B Odata WebServer-Logs Sensor-Data Data Marketplace SAP Databases
  • 36.
  • 37. Easy to cosume The factory Integrated in the business process Analyze on mass data Host it and run it At Enterpise Scale For Realtime Enterprise
  • 38. Stable Big Data Architecture Prediction & Data Science Front-Ends & Mobile Windows Azure On-Premises Source Systems Unstructured Data WebServer-Logs Sensor-Data HDInsight SQL Server PDW Data Marketplace Structured Data SAP Databases
  • 39.
  • 40. How do we scale?
  • 42. How do we scale? Relational data & compute SQL Server 2012 Parallel Data Warehouse Half Rack Infiniband Analytical data & compute HP DL 385 40 Cores 2 TB RAM Fusion-IO Card
  • 43. What is Revolution Analytics? • Founded in 2007 • Aim: Evolution of R for high-performance • Offer R packages for faster performance and greater stability • Enterprise & Community products • Stand-alone, Scale-out (HPC), on Hadoop
  • 44. How do we handle our data? R-ODBC: 10 MB/s Flat file export: 80 MB/s Data preparation Data transfer predictive scripts
  • 45. Results • Generate predictions for 30.000 customers – – – – • • • • 50.000 rows per customer, 54 columns Customer goal: 5 Minutes Our solution: 7.500 customers in 5 Minutes Benchmark: 1 Minute Revolution Analytics ODBC driver does not work with PDW Standard R ODBC driver reads data with 10 MB/s Workaround via flat file export RDS format faster than csv
  • 46. Other solutions? • R in database • R on Hadoop – RHadoop – Revolution Analytics RHadoop
  • 48. THANK YOU! • For attending this session and PASS SQLRally Nordic 2013, Stockholm
  • 49. Titles are set to 34 pt, Arial Click to edit Master title style • Level 1 text is 28 pt Arial – Level 2 text is 24 pt Arial • Level 3 text is 20 pt Arial – Level 4 text is 20 pt Arial • Level 5 text is 20 pt Arial
  • 50. Notes (hidden) • Some speakers may use this slide for hidden notes • Please delete if you prefer not to use • Please note you are also able to use notes section for each slide

Editor's Notes

  1. A lotoftopicsandskillsarecombinedData Warehouse is also a partofitMore Statisticsandmathematicskillsareneeded
  2. Wheredoes Data Science comefrom?
  3. Whenyou do someresearch on thattopicyou will automaticallystumbleaboutgamblingorgamesofchances.
  4. Dicecup
  5. 2 scientistsstartedthinkingaboutgamling on a morescientificway.Writing verylongletters back andforthDifferentprobabilitytowinifyouplaywith 1 diceor 2
  6. 1.)Howbigistheprobabilitytowinorloose, ortoreach a certaingoal?2.) Isthereanycorrelationbetweenthecustomerincomeandthesalesamount?5.) Whathappensifwechangecertainparameterslikeprice?6.) Whatisthesalesamoutof a certainproduct in thenextquarteroryear?
  7. Howdoesthistopic fit to BI?
  8. Whatcan I do withit?
  9. So what do companies do withit?I consciouslydidn‘tusetheword Big Data but you all knowthatthisnewareaisveryhot in marketingandnews. So whatarethegoodexamples & usecases?
  10. Kasse – cash deskBelohnung – rewardWindel - nappy
  11. Stellwert von R herausheben -> fast alle Anbieter basieren auf RWir viel im Bereich Open Source verwendet
  12. InjectorforwashingpelletsWaste, poorquality,
  13. Ideaof a processmodellcalled Lab & FactoryExperimental approachIterativeFastFind newpatterns
  14. Isforthedatascientisttoexperiment
  15. Ifwefoundsomethinginteresting, wecandeployittothefactoryIt‘stheplacewherewerunouranalyticalcode at Enterprise scale
  16. Mostoftheanalyticaltoolsare out thereforyearslike Databases, R, SAS, SPSSWeoftenherelimitations in scalability & performanceDB -> MPPR, SAS, -> In-Memory
  17. POC on different analyticusecaseswiththebigvendorsComplex SQL-QueriesSimulationsPredictionswith R
  18. SQL -> wir wissen wie wir skalierenR -> Skalierung schwierig, deshalb Revolution
  19. Kein stabiler Markt, viele Möglichkeiten