SlideShare a Scribd company logo
1 of 27
How and why you need to build a Big Data Lab
Why GCP is a pretty cool place to do it
Chris Kernaghan
Principal Consultant
VS
Data Lab Data Factory
Big data Lab – the world’s biggest
• WLCG – Worldwide LHC
Computing Grid
• 170 Computing facilities
• 200,000 Cores
• 300GB/s data stream
ingestion
• 300MB/s data stream
filtered
• 27TB RAW data per day
4
Big data Lab – Traditional Home brew
• Based on Vmware or Virtuabox or
Raspberry PI
• Mix of hardware
• Limited resources – 6 cores, 128GB space
• Low performance – 1 GHz Processor
• Lots of baby sitting
• Equal measures of heartbreak and joy
5
Big data Lab – Using Cloud
• IaaS and PaaS services
• Mix of applications
• Infinite resources
• High performance
• Access to quality data sets
• Utility billing
• Sharable outcomes
Big data platforms in the Cloud - AWS
Big data platforms in the Cloud - GCP
Big data platforms in the Cloud - Azure
Big data platforms in the Cloud - SAP
Big data platforms in the Cloud - IBM
Common characteristics of Cloud based platforms
Streaming Engine
Data Storage
Hadoop
In Memory Engine
Machine Learning
Analytics
Why have a lab
• Data is a complex beast, it has several attributes
• Quality – different tasks require different data quality
• Machine Learning & Predictive
• Reporting
• Context – data context is vital for analytics
• Story of the data
• Volume – how much data is there
• Testing requirements for data latency
• Format – data format is not universal
• Different applications have different data types
• Analysis
• What and how to analyse
A lab is essential for testing these items before large scale factory work is
done
Why have a lab
• Data is a complex beast, it has several attributes
• Quality – different tasks require different data quality
• Machine Learning & Predictive
• Reporting
• Context – data context is vital for analytics
• Story of the data
• Volume – how much data is there
• Testing requirements for data latency
• Format – data format is not universal
• Different applications have different data types
• Analysis
• What and how to analyse
A lab is essential for testing these items before large scale factory work is
done
Why have a lab
• Data is a complex beast, it has several attributes
• Quality – different tasks require different data quality
• Machine Learning & Predictive
• Reporting
• Context – data context is vital for analytics
• Story of the data
• Volume – how much data is there
• Testing requirements for data latency
• Format – data format is not universal
• Different applications have different data types
• Analysis
• What and how to analyse
A lab is essential for testing these items before large scale factory work is
done
Define your goals
• Achieving the best use of resources is critical
• Cloud based Big Data labs have a direct charge model
• Homebrew Big Data labs have limited resources
• Define what the outcome of the lab work is
• This is no different to a proper science experiment
• Design your lab and define your tools
• You have to use the right tool for the job, not just those you are familiar with
• Define your data set
• Work out what data you need
• Gain permission to use what you need if required
Define your goals
• Achieving the best use of resources is critical
• Cloud based Big Data labs have a direct charge model
• Homebrew Big Data labs have limited resources
• Define what the outcome of the lab work is
• This is no different to a proper science experiment
• Design your lab and define your tools
• You have to use the right tool for the job, not just those you are familiar with
• Define your data set
• Work out what data you need
• Gain permission to use what you need if required
Mind the gap and acquire knowledge
Part of the fun of big data labs is working out what you don’t know
• A particular framework
• An algorithm
• A data set
• A visualisation
The next fun part is working out where to fill that knowledge gap
• Online sources –
• Kaggle
• MOOC’s – Andrew Ng’s Stanford course
• Forums – Stack Overflow
It is also implicit that you also share what you have learnt once you have
Mind the gap and acquire knowledge
Part of the fun of big data labs is working out what you don’t know
• A particular framework
• An algorithm
• A data set
• A visualisation
The next fun part is working out where to fill that knowledge gap
• Online sources –
• Kaggle
• MOOC’s – Andrew Ng’s Stanford course
• Forums – Stack Overflow
It is also implicit that you also share what you have learnt once you have
SAP and Big Data platforms
In-Memory
Store
Simplified processing of large
volumes of archived data
HANA SDA / Spark Adapter
HANA-Spark Adapter for real-
time understanding of current
data with historical context
Unified administration using
HANA cockpit administration
simplifies system management
SAP HANA
Application Services
Database Services
Processing Services
Integration Services
YARN
HDFSFiles Files Files
Vora
Spark
Vora
Spark
Vora
Spark
SAP HANA Platform
HANA Smart
Data Access
Structured
Storage
Dynamic
Tiering
Spark API
enhancement
Hadoop Cluster
SAP HANA Express Edition
• Fast application development and deployment with essential features
• Free up to 32GB of memory – upgradeable for a fee
• Flexible access from a laptop, desktop, server, Cloud platform
• Pre-Packages with sample code and data
• Downloadable from SAP Developer network
Big data datasets
Companies are really really bad at using external data sets
• There are many public data sets which can be used to compliment existing internal
data.
• Weather data for logistics companies
• AWS Public Datasets
• Google Public Datasets
• GitHub Public Datasets
• Kaggle Public Datasets
• Data.gov.uk Public Datasets
AWS Big data datasets
Google Big data datasets
GitHub Big data datasets
Kaggle Big data datasets
Data.gov.uk data datasets
SAP HANA Express Edition Deploying in GCP
DEMO

More Related Content

What's hot

Collaborate 2018: Optimizing Your Robust Oracle EBS Footprint for Radical Eff...
Collaborate 2018: Optimizing Your Robust Oracle EBS Footprint for Radical Eff...Collaborate 2018: Optimizing Your Robust Oracle EBS Footprint for Radical Eff...
Collaborate 2018: Optimizing Your Robust Oracle EBS Footprint for Radical Eff...Datavail
 
Scheduled releases @ Commit Porto 2016
Scheduled releases @ Commit Porto 2016Scheduled releases @ Commit Porto 2016
Scheduled releases @ Commit Porto 2016Fábio Oliveira
 
FUG Agile software engineering practices
FUG Agile software engineering practicesFUG Agile software engineering practices
FUG Agile software engineering practicesSerena Software
 
Creating High Performance teams by using a DevOps culture (FUG presentation)
Creating High Performance teams by using a DevOps culture (FUG presentation)Creating High Performance teams by using a DevOps culture (FUG presentation)
Creating High Performance teams by using a DevOps culture (FUG presentation)Serena Software
 
Salesforce Flows Architecture Best Practices
Salesforce Flows Architecture Best PracticesSalesforce Flows Architecture Best Practices
Salesforce Flows Architecture Best Practicespanayaofficial
 
Sap netweaver as abap 7.4 overview and product highlights
Sap netweaver as abap 7.4 overview and product highlightsSap netweaver as abap 7.4 overview and product highlights
Sap netweaver as abap 7.4 overview and product highlightsNizar Fanany
 
Directions NA Water-Agile-Fall methodology and NAV implementation
Directions NA Water-Agile-Fall methodology and NAV implementationDirections NA Water-Agile-Fall methodology and NAV implementation
Directions NA Water-Agile-Fall methodology and NAV implementationAleksandar Totovic
 
Directions NA Choosing the best possible Azure platform for NAV
Directions NA Choosing the best possible Azure platform for NAVDirections NA Choosing the best possible Azure platform for NAV
Directions NA Choosing the best possible Azure platform for NAVAleksandar Totovic
 
How Tempo Adds More Value To Your JIRA
How Tempo Adds More Value To Your JIRAHow Tempo Adds More Value To Your JIRA
How Tempo Adds More Value To Your JIRAACA IT-Solutions
 
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Chocolatey Software
 
Project Tracking Application
Project Tracking ApplicationProject Tracking Application
Project Tracking ApplicationQBurst
 
Serena Business Manager Visualizing 2016
Serena Business Manager Visualizing 2016Serena Business Manager Visualizing 2016
Serena Business Manager Visualizing 2016Serena Software
 
Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018Daniel Graversen
 
DevBoss May 2019 Presentation
DevBoss May 2019 Presentation DevBoss May 2019 Presentation
DevBoss May 2019 Presentation Corecom Consulting
 
How to speed up your SAP PI/CPI development
How to speed up your SAP PI/CPI developmentHow to speed up your SAP PI/CPI development
How to speed up your SAP PI/CPI developmentDaniel Graversen
 
What's New for Atlassian Administrators
What's New for Atlassian AdministratorsWhat's New for Atlassian Administrators
What's New for Atlassian AdministratorsAtlassian
 

What's hot (17)

Collaborate 2018: Optimizing Your Robust Oracle EBS Footprint for Radical Eff...
Collaborate 2018: Optimizing Your Robust Oracle EBS Footprint for Radical Eff...Collaborate 2018: Optimizing Your Robust Oracle EBS Footprint for Radical Eff...
Collaborate 2018: Optimizing Your Robust Oracle EBS Footprint for Radical Eff...
 
Scheduled releases @ Commit Porto 2016
Scheduled releases @ Commit Porto 2016Scheduled releases @ Commit Porto 2016
Scheduled releases @ Commit Porto 2016
 
FUG Agile software engineering practices
FUG Agile software engineering practicesFUG Agile software engineering practices
FUG Agile software engineering practices
 
Creating High Performance teams by using a DevOps culture (FUG presentation)
Creating High Performance teams by using a DevOps culture (FUG presentation)Creating High Performance teams by using a DevOps culture (FUG presentation)
Creating High Performance teams by using a DevOps culture (FUG presentation)
 
Salesforce Flows Architecture Best Practices
Salesforce Flows Architecture Best PracticesSalesforce Flows Architecture Best Practices
Salesforce Flows Architecture Best Practices
 
Sap netweaver as abap 7.4 overview and product highlights
Sap netweaver as abap 7.4 overview and product highlightsSap netweaver as abap 7.4 overview and product highlights
Sap netweaver as abap 7.4 overview and product highlights
 
Directions NA Water-Agile-Fall methodology and NAV implementation
Directions NA Water-Agile-Fall methodology and NAV implementationDirections NA Water-Agile-Fall methodology and NAV implementation
Directions NA Water-Agile-Fall methodology and NAV implementation
 
Directions NA Choosing the best possible Azure platform for NAV
Directions NA Choosing the best possible Azure platform for NAVDirections NA Choosing the best possible Azure platform for NAV
Directions NA Choosing the best possible Azure platform for NAV
 
How Tempo Adds More Value To Your JIRA
How Tempo Adds More Value To Your JIRAHow Tempo Adds More Value To Your JIRA
How Tempo Adds More Value To Your JIRA
 
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
 
Project Tracking Application
Project Tracking ApplicationProject Tracking Application
Project Tracking Application
 
Serena Business Manager Visualizing 2016
Serena Business Manager Visualizing 2016Serena Business Manager Visualizing 2016
Serena Business Manager Visualizing 2016
 
Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018
 
What's new in SBM 11.1
What's new in SBM 11.1What's new in SBM 11.1
What's new in SBM 11.1
 
DevBoss May 2019 Presentation
DevBoss May 2019 Presentation DevBoss May 2019 Presentation
DevBoss May 2019 Presentation
 
How to speed up your SAP PI/CPI development
How to speed up your SAP PI/CPI developmentHow to speed up your SAP PI/CPI development
How to speed up your SAP PI/CPI development
 
What's New for Atlassian Administrators
What's New for Atlassian AdministratorsWhat's New for Atlassian Administrators
What's New for Atlassian Administrators
 

Similar to Build a Big Data Lab in the Cloud

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopAnu Ravindranath
 
Adventures in Azure Machine Learning from NE Bytes
Adventures in Azure Machine Learning from NE BytesAdventures in Azure Machine Learning from NE Bytes
Adventures in Azure Machine Learning from NE BytesDerek Graham
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoBitly
 

Similar to Build a Big Data Lab in the Cloud (20)

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoop
 
Adventures in Azure Machine Learning from NE Bytes
Adventures in Azure Machine Learning from NE BytesAdventures in Azure Machine Learning from NE Bytes
Adventures in Azure Machine Learning from NE Bytes
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Rdbms
RdbmsRdbms
Rdbms
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
 

More from Chris Kernaghan

DevOps for SAP customers
DevOps for SAP customersDevOps for SAP customers
DevOps for SAP customersChris Kernaghan
 
Can you do DevOps in SAP (DevOps -> SAP)
Can you do DevOps in SAP (DevOps -> SAP)Can you do DevOps in SAP (DevOps -> SAP)
Can you do DevOps in SAP (DevOps -> SAP)Chris Kernaghan
 
Can you do DevOps in SAP (SAP -> DevOps)
Can you do DevOps in SAP (SAP -> DevOps)Can you do DevOps in SAP (SAP -> DevOps)
Can you do DevOps in SAP (SAP -> DevOps)Chris Kernaghan
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data PlatformsChris Kernaghan
 
Change management in hybrid landscapes
Change management in hybrid landscapesChange management in hybrid landscapes
Change management in hybrid landscapesChris Kernaghan
 
Quick and dirty performance analysis
Quick and dirty performance analysisQuick and dirty performance analysis
Quick and dirty performance analysisChris Kernaghan
 
HANA - the backbone for S/4 HANA
HANA - the backbone for S/4 HANAHANA - the backbone for S/4 HANA
HANA - the backbone for S/4 HANAChris Kernaghan
 
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...Chris Kernaghan
 
Automating Infrastructure as a Service Deployments and monitoring – TEC213
Automating Infrastructure as a Service Deployments and monitoring – TEC213Automating Infrastructure as a Service Deployments and monitoring – TEC213
Automating Infrastructure as a Service Deployments and monitoring – TEC213Chris Kernaghan
 
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsSAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsChris Kernaghan
 
SAP TechEd 2013 session Tec118 managing your-environment
SAP TechEd 2013 session Tec118 managing your-environmentSAP TechEd 2013 session Tec118 managing your-environment
SAP TechEd 2013 session Tec118 managing your-environmentChris Kernaghan
 
01 sap hana landscape and operations infrastructure v2 0
01  sap hana landscape and operations infrastructure v2 001  sap hana landscape and operations infrastructure v2 0
01 sap hana landscape and operations infrastructure v2 0Chris Kernaghan
 

More from Chris Kernaghan (14)

DevOps for SAP customers
DevOps for SAP customersDevOps for SAP customers
DevOps for SAP customers
 
Can you do DevOps in SAP (DevOps -> SAP)
Can you do DevOps in SAP (DevOps -> SAP)Can you do DevOps in SAP (DevOps -> SAP)
Can you do DevOps in SAP (DevOps -> SAP)
 
Can you do DevOps in SAP (SAP -> DevOps)
Can you do DevOps in SAP (SAP -> DevOps)Can you do DevOps in SAP (SAP -> DevOps)
Can you do DevOps in SAP (SAP -> DevOps)
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
Change management in hybrid landscapes
Change management in hybrid landscapesChange management in hybrid landscapes
Change management in hybrid landscapes
 
Quick and dirty performance analysis
Quick and dirty performance analysisQuick and dirty performance analysis
Quick and dirty performance analysis
 
HANA - the backbone for S/4 HANA
HANA - the backbone for S/4 HANAHANA - the backbone for S/4 HANA
HANA - the backbone for S/4 HANA
 
Cloud or On Premise
Cloud or On PremiseCloud or On Premise
Cloud or On Premise
 
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
 
Automating Infrastructure as a Service Deployments and monitoring – TEC213
Automating Infrastructure as a Service Deployments and monitoring – TEC213Automating Infrastructure as a Service Deployments and monitoring – TEC213
Automating Infrastructure as a Service Deployments and monitoring – TEC213
 
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsSAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
 
SAP TechEd 2013 session Tec118 managing your-environment
SAP TechEd 2013 session Tec118 managing your-environmentSAP TechEd 2013 session Tec118 managing your-environment
SAP TechEd 2013 session Tec118 managing your-environment
 
01 sap hana landscape and operations infrastructure v2 0
01  sap hana landscape and operations infrastructure v2 001  sap hana landscape and operations infrastructure v2 0
01 sap hana landscape and operations infrastructure v2 0
 
Sapuki sig 2013
Sapuki sig 2013Sapuki sig 2013
Sapuki sig 2013
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Build a Big Data Lab in the Cloud

  • 1. How and why you need to build a Big Data Lab Why GCP is a pretty cool place to do it Chris Kernaghan Principal Consultant
  • 3. Big data Lab – the world’s biggest • WLCG – Worldwide LHC Computing Grid • 170 Computing facilities • 200,000 Cores • 300GB/s data stream ingestion • 300MB/s data stream filtered • 27TB RAW data per day
  • 4. 4 Big data Lab – Traditional Home brew • Based on Vmware or Virtuabox or Raspberry PI • Mix of hardware • Limited resources – 6 cores, 128GB space • Low performance – 1 GHz Processor • Lots of baby sitting • Equal measures of heartbreak and joy
  • 5. 5 Big data Lab – Using Cloud • IaaS and PaaS services • Mix of applications • Infinite resources • High performance • Access to quality data sets • Utility billing • Sharable outcomes
  • 6. Big data platforms in the Cloud - AWS
  • 7. Big data platforms in the Cloud - GCP
  • 8. Big data platforms in the Cloud - Azure
  • 9. Big data platforms in the Cloud - SAP
  • 10. Big data platforms in the Cloud - IBM
  • 11. Common characteristics of Cloud based platforms Streaming Engine Data Storage Hadoop In Memory Engine Machine Learning Analytics
  • 12. Why have a lab • Data is a complex beast, it has several attributes • Quality – different tasks require different data quality • Machine Learning & Predictive • Reporting • Context – data context is vital for analytics • Story of the data • Volume – how much data is there • Testing requirements for data latency • Format – data format is not universal • Different applications have different data types • Analysis • What and how to analyse A lab is essential for testing these items before large scale factory work is done
  • 13. Why have a lab • Data is a complex beast, it has several attributes • Quality – different tasks require different data quality • Machine Learning & Predictive • Reporting • Context – data context is vital for analytics • Story of the data • Volume – how much data is there • Testing requirements for data latency • Format – data format is not universal • Different applications have different data types • Analysis • What and how to analyse A lab is essential for testing these items before large scale factory work is done
  • 14. Why have a lab • Data is a complex beast, it has several attributes • Quality – different tasks require different data quality • Machine Learning & Predictive • Reporting • Context – data context is vital for analytics • Story of the data • Volume – how much data is there • Testing requirements for data latency • Format – data format is not universal • Different applications have different data types • Analysis • What and how to analyse A lab is essential for testing these items before large scale factory work is done
  • 15. Define your goals • Achieving the best use of resources is critical • Cloud based Big Data labs have a direct charge model • Homebrew Big Data labs have limited resources • Define what the outcome of the lab work is • This is no different to a proper science experiment • Design your lab and define your tools • You have to use the right tool for the job, not just those you are familiar with • Define your data set • Work out what data you need • Gain permission to use what you need if required
  • 16. Define your goals • Achieving the best use of resources is critical • Cloud based Big Data labs have a direct charge model • Homebrew Big Data labs have limited resources • Define what the outcome of the lab work is • This is no different to a proper science experiment • Design your lab and define your tools • You have to use the right tool for the job, not just those you are familiar with • Define your data set • Work out what data you need • Gain permission to use what you need if required
  • 17. Mind the gap and acquire knowledge Part of the fun of big data labs is working out what you don’t know • A particular framework • An algorithm • A data set • A visualisation The next fun part is working out where to fill that knowledge gap • Online sources – • Kaggle • MOOC’s – Andrew Ng’s Stanford course • Forums – Stack Overflow It is also implicit that you also share what you have learnt once you have
  • 18. Mind the gap and acquire knowledge Part of the fun of big data labs is working out what you don’t know • A particular framework • An algorithm • A data set • A visualisation The next fun part is working out where to fill that knowledge gap • Online sources – • Kaggle • MOOC’s – Andrew Ng’s Stanford course • Forums – Stack Overflow It is also implicit that you also share what you have learnt once you have
  • 19. SAP and Big Data platforms In-Memory Store Simplified processing of large volumes of archived data HANA SDA / Spark Adapter HANA-Spark Adapter for real- time understanding of current data with historical context Unified administration using HANA cockpit administration simplifies system management SAP HANA Application Services Database Services Processing Services Integration Services YARN HDFSFiles Files Files Vora Spark Vora Spark Vora Spark SAP HANA Platform HANA Smart Data Access Structured Storage Dynamic Tiering Spark API enhancement Hadoop Cluster
  • 20. SAP HANA Express Edition • Fast application development and deployment with essential features • Free up to 32GB of memory – upgradeable for a fee • Flexible access from a laptop, desktop, server, Cloud platform • Pre-Packages with sample code and data • Downloadable from SAP Developer network
  • 21. Big data datasets Companies are really really bad at using external data sets • There are many public data sets which can be used to compliment existing internal data. • Weather data for logistics companies • AWS Public Datasets • Google Public Datasets • GitHub Public Datasets • Kaggle Public Datasets • Data.gov.uk Public Datasets
  • 22. AWS Big data datasets
  • 23. Google Big data datasets
  • 24. GitHub Big data datasets
  • 25. Kaggle Big data datasets
  • 27. SAP HANA Express Edition Deploying in GCP DEMO