The document discusses building a big data lab using cloud services like Google Cloud Platform (GCP). It notes that traditional homebrew labs have limited resources while cloud-based labs provide infinite resources and utility billing. It emphasizes defining goals for the lab work, acquiring necessary skills and knowledge, and using public datasets to complement internal data. Choosing the right tools and cloud platform like GCP, AWS, or Azure is important for high performance analytics on large data volumes and formats.
3. Big data Lab – the world’s biggest
• WLCG – Worldwide LHC
Computing Grid
• 170 Computing facilities
• 200,000 Cores
• 300GB/s data stream
ingestion
• 300MB/s data stream
filtered
• 27TB RAW data per day
4. 4
Big data Lab – Traditional Home brew
• Based on Vmware or Virtuabox or
Raspberry PI
• Mix of hardware
• Limited resources – 6 cores, 128GB space
• Low performance – 1 GHz Processor
• Lots of baby sitting
• Equal measures of heartbreak and joy
5. 5
Big data Lab – Using Cloud
• IaaS and PaaS services
• Mix of applications
• Infinite resources
• High performance
• Access to quality data sets
• Utility billing
• Sharable outcomes
11. Common characteristics of Cloud based platforms
Streaming Engine
Data Storage
Hadoop
In Memory Engine
Machine Learning
Analytics
12. Why have a lab
• Data is a complex beast, it has several attributes
• Quality – different tasks require different data quality
• Machine Learning & Predictive
• Reporting
• Context – data context is vital for analytics
• Story of the data
• Volume – how much data is there
• Testing requirements for data latency
• Format – data format is not universal
• Different applications have different data types
• Analysis
• What and how to analyse
A lab is essential for testing these items before large scale factory work is
done
13. Why have a lab
• Data is a complex beast, it has several attributes
• Quality – different tasks require different data quality
• Machine Learning & Predictive
• Reporting
• Context – data context is vital for analytics
• Story of the data
• Volume – how much data is there
• Testing requirements for data latency
• Format – data format is not universal
• Different applications have different data types
• Analysis
• What and how to analyse
A lab is essential for testing these items before large scale factory work is
done
14. Why have a lab
• Data is a complex beast, it has several attributes
• Quality – different tasks require different data quality
• Machine Learning & Predictive
• Reporting
• Context – data context is vital for analytics
• Story of the data
• Volume – how much data is there
• Testing requirements for data latency
• Format – data format is not universal
• Different applications have different data types
• Analysis
• What and how to analyse
A lab is essential for testing these items before large scale factory work is
done
15. Define your goals
• Achieving the best use of resources is critical
• Cloud based Big Data labs have a direct charge model
• Homebrew Big Data labs have limited resources
• Define what the outcome of the lab work is
• This is no different to a proper science experiment
• Design your lab and define your tools
• You have to use the right tool for the job, not just those you are familiar with
• Define your data set
• Work out what data you need
• Gain permission to use what you need if required
16. Define your goals
• Achieving the best use of resources is critical
• Cloud based Big Data labs have a direct charge model
• Homebrew Big Data labs have limited resources
• Define what the outcome of the lab work is
• This is no different to a proper science experiment
• Design your lab and define your tools
• You have to use the right tool for the job, not just those you are familiar with
• Define your data set
• Work out what data you need
• Gain permission to use what you need if required
17. Mind the gap and acquire knowledge
Part of the fun of big data labs is working out what you don’t know
• A particular framework
• An algorithm
• A data set
• A visualisation
The next fun part is working out where to fill that knowledge gap
• Online sources –
• Kaggle
• MOOC’s – Andrew Ng’s Stanford course
• Forums – Stack Overflow
It is also implicit that you also share what you have learnt once you have
18. Mind the gap and acquire knowledge
Part of the fun of big data labs is working out what you don’t know
• A particular framework
• An algorithm
• A data set
• A visualisation
The next fun part is working out where to fill that knowledge gap
• Online sources –
• Kaggle
• MOOC’s – Andrew Ng’s Stanford course
• Forums – Stack Overflow
It is also implicit that you also share what you have learnt once you have
19. SAP and Big Data platforms
In-Memory
Store
Simplified processing of large
volumes of archived data
HANA SDA / Spark Adapter
HANA-Spark Adapter for real-
time understanding of current
data with historical context
Unified administration using
HANA cockpit administration
simplifies system management
SAP HANA
Application Services
Database Services
Processing Services
Integration Services
YARN
HDFSFiles Files Files
Vora
Spark
Vora
Spark
Vora
Spark
SAP HANA Platform
HANA Smart
Data Access
Structured
Storage
Dynamic
Tiering
Spark API
enhancement
Hadoop Cluster
20. SAP HANA Express Edition
• Fast application development and deployment with essential features
• Free up to 32GB of memory – upgradeable for a fee
• Flexible access from a laptop, desktop, server, Cloud platform
• Pre-Packages with sample code and data
• Downloadable from SAP Developer network
21. Big data datasets
Companies are really really bad at using external data sets
• There are many public data sets which can be used to compliment existing internal
data.
• Weather data for logistics companies
• AWS Public Datasets
• Google Public Datasets
• GitHub Public Datasets
• Kaggle Public Datasets
• Data.gov.uk Public Datasets