More Related Content Similar to Big Data per le Startup: come creare applicazioni Big Data in modalità Serverless (20) More from Amazon Web Services (20) Big Data per le Startup: come creare applicazioni Big Data in modalità Serverless1. © 2020, Amazon Web Services, Inc. or its Affiliates.
Big Data per le Startup
come creare applicazioni Big Data in modalità Serverless
Fausto Palma
AWS Solution Architect
2. © 2020, Amazon Web Services, Inc. or its Affiliates.
disk space RAM or CPU
Use case for Bid Data tools
Fits in standard DBs
Structured data
time
CPU
No excessive load spikes
streaming
Variety
tabular nested images video
Different
data formats
Velocity
Streaming real
time analysis
Volume
Large amount of data
not fitting resources
3. © 2020, Amazon Web Services, Inc. or its Affiliates.
Use case for Bid Data tools
Data lake
Open formats
Central catalog
Data collected when
available even in raw format
Recommendation
systems
Text mining
Supply chain flow
optimization
Social network
analysis
Anomaly
detection
Sentiment
analysis
Customer churn
prevention
…
4. © 2020, Amazon Web Services, Inc. or its Affiliates.
Analytics overall architecture (Data Lake)
Data movement
Storage Analytics Data value
Catalog
Management | Security
5. © 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
6. © 2020, Amazon Web Services, Inc. or its Affiliates.
Parallel Processing Reduction
Aggregation
General pattern to scale
Data
Messages
Streams
…
Mapping
Sharding
Shuffling
Shuffling
Shuffling
Outputs
7. © 2020, Amazon Web Services, Inc. or its Affiliates.
Most secure infrastructure: certifications
CSA
Cloud Security
Alliance Controls
ISO 9001
Global Quality
Standard
ISO 27001
Security Management
Controls
ISO 27017
Cloud Specific
Controls
ISO 27018
Personal Data
Protection
PCI DSS Level 1
Payment Card
Standards
SOC 1
Audit Controls
Report
SOC 2
Security, Availability, &
Confidentiality Report
SOC 3
General Controls
Report
Global United States
CJIS
Criminal Justice
Information Services
DoD SRG
DoD Data
Processing
FedRAMP
Government Data
Standards
FERPA
Educational
Privacy Act
FIPS
Government Security
Standards
FISMA
Federal Information
Security Management
GxP
Quality Guidelines
and Regulations
ISO FFIEC
Financial Institutions
Regulation
HIPPA
Protected Health
Information
ITAR
International Arms
Regulations
MPAA
Protected Media
Content
NIST
National Institute of
Standards and Technology
SEC Rule 17a-4(f)
Financial Data
Standards
VPAT/Section 508
Accountability
Standards
Asia Pacific
FISC [Japan]
Financial Industry
Information Systems
IRAP [Australia]
Australian Security
Standards
K-ISMS [Korea]
Korean Information
Security
MTCS Tier 3 [Singapore]
Multi-Tier Cloud
Security Standard
My Number Act [Japan]
Personal Information
Protection
Europe
C5 [Germany]
Operational Security
Attestation
Cyber Essentials
Plus [UK]
Cyber Threat
Protection
G-Cloud [UK]
UK Government
Standards
IT-Grundschutz
[Germany]
Baseline Protection
Methodology
X P
G
https://aws.amazon.com/compliance/programs/
8. © 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
9. © 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Simple Storage Service “S3”
§ Built to store any amount of data
§ Runs on the world’s largest global
cloud infrastructure
§ Designed to deliver 99.999999999% durability
§ Geographic redundancy & automatic replication
§ Tiered storage to optimize price/performance
S3
AZ
AZ AZ
Transit Transit
10. © 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon S3
Amazon Athena Amazon Redshift
Spectrum
Amazon SageMaker AWS Glue
Process Data in Place
11. © 2020, Amazon Web Services, Inc. or its Affiliates.
Output
Format: delimited text (CSV, TSV),
JSON …
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical Math
Boolean String (Like, ||) Aggregate
Input
Format: delimited text (CSV, TSV,
JSON, Parquet…
Compression: GZIP, BZIP2 …
Amazon S3 Select
SQL
12. © 2020, Amazon Web Services, Inc. or its Affiliates.
S3 – how to access
https://docs.aws.amazon.com/AmazonS3/latest/API/API_Operations.html
AWS S3 console AWS S3 API documentation
AWS S3 CLI
https://docs.aws.amazon.com/cli/latest/reference/s3/#available-commands
https://s3.console.aws.amazon.com/s3/
13. © 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
14. © 2020, Amazon Web Services, Inc. or its Affiliates.
Kinesis Data Firehose — How it Works
AWS IoT
Amazon Kinesis
Agent
Amazon Kinesis
Streams
Amazon CloudWatch
Logs
Amazon CloudWatch
Events
Managed Streams
for Kafka
Amazon S3
Amazon
Redshift
Amazon Elasticsearch
Service
Ingest Transform Deliver
Lambda
function
15. © 2020, Amazon Web Services, Inc. or its Affiliates.
Kinesis Firehouse – how to access
https://docs.aws.amazon.com/firehose/latest/APIReference/API_Operations.html
AWS Kinesis Firehouse console AWS Kinesis Firehouse API documentation
AWS Kinesis Firehouse CLI
https://docs.aws.amazon.com/cli/latest/reference/f
irehose/index.html#available-commands
https://eu-west-1.console.aws.amazon.com/kinesis/
16. © 2020, Amazon Web Services, Inc. or its Affiliates.
Simple demo
Amazon Kinesis
Data Firehose
Amazon Simple
Storage Service (S3)
Data
movement
Storage
App
17. © 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Kinesis Data Generator (KDG)
https://awslabs.github.io/amazon-kinesis-data-generator/web/help.html
{
"sensorId": {{random.number(50)}},
"currentTemperature": {{random.number(
{
"min":15,
"max":38
}
)}},
"status":
"{{random.weightedArrayElement(
{
"weights": [0.9,0.03,0.07],
"data": ["OK","FAIL","WARN"]
}
)}}"
}
18. © 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
19. © 2020, Amazon Web Services, Inc. or its Affiliates.
Hive metastore
service
Glue Catalog and Crawlers
Data Lake
S3
EMR
Athena
AWS Glue Jobs
AWS Glue Data CatalogAWS Glue Crawler
20. © 2020, Amazon Web Services, Inc. or its Affiliates.
Glue Catalog console
21. © 2020, Amazon Web Services, Inc. or its Affiliates.
Glue Crawlers console
22. © 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
23. © 2020, Amazon Web Services, Inc. or its Affiliates.
Athena console
https://eu-west-1.console.aws.amazon.com/athena/
Select catalog
Select database
Write Query
S3
24. © 2020, Amazon Web Services, Inc. or its Affiliates.
Data locations
Coordinator
Presto architecture
Workers
Worker
Worker
Worker
Worker
Parsing
Metastore
Planning
Scheduling
Connectors
Client
SELECT
sport,
count(distinct location) as locations,
count(distinct event_id) as events,
count(*) as tickets,
avg(ticket_price) as avg_ticket_price
FROM sporting_event_ticket_info
GROUP BY 1
ORDER BY 1;
Parsing
Planning
Scheduling
25. © 2020, Amazon Web Services, Inc. or its Affiliates.
Row vs Columnar file orientation
Tabular data
File in storage or streaming
26. © 2020, Amazon Web Services, Inc. or its Affiliates.
Row vs Columnar file orientation
Tabular data
File in storage or streaming
27. © 2020, Amazon Web Services, Inc. or its Affiliates.
Row vs Columnar file orientation
File in storage or streaming
Nested data
28. © 2020, Amazon Web Services, Inc. or its Affiliates.
Different file formats
Avro ParquetORC
Optimized Row Columnar
Compression ★ ★ ★ ★ ★ ★ ★ ★ ★
Schema evolution ★ ★ ★ ★ ★ ★ ★
Row vs column row column column
Splittability ★ ★ ★ ★ ★ ★ ★ ★ ★ ★
Nested fields support ★ ★ ★ ★ ★ ★ ★ ★ ★ ★
Best for Schema evolution Compression Nested fields
29. © 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
30. © 2020, Amazon Web Services, Inc. or its Affiliates.
Glue jobs console https://aws.amazon.com/blogs/big-data/making-etl-easier-with-aws-glue-studio/
31. © 2020, Amazon Web Services, Inc. or its Affiliates.
RDD data structure
RDD
§ Resilient
§ Distributed
§ Datasets
Node
Node
Object 1
Object 2
Key 1
Key 2
Object 3Key 3
Node
Object nKey n
Distributed on multiple
node to take advantage of
parallel processing
1
2
3
4
6
5
7
Resiliency by replicating the DAG
execution (directed acyclic graph) in
case of failures
Object 1
Object 2
Object 3
Object n
Key 1
Key 2
Key 3
Key n
Collection of objects that may be
organized in key object pairs
32. © 2020, Amazon Web Services, Inc. or its Affiliates.
Narrow transformation – no shuffling among partitions
Worker node
Worker node
Worker node
Worker node
§ map()
§ flatMap()
§ mapPartition()
§ filter()
§ sample()
§ union()
33. © 2020, Amazon Web Services, Inc. or its Affiliates.
Wide transformation – shuffling among partitions
Worker node
Worker node
Worker node
Worker node
§ intersection()
§ distinct()
§ reduceByKey()
§ groupByKey()
§ join()
§ cartesian()
§ repartition()
§ coalesce()
34. © 2020, Amazon Web Services, Inc. or its Affiliates.
Spark Operations
Worker node
Worker node
Worker node
Worker node
Worker node
Worker node
Worker node
Worker node
map()
flatMap()
mapPartition()
filter()
sample()
union()
intersection()
distinct()
reduceByKey()
groupByKey()
join()
cartesian()
repartition()
coalesce()
Narrow trasformations Wide transformations
Actions
count()
collect()
take()
top()
countByValue()
reduce()
fold()
aggregate()
foreach()
35. © 2020, Amazon Web Services, Inc. or its Affiliates.
Driver
spark = SparkSession...
spark.sparkContext
rdd_1 = spark.read...
rdd_2 = spark.read...
rdd_3 = rdd_1.filter(...)
rdd_4 = rdd_2.filter(...)
rdd_5 = rdd_3.join(rdd_4)
rdd_6 = rdd_5.filter(...)
output = rdd_6.count(...)
DAG Scheduler
Builds the DAG, splits into stages and tasks,
and signals the Task Scheduler
Cluster Manager
Allocate worker nodes
Worker node
Worker node
Worker node
…
Spark basic job execution process
rdd_1 rdd_2
task
task
task
task
task
rdd_3
rdd_5
rdd_4
Job
Starts executers
executer executer
executer executer
executer
rdd_x rdd_x
rdd_x rdd_x
rdd_x
rdd_6
out.
Task Scheduler
Places tasks on
executors
stage_1
stage_2
stage_3
spark = SparkSession...
spark.sparkContext
rdd_1 = spark.read...
rdd_2 = spark.read...
rdd_3 = rdd_1.filter(...)
rdd_4 = rdd_2.filter(...)
rdd_5 = rdd_3.join(rdd_4)
rdd_6 = rdd_5.filter(...)
output = rdd_6.count(...)
task
task
task
task
task
spark-submit mycode.py
...
36. © 2020, Amazon Web Services, Inc. or its Affiliates.
Additional features in Glue jobs (focus on PySpark)
PySpark Transforms
GlueTransform
ApplyMapping
DropFields
DropNullFields
ErrorsAsDynamicFrame
Filter
FlatMap
Join
Map
MapToCollection
Relationalize
RenameField
ResolveChoice
SelectFields
SelectFromCollection
Spigot
SplitFields
SplitRows
Unbox
UnnestFrame
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-
programming-python-transforms.html
AWS Glue
PySpark Extensions
getResolvedOptions
Types
DynamicFrame
DynamicFrameCollection
DynamicFrameWriter
DynamicFrameReader
GlueContext
https://docs.aws.amazon.com/glue/latest/dg/aws-
glue-programming-python-extensions.html
RDD
DataFrame
Spark DataSet
DynamicFrameGlue
37. © 2020, Amazon Web Services, Inc. or its Affiliates.
Demo custom script
38. © 2020, Amazon Web Services, Inc. or its Affiliates.
A 1 ★★★★
A 2 ★
A 3 ★★★
B 1 ★
B 2 ★★★★
B 3 ★
C 1 ★★★
C 2 ★
C 3 ★★★★
D 1 ★★
D 2 ★★★★
D 3 ★★
A 1 ★★★★
A 2 ★
A 3 ★★★
B 1 ★
B 2 ★★★★
B 3 ★
A 1 ★★★★
A 2 ★
A 3 ★★★
C 1 ★★★
C 2 ★
C 3 ★★★★
A 1 ★★★★
A 2 ★
A 3 ★★★
D 1 ★★
D 2 ★★★★
D 3 ★★
B 1 ★
B 2 ★★★★
B 3 ★
C 1 ★★★
C 2 ★
C 3 ★★★★
B 1 ★
B 2 ★★★★
B 3 ★
D 1 ★★
D 2 ★★★★
D 3 ★★
C 1 ★★★
C 2 ★
C 3 ★★★★
D 1 ★★
D 2 ★★★★
D 3 ★★
movies_pairs = movies.join(movies, on=user)
movie user rating movieX userX ratingX movieY userY ratingY
39. © 2020, Amazon Web Services, Inc. or its Affiliates.
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
C
C
C
B
B
B
C
C
C
D
D
D
C
C
C
D
D
D
D
D
D
movieX movieY
★★★★
★
★★★
★★★★
★
★★★
★★★★
★
★★★
★
★★★★
★
★
★★★★
★
★★★
★
★★★★
ratingX
★
★★★★
★
★★★
★
★★★★
★★
★★★★
★★
★★★
★
★★★★
★★
★★★★
★★
★★
★★★★
★★
ratingY
movie_pairs = movie_pairs.groupBy((movieX,movieY))
A B
A C
A D
B C
B D
C D
40. © 2020, Amazon Web Services, Inc. or its Affiliates.
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
C
C
C
B
B
B
C
C
C
D
D
D
C
C
C
D
D
D
D
D
D
similarity = movie_pairs.mapValue(cosine_similarity)
movieX movieY
★★★★
★
★★★
★★★★
★
★★★
★★★★
★
★★★
★
★★★★
★
★
★★★★
★
★★★
★
★★★★
ratingX
★
★★★★
★
★★★
★
★★★★
★★
★★★★
★★
★★★
★
★★★★
★★
★★★★
★★
★★
★★★★
★★
ratingY similarity
≠
=
≠
≠
=
≠
movieX movieY similarity
A B
A C
A D
B C
B D
C D
≠
=
≠
≠
=
≠
movie_pairs = movie_pairs.groupBy((movieX,movieY))
A B
A C
A D
B C
B D
C D
41. © 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
42. © 2020, Amazon Web Services, Inc. or its Affiliates.
Quicksight console
43. © 2020, Amazon Web Services, Inc. or its Affiliates.
AWS Training & Certification
https://www.aws.training: Free on-demand courses to help you build new cloud skills
E-Learning: Data Analytics Fundamentals
https://www.aws.training/Details/eLearning?id=35364
E-Learning: AWS Hadoop Fundamentals
https://www.aws.training/Details/eLearning?id=40337
Learning Path: Internet of Things Foundation Series
https://www.aws.training/Details/Curriculum?id=27289
Video: Serverless Analytics
https://www.aws.training/Details/Video?id=26848
Available AWS Certifications