4. Terminologies
Crawlers crawl data sets and
populate the Glue Data
Catalog
Glue Data Catalog - Meta
data informationabout the
schema , locationof data.
ReadbyGlue ETL. Created
bythe Crawler
A GLUE job basically
consist ofbusiness logic
that performs ETL work.
Glue is a managed
service that uses apache
spark.The boilerplate
code is in python/ scala.
Parquet is an open
source
fileformat available
to any projectin the
Hadoop ecosystem.
Apache Parquet is
designed for efficient
as well as performant
flatcolumnar
storage format of
data compared to
row-based files like
CSV or TSV files.
Athena is a serverless
interactivequery
servicebased on the
presto processing
engine .Can run
analytic queries on
largedata sets from
S3 buckets in csv /
parquet /JSON
formats.
5.
6. Setup for S3 Source buckets
1. Source raw data s3 bucket.
2. Partitionedsource rawdatas3
bucket.
Screen Shot 2020-06-08 at 12.02.0
8. Setup the Glue ETL Job
Parquet format data in target
s3 bucket after the glue job is
run
9. Demo
Recorded video
Import new client - partitioned use case
Run crawler - data catalog data update , schema update
Run Glue Job - update spark code to include partition
Check new partition in parquet S3 bucket.
Athena query by updating partition.
Visualization Options.