SlideShare una empresa de Scribd logo
1 de 75
Descargar para leer sin conexión
Ankara Cloud Meetup – July 2016
Serkan ÖZAL
BIG DATA ON AWS
AGENDA
• What is Big Data?
• Big Data Concepts
• Big Data on AWS
• Data Storage on AWS
• Data Analytics & Querying on AWS
• Data Processing on AWS
• Data Flow on AWS
• Demo
2
3
WHAT IS BIG DATA???
4
5
4V OF BIG DATA
6
BIG DATA CONCEPTS
7
BIG DATA CONCEPTS
• Scalability
• Cloud Computing
• Data Storage
• Data Analytics & Querying
• Data Processing
• Data Flow
8
BIG DATA ON AWS
9
AWS BIG DATA PORTOFILO
10
SCALABILITY ON AWS
• Types of scalability:
• Vertical Scalability: More/Less CPU, RAM, ...
• Horizontal Scalability: More/Less machine
• Auto scalable by
• CPU
• Memory
• Network
• Disk
• …
11
CLOUD COMPUTING ON AWS
• Rich service catalog
• «Pay as you go» payment model
• Cloud Computing models:
 IaaS (Infrastructure as a Service)
o EC2, S3
 PaaS (Platform as a Service)
o Elastic Beanstalk
 SaaS (Software as a Service)
o DynamoDB, ElasticSearch
12
DATA STORAGE ON AWS
13
DATA STORAGE ON AWS
• There are many different data storage services
for different purposes
• Key features for storage:
• Data storage services on AWS:
 S3
 Glacier
14
 Durability
 Security
 Cost
 Scalability
 Access Performance
 Availability
 Reliability
S3
• Secure, durable and highly-available storage service
• Bucket: Containers for objects
 Globally unique regardless of the AWS region
• Object: Stored entries
 Uniquely within a bucket by a name and a version ID
• Supports versioning
• WORM (Write once, read many)
• Consistency models
 «read-after-write» consistency for puts of new objects
 «eventual consistency» for puts/deletes of existing objects
 Updates to a single object are atomic.
15
GLACIER
• Storage service optimized for infrequently used data
• Archives are stored in the containers name «vault»s
• Uploading is synchronous operation with replicas
• Downloading is asynchronous operation through retrieval job
• Immutable
• Encrypted by default
• «S3» data can be archived into «Glacier» by life-cycle rules
• 4x cheaper than «S3»
16
DATA
ANALYTICS & QUERYING
ON AWS
17
DATA ANALYTICS & QUERYING ON AWS
• There are different data analytics&querying services for
different purposes.
 NoSQL
 SQL
 Text search
 Analytics / Visualising
• Data analytics&querying services on AWS:
18
 DynamoDB
 Redshift
 RDS
 Elasticsearch (+ Kibana)
 CloudSearch
 QuickSight
DYNAMODB
• Key-value or document storage based NoSQL database
• Consistency model
• Eventual consistency by default
• Strong consistency as optional
• Supports
• «DynamoDB Streaming» for tracking changes in near real time
• Integrated with
19
 EMR
 Elasticsearch
 Redshift
 Lambda
 Secondary indexes
 Batch operations
 Conditional writes
 Atomic counters
REDSHIFT
• SQL data warehouse solution
• Fault tolerant by
 Replicating to other nodes in the cluster
 Backing up to «S3» continuously and automatically
• Scalable
 Vertically by allowing switching to new cluster
 Horizontally by adding new nodes to cluster
• Integrated with
20
 S3
 EMR
 DynamoDB
 Kinesis
 Data Pipeline
RDS
• Makes it easier to set up, operate and scale a relational
database in the cloud
• Manages backups, software patching, automatic failure
detection and recovery
• Supports
 MySQL
 PostgreSQL
 Oracle
 Microsoft SQL Server
 Aurora (Amazon’s MySQL based RDMS by 5x performance)
21
ELASTICSEARCH
• Makes it easy to deploy, operate and scale «Elasticsearch» in the
AWS cloud
• Supports
 Taking snapshots for backup
 Restoring from backups
 Replicating domains across Availability Zones
• Integrated with
 «Kibana» for data visualization
 «Logstash» to process logs and load them into «ES»
 «S3», «Kinesis» and «DynamoDB» for loading data into «ES»
22
CLOUDSEARCH
• AWS’s managed and high-performance search solution
• Reliable by replicating search domain into multiple AZ
• Automatic monitoring and recovery
• Auto scalable
• Supports popular search features such as
On 34 languages
23
 Free-text search
 Faceted search
 Highlighting
 Autocomplete suggestion
 Geospatial search
 …
QUICKSIGHT
• AWS’s new very fast, cloud-powered business intelligence (BI)
service
• Uses a new, Super-fast, Parallel, In-memory Calculation Engine
(«SPICE») to perform advanced calculations and render
visualizations rapidly
• Supports creating analyze stories and sharing them
• Integrated/Supported data sources:
• Said that 1/10th the cost of traditional BI solutions
• Native Access on available major mobile platforms
24
 EMR
 Kinesis
 S3
 DynamoDB
 Redshift
 RDS
 3rd party (i.e. Salesforce)
 Local file upload
 …
DATA PROCESSING ON
AWS
25
DATA PROCESSING ON AWS
• Allows processing of huge amount of data in highly scalable
way as distributed
• Support both of
 Batch
 Stream
data processing concepts
• Data processing services on AWS:
 EMR (Elastic Map-Reduce)
 Kinesis
 Lambda
 Machine Learning
26
EMR (ELASTIC MAPREDUCE)
• Big data service/infrastructure to process large amounts of
data efficiently in highly-scalable manner
• Supported Hadoop distributions:
 Amazon
 MapR
• Low cost by «on-demand», «spot» or «reserved» instances
• Reliable by retrying failed tasks and automatically replacing
poorly performing or failed instances
• Elastic by resizing cluster on the fly
• Integrated with
27
 S3
 DynamoDB
 Data Pipeline
 Redshift
 RDS
 Glacier
 VPC
 …
EMR – CLUSTER COMPONENTS
• Master Node
 Tracks, monitors and manages the cluster
 Runs the «YARN Resource Manager» and
the «HDFS Name Node Service»
• Core Node [Slave Node]
 Runs tasks and stores data in the HDFS
 Runs the «YARN Node Manager Service»
and «HDFS Data Node Daemon» service
• Task Node [Slave Node]
• Only runs the tasks
• Runs the «YARN Node Manager Service»
28
EMR – BEST PRACTICES FOR INSTANCE
GROUPS
29
Project Master Instance Group Core Instance Group Task Instance Group
Long-
running
clusters
On-Demand On-Demand Spot
Cost-driven
workloads
Spot Spot Spot
Data-critical
workloads
On-Demand On-Demand Spot
Application
testing
Spot Spot Spot
EMR – CLUSTER LIFECYCLE
30
EMR - STORAGE
• The storage layer includes the different file systems for
different purposes as follows:
 HDFS (Hadoop Distributed File System)
 EMRFS (Elastic MapReduce File System)
 Local Storage
31
EMR - HDFS
• Distributed, scalable file system for Hadoop
• Stores multiple copies to prevent data lost on failure
• It is an ephemeral storage because it is reclaimed when cluster
terminates
• Used by the master and core nodes
• Useful for intermediate results during Map-Reduce processing
or for workloads which have significant random I/O
• Prefix: «hdfs://»
32
EMR - EMRFS
• An implementation of HDFS which allows clusters to store data
on «AWS S3»
• Eventually consistent by default on top of «AWS S3»
• Consistent view can be enabled with the following supports:
 Read-after-write consistency
 Delete consistency
 List consistency
• Client/Server side encryption
• Most often, «EMRFS» (S3) is used to store input and output
data and intermediate results are stored in «HDFS»
• Prefix: «s3://»
33
EMR – LOCAL STORAGE
• Refers to an ephemeral volume directly attached to an EC2
instance
• Ideal for storing temporary data that is continually changing,
such as
 Buffers
 Caches
 Scratch data
 And other temporary content
• Prefix: «file://»
34
EMR – DATA PROCESSING FRAMEWORKS
35
• Hadoop
• Spark
• Hive
• Pig
• HBase
• HCatalog
• Zookeeper
• Presto
• Impala
• Sqoop
• Oozie
• Mahout
• Phoenix
• Tez
• Ganglia
• Hue
• Zeppelin
• …
KINESIS
• Collects and processes large streams of data records in real time
• Reliable by synchronously replicating streaming data across
facilities in an AWS Region
• Elastic by increasing shards based on the volume of input data
• Has
 S3
 DynamoDB
 ElasticSearch
connectors
• Also integrated with «Storm» and «Spark»
36
KINESIS – HIGH LEVEL ARCHITECTURE
37
KINESIS – KEY CONCEPTS[1]
• Streams
 Ordered sequence of data records
 All data is stored for 24 hours (can be increased to 7 days)
• Shards
 It is a uniquely identified group of data records in a stream
 Kinesis streams are scaled by adding or removing shards
 Provides 1MB/sec data input and 2MB/sec data output capacity
 Supports up to 1000 TPS (put records per second)
• Producers
 Put records into streams
• Consumers
 Get records from streams and process them
38
KINESIS – KEY CONCEPTS[2]
• Partition Key
 Used to group data by shard within a stream
 Used to determine which shard a given data record belongs to
• Sequence Number
 Each record has a unique sequence number in the owned shard
 Increase over time
 Specific to a shard within a stream, not across all shards
• Record
 Unit of data stored
 Composed of <partition-key, sequence-number, data-blob>
 Data size can be up to 1 MB
 Immutable
39
KINESIS – GETTING DATA IN
• Producers use «PUT» call to store data in
a stream
• A partition key is used to distribute the
«PUT»s across shards
• A unique sequence # is returned to the
producer for each event
• Data can be ingested at 1MB/second or
1000 Transactions/second per shard
40
KINESIS – GETTING DATA OUT
• Kinesis Client Library (KCL) simplifies
reading from the stream by abstracting
you from individual shards
• Automatically starts a worker thread for
each shard
• Increases and decreases thread count as
number of shards changes
• Uses checkpoints to keep track of a
thread’s location in the stream
• Restarts threads & workers if they fail
41
LAMBDA
• Compute service that run your uploaded code on your behalf
using AWS infrastructure
• Stateless
• Auto-scalable
• Pay per use
• Use cases
 As an event-driven compute service triggered by actions
on «S3», «Kinesis», «DynamoDB», «SNS», «CloudWatch»
 As a compute service to run your code in response to HTTP
requests using «AWS API Gateway»
• Supports «Java», «Python» and «Node.js» languages
42
MACHINE LEARNING
• Machine learning service for building ML models and
generating predictions
• Provides implementations of common ML data transformations
• Supports industry-standard ML algorithms such as
 binary classification
 multi-class classification
 regression
• Supports batch and real-time predictions
• Integrated with
• S3
• RDS
• Redshift
43
DATA FLOW ON AWS
44
DATA FLOW ON AWS
• Allows transferring data between data
storage/processing points
• Data flow services on AWS:
 Firehose
 Data Pipeline
 DMS (Database Migration Service)
 Snowball
45
FIREHOSE
• Makes it easy to capture and load massive volumes of load
streaming data into AWS
• Supports endpoints
 S3
 Redshift
 Elasticsearch
• Supports buffering
 By size: Ranges from 1 MB to 128 MB. Default 5 MB
 By interval: Ranges from 60 seconds to 900 seconds.
Default 5 minutes
• Supports encryption and compression
46
DATA PIPELINE
• Automates the movement and transformation of data
• Comprise of workflows of tasks
• The flow can be scheduled
• Has visual designer with drag-and-drop support
• The definition can be exported/imported
• Supports custom activities and preconditions
• Retry flow in case of failure and notifies
• Integrated with
47
 EMR
 S3
 DynamoDB
 Redshift
 RDS
DATABASE MIGRATION SERVICE
• Helps to migrate databases into AWS easily and securely
• Supports
 Homogeneous migrations (i.e. Oracle <=> Oracle)
 Heterogeneous migrations (i.e. Oracle <=> PostgreSQL)
• Also used for:
• Database consolidation
• Continuous data replication
• Keep your applications running during migration
 Start a replication instance
 Connect source and target databases
 Let «DMS» load data and keep them in sync
 Switch applications over to the target at your convenience
48
SNOWBALL
• Import/Export service that accelerates transferring large
amounts of data into and out of AWS using physical storage
appliances, bypassing the Internet
• Data is transported securely as encrypted
• Data is transferred via 10G connection
• Has 50TB or 80TB storage options
• Steps:
49
 Job Create
 Processing
 In transit to you
 Delivered to you
 In transit to AWS
 At AWS
 Importing
 Completed
DEMO TIME
50
Source code
and
Instructions
are available at
github.com/serkan-ozal/ankaracloudmeetup-bigdata-demo
51
STEPS OF DEMO
• Storing
• Searching
• Batch Processing
• Stream Processing
• Analyzing
52
STORING
53
[STORING]
CREATING DELIVERY STREAM
• Create an «AWS S3» bucket to store tweets
• Create an «AWS Firehose» stream to push tweets into to be stored
• Attach created bucket as endpoint to the stream
Then;
• Listened tweet data can be pushed into this stream to be stored
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-storing-firehose
54
[STORING]
CRAWLING TWEETS
• Configure stream name to be used by crawler application
• Build «bigdata-demo-storing-crawler» application
• Deploy into «AWS Elasticbeanstalk»
Then;
• Application listens tweets via «Twitter Streaming API»
• Application pushes tweets data into «AWS Firehose» stream
• «AWS Firehose» stream stores tweets on «AWS S3» bucket
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-storing-crawler
55
SEARCHING
56
[SEARCHING]
CREATING SEARCH DOMAIN
• Create an «AWS Elasticsearch» domain
• Wait until it becomes active and ready to use
Then
• Tweet data can also be indexed to be searched
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-searching-
elasticsearch
57
[SEARCHING]
INDEXING INTO SEARCH DOMAIN
• Create an «AWS Lambda» function with the given code
• Configure created «AWS S3» bucket as trigger
• Wait until it becomes active and ready to use
Then;
• Stored tweet data will also be indexed to be searched
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-searching-lambda
58
BATCH PROCESSING
59
[BATCH PROCESSING]
CREATING EMR CLUSTER
• Create an «AWS EMR» cluster
• Wait until it becomes active and ready to use
Then;
• Stored tweet data can be processed as batch
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-emr
60
[BATCH PROCESSING]
CREATING TABLE TO STORE BATCH RESULT
• Create an «AWS DynamoDB» table to save batch results
• Wait until it becomes active and ready to use
Then;
• Analyze result data can be saved into this table
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-
dynamodb
61
[BATCH PROCESSING]
PROCESSING TWEETS VIA HADOOP
• Build&upload «bigdata-demo-batchprocessing-hadoop» to «AWS S3»
• Run uploaded jar as «Custom JAR» step on «AWS EMR» cluster
• Wait until step has completed
Then;
• Processed tweet data is on «AWS S3» bucket
• Analyze result is in «AWS DynamoDB» table
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-hadoop
62
[BATCH PROCESSING]
QUERYING TWEETS VIA HIVE
• Upload Hive query file to «AWS S3»
• Run uploaded query as «Hive» step on «AWS EMR» cluster
• Wait until step has completed
Then;
• Query results are dumped to «AWS S3»
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-hive
63
[BATCH PROCESSING]
SCHEDULING WITH DATA PIPILINE
• Import «AWS Data Pipeline» definition
• Activate pipeline
Then;
• Every day, all batch processing jobs (Hadoop +Hive) will be
executed automatically
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-
datapipeline
64
STREAM PROCESSING
65
[STREAM PROCESSING]
CREATING REALTIME STREAM
• Create an «AWS Kinesis» stream to push tweets to be analyzed
• Wait until it becomes active and ready to use
Then;
• Listened tweet data can be pushed into this stream to be analyzed
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
kinesis
66
[STREAM PROCESSING]
PRODUCING STREAM DATA
• Configure stream name to be used by producer application
• Build «bigdata-demo-streamprocessing-kinesis-producer»
• Deploy into «AWS Elasticbeanstalk»
Then;
• Application listens tweets via «Twitter Streaming API»
• Listened tweets are pushed into «AWS Kinesis» stream
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
kinesis-producer
67
[STREAM PROCESSING]
CREATING TABLE TO SAVE STREAM RESULT
• Create an «AWS DynamoDB» table to save realtime results
• Wait until it becomes active and ready to use
Then;
• Analyze result can be saved into this table
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
dynamodb
68
[STREAM PROCESSING]
CONSUMING STREAM DATA
• Configure
• «AWS Kinesis» stream name for consuming tweets from
• «AWS DynamoDB» table name for saving results into
• Build «bigdata-demo-streamprocessing-consumer» application
• Deploy into «AWS Elasticbeanstalk»
Then;
• Application consumes tweets from «AWS Kinesis» stream
• Sentiment analyze is applied to tweets via «Stanford NLP»
• Analyze results are saved into «AWS DynamoDB» table in realtime
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
kinesis-consumer 69
ANALYZING
70
[ANALYZING]
ANALYTICS VIA QUICKSIGHT
• Create a data source from «AWS S3»
• Create an analyse based on data source from «AWS S3»
• Wait until importing data and analyze has completed
Then;
• You can do your
 Query
 Filter
 Visualization
 Story (flow of visualisations like presentation)
stuff on your dataset and share them
71
THE BIG PICTURE
72
73
END OF DEMO
74
THANKS!!!
75

Más contenido relacionado

La actualidad más candente

Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Web Services Korea
 

La actualidad más candente (20)

5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Masterclass - Redshift
Masterclass - RedshiftMasterclass - Redshift
Masterclass - Redshift
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 

Similar a Big data on aws

Similar a Big data on aws (20)

AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Data Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMRData Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMR
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the Cloud
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
AWS 101.pptx
AWS 101.pptxAWS 101.pptx
AWS 101.pptx
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Cloud Service.pptx
Cloud Service.pptxCloud Service.pptx
Cloud Service.pptx
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
 
Real-time Analytics with Open-Source
Real-time Analytics with Open-SourceReal-time Analytics with Open-Source
Real-time Analytics with Open-Source
 

Más de Serkan Özal (7)

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS Lambda
 
MySafe
MySafeMySafe
MySafe
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the Hood
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data Presentation
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map Reduce
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 

Último

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 

Último (20)

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 

Big data on aws

  • 1. Ankara Cloud Meetup – July 2016 Serkan ÖZAL BIG DATA ON AWS
  • 2. AGENDA • What is Big Data? • Big Data Concepts • Big Data on AWS • Data Storage on AWS • Data Analytics & Querying on AWS • Data Processing on AWS • Data Flow on AWS • Demo 2
  • 3. 3
  • 4. WHAT IS BIG DATA??? 4
  • 5. 5
  • 6. 4V OF BIG DATA 6
  • 8. BIG DATA CONCEPTS • Scalability • Cloud Computing • Data Storage • Data Analytics & Querying • Data Processing • Data Flow 8
  • 9. BIG DATA ON AWS 9
  • 10. AWS BIG DATA PORTOFILO 10
  • 11. SCALABILITY ON AWS • Types of scalability: • Vertical Scalability: More/Less CPU, RAM, ... • Horizontal Scalability: More/Less machine • Auto scalable by • CPU • Memory • Network • Disk • … 11
  • 12. CLOUD COMPUTING ON AWS • Rich service catalog • «Pay as you go» payment model • Cloud Computing models:  IaaS (Infrastructure as a Service) o EC2, S3  PaaS (Platform as a Service) o Elastic Beanstalk  SaaS (Software as a Service) o DynamoDB, ElasticSearch 12
  • 13. DATA STORAGE ON AWS 13
  • 14. DATA STORAGE ON AWS • There are many different data storage services for different purposes • Key features for storage: • Data storage services on AWS:  S3  Glacier 14  Durability  Security  Cost  Scalability  Access Performance  Availability  Reliability
  • 15. S3 • Secure, durable and highly-available storage service • Bucket: Containers for objects  Globally unique regardless of the AWS region • Object: Stored entries  Uniquely within a bucket by a name and a version ID • Supports versioning • WORM (Write once, read many) • Consistency models  «read-after-write» consistency for puts of new objects  «eventual consistency» for puts/deletes of existing objects  Updates to a single object are atomic. 15
  • 16. GLACIER • Storage service optimized for infrequently used data • Archives are stored in the containers name «vault»s • Uploading is synchronous operation with replicas • Downloading is asynchronous operation through retrieval job • Immutable • Encrypted by default • «S3» data can be archived into «Glacier» by life-cycle rules • 4x cheaper than «S3» 16
  • 18. DATA ANALYTICS & QUERYING ON AWS • There are different data analytics&querying services for different purposes.  NoSQL  SQL  Text search  Analytics / Visualising • Data analytics&querying services on AWS: 18  DynamoDB  Redshift  RDS  Elasticsearch (+ Kibana)  CloudSearch  QuickSight
  • 19. DYNAMODB • Key-value or document storage based NoSQL database • Consistency model • Eventual consistency by default • Strong consistency as optional • Supports • «DynamoDB Streaming» for tracking changes in near real time • Integrated with 19  EMR  Elasticsearch  Redshift  Lambda  Secondary indexes  Batch operations  Conditional writes  Atomic counters
  • 20. REDSHIFT • SQL data warehouse solution • Fault tolerant by  Replicating to other nodes in the cluster  Backing up to «S3» continuously and automatically • Scalable  Vertically by allowing switching to new cluster  Horizontally by adding new nodes to cluster • Integrated with 20  S3  EMR  DynamoDB  Kinesis  Data Pipeline
  • 21. RDS • Makes it easier to set up, operate and scale a relational database in the cloud • Manages backups, software patching, automatic failure detection and recovery • Supports  MySQL  PostgreSQL  Oracle  Microsoft SQL Server  Aurora (Amazon’s MySQL based RDMS by 5x performance) 21
  • 22. ELASTICSEARCH • Makes it easy to deploy, operate and scale «Elasticsearch» in the AWS cloud • Supports  Taking snapshots for backup  Restoring from backups  Replicating domains across Availability Zones • Integrated with  «Kibana» for data visualization  «Logstash» to process logs and load them into «ES»  «S3», «Kinesis» and «DynamoDB» for loading data into «ES» 22
  • 23. CLOUDSEARCH • AWS’s managed and high-performance search solution • Reliable by replicating search domain into multiple AZ • Automatic monitoring and recovery • Auto scalable • Supports popular search features such as On 34 languages 23  Free-text search  Faceted search  Highlighting  Autocomplete suggestion  Geospatial search  …
  • 24. QUICKSIGHT • AWS’s new very fast, cloud-powered business intelligence (BI) service • Uses a new, Super-fast, Parallel, In-memory Calculation Engine («SPICE») to perform advanced calculations and render visualizations rapidly • Supports creating analyze stories and sharing them • Integrated/Supported data sources: • Said that 1/10th the cost of traditional BI solutions • Native Access on available major mobile platforms 24  EMR  Kinesis  S3  DynamoDB  Redshift  RDS  3rd party (i.e. Salesforce)  Local file upload  …
  • 26. DATA PROCESSING ON AWS • Allows processing of huge amount of data in highly scalable way as distributed • Support both of  Batch  Stream data processing concepts • Data processing services on AWS:  EMR (Elastic Map-Reduce)  Kinesis  Lambda  Machine Learning 26
  • 27. EMR (ELASTIC MAPREDUCE) • Big data service/infrastructure to process large amounts of data efficiently in highly-scalable manner • Supported Hadoop distributions:  Amazon  MapR • Low cost by «on-demand», «spot» or «reserved» instances • Reliable by retrying failed tasks and automatically replacing poorly performing or failed instances • Elastic by resizing cluster on the fly • Integrated with 27  S3  DynamoDB  Data Pipeline  Redshift  RDS  Glacier  VPC  …
  • 28. EMR – CLUSTER COMPONENTS • Master Node  Tracks, monitors and manages the cluster  Runs the «YARN Resource Manager» and the «HDFS Name Node Service» • Core Node [Slave Node]  Runs tasks and stores data in the HDFS  Runs the «YARN Node Manager Service» and «HDFS Data Node Daemon» service • Task Node [Slave Node] • Only runs the tasks • Runs the «YARN Node Manager Service» 28
  • 29. EMR – BEST PRACTICES FOR INSTANCE GROUPS 29 Project Master Instance Group Core Instance Group Task Instance Group Long- running clusters On-Demand On-Demand Spot Cost-driven workloads Spot Spot Spot Data-critical workloads On-Demand On-Demand Spot Application testing Spot Spot Spot
  • 30. EMR – CLUSTER LIFECYCLE 30
  • 31. EMR - STORAGE • The storage layer includes the different file systems for different purposes as follows:  HDFS (Hadoop Distributed File System)  EMRFS (Elastic MapReduce File System)  Local Storage 31
  • 32. EMR - HDFS • Distributed, scalable file system for Hadoop • Stores multiple copies to prevent data lost on failure • It is an ephemeral storage because it is reclaimed when cluster terminates • Used by the master and core nodes • Useful for intermediate results during Map-Reduce processing or for workloads which have significant random I/O • Prefix: «hdfs://» 32
  • 33. EMR - EMRFS • An implementation of HDFS which allows clusters to store data on «AWS S3» • Eventually consistent by default on top of «AWS S3» • Consistent view can be enabled with the following supports:  Read-after-write consistency  Delete consistency  List consistency • Client/Server side encryption • Most often, «EMRFS» (S3) is used to store input and output data and intermediate results are stored in «HDFS» • Prefix: «s3://» 33
  • 34. EMR – LOCAL STORAGE • Refers to an ephemeral volume directly attached to an EC2 instance • Ideal for storing temporary data that is continually changing, such as  Buffers  Caches  Scratch data  And other temporary content • Prefix: «file://» 34
  • 35. EMR – DATA PROCESSING FRAMEWORKS 35 • Hadoop • Spark • Hive • Pig • HBase • HCatalog • Zookeeper • Presto • Impala • Sqoop • Oozie • Mahout • Phoenix • Tez • Ganglia • Hue • Zeppelin • …
  • 36. KINESIS • Collects and processes large streams of data records in real time • Reliable by synchronously replicating streaming data across facilities in an AWS Region • Elastic by increasing shards based on the volume of input data • Has  S3  DynamoDB  ElasticSearch connectors • Also integrated with «Storm» and «Spark» 36
  • 37. KINESIS – HIGH LEVEL ARCHITECTURE 37
  • 38. KINESIS – KEY CONCEPTS[1] • Streams  Ordered sequence of data records  All data is stored for 24 hours (can be increased to 7 days) • Shards  It is a uniquely identified group of data records in a stream  Kinesis streams are scaled by adding or removing shards  Provides 1MB/sec data input and 2MB/sec data output capacity  Supports up to 1000 TPS (put records per second) • Producers  Put records into streams • Consumers  Get records from streams and process them 38
  • 39. KINESIS – KEY CONCEPTS[2] • Partition Key  Used to group data by shard within a stream  Used to determine which shard a given data record belongs to • Sequence Number  Each record has a unique sequence number in the owned shard  Increase over time  Specific to a shard within a stream, not across all shards • Record  Unit of data stored  Composed of <partition-key, sequence-number, data-blob>  Data size can be up to 1 MB  Immutable 39
  • 40. KINESIS – GETTING DATA IN • Producers use «PUT» call to store data in a stream • A partition key is used to distribute the «PUT»s across shards • A unique sequence # is returned to the producer for each event • Data can be ingested at 1MB/second or 1000 Transactions/second per shard 40
  • 41. KINESIS – GETTING DATA OUT • Kinesis Client Library (KCL) simplifies reading from the stream by abstracting you from individual shards • Automatically starts a worker thread for each shard • Increases and decreases thread count as number of shards changes • Uses checkpoints to keep track of a thread’s location in the stream • Restarts threads & workers if they fail 41
  • 42. LAMBDA • Compute service that run your uploaded code on your behalf using AWS infrastructure • Stateless • Auto-scalable • Pay per use • Use cases  As an event-driven compute service triggered by actions on «S3», «Kinesis», «DynamoDB», «SNS», «CloudWatch»  As a compute service to run your code in response to HTTP requests using «AWS API Gateway» • Supports «Java», «Python» and «Node.js» languages 42
  • 43. MACHINE LEARNING • Machine learning service for building ML models and generating predictions • Provides implementations of common ML data transformations • Supports industry-standard ML algorithms such as  binary classification  multi-class classification  regression • Supports batch and real-time predictions • Integrated with • S3 • RDS • Redshift 43
  • 44. DATA FLOW ON AWS 44
  • 45. DATA FLOW ON AWS • Allows transferring data between data storage/processing points • Data flow services on AWS:  Firehose  Data Pipeline  DMS (Database Migration Service)  Snowball 45
  • 46. FIREHOSE • Makes it easy to capture and load massive volumes of load streaming data into AWS • Supports endpoints  S3  Redshift  Elasticsearch • Supports buffering  By size: Ranges from 1 MB to 128 MB. Default 5 MB  By interval: Ranges from 60 seconds to 900 seconds. Default 5 minutes • Supports encryption and compression 46
  • 47. DATA PIPELINE • Automates the movement and transformation of data • Comprise of workflows of tasks • The flow can be scheduled • Has visual designer with drag-and-drop support • The definition can be exported/imported • Supports custom activities and preconditions • Retry flow in case of failure and notifies • Integrated with 47  EMR  S3  DynamoDB  Redshift  RDS
  • 48. DATABASE MIGRATION SERVICE • Helps to migrate databases into AWS easily and securely • Supports  Homogeneous migrations (i.e. Oracle <=> Oracle)  Heterogeneous migrations (i.e. Oracle <=> PostgreSQL) • Also used for: • Database consolidation • Continuous data replication • Keep your applications running during migration  Start a replication instance  Connect source and target databases  Let «DMS» load data and keep them in sync  Switch applications over to the target at your convenience 48
  • 49. SNOWBALL • Import/Export service that accelerates transferring large amounts of data into and out of AWS using physical storage appliances, bypassing the Internet • Data is transported securely as encrypted • Data is transferred via 10G connection • Has 50TB or 80TB storage options • Steps: 49  Job Create  Processing  In transit to you  Delivered to you  In transit to AWS  At AWS  Importing  Completed
  • 51. Source code and Instructions are available at github.com/serkan-ozal/ankaracloudmeetup-bigdata-demo 51
  • 52. STEPS OF DEMO • Storing • Searching • Batch Processing • Stream Processing • Analyzing 52
  • 54. [STORING] CREATING DELIVERY STREAM • Create an «AWS S3» bucket to store tweets • Create an «AWS Firehose» stream to push tweets into to be stored • Attach created bucket as endpoint to the stream Then; • Listened tweet data can be pushed into this stream to be stored • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-storing-firehose 54
  • 55. [STORING] CRAWLING TWEETS • Configure stream name to be used by crawler application • Build «bigdata-demo-storing-crawler» application • Deploy into «AWS Elasticbeanstalk» Then; • Application listens tweets via «Twitter Streaming API» • Application pushes tweets data into «AWS Firehose» stream • «AWS Firehose» stream stores tweets on «AWS S3» bucket • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-storing-crawler 55
  • 57. [SEARCHING] CREATING SEARCH DOMAIN • Create an «AWS Elasticsearch» domain • Wait until it becomes active and ready to use Then • Tweet data can also be indexed to be searched • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-searching- elasticsearch 57
  • 58. [SEARCHING] INDEXING INTO SEARCH DOMAIN • Create an «AWS Lambda» function with the given code • Configure created «AWS S3» bucket as trigger • Wait until it becomes active and ready to use Then; • Stored tweet data will also be indexed to be searched • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-searching-lambda 58
  • 60. [BATCH PROCESSING] CREATING EMR CLUSTER • Create an «AWS EMR» cluster • Wait until it becomes active and ready to use Then; • Stored tweet data can be processed as batch • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing-emr 60
  • 61. [BATCH PROCESSING] CREATING TABLE TO STORE BATCH RESULT • Create an «AWS DynamoDB» table to save batch results • Wait until it becomes active and ready to use Then; • Analyze result data can be saved into this table • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing- dynamodb 61
  • 62. [BATCH PROCESSING] PROCESSING TWEETS VIA HADOOP • Build&upload «bigdata-demo-batchprocessing-hadoop» to «AWS S3» • Run uploaded jar as «Custom JAR» step on «AWS EMR» cluster • Wait until step has completed Then; • Processed tweet data is on «AWS S3» bucket • Analyze result is in «AWS DynamoDB» table • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing-hadoop 62
  • 63. [BATCH PROCESSING] QUERYING TWEETS VIA HIVE • Upload Hive query file to «AWS S3» • Run uploaded query as «Hive» step on «AWS EMR» cluster • Wait until step has completed Then; • Query results are dumped to «AWS S3» • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing-hive 63
  • 64. [BATCH PROCESSING] SCHEDULING WITH DATA PIPILINE • Import «AWS Data Pipeline» definition • Activate pipeline Then; • Every day, all batch processing jobs (Hadoop +Hive) will be executed automatically • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing- datapipeline 64
  • 66. [STREAM PROCESSING] CREATING REALTIME STREAM • Create an «AWS Kinesis» stream to push tweets to be analyzed • Wait until it becomes active and ready to use Then; • Listened tweet data can be pushed into this stream to be analyzed • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- kinesis 66
  • 67. [STREAM PROCESSING] PRODUCING STREAM DATA • Configure stream name to be used by producer application • Build «bigdata-demo-streamprocessing-kinesis-producer» • Deploy into «AWS Elasticbeanstalk» Then; • Application listens tweets via «Twitter Streaming API» • Listened tweets are pushed into «AWS Kinesis» stream • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- kinesis-producer 67
  • 68. [STREAM PROCESSING] CREATING TABLE TO SAVE STREAM RESULT • Create an «AWS DynamoDB» table to save realtime results • Wait until it becomes active and ready to use Then; • Analyze result can be saved into this table • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- dynamodb 68
  • 69. [STREAM PROCESSING] CONSUMING STREAM DATA • Configure • «AWS Kinesis» stream name for consuming tweets from • «AWS DynamoDB» table name for saving results into • Build «bigdata-demo-streamprocessing-consumer» application • Deploy into «AWS Elasticbeanstalk» Then; • Application consumes tweets from «AWS Kinesis» stream • Sentiment analyze is applied to tweets via «Stanford NLP» • Analyze results are saved into «AWS DynamoDB» table in realtime • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- kinesis-consumer 69
  • 71. [ANALYZING] ANALYTICS VIA QUICKSIGHT • Create a data source from «AWS S3» • Create an analyse based on data source from «AWS S3» • Wait until importing data and analyze has completed Then; • You can do your  Query  Filter  Visualization  Story (flow of visualisations like presentation) stuff on your dataset and share them 71
  • 73. 73