Log analytics is a common big data use case that allows you to analyze log data from websites, mobile devices, servers, sensors, and more for a wide variety of applications including digital marketing, application monitoring, fraud detection, ad tech, gaming, and IoT. In this tech talk, we will walk you step-by-step through the process of building an end-to-end analytics solution that ingests, transforms, and loads streaming data using Amazon Kinesis Firehose, Amazon Kinesis Analytics and AWS Lambda. The processed data will be saved to an Amazon Elasticsearch Service cluster, and we will use Kibana to visualize the data in near real-time.
Learning Objectives:
1. Reference architecture for building a complete log analytics solution
2. Overview of the services used and how they fit together
3. Best practices for log analytics implementation
4. data source Amazon Kinesis Firehose Amazon Elasticsearch
Service
Kibana
Log analytics architecture
5. Amazon Elasticsearch Service is a cost-effective
managed service that makes it easy to deploy,
manage, and scale open source Elasticsearch for log
analytics, full-text search and more.
Amazon
Elasticsearch
Service
6. Amazon Elasticsearch Service benefits
Easy to use
Open-source
compatible
Secure
Highly available
AWS integrated
Scalable
7. Adobe Developer Platform (Adobe I/O)
P R O B L E M
• Cost effective monitor
for XL amount of log
data
• Over 200,000 API calls
per second at peak -
destinations, response
times, bandwidth
• Integrate seamlessly
with other components
of AWS eco-system.
SOLU TION
• Log data is routed with
Amazon Kinesis to
Amazon Elasticsearch
Service, then
displayed using AES
Kibana
• Adobe team can easily
see traffic patterns and
error rates, quickly
identifying anomalies
and potential
challenges
B E N E F I T S
• Management and
operational simplicity
• Flexibility to try out
different cluster config
during dev and test
Amazon
Kinesis
Streams
Spark Streaming
Amazon
Elasticsearch
Service
Data
Sources
1
8. McGraw Hill Education
P R O B L E M
• Supporting a wide catalog
across multiple services in
multiple jurisdictions
• Over 100 million learning
events each month
• Tests, quizzes, learning
modules begun / completed
/ abandoned
S O L U T I O N
• Search and analyze test
results, student/teacher
interaction, teacher
effectiveness, student
progress
• Analytics of applications
and infrastructure are now
integrated to understand
operations in real time
B E N E F I T S
• Confidence to scale
throughout the school year.
From 0 to 32TB in 9 months
• Focus on their business, not
their infrastructure
10. Amazon ES overview
Amazon Route
53
Elastic Load
Balancing
IAM
CloudWatch
Elasticsearch API
CloudTrail
11.
12. Data pattern
Amazon ES cluster
logs_01.21.2017
logs_01.22.2017
logs_01.23.2017
logs_01.24.2017
logs_01.25.2017
logs_01.26.2017
logs_01.27.2017
Shard 1
Shard 2
Shard 3
host
ident
auth
timestamp
etc.
Each index has
multiple shards
Each shard contains
a set of documents
Each document contains
a set of fields and values
One index per day
13. Deployment of indices to a cluster
• Index 1
– Shard 1
– Shard 2
– Shard 3
• Index 2
– Shard 1
– Shard 2
– Shard 3
Amazon ES cluster
1
2
3
1
2
3
1
2
3
1
2
3
Primary Replica
1
3
3
1
Instance 1,
Master
2
1
1
2
Instance 2
3
2
2
3
Instance 3
14.
15. How many instances?
The index size will be about the same as the
corpus of source documents
• Double this if you are deploying an index replica
Size based on storage requirements
• Either local storage or up to 1.5TB of EBS per
instance
• Example: 2TB corpus will need 4 instances
– Assuming a replica and using EBS
– Or with i2.2xlarge nodes (1.6TB ephemeral storage)
16.
17. Instance type recommendations
Instance Workload
T2 Entry point. Dev and test.
M3, M4 Equal read and write volumes.
R3, R4 Read-heavy or workloads with high memory demands (e.g.,
aggregations).
C4 High concurrency/indexing workloads
I2 Up to 1.6 TB of SSD instance storage.
18.
19. Cluster with no dedicated masters
Amazon ES cluster
1
3
3
1
Instance 1,
Master
2
1
1
2
Instance 2
3
2
2
3
Instance 3
20. Cluster with dedicated masters
Amazon ES cluster
1
3
3
1
Instance 1
2
1
1
2
Instance 2
3
2
2
3
Instance 3Dedicated master nodes
Data nodes: queries and updates
23. Cluster with zone awareness
Amazon ES cluster
1
3
Instance 1
2
1 2
Instance 2
3
2
1
Instance 3
Availability Zone 1 Availability Zone 2
2
1
Instance 4
3
3
24. Small use cases
• Logstash co-located on the
Application instance
• SigV4 signing via provided
output plugin
• Up to 200GB of data
• m3.medium + 100G EBS
data nodes
• 3x m3.medium master nodes
Application
Instance
25. Large use cases
Amazon
DynamoDB
AWS
Lambda
Amazon S3
bucket
Amazon
CloudWatch
• Data flows from instances
and applications via
Lambda; CWL is implicit
• SigV4 signing via
Lambda/roles
• Up to 5TB of data
• r3.2xlarge + 512GB EBS
data nodes
• 3x m3.medium master nodes
26. XL use cases
Amazon
Kinesis
• Ingest supported through
high-volume technologies
like Spark or Kinesis
• Up to 60 TB of data
• R3.8xlarge + 640GB data
nodes
• 3x m3.xlarge master nodes
Amazon
EMR
27. Best practices
Data nodes = Storage needed/Storage per node
Use GP2 EBS volumes
Use 3 dedicated master nodes for production deployments
Enable Zone Awareness
Set indices.fielddata.cache.size = 40
29. Amazon Kinesis: Streaming Data Made Easy
Services make it easy to capture, deliver, process streams on AWS
Amazon Kinesis
Streams
Amazon Kinesis
Analytics
Amazon Kinesis
Firehose
30. Amazon Kinesis Streams
• Easy administration
• Build real time applications with framework of choice
• Low cost
31. Amazon Kinesis Firehose
• Zero administration
• Direct-to-data store integration
• Seamless elasticity
32. Amazon Kinesis Analytics
• Interact with streaming data in real-time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time visualizations
and alarms
33. Amazon Kinesis - Firehose vs. Streams
Amazon Kinesis Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing
latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on
Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a
data latency of 60 seconds or higher.
34. Kinesis Firehose overview
Delivery Stream: Underlying
AWS resource
Destination: Amazon ES,
Amazon Redshift, or Amazon
S3
Record: Put records in
streams to deliver to
destinations
35. Kinesis Firehose Data Transformation
• Firehose buffers up to 3MB of ingested data
• When buffer is full, automatically invokes Lambda function,
passing array of records to be processed
• Lambda function processes and returns array of transformed
records, with status of each record
• Transformed records are saved to configured destination
[{"
"recordId": "1234",
"data": "encoded-data"
},
{
"recordId": "1235",
"data": "encoded-data"
}
]
[{
"recordId": "1234",
"result": "Ok"
"data": "encoded-data"
},
{
"recordId": "1235",
"result": "Dropped"
"data": "encoded-data"
}
]
36. Kinesis Firehose delivery architecture with
transformations
S3 bucket
source records
data source
source records
Amazon Elasticsearch
Service
Firehose
delivery stream
transformed
records
delivery failure
Data transformation
function
transformation failure
38. Best practices
Use smaller buffer sizes to increase throughput, but be
careful of concurrency
Use index rotation based on sizing
Default: stream limits: 2,000 transactions/second, 5,000
records/second, and 5 MB/second
40. Amazon ES aggregations
Buckets – a collection of documents meeting some criterion
Metrics – calculations on the content of buckets
Bucket: time
Metric:count
41. host:199.72.81.55 with <histogram of verb>
1,
4,
8,
12,
30,
42,
58,
100
...
Look up
199.72.81.55
Field data
GET
GET
POST
GET
PUT
GET
GET
POST
Buckets
GET
POST
PUT
5
2
1
Counts
42. A more complicated aggregation
Bucket: ARN
Bucket: Region
Bucket: eventName
Metric: Count
43. Best practices
Make sure that your fields are not_analyzed
Visualizations are based on buckets/metrics
Use a histogram on the x-axis first, then sub-aggregate
44. Run Elasticsearch in the AWS cloud with Amazon
Elasticsearch Service
Use Kinesis Firehose to ingest data simply
Kibana for monitoring, Elasticsearch queries for
deeper analysisAmazon
Elasticsearch
Service
45. What to do next
Qwiklab:
https://qwiklabs.com/searches/lab?keywords=introduction
%20to%20amazon%20elasticsearch%20service
Centralized logging solution
https://aws.amazon.com/answers/logging/centralized-
logging/
Our overview page on AWS
https://aws.amazon.com/elasticsearch-service/