Businesses are generating more data than ever before.
Doing real time data analytics requires IT infrastructure that often needs to be scaled up quickly and running an on-premise environment in this setting has its limitations.
Organisations often require a massive amount of IT resources to analyse their data and the upfront capital cost can deter them from embarking on these projects.
What’s needed is scalable, agile and secure cloud-based infrastructure at the lowest possible cost so they can spin up servers that support their data analysis projects exactly when they are required. This infrastructure must enable them to create proof-of-concepts quickly and cheaply – to fail fast and move on.
2. What is Big Data?
Volume
Velocity
Variety
Veracity
Value
Quantum of data
TB to PB of data
Speed of data
Millisecond latency
Types of data
Hundreds of data sources
Quality of data
Varies greatly. Affects
accuracy of analysis
Business relevance
How does it help the business?
25+TB of data being
generated per second
globally
90+% of world’s data
created in last 2 years
90+% of data generated is
unstructured
3. Evolution of Big Data ProcessingDescriptivePredictivePrescriptive
Batch
Real-time
Dashboards;
Traditional query &
reporting
Prediction engines;
Inventory forecasting,
cross-sell analysis
Recommendation engines;
routes, content recos
What & Why it happened?
Probability of ‘x’ happening
What to do if ‘x’ happens
TypeofAnalytics
Speed of Analysis
It is happening!
Alerts, analysis &
detection; what is going
wrong, fraudulent use
4. Big Data
Potentially massive datasets
Iterative, experimental style of data
manipulation & analysis
Frequently not a steady-state
workload; peaks & valleys
Variety & velocity of data
Management of tools complex
AWS Cloud
Massive, virtually unlimited capacity
On-demand infrastructure allows iterative,
experimental deployment/usage
Most efficient with highly variable
workloads
Tools & services for managing structured
& unstructured, batch & stream data
Fully managed
Big Data was built for the Cloud
5. Ingest/
Collect
Consume/
visualize
Store Process &
Analyze
Data
1 4
0 9
5
Answers &
Insights
Broad, Tightly Integrated Capabilities
AWS provides the broadest platform for big data analytics today
Start Here
with a
business
case
Real-time
Amazon Kinesis
Firehose
Data Import
AWS
Import/Export
Snowball
Object Storage
Amazon S3
Real-time
Amazon Kinesis
Streams
Distributed
Amazon EMR
(Hadoop, Spark, etc)
BI & Data
Vizualization
Amazon Quicksoght
Real-time
AWS Lambda
Amazon Kinesis Analytics
Data Warehousing
Amazon Redshift
Machine Learning
Amazon Machine
Learning
Relational
Databases
Amazon RDS
No SQL
Databases
Amazon
DynamoDB Elasticsearch
Amazon
Elasticsearch
Data Connect
AWS Direct
Connect
Storage gateway
AWS Storage
Gateway
Database Migration
AWS Data Migration
Service
Time to
answer
(latency)
Throughput
Cost
6. Amazon Redshift
Fast, fully managed, petabyte-scale data warehouse
• 10X better performance than traditional DBs
• Less than one tenth the cost of traditional solutions
• Simple and fully managed
• Flexible & Scalable: Easily change number or type nodes
• ANSI SQL Compatible: Use familiar SQL clients/BI tools
• Secure: Encryption, network isolation, audit & compliance
• Ideal usage patterns: sales, historical, gaming, finance,
marketing, ad, social data
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
7. Amazon EMR
Quickly and cost-effectively process vast amounts of data
• Largest cloud operator of Hadoop infrastructure
• Open source & MapR distributions
• Most current Hadoop distribution
• Flexibility : Decoupled compute & storage, select apps, resize
• Simple : Launch a cluster in minutes, fully managed
• Scalable : Provision as much capacity as needed
• Multiple pricing option – On-demand, Reserved Instances, Spot
• Typical use cases – Clickstream analysis, log processing, genomics
8. Amazon Kinesis
Easily work with real-time streaming data
Amazon Kinesis Streams
• Build custom apps to process or analyze streaming data
• Typical use cases – Log & event data collection, real-time analytics
Amazon Kinesis Firehose
• Easily load massive volumes of streaming data into S3, Redshift, AWS ES
• Typical use cases – Digital marketing, IoT, mobile data capture
Amazon Kinesis Analytics
• Easily analyze data streams using standard SQL queries
9. Amazon Elasticsearch
Fully managed making it easy to set-up, operate & scale
Elasticsearch clusters in the cloud
• Easy set-up & configuration. Fully managed
• Flexible storage options
• Set-up for high availability
• Seamlessly scale
• Direct access to Elasticsearch APIs
• Support for ELK. Built-in KIbana
• Integration with AWS IAM for controlling access to your domain
• Integration with Amazon CloudTrail for auditing
Amazon
Route 53
Elastic
Load
Balancing
IAM
CloudWatch
Elasticsearch API
CloudTrailAWS
10. Select Big Data & Analytics Customers
The vast majority of Big Data use cases deployed in the cloudtoday run onAWS
14. Businesses are literally drowning in data
1-3. Source: https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
4. Source: https://www.technologyreview.com/business-report/big-data-gets-personal/download/?state=join#/join/
2.5 quintillion bytes of data is being created every day1
90% of data in the world today has been created in the last two years2
1.7 megabytes of new information will be created every second for every human on the planet by 20203
<0.5% of all data is currently being analysed and used
15. Internal systems can’t cope
On-premise environments can’t scale quick enough for big data analytics
projects to work well
Cost is prohibitive
The high capital cost of upgrading server infrastructure deters organisations
from embarking on projects.
Tools are outdated
Data management architectures are complex and traditional data analytics
tools are no longer suitable
So what’s the problem?
16. Drives scale
Offers agile and secure cloud infrastructure, provided by AWS, at a low cost
Provides clarity
Easy to forecast how much computing power is needed
Ensures infrastructure is not under-utilised
Empowers business
Servers can be ‘spun up’ to support proof of concepts as required
Enables organisations to go to market faster
Supports a ‘fail fast’ culture
Why cloud for real time analytics?
18. Cloud advisory
Consulting approach to identify the suitability of a move to the cloud – examining current
apps, infrastructure tools, methods and readiness
Migration and deployment
Move web-based and ERP apps – including Oracle and SAP solutions – to the cloud
Cloud consulting services
19. DevOps
Continuous integration, deployment and release
management processes with Puppet Labs, Jenkins,
Capistrano, and ELK Stack
Managed services
Proactive monitoring of AWS infrastructure, SLA-based
resolution, 24x7 support, and account management
Cloud consulting services
20. Big data on cloud
Process data in real time using Amazon Kinesis, Apache Kafka, AWS Lambda and Hadoop
Data warehouse on cloud
Data warehouse design, management and reporting with Amazon Redshift, AWS Quicksight
and Tableau.
Cloud native app and product development
Provide micro services and driven architecture with tools like SQS and SNS
Analytics and product development services
21. Cloud stream
Digital content, asset management, publishing workflow
and video on demand
Cloudlytics
Provides log and billing analytics, cloud automation and
monitoring
CloudScale
Load testing and resilience, and testing automation in the
cloud
BlazeNAS
A highly available and fault-tolerant storage solution
Our products and frameworks
22. Our in-house developed big data framework Cloudlytics 2.0 is an analytics engine that addresses
applications from different domains like infrastructure, application monitoring and IoT.
It gives organizations an edge over their competition by providing real-time insights which help reduce the
time to market for products and services.
Big data analytics engine
24. 5Abox is a software company building embedded solutions for the IoT world. It is focused on energy and
domotics gateways, and ‘VPN on request’ solutions.
Case study: 5Abox
Analyzing IOT data in real-time
25. Streaming real-time data
Complex transformation
Visualization
Case study: 5Abox
The problem/challenge:
26. The solution:
Case study: 5Abox
Weather and Voltage
fluctuations Data
IOT Device Real-Time Transformation and
visualization of Data on Cloudlytics 2.0
using MQTT
protocol
BlazeClan solution enabled the customer Real-Time Insights for weather and voltage fluctuations data.
Data in the 21st century is like oil in the 18th century, an immensely valuable yet largely untapped asset. Like with oil, for those who see data’s fundamental value and learn to extract and use it there will be huge rewards
Big data is typically described basis 3Vs (volume, variety & velocity of big data which is ever increasing) and recently 2 more have got added (value & veracity)-
Value: Refers to the business relevance of the captured data i.e. how does it help the business ?
Veracity: Refers to the quality of captured data as it varies greatly. This is important as it affects the accuracy of the analysis
Variety: Refers to the nature of the captured data. You have a plethora of data sources today and hence you have a broad variety of data be it log/streaming/IoT data or then transactional data. Then you have for example file data with fixed schema (CSV, Parquet, Avro) and file data which is schema free (JSON. Key value). Then you have small files and large files and I could go on
Velocity: Refers to the speed at which the data is generated and processed. Today for real-time use cases we are talking about milliseconds latency. 1 million reads and writes per second is becoming a norm for example for customers in the digital advertising business
Volume: Refers to the quantity of data being generated and stored. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not. Customers generating 100-150 TB a day is not very uncommon now
25+TB of data being generated per second globally
90+% of world’s data created in last 2 years
90+% of data generated is unstructured and hence needs some work before it can be meaningfully used
Now lets look at how big data processing is evolving
On the x-axis you have the speed of analysis while on the y-axis you have the type of analytics you can derive basis the same
With batch analysis its typically descriptive analytics. Descriptive analytics answers the questions what happened and why did it happen. Descriptive analytics looks at past performance and understands that performance by mining historical data to look for the reasons behind past success or failure. Most management reporting - such as sales, marketing, operations, and finance - uses this type of post-mortem analysis. Good for dashboards, reports in response to queries, looking at trends, looking at outcomes e.g. (i) daily customer-preferences report from your web site’s click stream: helps you decide on how to optimize deals and what ad you should try next time, (ii) daily fraud reports: was there fraud yesterday.
Then it comes to dealing with data in real-time with which it moves to what is happening vs what happened – Great for real-time alerts (what is happening now, what is going wrong now), real-time analysis (what to offer the current customer now), real-time spending caps (transaction gets denied as it exceeds your balance for example)
The next phase is predictive analytics. Predictive analytics answers the question what might happen. This is when historical performance data is combined with a variety of statistical, modeling, data mining, and machine learning techniques, and occasionally external data to determine the probable future outcome of an event or the likelihood of a situation occurring
The final phase is prescriptive analytics, which goes beyond predicting future outcomes by also suggesting actions to benefit from the predictions and showing the implications of each decision option. e.g. Think of a traffic navigation app. Pick an origin and a destination — a multitude of factors get mashed together, and it advises you on different route choices, each with a predicted ETA. This is everyday prescriptive analytics at work. Prescriptive analytics can continually take in new data to re-predict and re-prescribe, thus automatically improving prediction accuracy and prescribing better decision options. So prescriptive analytics provide intelligent recommendations for the optimal next steps for almost any application or business process to drive desired outcomes. So while predictive analytics forecasts what might happen in the future, prescriptive analytics can help alter the future
Example of a retailer that offers free expedited shipping to loyal customers.
Descriptive analysis would provide the trends on which this program was structured
Based on past customer behavior, a predictive model would assume that customers will keep the majority of what they purchase with this promotion. However, one customer purchases eight items of clothing but decides to keep only one.
The retailer paid for expedited shipping with the assumption that there's this great consumer out there who bought eight items, so they're willing to invest and lose a little margin on shipping. The algorithm didn't take return behavior into account.
For this retailer, reducing its losses on "outlier" customers who don't follow what predictive analytics forecasted means having policies in place to cover itself. Using prescriptive analytics, the retailer might come up with the options of giving an in-store-only coupon to customers who make returns (to encourage another purchase in which shipping isn't a factor) or notifying customers that they must pay for return shipping
Big Data was built for the cloud and if you aren't using the cloud for big data then you either aren't dealing with big data or then are struggling/going to run into issues very soon and lets understand why that’s the case
With big data you typically are dealing with very large or then large and fast growing data sets and with on-prem infrastructure you will run into capacity issues sooner tan later. You have no such capacity issues with the cloud
With big data there are typically peaks and valleys and its rarely persistent volume and that creates challenges for on-prem infrastructure as you have to provision for peak load which is highly inefficient. The cloud in contrast is most efficient with highly variable workloads
Given the variety and velocity of big data you will need a set of services & tools to manage the same and managing the same is complex while in the AWS cloud the same set of tools & services are fully managed
If you look at a typical big data pipeline, data comes in one side and answers/insights come out the other side and there are multiple stages in between – ingest, store, process & analyze, consume/vizualize with store and process repeating itself multiple times to shape the data in a format that the end consuming application can consume at any rate or any characteristic it demands. What goes on in between is what is called time to answer (pipeline latency), pipeline throughput = f (volume, request rate) and cost
Before we get to the components that enable this, its important to emphasize that its imperative to start with understanding the use case or in other words the answers and insights that are required, why they are required and how they will help the business before embarking on building out the solution and piecing together the elements to enable it. What’s important is leveraging the data & not the technology stack. The technology exists today to make it all happen quickly, securely & cost efficiently!
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology. Amazon Machine Learning provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology. Once your models are ready, Amazon Machine Learning makes it easy to obtain predictions for your application using simple APIs, without having to implement custom prediction generation code, or manage any infrastructure. Amazon Machine Learning is based on the same proven, highly scalable, ML technology used for years by Amazon’s internal data scientist community
And then we have a new addition to the analytics portfolio by way of Amazon Quicksight our very fast, easy-to-use, cloud-powered business intelligence service for 1/10th the cost of traditional BI solutions at $9/user/mth. Amazon Quicksight is under preview currently
Fast, fully managed, petabyte scale datawarehouse
Fast - Optimized for data warehousing. Redshift has a massively parallel processing (MPP) architecture with columnar storage, data compression and 10GigE networking between nodes for up to 10x better performance than traditional relational, row-based databases
Cheap - No upfront costs, pay only for resources you provision. Start small for $0.25 per hour and scale over a PB for $935 per TB per year, less than a tenth of most other data warehousing solutions
Simple – Get started in minutes with a few clicks or a simple API call. Fully managed and fault tolerant. Easy to set up, operate and scale. We take care of provisioning, installation, monitoring, backup, restore and patching
Scalable – With a few clicks via the Console or a simple API call, you can change the type or number of nodes as your performance or capacity needs change. While resizing, your cluster still runs in read-only mode
ANSI SQL Compliant – Uses standard JDBC and ODBC drivers, allowing you to use a wide range of familiar SQL clients/BI tools
Secure – You can encrypt data at rest and in transit using hardware-accelerated AES-256 and SSL, isolate your clusters using Amazon VPC and even manage your keys using hardware security modules (HSMs). Compliant with SOC1, SOC2 & SOC3, FedRAMP, HIPAA and PCI DSS Level 1
Durability and Availability
Replication
Backup
Automated recovery from failed drives & nodes
Interfaces
JDBC/ODBC interface with BI/ETL tools
Amazon S3 or DynamoDB
Cost model
No upfront costs or long term commitments
Free backup storage equivalent to 100% of provisioned storage
Amazon Elastic MapReduce (EMR) simplifies big data processing by providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances
You can also run other popular distributed frameworks such as Apache Spark and Presto or any other application in the Apache Hadoop stack in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB
Little trumpeted fact that EMR is the largest cloud operator of Hadoop infrastructure having spun up tens of millions of clusters for customers since 2009
EMR support the open source & MapR distributions and has the most current Hadoop distribution in the market today with the current versions of the most popular Hadoop apps
Fully managed and hence simple allowing you to launch a cluster in minutes and takes care of provisioning, set-up, configuration, tuning and monitoring
Extremely flexible as we have decoupled compute & storage (which also provides a very significant cost benefit), you can select the apps you need as also easily resize a running cluster
Elastic as you can provision one, hundreds or thousands of instances to process data at any scale
Typical use cases – Clickstream analysis, log processing, genomics
Amazon Kinesis services make it easy to work with real-time streaming data. Lets look at the components and their functionalities
Amazon Kinesis Streams enables you to build custom applications that process or analyze streaming data for specialized needs. Amazon Kinesis Streams can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events. With Amazon Kinesis Client Library (KCL), you can build Amazon Kinesis Applications and use streaming data to power real-time dashboards, generate alerts, implement dynamic pricing and advertising, and more
Next is Amazon Kinesis Firehose which is the easiest way to load streaming data into AWS. It can capture and automatically load streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today
And then you have Amazon Kinesis Analytics which allows you to easily analyze data streams using standard SQL queries
Easy set-up & configuration
create domains via console, SDK or CLI
specify instance types, number of instances & storage options
modify or delete existing domains at any time
Fully managed
addresses time consuming management tasks
ensures high availability, patch management, backups
monitors cluster and replaces nodes as required
Flexible storage options
choose between local on-instance storage or Amazon EBS volumes to store your Elasticsearch indices
specify size of the Amazon EBS volume and volume type
modify the storage options after domain creation as needed
Set-up for High Availability
Zone Awareness distributes instances supporting the domain across two different AZs
with replicas enabled, instances are automatically distributed to deliver cross-zone replication
Here is a select set of referenceable customers using our analytics services
The vast majority of Big Data use cases deployed in the cloud today run on AWS
We now have a large and growing user base in India too