HiFX designed and implemented a unified data analytics platform called Vision Lens for Malayala Manorama to generate meaningful insights from large amounts of data across their multiple digital properties. The solution involved building a data lake, data pipeline, processing framework, and dashboards to provide real-time and historical analytics. This helped Manorama improve user experiences, drive smarter marketing, and make better business decisions.
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
1. About HiFX
Established in the year 2001, HiFX is an Amazon Web Services
Consulting Partner.
We have been designing and migrating workloads in AWS cloud
since 2010 and helping organizations to become truly data driven
by building big data solutions since 2015
2. Case Study with Malayala Manorama
Malayala Manorama is one of the largest media conglomerates in India. They
run manoramaonline.com, the largest news portal for Malayalees, around the
world and several digital media properties
In 2016, Manorama embarked on a project to develop an in-house
analytics pipeline that could unify enormous amounts of raw data from
multiple web domains and convert it into meaningful insights. The
company currently has 10 domains such as its matrimonial and real
estate sites, with plans to further expand its digital footprint.
HiFX, has been Malayala Manorama’s technology partner for more than
18 years and was approached to design this new data analytics pipeline.
3. Manorama Online
Manorama News
The Week
Vanitha
Watchtime India
E-paper/E-magazine
Chuttuvattom
OnManorama
M4Marry
HelloAddress
QuickeralaQkdoc
Entedeal
Manorama Horizon
Android
iOS
Manorama MAX
4. 2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to
make smart business decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there
was a need to design the collection and storage layers that would scale well
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in
processing
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, granting access and integration
04
03
02
01
6. Vision Lens is a unified data platform with a
consolidated solution stack to
generate meaningful real time
insights and drive revenue
“
“
Better product
decisions based
on behavioral
insights
Add value
to our
businesses
€
Increase CLV
Deeply
understand every
user's journey
Immediate actions,
smart targeting and
marketing
automation
Positively
impact KPIs
7. Components
02A well governed data lake
architected to store raw and
enriched data thereby
eliminating storage silos
WELL GOVERNED DATA LAKE
01 Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near real-time access to any data
source
UNIFIED DATA PIPELINE
03
Data processing framework to
support streams and batches
workloads to aid analytics and
machine learning along with
smart workflow management
DATA PROCESSING
FRAMEWORK
05
Recommendations and
personalization engine powered
by machine learning
RECOMMENDATIONS ENGINE
04Well designed big data stores
for reporting and exploratory
analysis
BIG DATA STORES FOR OLAP
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
SMART DASHBOARDS
8. Solution Stack
04
Track Key metrics : visits,
plays,dropouts and minutes
watched
VIDEO ANALYTICS
Watch Attention shift in
real time
Updates every few
seconds to quickly
capitalize attention to
every post, campaign and
sections
STREAMING ANALYTICS
01
02Historical View of unique
attention metrics to understand
what happened in the past and
use it to plan for the future
BATCH ANALYTICS
03
Integrations with Google
Accelerated Pages and
Facebook Instant articles
FB IA AND GOOGLE AMP
INTEGRATIONS
05
Recommendations and
personalization engine
powered by machine learning
CONTENT
PERSONALIZATION
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
ADVANCED REPORTING
Clean structured data that
team can analyze directly
RAW DATA ACCESS07
11. Trackers
Android SDK IOS SDK JS SDK PHP SDK Java SDK
Data / Event Trackers
Trackers allow us to collect data from any
type of digital application, service or
device. All trackers adhere to the LENS
Tracker Protocol.
12. Collectors-Scribe
Data Collectors
04
03
Written in
Go/Java
02
Designed for
Low LatencyEngineered for
High
Concurrency
Horizontally
Scalable
01
Scribe collects data from the trackers
and writes them to the Kinesis data
firehose.
This allows near-real time processing of
data as well as storage in the data lake
for further batch analysis.
Use ECS Fargate for the
containerization.
Scribe API endpoints
• Event tracker
• Pixel tracker
• Click tracker
• AMP tracker
13. Accumulo /Data Lake
A
ACCUMULO
The data consumer component
responsible for -
Reading data from the event
firehose ( Kinesis Streams )
Performing rudimentary data
quality checks
Converting data to Avro
Format with Snappy
Compression
Loading them to the Data
Lake
DATA LAKE
Data Lake supports the following
capabilities
Capture and store raw data securely
at scale at a low cost
Store many types of data in the same
repository
Define the structure of the data at
the time it is used
It is designed to
Retain all data
Support all data types
Adapt Easily to changes
14. Prism - Processing Engine
Using Apache Spark as our processing Engine.
It’s written in Scala.
It can run on EMR-5.27 and as a Databricks job running
on AWS spot/on-demand instances
Unified Processing Engine
Prism
Analytics Engine
15. Prism - Processing Engine
Data Cleanser
Performs data cleansing
including:
• Normalization
• De-duplication
• Bot-exclusion
• Fixes for client clock issues.
Data Enricher
Performs enrichment activities
including:
• User Agent Parsing to
understand OS / Platform
• Referrer Parsing to understand
channels
• IP to location transformation
• Lat+Long to location
transformation
• Widen event data with user
profile information
Data Quality Checks
Performs the data quality checks
needed to detect, report and omit
instrumentation errors
Data Reconciler
Reconciles data that is
sacrosanct like transactions
from the feeds generated by
the master db
Sessionization/User Merging
Sessionize and merge the users
based on domain/anonymous id
15
Prism
Analytics Engine
Data Refresher
Loads the data to respective tables
in the data warehouse and other
reporting data stores
16. Prism - Real-time Analytics
• Use structured streaming to stream live events
into Elastic Search.
• Stack can be run on both EMR and Databricks
• Run in 50 -4.x large instances, which is scaled
to 100 instances during the election time.
• Configurations:-
spark.executor.cores=4
spark.executor.memory=25g
spark.executor.instances=50
Spark Streaming
Spark Streaming
17. Prism - Batch Analytics
Spark on EMR/Databricks
Spark• Scheduled Job which kick off every
day to process all the events for a
day and write the cleansed
raw/aggregated data to the redshift
(primary data store).
• It also writes the data to Parquet
Format to run presto/Databricks
delta lake on the top if needed.
• Runs in 20 – r4.2xlarge instances
• Configurations:-
spark.executor.cores=3
spark.executor.memory=20g
spark.executor.instances=39
18. Data Stores
DATA WAREHOUSE
AMAZON REDSHIFT
Primary Data Store
• Supports batch workloads.
• Supports up to 50
concurrent queries
• Cache layer pgpool deployed
• WLM and concurrency
scaling enabled
• Elastic Resize
• Redshift spectrum to query
archived data in S3
01
REALTIME REPORTING STORE
Elasticsearch
Content Analytics Real Time
Dashboard.
• Fluidic Dashboard with
granular filters
• Data Visualization using
Kibana
02
RECOMMENDATION RESULTS
DYNAMODB
Features like,
Horizontally Scalability, low
operational overhead and
predictable performance
make Dynamodb a good
choice for storing
recommendation results
03
19. Orchestration
Used to programmatically author, schedule
and monitor workflows.
Workflow Management
Rich UI that makes it easy to visualize
pipelines running in production, monitor
progress, and troubleshoot issues when
needed.
Rich UI
Apache Airflow
20. Data Retention Strategy
Find a balance between what’s optimal for your clients’ business needs vs. operational cost effectiveness
Ensure the data retention policies align with the regulatory restrictions(GDPR)
Define proper life cycle policies at different stages
S3-IA/Glacier lifecycle policy defined for the data at rest in Data lake and a scheduled purging policy defined
for the primary data store(redshift)
We keep a quarter worth of data in the primary data store(redshift) and older data is archived to S3.
Redshift Spectrum is used for detailed analysis of older data.
For YOY, QOQ comparison we pre-calculate it as part of the quarterly process and store the aggregated results
into the data store.
21. Page Views
Dashboard - KPIs/Different Angles
Domain Specific KPIs
Key Metrics in the Content
Dashboard.
Different Angles
New and returning Visitors
Explore the Content Data from these
Angles
Engaged Time
Social Shares and
Referrals
Bounce Rate
Video Play Rate
Titles
Authors
Sections
Tags
Referrers
Campaigns
Google AMP Facebook IA
22. Scalability /Performance
Collect, Storage and Process layers designed to Autoscale.
Batch analytics takes an average of 30-40 mins to process and refresh data for the entire day
across all reporting dashboards
Turnaround latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156
ms
Currently handles about 150 GB of data per day with an average of 300 million events processed
per day
Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting
Stores
04
03
02
01
The real time streaming stack currently processes 500K events in less than 10 seconds.
05
06
23. Best Practices in Spark
Use Dataset, DataFrames, Spark SQL instead of RDD to get the benefits of catalyst optimizer
Choose the best data format and compression.
Apache Parque gives the fastest read performance with the spark with its vectorized Parquet reader. Run
presto/Databricks delta lake on the top if needed.
Avro offers rich schema support and more efficient writes than Parquet.
Choose either Snappy or LZO compression as they have balance in terms of split-ability and block compression.
Use the Spark Web UI to explore your task jobs, storage, and SQL query plan to optimize your spark execution
Look at the spark event timeline to see the amount of time for each stage/tasks
Check the shuffles between stages and the amount of data shuffled(Use the spark.sql.shuffle.partitions option
if needed)
Check the join algorithms being used.
Broadcast join should be used when one table is small.
Sort-merge join should be used for large tables. You can use bucketing to pre-sort and group tables; this will
avoid shuffling in the sort merge
Enable Dynamic Partition Pruning/ flattenScalarSubqueriesWithAggregates/ Bloom Filter Join/ Optimized Join
Reorder
Use s3 instead of s3a/s3n protocol to refer the data so that it goes through the optimized path
Use EMRFS consistency only if its required
Find an optimal configurations on number of executors, memory setting for each executors and the no of cores for
the spark job.
24. Outcomes
Ability to run targeted mobile push and email campaigns
Consistent KPI measurement. The client has a consistent framework across properties to
measure KPIs
Dozens of independently managed collections of data, leading to data silos. Having no single
source of truth was leading to difficulties in identifying what type of data is available, getting
access to it and integration.
Better user experience. Recommendations running off the data in the Data Lake add value to the
digital properties we manage
Better business agility and product decisions based on behavioural insights. The journey from
data to decisions is made swifter
04
03
02
01