Handwritten Text Recognition for manuscripts and early printed texts
Self-Service Analytics on Hadoop: Lessons Learned
1. Self-Service Analytics on Hadoop: Lessons Learned
June 29, 2016
Drew Leamon
Director – Advanced Technology Solutions
2. Comcast: Shaping the Future of Media and Technology
High Speed
Internet
Video
IP
Telephony
Home
Security /
Automation
Universal
Parks
Media
Properties
7. Self Service: Native Habitat
Limitations of the Spreadsheet Native Habitat
• 1 Million Row Max
Self Service
• Not Even Medium Data
• Not Collaborative
• No Automation
• Not Repeatable
IT Analyst
8. Self Service: How We Started
Analyst goes to IT, makes request, waited weeks to get results
SSRS
• 10 TB Storage
• 1 Compute Node
Not Self Service
• 10 TB (Medium Data)
• Limited Compute
• IT Hand-off
• Consultative service
• Not self service.
IT Analysts
9. Bigger database still meant building dashboards for team
IT Analysts
Still Not Self Service
• 100s TBs (Large Data)
• Data silos
• IT Hand-off
• Consultative service
• Analysts not SQL experts
Graduated to Specialized Databases
• Clustered Storage
• Columnar Compression
• Clustered Compute
10. Datameer, native on Hadoop, enables self-service for big data
Analysts
True Self Service
• PB == Big Data
• Data Lake
• Excel-like UI
• No more waiting for IT
Self Service: The New Way
• Clustered Storage
• Columnar Compression
• Clustered Compute
• Liberated Data
18. 30% of this traffic was coming from three
accounts.
Analysis Shows Traffic Concentration Few Accounts
19. Ongoing Monitoring of Future Abuse
Analyst Scheduled a Tableau Data Extract and built a Tableau dashboard
- Now the business can keep an eye out for further abuse.
20. Result: Future Abuse Prevented and More
Abuse detected Analysts empowered Resources saved
No IT hand-off Value to organizationAutomated and
repeatable
24. Improved Customer Experience through Data Analytics
24
Findings / Analysis
Best
Practices
Improved Customer Experience
Data driven scheduling
Dataflow Automation
25. Solution:
25
- Build views
quickly &
aggregate
large
datasets.
- Early visibility
of data in
Hadoop
- Create
repeatable
processes
through
automated
workflow
• Aggregations of large datasets from disparate data sources.
- RDBMS, HDFS, APIs
• Data Joins / Data Quality Checks / Pipeline between clusters
26. Result: Data-driven Customer Viewing Experience Enhancements
26
Customer Experience
Improved
Analysts empowered Capital Spend
Directed Intelligently
No IT hand-off Value to organizationAutomated and
repeatable
Notas del editor
Welcome
Self Introduction
Journey to Self-Service Big Data
Based on Lessons Learned from the work that we have done at Comcast.
Comcast Introduction
Cable Organization
High Speed Internet
Emmy winning Video Platform
Home Security & Automation
IP Telephony
NBC Universal
Media Properties
Universal Theme Parks
Scale
10s of Millions of Customers / 100s of Millions of Devices
Intro to my team
Initial Charter
Start with Massive amounts of Data
Deliver Budget Guidance
Deliver Forecasts
Engineering Design Guidance
My specific goal is to empower all of these activities and more with Technology
Hadoop Summit – Data Lake
Safari in Africa
Musth – testosterone spikes 60x
- You will never experience that in a zoo
- nor a theme park
Native Habitat is critical
We started with Self Service Analytices
Excel on a Laptop
Single Resource / No handoffs
Contained
Scaled to 1M rows
Migrated to SQL Server / SSRS – Not Self Service
IT Infrastructure / Handoff
Limit at 250 GB
8 years ago before Big Data was cool we had big data problems – Enter Vertica
Columnar Data store
100s of TBs
Stil have silos
Enter Datameer on Hadoop to bring us back to Self-Service Analytics
Technical Limitations
1 M Row Max (Not even medium data)
Not Collaborative
No Automation
Not Repeatable
Consultative model
- Limit for SQL Server at ~250GBs
- IT Handoffs
Model is Consultative
Actually moved away from self service. In excel, analysts had access to data.
IT Service Analytics
- Now we can store TBs of data in clusters of servers
- If you really have big data, you are still going to end up with silos
- IT Handoff and still consultative
- Analysts don’t know SQL or at least don’t know it well enough to not make problems.
- OpenSource
- Dataset blending
Have true Self-Service
No IT Handoffs
Datameer
5000 row sample
Multiple Configurations/Distributions
Mixture of Bare Metal and Virtualized
Multiple Distributions
When we use “big data” like this we do so in compliance with all applicable privacy and security requirements and laws.
Diagram Details the Maturity of Different Use Cases
Many are being targeted and are in varying levels of maturity
At this point I’m going to focus in on a specific use case.
Lots of Consultative work
Invested to make them Self-Sufficient with Datameer
Comcast Digital Voice
We are one of the largest telephone carriers in the country
This is am important line of business for us
There are many parts of the business including wholesale and peering relationships that need to be managed
All of these rely on data to make decisions on how to manage the network and the relationships
IP Telephony is complex.
deep engineering field
Intricacies
session boarded controllers
media gateways.
SMEs and Analysts
deep engineering knowledge
My team did not have knowledge
Consultative approach was very challenging
Handoff errors
Built the wrong thing
Extremely iterative and costly
Solution:
Get the SMEs and Analysts into the data with Datameer
Data
Anonymized CDRs
TBs of data per day
Datameer UI – 5000 row sample of data
Real-time feedback
Create your Data pipeline via XLS-like
Instantaneous Feedback
What Happened – second hand
Data Discovery
Profiling – Understand the Data
Let the data tell it’s story
Noticed something strange in the data
Spike to High cost areas (international?)
Question: What does it mean?
Hypothesis: Network abuse
Not legitimate use
Violation of the terms of service
Not going to give a course in how to abuse our services
The SMEs/Analysts
Hypothesis
Dug deeper / created aggregations
Large percentage of traffic was coming from a handful of accounts
Datameer has Visualization capability
Infographics
Tableau is fairly well adopted
using Datameers integration with Tableau
SMEs created an automation in Datameer to push a TDE to Tableau Server
Abuse detected and addressed
Analyst directed and empowered
No IT Handoff
- Value delivered to the organization
- Automated and repeatable
Diagram Details the Maturity of Different Use Cases
Many are being targeted and are in varying levels of maturity
At this point I’m going to focus in on a specific use case.
Lots of Consultative work
Invested to make them Self-Sufficient with Datameer
Sausage Funnel
Inputs
3rd Party QoE
Network QoS
In-Home QoS
Outputs
Improved IP Video QoE
Improved NPS
Blend
Analyze
Share
Rapid Prototyping – Disparate Data Sets
Changing how we prioritize capital spend
Optimizing for CX – Right KPI