SlideShare una empresa de Scribd logo
1 de 56
Jeff Bertman
CHIEF DATA ENGINEER,
WARNER BROS.
LOS ANGELES, CA ~ JUNE 12 - 13, 2019 | DIGIMARCONWEST.COM
#DigiMarConWest
Maximizing Big Data ROI via
Best of Breed Technology
Patterns and Practices
KEYNOTE
Maximizing Big Data ROI via Best of Breed
Patterns & Practices
CTO and Lead Data Scientist/Engineer
Dfuse Technologies (formerly with Warner Bros Digital Networks)
Jeff Bertman
Click1: DigiMarCon Main Site
#NOTES to Audience:
(1) Caution: This deck contains some
Hollywood “GLITZ” that could be harmful
to your Boringzola.
Please be prepared to Smile 
(2) Much of the focus is on Data Engineering
but from a Business Value perspective.
We will quickly zip through certain slides
-- spending just enough time for context.
(3) Thanks for the honor to Serve the
DigiMarCon West Community!
Click2: DigiMarCon Speaker Page
(Scroll to “Bertman”) Also:
- www.LinkedIn.com/in/JeffBertman
- Jeff.Bertman@DfuseTech.com
- Mobile +1 818-321-3111
- More contact info at end of deck
Click3: Dfuse Technologies Main Site
www.DigiMarConWest.com
3
Speaker Highlights
Jeffrey Bertman: “Uptight Easterner”=================Chief Data Engineer, CTO,
Lead Data Scientist/Engineer, Bla Bla BlaData Geek 
4
This deck contains more GLITZ than usual (for me)  And there are layers on several slides.
For Best Viewing, DOWNLOAD the PowerPoint Show File (vs viewing online). Thanks!
SHOW BUSINESS WARNING
5
Contents Summary
# Topic Slides Remarks
1) Speaker Highlights 2
2) Overview (incl 1 Brief Slide about Machinima / Warner Bros) 4 This TOC followed by general context…
and an Intro to our tour guide and related characters 
3) Business & Technical Landscapes
--- ROI Defined ---
10
4) ROI Conducive Technologies and Architectures for Big Data 17 Highlights:
• introduction to FiTL (Fitness Technology Landscape)
and “Price Shifting” which affects TCO >>> ROI, etc
• The TP3 Principle
• Polyglot Jazz (DI Graphical Tools)
5) ROI Best Practices for Big Data, Etc 41 Abbreviated List due to time limit.
Much larger list avail upon request.
6) Fairy Tale Wrap-up and Closing Thoughts 46
7) Q & A / Contact Info 52 Feel free to reach out for discussion or future
presentation versions (whitepaper in progress)
Friendly Warning: Presentations from Entertainment Companies may Contain some Jazz !
Introducing Our Tour Guides: DP aka Data Pigeon, LOP aka Lack of Planning Mule
DP
DP
Methodology
DP
Helper 2
Helper 1
From Here to There
LOP
* No Orwellian Connotation
Slide 6
DP
DP
LOP
But even the
best plans...
…with lack of follow-through...
Slide 7Introducing Our Tour Guides: DP aka Data Pigeon, LOP aka Lack of Planning Mule
…can lead to...
Slide 8
Slide 9
Opening the door to success requires both
planning and follow-through…
…to tame LOP the mule into LOP2 !
DP
LOP2
SGA
Successful
Geeks
Assoc.
Introducing Our Tour Guides: DP aka Data Pigeon, KIO1 aka LOTS of Planning Pegasus…or at least “ENUF” Agile Planning 
10
Everything we do is in the…
 Thousands of content creators (aka talent partners)
 Millions of videos on numerous platforms
 Billions of aggregate views / month
Expanding Footprint Within & Beyond WB requires even Greater Scale
 BI/Data supports Mach OTT, Other WB Divisions, External Companies
 Distribution supports other WB Initiatives
Native Digital Entertainment Business that Leverages Technology to
Meet and Exceed KPIs & Operational Goals -- and to Scale Cost-Effectively
Recent Years
Millions… BILLIONS+Thousands…
 Cornerstone Technologies (Big Data focus)
#DISCLAIMER: This is the only Machinima/WB slide (and it is non-proprietary).
11
LEVERAGE TECHNOLOGY
Architecture, Engineering, Methods, Libraries,
CM, QA, Security, SysOps, DevOps
DATA >> INFO >> KNOWLEDGE >> ACTION
Improve BIZ (Revenue, Profit, Market Share, Etc)
Always Grow BIZ Value –– Data Intelligence
BEST PRACTICES & ~SLAs
Continual Improvement, Serviceability,
Reliability, Performance, Governance
SERVICE ORIENTED Mindset Driven By Clear Mission, Values, Goals & Priorities:
Cost-Benefit + “Everyone is a Customer” Approach
Be the Solution, Be the Boss, Value Each Other, A-Team, Executional Excellence, …
Sample
Data Management
Touch Points &
Basic Approach
11
12
BIZ VALUE.
Increase Revenue,
Profit, Market Share,
Etc
Low Level
Processes
Get Stuff, Do Stuff, Put Stuff, Etc
Raw Data
Google, Youtube, Facebook, Twitter, Twitch,
Amazon, Salesforce,
Mach Console, ETC
Information Technology
Data
Engineering
Data ►► Information
Software
Engineering
Tech ►► Biz Tools
Product
Management
Actualization
User Applications, Visualization Tools
(BI, Reporting, Analytics, Discovery), Etc
Value “Scape”
12
ROI is
Impacted by
All Levels
13
Downstream Feeds,
Exports / ReportsDashboards, RptsDashboards, Rpts
Convert Data into INFORMATION to Help Drive Cost-Benefit / KPIs for Business Units and The Enterprise
• Google
• YouTube
• Facebook, Instagram
• Twitter
• Twitch
• Amazon
• Pluto TV
• Clickstream
• Finance/Accounting
• Salesforce
• Sensors
• Logs
• ETC
RAW DATA from
(Platforms, Apps, Etc)
• BI + Data Managers
(incl Self-Service)
• Marketing
• Sales / BD
• Finance
• Accounting
• Operations
• Security & Compliance
• ETC
(See Pillars slide)
DISTRIBUTE to
Customers, Et Al:Data Engineering (DE)
Is “Under the Hood”
 Data ►► INFORMATION:
Consolidate & Transform Raw
Data into Data Warehouse (DW)
 Operational ENGINES:
Payment Processing for Talent,
Directors, Recruiters, etc
Data Engineering Landscape --- Acquisition, Curation & Dissemination
14
Data Engineering & Business Intelligence Lifecycle
GCR
Business
Goals
Concept
Rqmts
Definition
Vision & Goals Management
Tech Architecture
(incl Data Services / Ops)
Business / Logical
Products Select & Install
Tech / Physical
Integration
Testing
(Func+Stress)
Production
Deployment
Business Intelligence (BI)
Design / Definition
Prep
Deployment
(incl Support Spec)
User Training
(incl End User
Info)
Business
Intelligence (BI)
Development
(Focus on BIZ
Meta Layer)
Data Integration
(ETL, MDM,
Metadata, Etc)
Explore & Design
Unit Testing
Integration
Deliver Business Value Increments via “Frequent Little BITIs” (Agile Coordination + Waterfall As-Needed)
Integration
Test Planning
Data Profiles & Maps
PostReflect,Maint,Improve
PreReflect,Strategy,Scope,Impact
Implement & SupportTexture + Smooth: BITI Cycle (Build, Integrate, Test, Improve)Inception & Definition
>>> Today’s Objective <<<
15
ROI in Technical Environments
The (preferably measurable) successful Business outcomes generated by leveraging Technology
to increase financial KPIs, e.g. revenue, profit, etc.
• “Leverage” implies “$Spend”:
o For Product, Licensing, Labor, Infrastructure, Transition, Etc
o “Break Even” is a common focus: when measured benefits equal initial investment.
o Keep ongoing costs in mind. Ongoing benefits must always exceed TCO to declare “success”.
• Even Open Source / Freemium products have Costs, for example:
o Compute Nodes they run on – vs managed service which is inclusive but still $spend
(e.g. Python, Node.js, Scala, Etc on EC2/VM vs AWS Lambda or Azure Cloud Functions, Google App Engine/Cloud Functions)
o Labor might be more than paid product or managed service
(e.g. Kafka vs AWS Kinesis – various pros and cons largely focused on labor and performance)
• ROI Obvious Factors:
Availability, Maintainability, Reliability/Accuracy, Functionality, Performance, Security
ROI Defined (Return on Investment) – The Basics
16
ROI Defined (Return on Investment) – Special Challenges
ROI ALSO includes “Lateral Spend”
• CAUTION – Watch for Hidden or Shifted Costs:
… sometimes feels like “Collateral Damage”
o Example 1 – Big Data on Serverless Architecture:
AWS Lambda / Google Cloud Function 9-15 Minute Time Limit per Execution
If use for certain Data Integrations, you can invest a lot of $$ time and effort working around the time limit
via special chunking or ~recursive calling mechanisms which are more complex than need to be.
o Example 2 – Streaming TV & Movie Platforms for Cord Cutters  :
Free and Low Cost Options e.g. Pluto TV, Hulu Live, DirecTV NOW, YouTube TV, Sling TV, Apple TV, VUE, Watch TV, …
If run simultaneously on multiple devices in same home, along with gaming, you’ll need to add $$ to Internet Service.
• Consider Full Functional Scope before settling on a Single Technology
• Best to support Multiple Patterns, Sometimes via Multiple Technologies
But keep “official” list “minimal” for each product or platform type.
• Heads-up on New but Old Term: “Poly#!#” – more on this later
17
Presentation Abstract – Brief Discussion
Not long ago the question was whether your organization had big data. Did you have
the volume, the velocity, the technology. Now those basics are largely given for most of
the people attending this event. The path to success is still fuzzy, however, with so many
technologies to choose from – and so many ways to use them.
This presentation triangulates in a holistic manner on the modern business dilemma:
how can we leverage technology to improve revenue, profit, market share, and numerous
other success criteria. That said, this is not about the analytics or KPIs -- although it is
about measurable improvement. It’s about lining up the right technologies and using them
in effective, proven ways to maximize Return on Investment (ROI). Since the slant here
is holistic, we’ll show how to blend infrastructure, tools, methods, and talent to avoid and
constantly trim technical debt… and to produce success stories that are consistently
repeatable, not a byproduct of individual heroics.
Not long ago
This presentation
Brief Discussion
18
Up Next
ROI Conducive
Technologies & Architectures
19
ACTIONABLE INSIGHTS >>
~RAW Uber Actionable
CUSTOM
TURBO
CUSTOM
TURBO
CUSTOM
TURBO
20
ACTIONABLE INSIGHTS >>
~RAW Uber Actionable
21
The Birth of FiTL: Fitness Technology Landscape (via Paper Providence)
22
What Other People Do While Geeks Write Papers / Presentations 
23
● PostgreSQL CE
(cloud)
0
Open Src
$$
EC2
$$$
EBS
~n/a,wash $
(As Needed)
5$ ● #PRO: Multi-Model for Trxs (+ ~Analytics)
● #CON: Minimal Scale-Out except for e.g.
Citus DB or $$Product GC PG, Azure PG
● #CON: Slow evolution of Analytics infra
● Redshift
(cloud)
$$ $$$
~EC2
$$$
EBS
~n/a,wash $
(As Needed)
7$ ● #PRO: Spectrum to connect Data Lake
● #CON: Main focus is DW Not Transactional
● #CAUTION: Node distribution / access
limitations. Discuss Mitigation Patterns for
diverse data access patterns on same table.
Tie to Aggregate Awareness, etc.
● Snowflake
(cloud)
$$$ $$
Selectable
per Session
$
S3
~n/a,wash $ BUT See
#CAUTION
in Special Info col
cuz Separate
Compute layer
6$
BUT
Exp =
7$
● #PRO: Unique Data Sharing
● #CON: Main focus is DW Not Transactional
● #CAUTION: Separate Computer Layer is
generally better BUT can cost more for some
patterns. Discuss Mitigation Patterns.
● #EVAL Data Cache options, e.g. Tableau
TDEs or Data Virtualization w/ Denodo, etc.
● Oracle Exadata
(on-prem)
$$$$$ $$$
Data Ctr +
Extra Staff
$$$
Appliance
#TBD $
(As Needed)
11$ ● #PRO: Multi-Model for Trxs + Analytics
● #CON: #ECOSYSTEM is Shrinking(?)
● #SCALABILITY Options:
e.g. negotiate for dormant CPUs etc
● CapEx is vanilla for On-Prem, but ~wash
FiTL Price Shifting Example: DW / Data Hub Main Platform
Component Product Compute Storage In/Out
(e.g. Hybrid Cloud)
Access Patterns
(Explain)
RAW$
COUNT
Special Info / Discussion Points
Guidelines + Future Improvements $ Scale: 1 = Low to 5 = High ($ symbol intentionally on right of Raw$ figure)
• TCO Factoring:
Incorporated in each category,
e.g. see Compute column for Oracle
• Conventional Evaluation Factors can be Added:
Performance, Scalability, Maintainability, Functionality,
Security, Vendor Viability / Ecosystem, etc
24
Snowflake’s Sweet Spot is Data Warehousing / Analytics, Not Transactional / Operational
Activity (although transactional performance is better than expected!)
 Many Modern Best in Class Products have Same Issue
 Homogenous vs Heterogeneous Technical Environments
o Homogenous was a Dream 
(Not Sustainable due to Tech Evolution, M&As, Self-Service / Decentralization, etc)
o Interoperability is the Reality … Usually
o People like Simple, but Modern Times contain simplicity
in each Class – “Best in Class”
 What does Heterogenous Mean in Today’s Modern IT Arena? … … …
Reality Check
Examine Wide Use Cases -- This Example Happens to be Data Platform
DP
LOP2
25
Old / Rejuvenated “Modern” Term: Polyg#!# What?
(Source: Google Dictionary)
Polyglot Persistence:
Using multiple data platform technologies
[to address diverse use cases in a best-of-breed manner].
Polyglot Programming:
Using multiple programming languages
[to address diverse use cases in a best-of-breed manner].
Domain Specific Languages (DSLs) are now standard practice
for enterprise app development.
. . . 2012+ Time to Revive! . . .
Polyglot Engineering / Architecture:
Using multiple technologies [in the same functional domain]
[to address diverse use cases in a best-of-breed manner].
+ glotta ‘tongue.’
26
Potential New Principle / Postulate (#DRAFT Idea in Progress)
TP3 Data – Technical Polyglot Propensity Principle for Data Platforms:
Modern enterprises with Big Data tend to utilize Polyglot Engineering with the intention of maximizing ROI.
One Technical Data Platform cannot profitably maintain a Top 3 industry popularity rank for modern
big data enterprises for more than 3 years without sacrificing at least one of the following Top 3 ranks:
TP3 DE – Technical Polyglot Propensity Principle for Data Engineering Platforms:
Similar to TP3 Data, but for Data Engineering / Integration tools and platforms.
DP
Let’s see some EXAMPLES . . .
• Multi-Model support for more than 3 types:
e.g. Relational, Graph, Document/Text, Multi-Media, Geospatial, Key-Value (Structured, Semi/UnStructured)
• Multi-Use-Category support for more than 3 types:
e.g. Analytical, Transactional, Search, Stream
• “Reasonably Low Pricing” given abundance of Modern, Competitive, Low Cost / Community products
27
Polyglot Data Integration: Example Architecture Pattern 1
Example Use Case: Social Media / Video Platform – Core Data Feeds
• Purpose & Constraints: Biz Performance & Revenue Metrics / KPIs.
Minimal or No Backfill Available. Supports Data Lake Direct Access Use Cases for raw data (purple lines from earlier slide).
• Solution Profile – Decoupled Polyglot thru Data Lake:
Python for Extraction, Pentaho Data Integration (PDI) for Load/Ingest/Transform. High Resilience.
• Fitness Highlights:
o Python: $0 Open Source, Thin (low resource), Highly Available, e.g. can be Serverless or EC2.
o Pentaho PDI: $0 Community Edition (or EE see below), Graphic Workflows with Standard Transformations.
Low Maintenance Work Share (extremely easy for cross-training even with complex pipelines).
Option to Expand to Enterprise Edition ($) for better HA infra, monitoring, repo, scheduling, support, etc.
Extract via Python
(e.g. REST API)
Ideal for: Facebook (core), Instagram,
Twitch Live Streams, Etc
Optional for: YouTube, Amazon, Etc
Load/Ingest/Transform
via Pentaho PDI DW
Async
e.g. CSV,
JSON, Parque Near-Real-Time,
Hourly, Daily, Etc
Cache
Lake
28
Polyglot Data Integration: Example Architecture Pattern 2
Example Use Case: Social Media / Video Platform – Extension Data Feeds
• Purpose & Constraints: Biz Performance & Revenue Metrics / KPIs.
Backfill is Available. Extract Runtime is Short / Low Impact on Runbook Dependencies. Ok to Not have Data Lake raw data.
Extensions Data Density is High, e.g. would create thousands-millions of files per day (discuss).
• Solution Profile – Homogenous ~Stream Direct to DW / Data Hub:
Python if Lite Transformations (subject to DI Library selection, see next slide), Pentaho PDI if Heavy.
• Fitness Highlights:
o Python: $0 Open Source, Thin (low resource), Highly Available, e.g. can be Serverless or EC2.
o Pentaho PDI: $0 Community Edition (or EE see below), Graphic Workflows with Standard Transformations,
Low Maintenance Work Share (extremely easy for cross-training even with complex pipelines).
Option to Expand to Enterprise Edition ($) for better HA infra, monitoring, repo, scheduling, support, etc.
#Caution: Must Backfill after Planned/Unplanned DW Outage. No Data Lake Raw Files for Audit or Direct Use Cases.
Ideal for: FB Graph API Extensions,
YouTube Bulk API, Etc
Optional for: Salesforce, SAP, Etc
Extract +
Load/Ingest/Transform
via #TBD
(See Solution Profile above)
DW
Merge
Hourly, Daily,
Monthly, Etc
Cache
Lake
29
Some Python Data Integration / ETL Libraries
# Library Remarks (as of 2018-11-25) Sample Doc/Code
(Double-Click to Open)
1) PETL Reasonably Popular
(Last Commit Sept 2018)
2) PygramETL No Commits since Oct 2017
3) Bonobo Reasonably Popular
(Last Commit Nov 2018)
4) #TBD: Wide Open to Feedback Roadmap Task for 2019
Evaluate and Select
NOTES:
• Data Analysis Libs like Pandas are not shown above. Full Data Integration / ETL / ELT is
not their objective.
• Custom Development is also quite popular – withOUT starting with a canned 3rd party library.
But you should develop a lib (or at least a collection of templates) within your company for
standardization, productivity, etc.
30
Friendly
Refresher
DP
31
Polyglot DI: Why Graphic Workflows? –– Example 1
Hard Error End
Hard Error End
Easy to See Green Flow = GOOD, Red Flow = BAD
32
Polyglot DI: Why Graphic Workflows? –– Example 2
Soft Error End
(Do Nothing)
Easy to See Negative Logic ANTI-PATTERN
33
Polyglot DI: Why Graphic Workflows? –– Example 3
Error End
Easy to Understand
Semi-Complex Flow
More Complex is
Also Welcome
34
Polyglot DI: Why Graphic Workflows? –– Example 4
Error End
Soft Error End
(Do Nothing)
Easy to Add
Exception
Handler
35
CONCURRENT File Processing
Chains are Easy to Create
Logging Window Easily
Explains Broken Step Above
(Can also do in IDE for Python, etc)
Polyglot DI: Why Graphic Workflows? –– Example 5
36
Polyglot DI: Why Graphic Workflows? –– Example 6
This Time it Worked.
See all the Green
Checkmarks 
Automatically Gathers
METRICS Etc for Each Job
Renamed Steps to be
Meaningful
37
Polyglot DI: Why Graphic Workflows? –– Example 7
Error End
Standard
Job Types / Steps
(Menu Options)
38
Polyglot DI: Why Graphic Workflows? –– Example 8
Error End
Big Data
Built-in Transformations
(Source: Pentaho PDI Manual)
39
Polyglot DI: Why Visual Workflows? –– Example 9
Error End
Standard
Input Methods
(Menu Options)
40
Polyglot DI: Why Graphic Workflows? –– Example 10
Standard
Output Methods
(Menu Options)
41
Polyglot DI: Why Graphic Workflows? –– Example 11
Standard Transform Methods
(Menu Options)
Extend to Custom
Python, JavaScript, Bash, Etc
42
Up Next
ROI Best Practices
FUTURE IMPROVEMENTS for this Presentation:
• Reduce Slide Density – Especially the next few slides which also could use some diagrams.
• Many More Best Practices already documented – Adding to presentation after settle on better format.
• Contact Info at end of this deck – Feel free to reach out for discussion or future versions.
 Data Engineering
 Analytics (incl AI/ML)
 Design & Development
 DevOps, SysOps
 Data Governance
 Collaboration
 Management
 Etc
43
Best Practices for High ROI Impact – Part 1 (DRAFT: This Slide Currently Requires Walk-Thru)
Types Topic : Components For Benefits >> Do This Cost-Ben
(1-5 Hi : 1-5 Hi)
Depends Remarks
Tech
Data Eng
Data Integration / ETL Arch Patterns :
Decoupled vs Homogenous
Optimize Performance, Scalability, Reliability, Etc
>> Have Canned Patterns Ready for Various Scenarios
2 : 5 • Experience + Research. BEST PRACTICE TIPS:
• See Architecture Patterns earlierin this
presentation (e.g. decoupled vs
homogenous).
Tech
Data Eng
External Resilience Patterns :
Data Extraction Programs
Improve Data Availability & Reliability
>> Use Intelligent Logic (parsing etc) vs Arbitrary or
Hardcoded Logic to Dynamically Accommodate
Changes in Source Data Patterns. Examples:
• Skip over CSV Header Rows – use parse vs row count.
• Ignore extra JSON pages via content scan vs page count.
2 : 4 ROI falls when distracted by “fires”,
especially when preventable.
BEST PRACTICE TIPS:
• “Results are Better Than Excuses” culture.
• Problem Patterns are often ~just as
important as Solution Patterns. Line them
up a la Tech Arsenal.
Tech
General
Tech Debt : Keyword Tagging in
Code/Docs
Minimize Tech Debt and Facilitate Follow-up
>> Tag your Code and Design Docs. Examples:
• Code: #CHANGED, #PERF, #SCALE, #HARDCODED,
#MODULARIZE, #WORKAROUND, #TODO, #FUTURE
• Docs: #KBANK, #OUTPLAN, #SPEC, #QA, #RISK, #TBD,
#TODO, #FUTURE, #MGT, #TECH
• README File for every main module – incl “Future
Enhancements” section in addition to any tickets/cards.
1 : 5 • Context dependent. If Mgt
doc, separate #TECH
sidenotes. Or vice versa if
Tech doc, and put Mgt
Nutshell / Next Steps at top
since Mgt does not focus
on Tech.
• Tech Debt is frequently caused by
overlooking syndromes: “we’ll do it later”,
“slipping between the cracks”.
Tech
Monitor
Simple Automation, Reliability, and
Maintainability : Keyword Tagging in
Logs
Improve Reliability & Maintainability/Operations,
Avoid False Positives
>> Tag Log File Msgs for Action or Review. Examples:
• #ERROR: bla bla (+ Optional Subtags #RETRY or #FATAL)
• #WARNING: bla bla
• #INFO: bla bla bla
1 : 4 • Couple with Monitoring
Tools/Scripts.
• Avoid False Positives, e.g.
“error table”,
“solution for whatever error”, etc.
Tech
SysOps
Deployment : Hot Patches Improve Reliability & Maintainability/Operations
>> Always Create “cm_retro” hot patches folder with:
• Changed Files.
• README for changes to other tiers (e.g. data tier).
1 : 4 • Couple with Git if available
– sometimes changes occur
in infra/commercial
product config files, etc
which might not be CMed.
• Useful even when using Git, etc.
• Consider using Git for all configurables.
Opens deeper issues, e.g.:
o Separating config files from other bin.
o Maintaining security of credential files
etc.
44
Best Practices for High ROI Impact – Part 2 (DRAFT: This Slide Currently Requires Walk-Thru)
Types Topic : Components For Benefits >> Do This Cost-Ben
(1-5 Hi : 1-5 Hi)
Depends Remarks
Tech
SysOps
Hyper-Automation, Reliability, and
Maintainability : File Tagging
Simplify Many Critical Actions & Monitoring
>> Tag Filenames or Accounts for Monitoring, Tracking,
etc. Examples (add to filenames):
• Sensitive Files use “_stv” suffix: No-Brainer 1-line expression
(e.g. filename like “*_stv*.*”) to track all files containing
secure credentials etc. Facilitates auto security hardening and
monitoring. No need to depend on high maintenance list
which will grow out of sync.
• Service Accts use “svc_” prefix: Similar to “_stv” but for email
accounts, etc.
1 : 4 • Great for Security Audits:
Auditors like when justified exceptions
can be simplified. Service Accounts in
some high security environments are
allowed only by exception, even
though they are obviously needed and
much better than using name of a real
person who will eventually move on..
• SAFETY TIP: Don’t make it too obvious
for Hackers. For example, “_stv” is
good enough to understand with out
saying “look here for ‘sensitive’ files!”.
Tech
Data Mgt
Hidden Tech Debt : Multi-Tenancy Prevent Hidden Tech Debt from Plunging Productivity
>> Always have Checklist to Consider Multi-Tenant
Support, e.g. Biz Unit (BU). Examples – Add Biz Unit to:
• Data Sets/Tables
• Data Integration Infra (folder trees).
3 : 2 to 5
(It Depends)
CONSIDER FACTORS:
• M&A (of course).
• Reorgs / Dept moves, splits, etc.
BEST PRACTICE TIP (tangent topic):
Usually Name things after BEHAVIOR, not
Volatile Infra such as Biz/Dept Name.
Mgt
Collab
Naming Conventions :
Biz & Tech Vocabularies
Increase Productivity, Reliability, & Morale; Preclude
Communication Mayhem
>> Always Have a Name (a la Jim Croce, song artist).
Examples:
• Data Schemas: stage, base, presents.
• Data Sets: Pilot should have name such as “main” or “core”
since something else ~always follows. Otherwise you’ll
forever be saying “the set without a name”.
• Disk Folders:
• ETC
2 : 4 ROI falls when distracted by “fires”,
especially when preventable.
BEST PRACTICE TIPS:
• Devise a simple term whenever you
have to repeat a phrase or word bunch
repeatedly.
• Socialize everywhere across Biz and
Tech, e.g. Roadmap, Specs, etc.
45
Best Practices for High ROI Impact – Part 3 (DRAFT: This Slide Currently Requires Walk-Thru)
Types Topic : Components For Benefits >> Do This Cost-Ben
(1-5 Hi : 1-5 Hi)
Depends Remarks
Tech
DevOps,
SysOps
Process Orchestration :
DevOps/SysOps Coding
Control Maintainability vs High Availability
>> For each Job’s Master Script (launched by
scheduler), Choose Common/Reusable or Separate
Scripts:
• For Maintainability: Use Common/Reusable.
• For HA: Separate scripts to avoid risk of breaking
something unrelated to current task.
1 : 3 • Evaluate Reusability vs Risk (see
Best Practice Tip in Remarks
column).
• BEST PRACTICE TIP: Lean toward
Maintainability if have separate
Test/QA Team (which is another,
fundamental best practice).
Tech
Data
Mgmt
Data Assets Mgmt : Data Catalog Maximize Results and Minimize Redundant Data in
data stores + BI platforms
>> Create & Maintain a Data Catalog (#DCAT):
• Option 1 $$$: Combine with Data Virtualization, e.g.
Denodo, Dremio, Stardog, etc.
• Option 2 $$$: Separate product, e.g. Alation, etc.
• Option 3 Free But Not Great (Yet):
o Spreadsheet/Googlesheet
Proven Templates available upon request.
o #TODO: Researching open source products.
3 : 5 for $$$
or
2 : 3 for Free
• Be prepared to spend at least
$150K for good commercial
product.
• HOT TIP: Inferred relationships across
DB platforms are only practical when
virtualizing across multiple data
platforms. Lean toward Option 1 +
“encourage” Denodo to speed up their
Roadmap (like Alation feature but
cross-platform).
• #RISK of NOT Having DCAT:
High Tech Debt propensity across
multiple tiers, e.g. in Data Lake,
DW/Hub, and BI platforms.
Tech
Dev/Test
Testing : Sample Data Generation Reduce Train-Validate-Test Lifecycle
>> Auto Generate Meaningful vs Random Data:
• Mockaroo.com: $Range from free for 1K rows and slow
speed to $500/yr for 10M rows and 8x speed.
• Alternatives (generally < $600/yr/seat): RedGate SQL
Data Generator, Dummi, MS Visual Studio , etc.
• Or use Real Data if Possible (limited by InfoSec policy and
data volume considerations – See “Depends” column).
2 : 3 • Production Data – in some
environments -- can be
downloaded/refreshed into
Dev/Test. But still need subset
data while maintaining integrity
(where applicable).
IMPORTANT TIPS:
• For Metrics:
Add distribution info (e.g. uniform,
normal, normal inverse, exponential,
exponential inverse, etc). Incl edge
cases.
• For Text:
Create meaningful patterns, incl edge
cases, e.g. via regex, etc.
• Additional Metadata (as applicable):
Unique, Step, Min/Max values/length,
locale, character set, image ht/wd, etc.
46
Types Topic : Components For Benefits >> Do This Cost-Ben
(1-5 Hi : 1-5 Hi)
Depends Remarks
Tech
AI/ML
ML Outcomes : Source Data Gaps &
Anomalies
Improve Outcome Accuracy & Minimize ML Iterations
>> Track Imputed or Questionable Data. For example,
add Source_TBD column to indicate:
• Imputed Gaps: Fill nulls, e.g. via interpolation, etc.
• Low Credibility: Identify weakness at source vs algorithm.
2 : 3 • BEST PRACTICE TIP: Add Lifecycle
Checklist item to cross-check the
Source_TBD column with Validation
and/or Test data sets.
Tech
AI/ML
ML Outcomes : Validation & Test Data Improve Outcome Accuracy
>> Incl Separate Validation and Test Data Sets:
• For Validation Data: Hold out from Train data.
• For Test Data: Completely separate – No “Peeking”.
2-3 : 4 Test Set should be separate and
large if available.
Two popular alternatives:
• Bootstrapping
• Cross-Validation (e.g. K-Fold
resampling aka shell game).
• Bootstrapping (click1, click2).
BEST PRACTICE TIPS:
• Validation Data – use to tune params.
• Test Data – use to assess performance
(outcomes).
Mgt
Resources
Unplanned Work or Expenses :
Budget, Roadmap, Sprints, Project
Plans
Minimize Disruption to Budget & Schedule
>> Track “Outplan” for items not in Orig Budget,
Roadmap, Epic, Sprint, or Project Plan:
• Create Positive Agile Culture for Unplanned/Popup Tasks:
Tag on Roadmap as #Outplan (or use symbol). Also show any
impacted items, e.g. can indicate item slide to next
month.
• Maintain Schedule/Priority Control and Team Morale:
Easy to see when we’re in a Tunnel… and where the Light is.
Easy to justify schedule adjustments to incorporate high
value wins. But see #CAUTION in Remarks column.
2 : 5 • Introduce to corporate
vocabulary.
• Socialize with stakeholders
(and your boss).
• BEST PRACTICE TIP:
Track Outplan to justify:
o Hire New Staff
o Adjust Business Processes
• #CAUTION: Always try to plan pop-up
tasks for future, not outplan. Outplan is
always an exception (part of MBE plan).
• Only for tasks > 1 Day (not for
ad-hoc “support” tasks).
Mgt
Staffing
Work Force : Flex Plan Stay on Budget & Schedule
>> Create FLEX Plan, Not Just a Contingency Fund:
• Enlist As-Needed “Hot Standby” SMEs who need minimal
(sometimes 0) weekly work guarantee, but establish and
maintain knowledge about your biz and tech environment.
• Cross-Train / Semi-Matrix Dept Resources in Large
Enterprises: Achieve economy of scale -- sustainably.
2 : 5 • Need strong schmoozing and
negotiation skills.
Best Practices for High ROI Impact – Part 4 (DRAFT: This Slide Currently Requires Walk-Thru)
47
Up Next
DP
Fairy Tale Wrap-Up . . .
48
Everyone is Happy . . . Life is Perfect Now . . .
49
. . . Or is it ? . . .
50
. . . Hmmm . . .
51
. . . Success is Not Usually Perfection . . .Excellence,
Excellence +
Planning =
DP
52
Thanks to Dora the Explorer ®
53
Wrap-Up
THANKS!
 YOU !!!
 Our Sponsors
 Data – for being such a Lovable Thing
(But Information, Knowledge, and Results are even Better!)
54
Wrap-UpMake it a Great Day 
55
DP
You May PASS GO and Collect $2000
Jeff Bertman
• www.DigiMarconWest.com/speakers
• www.LinkedIn.com/in/JeffBertman
• Jeff.Bertman@DfuseTech.com
• Jeff.etalk123@gmail.com
• Skype: JeffB.epoch
(Slack, WhatsApp, etc upon request)
• Mobile/Text: 818-321-3111
NEXT UP
CONTACT INFO
Maximizing Big Data ROI via Best of Breed Technology Patterns and Practices - Jeff Bertman, Warner Bros.

Más contenido relacionado

Similar a Maximizing Big Data ROI via Best of Breed Technology Patterns and Practices - Jeff Bertman, Warner Bros.

Serverless projects at Myplanet
Serverless projects at MyplanetServerless projects at Myplanet
Serverless projects at MyplanetDaniel Zivkovic
 
VUCA - Planning for the essentially unplannable in a disruptive world
VUCA - Planning for the essentially unplannable in a disruptive worldVUCA - Planning for the essentially unplannable in a disruptive world
VUCA - Planning for the essentially unplannable in a disruptive worldJoakim Lindbom
 
Big Data Refinery: Distilling Value for User-Driven Analytics
Big Data Refinery: Distilling Value for User-Driven AnalyticsBig Data Refinery: Distilling Value for User-Driven Analytics
Big Data Refinery: Distilling Value for User-Driven AnalyticsInside Analysis
 
Open Web Technologies and You - Durham College Student Integration Presentation
Open Web Technologies and You - Durham College Student Integration PresentationOpen Web Technologies and You - Durham College Student Integration Presentation
Open Web Technologies and You - Durham College Student Integration Presentationdarryl_lehmann
 
Technology and Digital Platform | 2019 partner summit
Technology and Digital Platform | 2019 partner summitTechnology and Digital Platform | 2019 partner summit
Technology and Digital Platform | 2019 partner summitAndrew Kumar
 
5 facets of cloud computing - Presentation to AGBC
5 facets of cloud computing - Presentation to AGBC5 facets of cloud computing - Presentation to AGBC
5 facets of cloud computing - Presentation to AGBCRaymond Gao
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?Haggai Philip Zagury
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...DataWorks Summit
 
Developers Driving DevOps at Scale: 5 Keys to Success
Developers Driving DevOps at Scale: 5 Keys to SuccessDevelopers Driving DevOps at Scale: 5 Keys to Success
Developers Driving DevOps at Scale: 5 Keys to SuccessDevOps.com
 
The new dominant companies are running on data
The new dominant companies are running on data The new dominant companies are running on data
The new dominant companies are running on data SnapLogic
 
Collaborative Roadmapping
Collaborative Roadmapping Collaborative Roadmapping
Collaborative Roadmapping Enthiosys Inc
 
Marketo Symptoms you shouldn't ignore
Marketo Symptoms you shouldn't ignore Marketo Symptoms you shouldn't ignore
Marketo Symptoms you shouldn't ignore Zara Alkhudari
 
Eps8510 Week 01 - JBL Welcome to the Software Biz
Eps8510 Week 01 - JBL Welcome to the Software BizEps8510 Week 01 - JBL Welcome to the Software Biz
Eps8510 Week 01 - JBL Welcome to the Software BizJohn Landry
 
Repurpose, Reuse and Refresh Content
Repurpose, Reuse and Refresh ContentRepurpose, Reuse and Refresh Content
Repurpose, Reuse and Refresh ContentPam Didner
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companiesData Science Milan
 

Similar a Maximizing Big Data ROI via Best of Breed Technology Patterns and Practices - Jeff Bertman, Warner Bros. (20)

AI 2023.pdf
AI 2023.pdfAI 2023.pdf
AI 2023.pdf
 
Serverless projects at Myplanet
Serverless projects at MyplanetServerless projects at Myplanet
Serverless projects at Myplanet
 
VUCA - Planning for the essentially unplannable in a disruptive world
VUCA - Planning for the essentially unplannable in a disruptive worldVUCA - Planning for the essentially unplannable in a disruptive world
VUCA - Planning for the essentially unplannable in a disruptive world
 
Repurpose, Reuse, Refresh
Repurpose, Reuse, RefreshRepurpose, Reuse, Refresh
Repurpose, Reuse, Refresh
 
Big Data Refinery: Distilling Value for User-Driven Analytics
Big Data Refinery: Distilling Value for User-Driven AnalyticsBig Data Refinery: Distilling Value for User-Driven Analytics
Big Data Refinery: Distilling Value for User-Driven Analytics
 
Open Web Technologies and You - Durham College Student Integration Presentation
Open Web Technologies and You - Durham College Student Integration PresentationOpen Web Technologies and You - Durham College Student Integration Presentation
Open Web Technologies and You - Durham College Student Integration Presentation
 
Technology and Digital Platform | 2019 partner summit
Technology and Digital Platform | 2019 partner summitTechnology and Digital Platform | 2019 partner summit
Technology and Digital Platform | 2019 partner summit
 
5 facets of cloud computing - Presentation to AGBC
5 facets of cloud computing - Presentation to AGBC5 facets of cloud computing - Presentation to AGBC
5 facets of cloud computing - Presentation to AGBC
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?
 
SaaS Ecosystem - turn it on
SaaS Ecosystem - turn it onSaaS Ecosystem - turn it on
SaaS Ecosystem - turn it on
 
Why choose-liferay
Why choose-liferayWhy choose-liferay
Why choose-liferay
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...
 
Developers Driving DevOps at Scale: 5 Keys to Success
Developers Driving DevOps at Scale: 5 Keys to SuccessDevelopers Driving DevOps at Scale: 5 Keys to Success
Developers Driving DevOps at Scale: 5 Keys to Success
 
The new dominant companies are running on data
The new dominant companies are running on data The new dominant companies are running on data
The new dominant companies are running on data
 
Collaborative Roadmapping
Collaborative Roadmapping Collaborative Roadmapping
Collaborative Roadmapping
 
Marketo Symptoms you shouldn't ignore
Marketo Symptoms you shouldn't ignore Marketo Symptoms you shouldn't ignore
Marketo Symptoms you shouldn't ignore
 
Eps8510 Week 01 - JBL Welcome to the Software Biz
Eps8510 Week 01 - JBL Welcome to the Software BizEps8510 Week 01 - JBL Welcome to the Software Biz
Eps8510 Week 01 - JBL Welcome to the Software Biz
 
Repurpose, Reuse and Refresh Content
Repurpose, Reuse and Refresh ContentRepurpose, Reuse and Refresh Content
Repurpose, Reuse and Refresh Content
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
 

Más de DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

Más de DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions (20)

Growing Beyond Expectations, 10 Marketing Lessons From Hyper-growth Companies...
Growing Beyond Expectations, 10 Marketing Lessons From Hyper-growth Companies...Growing Beyond Expectations, 10 Marketing Lessons From Hyper-growth Companies...
Growing Beyond Expectations, 10 Marketing Lessons From Hyper-growth Companies...
 
Sell More by Saying Less With the ABT Framework - Park Howell, The Business o...
Sell More by Saying Less With the ABT Framework - Park Howell, The Business o...Sell More by Saying Less With the ABT Framework - Park Howell, The Business o...
Sell More by Saying Less With the ABT Framework - Park Howell, The Business o...
 
The Road to 100% Digital Marketing - Todd Vigil, Barrow Neurological Foundation
The Road to 100% Digital Marketing - Todd Vigil, Barrow Neurological FoundationThe Road to 100% Digital Marketing - Todd Vigil, Barrow Neurological Foundation
The Road to 100% Digital Marketing - Todd Vigil, Barrow Neurological Foundation
 
Add AI to Your Content Writing Toolkit - 5 Time-Saving Tools - Stephen Mostro...
Add AI to Your Content Writing Toolkit - 5 Time-Saving Tools - Stephen Mostro...Add AI to Your Content Writing Toolkit - 5 Time-Saving Tools - Stephen Mostro...
Add AI to Your Content Writing Toolkit - 5 Time-Saving Tools - Stephen Mostro...
 
The Democratization of Influence - How Gen Z Led a Content & Creator Revoluti...
The Democratization of Influence - How Gen Z Led a Content & Creator Revoluti...The Democratization of Influence - How Gen Z Led a Content & Creator Revoluti...
The Democratization of Influence - How Gen Z Led a Content & Creator Revoluti...
 
Panel - Digital Marketing Trends - Chad Illa-Petersen, The Story Catcher LLC
Panel - Digital Marketing Trends - Chad Illa-Petersen, The Story Catcher LLCPanel - Digital Marketing Trends - Chad Illa-Petersen, The Story Catcher LLC
Panel - Digital Marketing Trends - Chad Illa-Petersen, The Story Catcher LLC
 
The robots Are Here… And They’re Boring - Donna Mostrom, Damn Smart Marketing...
The robots Are Here… And They’re Boring - Donna Mostrom, Damn Smart Marketing...The robots Are Here… And They’re Boring - Donna Mostrom, Damn Smart Marketing...
The robots Are Here… And They’re Boring - Donna Mostrom, Damn Smart Marketing...
 
The Impact on Going from Personal Brand to Community - Zach Colman, Creatitive
The Impact on Going from Personal Brand to Community - Zach Colman, CreatitiveThe Impact on Going from Personal Brand to Community - Zach Colman, Creatitive
The Impact on Going from Personal Brand to Community - Zach Colman, Creatitive
 
Social Media Masterclass - Jordan Scheltgen, Cave
Social Media Masterclass - Jordan Scheltgen, CaveSocial Media Masterclass - Jordan Scheltgen, Cave
Social Media Masterclass - Jordan Scheltgen, Cave
 
Improve Your Digital Experience to Drive More RevenueInsider 10-Part Framewor...
Improve Your Digital Experience to Drive More RevenueInsider 10-Part Framewor...Improve Your Digital Experience to Drive More RevenueInsider 10-Part Framewor...
Improve Your Digital Experience to Drive More RevenueInsider 10-Part Framewor...
 
Insider 10-Part Framework for Retention Marketing - Brandon Amoroso, ELECTRIQ
Insider 10-Part Framework for Retention Marketing - Brandon Amoroso, ELECTRIQInsider 10-Part Framework for Retention Marketing - Brandon Amoroso, ELECTRIQ
Insider 10-Part Framework for Retention Marketing - Brandon Amoroso, ELECTRIQ
 
497 Page One Rankings in 7 Weeks - How Pillar-Based Marketing is Changing SEO...
497 Page One Rankings in 7 Weeks - How Pillar-Based Marketing is Changing SEO...497 Page One Rankings in 7 Weeks - How Pillar-Based Marketing is Changing SEO...
497 Page One Rankings in 7 Weeks - How Pillar-Based Marketing is Changing SEO...
 
WoW Moments! Advanced Social Media Strategy - Jeff Turnbow, WinningLocal
WoW Moments! Advanced Social Media Strategy - Jeff Turnbow, WinningLocalWoW Moments! Advanced Social Media Strategy - Jeff Turnbow, WinningLocal
WoW Moments! Advanced Social Media Strategy - Jeff Turnbow, WinningLocal
 
The State of the Creator Economy - Ryan Schram, IZEA
The State of the Creator Economy - Ryan Schram, IZEAThe State of the Creator Economy - Ryan Schram, IZEA
The State of the Creator Economy - Ryan Schram, IZEA
 
Generative AI - The New Wild West of SEO - Ryan Huser, Resignal
Generative AI - The New Wild West of SEO  - Ryan Huser, ResignalGenerative AI - The New Wild West of SEO  - Ryan Huser, Resignal
Generative AI - The New Wild West of SEO - Ryan Huser, Resignal
 
The Power of UGC and Micro Influencers - Marie Kennedy, L'Oréal
The Power of UGC and Micro Influencers - Marie Kennedy, L'OréalThe Power of UGC and Micro Influencers - Marie Kennedy, L'Oréal
The Power of UGC and Micro Influencers - Marie Kennedy, L'Oréal
 
Finding New Customers Through Localization - Christina Spaulding, Manzanita M...
Finding New Customers Through Localization - Christina Spaulding, Manzanita M...Finding New Customers Through Localization - Christina Spaulding, Manzanita M...
Finding New Customers Through Localization - Christina Spaulding, Manzanita M...
 
Panel - Digital Marketing in the New Era of Privacy & Data Governance - Ashle...
Panel - Digital Marketing in the New Era of Privacy & Data Governance - Ashle...Panel - Digital Marketing in the New Era of Privacy & Data Governance - Ashle...
Panel - Digital Marketing in the New Era of Privacy & Data Governance - Ashle...
 
MultiChannel Marketing Strategy - Jeff Turnbow, WinningLocal
MultiChannel Marketing Strategy - Jeff Turnbow, WinningLocalMultiChannel Marketing Strategy - Jeff Turnbow, WinningLocal
MultiChannel Marketing Strategy - Jeff Turnbow, WinningLocal
 
Bridging Facts and Tales – The AI Transformation in Enterprise Content Creati...
Bridging Facts and Tales – The AI Transformation in Enterprise Content Creati...Bridging Facts and Tales – The AI Transformation in Enterprise Content Creati...
Bridging Facts and Tales – The AI Transformation in Enterprise Content Creati...
 

Último

Fritschi Collection 2022/23 EN gama completa
Fritschi Collection 2022/23 EN gama completaFritschi Collection 2022/23 EN gama completa
Fritschi Collection 2022/23 EN gama completaEsteller
 
Introduction to marketing Management Notes
Introduction to marketing Management NotesIntroduction to marketing Management Notes
Introduction to marketing Management NotesKiranTiwari42
 
AMAZON Copywriting Portfolio by Cielo Evangelista
AMAZON Copywriting Portfolio by Cielo EvangelistaAMAZON Copywriting Portfolio by Cielo Evangelista
AMAZON Copywriting Portfolio by Cielo Evangelistacrevangelista
 
Llanai Buyer Persona & Segmentation Strategy
Llanai Buyer Persona & Segmentation StrategyLlanai Buyer Persona & Segmentation Strategy
Llanai Buyer Persona & Segmentation StrategyMarianna Nakou
 
v100 20-Year HyperMarketing Plan by Jerome Cuyos.pptx
v100 20-Year HyperMarketing Plan by Jerome Cuyos.pptxv100 20-Year HyperMarketing Plan by Jerome Cuyos.pptx
v100 20-Year HyperMarketing Plan by Jerome Cuyos.pptxjeromecuyos1
 
Gen Z and Millennial Debit Card Use Survey.pdf
Gen Z and Millennial Debit Card Use Survey.pdfGen Z and Millennial Debit Card Use Survey.pdf
Gen Z and Millennial Debit Card Use Survey.pdfMedia Logic
 
Catálogo Sea To Summit 2024 gama compelta
Catálogo Sea To Summit 2024 gama compeltaCatálogo Sea To Summit 2024 gama compelta
Catálogo Sea To Summit 2024 gama compeltaEsteller
 
SEO and Digital PR - How to Connect Your Teams to Maximise Success
SEO and Digital PR - How to Connect Your Teams to Maximise SuccessSEO and Digital PR - How to Connect Your Teams to Maximise Success
SEO and Digital PR - How to Connect Your Teams to Maximise SuccessLiv Day
 
How To Become a Master In Search Engine Optimization (SEO)
How To Become a Master In Search Engine Optimization (SEO)How To Become a Master In Search Engine Optimization (SEO)
How To Become a Master In Search Engine Optimization (SEO)Blessings Ngalande
 
TAM Sports IPL 17 Advertising Report- M01 - M23
TAM Sports IPL 17 Advertising Report- M01 - M23TAM Sports IPL 17 Advertising Report- M01 - M23
TAM Sports IPL 17 Advertising Report- M01 - M23Social Samosa
 
The Ultimate Guide to Financial Advertising Strategies.pdf
The Ultimate Guide to Financial Advertising Strategies.pdfThe Ultimate Guide to Financial Advertising Strategies.pdf
The Ultimate Guide to Financial Advertising Strategies.pdfFinance Advertising Network
 
2024 WTF - what's working in mobile user acquisition
2024 WTF - what's working in mobile user acquisition2024 WTF - what's working in mobile user acquisition
2024 WTF - what's working in mobile user acquisitionJohn Koetsier
 
History of JWT by The Knowledge Center.pdf
History of JWT by The Knowledge Center.pdfHistory of JWT by The Knowledge Center.pdf
History of JWT by The Knowledge Center.pdfwilliam charnock
 
20 Top Social Media Tips for Peer Specialists
20 Top Social Media Tips for Peer Specialists20 Top Social Media Tips for Peer Specialists
20 Top Social Media Tips for Peer Specialistsmlicam615
 
TAM Sports-IPL 17 Advertising Report- M01 - M15.xlsx - IPL 17 FCT (Commercial...
TAM Sports-IPL 17 Advertising Report- M01 - M15.xlsx - IPL 17 FCT (Commercial...TAM Sports-IPL 17 Advertising Report- M01 - M15.xlsx - IPL 17 FCT (Commercial...
TAM Sports-IPL 17 Advertising Report- M01 - M15.xlsx - IPL 17 FCT (Commercial...Social Samosa
 
social media optimization complete indroduction
social media optimization complete indroductionsocial media optimization complete indroduction
social media optimization complete indroductioninfoshraddha747
 
SEO Forecasting by Nitin Manchanda at Berlin SEO & Content Club
SEO Forecasting by Nitin Manchanda at Berlin SEO & Content ClubSEO Forecasting by Nitin Manchanda at Berlin SEO & Content Club
SEO Forecasting by Nitin Manchanda at Berlin SEO & Content ClubNitin Manchanda
 
Digital Marketing complete introduction.
Digital Marketing complete introduction.Digital Marketing complete introduction.
Digital Marketing complete introduction.Kashish Bindra
 
Content Marketing: How To Find The True Value Of Your Marketing Funnel
Content Marketing: How To Find The True Value Of Your Marketing FunnelContent Marketing: How To Find The True Value Of Your Marketing Funnel
Content Marketing: How To Find The True Value Of Your Marketing FunnelSearch Engine Journal
 
top marketing posters - Fresh Spar Technologies - Manojkumar C
top marketing posters - Fresh Spar Technologies - Manojkumar Ctop marketing posters - Fresh Spar Technologies - Manojkumar C
top marketing posters - Fresh Spar Technologies - Manojkumar CManojkumar C
 

Último (20)

Fritschi Collection 2022/23 EN gama completa
Fritschi Collection 2022/23 EN gama completaFritschi Collection 2022/23 EN gama completa
Fritschi Collection 2022/23 EN gama completa
 
Introduction to marketing Management Notes
Introduction to marketing Management NotesIntroduction to marketing Management Notes
Introduction to marketing Management Notes
 
AMAZON Copywriting Portfolio by Cielo Evangelista
AMAZON Copywriting Portfolio by Cielo EvangelistaAMAZON Copywriting Portfolio by Cielo Evangelista
AMAZON Copywriting Portfolio by Cielo Evangelista
 
Llanai Buyer Persona & Segmentation Strategy
Llanai Buyer Persona & Segmentation StrategyLlanai Buyer Persona & Segmentation Strategy
Llanai Buyer Persona & Segmentation Strategy
 
v100 20-Year HyperMarketing Plan by Jerome Cuyos.pptx
v100 20-Year HyperMarketing Plan by Jerome Cuyos.pptxv100 20-Year HyperMarketing Plan by Jerome Cuyos.pptx
v100 20-Year HyperMarketing Plan by Jerome Cuyos.pptx
 
Gen Z and Millennial Debit Card Use Survey.pdf
Gen Z and Millennial Debit Card Use Survey.pdfGen Z and Millennial Debit Card Use Survey.pdf
Gen Z and Millennial Debit Card Use Survey.pdf
 
Catálogo Sea To Summit 2024 gama compelta
Catálogo Sea To Summit 2024 gama compeltaCatálogo Sea To Summit 2024 gama compelta
Catálogo Sea To Summit 2024 gama compelta
 
SEO and Digital PR - How to Connect Your Teams to Maximise Success
SEO and Digital PR - How to Connect Your Teams to Maximise SuccessSEO and Digital PR - How to Connect Your Teams to Maximise Success
SEO and Digital PR - How to Connect Your Teams to Maximise Success
 
How To Become a Master In Search Engine Optimization (SEO)
How To Become a Master In Search Engine Optimization (SEO)How To Become a Master In Search Engine Optimization (SEO)
How To Become a Master In Search Engine Optimization (SEO)
 
TAM Sports IPL 17 Advertising Report- M01 - M23
TAM Sports IPL 17 Advertising Report- M01 - M23TAM Sports IPL 17 Advertising Report- M01 - M23
TAM Sports IPL 17 Advertising Report- M01 - M23
 
The Ultimate Guide to Financial Advertising Strategies.pdf
The Ultimate Guide to Financial Advertising Strategies.pdfThe Ultimate Guide to Financial Advertising Strategies.pdf
The Ultimate Guide to Financial Advertising Strategies.pdf
 
2024 WTF - what's working in mobile user acquisition
2024 WTF - what's working in mobile user acquisition2024 WTF - what's working in mobile user acquisition
2024 WTF - what's working in mobile user acquisition
 
History of JWT by The Knowledge Center.pdf
History of JWT by The Knowledge Center.pdfHistory of JWT by The Knowledge Center.pdf
History of JWT by The Knowledge Center.pdf
 
20 Top Social Media Tips for Peer Specialists
20 Top Social Media Tips for Peer Specialists20 Top Social Media Tips for Peer Specialists
20 Top Social Media Tips for Peer Specialists
 
TAM Sports-IPL 17 Advertising Report- M01 - M15.xlsx - IPL 17 FCT (Commercial...
TAM Sports-IPL 17 Advertising Report- M01 - M15.xlsx - IPL 17 FCT (Commercial...TAM Sports-IPL 17 Advertising Report- M01 - M15.xlsx - IPL 17 FCT (Commercial...
TAM Sports-IPL 17 Advertising Report- M01 - M15.xlsx - IPL 17 FCT (Commercial...
 
social media optimization complete indroduction
social media optimization complete indroductionsocial media optimization complete indroduction
social media optimization complete indroduction
 
SEO Forecasting by Nitin Manchanda at Berlin SEO & Content Club
SEO Forecasting by Nitin Manchanda at Berlin SEO & Content ClubSEO Forecasting by Nitin Manchanda at Berlin SEO & Content Club
SEO Forecasting by Nitin Manchanda at Berlin SEO & Content Club
 
Digital Marketing complete introduction.
Digital Marketing complete introduction.Digital Marketing complete introduction.
Digital Marketing complete introduction.
 
Content Marketing: How To Find The True Value Of Your Marketing Funnel
Content Marketing: How To Find The True Value Of Your Marketing FunnelContent Marketing: How To Find The True Value Of Your Marketing Funnel
Content Marketing: How To Find The True Value Of Your Marketing Funnel
 
top marketing posters - Fresh Spar Technologies - Manojkumar C
top marketing posters - Fresh Spar Technologies - Manojkumar Ctop marketing posters - Fresh Spar Technologies - Manojkumar C
top marketing posters - Fresh Spar Technologies - Manojkumar C
 

Maximizing Big Data ROI via Best of Breed Technology Patterns and Practices - Jeff Bertman, Warner Bros.

  • 1. Jeff Bertman CHIEF DATA ENGINEER, WARNER BROS. LOS ANGELES, CA ~ JUNE 12 - 13, 2019 | DIGIMARCONWEST.COM #DigiMarConWest Maximizing Big Data ROI via Best of Breed Technology Patterns and Practices KEYNOTE
  • 2. Maximizing Big Data ROI via Best of Breed Patterns & Practices CTO and Lead Data Scientist/Engineer Dfuse Technologies (formerly with Warner Bros Digital Networks) Jeff Bertman Click1: DigiMarCon Main Site #NOTES to Audience: (1) Caution: This deck contains some Hollywood “GLITZ” that could be harmful to your Boringzola. Please be prepared to Smile  (2) Much of the focus is on Data Engineering but from a Business Value perspective. We will quickly zip through certain slides -- spending just enough time for context. (3) Thanks for the honor to Serve the DigiMarCon West Community! Click2: DigiMarCon Speaker Page (Scroll to “Bertman”) Also: - www.LinkedIn.com/in/JeffBertman - Jeff.Bertman@DfuseTech.com - Mobile +1 818-321-3111 - More contact info at end of deck Click3: Dfuse Technologies Main Site www.DigiMarConWest.com
  • 3. 3 Speaker Highlights Jeffrey Bertman: “Uptight Easterner”=================Chief Data Engineer, CTO, Lead Data Scientist/Engineer, Bla Bla BlaData Geek 
  • 4. 4 This deck contains more GLITZ than usual (for me)  And there are layers on several slides. For Best Viewing, DOWNLOAD the PowerPoint Show File (vs viewing online). Thanks! SHOW BUSINESS WARNING
  • 5. 5 Contents Summary # Topic Slides Remarks 1) Speaker Highlights 2 2) Overview (incl 1 Brief Slide about Machinima / Warner Bros) 4 This TOC followed by general context… and an Intro to our tour guide and related characters  3) Business & Technical Landscapes --- ROI Defined --- 10 4) ROI Conducive Technologies and Architectures for Big Data 17 Highlights: • introduction to FiTL (Fitness Technology Landscape) and “Price Shifting” which affects TCO >>> ROI, etc • The TP3 Principle • Polyglot Jazz (DI Graphical Tools) 5) ROI Best Practices for Big Data, Etc 41 Abbreviated List due to time limit. Much larger list avail upon request. 6) Fairy Tale Wrap-up and Closing Thoughts 46 7) Q & A / Contact Info 52 Feel free to reach out for discussion or future presentation versions (whitepaper in progress) Friendly Warning: Presentations from Entertainment Companies may Contain some Jazz !
  • 6. Introducing Our Tour Guides: DP aka Data Pigeon, LOP aka Lack of Planning Mule DP DP Methodology DP Helper 2 Helper 1 From Here to There LOP * No Orwellian Connotation Slide 6
  • 7. DP DP LOP But even the best plans... …with lack of follow-through... Slide 7Introducing Our Tour Guides: DP aka Data Pigeon, LOP aka Lack of Planning Mule
  • 9. Slide 9 Opening the door to success requires both planning and follow-through… …to tame LOP the mule into LOP2 ! DP LOP2 SGA Successful Geeks Assoc. Introducing Our Tour Guides: DP aka Data Pigeon, KIO1 aka LOTS of Planning Pegasus…or at least “ENUF” Agile Planning 
  • 10. 10 Everything we do is in the…  Thousands of content creators (aka talent partners)  Millions of videos on numerous platforms  Billions of aggregate views / month Expanding Footprint Within & Beyond WB requires even Greater Scale  BI/Data supports Mach OTT, Other WB Divisions, External Companies  Distribution supports other WB Initiatives Native Digital Entertainment Business that Leverages Technology to Meet and Exceed KPIs & Operational Goals -- and to Scale Cost-Effectively Recent Years Millions… BILLIONS+Thousands…  Cornerstone Technologies (Big Data focus) #DISCLAIMER: This is the only Machinima/WB slide (and it is non-proprietary).
  • 11. 11 LEVERAGE TECHNOLOGY Architecture, Engineering, Methods, Libraries, CM, QA, Security, SysOps, DevOps DATA >> INFO >> KNOWLEDGE >> ACTION Improve BIZ (Revenue, Profit, Market Share, Etc) Always Grow BIZ Value –– Data Intelligence BEST PRACTICES & ~SLAs Continual Improvement, Serviceability, Reliability, Performance, Governance SERVICE ORIENTED Mindset Driven By Clear Mission, Values, Goals & Priorities: Cost-Benefit + “Everyone is a Customer” Approach Be the Solution, Be the Boss, Value Each Other, A-Team, Executional Excellence, … Sample Data Management Touch Points & Basic Approach 11
  • 12. 12 BIZ VALUE. Increase Revenue, Profit, Market Share, Etc Low Level Processes Get Stuff, Do Stuff, Put Stuff, Etc Raw Data Google, Youtube, Facebook, Twitter, Twitch, Amazon, Salesforce, Mach Console, ETC Information Technology Data Engineering Data ►► Information Software Engineering Tech ►► Biz Tools Product Management Actualization User Applications, Visualization Tools (BI, Reporting, Analytics, Discovery), Etc Value “Scape” 12 ROI is Impacted by All Levels
  • 13. 13 Downstream Feeds, Exports / ReportsDashboards, RptsDashboards, Rpts Convert Data into INFORMATION to Help Drive Cost-Benefit / KPIs for Business Units and The Enterprise • Google • YouTube • Facebook, Instagram • Twitter • Twitch • Amazon • Pluto TV • Clickstream • Finance/Accounting • Salesforce • Sensors • Logs • ETC RAW DATA from (Platforms, Apps, Etc) • BI + Data Managers (incl Self-Service) • Marketing • Sales / BD • Finance • Accounting • Operations • Security & Compliance • ETC (See Pillars slide) DISTRIBUTE to Customers, Et Al:Data Engineering (DE) Is “Under the Hood”  Data ►► INFORMATION: Consolidate & Transform Raw Data into Data Warehouse (DW)  Operational ENGINES: Payment Processing for Talent, Directors, Recruiters, etc Data Engineering Landscape --- Acquisition, Curation & Dissemination
  • 14. 14 Data Engineering & Business Intelligence Lifecycle GCR Business Goals Concept Rqmts Definition Vision & Goals Management Tech Architecture (incl Data Services / Ops) Business / Logical Products Select & Install Tech / Physical Integration Testing (Func+Stress) Production Deployment Business Intelligence (BI) Design / Definition Prep Deployment (incl Support Spec) User Training (incl End User Info) Business Intelligence (BI) Development (Focus on BIZ Meta Layer) Data Integration (ETL, MDM, Metadata, Etc) Explore & Design Unit Testing Integration Deliver Business Value Increments via “Frequent Little BITIs” (Agile Coordination + Waterfall As-Needed) Integration Test Planning Data Profiles & Maps PostReflect,Maint,Improve PreReflect,Strategy,Scope,Impact Implement & SupportTexture + Smooth: BITI Cycle (Build, Integrate, Test, Improve)Inception & Definition >>> Today’s Objective <<<
  • 15. 15 ROI in Technical Environments The (preferably measurable) successful Business outcomes generated by leveraging Technology to increase financial KPIs, e.g. revenue, profit, etc. • “Leverage” implies “$Spend”: o For Product, Licensing, Labor, Infrastructure, Transition, Etc o “Break Even” is a common focus: when measured benefits equal initial investment. o Keep ongoing costs in mind. Ongoing benefits must always exceed TCO to declare “success”. • Even Open Source / Freemium products have Costs, for example: o Compute Nodes they run on – vs managed service which is inclusive but still $spend (e.g. Python, Node.js, Scala, Etc on EC2/VM vs AWS Lambda or Azure Cloud Functions, Google App Engine/Cloud Functions) o Labor might be more than paid product or managed service (e.g. Kafka vs AWS Kinesis – various pros and cons largely focused on labor and performance) • ROI Obvious Factors: Availability, Maintainability, Reliability/Accuracy, Functionality, Performance, Security ROI Defined (Return on Investment) – The Basics
  • 16. 16 ROI Defined (Return on Investment) – Special Challenges ROI ALSO includes “Lateral Spend” • CAUTION – Watch for Hidden or Shifted Costs: … sometimes feels like “Collateral Damage” o Example 1 – Big Data on Serverless Architecture: AWS Lambda / Google Cloud Function 9-15 Minute Time Limit per Execution If use for certain Data Integrations, you can invest a lot of $$ time and effort working around the time limit via special chunking or ~recursive calling mechanisms which are more complex than need to be. o Example 2 – Streaming TV & Movie Platforms for Cord Cutters  : Free and Low Cost Options e.g. Pluto TV, Hulu Live, DirecTV NOW, YouTube TV, Sling TV, Apple TV, VUE, Watch TV, … If run simultaneously on multiple devices in same home, along with gaming, you’ll need to add $$ to Internet Service. • Consider Full Functional Scope before settling on a Single Technology • Best to support Multiple Patterns, Sometimes via Multiple Technologies But keep “official” list “minimal” for each product or platform type. • Heads-up on New but Old Term: “Poly#!#” – more on this later
  • 17. 17 Presentation Abstract – Brief Discussion Not long ago the question was whether your organization had big data. Did you have the volume, the velocity, the technology. Now those basics are largely given for most of the people attending this event. The path to success is still fuzzy, however, with so many technologies to choose from – and so many ways to use them. This presentation triangulates in a holistic manner on the modern business dilemma: how can we leverage technology to improve revenue, profit, market share, and numerous other success criteria. That said, this is not about the analytics or KPIs -- although it is about measurable improvement. It’s about lining up the right technologies and using them in effective, proven ways to maximize Return on Investment (ROI). Since the slant here is holistic, we’ll show how to blend infrastructure, tools, methods, and talent to avoid and constantly trim technical debt… and to produce success stories that are consistently repeatable, not a byproduct of individual heroics. Not long ago This presentation Brief Discussion
  • 19. 19 ACTIONABLE INSIGHTS >> ~RAW Uber Actionable CUSTOM TURBO CUSTOM TURBO CUSTOM TURBO
  • 21. 21 The Birth of FiTL: Fitness Technology Landscape (via Paper Providence)
  • 22. 22 What Other People Do While Geeks Write Papers / Presentations 
  • 23. 23 ● PostgreSQL CE (cloud) 0 Open Src $$ EC2 $$$ EBS ~n/a,wash $ (As Needed) 5$ ● #PRO: Multi-Model for Trxs (+ ~Analytics) ● #CON: Minimal Scale-Out except for e.g. Citus DB or $$Product GC PG, Azure PG ● #CON: Slow evolution of Analytics infra ● Redshift (cloud) $$ $$$ ~EC2 $$$ EBS ~n/a,wash $ (As Needed) 7$ ● #PRO: Spectrum to connect Data Lake ● #CON: Main focus is DW Not Transactional ● #CAUTION: Node distribution / access limitations. Discuss Mitigation Patterns for diverse data access patterns on same table. Tie to Aggregate Awareness, etc. ● Snowflake (cloud) $$$ $$ Selectable per Session $ S3 ~n/a,wash $ BUT See #CAUTION in Special Info col cuz Separate Compute layer 6$ BUT Exp = 7$ ● #PRO: Unique Data Sharing ● #CON: Main focus is DW Not Transactional ● #CAUTION: Separate Computer Layer is generally better BUT can cost more for some patterns. Discuss Mitigation Patterns. ● #EVAL Data Cache options, e.g. Tableau TDEs or Data Virtualization w/ Denodo, etc. ● Oracle Exadata (on-prem) $$$$$ $$$ Data Ctr + Extra Staff $$$ Appliance #TBD $ (As Needed) 11$ ● #PRO: Multi-Model for Trxs + Analytics ● #CON: #ECOSYSTEM is Shrinking(?) ● #SCALABILITY Options: e.g. negotiate for dormant CPUs etc ● CapEx is vanilla for On-Prem, but ~wash FiTL Price Shifting Example: DW / Data Hub Main Platform Component Product Compute Storage In/Out (e.g. Hybrid Cloud) Access Patterns (Explain) RAW$ COUNT Special Info / Discussion Points Guidelines + Future Improvements $ Scale: 1 = Low to 5 = High ($ symbol intentionally on right of Raw$ figure) • TCO Factoring: Incorporated in each category, e.g. see Compute column for Oracle • Conventional Evaluation Factors can be Added: Performance, Scalability, Maintainability, Functionality, Security, Vendor Viability / Ecosystem, etc
  • 24. 24 Snowflake’s Sweet Spot is Data Warehousing / Analytics, Not Transactional / Operational Activity (although transactional performance is better than expected!)  Many Modern Best in Class Products have Same Issue  Homogenous vs Heterogeneous Technical Environments o Homogenous was a Dream  (Not Sustainable due to Tech Evolution, M&As, Self-Service / Decentralization, etc) o Interoperability is the Reality … Usually o People like Simple, but Modern Times contain simplicity in each Class – “Best in Class”  What does Heterogenous Mean in Today’s Modern IT Arena? … … … Reality Check Examine Wide Use Cases -- This Example Happens to be Data Platform DP LOP2
  • 25. 25 Old / Rejuvenated “Modern” Term: Polyg#!# What? (Source: Google Dictionary) Polyglot Persistence: Using multiple data platform technologies [to address diverse use cases in a best-of-breed manner]. Polyglot Programming: Using multiple programming languages [to address diverse use cases in a best-of-breed manner]. Domain Specific Languages (DSLs) are now standard practice for enterprise app development. . . . 2012+ Time to Revive! . . . Polyglot Engineering / Architecture: Using multiple technologies [in the same functional domain] [to address diverse use cases in a best-of-breed manner]. + glotta ‘tongue.’
  • 26. 26 Potential New Principle / Postulate (#DRAFT Idea in Progress) TP3 Data – Technical Polyglot Propensity Principle for Data Platforms: Modern enterprises with Big Data tend to utilize Polyglot Engineering with the intention of maximizing ROI. One Technical Data Platform cannot profitably maintain a Top 3 industry popularity rank for modern big data enterprises for more than 3 years without sacrificing at least one of the following Top 3 ranks: TP3 DE – Technical Polyglot Propensity Principle for Data Engineering Platforms: Similar to TP3 Data, but for Data Engineering / Integration tools and platforms. DP Let’s see some EXAMPLES . . . • Multi-Model support for more than 3 types: e.g. Relational, Graph, Document/Text, Multi-Media, Geospatial, Key-Value (Structured, Semi/UnStructured) • Multi-Use-Category support for more than 3 types: e.g. Analytical, Transactional, Search, Stream • “Reasonably Low Pricing” given abundance of Modern, Competitive, Low Cost / Community products
  • 27. 27 Polyglot Data Integration: Example Architecture Pattern 1 Example Use Case: Social Media / Video Platform – Core Data Feeds • Purpose & Constraints: Biz Performance & Revenue Metrics / KPIs. Minimal or No Backfill Available. Supports Data Lake Direct Access Use Cases for raw data (purple lines from earlier slide). • Solution Profile – Decoupled Polyglot thru Data Lake: Python for Extraction, Pentaho Data Integration (PDI) for Load/Ingest/Transform. High Resilience. • Fitness Highlights: o Python: $0 Open Source, Thin (low resource), Highly Available, e.g. can be Serverless or EC2. o Pentaho PDI: $0 Community Edition (or EE see below), Graphic Workflows with Standard Transformations. Low Maintenance Work Share (extremely easy for cross-training even with complex pipelines). Option to Expand to Enterprise Edition ($) for better HA infra, monitoring, repo, scheduling, support, etc. Extract via Python (e.g. REST API) Ideal for: Facebook (core), Instagram, Twitch Live Streams, Etc Optional for: YouTube, Amazon, Etc Load/Ingest/Transform via Pentaho PDI DW Async e.g. CSV, JSON, Parque Near-Real-Time, Hourly, Daily, Etc Cache Lake
  • 28. 28 Polyglot Data Integration: Example Architecture Pattern 2 Example Use Case: Social Media / Video Platform – Extension Data Feeds • Purpose & Constraints: Biz Performance & Revenue Metrics / KPIs. Backfill is Available. Extract Runtime is Short / Low Impact on Runbook Dependencies. Ok to Not have Data Lake raw data. Extensions Data Density is High, e.g. would create thousands-millions of files per day (discuss). • Solution Profile – Homogenous ~Stream Direct to DW / Data Hub: Python if Lite Transformations (subject to DI Library selection, see next slide), Pentaho PDI if Heavy. • Fitness Highlights: o Python: $0 Open Source, Thin (low resource), Highly Available, e.g. can be Serverless or EC2. o Pentaho PDI: $0 Community Edition (or EE see below), Graphic Workflows with Standard Transformations, Low Maintenance Work Share (extremely easy for cross-training even with complex pipelines). Option to Expand to Enterprise Edition ($) for better HA infra, monitoring, repo, scheduling, support, etc. #Caution: Must Backfill after Planned/Unplanned DW Outage. No Data Lake Raw Files for Audit or Direct Use Cases. Ideal for: FB Graph API Extensions, YouTube Bulk API, Etc Optional for: Salesforce, SAP, Etc Extract + Load/Ingest/Transform via #TBD (See Solution Profile above) DW Merge Hourly, Daily, Monthly, Etc Cache Lake
  • 29. 29 Some Python Data Integration / ETL Libraries # Library Remarks (as of 2018-11-25) Sample Doc/Code (Double-Click to Open) 1) PETL Reasonably Popular (Last Commit Sept 2018) 2) PygramETL No Commits since Oct 2017 3) Bonobo Reasonably Popular (Last Commit Nov 2018) 4) #TBD: Wide Open to Feedback Roadmap Task for 2019 Evaluate and Select NOTES: • Data Analysis Libs like Pandas are not shown above. Full Data Integration / ETL / ELT is not their objective. • Custom Development is also quite popular – withOUT starting with a canned 3rd party library. But you should develop a lib (or at least a collection of templates) within your company for standardization, productivity, etc.
  • 31. 31 Polyglot DI: Why Graphic Workflows? –– Example 1 Hard Error End Hard Error End Easy to See Green Flow = GOOD, Red Flow = BAD
  • 32. 32 Polyglot DI: Why Graphic Workflows? –– Example 2 Soft Error End (Do Nothing) Easy to See Negative Logic ANTI-PATTERN
  • 33. 33 Polyglot DI: Why Graphic Workflows? –– Example 3 Error End Easy to Understand Semi-Complex Flow More Complex is Also Welcome
  • 34. 34 Polyglot DI: Why Graphic Workflows? –– Example 4 Error End Soft Error End (Do Nothing) Easy to Add Exception Handler
  • 35. 35 CONCURRENT File Processing Chains are Easy to Create Logging Window Easily Explains Broken Step Above (Can also do in IDE for Python, etc) Polyglot DI: Why Graphic Workflows? –– Example 5
  • 36. 36 Polyglot DI: Why Graphic Workflows? –– Example 6 This Time it Worked. See all the Green Checkmarks  Automatically Gathers METRICS Etc for Each Job Renamed Steps to be Meaningful
  • 37. 37 Polyglot DI: Why Graphic Workflows? –– Example 7 Error End Standard Job Types / Steps (Menu Options)
  • 38. 38 Polyglot DI: Why Graphic Workflows? –– Example 8 Error End Big Data Built-in Transformations (Source: Pentaho PDI Manual)
  • 39. 39 Polyglot DI: Why Visual Workflows? –– Example 9 Error End Standard Input Methods (Menu Options)
  • 40. 40 Polyglot DI: Why Graphic Workflows? –– Example 10 Standard Output Methods (Menu Options)
  • 41. 41 Polyglot DI: Why Graphic Workflows? –– Example 11 Standard Transform Methods (Menu Options) Extend to Custom Python, JavaScript, Bash, Etc
  • 42. 42 Up Next ROI Best Practices FUTURE IMPROVEMENTS for this Presentation: • Reduce Slide Density – Especially the next few slides which also could use some diagrams. • Many More Best Practices already documented – Adding to presentation after settle on better format. • Contact Info at end of this deck – Feel free to reach out for discussion or future versions.  Data Engineering  Analytics (incl AI/ML)  Design & Development  DevOps, SysOps  Data Governance  Collaboration  Management  Etc
  • 43. 43 Best Practices for High ROI Impact – Part 1 (DRAFT: This Slide Currently Requires Walk-Thru) Types Topic : Components For Benefits >> Do This Cost-Ben (1-5 Hi : 1-5 Hi) Depends Remarks Tech Data Eng Data Integration / ETL Arch Patterns : Decoupled vs Homogenous Optimize Performance, Scalability, Reliability, Etc >> Have Canned Patterns Ready for Various Scenarios 2 : 5 • Experience + Research. BEST PRACTICE TIPS: • See Architecture Patterns earlierin this presentation (e.g. decoupled vs homogenous). Tech Data Eng External Resilience Patterns : Data Extraction Programs Improve Data Availability & Reliability >> Use Intelligent Logic (parsing etc) vs Arbitrary or Hardcoded Logic to Dynamically Accommodate Changes in Source Data Patterns. Examples: • Skip over CSV Header Rows – use parse vs row count. • Ignore extra JSON pages via content scan vs page count. 2 : 4 ROI falls when distracted by “fires”, especially when preventable. BEST PRACTICE TIPS: • “Results are Better Than Excuses” culture. • Problem Patterns are often ~just as important as Solution Patterns. Line them up a la Tech Arsenal. Tech General Tech Debt : Keyword Tagging in Code/Docs Minimize Tech Debt and Facilitate Follow-up >> Tag your Code and Design Docs. Examples: • Code: #CHANGED, #PERF, #SCALE, #HARDCODED, #MODULARIZE, #WORKAROUND, #TODO, #FUTURE • Docs: #KBANK, #OUTPLAN, #SPEC, #QA, #RISK, #TBD, #TODO, #FUTURE, #MGT, #TECH • README File for every main module – incl “Future Enhancements” section in addition to any tickets/cards. 1 : 5 • Context dependent. If Mgt doc, separate #TECH sidenotes. Or vice versa if Tech doc, and put Mgt Nutshell / Next Steps at top since Mgt does not focus on Tech. • Tech Debt is frequently caused by overlooking syndromes: “we’ll do it later”, “slipping between the cracks”. Tech Monitor Simple Automation, Reliability, and Maintainability : Keyword Tagging in Logs Improve Reliability & Maintainability/Operations, Avoid False Positives >> Tag Log File Msgs for Action or Review. Examples: • #ERROR: bla bla (+ Optional Subtags #RETRY or #FATAL) • #WARNING: bla bla • #INFO: bla bla bla 1 : 4 • Couple with Monitoring Tools/Scripts. • Avoid False Positives, e.g. “error table”, “solution for whatever error”, etc. Tech SysOps Deployment : Hot Patches Improve Reliability & Maintainability/Operations >> Always Create “cm_retro” hot patches folder with: • Changed Files. • README for changes to other tiers (e.g. data tier). 1 : 4 • Couple with Git if available – sometimes changes occur in infra/commercial product config files, etc which might not be CMed. • Useful even when using Git, etc. • Consider using Git for all configurables. Opens deeper issues, e.g.: o Separating config files from other bin. o Maintaining security of credential files etc.
  • 44. 44 Best Practices for High ROI Impact – Part 2 (DRAFT: This Slide Currently Requires Walk-Thru) Types Topic : Components For Benefits >> Do This Cost-Ben (1-5 Hi : 1-5 Hi) Depends Remarks Tech SysOps Hyper-Automation, Reliability, and Maintainability : File Tagging Simplify Many Critical Actions & Monitoring >> Tag Filenames or Accounts for Monitoring, Tracking, etc. Examples (add to filenames): • Sensitive Files use “_stv” suffix: No-Brainer 1-line expression (e.g. filename like “*_stv*.*”) to track all files containing secure credentials etc. Facilitates auto security hardening and monitoring. No need to depend on high maintenance list which will grow out of sync. • Service Accts use “svc_” prefix: Similar to “_stv” but for email accounts, etc. 1 : 4 • Great for Security Audits: Auditors like when justified exceptions can be simplified. Service Accounts in some high security environments are allowed only by exception, even though they are obviously needed and much better than using name of a real person who will eventually move on.. • SAFETY TIP: Don’t make it too obvious for Hackers. For example, “_stv” is good enough to understand with out saying “look here for ‘sensitive’ files!”. Tech Data Mgt Hidden Tech Debt : Multi-Tenancy Prevent Hidden Tech Debt from Plunging Productivity >> Always have Checklist to Consider Multi-Tenant Support, e.g. Biz Unit (BU). Examples – Add Biz Unit to: • Data Sets/Tables • Data Integration Infra (folder trees). 3 : 2 to 5 (It Depends) CONSIDER FACTORS: • M&A (of course). • Reorgs / Dept moves, splits, etc. BEST PRACTICE TIP (tangent topic): Usually Name things after BEHAVIOR, not Volatile Infra such as Biz/Dept Name. Mgt Collab Naming Conventions : Biz & Tech Vocabularies Increase Productivity, Reliability, & Morale; Preclude Communication Mayhem >> Always Have a Name (a la Jim Croce, song artist). Examples: • Data Schemas: stage, base, presents. • Data Sets: Pilot should have name such as “main” or “core” since something else ~always follows. Otherwise you’ll forever be saying “the set without a name”. • Disk Folders: • ETC 2 : 4 ROI falls when distracted by “fires”, especially when preventable. BEST PRACTICE TIPS: • Devise a simple term whenever you have to repeat a phrase or word bunch repeatedly. • Socialize everywhere across Biz and Tech, e.g. Roadmap, Specs, etc.
  • 45. 45 Best Practices for High ROI Impact – Part 3 (DRAFT: This Slide Currently Requires Walk-Thru) Types Topic : Components For Benefits >> Do This Cost-Ben (1-5 Hi : 1-5 Hi) Depends Remarks Tech DevOps, SysOps Process Orchestration : DevOps/SysOps Coding Control Maintainability vs High Availability >> For each Job’s Master Script (launched by scheduler), Choose Common/Reusable or Separate Scripts: • For Maintainability: Use Common/Reusable. • For HA: Separate scripts to avoid risk of breaking something unrelated to current task. 1 : 3 • Evaluate Reusability vs Risk (see Best Practice Tip in Remarks column). • BEST PRACTICE TIP: Lean toward Maintainability if have separate Test/QA Team (which is another, fundamental best practice). Tech Data Mgmt Data Assets Mgmt : Data Catalog Maximize Results and Minimize Redundant Data in data stores + BI platforms >> Create & Maintain a Data Catalog (#DCAT): • Option 1 $$$: Combine with Data Virtualization, e.g. Denodo, Dremio, Stardog, etc. • Option 2 $$$: Separate product, e.g. Alation, etc. • Option 3 Free But Not Great (Yet): o Spreadsheet/Googlesheet Proven Templates available upon request. o #TODO: Researching open source products. 3 : 5 for $$$ or 2 : 3 for Free • Be prepared to spend at least $150K for good commercial product. • HOT TIP: Inferred relationships across DB platforms are only practical when virtualizing across multiple data platforms. Lean toward Option 1 + “encourage” Denodo to speed up their Roadmap (like Alation feature but cross-platform). • #RISK of NOT Having DCAT: High Tech Debt propensity across multiple tiers, e.g. in Data Lake, DW/Hub, and BI platforms. Tech Dev/Test Testing : Sample Data Generation Reduce Train-Validate-Test Lifecycle >> Auto Generate Meaningful vs Random Data: • Mockaroo.com: $Range from free for 1K rows and slow speed to $500/yr for 10M rows and 8x speed. • Alternatives (generally < $600/yr/seat): RedGate SQL Data Generator, Dummi, MS Visual Studio , etc. • Or use Real Data if Possible (limited by InfoSec policy and data volume considerations – See “Depends” column). 2 : 3 • Production Data – in some environments -- can be downloaded/refreshed into Dev/Test. But still need subset data while maintaining integrity (where applicable). IMPORTANT TIPS: • For Metrics: Add distribution info (e.g. uniform, normal, normal inverse, exponential, exponential inverse, etc). Incl edge cases. • For Text: Create meaningful patterns, incl edge cases, e.g. via regex, etc. • Additional Metadata (as applicable): Unique, Step, Min/Max values/length, locale, character set, image ht/wd, etc.
  • 46. 46 Types Topic : Components For Benefits >> Do This Cost-Ben (1-5 Hi : 1-5 Hi) Depends Remarks Tech AI/ML ML Outcomes : Source Data Gaps & Anomalies Improve Outcome Accuracy & Minimize ML Iterations >> Track Imputed or Questionable Data. For example, add Source_TBD column to indicate: • Imputed Gaps: Fill nulls, e.g. via interpolation, etc. • Low Credibility: Identify weakness at source vs algorithm. 2 : 3 • BEST PRACTICE TIP: Add Lifecycle Checklist item to cross-check the Source_TBD column with Validation and/or Test data sets. Tech AI/ML ML Outcomes : Validation & Test Data Improve Outcome Accuracy >> Incl Separate Validation and Test Data Sets: • For Validation Data: Hold out from Train data. • For Test Data: Completely separate – No “Peeking”. 2-3 : 4 Test Set should be separate and large if available. Two popular alternatives: • Bootstrapping • Cross-Validation (e.g. K-Fold resampling aka shell game). • Bootstrapping (click1, click2). BEST PRACTICE TIPS: • Validation Data – use to tune params. • Test Data – use to assess performance (outcomes). Mgt Resources Unplanned Work or Expenses : Budget, Roadmap, Sprints, Project Plans Minimize Disruption to Budget & Schedule >> Track “Outplan” for items not in Orig Budget, Roadmap, Epic, Sprint, or Project Plan: • Create Positive Agile Culture for Unplanned/Popup Tasks: Tag on Roadmap as #Outplan (or use symbol). Also show any impacted items, e.g. can indicate item slide to next month. • Maintain Schedule/Priority Control and Team Morale: Easy to see when we’re in a Tunnel… and where the Light is. Easy to justify schedule adjustments to incorporate high value wins. But see #CAUTION in Remarks column. 2 : 5 • Introduce to corporate vocabulary. • Socialize with stakeholders (and your boss). • BEST PRACTICE TIP: Track Outplan to justify: o Hire New Staff o Adjust Business Processes • #CAUTION: Always try to plan pop-up tasks for future, not outplan. Outplan is always an exception (part of MBE plan). • Only for tasks > 1 Day (not for ad-hoc “support” tasks). Mgt Staffing Work Force : Flex Plan Stay on Budget & Schedule >> Create FLEX Plan, Not Just a Contingency Fund: • Enlist As-Needed “Hot Standby” SMEs who need minimal (sometimes 0) weekly work guarantee, but establish and maintain knowledge about your biz and tech environment. • Cross-Train / Semi-Matrix Dept Resources in Large Enterprises: Achieve economy of scale -- sustainably. 2 : 5 • Need strong schmoozing and negotiation skills. Best Practices for High ROI Impact – Part 4 (DRAFT: This Slide Currently Requires Walk-Thru)
  • 47. 47 Up Next DP Fairy Tale Wrap-Up . . .
  • 48. 48 Everyone is Happy . . . Life is Perfect Now . . .
  • 49. 49 . . . Or is it ? . . .
  • 50. 50 . . . Hmmm . . .
  • 51. 51 . . . Success is Not Usually Perfection . . .Excellence, Excellence + Planning = DP
  • 52. 52 Thanks to Dora the Explorer ®
  • 53. 53 Wrap-Up THANKS!  YOU !!!  Our Sponsors  Data – for being such a Lovable Thing (But Information, Knowledge, and Results are even Better!)
  • 54. 54 Wrap-UpMake it a Great Day 
  • 55. 55 DP You May PASS GO and Collect $2000 Jeff Bertman • www.DigiMarconWest.com/speakers • www.LinkedIn.com/in/JeffBertman • Jeff.Bertman@DfuseTech.com • Jeff.etalk123@gmail.com • Skype: JeffB.epoch (Slack, WhatsApp, etc upon request) • Mobile/Text: 818-321-3111 NEXT UP CONTACT INFO