Not long ago the question was whether your organization had big data. Did you have the volume, the velocity, the technology. Now those basics are largely given for most of the people attending this event. The path to success is still fuzzy, however, with so many technologies to choose from – and so many ways to use them.
This presentation triangulates in a holistic manner on the modern business dilemma: how can we leverage technology to improve revenue, profit, market share, and numerous other success criteria. That said, this is not about the analytics or KPIs -- although it is about measurable improvement. It’s about lining up the right technologies and using them in effective, proven ways to maximize Return on Investment (ROI). Since the slant here is holistic, we’ll show how to blend infrastructure, tools, methods, and talent to avoid and constantly trim technical debt… and to produce success stories that are consistently repeatable, not a byproduct of individual heroics.
top marketing posters - Fresh Spar Technologies - Manojkumar C
Maximizing Big Data ROI via Best of Breed Technology Patterns and Practices - Jeff Bertman, Warner Bros.
1. Jeff Bertman
CHIEF DATA ENGINEER,
WARNER BROS.
LOS ANGELES, CA ~ JUNE 12 - 13, 2019 | DIGIMARCONWEST.COM
#DigiMarConWest
Maximizing Big Data ROI via
Best of Breed Technology
Patterns and Practices
KEYNOTE
2. Maximizing Big Data ROI via Best of Breed
Patterns & Practices
CTO and Lead Data Scientist/Engineer
Dfuse Technologies (formerly with Warner Bros Digital Networks)
Jeff Bertman
Click1: DigiMarCon Main Site
#NOTES to Audience:
(1) Caution: This deck contains some
Hollywood “GLITZ” that could be harmful
to your Boringzola.
Please be prepared to Smile
(2) Much of the focus is on Data Engineering
but from a Business Value perspective.
We will quickly zip through certain slides
-- spending just enough time for context.
(3) Thanks for the honor to Serve the
DigiMarCon West Community!
Click2: DigiMarCon Speaker Page
(Scroll to “Bertman”) Also:
- www.LinkedIn.com/in/JeffBertman
- Jeff.Bertman@DfuseTech.com
- Mobile +1 818-321-3111
- More contact info at end of deck
Click3: Dfuse Technologies Main Site
www.DigiMarConWest.com
3. 3
Speaker Highlights
Jeffrey Bertman: “Uptight Easterner”=================Chief Data Engineer, CTO,
Lead Data Scientist/Engineer, Bla Bla BlaData Geek
4. 4
This deck contains more GLITZ than usual (for me) And there are layers on several slides.
For Best Viewing, DOWNLOAD the PowerPoint Show File (vs viewing online). Thanks!
SHOW BUSINESS WARNING
5. 5
Contents Summary
# Topic Slides Remarks
1) Speaker Highlights 2
2) Overview (incl 1 Brief Slide about Machinima / Warner Bros) 4 This TOC followed by general context…
and an Intro to our tour guide and related characters
3) Business & Technical Landscapes
--- ROI Defined ---
10
4) ROI Conducive Technologies and Architectures for Big Data 17 Highlights:
• introduction to FiTL (Fitness Technology Landscape)
and “Price Shifting” which affects TCO >>> ROI, etc
• The TP3 Principle
• Polyglot Jazz (DI Graphical Tools)
5) ROI Best Practices for Big Data, Etc 41 Abbreviated List due to time limit.
Much larger list avail upon request.
6) Fairy Tale Wrap-up and Closing Thoughts 46
7) Q & A / Contact Info 52 Feel free to reach out for discussion or future
presentation versions (whitepaper in progress)
Friendly Warning: Presentations from Entertainment Companies may Contain some Jazz !
6. Introducing Our Tour Guides: DP aka Data Pigeon, LOP aka Lack of Planning Mule
DP
DP
Methodology
DP
Helper 2
Helper 1
From Here to There
LOP
* No Orwellian Connotation
Slide 6
7. DP
DP
LOP
But even the
best plans...
…with lack of follow-through...
Slide 7Introducing Our Tour Guides: DP aka Data Pigeon, LOP aka Lack of Planning Mule
9. Slide 9
Opening the door to success requires both
planning and follow-through…
…to tame LOP the mule into LOP2 !
DP
LOP2
SGA
Successful
Geeks
Assoc.
Introducing Our Tour Guides: DP aka Data Pigeon, KIO1 aka LOTS of Planning Pegasus…or at least “ENUF” Agile Planning
10. 10
Everything we do is in the…
Thousands of content creators (aka talent partners)
Millions of videos on numerous platforms
Billions of aggregate views / month
Expanding Footprint Within & Beyond WB requires even Greater Scale
BI/Data supports Mach OTT, Other WB Divisions, External Companies
Distribution supports other WB Initiatives
Native Digital Entertainment Business that Leverages Technology to
Meet and Exceed KPIs & Operational Goals -- and to Scale Cost-Effectively
Recent Years
Millions… BILLIONS+Thousands…
Cornerstone Technologies (Big Data focus)
#DISCLAIMER: This is the only Machinima/WB slide (and it is non-proprietary).
11. 11
LEVERAGE TECHNOLOGY
Architecture, Engineering, Methods, Libraries,
CM, QA, Security, SysOps, DevOps
DATA >> INFO >> KNOWLEDGE >> ACTION
Improve BIZ (Revenue, Profit, Market Share, Etc)
Always Grow BIZ Value –– Data Intelligence
BEST PRACTICES & ~SLAs
Continual Improvement, Serviceability,
Reliability, Performance, Governance
SERVICE ORIENTED Mindset Driven By Clear Mission, Values, Goals & Priorities:
Cost-Benefit + “Everyone is a Customer” Approach
Be the Solution, Be the Boss, Value Each Other, A-Team, Executional Excellence, …
Sample
Data Management
Touch Points &
Basic Approach
11
12. 12
BIZ VALUE.
Increase Revenue,
Profit, Market Share,
Etc
Low Level
Processes
Get Stuff, Do Stuff, Put Stuff, Etc
Raw Data
Google, Youtube, Facebook, Twitter, Twitch,
Amazon, Salesforce,
Mach Console, ETC
Information Technology
Data
Engineering
Data ►► Information
Software
Engineering
Tech ►► Biz Tools
Product
Management
Actualization
User Applications, Visualization Tools
(BI, Reporting, Analytics, Discovery), Etc
Value “Scape”
12
ROI is
Impacted by
All Levels
13. 13
Downstream Feeds,
Exports / ReportsDashboards, RptsDashboards, Rpts
Convert Data into INFORMATION to Help Drive Cost-Benefit / KPIs for Business Units and The Enterprise
• Google
• YouTube
• Facebook, Instagram
• Twitter
• Twitch
• Amazon
• Pluto TV
• Clickstream
• Finance/Accounting
• Salesforce
• Sensors
• Logs
• ETC
RAW DATA from
(Platforms, Apps, Etc)
• BI + Data Managers
(incl Self-Service)
• Marketing
• Sales / BD
• Finance
• Accounting
• Operations
• Security & Compliance
• ETC
(See Pillars slide)
DISTRIBUTE to
Customers, Et Al:Data Engineering (DE)
Is “Under the Hood”
Data ►► INFORMATION:
Consolidate & Transform Raw
Data into Data Warehouse (DW)
Operational ENGINES:
Payment Processing for Talent,
Directors, Recruiters, etc
Data Engineering Landscape --- Acquisition, Curation & Dissemination
14. 14
Data Engineering & Business Intelligence Lifecycle
GCR
Business
Goals
Concept
Rqmts
Definition
Vision & Goals Management
Tech Architecture
(incl Data Services / Ops)
Business / Logical
Products Select & Install
Tech / Physical
Integration
Testing
(Func+Stress)
Production
Deployment
Business Intelligence (BI)
Design / Definition
Prep
Deployment
(incl Support Spec)
User Training
(incl End User
Info)
Business
Intelligence (BI)
Development
(Focus on BIZ
Meta Layer)
Data Integration
(ETL, MDM,
Metadata, Etc)
Explore & Design
Unit Testing
Integration
Deliver Business Value Increments via “Frequent Little BITIs” (Agile Coordination + Waterfall As-Needed)
Integration
Test Planning
Data Profiles & Maps
PostReflect,Maint,Improve
PreReflect,Strategy,Scope,Impact
Implement & SupportTexture + Smooth: BITI Cycle (Build, Integrate, Test, Improve)Inception & Definition
>>> Today’s Objective <<<
15. 15
ROI in Technical Environments
The (preferably measurable) successful Business outcomes generated by leveraging Technology
to increase financial KPIs, e.g. revenue, profit, etc.
• “Leverage” implies “$Spend”:
o For Product, Licensing, Labor, Infrastructure, Transition, Etc
o “Break Even” is a common focus: when measured benefits equal initial investment.
o Keep ongoing costs in mind. Ongoing benefits must always exceed TCO to declare “success”.
• Even Open Source / Freemium products have Costs, for example:
o Compute Nodes they run on – vs managed service which is inclusive but still $spend
(e.g. Python, Node.js, Scala, Etc on EC2/VM vs AWS Lambda or Azure Cloud Functions, Google App Engine/Cloud Functions)
o Labor might be more than paid product or managed service
(e.g. Kafka vs AWS Kinesis – various pros and cons largely focused on labor and performance)
• ROI Obvious Factors:
Availability, Maintainability, Reliability/Accuracy, Functionality, Performance, Security
ROI Defined (Return on Investment) – The Basics
16. 16
ROI Defined (Return on Investment) – Special Challenges
ROI ALSO includes “Lateral Spend”
• CAUTION – Watch for Hidden or Shifted Costs:
… sometimes feels like “Collateral Damage”
o Example 1 – Big Data on Serverless Architecture:
AWS Lambda / Google Cloud Function 9-15 Minute Time Limit per Execution
If use for certain Data Integrations, you can invest a lot of $$ time and effort working around the time limit
via special chunking or ~recursive calling mechanisms which are more complex than need to be.
o Example 2 – Streaming TV & Movie Platforms for Cord Cutters :
Free and Low Cost Options e.g. Pluto TV, Hulu Live, DirecTV NOW, YouTube TV, Sling TV, Apple TV, VUE, Watch TV, …
If run simultaneously on multiple devices in same home, along with gaming, you’ll need to add $$ to Internet Service.
• Consider Full Functional Scope before settling on a Single Technology
• Best to support Multiple Patterns, Sometimes via Multiple Technologies
But keep “official” list “minimal” for each product or platform type.
• Heads-up on New but Old Term: “Poly#!#” – more on this later
17. 17
Presentation Abstract – Brief Discussion
Not long ago the question was whether your organization had big data. Did you have
the volume, the velocity, the technology. Now those basics are largely given for most of
the people attending this event. The path to success is still fuzzy, however, with so many
technologies to choose from – and so many ways to use them.
This presentation triangulates in a holistic manner on the modern business dilemma:
how can we leverage technology to improve revenue, profit, market share, and numerous
other success criteria. That said, this is not about the analytics or KPIs -- although it is
about measurable improvement. It’s about lining up the right technologies and using them
in effective, proven ways to maximize Return on Investment (ROI). Since the slant here
is holistic, we’ll show how to blend infrastructure, tools, methods, and talent to avoid and
constantly trim technical debt… and to produce success stories that are consistently
repeatable, not a byproduct of individual heroics.
Not long ago
This presentation
Brief Discussion
23. 23
● PostgreSQL CE
(cloud)
0
Open Src
$$
EC2
$$$
EBS
~n/a,wash $
(As Needed)
5$ ● #PRO: Multi-Model for Trxs (+ ~Analytics)
● #CON: Minimal Scale-Out except for e.g.
Citus DB or $$Product GC PG, Azure PG
● #CON: Slow evolution of Analytics infra
● Redshift
(cloud)
$$ $$$
~EC2
$$$
EBS
~n/a,wash $
(As Needed)
7$ ● #PRO: Spectrum to connect Data Lake
● #CON: Main focus is DW Not Transactional
● #CAUTION: Node distribution / access
limitations. Discuss Mitigation Patterns for
diverse data access patterns on same table.
Tie to Aggregate Awareness, etc.
● Snowflake
(cloud)
$$$ $$
Selectable
per Session
$
S3
~n/a,wash $ BUT See
#CAUTION
in Special Info col
cuz Separate
Compute layer
6$
BUT
Exp =
7$
● #PRO: Unique Data Sharing
● #CON: Main focus is DW Not Transactional
● #CAUTION: Separate Computer Layer is
generally better BUT can cost more for some
patterns. Discuss Mitigation Patterns.
● #EVAL Data Cache options, e.g. Tableau
TDEs or Data Virtualization w/ Denodo, etc.
● Oracle Exadata
(on-prem)
$$$$$ $$$
Data Ctr +
Extra Staff
$$$
Appliance
#TBD $
(As Needed)
11$ ● #PRO: Multi-Model for Trxs + Analytics
● #CON: #ECOSYSTEM is Shrinking(?)
● #SCALABILITY Options:
e.g. negotiate for dormant CPUs etc
● CapEx is vanilla for On-Prem, but ~wash
FiTL Price Shifting Example: DW / Data Hub Main Platform
Component Product Compute Storage In/Out
(e.g. Hybrid Cloud)
Access Patterns
(Explain)
RAW$
COUNT
Special Info / Discussion Points
Guidelines + Future Improvements $ Scale: 1 = Low to 5 = High ($ symbol intentionally on right of Raw$ figure)
• TCO Factoring:
Incorporated in each category,
e.g. see Compute column for Oracle
• Conventional Evaluation Factors can be Added:
Performance, Scalability, Maintainability, Functionality,
Security, Vendor Viability / Ecosystem, etc
24. 24
Snowflake’s Sweet Spot is Data Warehousing / Analytics, Not Transactional / Operational
Activity (although transactional performance is better than expected!)
Many Modern Best in Class Products have Same Issue
Homogenous vs Heterogeneous Technical Environments
o Homogenous was a Dream
(Not Sustainable due to Tech Evolution, M&As, Self-Service / Decentralization, etc)
o Interoperability is the Reality … Usually
o People like Simple, but Modern Times contain simplicity
in each Class – “Best in Class”
What does Heterogenous Mean in Today’s Modern IT Arena? … … …
Reality Check
Examine Wide Use Cases -- This Example Happens to be Data Platform
DP
LOP2
25. 25
Old / Rejuvenated “Modern” Term: Polyg#!# What?
(Source: Google Dictionary)
Polyglot Persistence:
Using multiple data platform technologies
[to address diverse use cases in a best-of-breed manner].
Polyglot Programming:
Using multiple programming languages
[to address diverse use cases in a best-of-breed manner].
Domain Specific Languages (DSLs) are now standard practice
for enterprise app development.
. . . 2012+ Time to Revive! . . .
Polyglot Engineering / Architecture:
Using multiple technologies [in the same functional domain]
[to address diverse use cases in a best-of-breed manner].
+ glotta ‘tongue.’
26. 26
Potential New Principle / Postulate (#DRAFT Idea in Progress)
TP3 Data – Technical Polyglot Propensity Principle for Data Platforms:
Modern enterprises with Big Data tend to utilize Polyglot Engineering with the intention of maximizing ROI.
One Technical Data Platform cannot profitably maintain a Top 3 industry popularity rank for modern
big data enterprises for more than 3 years without sacrificing at least one of the following Top 3 ranks:
TP3 DE – Technical Polyglot Propensity Principle for Data Engineering Platforms:
Similar to TP3 Data, but for Data Engineering / Integration tools and platforms.
DP
Let’s see some EXAMPLES . . .
• Multi-Model support for more than 3 types:
e.g. Relational, Graph, Document/Text, Multi-Media, Geospatial, Key-Value (Structured, Semi/UnStructured)
• Multi-Use-Category support for more than 3 types:
e.g. Analytical, Transactional, Search, Stream
• “Reasonably Low Pricing” given abundance of Modern, Competitive, Low Cost / Community products
27. 27
Polyglot Data Integration: Example Architecture Pattern 1
Example Use Case: Social Media / Video Platform – Core Data Feeds
• Purpose & Constraints: Biz Performance & Revenue Metrics / KPIs.
Minimal or No Backfill Available. Supports Data Lake Direct Access Use Cases for raw data (purple lines from earlier slide).
• Solution Profile – Decoupled Polyglot thru Data Lake:
Python for Extraction, Pentaho Data Integration (PDI) for Load/Ingest/Transform. High Resilience.
• Fitness Highlights:
o Python: $0 Open Source, Thin (low resource), Highly Available, e.g. can be Serverless or EC2.
o Pentaho PDI: $0 Community Edition (or EE see below), Graphic Workflows with Standard Transformations.
Low Maintenance Work Share (extremely easy for cross-training even with complex pipelines).
Option to Expand to Enterprise Edition ($) for better HA infra, monitoring, repo, scheduling, support, etc.
Extract via Python
(e.g. REST API)
Ideal for: Facebook (core), Instagram,
Twitch Live Streams, Etc
Optional for: YouTube, Amazon, Etc
Load/Ingest/Transform
via Pentaho PDI DW
Async
e.g. CSV,
JSON, Parque Near-Real-Time,
Hourly, Daily, Etc
Cache
Lake
28. 28
Polyglot Data Integration: Example Architecture Pattern 2
Example Use Case: Social Media / Video Platform – Extension Data Feeds
• Purpose & Constraints: Biz Performance & Revenue Metrics / KPIs.
Backfill is Available. Extract Runtime is Short / Low Impact on Runbook Dependencies. Ok to Not have Data Lake raw data.
Extensions Data Density is High, e.g. would create thousands-millions of files per day (discuss).
• Solution Profile – Homogenous ~Stream Direct to DW / Data Hub:
Python if Lite Transformations (subject to DI Library selection, see next slide), Pentaho PDI if Heavy.
• Fitness Highlights:
o Python: $0 Open Source, Thin (low resource), Highly Available, e.g. can be Serverless or EC2.
o Pentaho PDI: $0 Community Edition (or EE see below), Graphic Workflows with Standard Transformations,
Low Maintenance Work Share (extremely easy for cross-training even with complex pipelines).
Option to Expand to Enterprise Edition ($) for better HA infra, monitoring, repo, scheduling, support, etc.
#Caution: Must Backfill after Planned/Unplanned DW Outage. No Data Lake Raw Files for Audit or Direct Use Cases.
Ideal for: FB Graph API Extensions,
YouTube Bulk API, Etc
Optional for: Salesforce, SAP, Etc
Extract +
Load/Ingest/Transform
via #TBD
(See Solution Profile above)
DW
Merge
Hourly, Daily,
Monthly, Etc
Cache
Lake
29. 29
Some Python Data Integration / ETL Libraries
# Library Remarks (as of 2018-11-25) Sample Doc/Code
(Double-Click to Open)
1) PETL Reasonably Popular
(Last Commit Sept 2018)
2) PygramETL No Commits since Oct 2017
3) Bonobo Reasonably Popular
(Last Commit Nov 2018)
4) #TBD: Wide Open to Feedback Roadmap Task for 2019
Evaluate and Select
NOTES:
• Data Analysis Libs like Pandas are not shown above. Full Data Integration / ETL / ELT is
not their objective.
• Custom Development is also quite popular – withOUT starting with a canned 3rd party library.
But you should develop a lib (or at least a collection of templates) within your company for
standardization, productivity, etc.
31. 31
Polyglot DI: Why Graphic Workflows? –– Example 1
Hard Error End
Hard Error End
Easy to See Green Flow = GOOD, Red Flow = BAD
32. 32
Polyglot DI: Why Graphic Workflows? –– Example 2
Soft Error End
(Do Nothing)
Easy to See Negative Logic ANTI-PATTERN
33. 33
Polyglot DI: Why Graphic Workflows? –– Example 3
Error End
Easy to Understand
Semi-Complex Flow
More Complex is
Also Welcome
34. 34
Polyglot DI: Why Graphic Workflows? –– Example 4
Error End
Soft Error End
(Do Nothing)
Easy to Add
Exception
Handler
35. 35
CONCURRENT File Processing
Chains are Easy to Create
Logging Window Easily
Explains Broken Step Above
(Can also do in IDE for Python, etc)
Polyglot DI: Why Graphic Workflows? –– Example 5
36. 36
Polyglot DI: Why Graphic Workflows? –– Example 6
This Time it Worked.
See all the Green
Checkmarks
Automatically Gathers
METRICS Etc for Each Job
Renamed Steps to be
Meaningful
37. 37
Polyglot DI: Why Graphic Workflows? –– Example 7
Error End
Standard
Job Types / Steps
(Menu Options)
38. 38
Polyglot DI: Why Graphic Workflows? –– Example 8
Error End
Big Data
Built-in Transformations
(Source: Pentaho PDI Manual)
39. 39
Polyglot DI: Why Visual Workflows? –– Example 9
Error End
Standard
Input Methods
(Menu Options)
40. 40
Polyglot DI: Why Graphic Workflows? –– Example 10
Standard
Output Methods
(Menu Options)
41. 41
Polyglot DI: Why Graphic Workflows? –– Example 11
Standard Transform Methods
(Menu Options)
Extend to Custom
Python, JavaScript, Bash, Etc
42. 42
Up Next
ROI Best Practices
FUTURE IMPROVEMENTS for this Presentation:
• Reduce Slide Density – Especially the next few slides which also could use some diagrams.
• Many More Best Practices already documented – Adding to presentation after settle on better format.
• Contact Info at end of this deck – Feel free to reach out for discussion or future versions.
Data Engineering
Analytics (incl AI/ML)
Design & Development
DevOps, SysOps
Data Governance
Collaboration
Management
Etc
43. 43
Best Practices for High ROI Impact – Part 1 (DRAFT: This Slide Currently Requires Walk-Thru)
Types Topic : Components For Benefits >> Do This Cost-Ben
(1-5 Hi : 1-5 Hi)
Depends Remarks
Tech
Data Eng
Data Integration / ETL Arch Patterns :
Decoupled vs Homogenous
Optimize Performance, Scalability, Reliability, Etc
>> Have Canned Patterns Ready for Various Scenarios
2 : 5 • Experience + Research. BEST PRACTICE TIPS:
• See Architecture Patterns earlierin this
presentation (e.g. decoupled vs
homogenous).
Tech
Data Eng
External Resilience Patterns :
Data Extraction Programs
Improve Data Availability & Reliability
>> Use Intelligent Logic (parsing etc) vs Arbitrary or
Hardcoded Logic to Dynamically Accommodate
Changes in Source Data Patterns. Examples:
• Skip over CSV Header Rows – use parse vs row count.
• Ignore extra JSON pages via content scan vs page count.
2 : 4 ROI falls when distracted by “fires”,
especially when preventable.
BEST PRACTICE TIPS:
• “Results are Better Than Excuses” culture.
• Problem Patterns are often ~just as
important as Solution Patterns. Line them
up a la Tech Arsenal.
Tech
General
Tech Debt : Keyword Tagging in
Code/Docs
Minimize Tech Debt and Facilitate Follow-up
>> Tag your Code and Design Docs. Examples:
• Code: #CHANGED, #PERF, #SCALE, #HARDCODED,
#MODULARIZE, #WORKAROUND, #TODO, #FUTURE
• Docs: #KBANK, #OUTPLAN, #SPEC, #QA, #RISK, #TBD,
#TODO, #FUTURE, #MGT, #TECH
• README File for every main module – incl “Future
Enhancements” section in addition to any tickets/cards.
1 : 5 • Context dependent. If Mgt
doc, separate #TECH
sidenotes. Or vice versa if
Tech doc, and put Mgt
Nutshell / Next Steps at top
since Mgt does not focus
on Tech.
• Tech Debt is frequently caused by
overlooking syndromes: “we’ll do it later”,
“slipping between the cracks”.
Tech
Monitor
Simple Automation, Reliability, and
Maintainability : Keyword Tagging in
Logs
Improve Reliability & Maintainability/Operations,
Avoid False Positives
>> Tag Log File Msgs for Action or Review. Examples:
• #ERROR: bla bla (+ Optional Subtags #RETRY or #FATAL)
• #WARNING: bla bla
• #INFO: bla bla bla
1 : 4 • Couple with Monitoring
Tools/Scripts.
• Avoid False Positives, e.g.
“error table”,
“solution for whatever error”, etc.
Tech
SysOps
Deployment : Hot Patches Improve Reliability & Maintainability/Operations
>> Always Create “cm_retro” hot patches folder with:
• Changed Files.
• README for changes to other tiers (e.g. data tier).
1 : 4 • Couple with Git if available
– sometimes changes occur
in infra/commercial
product config files, etc
which might not be CMed.
• Useful even when using Git, etc.
• Consider using Git for all configurables.
Opens deeper issues, e.g.:
o Separating config files from other bin.
o Maintaining security of credential files
etc.
44. 44
Best Practices for High ROI Impact – Part 2 (DRAFT: This Slide Currently Requires Walk-Thru)
Types Topic : Components For Benefits >> Do This Cost-Ben
(1-5 Hi : 1-5 Hi)
Depends Remarks
Tech
SysOps
Hyper-Automation, Reliability, and
Maintainability : File Tagging
Simplify Many Critical Actions & Monitoring
>> Tag Filenames or Accounts for Monitoring, Tracking,
etc. Examples (add to filenames):
• Sensitive Files use “_stv” suffix: No-Brainer 1-line expression
(e.g. filename like “*_stv*.*”) to track all files containing
secure credentials etc. Facilitates auto security hardening and
monitoring. No need to depend on high maintenance list
which will grow out of sync.
• Service Accts use “svc_” prefix: Similar to “_stv” but for email
accounts, etc.
1 : 4 • Great for Security Audits:
Auditors like when justified exceptions
can be simplified. Service Accounts in
some high security environments are
allowed only by exception, even
though they are obviously needed and
much better than using name of a real
person who will eventually move on..
• SAFETY TIP: Don’t make it too obvious
for Hackers. For example, “_stv” is
good enough to understand with out
saying “look here for ‘sensitive’ files!”.
Tech
Data Mgt
Hidden Tech Debt : Multi-Tenancy Prevent Hidden Tech Debt from Plunging Productivity
>> Always have Checklist to Consider Multi-Tenant
Support, e.g. Biz Unit (BU). Examples – Add Biz Unit to:
• Data Sets/Tables
• Data Integration Infra (folder trees).
3 : 2 to 5
(It Depends)
CONSIDER FACTORS:
• M&A (of course).
• Reorgs / Dept moves, splits, etc.
BEST PRACTICE TIP (tangent topic):
Usually Name things after BEHAVIOR, not
Volatile Infra such as Biz/Dept Name.
Mgt
Collab
Naming Conventions :
Biz & Tech Vocabularies
Increase Productivity, Reliability, & Morale; Preclude
Communication Mayhem
>> Always Have a Name (a la Jim Croce, song artist).
Examples:
• Data Schemas: stage, base, presents.
• Data Sets: Pilot should have name such as “main” or “core”
since something else ~always follows. Otherwise you’ll
forever be saying “the set without a name”.
• Disk Folders:
• ETC
2 : 4 ROI falls when distracted by “fires”,
especially when preventable.
BEST PRACTICE TIPS:
• Devise a simple term whenever you
have to repeat a phrase or word bunch
repeatedly.
• Socialize everywhere across Biz and
Tech, e.g. Roadmap, Specs, etc.
45. 45
Best Practices for High ROI Impact – Part 3 (DRAFT: This Slide Currently Requires Walk-Thru)
Types Topic : Components For Benefits >> Do This Cost-Ben
(1-5 Hi : 1-5 Hi)
Depends Remarks
Tech
DevOps,
SysOps
Process Orchestration :
DevOps/SysOps Coding
Control Maintainability vs High Availability
>> For each Job’s Master Script (launched by
scheduler), Choose Common/Reusable or Separate
Scripts:
• For Maintainability: Use Common/Reusable.
• For HA: Separate scripts to avoid risk of breaking
something unrelated to current task.
1 : 3 • Evaluate Reusability vs Risk (see
Best Practice Tip in Remarks
column).
• BEST PRACTICE TIP: Lean toward
Maintainability if have separate
Test/QA Team (which is another,
fundamental best practice).
Tech
Data
Mgmt
Data Assets Mgmt : Data Catalog Maximize Results and Minimize Redundant Data in
data stores + BI platforms
>> Create & Maintain a Data Catalog (#DCAT):
• Option 1 $$$: Combine with Data Virtualization, e.g.
Denodo, Dremio, Stardog, etc.
• Option 2 $$$: Separate product, e.g. Alation, etc.
• Option 3 Free But Not Great (Yet):
o Spreadsheet/Googlesheet
Proven Templates available upon request.
o #TODO: Researching open source products.
3 : 5 for $$$
or
2 : 3 for Free
• Be prepared to spend at least
$150K for good commercial
product.
• HOT TIP: Inferred relationships across
DB platforms are only practical when
virtualizing across multiple data
platforms. Lean toward Option 1 +
“encourage” Denodo to speed up their
Roadmap (like Alation feature but
cross-platform).
• #RISK of NOT Having DCAT:
High Tech Debt propensity across
multiple tiers, e.g. in Data Lake,
DW/Hub, and BI platforms.
Tech
Dev/Test
Testing : Sample Data Generation Reduce Train-Validate-Test Lifecycle
>> Auto Generate Meaningful vs Random Data:
• Mockaroo.com: $Range from free for 1K rows and slow
speed to $500/yr for 10M rows and 8x speed.
• Alternatives (generally < $600/yr/seat): RedGate SQL
Data Generator, Dummi, MS Visual Studio , etc.
• Or use Real Data if Possible (limited by InfoSec policy and
data volume considerations – See “Depends” column).
2 : 3 • Production Data – in some
environments -- can be
downloaded/refreshed into
Dev/Test. But still need subset
data while maintaining integrity
(where applicable).
IMPORTANT TIPS:
• For Metrics:
Add distribution info (e.g. uniform,
normal, normal inverse, exponential,
exponential inverse, etc). Incl edge
cases.
• For Text:
Create meaningful patterns, incl edge
cases, e.g. via regex, etc.
• Additional Metadata (as applicable):
Unique, Step, Min/Max values/length,
locale, character set, image ht/wd, etc.
46. 46
Types Topic : Components For Benefits >> Do This Cost-Ben
(1-5 Hi : 1-5 Hi)
Depends Remarks
Tech
AI/ML
ML Outcomes : Source Data Gaps &
Anomalies
Improve Outcome Accuracy & Minimize ML Iterations
>> Track Imputed or Questionable Data. For example,
add Source_TBD column to indicate:
• Imputed Gaps: Fill nulls, e.g. via interpolation, etc.
• Low Credibility: Identify weakness at source vs algorithm.
2 : 3 • BEST PRACTICE TIP: Add Lifecycle
Checklist item to cross-check the
Source_TBD column with Validation
and/or Test data sets.
Tech
AI/ML
ML Outcomes : Validation & Test Data Improve Outcome Accuracy
>> Incl Separate Validation and Test Data Sets:
• For Validation Data: Hold out from Train data.
• For Test Data: Completely separate – No “Peeking”.
2-3 : 4 Test Set should be separate and
large if available.
Two popular alternatives:
• Bootstrapping
• Cross-Validation (e.g. K-Fold
resampling aka shell game).
• Bootstrapping (click1, click2).
BEST PRACTICE TIPS:
• Validation Data – use to tune params.
• Test Data – use to assess performance
(outcomes).
Mgt
Resources
Unplanned Work or Expenses :
Budget, Roadmap, Sprints, Project
Plans
Minimize Disruption to Budget & Schedule
>> Track “Outplan” for items not in Orig Budget,
Roadmap, Epic, Sprint, or Project Plan:
• Create Positive Agile Culture for Unplanned/Popup Tasks:
Tag on Roadmap as #Outplan (or use symbol). Also show any
impacted items, e.g. can indicate item slide to next
month.
• Maintain Schedule/Priority Control and Team Morale:
Easy to see when we’re in a Tunnel… and where the Light is.
Easy to justify schedule adjustments to incorporate high
value wins. But see #CAUTION in Remarks column.
2 : 5 • Introduce to corporate
vocabulary.
• Socialize with stakeholders
(and your boss).
• BEST PRACTICE TIP:
Track Outplan to justify:
o Hire New Staff
o Adjust Business Processes
• #CAUTION: Always try to plan pop-up
tasks for future, not outplan. Outplan is
always an exception (part of MBE plan).
• Only for tasks > 1 Day (not for
ad-hoc “support” tasks).
Mgt
Staffing
Work Force : Flex Plan Stay on Budget & Schedule
>> Create FLEX Plan, Not Just a Contingency Fund:
• Enlist As-Needed “Hot Standby” SMEs who need minimal
(sometimes 0) weekly work guarantee, but establish and
maintain knowledge about your biz and tech environment.
• Cross-Train / Semi-Matrix Dept Resources in Large
Enterprises: Achieve economy of scale -- sustainably.
2 : 5 • Need strong schmoozing and
negotiation skills.
Best Practices for High ROI Impact – Part 4 (DRAFT: This Slide Currently Requires Walk-Thru)
55. 55
DP
You May PASS GO and Collect $2000
Jeff Bertman
• www.DigiMarconWest.com/speakers
• www.LinkedIn.com/in/JeffBertman
• Jeff.Bertman@DfuseTech.com
• Jeff.etalk123@gmail.com
• Skype: JeffB.epoch
(Slack, WhatsApp, etc upon request)
• Mobile/Text: 818-321-3111
NEXT UP
CONTACT INFO