Business leaders everywhere are looking to data to inform their decision making. Accompanying this demand are misunderstandings of what it takes to transform data into something that can inform a decision. What is the data infrastructure required? In this talk, I'll dispel some of these misunderstandings and discuss what it takes to build good data infrastructure. I'll discuss the components of a good data infrastructure. The best practices and available tools for gathering data, processing it, storing it, analyzing it and communicating the results. The goal is for these components to create a data infrastructure which can evolve from simple reporting to sophisticated insights for decision making.
Presented at OpenWest 2018
4. Uber, the world’s largest taxi company,
owns no vehicles. Airbnb, the world’s
largest accommodation provider, owns no
real estate. Something interesting is
happening.
Tom Goodwin
5. Data has lead to amazing success stories
Over the past few decades, companies leveraging data successfully have huge
successes
● Walmart pioneered data to manage inventory and became the first company
in to pass $1 billion in sales under 17 years.
● Twitter implemented tracking for its on-boarding and retention and was able
to increase conversion 30% and long-term retention by 20% users
● Internet consumer companies like Google, Amazon, Netflix continue to show
success after success by collecting and utilizing huge quantities of data
6. The success stories have people scrambling for “data”
dilbert.com/strip/2012-07-29
8. Warning: Data is not Easy
The success stories make data seem magical. Every data tool
out there would like you to believe they will solve the hard
things for you. But the truth is that building good data
products is hard and rewarding endeavor.
9. We can succeed!
Data science is figuring out as an industry how to develop
and deliver data products. The processes and tools are
maturing to the point where everyone will be utilizing data.
This talk is to help you understand the fundamentals of this
endeavour and set you up for success!
11. Data Basics
The data science hierarchy of needs
describes the stages of data
complexity and insights
At the base are real world
phenomena, which we capture as
data and subsequently transform into
meaning
12. Data Basics: Data Collection + Raw Data Storage
Before you can do anything with data
you must capture it
Sources can include computer
generated log files, system
information, user generated content,
sensors, and other data stores
13. Data Basics: Cleaning + Structured Data Storage
As the adage goes, 80% of the work
is preparing the data
Where you find gaps, errors, or
inconsistencies you cycle back to
data collection
Developing intuitions about the data
and the questions it can answer
14. Data Basics: Descriptive Analytics
Descriptive Analytics are your first
stage where you actually get to
answer questions
Here you’re creating ad-hoc reports,
tracking performance indicators, and
building standard reports
15. Descriptive Analytics are your first
stage where you actually get to
answer questions
All other descriptions, modeling,
machine learning, are follow the
ability to show basic counting
Your early projects should not try to
extend beyond this stage
Data Basics: Descriptive Analytics
17. Your early data projects should be simple
Focus first on counting
These will be...
● Easier to explain to your
stakeholders
● Faster to build and for
stakeholders to realize value
● Easier to focus on good
infrastructure and process
18. Your early data projects should be simple
Descriptive Analytics are your first
stage where you can actually answer
questions. First point of value for
business users
Businesses spend 1-3 months to get
this into production the first time
They spend 1-3 years to really
implement this well
19. Your early data projects should be simple
Businesses spend 1-3 months to get
this into production the first time
They spend 1-3 years to really
implement this well
1-2 years to do this well
1-2 years integrate prediction
1+ years mature modeling to optimizations
21. Example Data Pipeline
Modern data infrastructure is evolving. Above is similar to what I help build at
Teem to deliver the first level of customer-facing and internal insights.
22. However….
Modern data infrastructure is evolving. Above is similar to what I help build at
Teem to deliver the first level of customer-facing and internal insights.
23. While most people focus on the technology, the best
organizations recognize that people are at the center
of data science complexity. In any organization, the
answers to questions such as who controls the data,
who they report to, and how they choose what to
work on are always more important than whether to
use a database like PostgreSQL or Amazon Redshift
or HDFS.
Dr. DJ Patil & Hilary Mason
24. Data is more about
process and people
than technology
26. Traditional product development lifecycle
Developing a data product is the same as any product.
Having this process in place will mean more success in your
data endeavours.
Concept
Idea Generation
Research
Assess
Opportunity
Analysis
Business
Assessment
Develop
Create
Launch
Delivery
27. Understanding the problem to solve
Concept
Idea Generation
Research
Assess
Opportunity
Analysis
Business
Assessment
Develop
Create
Launch
Delivery
Know the problem to solve and what action will be taken
● Identify the stakeholders
● Document the possible questions your stakeholders have
● Dive deep to find the root problem the stakeholders need to solve
● Identify the action they’re going to take once they have the information
29. What is the scope of needs for to answer the question and
figuring out who needs to be involved
● What are the short-term and long-term goals for data?
● Who are the supporters and who are the opponents?
● Assuming we do this perfectly, what will we build first?
● What is the most evil thing which can be done?
Concept
Idea Generation
Research
Assess
Opportunity
Analysis
Business
Assessment
Develop
Create
Launch
Delivery
Assess what other opportunities there are
33. Concept
Idea Generation
Research
Assess
Opportunity
Analysis
Business
Assessment
Develop
Create
Launch + Maintain
Delivery
Data reports can become irrelevant and errors can arise so it
is important to do ongoing reviews of the data
● Review dashboards: is data still relevant and actionable?
● Metrics meetings: does everyone still understand the data and are there new
definitions which need to be evaluated?
● Domain specific reviews: meet with stakeholders and see what data is
valuable to them and what actions they take.
Plan to review the value of your insights
34. You win by continuing the product development
You win by continuing the product development lifecycle,
starting with data basics, and progressing data complexity
over time.
Concept
Idea Generation
Research
Assess
Opportunity
Analysis
Business
Assessment
Develop
Create
Launch + Maintain
Delivery
35. Rinse and Repeat
Concept
Idea Generation
Research
Assess
Opportunity
Analysis
Business
Assessment
Develop
Create
Launch +
Maintain
Delivery
36. To succeed with data
Evolve your data complexity
over time and start simple
Data is more about process
and people than technology
Practice good product
development and process
Know the problem you’re
solving and the action that
will be taken Concept
Idea Generation
Research
Assess
Opportunity
Analysis
Business
Assessment
Develop
Create
Launch +
Maintain
Delivery
37. References and Resources
● DJ Patil & Hilary Mason (2015) Data Driven. Sebastopol, CA: O’Reilly
● DJ Patil (2011) Building Data Science Teams. Sebastopol, CA: O’Reilly
● Monica Rogati (2017) The AI Hierarchy of Needs
● Nick Crocker (2014) Thirty Things I’ve Learned
● Tom Goodwin (2018) The Battle Is For The Customer Interface
● Tavish Srivastava (2015) 13 Tips to make you awesome in Data Science / Analytics Jobs
● Daniel Tunkelang (2017) 10 Things Everyone Should Know About Machine Learning
● Timo Elliot (2018) Predictive Is The Next Step In Analytics Maturity? It’s More
Complicated Than That!
● DJ Patil - Everything We Wish We'd Known About Building Data Products
38. Good Luck!
Evolve your data complexity
over time and start simple
Data is more about process
and people than technology
Practice good product
development and process
Know the problem you’re
solving and the action that
will be taken Concept
Idea Generation
Research
Assess
Opportunity
Analysis
Business
Assessment
Develop
Create
Launch +
Maintain
Delivery