Bertil Hatt, Presented at the Elite BCS Conference in Manchester on 2018/05/17: Why you need good analytics, great engineers and scalable data practices before you hire data scientists and build machine learning models.
1. Artificial Intelligence
in the real world:
Where to start?
Bertil Hatt
Head of Data science, RentalCars.com
BCS ELITE Conference Thursday, 17th May 2018
‘Unlocking the Potential of Artificial Intelligence’
Manchester DoubleTree by Hilton Hotel
1 This presentation was prepared for a the BCS, The Chartered Institute for IT.
It was meant for an audience of CTOs curious about Artificial intelligence. The
program was first a talk about the exciting possibilities of Artificial intelligence. This
presentation was following it. This is a talk on what to do before you can unlock the
potential of Artificial intelligence.
It relies on a handful of premise on the audience (detailed as questions to the
audience during the opening section): the audience is technically informed, but not
familiar with implementing machine learning models.
The point is to talk about what to do *before* you have a machine learning model in
place; what are the classic traps to get there.
Why am I talking to you?
2 I was able to write down the following warning not because I was able to do a lot
of amazing things, but because I worked with many institutions — the main ones
are listed here. I was able to compare from the trenches, those who successfully
implemented machine learning and those who struggled. Actually, most of the
companies that I have helped are not on this list: friends who grew frustrated reach
out for advice, or conference speakers who explained how to get things to work
well.
None of the content is meant to be controversial, but if there is anything you object
to, it is entirely my fault: none of the institutions mentioned have sanctioned what
I’m saying here.
This presentation, by design, is going to sound quite negative: it’s a list of
warnings. It is also a list of solutions.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
2. Why am I talking to you?
3 I’m also here because I gave this talk to many data scientists. Their reaction was
not “this is cool”; it was incredibly relieved, even ecstatic: “Thank you for finally
saying this out loud.” “Someone had to tell them!” Not sure who was them; I hope
it’s you.
Artificial intelligence, Machine learning, Data science, all those are amazing things
to do. They happen to be victim of their —deserved— success. I’m telling what I
learned from trying many times, and I failing almost as often.
Finally, as you might be confused: my employer is known as RentalCars. The
company belong to the same holding and was acquired by Booking.com on
January 1st. This is why you’ll see me as belonging to either RentalCars,
Booking.com or even BookingGo, which is an internal brand. Rideways is a new
service that we are starting, so that’s four possible labels. All those are valid, and
essentially the same company.
Raise your hand
4 This is the interactive part of the presentation.
I was not familiar enough with the audience, so this is me checking that my
assumptions were right.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
3. Keep your hand up if
•You manage technical people
(engineers, dev.-ops, analysts)
•Your company is organised with
product teams, product owners
•You know “Joel Splosky’s test”
5 The assumption is that you want to make decisions about what technical people in
your organisation work on, who to hire (seniority, speciality).
I also assume that you have a network of teams, coordinated by people known as
product managers (PMs) or product owners (POs) who define short-term
objectives, priorities and allocate technical resources. The main question here will
be: what do in what order? What should you care about?
Another key aspect here is the happiness of technical people. If you are familiar
with Joel Splosky’s test, it’s a good reference to have in general. I have taken
inspiration from him and suggest an adapted version for (hiring) data scientists. If
you don’t know what Joel’s test is, check it out; it’s a good reference to have.
What I could talk about
•Should you think about AI?
•What should you do first?
• What project to start with?
• Who should you hire first?
•How to retain & grow data scientists?
• How to make your team effective?
•How to prepare for a project?
•How ready are you for AI?
6 There are many things I could talk about, and I’m happy to answer questions
around —how to do data science well, effectively, at scale; what are the right
skillsets— but I think the two points that are under-discussed is what to do to
prepare for data scientists, how to surround them.
Far too many company want to do AI, but they have no idea if they are ready for it.
When I hear that, it reminds me of a joke a friend had: Someone asked if he knew
how to speak Arabic; he’d quip: “I don’t know, I never tried.”
Being ready for data science is a little bit the same: if you are not sure, you very
likely need a lot of effort before you can be ready.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
4. Raise your hand
7 That’s a stupid trick I used to get people to participate: I get people to keep their
hands up first, then I ask question. It changes the default of participation, and you
get more honest answers. Plus it’s a little bit of physical exercise.
Keep your hand up if
•You are planning to invest in AI
in your organisation in 1-2 years
•You do not have a model running
& self-updating in production today
•You want to hire a Data scientist
8 The assumptions here is that you can about AI as something you are actively
planning to do.
Other assumption is that you are not ready or at least haven’t tried. If you have,
there are a lot more presentations on how to improve your AI, don’t waste time
with my rambles and go watch those.
Last assumption, and that’s where it gets tricky, you want to hire a person who you
think is a specialist. That introduce a connection between artificial intelligence,
machine learning and data scientist. I’m not going to detail how those are related.
If you have your hand up, this presentation is for you
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
5. Don’t
9 First surprise. Gets people’s attention.
If you hire someone fresh out of the Data science institute, or any statistical
Master’s, they’ll be great at training a model from a cleaned up dataset. However,
in a vast majority of cases, they won’t know what to do to get that dataset in the
first place; they’ll have little idea how to serve that model to your display service in
a fast, reliable, scalable manner. Most of them will get incredibly frustrated. Many
will leave within six months.
A friend of mine, Peadar Coyle, is trying to brand that “Trophy data scientist” it’s a
reference to a sexist trope, but I think it’s a good angle on the problem.
This presentation is here to help you avoid that, and figure out what type of person
you should hire.
Hire a Head of
Data & Reporting
first
10 For every company I’ve worked for (except to a minor extent Facebook), one key
overlooked problem was to get a clear understanding of what the service was
actually doing. Even for Facebook, many key aspects of what the company was
doing in 2015 was still unclear, as recent changes demonstrated. Don’t be
ashamed because you are naked in Eden: admit that you need help, and hire that
help.
William Gibson said that “The future is already here — it's just not very evenly
distributed.” That is absolutely true of Data science. Many employers had great
data science in some aspects, in one corner of the company — but large
uncertainty elsewhere. You can try starting data science before you have a good
overall vision. I wouldn’t recommend it, though.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
6. Then raise
their budget
11 The real issue with processing data is that it’s expensive. There are roughly two
roles:
- engineers who build pipeline and large scale technology to copy data; those
tend to be very expensive specialist so you want a budget for those;
- people in change of applying business logic to that data, typically called
analyst, sometimes data scientists. You might need many of those, to cover
your companies different product. They typically build pipelines that others use,
rather than provide a lot of insights themselves.
Don’t confuse them: engineers generally hate handling the frequent changes in
business logic; analysts typically don’t know code well enough — but they can
learn a lot from engineers.
Then raise
their budget
12 If you want to leave the room or close this presentation now, that’s fine with me.
You can spend a year hiring a head of data, grow their team and have them fix all
the inconsistencies, that will make me very happy and you will probably not delay
by a minute getting usable data science in production. We can be back here next
year. I’ll explain next steps to you then. I’m serious, if you stand up now, I’ll
understand. And I will thank you: you are taking this seriously.
Getting good data is very likely going to be tedious and expensive and absolutely
helps you machine learning effort. Actually, that’s the only part that I’m not
convinced can be automated.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
7. Plan
• Who is a data scientist
• 12 questions to tell if you are ready
• Five ways to audit denormalised dataset
• Nine steps to a model in production
• Common questions
• What project to start with
• The future for data science
13 This is going to be a very long presentation, and I might not reach the end in an
hour. Very sorry about that, but you just got the key part of it, so I’ve fulfilled my
promises.
First, I’ll try to clarify what are the expectations, or lack of, from the king of
specialist that you expect; that should help understand the rest of the presentation.
Then I’ll help you decide if you want to hire one. At that point you’ll be tired, so I’ll
make some nerdy jokes at the expense of my previous colleagues. If you are ready
for more, I’ll then talk about how to get to a working machine learning model in
production.
Plan
• Who is a data scientist
• 12 questions to tell if you are ready
• Five ways to audit denormalised dataset
• Nine steps to a model in production
• Common questions
• What project to start with
• The future for data science
14 Please, have question through out, because if you don’t, I’ll answer my own
questions or rather the questions that I more commonly get from the audience.
Feel free to raise your arms and ask questions now. I’m happy to address your
concerns now rather than when my long monologue has completely expunged all
your energy.
If I talk too fast, raise your arms. If I talk too slowly, or not loud enough, or too loud,
or I don’t articulate properly, raise your arms — all those have happened before, so
feel free to interrupt me if anything is wrong.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
8. Who is a
data scientist?
15 This is a very loaded question, because there is no clear answer — and that’s
actually by design.
“Data scientist” is a very
ambiguous word (by design)
• DJ Patil & Jeff Habermacher describe their job in 2008
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
• Most narrow: Familiar with self-learning statistical models
• May include: Engineering, Dev Ops, UX or ML Research
16 The original use was a shorthand to describe the role of two very incredible people.
DJ Patil single-handedly build most of the core features of LinkedIn and later
became the Chief Data scientists of the United States Government. Jeff
Habermacher was pretty much the same thing at Facebook and left to tackle
cancer, also pretty much single-handedly. They were able to do so much on their
own: develop a product, gather data from that product, store large scale
information, model, serve those models. Hardly anyone can compare; hardly
anyone should.
Nowadays, most people who use the title on LinkedIn are people familiar with one
or several self-learning statistical models, what is known as Machine learning. They
may, but generally don’t know how to write great front-end or back-end code;
know how to handle alerts on a server; know how to conduct research on users.
That’s the point: Data scientist are people around a basic set of skills — so you
always want to specify what you are looking for.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
9. What do we mean
by ‘data scientist’
Analyst: uses data to answer
ad hoc questions
uses data streams to build alerts,
automated reports & dashboards
Statistician: build statistical models
17 A good way to detail that is to think about this gradient, between ad hoc
applications and higher level abstractions.
The most generous definition for “data scientist” are actually people who can
answer individual business question, based on a static dataset. Those are
classically called analyst, and I don’t think of it as pejorative. They tend to find far
more nuanced insights but tend to not thing about how to automate their work.
Next step are people who still use fairly ad hoc models, but for alerting the
business about issues: they typically work on anomaly detection, impact
assessment, A/B testing. They are familiar with some statistical test, confidence
intervals, but still expect that their findings will be reviewed by a human and
tolerate some clear errors.
What do we mean
by ‘data scientist’
Analyst: uses data to answer
ad hoc questions
uses data streams to build alerts,
automated reports & dashboards
Statistician: build statistical models
18 Those can often transition to full statisticians, people who can train a model, make
an inference and understand non-trivial questions around representativity, bias,
unbalance datasets. Typically, people here have a Master’s in statistics or
equivalent.
Where things get interesting, and that’s where most people put a limit, is whether
those models are served to the business and the results reviewed case by case, or
if the results are used to show a suggestion or a decision to users. Those models
need to be not just accurate, but fast, and monitored well: if they develop a bias,
you want to detect and correct that fast. Most people on the market can apply
models but you rarely find the same person able to question the data that they
find, have “business sense”.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
10. What do we mean
by ‘data scientist’
Analyst: uses data to answer
ad hoc questions
uses data streams to build alerts,
automated reports & dashboards
Statistician: build statistical models
build self-updating models
used to automate decisions
ML researcher: imagines new types of models
19 That’s why you want to clearly define if you want someone in that role able to
explain the impact and implementation of their model. Responsibilities like that can
justify very wide difference in value for the company and compensation, which is
why having that scale in mind really helps to set expectations.
At the tail end, you have a smaller group of people who actually rarely look at
business application, but think about improving and developing new statistical
models. Those are typically academic career, with the very recent evolution that
they work for large internet company who need better, more scalable model, and
they can get paid insanely well. The crazy salaries are typically for those people
because they single-handedly can make, say, all the image processing at Google,
Facebook, Apple or another, orders of magnitude more efficient, or more accurate.
That’s very rarely the first profile you want to hire.
Difficult questions
• Can a data scientist set-up, manage and maintain a
server to host the model served to production?
• Can a data scientist understand the production code to
fix a faulty data log?
• Can a data scientist write a method used in production to
gather the data consistently?
20 For those three questions and many like those, the real answer is: Maybe, but
those are typically rather rare. An easier answer is: your (specialised) engineers can
do it better, faster, so rely on those first. You can and should teach your data
scientists, by asking them to work closely with those engineers and learn from
them; it will not only make them autonomous but also teach them to think like
engineers (be lazy, automate, delegate, define, document, test), which is really the
mindset that you want to cultivate.
At the same time, by explaining their models to engineers, DS will help them grasp
statistics, a very helpful skillset.
Don’t look for a white elephant: just have complementary profiles learn from each
other.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
11. What questions to
tell if you are ready?
21 This is (finally) the crux of the presentation: how can you tell if you should get
started with machine learning?
The Joel test
Joel Splosky’s test for
code-based projects:
• 12 Yes/No
• Easy to tell
• Easy to start
https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/
22 Joel Splosky is a thought leader in how to make great software. There are many
complicated ways of estimating quality of environment. He wanted to simplify it to
a short check list. Each can be set up by individual decisions; some are relatively
expensive but none are a head-scratcher. There are not really dependent on each
other — which is something I haven’t been able to do.
A not irrelevant question is:
- How much do you score on Joel’s test too?
Happy engineers and good code generally means good data, and that is key for
machine learning too.
- Were you surprised by the score?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
12. 12-point test for readiness
to implement data science
• Three questions on your strategy, effort & product team.
Do they make sense? Are they aligned?
• Six questions on your data: Is easy to grasp, up to date?
Are your analysts building the right datasets?
• Three questions on where a model would find its place.
23 If you notice a heavy weight around data quality, Congratulations! you won six
more slides of me being very, very insistent around that.
But it’s not just data: clear projects, and engineering structure ready for data
science are also very helpful. The test reflects those.
Are your product &
goals clearly defined?
Is the company goal broken down into team metrics in a
mutually exclusive, completely exhaustive (MECE) way?
Can a new team member describe the current effort
& its impact on metrics at the end of their first week?
Are the metrics known by everyone, i.e. reported, audited
& challenged? Do developers other teams key metrics?
…
24 Why care so much about unrelated employees knowing those metrics? Because of
unintended consequences: a team focused on growth might push for low-quality
profile, and the team in charge of converting those profile might object to this
objective, and prefer something more balanced to reasonably relevant incoming
traffic. The question really is: Do you give space for employee to challenges to
metrics?
As you can notice, none of those are about machine learning: this is by design. The
question here is Are you ready before starting data science?
I will defined targets for data science in the next section, for when you have data
scientist on board.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
13. Suggestions to
improve product clarity
• “What we do” document for all to read on their first day
• Introduce all the concepts, the strategy, the teams
• Includes orders of magnitude to explain priorities
• Quizzes employees often on key numbers, other teams
• Show explicit interest in their overall understanding
• Detect issues, tension, miscommunications
25 No comments here.
Those are just good ideas.
Is your data easy to get?
Do you have an up-to-date metrics dictionary with a
detailed explanation of each number, incl. edge case?
Are the reference analytical tables audited daily?
Totals add up with production, trends make sense, etc.
Are simple data requests answered within 10 minutes?
Can a question that fit in two lines fit on a single query?
…
26 For all of those, ask the most junior analysts. Managers are often removed from the
grind of getting data: they ask, go away and wonder why it takes so long. You want
analytics managers to understand the struggle, so they can invest in fixing it &
making analysts productive.
“10 minutes” sounds ridiculous to many analysts: they typically have to work for
days for any insights. It is not a pipe-dream: it’s the result of significant effort in
documenting, pre-processing & training analysts to make each-other more
efficient. How valuable it is for your organisation to get there, how nimble and
informed you could be? This is why I argued for significant investment in data and
reporting, far more than what directly measurable impact would seem to allow.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
14. Suggestions to
improve data quality
• Ask an analyst: Why is that number hard to compute?
• Share the pain & build empathy with engineers
• Analyst value consistency, precision, reusability
• Ask about 1st, 2nd, 3rd normal form & denormalisation
• Analyst often have insufficient training in data structure
• Understand log structure (no edits) vs. status table
27 It might be counter-intuitive, but most analysts work actually imply to spend far
more time handling not essential corrections. That imbalance is extremely common
and can be compensated by good tooling — but understanding what analysts care
about, what they spend their time on will open a window into unexpected issues.
Spoiler alert, you will hear very confusing things about:
- time zones — and how they relate to a subset of cities, not countries;
- the distinction between countries, territories and locales;
- the spelling of Munich; the status of Taiwan, Hong-Kong;
- which week number is January 1st (52 or 53, both wrong);
- why Christmas, Easter and Ramadan are so inconvenient;
- finally, if you are lucky, how to represent a discreet log-scale.
This large imbalance between impact & time spent is not ideal, so if you want to
compensate for it, invest in tools to handle those confusing details gracefully.
Extract Transform Load
Production
database
Read
replica
Analytics
database
Normalised tables:
references, events
28 The classic way to get data is to have a production database, copied into a read-
replica. You use that replica to have a copy in your analytics database, considered
off-line, as is: if it fails, your server is still up.
Those tables, if your engineers do their job well, should
- be heavily normalised, including reference tables and fact tables; all with strict
unique IDs.
- they should only be appended, never updated; or you might have
complementary state tables.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
15. Extract Transform Load
Production
databaseRead
replica
Analytics
database
Denormalised individual tables:
- Per item with all information
- UTC, local time, DoW, ∂
- One table per concept
Normalised tables:
references, events
Automated audits & alerts
DenormalDenormal
29 The first role of analyst should be build and maintain ETL processes to denormalise
those tables and have large reference tables that contain all the relevant
information in one. Any reasonably simple question should be answered with one
simple query of one of those reference table.
For that, those table need to be great augmented with things like local time, day of
week, difference between one operation and the next, join with relevant
aggregates, etc. This is presumably a constant effort to keep up-to-date with
product changes, add key concepts introduced by influential ad hoc analytics.
Those table can and also should be augmented by machine learning inferences,
checks and audits.
Extract Transform Load
Production
database
Read
replica
Analytics
database
Denormalised individual tables:
- Per item with all information
- UTC, local time, DoW, ∂
- One table per concept
Normalised tables:
references, events
Automated audits & alerts
AggragAggregDenormalDenormal
Aggregated tables:
- Pageviews to sessions
- Customer first & last
- Daily metrics (all)
30 You can use a distinct service rather than relational databases to do that, but the
structure remains the same: a single query should allow you to correlate, say,
customer satisfaction with number of calls to the service, day of week, language,
etc. all that in one single query on what has to be a very wide table, with a lot of
columns (or an equivalent data structure). Checks on those table should be pretty
much constant, extensive, the test and alerts publicly known, so that if any table
isn’t matching expectations, like unicity, exhaustivity, etc. analyst know and don’t
assume.
All the common aggregates and reports should also be based on a chain of pre-
computed aggregations to same time. Both the denormalisation and the
aggregation should be done by a system user with strong priority or wide swim
lane: all uses depend on those tables being available.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
16. Extract Transform Load
Production
databaseRead
replica
Analytics
database
Denormalised individual tables:
- Per item with all information
- UTC, local time, DoW, ∂
- One table per concept
Normalised tables:
references, events
Machine learning
training datasets
Reports
Automated audits & alerts
AggragAggregDenormalDenormal
Aggregated tables:
- Pageviews to sessions
- Customer first & last
- Daily metrics (all)
31 Each of those level of abstraction has its own use:
- aggregates should be used exclusively for reports; any report based on detailed
reference table presumably should be automated;
- machine learning training typically need individual data, so the occasional
extensive load there is to be expected, but still controlled by privacy access;
- straight-from-production normalised tables should be accessible, but have clear
expectations that they are only used to audit an inconsistency; their should be
monitored so that any significant query on them be converted into the
denormalisation and data augmentation process rather than rely on ad hoc
integration. All system and non-system users should actually be monitored for
efficiency, and the most expensive queries discussed regularly.
Extract Transform Load
Production
database
Read
replica
Analytics
database
More
services
Denormalised individual tables:
- Per item with all information
- UTC, local time, DoW, ∂
- One table per concept
Normalised tables:
references, events
Machine learning
training datasets
Reports
Automated audits & alerts
AggragAggregDenormalDenormal
Aggregated tables:
- Pageviews to sessions
- Customer first & last
- Daily metrics (all)
32 Finally, that process should be able to handle new, disparate services without
much problem, as ingestion and most processing should be done in parallel.
That overall structure can obviously be challenged, but the key idea remains: make
sure at all times that as much pre-computation as you can anticipate is done, so
that analyst have it as easy as possible to get data. This will make even very junior
analyst, or project leader, far more autonomous, effective. This will make many
Business Intelligence tools far easier to integrate as well.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
17. Is your data up-to-date?
Do you use version control on the ETL?
Do you monitor your analysts queries for bad patterns?
Talk about improvements at weekly improvement review?
Is there more than three days between a product release
& subsequent ETL update (tested, reviewed, pushed)?
Is the team handling the ETL aware of the product
schedule, including re-factorisation & service split?
…
33 Well, some of those question have been shamelessly taken straight from Joel S.
As explained earlier: there should be monitor on non-system user for expensive
joins, new table creation, normalised (‘production’) tables to track inefficiencies.
Those are less to police good usage than to provide insights into analyst and data
scientist querying habits, drive efficiencies and, more importantly, identify good
opportunities for learning. Monitoring and identifying expensive queries is a simple,
objective pretext to teach good, legible effective SQL to all analysts.
Suggestion to improve
data processing
• Senior analyst & Data engineer monitor biggest queries
• Exclude production, system user
• Teaching opportunity for all: ineffective, index, etc.
• Product documentation is shared responsibility between
the product owner and the product analyst:
• Stand-Up, Task board, Releases, A/B test, Dashboard
• Team objectives, product description, prioritisation
34 One key disruptor for ETL pipelines is product changes. With analysts embedded
in product teams, you can prevent bad surprises. You can have prototype ETL
gather data from the testing environment to develop both product and ETL in
parallel; you can have tolerance for faulty ETL and service agreement — but
overall, you want to monitor that closely.
That is only possible with tight working relationship between prioritisation, release
cycles, continuous interaction tool, the testing framework, i.e. typically product
owner’s responsibilities and dashboards, data definition, i.e. typically analyst’s role.
I don’t really have a trick to handle that, other than have both play with each other’s
toys.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
18. Are you logging your
(placeholder) model?
Do you have estimates (from past A/B tests) of
how model quality impact your metrics?
Do you have a placeholder machine learning environment
with a training server fed from the analytics pipeline?
Do you log the user input, suggestion, user action &
model version? Are those logs processed & audited?
35 Finally, we can talk about models. Actually, where they should sit — because,
remember: we do not have a data scientist or models yet!
The best way to be welcoming to a model is to imagine a placeholder: have a
service that is what would be the model. That service can be very naive, like
recommendations, or decisions without inferences, but have software that is
separate enough to have its own trace.
That service can we studied through an A/B test.
That service can be augmented with the ability to get data from either the analytics
server, or a server dedicated to model training.
That service can be equipped with a lot of monitoring, SLAs, delays, input and
output distribution, missing data, etc.
This service can be a third party service, like Google Cloud Platform Cloud ML, or
Amazon Web Service SageMaker.
Are you logging your
(placeholder) model?
Do you have estimates (from past A/B tests) of
how model quality impact your metrics?
Do you have a placeholder machine learning environment
with a training server fed from the analytics pipeline?
Do you log the user input, suggestion, user action &
model version? Are those logs processed & audited?
36 All those addenda are presumably not very insightful when running a naive,
placeholder model — but they are typically something that most data scientists
would take a while to build and integrate, while engineers familiar with the front-
end codebase would be at ease setting up and the dev-ops able to maintain.
These last steps are typically less necessary before a data scientists join, but it as
soon as one has, and starts having a relevant model to test, you want to get
started on building those.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
19. ML in production
Production
server (JS)
Client
Production db
37 How does that work in practice?
Well, you still have your display service, and production database that is sending
information to the analytics server.
Those information are processed by the pipeline common to analysts and data
scientists.
ML in production
Production
server (JS)
Client
ML server
(Python)
Trained
model
Train, test
& copy
Production db
38 You want to use that information to train a model and host it on a separate service.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
20. ML in production
Production
server (JS)
Client
ML server
(Python)Features
Prediction
Trained
model
Train, test
& copy
Production db
39 Now you have a server that, given a set of features from the production display
service can send back predictions, and you can use those predictions to show
recommendations or decisions to your user.
Can most data scientists handle hosting that server? This is fairly involved tasks,
but standard for engineers:
- Containers & Kubernetes
- Release cycle
- Fall back & cut-offs
So you can look for people who know both modelling and have theses skills, or
you hire a statistician and an engineer, and have them work together.
ML in production
Production
server (JS)
Client
ML server
(Python)Features
Prediction
Production db
recommendation_placeholder.py
import os
import pymssql
def get_smart_recommendation(input):
analytics_server = os.environ['ANALYTICS_URL']
conn = pymssql.connect(analytics_server).cursor()
cursor = conn.cursor()
cursor.execute('SELECT TOP 5 product_id FROM sales;')
five_suggested_product_list = []
for row in cursor:
five_suggested_product_list.ammend(row)
return five_suggested_product_list
Placeholder
naive model
Connection
naive model
placeholder
40 How can an engineer start building all that service without an existing model?
With a placeholder model. Replace the parts in teal with something naive.
Let’s look at how we could implement that in Python:
- import libraries to access the data server & get local keys;
- have the service query, say:
- connect to the data server with local tokens
- compute and load the five most popular items:
not a very personalised recommendation, but one that you can live with;
- fill those into a method that you can use to send your recommendation.
Have all that wrapped into any platform that your engineers like and are willing to
support: Python Flask, H20, etc.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
21. ML in production
Production
server (JS)
Client
ML server
(Python)Features
Prediction
Production db
recommendation_placeholder.py
import os
import pymssql
def get_smart_recommendation(input):
analytics_server = os.environ['ANALYTICS_URL']
conn = pymssql.connect(analytics_server).cursor()
cursor = conn.cursor()
cursor.execute('SELECT TOP 5 product_id FROM sales;')
five_suggested_product_list = []
for row in cursor:
five_suggested_product_list.ammend(row)
return five_suggested_product_list
Placeholder
naive model
Connection
naive model
placeholder
41 That’a boiler plate code that any developer can easily produce. It doesn’t have a
way to import features, but that’s easy to add. Any statistician with no knowledge
of production-ready code might find that particular sniper a little quaint. It is.
They should be at least eager to improve on the naive idea. The great thing is: even
if they are familiar with Python and nothing else, pointing them at this file and this
file alone will allow them to know how to get data, process it effectively, import all
their favourite libraries, create object to implement a recommendation engine, and
surface it using the same or an equivalent method. The over-head from, say, a
notebook, should be minimal.
That structure (and good directions) allows them to be effective within hours of
joining the company. An effective data scientist is a happy data scientist; their
usefulness will make them want to stay the company, where they should improve
on their model gradually, and rapidly bring in a lot of value by automating a lot of
key decisions.
Questions?
Overall goal split MECE in team KPI?
Can newbies describe current impact?
Are the metrics known by everyone?
Up-to-date metrics dictionary?
Are analytical tables audited daily?
Are queries answered in 10 minutes?
Version control on ETL?
Monitor analysts queries & feedback?
Less 3 d. from launch to ETL update?
Estimate impact of model quality?
Do you log input, & user action?
Placeholder ML environment?
42 So, here are all the twelve questions.
Go through those again and try to count, honestly, how many points you have.
If you don’t know about one, assume it’s like my friend who maybe speaks Arabic,
count those are a No.
How many points does your organisation have?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
22. What should you fix?
• ≤ 3 pt.:
hire engineering managers—your problem is upstream
• ≤ 6 pt.:
hire an experienced analytics manager—to get data good
• 6 - 9 pt.:
hire an experienced data scientist to plan improvements
• > 9 pt.:
get some Master’s students from your local university
43 Low is not bad, but low means you want to focus on different aspects.
Less than three means dramatic measures: forget data engineering, and start
upstream: are your engineers happy? Could have you a more mature engineering
culture overall? Actually, if you have less than six points, you probably should run
the Spolsky test too — or something equivalent. Read up on technical debt, agility,
etc. Talk to CTOs of company that you admire and see what they care about.
If you are low, Don’t hire an experienced data scientist to “do models”. Models are
not going to help you. You have bigger problem that that. Hire someone who,
reading this list, will smile and who sounds like he understands those problems
and care how to fix them; even better: hire someone who, for most points that I
make, snipes: “Pff, that’s stupid, you should…” because they are probably right.
Give them my details as I can certainly learn from them. Hire someone who say
they want to make other data scientists work more effectively.
What should you fix?
• ≤ 3 pt.:
hire engineering managers—your problem is upstream
• ≤ 6 pt.:
hire an experienced analytics manager—to get data good
• 6 - 9 pt.:
hire an experienced data scientist to plan improvements
• > 9 pt.:
get some Master’s students from your local university
44 If you are low, dedicate engineering resources to the data team. What needs to be
fixed vary, but engineers will rapidly identify issues likely invisible to management.
Keep talking to analysts, see if they get excited by changes like “dynamic table
index reallocation based on the production schema” or “much faster type inference
on the fly”. Notice how they waste less time.
Finally, if you have a good score, start hiring. Go to the local grad school, with a
description of your company’s business and problem & a small sample of your
data; ask the students how they would address it. Hire those who suggest several
options and care to tell early which one would work best. Not all will have a keen
business sense, but if some of them do, the rest can focus on models. You’ll get a
great team in no time.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
23. Plan
• Who is a data scientist
• Questions to measure if you are ready
• Five ways to audit denormalised dataset
• Nine steps to a model in production
• Common questions
• What project to start with
• The near future for data science
45 Ok. Here we stand and the half-way point:
We’ve looked at what are data scientists and what you should ask yourself before
hiring them.
We still have another long section on how to get a model into production and
common questions
Before we go there, let me admit: this was a very long and detailed first part. Next
part is a tongue-in-cheek look at the kind of actual errors that I saw in the BI of big,
respectable corporations with competent analysts. Errors happen, some are funny;
but to catch them, fix them and be able to laugh at them, you want to check. This
is how.
Audit
Comedy interlude
46 I mention that you have these big, wide denormalised reference tables with all the
relevant details.
Building them is essential, but building them is… prone to mistakes. Easy to catch,
obvious, so stupid you could say they are just plain hurtful when you find them.
Those error generally come from analysts missing something obvious, or
sometimes a real problem in the service.
What are the kind of things that actual companies’ analytics get wrong?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
24. All of those
examples
are real
And they happened at big,
respectable companies
full of very smart people.
Industries have been
swapped to protect
the guilty.
47 I wanted to use the music from Benny Hill but I don’t know how to make that work
with Keynote, so you get Blackadder. I’ve tried to mask which company was
responsible for which discrepancy, but every where I have looked, I have stumbled
on confusing results.
Most of the time, it was because I, or the analyst responsible for building the data
pipeline, overlooked an edge-case, misunderstood the logging. Sometimes, it was
a real problem with actual or attempted fraud, using race conditions, unexpected
patterns, etc. The point here is less to audit the website than to audit both the
analyst understanding of the data process and the process itself.
THIS ONE IS EASY, YOU’D THINK
ADD UP TOTALS
SELECT
SUM(value) AS sum_value
, COUNT(1) count_total
FROM denormalised_table;
SELECT
SUM(value) AS sum_value
, COUNT(1) count_total
FROM original_table;
-- How much revenue & orders was generated?
SELECT
SUM(subtotal) AS sum_value
, COUNT(1) count_total
FROM denormalised_orders;
SELECT
SUM(subtotal) AS sum_value
, COUNT(1) count_total
FROM orders;
-- How much … per day?
SELECT
SUM(subtotal) AS sum_value
, COUNT(1) count_total
, order_date
FROM denormalised_orders
GROUP BY order_date;
SELECT
SUM(subtotal) AS sum_value
, COUNT(1) count_total
, order_date
FROM orders
GROUP BY order_date;
48 You would be surprised how often even basic numbers like the raw revenues or
order count do not add up.
This is because processing involved handling duplicates, cancellations, etc.
Often, a fact table was only started later, so the discrepancy comes from missing
values before that, because of a strict join.
With an international audience, the date of an activity requires a convention: should
we associate the date of an event to UTC, or should be associate behaviour to
perceived local time (typically before and after 4AM local)?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
25. CLASSIC
UNICITY
SELECT date
, contact_name
, contact_id
, COUNT(1)
, COUNT(DISTINCT zone_id) AS zones
FROM {schema}.table_name
GROUP BY 1, 2, 3
HAVING COUNT(1) > 1
ORDER BY 4 DESC, 1 DESC, 3;
-- Unique orders & deliveries
SELECT order_id
, delivery_id
, COUNT(1)
FROM denormalised_deliveries
GROUP BY 1, 2
HAVING COUNT(1) > 1;
-- Unique credits
SELECT credit_id
, order_id
, COUNT(1)
FROM denormalised_credit
GROUP BY 1, 2
HAVING COUNT(1) > 1;
49 The most commonly overlooked aspect in database management by analyst is
unicity.
Source fact tables, even logs typically have extensive unique index because proper
engineering recommends it. When being copied and joined along 1::n and n::n
relations, those unicity conditions are commonly challenged.
I strongly recommend not using a `count` vs. `unique_count` method, because
those are expensive often silently default to approximations when dealing with
large dataset, and only tell if you that there is a discrepancy. Using a `having
count(1) > 1` allows you to get the exact values that are problematic and more
often than not, diagnose the problem on the spot.
YOU’D BE SURPRISED
MIN, MAX, AVG, NULL
SELECT
MAX(computed_value) AS max_computed_value
, MIN(computed_value) AS min_computed_value
, AVG(computed_value) AS avg_computed_value
, COUNT(CASE
WHEN computed_value IS NULL THEN 1
END) AS computed_value_is_null
-- , … More values
, COUNT(1) count_total
FROM outcome_being_verified;
-- Customer agent hours worked
SELECT
MAX(sum_hrs_out_of_service) AS max_sum_hrs_out_of_service
, MIN(hrs_worked_paid) AS min_hrs_worked_paid
, AVG(log_entry_count) AS avg_log_entry_count
, COUNT(CASE
WHEN hrs_worked_paid IS NULL THEN 1
END) AS computed_value_is_null
, COUNT(1) count_total
FROM denormalised_cs_agent_hours_worked
15.7425 -22.75305 1155 0 3247936
50 There are less obviously failing examples when you look at representative
statistics. Typically extremes and null counts.
I have seen table where the sum of hours worked was negative, or larger than 24
hours — because of edge cases around the clock.
Same thing for non-matching joins:
- you generally want to keep at least one side (left or right join)
- but that means you might get nulls;
- one or a handful is fine, but if you see thousands of mismatching instances can
flag a bigger problem.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
26. YOU’D BE SURPRISED
DISTRIBUTION
SELECT
ROUND(computed_value) AS computed_value
, computed_boolean
, COUNT(CASE
WHEN hrs_worked_raw IS NULL THEN 1
END) AS computed_value_is_null
, COUNT(1) count_total
FROM outcome_being_verified
▸
-- Redeliveries
SELECT is_fraudulent
, COUNT(1)
FROM denormalised_actions
GROUP BY 1;
—- is_fraudulent count
—- false 82 206
—- null 1 418 487
—- true 21 477 408
51 Also without a clear boolean outcome are data distribution: you want certain values
to be very low, but you might not have a clear hard limit to how much is too much.
An example of a good inferred feature are booleans to flag problems. You really
want to have one of those for every possible edge-case, type of abuse, etc.
`is_fraudulent` was one of those. Can you see why this one was problematic?
- First surprise, the amount of nulls. It was meant to have some nulls (the estimate
was ran once a day, so any transaction on the same day would have a null
estimate) but not that many — but still, that number was way too high.
- Second problem, even more obvious: is it likely that three order of magnitude
transactions are fraudulent? Not in this case: the analyst flipped the branches of
the `if` by mistake.
TO THE TIME MACHINE, MORTY!
SUCCESSIVE TIMESTAMPS
-- Timestamp succession
SELECT
(datetime_1 < datetime_2)
, (datetime_2 < datetime_3)
, COUNT(1)
FROM {schema}.table_name
GROUP BY 1, 2;
▸
-- Checking the timeframe of ZenDesk tickets
SELECT
ticket_created_at < ticket_updated_at
, ticket_received_at < org_created
, COUNT(1)
FROM denormalised_zendesk_tickets
GROUP BY 1, 2;
-- Checking the pick-up timing makes sense
SELECT
(order_created_at < delivered_at)
, (order_received_at < driver_confirmed_at)
, COUNT(1)
FROM denormalised_orders
GROUP BY 1, 2;
52 One common way to detect issues with data processes in production are
successions. Certain actions, like creating a account, have to happen before
another, like an action with that account. Not all timestamps have those constrains,
but there must be some. I have yet to work for a company without a very odd issue
along those lines. Typically, those are race conditions, over-written values or
misunderstood event names.
I have seen those both for physical logistics, where a parcel was delivered before it
was sent, or for paperwork, with ZenDesk tickets resolved before they were
created, or even before the type of the ticket, or the organisation handling them
were created. Having a detailed process for all of these tend to reveal edge cases.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
27. YOU CAN TAKE NOTES
WHAT TO CHECK
▸ Everything you can think of:
▸ Totals need to add-up
▸ Some things have to be unique
▸ Min, Max, Avg, Null aka basic distribution
▸ Distributions make sense
▸ A coherent timeline
▸ Keep a record of what you checked & expect
53 This is not an exhaustive list, more a list of suggestions to start checks. Expect to
write a dozen tests per key table and have most of them fail initially. Other ways of
auditing your data include:
- dashboard that breakdown key metrics; anomaly detection on key averages,
ratios;
- having one or several small datasets of the same schema as production,
filled with edge-cases; process those through the ETL & compare to an
expected output.
Not sure if you found that part as funny as I did, but I can promise you, it’s great to
ask that over-confident analyst about any of those, because, the 20-something
who swears there is no way that they ever did anything wrong, they make this
game really fun.
Nine steps to get to a
model in production
54 Ok, back to serious long lists.
After the overly detailed 12 points to check before you get an data scientist, let’s
assume you have hired a data scientist; actually you have hired a head of data,
several analyst who checked their data several times over and couple of engineers
who know a lot of about machine learning. How do you get all those people to
work together?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
28. Five roles
involved
• Product owner
& PO Managers
• Product analyst
• Data scientist
• Product engineer
• ML service engineer
55 First of: I refer to roughly five types of roles.
Those are not exclusive: many companies have more detailed roles, or do not have
a distinction between analyst (someone who knows how to build metrics for a
product team) and a data scientist (in this case, a statistician with an
understanding of engineering). I distinguish product engineer, who knows the code
that send data and would integrate things into the front-end and a support
engineer for machine learning, but either could also be a software architect.
Sequential
People have tried to
jump to higher steps
It hurts harder
56 Unlike the 12 points, those steps are numbered because their have a strict
dependency from one to the next. You can try to skip a step but so far, anyone
whom I’m seen try has failed and had to back-track. Many people want to get to
Step 7: build a model. On it’s own, that won’t help
If I had to express how that makes me feel, let’s tell the story of a painter, someone
who paints buildings:
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
29. Sequential
People have tried to
jump to higher steps
It hurts harder
57 The painter sends a bill to re-paint the outside of a tall building, five floors:
- £40k for building the scaffolding, and that will take two weeks;
- £15k for painting, and that will take three day.
The clients asks: “Can’t we skip the scaffolding and just paint the five floors in
three days for £15k?”
You’d think most people would know that that doesn’t make sense.
Skipping straight to building a model at Step 7 makes as much sense.
What do you want?
1. Team goals, objectives and key
metrics are defined & documented
2. OKR clear & broken downs by key
dimensions in custom reports
Product
manager
Product
analyst
58 The most important lesson from this framework, if you have to remember this but
not the detail is:
Product owners and product analyst own all of the early work; actually, most of the
two thirds and certainly the most difficult and tedious parts.
Those points can seem obvious but for every company that I’ve worked with,
defining those objectives is actually quite hard, technically, and can be rather
controversial. Every company, including Facebook, have “fake users”, “inactive
users”, duplicates, etc. Every company has key activity but which activity actually
counts is hard to define: someone who connects to a video game or on Twitter, but
doesn’t play any game, or tweets anything, are their active? If someone cancels
their only order on an e-commerce platform, are they an active customer?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
30. What do you want?
1. Team goals, objectives and key
metrics are defined & documented
2. OKR clear & broken downs by key
dimensions in custom reports
3. Solution is clear & the prediction to
be made is defined, measured
…
Product
manager
Product
analyst
Product
owner
59 On the model, this is very rare that product owner come right away with a solution.
Imagine you handle Amazon Mechanical Turk, and you want to retain Turkers, the
workers on the platform. You are asked to make a model to handle a model for the
prevention of de-registration. Two options:
- you can try to list Turkers likely to leave, and send them a gift to retain them, or a
good opportunity; or
- when Turker wants to de-register, you compute the likelihood that you convince
them otherwise, and if they probably won’t, let them do it with a click, otherwise,
have them call a retention specialist.
What do you want?
1. Team goals, objectives and key
metrics are defined & documented
2. OKR clear & broken downs by key
dimensions in custom reports
3. Solution is clear & the prediction to
be made is defined, measured
…
Product
manager
Product
analyst
Product
owner
60 The first model involved building a daily score; it’s a slow (not production-speed)
model that someone can audit before acting on; it probably gets a ton of false
positive before few people overall leave every day, etc.; the second has a lot of less
bias because you only look at Turkers actively trying to leave and you learn from
previous retention effort, but making it snappy enough might be tricky. Those are
completely different models, with different data, development strategies, etc. for
completely different service. Understanding that is key to work effectively with
Machine learning.
Naturally, that example was never really about Amazon Turk.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
31. Can we do it?
4. Likely improvements and impact
from data properly estimated
5. Model, methodology and dataset
needed to train defined
Product
analyst
(Sr.) Data
scientist
61 Notice how no data scientist was involved yet. Once the product owner spelled out
what to predict and when, how an inference would help, you can ask: by how
much?
That’s where a detailed understanding of the product ratios become key. Say, you
find a better way to rank suggestions to a user; they should convert more—but by
how much? For that, pick a metric for the ranking (MRR, AUC) and use previous
improvement to the ranking to estimate the impact. That leads to a proxy ratio:
+0.01 in MRR raise conversions by 0.45%. Those are always approximations but
they allow you to estimate the impact of a better model. You want your busy data
scientists to work where they can have the most impact.
Can we do it?
4. Likely improvements and impact
from data properly estimated
5. Model, methodology and dataset
needed to train defined
6. Dataset is available, audited;
monitored daily updates
…
Product
analyst
(Sr.) Data
scientist
ETL Full stack:
Data architect,
Product engineer,
Product analyst
62 Finally, you get data scientist involved with two key, rapid tasks:
- pick a model that could fit the problem, base on what we know at that point (no
zipcode for new customers, not until they confirm their first order);
- experienced data scientists could predict the likely accuracy; multiplied by the
impact sensitivity, that let us prioritise projects accurately.
That leads to a key document: the data needed to train the model. Sometimes,
it’s all readily available, modulo some minor feature engineering.
More often than not, Step 6 takes months. Getting the right data the first time can
be a gruelling, frustrating effort: so many things can go wrong. A reliable dataset
supported by analysts is key but not always enough. Before committing, running a
basic audit can reveal very confusing contradictions, so be mindful.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
32. Getting it done
7. Offline model: Objectives defined,
quality & impact estimated from training
8. Model is running in production with unit
test, skew & balance monitored
Data
scientist
ML Engineer
Product eng.
63 Once you have at least partial data (processing months of data once you have
found a way to get it consistent can be a slow process), you can start have the
data scientist do the ‘real’ work: training a dataset, test, identify meta-parameter,
optimise, etc.
As much as that step is the most exciting and the coolest, the first cycles are often
very quick because you generally want to try simple models.
Then you want that model to be served in production, with their own service: a lot
of things there are needed, but this mainly an engineering effort and that type of
service is not standard enough to not carry real engineering risk.
Getting it done
7. Offline model: Objectives defined,
quality & impact estimated from training
8. Model is running in production with unit
test, skew & balance monitored
9. Model is self-updating in production;
new model schedule
Data
scientist
ML Engineer
Product eng.
Data scientist
ML Engineer
64 Finally, once you are confidant that your data pipeline is reliable, and that your
training can automatically generate a good model (because you audit and monitor
those too) you can confidently let an automated system update the model. That’s
real machine learning. From there, you might even have it run on its own and free
your data scientists to work on something else.
Your actual model principle can be increasingly complex, self-optimising, etc. and
there certainly will be major improvements in the concepts that you use within six
months (thanks to the Machine learning researchers) so you want to go back to it,
schedule some time to investigate those possible improvements. To do that, keep
monitoring the impact of a better model on your key metric, have data scientists
make an informed estimation of how much they can improve models and prioritise
accordingly.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
33. “We just want
a model”
65 One of the key reason that people want “a model” to serve as a proxy for insights:
How much that influences conversions?; Is fixing this source of churn more
impactful than this blocker for up-sell? Models can help, but those are often better
answered by an analyst with a keen product understanding. Even a model would
rely on detailed data, categorised appropriately, driven by product insights; a good
analyst with a intelligent breakdown with clear odd-ratio provides far more useful
and legible insights than most models.
Typically, models other than linear regressions are difficult to interpret, so you
probably want to use those first for insights-driving models. Regressions are the
simplest models, so you can trust junior analysts to handle them—just check for
collinearity, a common issue that’s easy to fix.
Questions?
1. Team goals, objectives and key metrics are defined & documented
2. OKR commented & breakdowns by key dimensions have custom reports
3. Problem for data science is properly defined, measured
4. Likely improvements and impact from data properly estimated
5. Model, methodology and dataset needed to train defined
6. Dataset is available, audited; regular updates scheduled
7. Offline model: Objectives defined, quality & impact estimated
8. Model is running in production with unit test, balance audited
9. Model is self-updating in production; new model scheduled
66 This was yet again a long section.
Do you have any question both on:
- any of the individual steps, the person responsible, the success criteria; but also
on
- the order, how dependent any step is to the next?
If you are not tied are this point, I can address individual questions, or I can answer
my own.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
34. Plan
• Who is a data scientist
• Questions to measure if you are ready
• Five ways to audit denormalised dataset
• Nine steps to a model in production
• Common questions
• What project to start with
• The near future for data science
67 This is where we stand.
There are two very common questions that I get, so I’ll cover those, but happy to
answer more afterwards.
What project
to start with?
68 The classic question is: what to start with?
There are often so many ideas (or not a lot of inspiration) that it’s difficult to pick
one. There is also usually quite a bit of pressure to get that first project to be a
success, which certainly doesn’t help, as the first project will probably reveal a lot
more issues in the organisation, the clarity or the alignment of the incentive, the
quality or consistency of the data, in the technical debt, in all the dependency of a
data science project, etc. that you’d like. I might have some controversial response
there too. Once again: your millage should absolutely vary.
In increasing order of urgency and interest:
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
35. Boring, repetitive,
expensive & scaling
• More events, better model
• Simple interactions, decisions
• Low quality: priority for human
69 Who thinks that Machine learning is exciting, flying cars, intelligent robots, etc.?
Well, you would be wrong. I mean: those are true, but the reality is that driving or
flying, that’s really boring.
Machines are not really smart yet, and they don’t get bored; they also need a lot of
cases to learn for now — so naturally, they are really good at anything that is
repetitive and boring. Find the least valued role, the team with the highest churn.
That’s your best candidate project.
I don’t know your industry but I’m willing to bet that that is either the Sales team, or
the Customer service team. There are tons of great template projects around
generating leads, or rather filter through a long list of potential leads and rank
them.
Boring, repetitive,
expensive & scaling
• More events, better model
• Simple interactions, decisions
• Low quality: priority for human
70 Same thing for customer services: natural language processing can help you
prioritise, classify and even suggest default answers to customer contacts; you can
infer tone, evaluate agents, estimate an optimal compensation rate for unhappy
customers before expensive case processing, etc. All acquisition effort, social
media optimisation and marketing campaigns are also often great candidates; even
small changes typically scale quite effectively.
There are also great projects to be thinking about around already automated
processes: ad auctions, recommendations. Those are typically already fully digital,
they use exclusively numbers rather than confusing ambiguous classification like
the customer service or lead evaluation, so they generally have an easier data
gathering curve.
But the best rule of thumb is: if I did that for 30 minutes, would I be bored?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
36. Anomaly detection
Lifetime value, churn
• Understanding what is wrong
• Not loosing customer is far
easier than making a new one
• Understand contradictory
effects between promotions,
efforts to retain
• Both are mainly analysts work
than modelling
71 Truth is, the best first project for your data scientists might not be a model to serve
in production, especially if they are a little junior and you have a lot of contradictory
direction you can go, or if your service is a little wonky. There are three types of
projects with fairly standard deployment that really help organisation structure
themselves:
1. Anomaly detection is about monitoring your activity closely, breaking it down in
detail and identifying unusual patterns; that allows you to trigger alerts to analyst;
you don’t need to have client-facing process attached to it, but it really help
identifying issues. If your activity goes up and down, and your analysts spend a lot
of time figuring out why, that should help relieve them a lot. You should rapidly
move from initial naive model to fairly sophisticated time series forecasting, and
that’s always a fun learning exercise for analysts.
Happy to present a framework to build that with a growing complexity.
Anomaly detection
Lifetime value, churn
• Understanding what is wrong
• Not loosing customer is far
easier than making a new one
• Understand contradictory
effects between promotions,
efforts to retain
• Both are mainly analysts work
than modelling
72 2. Another similar supporting effort that most analyst can do is first measuring, and
then estimating the sum of actualised profits that a customer brings over an
extended period of time, also known as “Lifetime value” or LTV. The first effort
should be purely analytical and compute past value, but you rapidly will need to
make assumption about customers who recently joined. That model allows you to
estimate the long-term impact of certain decisions, compare investment that would
otherwise seem hard to compare, like Improving the customer service vs.
Spending more in advertising.
Like for Anomaly detection, there are standard frameworks to build and model that,
that I’m happy to share.
The key decision associated to LTV is that, you want it to be less than your
acquisition cost, how much you can spend in advertising to get new customers,
also known as Cost per acquisition or CPA.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
37. Anomaly detection
Lifetime value, churn
• Understanding what is wrong
• Not loosing customer is far
easier than making a new one
• Understand contradictory
effects between promotions,
efforts to retain
• Both are mainly analysts work
than modelling
73 The growth of venture-backed companies is more often than not associated to an
initial effort to increase LTV and find channels with low CPA and once you have,
spend as much money as you can on acquisition, hoping your CPA stays low and
your LTV stays higher.
3. Finally, getting customers in is good but it’s often far cheaper to not let them out.
The best way to do that is to understand who leaves, why and how to address that.
The best approach for that is a framework known as Growth accounting: rather
than look at users, you account for new comers and departures, identify the last
actions and circumstances of leavers, and test retention efforts. Combined with
LTV, that effort is generally very compelling.
I guess you won’t be surprised if I’m offering a standard approach to that too.
A/B testing
• Identifying no-log users
• Splitting by user type
• Including all costs
• Balance & Contamination
• Peeking & Cancelling early
• Supporting clusters
• Documenting hypothesis
74 But the truth is, if you have a statistician joining your software company, the most
impact they can have is by building a simple, key tool: an A/B testing framework.
Initially, you should be able to get all your users in a room. As soon are they are
more than that to meet them regularly, you want to make sure that your changes
add by confronting the old solution to the new. Those are known as A/B testing
and they are a key feature in any good software company. They are actually
surprisingly hard to do well do to a flurry of issues, most of which are engineering
work, but some who are classic statistics. Any statistician is familiar with inferential
testing and some are even able to suggest Bayesian models. All will want their
models and effort to be tested through that prism, so before you have them started
on the real work, get them to build a reliable, robust, adapted A/B testing
framework.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
38. Where is the
job going to?
75 One question that I get occasionally — although not often — is: can we future-
proof our hires?
It’s very difficult to anticipate the future of a job that didn’t exist ten years ago, was
completely transformed several times since and whose defining feature is that
everyone good at it it is actively trying to code themselves out of a job.
In spite of all that, the solution is rather obvious, and similar to any other job in the
next ten years: specialisation and automation.
“Data scientists will go the
way of the webmaster…”
— Hillary Mason
• Automated easy tasks:
• Feature engineering
Model selection
• DataRobot, H2O, Peak.AI
• Less & more productive data
scientists, ML engineer
• Hosted models, auto-testing
standard monitoring & alerts
• More analysts, more PO
• Can’t automate prioritisation
• Data structure make sense
76 Hilary Masson started and used to lead bit.ly data science team. She is one of the
most talented and eloquent data scientist there is. The best summary of where we
are going is a quote from her, from in an interview in January. For her, the label
“data scientist” will probably disappear in ten years; it will go, I quote: “the way of
the webmasters.” I guess that word resonates with most of you who have been
“making” websites for a decade or more. 15 years ago, every website had its
webmaster. This was a generalist trying to handle copy, uploading, fighting with the
unruly CSS, content strategy, hosting, etc. all by themselves. Since that confusing
time, hosting platform, frameworks, specialisation have turned that one job into
different, mostly well defined specialities: editor, copy-writer, designer, Wordpress
integrators, SEO experts, front-end and back-end developper, UX researcher and
also, in a way, data scientist.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
39. “Data scientists will go the
way of the webmaster…”
— Hillary Mason
• Automated easy tasks:
• Feature engineering
Model selection
• DataRobot, H2O, Peak.AI
• Less & more productive data
scientists, ML engineer
• Hosted models, auto-testing
standard monitoring & alerts
• More analysts, more PO
• Can’t automate prioritisation
• Data structure make sense
77 There are no more webmasters, just people doing things far better and more
effectively because they have specialised expertise, context and tools.
We should expect data scientists to do the same. We will all benefit for an
additional layer and several generations of tools; tools to do all the automatable
parts of the job in seconds: feature generation and selection, model selection,
meta-search, optimisation. That means that there should be a small increase in
demand for highly paid machine-learning specialists building those tools, but also
less need for per company for data scientists working on picking the relevant
feature and the perfect model. Some could still do that, but many would rely on
having a machine run through a more extensive checklist, while they get coffee. At
the same time, far more company will be using Machine learning, so I suspect
more people will be data scientists, just not the same as today.
“Data scientists will go the
way of the webmaster…”
— Hillary Mason
• Automated easy tasks:
• Feature engineering
Model selection
• DataRobot, H2O, Peak.AI
• Less & more productive data
scientists, ML engineer
• Hosted models, auto-testing
standard monitoring & alerts
• More analysts, more PO
• Can’t automate prioritisation
• Data structure make sense
78 Hosting large datasets, monitoring, all that should be following standard framework
quite rapidly. And those framework should have, like Node React today, and still to
an certain extend Wordpress, very strong network effects. Current tools have a
bright future.
The parts that seem harder to automate for now are around feature design,
prioritisation, defining a data structure and a schema that allow to accommodate a
given service, a given feature. Data scientists probably want to spend more quality
time understanding data architects, product owners, because their future job might
be focusing on the commonalities with those jobs.
More ambitious data scientists might brush on more abstract mathematical results
like generativity, ensembling and convergence to build more universally automated
systems. Those aspects, core to data science but often overlooked, could become
essential when automating data science itself.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018