SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
Artificial Intelligence

in the real world:
Where to start?
Bertil Hatt

Head of Data science, RentalCars.com
BCS ELITE Conference Thursday, 17th May 2018

‘Unlocking the Potential of Artificial Intelligence’

Manchester DoubleTree by Hilton Hotel
1 This presentation was prepared for a the BCS, The Chartered Institute for IT.

It was meant for an audience of CTOs curious about Artificial intelligence. The
program was first a talk about the exciting possibilities of Artificial intelligence. This
presentation was following it. This is a talk on what to do before you can unlock the
potential of Artificial intelligence.

It relies on a handful of premise on the audience (detailed as questions to the
audience during the opening section): the audience is technically informed, but not
familiar with implementing machine learning models.

The point is to talk about what to do *before* you have a machine learning model in
place; what are the classic traps to get there.
Why am I talking to you?
2 I was able to write down the following warning not because I was able to do a lot
of amazing things, but because I worked with many institutions — the main ones
are listed here. I was able to compare from the trenches, those who successfully
implemented machine learning and those who struggled. Actually, most of the
companies that I have helped are not on this list: friends who grew frustrated reach
out for advice, or conference speakers who explained how to get things to work
well.

None of the content is meant to be controversial, but if there is anything you object
to, it is entirely my fault: none of the institutions mentioned have sanctioned what
I’m saying here.

This presentation, by design, is going to sound quite negative: it’s a list of
warnings. It is also a list of solutions.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Why am I talking to you?
3 I’m also here because I gave this talk to many data scientists. Their reaction was
not “this is cool”; it was incredibly relieved, even ecstatic: “Thank you for finally
saying this out loud.” “Someone had to tell them!” Not sure who was them; I hope
it’s you.

Artificial intelligence, Machine learning, Data science, all those are amazing things
to do. They happen to be victim of their —deserved— success. I’m telling what I
learned from trying many times, and I failing almost as often.

Finally, as you might be confused: my employer is known as RentalCars. The
company belong to the same holding and was acquired by Booking.com on
January 1st. This is why you’ll see me as belonging to either RentalCars,
Booking.com or even BookingGo, which is an internal brand. Rideways is a new
service that we are starting, so that’s four possible labels. All those are valid, and
essentially the same company.
Raise your hand
4 This is the interactive part of the presentation.

I was not familiar enough with the audience, so this is me checking that my
assumptions were right.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Keep your hand up if
•You manage technical people

(engineers, dev.-ops, analysts)

•Your company is organised with

product teams, product owners

•You know “Joel Splosky’s test”
5 The assumption is that you want to make decisions about what technical people in
your organisation work on, who to hire (seniority, speciality).

I also assume that you have a network of teams, coordinated by people known as
product managers (PMs) or product owners (POs) who define short-term
objectives, priorities and allocate technical resources. The main question here will
be: what do in what order? What should you care about?

Another key aspect here is the happiness of technical people. If you are familiar
with Joel Splosky’s test, it’s a good reference to have in general. I have taken
inspiration from him and suggest an adapted version for (hiring) data scientists. If
you don’t know what Joel’s test is, check it out; it’s a good reference to have.
What I could talk about
•Should you think about AI?

•What should you do first?

• What project to start with?

• Who should you hire first?

•How to retain & grow data scientists?

• How to make your team effective?
•How to prepare for a project?
•How ready are you for AI?
6 There are many things I could talk about, and I’m happy to answer questions
around —how to do data science well, effectively, at scale; what are the right
skillsets— but I think the two points that are under-discussed is what to do to
prepare for data scientists, how to surround them.

Far too many company want to do AI, but they have no idea if they are ready for it.

When I hear that, it reminds me of a joke a friend had: Someone asked if he knew
how to speak Arabic; he’d quip: “I don’t know, I never tried.”

Being ready for data science is a little bit the same: if you are not sure, you very
likely need a lot of effort before you can be ready.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Raise your hand
7 That’s a stupid trick I used to get people to participate: I get people to keep their
hands up first, then I ask question. It changes the default of participation, and you
get more honest answers. Plus it’s a little bit of physical exercise.
Keep your hand up if
•You are planning to invest in AI

in your organisation in 1-2 years

•You do not have a model running

& self-updating in production today

•You want to hire a Data scientist
8 The assumptions here is that you can about AI as something you are actively
planning to do.

Other assumption is that you are not ready or at least haven’t tried. If you have,
there are a lot more presentations on how to improve your AI, don’t waste time
with my rambles and go watch those.

Last assumption, and that’s where it gets tricky, you want to hire a person who you
think is a specialist. That introduce a connection between artificial intelligence,
machine learning and data scientist. I’m not going to detail how those are related.

If you have your hand up, this presentation is for you
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Don’t
9 First surprise. Gets people’s attention.
If you hire someone fresh out of the Data science institute, or any statistical
Master’s, they’ll be great at training a model from a cleaned up dataset. However,
in a vast majority of cases, they won’t know what to do to get that dataset in the
first place; they’ll have little idea how to serve that model to your display service in
a fast, reliable, scalable manner. Most of them will get incredibly frustrated. Many
will leave within six months.

A friend of mine, Peadar Coyle, is trying to brand that “Trophy data scientist” it’s a
reference to a sexist trope, but I think it’s a good angle on the problem.

This presentation is here to help you avoid that, and figure out what type of person
you should hire.
Hire a Head of

Data & Reporting

first
10 For every company I’ve worked for (except to a minor extent Facebook), one key
overlooked problem was to get a clear understanding of what the service was
actually doing. Even for Facebook, many key aspects of what the company was
doing in 2015 was still unclear, as recent changes demonstrated. Don’t be
ashamed because you are naked in Eden: admit that you need help, and hire that
help.

William Gibson said that “The future is already here — it's just not very evenly
distributed.” That is absolutely true of Data science. Many employers had great
data science in some aspects, in one corner of the company — but large
uncertainty elsewhere. You can try starting data science before you have a good
overall vision. I wouldn’t recommend it, though.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Then raise

their budget
11 The real issue with processing data is that it’s expensive. There are roughly two
roles:

- engineers who build pipeline and large scale technology to copy data; those
tend to be very expensive specialist so you want a budget for those;

- people in change of applying business logic to that data, typically called
analyst, sometimes data scientists. You might need many of those, to cover
your companies different product. They typically build pipelines that others use,
rather than provide a lot of insights themselves.

Don’t confuse them: engineers generally hate handling the frequent changes in
business logic; analysts typically don’t know code well enough — but they can
learn a lot from engineers.
Then raise

their budget
12 If you want to leave the room or close this presentation now, that’s fine with me.
You can spend a year hiring a head of data, grow their team and have them fix all
the inconsistencies, that will make me very happy and you will probably not delay
by a minute getting usable data science in production. We can be back here next
year. I’ll explain next steps to you then. I’m serious, if you stand up now, I’ll
understand. And I will thank you: you are taking this seriously. 

Getting good data is very likely going to be tedious and expensive and absolutely
helps you machine learning effort. Actually, that’s the only part that I’m not
convinced can be automated.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Plan
• Who is a data scientist

• 12 questions to tell if you are ready

• Five ways to audit denormalised dataset

• Nine steps to a model in production

• Common questions

• What project to start with

• The future for data science
13 This is going to be a very long presentation, and I might not reach the end in an
hour. Very sorry about that, but you just got the key part of it, so I’ve fulfilled my
promises. 

First, I’ll try to clarify what are the expectations, or lack of, from the king of
specialist that you expect; that should help understand the rest of the presentation.
Then I’ll help you decide if you want to hire one. At that point you’ll be tired, so I’ll
make some nerdy jokes at the expense of my previous colleagues. If you are ready
for more, I’ll then talk about how to get to a working machine learning model in
production.

Plan
• Who is a data scientist

• 12 questions to tell if you are ready

• Five ways to audit denormalised dataset

• Nine steps to a model in production

• Common questions

• What project to start with

• The future for data science
14 Please, have question through out, because if you don’t, I’ll answer my own
questions or rather the questions that I more commonly get from the audience.
Feel free to raise your arms and ask questions now. I’m happy to address your
concerns now rather than when my long monologue has completely expunged all
your energy.
If I talk too fast, raise your arms. If I talk too slowly, or not loud enough, or too loud,
or I don’t articulate properly, raise your arms — all those have happened before, so
feel free to interrupt me if anything is wrong.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Who is a
data scientist?
15 This is a very loaded question, because there is no clear answer — and that’s
actually by design.
“Data scientist” is a very
ambiguous word (by design)
• DJ Patil & Jeff Habermacher describe their job in 2008

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century 

• Most narrow: Familiar with self-learning statistical models

• May include: Engineering, Dev Ops, UX or ML Research
16 The original use was a shorthand to describe the role of two very incredible people.
DJ Patil single-handedly build most of the core features of LinkedIn and later
became the Chief Data scientists of the United States Government. Jeff
Habermacher was pretty much the same thing at Facebook and left to tackle
cancer, also pretty much single-handedly. They were able to do so much on their
own: develop a product, gather data from that product, store large scale
information, model, serve those models. Hardly anyone can compare; hardly
anyone should.

Nowadays, most people who use the title on LinkedIn are people familiar with one
or several self-learning statistical models, what is known as Machine learning. They
may, but generally don’t know how to write great front-end or back-end code;
know how to handle alerts on a server; know how to conduct research on users.
That’s the point: Data scientist are people around a basic set of skills — so you
always want to specify what you are looking for.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
What do we mean
by ‘data scientist’
Analyst: uses data to answer

ad hoc questions
uses data streams to build alerts,

automated reports & dashboards
Statistician: build statistical models
17 A good way to detail that is to think about this gradient, between ad hoc
applications and higher level abstractions.

The most generous definition for “data scientist” are actually people who can
answer individual business question, based on a static dataset. Those are
classically called analyst, and I don’t think of it as pejorative. They tend to find far
more nuanced insights but tend to not thing about how to automate their work.

Next step are people who still use fairly ad hoc models, but for alerting the
business about issues: they typically work on anomaly detection, impact
assessment, A/B testing. They are familiar with some statistical test, confidence
intervals, but still expect that their findings will be reviewed by a human and
tolerate some clear errors.
What do we mean
by ‘data scientist’
Analyst: uses data to answer

ad hoc questions
uses data streams to build alerts,

automated reports & dashboards
Statistician: build statistical models
18 Those can often transition to full statisticians, people who can train a model, make
an inference and understand non-trivial questions around representativity, bias,
unbalance datasets. Typically, people here have a Master’s in statistics or
equivalent.

Where things get interesting, and that’s where most people put a limit, is whether
those models are served to the business and the results reviewed case by case, or
if the results are used to show a suggestion or a decision to users. Those models
need to be not just accurate, but fast, and monitored well: if they develop a bias,
you want to detect and correct that fast. Most people on the market can apply
models but you rarely find the same person able to question the data that they
find, have “business sense”.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
What do we mean
by ‘data scientist’
Analyst: uses data to answer

ad hoc questions
uses data streams to build alerts,

automated reports & dashboards
Statistician: build statistical models
build self-updating models

used to automate decisions
ML researcher: imagines new types of models
19 That’s why you want to clearly define if you want someone in that role able to
explain the impact and implementation of their model. Responsibilities like that can
justify very wide difference in value for the company and compensation, which is
why having that scale in mind really helps to set expectations.

At the tail end, you have a smaller group of people who actually rarely look at
business application, but think about improving and developing new statistical
models. Those are typically academic career, with the very recent evolution that
they work for large internet company who need better, more scalable model, and
they can get paid insanely well. The crazy salaries are typically for those people
because they single-handedly can make, say, all the image processing at Google,
Facebook, Apple or another, orders of magnitude more efficient, or more accurate.
That’s very rarely the first profile you want to hire.
Difficult questions
• Can a data scientist set-up, manage and maintain a
server to host the model served to production?

• Can a data scientist understand the production code to
fix a faulty data log?

• Can a data scientist write a method used in production to
gather the data consistently?
20 For those three questions and many like those, the real answer is: Maybe, but
those are typically rather rare. An easier answer is: your (specialised) engineers can
do it better, faster, so rely on those first. You can and should teach your data
scientists, by asking them to work closely with those engineers and learn from
them; it will not only make them autonomous but also teach them to think like
engineers (be lazy, automate, delegate, define, document, test), which is really the
mindset that you want to cultivate.

At the same time, by explaining their models to engineers, DS will help them grasp
statistics, a very helpful skillset.

Don’t look for a white elephant: just have complementary profiles learn from each
other. 

Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
What questions to

tell if you are ready?
21 This is (finally) the crux of the presentation: how can you tell if you should get
started with machine learning?
The Joel test
Joel Splosky’s test for
code-based projects:

• 12 Yes/No

• Easy to tell

• Easy to start
https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/
22 Joel Splosky is a thought leader in how to make great software. There are many
complicated ways of estimating quality of environment. He wanted to simplify it to
a short check list. Each can be set up by individual decisions; some are relatively
expensive but none are a head-scratcher. There are not really dependent on each
other — which is something I haven’t been able to do.

A not irrelevant question is:

- How much do you score on Joel’s test too?

Happy engineers and good code generally means good data, and that is key for
machine learning too.

- Were you surprised by the score?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
12-point test for readiness
to implement data science
• Three questions on your strategy, effort & product team.

Do they make sense? Are they aligned?

• Six questions on your data: Is easy to grasp, up to date?

Are your analysts building the right datasets?

• Three questions on where a model would find its place.

23 If you notice a heavy weight around data quality, Congratulations! you won six
more slides of me being very, very insistent around that.

But it’s not just data: clear projects, and engineering structure ready for data
science are also very helpful. The test reflects those.
Are your product &

goals clearly defined?
Is the company goal broken down into team metrics in a
mutually exclusive, completely exhaustive (MECE) way?

Can a new team member describe the current effort

& its impact on metrics at the end of their first week?

Are the metrics known by everyone, i.e. reported, audited
& challenged? Do developers other teams key metrics?

…
24 Why care so much about unrelated employees knowing those metrics? Because of
unintended consequences: a team focused on growth might push for low-quality
profile, and the team in charge of converting those profile might object to this
objective, and prefer something more balanced to reasonably relevant incoming
traffic. The question really is: Do you give space for employee to challenges to
metrics?

As you can notice, none of those are about machine learning: this is by design. The
question here is Are you ready before starting data science?

I will defined targets for data science in the next section, for when you have data
scientist on board.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Suggestions to

improve product clarity
• “What we do” document for all to read on their first day

• Introduce all the concepts, the strategy, the teams

• Includes orders of magnitude to explain priorities

• Quizzes employees often on key numbers, other teams

• Show explicit interest in their overall understanding

• Detect issues, tension, miscommunications
25 No comments here.
Those are just good ideas.
Is your data easy to get?
Do you have an up-to-date metrics dictionary with a
detailed explanation of each number, incl. edge case?

Are the reference analytical tables audited daily?

Totals add up with production, trends make sense, etc.

Are simple data requests answered within 10 minutes?
Can a question that fit in two lines fit on a single query?

…
26 For all of those, ask the most junior analysts. Managers are often removed from the
grind of getting data: they ask, go away and wonder why it takes so long. You want
analytics managers to understand the struggle, so they can invest in fixing it &
making analysts productive.

“10 minutes” sounds ridiculous to many analysts: they typically have to work for
days for any insights. It is not a pipe-dream: it’s the result of significant effort in
documenting, pre-processing & training analysts to make each-other more
efficient. How valuable it is for your organisation to get there, how nimble and
informed you could be? This is why I argued for significant investment in data and
reporting, far more than what directly measurable impact would seem to allow.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Suggestions to

improve data quality
• Ask an analyst: Why is that number hard to compute?

• Share the pain & build empathy with engineers

• Analyst value consistency, precision, reusability

• Ask about 1st, 2nd, 3rd normal form & denormalisation

• Analyst often have insufficient training in data structure

• Understand log structure (no edits) vs. status table
27 It might be counter-intuitive, but most analysts work actually imply to spend far
more time handling not essential corrections. That imbalance is extremely common
and can be compensated by good tooling — but understanding what analysts care
about, what they spend their time on will open a window into unexpected issues.

Spoiler alert, you will hear very confusing things about:

- time zones — and how they relate to a subset of cities, not countries;

- the distinction between countries, territories and locales;

- the spelling of Munich; the status of Taiwan, Hong-Kong;

- which week number is January 1st (52 or 53, both wrong);

- why Christmas, Easter and Ramadan are so inconvenient;

- finally, if you are lucky, how to represent a discreet log-scale.

This large imbalance between impact & time spent is not ideal, so if you want to
compensate for it, invest in tools to handle those confusing details gracefully.

Extract Transform Load
Production

database
Read

replica
Analytics

database
Normalised tables:

references, events
28 The classic way to get data is to have a production database, copied into a read-
replica. You use that replica to have a copy in your analytics database, considered
off-line, as is: if it fails, your server is still up.

Those tables, if your engineers do their job well, should

- be heavily normalised, including reference tables and fact tables; all with strict
unique IDs.

- they should only be appended, never updated; or you might have
complementary state tables.

Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Extract Transform Load
Production

databaseRead

replica
Analytics

database
Denormalised individual tables:
- Per item with all information
- UTC, local time, DoW, ∂
- One table per concept
Normalised tables:

references, events
Automated audits & alerts
DenormalDenormal
29 The first role of analyst should be build and maintain ETL processes to denormalise
those tables and have large reference tables that contain all the relevant
information in one. Any reasonably simple question should be answered with one
simple query of one of those reference table.

For that, those table need to be great augmented with things like local time, day of
week, difference between one operation and the next, join with relevant
aggregates, etc. This is presumably a constant effort to keep up-to-date with
product changes, add key concepts introduced by influential ad hoc analytics.

Those table can and also should be augmented by machine learning inferences,
checks and audits.
Extract Transform Load
Production

database
Read

replica
Analytics

database
Denormalised individual tables:
- Per item with all information
- UTC, local time, DoW, ∂
- One table per concept
Normalised tables:

references, events
Automated audits & alerts
AggragAggregDenormalDenormal
Aggregated tables:
- Pageviews to sessions
- Customer first & last
- Daily metrics (all)
30 You can use a distinct service rather than relational databases to do that, but the
structure remains the same: a single query should allow you to correlate, say,
customer satisfaction with number of calls to the service, day of week, language,
etc. all that in one single query on what has to be a very wide table, with a lot of
columns (or an equivalent data structure). Checks on those table should be pretty
much constant, extensive, the test and alerts publicly known, so that if any table
isn’t matching expectations, like unicity, exhaustivity, etc. analyst know and don’t
assume.

All the common aggregates and reports should also be based on a chain of pre-
computed aggregations to same time. Both the denormalisation and the
aggregation should be done by a system user with strong priority or wide swim
lane: all uses depend on those tables being available.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Extract Transform Load
Production

databaseRead

replica
Analytics

database
Denormalised individual tables:
- Per item with all information
- UTC, local time, DoW, ∂
- One table per concept
Normalised tables:

references, events
Machine learning

training datasets
Reports
Automated audits & alerts
AggragAggregDenormalDenormal
Aggregated tables:
- Pageviews to sessions
- Customer first & last
- Daily metrics (all)
31 Each of those level of abstraction has its own use:

- aggregates should be used exclusively for reports; any report based on detailed
reference table presumably should be automated;

- machine learning training typically need individual data, so the occasional
extensive load there is to be expected, but still controlled by privacy access;

- straight-from-production normalised tables should be accessible, but have clear
expectations that they are only used to audit an inconsistency; their should be
monitored so that any significant query on them be converted into the
denormalisation and data augmentation process rather than rely on ad hoc
integration. All system and non-system users should actually be monitored for
efficiency, and the most expensive queries discussed regularly.
Extract Transform Load
Production

database
Read

replica
Analytics

database
More 

services
Denormalised individual tables:
- Per item with all information
- UTC, local time, DoW, ∂
- One table per concept
Normalised tables:

references, events
Machine learning

training datasets
Reports
Automated audits & alerts
AggragAggregDenormalDenormal
Aggregated tables:
- Pageviews to sessions
- Customer first & last
- Daily metrics (all)
32 Finally, that process should be able to handle new, disparate services without
much problem, as ingestion and most processing should be done in parallel.

That overall structure can obviously be challenged, but the key idea remains: make
sure at all times that as much pre-computation as you can anticipate is done, so
that analyst have it as easy as possible to get data. This will make even very junior
analyst, or project leader, far more autonomous, effective. This will make many
Business Intelligence tools far easier to integrate as well.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Is your data up-to-date?
Do you use version control on the ETL? 

Do you monitor your analysts queries for bad patterns?

Talk about improvements at weekly improvement review?

Is there more than three days between a product release
& subsequent ETL update (tested, reviewed, pushed)?

Is the team handling the ETL aware of the product
schedule, including re-factorisation & service split?

…
33 Well, some of those question have been shamelessly taken straight from Joel S.

As explained earlier: there should be monitor on non-system user for expensive
joins, new table creation, normalised (‘production’) tables to track inefficiencies.
Those are less to police good usage than to provide insights into analyst and data
scientist querying habits, drive efficiencies and, more importantly, identify good
opportunities for learning. Monitoring and identifying expensive queries is a simple,
objective pretext to teach good, legible effective SQL to all analysts.
Suggestion to improve

data processing
• Senior analyst & Data engineer monitor biggest queries

• Exclude production, system user 

• Teaching opportunity for all: ineffective, index, etc.

• Product documentation is shared responsibility between

the product owner and the product analyst:

• Stand-Up, Task board, Releases, A/B test, Dashboard

• Team objectives, product description, prioritisation
34 One key disruptor for ETL pipelines is product changes. With analysts embedded
in product teams, you can prevent bad surprises. You can have prototype ETL
gather data from the testing environment to develop both product and ETL in
parallel; you can have tolerance for faulty ETL and service agreement — but
overall, you want to monitor that closely.

That is only possible with tight working relationship between prioritisation, release
cycles, continuous interaction tool, the testing framework, i.e. typically product
owner’s responsibilities and dashboards, data definition, i.e. typically analyst’s role.
I don’t really have a trick to handle that, other than have both play with each other’s
toys.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Are you logging your
(placeholder) model?
Do you have estimates (from past A/B tests) of

how model quality impact your metrics? 

Do you have a placeholder machine learning environment
with a training server fed from the analytics pipeline? 

Do you log the user input, suggestion, user action &
model version? Are those logs processed & audited?
35 Finally, we can talk about models. Actually, where they should sit — because,
remember: we do not have a data scientist or models yet!

The best way to be welcoming to a model is to imagine a placeholder: have a
service that is what would be the model. That service can be very naive, like
recommendations, or decisions without inferences, but have software that is
separate enough to have its own trace.

That service can we studied through an A/B test.

That service can be augmented with the ability to get data from either the analytics
server, or a server dedicated to model training.

That service can be equipped with a lot of monitoring, SLAs, delays, input and
output distribution, missing data, etc. 

This service can be a third party service, like Google Cloud Platform Cloud ML, or
Amazon Web Service SageMaker.
Are you logging your
(placeholder) model?
Do you have estimates (from past A/B tests) of

how model quality impact your metrics? 

Do you have a placeholder machine learning environment
with a training server fed from the analytics pipeline? 

Do you log the user input, suggestion, user action &
model version? Are those logs processed & audited?
36 All those addenda are presumably not very insightful when running a naive,
placeholder model — but they are typically something that most data scientists
would take a while to build and integrate, while engineers familiar with the front-
end codebase would be at ease setting up and the dev-ops able to maintain.

These last steps are typically less necessary before a data scientists join, but it as
soon as one has, and starts having a relevant model to test, you want to get
started on building those.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
ML in production
Production

server (JS)
Client
Production db
37 How does that work in practice?

Well, you still have your display service, and production database that is sending
information to the analytics server.

Those information are processed by the pipeline common to analysts and data
scientists.

ML in production
Production

server (JS)
Client
ML server

(Python)
Trained

model
Train, test

& copy
Production db
38 You want to use that information to train a model and host it on a separate service.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
ML in production
Production

server (JS)
Client
ML server

(Python)Features
Prediction
Trained

model
Train, test

& copy
Production db
39 Now you have a server that, given a set of features from the production display
service can send back predictions, and you can use those predictions to show
recommendations or decisions to your user.

Can most data scientists handle hosting that server? This is fairly involved tasks,
but standard for engineers:

- Containers & Kubernetes

- Release cycle

- Fall back & cut-offs

So you can look for people who know both modelling and have theses skills, or
you hire a statistician and an engineer, and have them work together.
ML in production
Production

server (JS)
Client
ML server

(Python)Features
Prediction
Production db
recommendation_placeholder.py
import os
import pymssql
def get_smart_recommendation(input):
analytics_server = os.environ['ANALYTICS_URL']
conn = pymssql.connect(analytics_server).cursor()
cursor = conn.cursor()
cursor.execute('SELECT TOP 5 product_id FROM sales;')
five_suggested_product_list = []
for row in cursor:
five_suggested_product_list.ammend(row)
return five_suggested_product_list
Placeholder

naive model
Connection

naive model
placeholder
40 How can an engineer start building all that service without an existing model?

With a placeholder model. Replace the parts in teal with something naive.

Let’s look at how we could implement that in Python:

- import libraries to access the data server & get local keys;

- have the service query, say:

- connect to the data server with local tokens

- compute and load the five most popular items:

not a very personalised recommendation, but one that you can live with;

- fill those into a method that you can use to send your recommendation.

Have all that wrapped into any platform that your engineers like and are willing to
support: Python Flask, H20, etc.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
ML in production
Production

server (JS)
Client
ML server

(Python)Features
Prediction
Production db
recommendation_placeholder.py
import os
import pymssql
def get_smart_recommendation(input):
analytics_server = os.environ['ANALYTICS_URL']
conn = pymssql.connect(analytics_server).cursor()
cursor = conn.cursor()
cursor.execute('SELECT TOP 5 product_id FROM sales;')
five_suggested_product_list = []
for row in cursor:
five_suggested_product_list.ammend(row)
return five_suggested_product_list
Placeholder

naive model
Connection

naive model
placeholder
41 That’a boiler plate code that any developer can easily produce. It doesn’t have a
way to import features, but that’s easy to add. Any statistician with no knowledge
of production-ready code might find that particular sniper a little quaint. It is.

They should be at least eager to improve on the naive idea. The great thing is: even
if they are familiar with Python and nothing else, pointing them at this file and this
file alone will allow them to know how to get data, process it effectively, import all
their favourite libraries, create object to implement a recommendation engine, and
surface it using the same or an equivalent method. The over-head from, say, a
notebook, should be minimal.

That structure (and good directions) allows them to be effective within hours of
joining the company. An effective data scientist is a happy data scientist; their
usefulness will make them want to stay the company, where they should improve
on their model gradually, and rapidly bring in a lot of value by automating a lot of
key decisions.
Questions?
Overall goal split MECE in team KPI?

Can newbies describe current impact?

Are the metrics known by everyone?

Up-to-date metrics dictionary?

Are analytical tables audited daily?

Are queries answered in 10 minutes?

Version control on ETL?

Monitor analysts queries & feedback?

Less 3 d. from launch to ETL update?

Estimate impact of model quality? 

Do you log input, & user action? 

Placeholder ML environment?
42 So, here are all the twelve questions.

Go through those again and try to count, honestly, how many points you have.

If you don’t know about one, assume it’s like my friend who maybe speaks Arabic,
count those are a No.

How many points does your organisation have?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
What should you fix?
• ≤ 3 pt.:

hire engineering managers—your problem is upstream

• ≤ 6 pt.:

hire an experienced analytics manager—to get data good

• 6 - 9 pt.:

hire an experienced data scientist to plan improvements

• > 9 pt.:

get some Master’s students from your local university
43 Low is not bad, but low means you want to focus on different aspects.

Less than three means dramatic measures: forget data engineering, and start
upstream: are your engineers happy? Could have you a more mature engineering
culture overall? Actually, if you have less than six points, you probably should run
the Spolsky test too — or something equivalent. Read up on technical debt, agility,
etc. Talk to CTOs of company that you admire and see what they care about.

If you are low, Don’t hire an experienced data scientist to “do models”. Models are
not going to help you. You have bigger problem that that. Hire someone who,
reading this list, will smile and who sounds like he understands those problems
and care how to fix them; even better: hire someone who, for most points that I
make, snipes: “Pff, that’s stupid, you should…” because they are probably right.
Give them my details as I can certainly learn from them. Hire someone who say
they want to make other data scientists work more effectively.
What should you fix?
• ≤ 3 pt.:

hire engineering managers—your problem is upstream

• ≤ 6 pt.:

hire an experienced analytics manager—to get data good

• 6 - 9 pt.:

hire an experienced data scientist to plan improvements

• > 9 pt.:

get some Master’s students from your local university
44 If you are low, dedicate engineering resources to the data team. What needs to be
fixed vary, but engineers will rapidly identify issues likely invisible to management.
Keep talking to analysts, see if they get excited by changes like “dynamic table
index reallocation based on the production schema” or “much faster type inference
on the fly”. Notice how they waste less time.

Finally, if you have a good score, start hiring. Go to the local grad school, with a
description of your company’s business and problem & a small sample of your
data; ask the students how they would address it. Hire those who suggest several
options and care to tell early which one would work best. Not all will have a keen
business sense, but if some of them do, the rest can focus on models. You’ll get a
great team in no time.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Plan
• Who is a data scientist

• Questions to measure if you are ready

• Five ways to audit denormalised dataset

• Nine steps to a model in production

• Common questions

• What project to start with

• The near future for data science
45 Ok. Here we stand and the half-way point:

We’ve looked at what are data scientists and what you should ask yourself before
hiring them.

We still have another long section on how to get a model into production and
common questions

Before we go there, let me admit: this was a very long and detailed first part. Next
part is a tongue-in-cheek look at the kind of actual errors that I saw in the BI of big,
respectable corporations with competent analysts. Errors happen, some are funny;
but to catch them, fix them and be able to laugh at them, you want to check. This
is how.
Audit
Comedy interlude
46 I mention that you have these big, wide denormalised reference tables with all the
relevant details.

Building them is essential, but building them is… prone to mistakes. Easy to catch,
obvious, so stupid you could say they are just plain hurtful when you find them.

Those error generally come from analysts missing something obvious, or
sometimes a real problem in the service. 

What are the kind of things that actual companies’ analytics get wrong?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
All of those
examples
are real
And they happened at big,
respectable companies
full of very smart people.

Industries have been
swapped to protect

the guilty.
47 I wanted to use the music from Benny Hill but I don’t know how to make that work
with Keynote, so you get Blackadder. I’ve tried to mask which company was
responsible for which discrepancy, but every where I have looked, I have stumbled
on confusing results.

Most of the time, it was because I, or the analyst responsible for building the data
pipeline, overlooked an edge-case, misunderstood the logging. Sometimes, it was
a real problem with actual or attempted fraud, using race conditions, unexpected
patterns, etc. The point here is less to audit the website than to audit both the
analyst understanding of the data process and the process itself.
THIS ONE IS EASY, YOU’D THINK
ADD UP TOTALS
SELECT

SUM(value) AS sum_value

, COUNT(1) count_total

FROM denormalised_table;

SELECT

SUM(value) AS sum_value

, COUNT(1) count_total

FROM original_table;
-- How much revenue & orders was generated?

SELECT

SUM(subtotal) AS sum_value

, COUNT(1) count_total

FROM denormalised_orders;

SELECT

SUM(subtotal) AS sum_value

, COUNT(1) count_total

FROM orders;
-- How much … per day?

SELECT

SUM(subtotal) AS sum_value

, COUNT(1) count_total
, order_date

FROM denormalised_orders
GROUP BY order_date;

SELECT

SUM(subtotal) AS sum_value

, COUNT(1) count_total

, order_date

FROM orders
GROUP BY order_date;
48 You would be surprised how often even basic numbers like the raw revenues or
order count do not add up.

This is because processing involved handling duplicates, cancellations, etc.

Often, a fact table was only started later, so the discrepancy comes from missing
values before that, because of a strict join.

With an international audience, the date of an activity requires a convention: should
we associate the date of an event to UTC, or should be associate behaviour to
perceived local time (typically before and after 4AM local)?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
CLASSIC
UNICITY
SELECT date

, contact_name
, contact_id

, COUNT(1)

, COUNT(DISTINCT zone_id) AS zones

FROM {schema}.table_name

GROUP BY 1, 2, 3

HAVING COUNT(1) > 1

ORDER BY 4 DESC, 1 DESC, 3;

-- Unique orders & deliveries

SELECT order_id

, delivery_id

, COUNT(1)

FROM denormalised_deliveries

GROUP BY 1, 2

HAVING COUNT(1) > 1;

-- Unique credits

SELECT credit_id
, order_id

, COUNT(1)

FROM denormalised_credit

GROUP BY 1, 2

HAVING COUNT(1) > 1;
49 The most commonly overlooked aspect in database management by analyst is
unicity.

Source fact tables, even logs typically have extensive unique index because proper
engineering recommends it. When being copied and joined along 1::n and n::n
relations, those unicity conditions are commonly challenged.

I strongly recommend not using a `count` vs. `unique_count` method, because
those are expensive often silently default to approximations when dealing with
large dataset, and only tell if you that there is a discrepancy. Using a `having
count(1) > 1` allows you to get the exact values that are problematic and more
often than not, diagnose the problem on the spot.
YOU’D BE SURPRISED
MIN, MAX, AVG, NULL


SELECT

MAX(computed_value) AS max_computed_value

, MIN(computed_value) AS min_computed_value

, AVG(computed_value) AS avg_computed_value

, COUNT(CASE
WHEN computed_value IS NULL THEN 1
END) AS computed_value_is_null

-- , … More values

, COUNT(1) count_total

FROM outcome_being_verified;
-- Customer agent hours worked

SELECT

MAX(sum_hrs_out_of_service) AS max_sum_hrs_out_of_service

, MIN(hrs_worked_paid) AS min_hrs_worked_paid

, AVG(log_entry_count) AS avg_log_entry_count

, COUNT(CASE
WHEN hrs_worked_paid IS NULL THEN 1
END) AS computed_value_is_null

, COUNT(1) count_total

FROM denormalised_cs_agent_hours_worked


15.7425 -22.75305 1155 0 3247936
50 There are less obviously failing examples when you look at representative
statistics. Typically extremes and null counts.

I have seen table where the sum of hours worked was negative, or larger than 24
hours — because of edge cases around the clock.

Same thing for non-matching joins:

- you generally want to keep at least one side (left or right join)

- but that means you might get nulls;

- one or a handful is fine, but if you see thousands of mismatching instances can
flag a bigger problem.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
YOU’D BE SURPRISED
DISTRIBUTION


SELECT

ROUND(computed_value) AS computed_value

, computed_boolean

, COUNT(CASE
WHEN hrs_worked_raw IS NULL THEN 1
END) AS computed_value_is_null

, COUNT(1) count_total

FROM outcome_being_verified
▸
-- Redeliveries

SELECT is_fraudulent

, COUNT(1)

FROM denormalised_actions

GROUP BY 1;


—- is_fraudulent count
—- false 82 206
—- null 1 418 487
—- true 21 477 408
51 Also without a clear boolean outcome are data distribution: you want certain values
to be very low, but you might not have a clear hard limit to how much is too much.

An example of a good inferred feature are booleans to flag problems. You really
want to have one of those for every possible edge-case, type of abuse, etc.
`is_fraudulent` was one of those. Can you see why this one was problematic?

- First surprise, the amount of nulls. It was meant to have some nulls (the estimate
was ran once a day, so any transaction on the same day would have a null
estimate) but not that many — but still, that number was way too high.

- Second problem, even more obvious: is it likely that three order of magnitude
transactions are fraudulent? Not in this case: the analyst flipped the branches of
the `if` by mistake.
TO THE TIME MACHINE, MORTY!
SUCCESSIVE TIMESTAMPS
-- Timestamp succession

SELECT

(datetime_1 < datetime_2)

, (datetime_2 < datetime_3)

, COUNT(1)

FROM {schema}.table_name

GROUP BY 1, 2;
▸
-- Checking the timeframe of ZenDesk tickets
SELECT

ticket_created_at < ticket_updated_at

, ticket_received_at < org_created

, COUNT(1)

FROM denormalised_zendesk_tickets

GROUP BY 1, 2;
-- Checking the pick-up timing makes sense

SELECT

(order_created_at < delivered_at)

, (order_received_at < driver_confirmed_at)

, COUNT(1)

FROM denormalised_orders

GROUP BY 1, 2;
52 One common way to detect issues with data processes in production are
successions. Certain actions, like creating a account, have to happen before
another, like an action with that account. Not all timestamps have those constrains,
but there must be some. I have yet to work for a company without a very odd issue
along those lines. Typically, those are race conditions, over-written values or
misunderstood event names.

I have seen those both for physical logistics, where a parcel was delivered before it
was sent, or for paperwork, with ZenDesk tickets resolved before they were
created, or even before the type of the ticket, or the organisation handling them
were created. Having a detailed process for all of these tend to reveal edge cases.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
YOU CAN TAKE NOTES
WHAT TO CHECK
▸ Everything you can think of:
▸ Totals need to add-up
▸ Some things have to be unique
▸ Min, Max, Avg, Null aka basic distribution
▸ Distributions make sense
▸ A coherent timeline
▸ Keep a record of what you checked & expect
53 This is not an exhaustive list, more a list of suggestions to start checks. Expect to
write a dozen tests per key table and have most of them fail initially. Other ways of
auditing your data include:

- dashboard that breakdown key metrics; anomaly detection on key averages,
ratios; 

- having one or several small datasets of the same schema as production,

filled with edge-cases; process those through the ETL & compare to an
expected output.

Not sure if you found that part as funny as I did, but I can promise you, it’s great to
ask that over-confident analyst about any of those, because, the 20-something
who swears there is no way that they ever did anything wrong, they make this
game really fun.
Nine steps to get to a
model in production
54 Ok, back to serious long lists.

After the overly detailed 12 points to check before you get an data scientist, let’s
assume you have hired a data scientist; actually you have hired a head of data,
several analyst who checked their data several times over and couple of engineers
who know a lot of about machine learning. How do you get all those people to
work together?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Five roles
involved
• Product owner

& PO Managers

• Product analyst

• Data scientist

• Product engineer

• ML service engineer
55 First of: I refer to roughly five types of roles.

Those are not exclusive: many companies have more detailed roles, or do not have
a distinction between analyst (someone who knows how to build metrics for a
product team) and a data scientist (in this case, a statistician with an
understanding of engineering). I distinguish product engineer, who knows the code
that send data and would integrate things into the front-end and a support
engineer for machine learning, but either could also be a software architect.
Sequential
People have tried to

jump to higher steps

It hurts harder
56 Unlike the 12 points, those steps are numbered because their have a strict
dependency from one to the next. You can try to skip a step but so far, anyone
whom I’m seen try has failed and had to back-track. Many people want to get to
Step 7: build a model. On it’s own, that won’t help

If I had to express how that makes me feel, let’s tell the story of a painter, someone
who paints buildings:
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Sequential
People have tried to

jump to higher steps

It hurts harder
57 The painter sends a bill to re-paint the outside of a tall building, five floors:

- £40k for building the scaffolding, and that will take two weeks;

- £15k for painting, and that will take three day.

The clients asks: “Can’t we skip the scaffolding and just paint the five floors in
three days for £15k?”

You’d think most people would know that that doesn’t make sense.

Skipping straight to building a model at Step 7 makes as much sense.
What do you want?
1. Team goals, objectives and key
metrics are defined & documented

2. OKR clear & broken downs by key
dimensions in custom reports









Product

manager
Product

analyst
58 The most important lesson from this framework, if you have to remember this but
not the detail is:

Product owners and product analyst own all of the early work; actually, most of the
two thirds and certainly the most difficult and tedious parts.

Those points can seem obvious but for every company that I’ve worked with,
defining those objectives is actually quite hard, technically, and can be rather
controversial. Every company, including Facebook, have “fake users”, “inactive
users”, duplicates, etc. Every company has key activity but which activity actually
counts is hard to define: someone who connects to a video game or on Twitter, but
doesn’t play any game, or tweets anything, are their active? If someone cancels
their only order on an e-commerce platform, are they an active customer?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
What do you want?
1. Team goals, objectives and key
metrics are defined & documented

2. OKR clear & broken downs by key
dimensions in custom reports

3. Solution is clear & the prediction to
be made is defined, measured



…
Product

manager
Product

analyst
Product

owner
59 On the model, this is very rare that product owner come right away with a solution.

Imagine you handle Amazon Mechanical Turk, and you want to retain Turkers, the
workers on the platform. You are asked to make a model to handle a model for the
prevention of de-registration. Two options:

- you can try to list Turkers likely to leave, and send them a gift to retain them, or a
good opportunity; or

- when Turker wants to de-register, you compute the likelihood that you convince
them otherwise, and if they probably won’t, let them do it with a click, otherwise,
have them call a retention specialist.
What do you want?
1. Team goals, objectives and key
metrics are defined & documented

2. OKR clear & broken downs by key
dimensions in custom reports

3. Solution is clear & the prediction to
be made is defined, measured



…
Product

manager
Product

analyst
Product

owner
60 The first model involved building a daily score; it’s a slow (not production-speed)
model that someone can audit before acting on; it probably gets a ton of false
positive before few people overall leave every day, etc.; the second has a lot of less
bias because you only look at Turkers actively trying to leave and you learn from
previous retention effort, but making it snappy enough might be tricky. Those are
completely different models, with different data, development strategies, etc. for
completely different service. Understanding that is key to work effectively with
Machine learning.

Naturally, that example was never really about Amazon Turk.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Can we do it?
4. Likely improvements and impact
from data properly estimated

5. Model, methodology and dataset
needed to train defined
Product

analyst
(Sr.) Data

scientist
61 Notice how no data scientist was involved yet. Once the product owner spelled out
what to predict and when, how an inference would help, you can ask: by how
much?

That’s where a detailed understanding of the product ratios become key. Say, you
find a better way to rank suggestions to a user; they should convert more—but by
how much? For that, pick a metric for the ranking (MRR, AUC) and use previous
improvement to the ranking to estimate the impact. That leads to a proxy ratio:
+0.01 in MRR raise conversions by 0.45%. Those are always approximations but
they allow you to estimate the impact of a better model. You want your busy data
scientists to work where they can have the most impact.

Can we do it?
4. Likely improvements and impact
from data properly estimated

5. Model, methodology and dataset
needed to train defined

6. Dataset is available, audited;

monitored daily updates



…
Product

analyst
(Sr.) Data

scientist
ETL Full stack:
Data architect,

Product engineer,

Product analyst
62 Finally, you get data scientist involved with two key, rapid tasks:

- pick a model that could fit the problem, base on what we know at that point (no
zipcode for new customers, not until they confirm their first order);

- experienced data scientists could predict the likely accuracy; multiplied by the
impact sensitivity, that let us prioritise projects accurately.

That leads to a key document: the data needed to train the model. Sometimes,
it’s all readily available, modulo some minor feature engineering.

More often than not, Step 6 takes months. Getting the right data the first time can
be a gruelling, frustrating effort: so many things can go wrong. A reliable dataset
supported by analysts is key but not always enough. Before committing, running a
basic audit can reveal very confusing contradictions, so be mindful.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Getting it done
7. Offline model: Objectives defined,
quality & impact estimated from training

8. Model is running in production with unit
test, skew & balance monitored





Data

scientist
ML Engineer

Product eng.
63 Once you have at least partial data (processing months of data once you have
found a way to get it consistent can be a slow process), you can start have the
data scientist do the ‘real’ work: training a dataset, test, identify meta-parameter,
optimise, etc. 

As much as that step is the most exciting and the coolest, the first cycles are often
very quick because you generally want to try simple models.

Then you want that model to be served in production, with their own service: a lot
of things there are needed, but this mainly an engineering effort and that type of
service is not standard enough to not carry real engineering risk.
Getting it done
7. Offline model: Objectives defined,
quality & impact estimated from training

8. Model is running in production with unit
test, skew & balance monitored

9. Model is self-updating in production;
new model schedule
Data

scientist
ML Engineer

Product eng.
Data scientist

ML Engineer
64 Finally, once you are confidant that your data pipeline is reliable, and that your
training can automatically generate a good model (because you audit and monitor
those too) you can confidently let an automated system update the model. That’s
real machine learning. From there, you might even have it run on its own and free
your data scientists to work on something else.

Your actual model principle can be increasingly complex, self-optimising, etc. and
there certainly will be major improvements in the concepts that you use within six
months (thanks to the Machine learning researchers) so you want to go back to it,
schedule some time to investigate those possible improvements. To do that, keep
monitoring the impact of a better model on your key metric, have data scientists
make an informed estimation of how much they can improve models and prioritise
accordingly.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
“We just want

a model”
65 One of the key reason that people want “a model” to serve as a proxy for insights:
How much that influences conversions?; Is fixing this source of churn more
impactful than this blocker for up-sell? Models can help, but those are often better
answered by an analyst with a keen product understanding. Even a model would
rely on detailed data, categorised appropriately, driven by product insights; a good
analyst with a intelligent breakdown with clear odd-ratio provides far more useful
and legible insights than most models.

Typically, models other than linear regressions are difficult to interpret, so you
probably want to use those first for insights-driving models. Regressions are the
simplest models, so you can trust junior analysts to handle them—just check for
collinearity, a common issue that’s easy to fix. 

Questions?
1. Team goals, objectives and key metrics are defined & documented

2. OKR commented & breakdowns by key dimensions have custom reports

3. Problem for data science is properly defined, measured

4. Likely improvements and impact from data properly estimated

5. Model, methodology and dataset needed to train defined

6. Dataset is available, audited; regular updates scheduled

7. Offline model: Objectives defined, quality & impact estimated

8. Model is running in production with unit test, balance audited

9. Model is self-updating in production; new model scheduled
66 This was yet again a long section.

Do you have any question both on:

- any of the individual steps, the person responsible, the success criteria; but also
on

- the order, how dependent any step is to the next?

If you are not tied are this point, I can address individual questions, or I can answer
my own.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Plan
• Who is a data scientist

• Questions to measure if you are ready

• Five ways to audit denormalised dataset

• Nine steps to a model in production

• Common questions

• What project to start with

• The near future for data science
67 This is where we stand.

There are two very common questions that I get, so I’ll cover those, but happy to
answer more afterwards.
What project

to start with?
68 The classic question is: what to start with?

There are often so many ideas (or not a lot of inspiration) that it’s difficult to pick
one. There is also usually quite a bit of pressure to get that first project to be a
success, which certainly doesn’t help, as the first project will probably reveal a lot
more issues in the organisation, the clarity or the alignment of the incentive, the
quality or consistency of the data, in the technical debt, in all the dependency of a
data science project, etc. that you’d like. I might have some controversial response
there too. Once again: your millage should absolutely vary.

In increasing order of urgency and interest:
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Boring, repetitive,
expensive & scaling
• More events, better model

• Simple interactions, decisions

• Low quality: priority for human
69 Who thinks that Machine learning is exciting, flying cars, intelligent robots, etc.?

Well, you would be wrong. I mean: those are true, but the reality is that driving or
flying, that’s really boring.

Machines are not really smart yet, and they don’t get bored; they also need a lot of
cases to learn for now — so naturally, they are really good at anything that is
repetitive and boring. Find the least valued role, the team with the highest churn.
That’s your best candidate project.

I don’t know your industry but I’m willing to bet that that is either the Sales team, or
the Customer service team. There are tons of great template projects around
generating leads, or rather filter through a long list of potential leads and rank
them.
Boring, repetitive,
expensive & scaling
• More events, better model

• Simple interactions, decisions

• Low quality: priority for human
70 Same thing for customer services: natural language processing can help you
prioritise, classify and even suggest default answers to customer contacts; you can
infer tone, evaluate agents, estimate an optimal compensation rate for unhappy
customers before expensive case processing, etc. All acquisition effort, social
media optimisation and marketing campaigns are also often great candidates; even
small changes typically scale quite effectively.

There are also great projects to be thinking about around already automated
processes: ad auctions, recommendations. Those are typically already fully digital,
they use exclusively numbers rather than confusing ambiguous classification like
the customer service or lead evaluation, so they generally have an easier data
gathering curve.

But the best rule of thumb is: if I did that for 30 minutes, would I be bored?
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Anomaly detection

Lifetime value, churn
• Understanding what is wrong

• Not loosing customer is far
easier than making a new one

• Understand contradictory
effects between promotions,
efforts to retain

• Both are mainly analysts work
than modelling
71 Truth is, the best first project for your data scientists might not be a model to serve
in production, especially if they are a little junior and you have a lot of contradictory
direction you can go, or if your service is a little wonky. There are three types of
projects with fairly standard deployment that really help organisation structure
themselves:

1. Anomaly detection is about monitoring your activity closely, breaking it down in
detail and identifying unusual patterns; that allows you to trigger alerts to analyst;
you don’t need to have client-facing process attached to it, but it really help
identifying issues. If your activity goes up and down, and your analysts spend a lot
of time figuring out why, that should help relieve them a lot. You should rapidly
move from initial naive model to fairly sophisticated time series forecasting, and
that’s always a fun learning exercise for analysts.

Happy to present a framework to build that with a growing complexity.
Anomaly detection

Lifetime value, churn
• Understanding what is wrong

• Not loosing customer is far
easier than making a new one

• Understand contradictory
effects between promotions,
efforts to retain

• Both are mainly analysts work
than modelling
72 2. Another similar supporting effort that most analyst can do is first measuring, and
then estimating the sum of actualised profits that a customer brings over an
extended period of time, also known as “Lifetime value” or LTV. The first effort
should be purely analytical and compute past value, but you rapidly will need to
make assumption about customers who recently joined. That model allows you to
estimate the long-term impact of certain decisions, compare investment that would
otherwise seem hard to compare, like Improving the customer service vs.
Spending more in advertising. 

Like for Anomaly detection, there are standard frameworks to build and model that,
that I’m happy to share. 

The key decision associated to LTV is that, you want it to be less than your
acquisition cost, how much you can spend in advertising to get new customers,
also known as Cost per acquisition or CPA.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Anomaly detection

Lifetime value, churn
• Understanding what is wrong

• Not loosing customer is far
easier than making a new one

• Understand contradictory
effects between promotions,
efforts to retain

• Both are mainly analysts work
than modelling
73 The growth of venture-backed companies is more often than not associated to an
initial effort to increase LTV and find channels with low CPA and once you have,
spend as much money as you can on acquisition, hoping your CPA stays low and
your LTV stays higher.

3. Finally, getting customers in is good but it’s often far cheaper to not let them out.
The best way to do that is to understand who leaves, why and how to address that.
The best approach for that is a framework known as Growth accounting: rather
than look at users, you account for new comers and departures, identify the last
actions and circumstances of leavers, and test retention efforts. Combined with
LTV, that effort is generally very compelling.

I guess you won’t be surprised if I’m offering a standard approach to that too.
A/B testing
• Identifying no-log users

• Splitting by user type

• Including all costs

• Balance & Contamination 

• Peeking & Cancelling early

• Supporting clusters

• Documenting hypothesis
74 But the truth is, if you have a statistician joining your software company, the most
impact they can have is by building a simple, key tool: an A/B testing framework. 

Initially, you should be able to get all your users in a room. As soon are they are
more than that to meet them regularly, you want to make sure that your changes
add by confronting the old solution to the new. Those are known as A/B testing
and they are a key feature in any good software company. They are actually
surprisingly hard to do well do to a flurry of issues, most of which are engineering
work, but some who are classic statistics. Any statistician is familiar with inferential
testing and some are even able to suggest Bayesian models. All will want their
models and effort to be tested through that prism, so before you have them started
on the real work, get them to build a reliable, robust, adapted A/B testing
framework.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
Where is the

job going to?
75 One question that I get occasionally — although not often — is: can we future-
proof our hires?

It’s very difficult to anticipate the future of a job that didn’t exist ten years ago, was
completely transformed several times since and whose defining feature is that
everyone good at it it is actively trying to code themselves out of a job.

In spite of all that, the solution is rather obvious, and similar to any other job in the
next ten years: specialisation and automation.
“Data scientists will go the

way of the webmaster…”
— Hillary Mason
• Automated easy tasks:

• Feature engineering

Model selection

• DataRobot, H2O, Peak.AI

• Less & more productive data
scientists, ML engineer

• Hosted models, auto-testing

standard monitoring & alerts

• More analysts, more PO

• Can’t automate prioritisation

• Data structure make sense
76 Hilary Masson started and used to lead bit.ly data science team. She is one of the
most talented and eloquent data scientist there is. The best summary of where we
are going is a quote from her, from in an interview in January. For her, the label
“data scientist” will probably disappear in ten years; it will go, I quote: “the way of
the webmasters.” I guess that word resonates with most of you who have been
“making” websites for a decade or more. 15 years ago, every website had its
webmaster. This was a generalist trying to handle copy, uploading, fighting with the
unruly CSS, content strategy, hosting, etc. all by themselves. Since that confusing
time, hosting platform, frameworks, specialisation have turned that one job into
different, mostly well defined specialities: editor, copy-writer, designer, Wordpress
integrators, SEO experts, front-end and back-end developper, UX researcher and
also, in a way, data scientist.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
“Data scientists will go the

way of the webmaster…”
— Hillary Mason
• Automated easy tasks:

• Feature engineering

Model selection

• DataRobot, H2O, Peak.AI

• Less & more productive data
scientists, ML engineer

• Hosted models, auto-testing

standard monitoring & alerts

• More analysts, more PO

• Can’t automate prioritisation

• Data structure make sense
77 There are no more webmasters, just people doing things far better and more
effectively because they have specialised expertise, context and tools.

We should expect data scientists to do the same. We will all benefit for an
additional layer and several generations of tools; tools to do all the automatable
parts of the job in seconds: feature generation and selection, model selection,
meta-search, optimisation. That means that there should be a small increase in
demand for highly paid machine-learning specialists building those tools, but also
less need for per company for data scientists working on picking the relevant
feature and the perfect model. Some could still do that, but many would rely on
having a machine run through a more extensive checklist, while they get coffee. At
the same time, far more company will be using Machine learning, so I suspect
more people will be data scientists, just not the same as today.
“Data scientists will go the

way of the webmaster…”
— Hillary Mason
• Automated easy tasks:

• Feature engineering

Model selection

• DataRobot, H2O, Peak.AI

• Less & more productive data
scientists, ML engineer

• Hosted models, auto-testing

standard monitoring & alerts

• More analysts, more PO

• Can’t automate prioritisation

• Data structure make sense
78 Hosting large datasets, monitoring, all that should be following standard framework
quite rapidly. And those framework should have, like Node React today, and still to
an certain extend Wordpress, very strong network effects. Current tools have a
bright future.

The parts that seem harder to automate for now are around feature design,
prioritisation, defining a data structure and a schema that allow to accommodate a
given service, a given feature. Data scientists probably want to spend more quality
time understanding data architects, product owners, because their future job might
be focusing on the commonalities with those jobs.

More ambitious data scientists might brush on more abstract mathematical results
like generativity, ensembling and convergence to build more universally automated
systems. Those aspects, core to data science but often overlooked, could become
essential when automating data science itself.
Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018

Más contenido relacionado

La actualidad más candente

BioIT Webinar on AI and data methods for drug discovery
BioIT Webinar on AI and data methods for drug discoveryBioIT Webinar on AI and data methods for drug discovery
BioIT Webinar on AI and data methods for drug discoveryFernanda Foertter
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicplka13
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century Human Capital Media
 
Best Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationBest Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationChasity Gibson
 
Tools For Creativity
Tools For CreativityTools For Creativity
Tools For CreativityRowan Norrie
 
How to succeed at data without even trying!
How to succeed at data without even trying!How to succeed at data without even trying!
How to succeed at data without even trying!Dylan
 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Dhiana Deva
 
Wtf is data science?
Wtf is data science?Wtf is data science?
Wtf is data science?Dylan
 
Managing bias in data
Managing bias in dataManaging bias in data
Managing bias in dataSAP SE
 
How Do I Get a Job in Data Science? | People Ask Google
How Do I Get a Job in Data Science? | People Ask GoogleHow Do I Get a Job in Data Science? | People Ask Google
How Do I Get a Job in Data Science? | People Ask Googleprateek kumar
 
Who is a data scientist
Who is a data scientist  Who is a data scientist
Who is a data scientist prateek kumar
 
New Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max WellingNew Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max WellingTextkernel
 
Lect2-3 Theory-Practice_UsabiilityStudies-Oct2014(1)
Lect2-3 Theory-Practice_UsabiilityStudies-Oct2014(1)Lect2-3 Theory-Practice_UsabiilityStudies-Oct2014(1)
Lect2-3 Theory-Practice_UsabiilityStudies-Oct2014(1)John Chin
 
Analytics that deliver Value
Analytics that deliver ValueAnalytics that deliver Value
Analytics that deliver ValueSandro Catanzaro
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
Predictive analytics: hot and getting hotter
Predictive analytics: hot and getting hotterPredictive analytics: hot and getting hotter
Predictive analytics: hot and getting hotterThe Marketing Distillery
 
Designing more ethical and unbiased experiences
Designing more ethical and unbiased experiencesDesigning more ethical and unbiased experiences
Designing more ethical and unbiased experiencesKaren Bachmann
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the datamark madsen
 

La actualidad más candente (20)

BioIT Webinar on AI and data methods for drug discovery
BioIT Webinar on AI and data methods for drug discoveryBioIT Webinar on AI and data methods for drug discovery
BioIT Webinar on AI and data methods for drug discovery
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_public
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century
 
Best Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationBest Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the Organization
 
Tools For Creativity
Tools For CreativityTools For Creativity
Tools For Creativity
 
How to succeed at data without even trying!
How to succeed at data without even trying!How to succeed at data without even trying!
How to succeed at data without even trying!
 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
 
Wtf is data science?
Wtf is data science?Wtf is data science?
Wtf is data science?
 
Managing bias in data
Managing bias in dataManaging bias in data
Managing bias in data
 
How Do I Get a Job in Data Science? | People Ask Google
How Do I Get a Job in Data Science? | People Ask GoogleHow Do I Get a Job in Data Science? | People Ask Google
How Do I Get a Job in Data Science? | People Ask Google
 
Who is a data scientist
Who is a data scientist  Who is a data scientist
Who is a data scientist
 
New Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max WellingNew Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max Welling
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
 
Lect2-3 Theory-Practice_UsabiilityStudies-Oct2014(1)
Lect2-3 Theory-Practice_UsabiilityStudies-Oct2014(1)Lect2-3 Theory-Practice_UsabiilityStudies-Oct2014(1)
Lect2-3 Theory-Practice_UsabiilityStudies-Oct2014(1)
 
Analytics that deliver Value
Analytics that deliver ValueAnalytics that deliver Value
Analytics that deliver Value
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Predictive analytics: hot and getting hotter
Predictive analytics: hot and getting hotterPredictive analytics: hot and getting hotter
Predictive analytics: hot and getting hotter
 
Designing more ethical and unbiased experiences
Designing more ethical and unbiased experiencesDesigning more ethical and unbiased experiences
Designing more ethical and unbiased experiences
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the data
 

Similar a What to do to get started with AI

Business Analytics Lesson Of The Day August 2012
Business Analytics Lesson Of The Day August 2012Business Analytics Lesson Of The Day August 2012
Business Analytics Lesson Of The Day August 2012Pozzolini
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen
 
The road to connected architecture
The road to connected architectureThe road to connected architecture
The road to connected architectureMartijntenNapel
 
Sahilpreet_Sahilpreet_991692200data.pptx
Sahilpreet_Sahilpreet_991692200data.pptxSahilpreet_Sahilpreet_991692200data.pptx
Sahilpreet_Sahilpreet_991692200data.pptxLatestTrending1
 
Data for Hong Kong startups
Data for Hong Kong startupsData for Hong Kong startups
Data for Hong Kong startupsGuy Freeman
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
 
Turning AI Dreams into Reality
Turning AI Dreams into RealityTurning AI Dreams into Reality
Turning AI Dreams into RealityJosh Sephton
 
Landing your first Data Science Job: The Technical Interview
Landing your first Data Science Job: The Technical InterviewLanding your first Data Science Job: The Technical Interview
Landing your first Data Science Job: The Technical InterviewAnidata
 
Five Hot Trends for 2018
Five Hot Trends for 2018Five Hot Trends for 2018
Five Hot Trends for 2018ibi
 
10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist10 Tips From A Young Data Scientist
10 Tips From A Young Data ScientistNuno Carneiro
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science teamAshish Bansal
 
Top 7 Reasons why Maintenance Work Orders are Closed Out Accurately
Top 7 Reasons why Maintenance Work Orders are Closed Out AccuratelyTop 7 Reasons why Maintenance Work Orders are Closed Out Accurately
Top 7 Reasons why Maintenance Work Orders are Closed Out AccuratelyRicky Smith CMRP, CMRT
 
LIMITATIONS OF AI
LIMITATIONS OF AILIMITATIONS OF AI
LIMITATIONS OF AIAdityaK52
 
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIsDashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIsLuciano Pesci, PhD
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)Julien SIMON
 
Does market information, marketing and consumer research have a role in busin...
Does market information, marketing and consumer research have a role in busin...Does market information, marketing and consumer research have a role in busin...
Does market information, marketing and consumer research have a role in busin...Drthomasbrand Limited
 
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
[DSC Europe 22] Avoid mistakes building AI products - Karol PrzystalskiDataScienceConferenc1
 
The Analytics Stack Guidebook (Holistics)
The Analytics Stack Guidebook (Holistics)The Analytics Stack Guidebook (Holistics)
The Analytics Stack Guidebook (Holistics)Truong Bomi
 

Similar a What to do to get started with AI (20)

Business Analytics Lesson Of The Day August 2012
Business Analytics Lesson Of The Day August 2012Business Analytics Lesson Of The Day August 2012
Business Analytics Lesson Of The Day August 2012
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...
 
The road to connected architecture
The road to connected architectureThe road to connected architecture
The road to connected architecture
 
Sahilpreet_Sahilpreet_991692200data.pptx
Sahilpreet_Sahilpreet_991692200data.pptxSahilpreet_Sahilpreet_991692200data.pptx
Sahilpreet_Sahilpreet_991692200data.pptx
 
Data for Hong Kong startups
Data for Hong Kong startupsData for Hong Kong startups
Data for Hong Kong startups
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
Turning AI Dreams into Reality
Turning AI Dreams into RealityTurning AI Dreams into Reality
Turning AI Dreams into Reality
 
Landing your first Data Science Job: The Technical Interview
Landing your first Data Science Job: The Technical InterviewLanding your first Data Science Job: The Technical Interview
Landing your first Data Science Job: The Technical Interview
 
Five Hot Trends for 2018
Five Hot Trends for 2018Five Hot Trends for 2018
Five Hot Trends for 2018
 
10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science team
 
Top 7 Reasons why Maintenance Work Orders are Closed Out Accurately
Top 7 Reasons why Maintenance Work Orders are Closed Out AccuratelyTop 7 Reasons why Maintenance Work Orders are Closed Out Accurately
Top 7 Reasons why Maintenance Work Orders are Closed Out Accurately
 
LIMITATIONS OF AI
LIMITATIONS OF AILIMITATIONS OF AI
LIMITATIONS OF AI
 
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIsDashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)
 
Does market information, marketing and consumer research have a role in busin...
Does market information, marketing and consumer research have a role in busin...Does market information, marketing and consumer research have a role in busin...
Does market information, marketing and consumer research have a role in busin...
 
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
[DSC Europe 22] Avoid mistakes building AI products - Karol Przystalski
 
The Analytics Stack Guidebook (Holistics)
The Analytics Stack Guidebook (Holistics)The Analytics Stack Guidebook (Holistics)
The Analytics Stack Guidebook (Holistics)
 
10 ways to advance your it career tech news techgig
10 ways to advance your it career   tech news   techgig10 ways to advance your it career   tech news   techgig
10 ways to advance your it career tech news techgig
 
Big Data : a 360° Overview
Big Data : a 360° Overview Big Data : a 360° Overview
Big Data : a 360° Overview
 

Más de Bertil Hatt

Five finger audit
Five finger auditFive finger audit
Five finger auditBertil Hatt
 
Are you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testAre you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testBertil Hatt
 
Prediction machines
Prediction machinesPrediction machines
Prediction machinesBertil Hatt
 
Garbage in, garbage out
Garbage in, garbage outGarbage in, garbage out
Garbage in, garbage outBertil Hatt
 
MancML Growth accounting
MancML Growth accountingMancML Growth accounting
MancML Growth accountingBertil Hatt
 

Más de Bertil Hatt (6)

Five finger audit
Five finger auditFive finger audit
Five finger audit
 
AlexNet
AlexNetAlexNet
AlexNet
 
Are you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testAre you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point test
 
Prediction machines
Prediction machinesPrediction machines
Prediction machines
 
Garbage in, garbage out
Garbage in, garbage outGarbage in, garbage out
Garbage in, garbage out
 
MancML Growth accounting
MancML Growth accountingMancML Growth accounting
MancML Growth accounting
 

Último

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Último (20)

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

What to do to get started with AI

  • 1. Artificial Intelligence
 in the real world: Where to start? Bertil Hatt Head of Data science, RentalCars.com BCS ELITE Conference Thursday, 17th May 2018
 ‘Unlocking the Potential of Artificial Intelligence’ Manchester DoubleTree by Hilton Hotel 1 This presentation was prepared for a the BCS, The Chartered Institute for IT. It was meant for an audience of CTOs curious about Artificial intelligence. The program was first a talk about the exciting possibilities of Artificial intelligence. This presentation was following it. This is a talk on what to do before you can unlock the potential of Artificial intelligence.
 It relies on a handful of premise on the audience (detailed as questions to the audience during the opening section): the audience is technically informed, but not familiar with implementing machine learning models. The point is to talk about what to do *before* you have a machine learning model in place; what are the classic traps to get there. Why am I talking to you? 2 I was able to write down the following warning not because I was able to do a lot of amazing things, but because I worked with many institutions — the main ones are listed here. I was able to compare from the trenches, those who successfully implemented machine learning and those who struggled. Actually, most of the companies that I have helped are not on this list: friends who grew frustrated reach out for advice, or conference speakers who explained how to get things to work well. None of the content is meant to be controversial, but if there is anything you object to, it is entirely my fault: none of the institutions mentioned have sanctioned what I’m saying here. This presentation, by design, is going to sound quite negative: it’s a list of warnings. It is also a list of solutions. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 2. Why am I talking to you? 3 I’m also here because I gave this talk to many data scientists. Their reaction was not “this is cool”; it was incredibly relieved, even ecstatic: “Thank you for finally saying this out loud.” “Someone had to tell them!” Not sure who was them; I hope it’s you. Artificial intelligence, Machine learning, Data science, all those are amazing things to do. They happen to be victim of their —deserved— success. I’m telling what I learned from trying many times, and I failing almost as often. Finally, as you might be confused: my employer is known as RentalCars. The company belong to the same holding and was acquired by Booking.com on January 1st. This is why you’ll see me as belonging to either RentalCars, Booking.com or even BookingGo, which is an internal brand. Rideways is a new service that we are starting, so that’s four possible labels. All those are valid, and essentially the same company. Raise your hand 4 This is the interactive part of the presentation.
 I was not familiar enough with the audience, so this is me checking that my assumptions were right. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 3. Keep your hand up if •You manage technical people
 (engineers, dev.-ops, analysts) •Your company is organised with
 product teams, product owners •You know “Joel Splosky’s test” 5 The assumption is that you want to make decisions about what technical people in your organisation work on, who to hire (seniority, speciality). I also assume that you have a network of teams, coordinated by people known as product managers (PMs) or product owners (POs) who define short-term objectives, priorities and allocate technical resources. The main question here will be: what do in what order? What should you care about? Another key aspect here is the happiness of technical people. If you are familiar with Joel Splosky’s test, it’s a good reference to have in general. I have taken inspiration from him and suggest an adapted version for (hiring) data scientists. If you don’t know what Joel’s test is, check it out; it’s a good reference to have. What I could talk about •Should you think about AI? •What should you do first? • What project to start with? • Who should you hire first? •How to retain & grow data scientists? • How to make your team effective? •How to prepare for a project? •How ready are you for AI? 6 There are many things I could talk about, and I’m happy to answer questions around —how to do data science well, effectively, at scale; what are the right skillsets— but I think the two points that are under-discussed is what to do to prepare for data scientists, how to surround them. Far too many company want to do AI, but they have no idea if they are ready for it. When I hear that, it reminds me of a joke a friend had: Someone asked if he knew how to speak Arabic; he’d quip: “I don’t know, I never tried.” Being ready for data science is a little bit the same: if you are not sure, you very likely need a lot of effort before you can be ready. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 4. Raise your hand 7 That’s a stupid trick I used to get people to participate: I get people to keep their hands up first, then I ask question. It changes the default of participation, and you get more honest answers. Plus it’s a little bit of physical exercise. Keep your hand up if •You are planning to invest in AI
 in your organisation in 1-2 years •You do not have a model running
 & self-updating in production today •You want to hire a Data scientist 8 The assumptions here is that you can about AI as something you are actively planning to do. Other assumption is that you are not ready or at least haven’t tried. If you have, there are a lot more presentations on how to improve your AI, don’t waste time with my rambles and go watch those. Last assumption, and that’s where it gets tricky, you want to hire a person who you think is a specialist. That introduce a connection between artificial intelligence, machine learning and data scientist. I’m not going to detail how those are related. If you have your hand up, this presentation is for you Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 5. Don’t 9 First surprise. Gets people’s attention. If you hire someone fresh out of the Data science institute, or any statistical Master’s, they’ll be great at training a model from a cleaned up dataset. However, in a vast majority of cases, they won’t know what to do to get that dataset in the first place; they’ll have little idea how to serve that model to your display service in a fast, reliable, scalable manner. Most of them will get incredibly frustrated. Many will leave within six months. A friend of mine, Peadar Coyle, is trying to brand that “Trophy data scientist” it’s a reference to a sexist trope, but I think it’s a good angle on the problem. This presentation is here to help you avoid that, and figure out what type of person you should hire. Hire a Head of
 Data & Reporting
 first 10 For every company I’ve worked for (except to a minor extent Facebook), one key overlooked problem was to get a clear understanding of what the service was actually doing. Even for Facebook, many key aspects of what the company was doing in 2015 was still unclear, as recent changes demonstrated. Don’t be ashamed because you are naked in Eden: admit that you need help, and hire that help. William Gibson said that “The future is already here — it's just not very evenly distributed.” That is absolutely true of Data science. Many employers had great data science in some aspects, in one corner of the company — but large uncertainty elsewhere. You can try starting data science before you have a good overall vision. I wouldn’t recommend it, though. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 6. Then raise
 their budget 11 The real issue with processing data is that it’s expensive. There are roughly two roles: - engineers who build pipeline and large scale technology to copy data; those tend to be very expensive specialist so you want a budget for those; - people in change of applying business logic to that data, typically called analyst, sometimes data scientists. You might need many of those, to cover your companies different product. They typically build pipelines that others use, rather than provide a lot of insights themselves. Don’t confuse them: engineers generally hate handling the frequent changes in business logic; analysts typically don’t know code well enough — but they can learn a lot from engineers. Then raise
 their budget 12 If you want to leave the room or close this presentation now, that’s fine with me. You can spend a year hiring a head of data, grow their team and have them fix all the inconsistencies, that will make me very happy and you will probably not delay by a minute getting usable data science in production. We can be back here next year. I’ll explain next steps to you then. I’m serious, if you stand up now, I’ll understand. And I will thank you: you are taking this seriously. Getting good data is very likely going to be tedious and expensive and absolutely helps you machine learning effort. Actually, that’s the only part that I’m not convinced can be automated. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 7. Plan • Who is a data scientist • 12 questions to tell if you are ready • Five ways to audit denormalised dataset • Nine steps to a model in production • Common questions • What project to start with • The future for data science 13 This is going to be a very long presentation, and I might not reach the end in an hour. Very sorry about that, but you just got the key part of it, so I’ve fulfilled my promises. First, I’ll try to clarify what are the expectations, or lack of, from the king of specialist that you expect; that should help understand the rest of the presentation. Then I’ll help you decide if you want to hire one. At that point you’ll be tired, so I’ll make some nerdy jokes at the expense of my previous colleagues. If you are ready for more, I’ll then talk about how to get to a working machine learning model in production. Plan • Who is a data scientist • 12 questions to tell if you are ready • Five ways to audit denormalised dataset • Nine steps to a model in production • Common questions • What project to start with • The future for data science 14 Please, have question through out, because if you don’t, I’ll answer my own questions or rather the questions that I more commonly get from the audience. Feel free to raise your arms and ask questions now. I’m happy to address your concerns now rather than when my long monologue has completely expunged all your energy. If I talk too fast, raise your arms. If I talk too slowly, or not loud enough, or too loud, or I don’t articulate properly, raise your arms — all those have happened before, so feel free to interrupt me if anything is wrong. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 8. Who is a data scientist? 15 This is a very loaded question, because there is no clear answer — and that’s actually by design. “Data scientist” is a very ambiguous word (by design) • DJ Patil & Jeff Habermacher describe their job in 2008
 https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century • Most narrow: Familiar with self-learning statistical models • May include: Engineering, Dev Ops, UX or ML Research 16 The original use was a shorthand to describe the role of two very incredible people. DJ Patil single-handedly build most of the core features of LinkedIn and later became the Chief Data scientists of the United States Government. Jeff Habermacher was pretty much the same thing at Facebook and left to tackle cancer, also pretty much single-handedly. They were able to do so much on their own: develop a product, gather data from that product, store large scale information, model, serve those models. Hardly anyone can compare; hardly anyone should. Nowadays, most people who use the title on LinkedIn are people familiar with one or several self-learning statistical models, what is known as Machine learning. They may, but generally don’t know how to write great front-end or back-end code; know how to handle alerts on a server; know how to conduct research on users. That’s the point: Data scientist are people around a basic set of skills — so you always want to specify what you are looking for. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 9. What do we mean by ‘data scientist’ Analyst: uses data to answer
 ad hoc questions uses data streams to build alerts,
 automated reports & dashboards Statistician: build statistical models 17 A good way to detail that is to think about this gradient, between ad hoc applications and higher level abstractions. The most generous definition for “data scientist” are actually people who can answer individual business question, based on a static dataset. Those are classically called analyst, and I don’t think of it as pejorative. They tend to find far more nuanced insights but tend to not thing about how to automate their work. Next step are people who still use fairly ad hoc models, but for alerting the business about issues: they typically work on anomaly detection, impact assessment, A/B testing. They are familiar with some statistical test, confidence intervals, but still expect that their findings will be reviewed by a human and tolerate some clear errors. What do we mean by ‘data scientist’ Analyst: uses data to answer
 ad hoc questions uses data streams to build alerts,
 automated reports & dashboards Statistician: build statistical models 18 Those can often transition to full statisticians, people who can train a model, make an inference and understand non-trivial questions around representativity, bias, unbalance datasets. Typically, people here have a Master’s in statistics or equivalent. Where things get interesting, and that’s where most people put a limit, is whether those models are served to the business and the results reviewed case by case, or if the results are used to show a suggestion or a decision to users. Those models need to be not just accurate, but fast, and monitored well: if they develop a bias, you want to detect and correct that fast. Most people on the market can apply models but you rarely find the same person able to question the data that they find, have “business sense”. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 10. What do we mean by ‘data scientist’ Analyst: uses data to answer
 ad hoc questions uses data streams to build alerts,
 automated reports & dashboards Statistician: build statistical models build self-updating models
 used to automate decisions ML researcher: imagines new types of models 19 That’s why you want to clearly define if you want someone in that role able to explain the impact and implementation of their model. Responsibilities like that can justify very wide difference in value for the company and compensation, which is why having that scale in mind really helps to set expectations. At the tail end, you have a smaller group of people who actually rarely look at business application, but think about improving and developing new statistical models. Those are typically academic career, with the very recent evolution that they work for large internet company who need better, more scalable model, and they can get paid insanely well. The crazy salaries are typically for those people because they single-handedly can make, say, all the image processing at Google, Facebook, Apple or another, orders of magnitude more efficient, or more accurate. That’s very rarely the first profile you want to hire. Difficult questions • Can a data scientist set-up, manage and maintain a server to host the model served to production? • Can a data scientist understand the production code to fix a faulty data log? • Can a data scientist write a method used in production to gather the data consistently? 20 For those three questions and many like those, the real answer is: Maybe, but those are typically rather rare. An easier answer is: your (specialised) engineers can do it better, faster, so rely on those first. You can and should teach your data scientists, by asking them to work closely with those engineers and learn from them; it will not only make them autonomous but also teach them to think like engineers (be lazy, automate, delegate, define, document, test), which is really the mindset that you want to cultivate. At the same time, by explaining their models to engineers, DS will help them grasp statistics, a very helpful skillset. Don’t look for a white elephant: just have complementary profiles learn from each other. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 11. What questions to
 tell if you are ready? 21 This is (finally) the crux of the presentation: how can you tell if you should get started with machine learning? The Joel test Joel Splosky’s test for code-based projects: • 12 Yes/No • Easy to tell • Easy to start https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/ 22 Joel Splosky is a thought leader in how to make great software. There are many complicated ways of estimating quality of environment. He wanted to simplify it to a short check list. Each can be set up by individual decisions; some are relatively expensive but none are a head-scratcher. There are not really dependent on each other — which is something I haven’t been able to do. A not irrelevant question is: - How much do you score on Joel’s test too?
 Happy engineers and good code generally means good data, and that is key for machine learning too. - Were you surprised by the score? Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 12. 12-point test for readiness to implement data science • Three questions on your strategy, effort & product team.
 Do they make sense? Are they aligned? • Six questions on your data: Is easy to grasp, up to date?
 Are your analysts building the right datasets? • Three questions on where a model would find its place.
 23 If you notice a heavy weight around data quality, Congratulations! you won six more slides of me being very, very insistent around that. But it’s not just data: clear projects, and engineering structure ready for data science are also very helpful. The test reflects those. Are your product &
 goals clearly defined? Is the company goal broken down into team metrics in a mutually exclusive, completely exhaustive (MECE) way? Can a new team member describe the current effort
 & its impact on metrics at the end of their first week? Are the metrics known by everyone, i.e. reported, audited & challenged? Do developers other teams key metrics?
 … 24 Why care so much about unrelated employees knowing those metrics? Because of unintended consequences: a team focused on growth might push for low-quality profile, and the team in charge of converting those profile might object to this objective, and prefer something more balanced to reasonably relevant incoming traffic. The question really is: Do you give space for employee to challenges to metrics? As you can notice, none of those are about machine learning: this is by design. The question here is Are you ready before starting data science? I will defined targets for data science in the next section, for when you have data scientist on board. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 13. Suggestions to
 improve product clarity • “What we do” document for all to read on their first day • Introduce all the concepts, the strategy, the teams • Includes orders of magnitude to explain priorities • Quizzes employees often on key numbers, other teams • Show explicit interest in their overall understanding • Detect issues, tension, miscommunications 25 No comments here. Those are just good ideas. Is your data easy to get? Do you have an up-to-date metrics dictionary with a detailed explanation of each number, incl. edge case? Are the reference analytical tables audited daily?
 Totals add up with production, trends make sense, etc. Are simple data requests answered within 10 minutes? Can a question that fit in two lines fit on a single query?
 … 26 For all of those, ask the most junior analysts. Managers are often removed from the grind of getting data: they ask, go away and wonder why it takes so long. You want analytics managers to understand the struggle, so they can invest in fixing it & making analysts productive. “10 minutes” sounds ridiculous to many analysts: they typically have to work for days for any insights. It is not a pipe-dream: it’s the result of significant effort in documenting, pre-processing & training analysts to make each-other more efficient. How valuable it is for your organisation to get there, how nimble and informed you could be? This is why I argued for significant investment in data and reporting, far more than what directly measurable impact would seem to allow. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 14. Suggestions to
 improve data quality • Ask an analyst: Why is that number hard to compute? • Share the pain & build empathy with engineers • Analyst value consistency, precision, reusability • Ask about 1st, 2nd, 3rd normal form & denormalisation • Analyst often have insufficient training in data structure • Understand log structure (no edits) vs. status table 27 It might be counter-intuitive, but most analysts work actually imply to spend far more time handling not essential corrections. That imbalance is extremely common and can be compensated by good tooling — but understanding what analysts care about, what they spend their time on will open a window into unexpected issues. Spoiler alert, you will hear very confusing things about: - time zones — and how they relate to a subset of cities, not countries; - the distinction between countries, territories and locales; - the spelling of Munich; the status of Taiwan, Hong-Kong; - which week number is January 1st (52 or 53, both wrong); - why Christmas, Easter and Ramadan are so inconvenient; - finally, if you are lucky, how to represent a discreet log-scale. This large imbalance between impact & time spent is not ideal, so if you want to compensate for it, invest in tools to handle those confusing details gracefully. Extract Transform Load Production
 database Read
 replica Analytics
 database Normalised tables:
 references, events 28 The classic way to get data is to have a production database, copied into a read- replica. You use that replica to have a copy in your analytics database, considered off-line, as is: if it fails, your server is still up. Those tables, if your engineers do their job well, should - be heavily normalised, including reference tables and fact tables; all with strict unique IDs. - they should only be appended, never updated; or you might have complementary state tables. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 15. Extract Transform Load Production
 databaseRead
 replica Analytics
 database Denormalised individual tables: - Per item with all information - UTC, local time, DoW, ∂ - One table per concept Normalised tables:
 references, events Automated audits & alerts DenormalDenormal 29 The first role of analyst should be build and maintain ETL processes to denormalise those tables and have large reference tables that contain all the relevant information in one. Any reasonably simple question should be answered with one simple query of one of those reference table. For that, those table need to be great augmented with things like local time, day of week, difference between one operation and the next, join with relevant aggregates, etc. This is presumably a constant effort to keep up-to-date with product changes, add key concepts introduced by influential ad hoc analytics. Those table can and also should be augmented by machine learning inferences, checks and audits. Extract Transform Load Production
 database Read
 replica Analytics
 database Denormalised individual tables: - Per item with all information - UTC, local time, DoW, ∂ - One table per concept Normalised tables:
 references, events Automated audits & alerts AggragAggregDenormalDenormal Aggregated tables: - Pageviews to sessions - Customer first & last - Daily metrics (all) 30 You can use a distinct service rather than relational databases to do that, but the structure remains the same: a single query should allow you to correlate, say, customer satisfaction with number of calls to the service, day of week, language, etc. all that in one single query on what has to be a very wide table, with a lot of columns (or an equivalent data structure). Checks on those table should be pretty much constant, extensive, the test and alerts publicly known, so that if any table isn’t matching expectations, like unicity, exhaustivity, etc. analyst know and don’t assume.
 All the common aggregates and reports should also be based on a chain of pre- computed aggregations to same time. Both the denormalisation and the aggregation should be done by a system user with strong priority or wide swim lane: all uses depend on those tables being available. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 16. Extract Transform Load Production
 databaseRead
 replica Analytics
 database Denormalised individual tables: - Per item with all information - UTC, local time, DoW, ∂ - One table per concept Normalised tables:
 references, events Machine learning
 training datasets Reports Automated audits & alerts AggragAggregDenormalDenormal Aggregated tables: - Pageviews to sessions - Customer first & last - Daily metrics (all) 31 Each of those level of abstraction has its own use: - aggregates should be used exclusively for reports; any report based on detailed reference table presumably should be automated; - machine learning training typically need individual data, so the occasional extensive load there is to be expected, but still controlled by privacy access; - straight-from-production normalised tables should be accessible, but have clear expectations that they are only used to audit an inconsistency; their should be monitored so that any significant query on them be converted into the denormalisation and data augmentation process rather than rely on ad hoc integration. All system and non-system users should actually be monitored for efficiency, and the most expensive queries discussed regularly. Extract Transform Load Production
 database Read
 replica Analytics
 database More 
 services Denormalised individual tables: - Per item with all information - UTC, local time, DoW, ∂ - One table per concept Normalised tables:
 references, events Machine learning
 training datasets Reports Automated audits & alerts AggragAggregDenormalDenormal Aggregated tables: - Pageviews to sessions - Customer first & last - Daily metrics (all) 32 Finally, that process should be able to handle new, disparate services without much problem, as ingestion and most processing should be done in parallel. That overall structure can obviously be challenged, but the key idea remains: make sure at all times that as much pre-computation as you can anticipate is done, so that analyst have it as easy as possible to get data. This will make even very junior analyst, or project leader, far more autonomous, effective. This will make many Business Intelligence tools far easier to integrate as well. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 17. Is your data up-to-date? Do you use version control on the ETL? Do you monitor your analysts queries for bad patterns?
 Talk about improvements at weekly improvement review? Is there more than three days between a product release & subsequent ETL update (tested, reviewed, pushed)?
 Is the team handling the ETL aware of the product schedule, including re-factorisation & service split?
 … 33 Well, some of those question have been shamelessly taken straight from Joel S. As explained earlier: there should be monitor on non-system user for expensive joins, new table creation, normalised (‘production’) tables to track inefficiencies. Those are less to police good usage than to provide insights into analyst and data scientist querying habits, drive efficiencies and, more importantly, identify good opportunities for learning. Monitoring and identifying expensive queries is a simple, objective pretext to teach good, legible effective SQL to all analysts. Suggestion to improve
 data processing • Senior analyst & Data engineer monitor biggest queries • Exclude production, system user • Teaching opportunity for all: ineffective, index, etc. • Product documentation is shared responsibility between
 the product owner and the product analyst: • Stand-Up, Task board, Releases, A/B test, Dashboard • Team objectives, product description, prioritisation 34 One key disruptor for ETL pipelines is product changes. With analysts embedded in product teams, you can prevent bad surprises. You can have prototype ETL gather data from the testing environment to develop both product and ETL in parallel; you can have tolerance for faulty ETL and service agreement — but overall, you want to monitor that closely. That is only possible with tight working relationship between prioritisation, release cycles, continuous interaction tool, the testing framework, i.e. typically product owner’s responsibilities and dashboards, data definition, i.e. typically analyst’s role. I don’t really have a trick to handle that, other than have both play with each other’s toys. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 18. Are you logging your (placeholder) model? Do you have estimates (from past A/B tests) of
 how model quality impact your metrics? Do you have a placeholder machine learning environment with a training server fed from the analytics pipeline? Do you log the user input, suggestion, user action & model version? Are those logs processed & audited? 35 Finally, we can talk about models. Actually, where they should sit — because, remember: we do not have a data scientist or models yet! The best way to be welcoming to a model is to imagine a placeholder: have a service that is what would be the model. That service can be very naive, like recommendations, or decisions without inferences, but have software that is separate enough to have its own trace. That service can we studied through an A/B test. That service can be augmented with the ability to get data from either the analytics server, or a server dedicated to model training. That service can be equipped with a lot of monitoring, SLAs, delays, input and output distribution, missing data, etc. This service can be a third party service, like Google Cloud Platform Cloud ML, or Amazon Web Service SageMaker. Are you logging your (placeholder) model? Do you have estimates (from past A/B tests) of
 how model quality impact your metrics? Do you have a placeholder machine learning environment with a training server fed from the analytics pipeline? Do you log the user input, suggestion, user action & model version? Are those logs processed & audited? 36 All those addenda are presumably not very insightful when running a naive, placeholder model — but they are typically something that most data scientists would take a while to build and integrate, while engineers familiar with the front- end codebase would be at ease setting up and the dev-ops able to maintain. These last steps are typically less necessary before a data scientists join, but it as soon as one has, and starts having a relevant model to test, you want to get started on building those. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 19. ML in production Production
 server (JS) Client Production db 37 How does that work in practice? Well, you still have your display service, and production database that is sending information to the analytics server. Those information are processed by the pipeline common to analysts and data scientists. ML in production Production
 server (JS) Client ML server
 (Python) Trained
 model Train, test
 & copy Production db 38 You want to use that information to train a model and host it on a separate service. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 20. ML in production Production
 server (JS) Client ML server
 (Python)Features Prediction Trained
 model Train, test
 & copy Production db 39 Now you have a server that, given a set of features from the production display service can send back predictions, and you can use those predictions to show recommendations or decisions to your user. Can most data scientists handle hosting that server? This is fairly involved tasks, but standard for engineers: - Containers & Kubernetes - Release cycle - Fall back & cut-offs So you can look for people who know both modelling and have theses skills, or you hire a statistician and an engineer, and have them work together. ML in production Production
 server (JS) Client ML server
 (Python)Features Prediction Production db recommendation_placeholder.py import os import pymssql def get_smart_recommendation(input): analytics_server = os.environ['ANALYTICS_URL'] conn = pymssql.connect(analytics_server).cursor() cursor = conn.cursor() cursor.execute('SELECT TOP 5 product_id FROM sales;') five_suggested_product_list = [] for row in cursor: five_suggested_product_list.ammend(row) return five_suggested_product_list Placeholder
 naive model Connection
 naive model placeholder 40 How can an engineer start building all that service without an existing model?
 With a placeholder model. Replace the parts in teal with something naive. Let’s look at how we could implement that in Python: - import libraries to access the data server & get local keys; - have the service query, say: - connect to the data server with local tokens - compute and load the five most popular items:
 not a very personalised recommendation, but one that you can live with; - fill those into a method that you can use to send your recommendation. Have all that wrapped into any platform that your engineers like and are willing to support: Python Flask, H20, etc. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 21. ML in production Production
 server (JS) Client ML server
 (Python)Features Prediction Production db recommendation_placeholder.py import os import pymssql def get_smart_recommendation(input): analytics_server = os.environ['ANALYTICS_URL'] conn = pymssql.connect(analytics_server).cursor() cursor = conn.cursor() cursor.execute('SELECT TOP 5 product_id FROM sales;') five_suggested_product_list = [] for row in cursor: five_suggested_product_list.ammend(row) return five_suggested_product_list Placeholder
 naive model Connection
 naive model placeholder 41 That’a boiler plate code that any developer can easily produce. It doesn’t have a way to import features, but that’s easy to add. Any statistician with no knowledge of production-ready code might find that particular sniper a little quaint. It is. They should be at least eager to improve on the naive idea. The great thing is: even if they are familiar with Python and nothing else, pointing them at this file and this file alone will allow them to know how to get data, process it effectively, import all their favourite libraries, create object to implement a recommendation engine, and surface it using the same or an equivalent method. The over-head from, say, a notebook, should be minimal. That structure (and good directions) allows them to be effective within hours of joining the company. An effective data scientist is a happy data scientist; their usefulness will make them want to stay the company, where they should improve on their model gradually, and rapidly bring in a lot of value by automating a lot of key decisions. Questions? Overall goal split MECE in team KPI? Can newbies describe current impact? Are the metrics known by everyone? Up-to-date metrics dictionary? Are analytical tables audited daily? Are queries answered in 10 minutes? Version control on ETL? Monitor analysts queries & feedback? Less 3 d. from launch to ETL update? Estimate impact of model quality? Do you log input, & user action? Placeholder ML environment? 42 So, here are all the twelve questions. Go through those again and try to count, honestly, how many points you have. If you don’t know about one, assume it’s like my friend who maybe speaks Arabic, count those are a No. How many points does your organisation have? Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 22. What should you fix? • ≤ 3 pt.:
 hire engineering managers—your problem is upstream • ≤ 6 pt.:
 hire an experienced analytics manager—to get data good • 6 - 9 pt.:
 hire an experienced data scientist to plan improvements • > 9 pt.:
 get some Master’s students from your local university 43 Low is not bad, but low means you want to focus on different aspects. Less than three means dramatic measures: forget data engineering, and start upstream: are your engineers happy? Could have you a more mature engineering culture overall? Actually, if you have less than six points, you probably should run the Spolsky test too — or something equivalent. Read up on technical debt, agility, etc. Talk to CTOs of company that you admire and see what they care about. If you are low, Don’t hire an experienced data scientist to “do models”. Models are not going to help you. You have bigger problem that that. Hire someone who, reading this list, will smile and who sounds like he understands those problems and care how to fix them; even better: hire someone who, for most points that I make, snipes: “Pff, that’s stupid, you should…” because they are probably right. Give them my details as I can certainly learn from them. Hire someone who say they want to make other data scientists work more effectively. What should you fix? • ≤ 3 pt.:
 hire engineering managers—your problem is upstream • ≤ 6 pt.:
 hire an experienced analytics manager—to get data good • 6 - 9 pt.:
 hire an experienced data scientist to plan improvements • > 9 pt.:
 get some Master’s students from your local university 44 If you are low, dedicate engineering resources to the data team. What needs to be fixed vary, but engineers will rapidly identify issues likely invisible to management. Keep talking to analysts, see if they get excited by changes like “dynamic table index reallocation based on the production schema” or “much faster type inference on the fly”. Notice how they waste less time. Finally, if you have a good score, start hiring. Go to the local grad school, with a description of your company’s business and problem & a small sample of your data; ask the students how they would address it. Hire those who suggest several options and care to tell early which one would work best. Not all will have a keen business sense, but if some of them do, the rest can focus on models. You’ll get a great team in no time. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 23. Plan • Who is a data scientist • Questions to measure if you are ready • Five ways to audit denormalised dataset • Nine steps to a model in production • Common questions • What project to start with • The near future for data science 45 Ok. Here we stand and the half-way point: We’ve looked at what are data scientists and what you should ask yourself before hiring them. We still have another long section on how to get a model into production and common questions Before we go there, let me admit: this was a very long and detailed first part. Next part is a tongue-in-cheek look at the kind of actual errors that I saw in the BI of big, respectable corporations with competent analysts. Errors happen, some are funny; but to catch them, fix them and be able to laugh at them, you want to check. This is how. Audit Comedy interlude 46 I mention that you have these big, wide denormalised reference tables with all the relevant details. Building them is essential, but building them is… prone to mistakes. Easy to catch, obvious, so stupid you could say they are just plain hurtful when you find them. Those error generally come from analysts missing something obvious, or sometimes a real problem in the service. What are the kind of things that actual companies’ analytics get wrong? Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 24. All of those examples are real And they happened at big, respectable companies full of very smart people. Industries have been swapped to protect
 the guilty. 47 I wanted to use the music from Benny Hill but I don’t know how to make that work with Keynote, so you get Blackadder. I’ve tried to mask which company was responsible for which discrepancy, but every where I have looked, I have stumbled on confusing results. Most of the time, it was because I, or the analyst responsible for building the data pipeline, overlooked an edge-case, misunderstood the logging. Sometimes, it was a real problem with actual or attempted fraud, using race conditions, unexpected patterns, etc. The point here is less to audit the website than to audit both the analyst understanding of the data process and the process itself. THIS ONE IS EASY, YOU’D THINK ADD UP TOTALS SELECT
 SUM(value) AS sum_value
 , COUNT(1) count_total
 FROM denormalised_table;
 SELECT
 SUM(value) AS sum_value
 , COUNT(1) count_total
 FROM original_table; -- How much revenue & orders was generated?
 SELECT
 SUM(subtotal) AS sum_value
 , COUNT(1) count_total
 FROM denormalised_orders;
 SELECT
 SUM(subtotal) AS sum_value
 , COUNT(1) count_total
 FROM orders; -- How much … per day?
 SELECT
 SUM(subtotal) AS sum_value
 , COUNT(1) count_total , order_date
 FROM denormalised_orders GROUP BY order_date;
 SELECT
 SUM(subtotal) AS sum_value
 , COUNT(1) count_total
 , order_date
 FROM orders GROUP BY order_date; 48 You would be surprised how often even basic numbers like the raw revenues or order count do not add up. This is because processing involved handling duplicates, cancellations, etc. Often, a fact table was only started later, so the discrepancy comes from missing values before that, because of a strict join. With an international audience, the date of an activity requires a convention: should we associate the date of an event to UTC, or should be associate behaviour to perceived local time (typically before and after 4AM local)? Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 25. CLASSIC UNICITY SELECT date
 , contact_name , contact_id
 , COUNT(1)
 , COUNT(DISTINCT zone_id) AS zones
 FROM {schema}.table_name
 GROUP BY 1, 2, 3
 HAVING COUNT(1) > 1
 ORDER BY 4 DESC, 1 DESC, 3;
 -- Unique orders & deliveries
 SELECT order_id
 , delivery_id
 , COUNT(1)
 FROM denormalised_deliveries
 GROUP BY 1, 2
 HAVING COUNT(1) > 1;
 -- Unique credits
 SELECT credit_id , order_id
 , COUNT(1)
 FROM denormalised_credit
 GROUP BY 1, 2
 HAVING COUNT(1) > 1; 49 The most commonly overlooked aspect in database management by analyst is unicity. Source fact tables, even logs typically have extensive unique index because proper engineering recommends it. When being copied and joined along 1::n and n::n relations, those unicity conditions are commonly challenged. I strongly recommend not using a `count` vs. `unique_count` method, because those are expensive often silently default to approximations when dealing with large dataset, and only tell if you that there is a discrepancy. Using a `having count(1) > 1` allows you to get the exact values that are problematic and more often than not, diagnose the problem on the spot. YOU’D BE SURPRISED MIN, MAX, AVG, NULL 
 SELECT
 MAX(computed_value) AS max_computed_value
 , MIN(computed_value) AS min_computed_value
 , AVG(computed_value) AS avg_computed_value
 , COUNT(CASE WHEN computed_value IS NULL THEN 1 END) AS computed_value_is_null
 -- , … More values
 , COUNT(1) count_total
 FROM outcome_being_verified; -- Customer agent hours worked
 SELECT
 MAX(sum_hrs_out_of_service) AS max_sum_hrs_out_of_service
 , MIN(hrs_worked_paid) AS min_hrs_worked_paid
 , AVG(log_entry_count) AS avg_log_entry_count
 , COUNT(CASE WHEN hrs_worked_paid IS NULL THEN 1 END) AS computed_value_is_null
 , COUNT(1) count_total
 FROM denormalised_cs_agent_hours_worked 
 15.7425 -22.75305 1155 0 3247936 50 There are less obviously failing examples when you look at representative statistics. Typically extremes and null counts. I have seen table where the sum of hours worked was negative, or larger than 24 hours — because of edge cases around the clock. Same thing for non-matching joins: - you generally want to keep at least one side (left or right join) - but that means you might get nulls; - one or a handful is fine, but if you see thousands of mismatching instances can flag a bigger problem. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 26. YOU’D BE SURPRISED DISTRIBUTION 
 SELECT
 ROUND(computed_value) AS computed_value
 , computed_boolean
 , COUNT(CASE WHEN hrs_worked_raw IS NULL THEN 1 END) AS computed_value_is_null
 , COUNT(1) count_total
 FROM outcome_being_verified ▸ -- Redeliveries
 SELECT is_fraudulent
 , COUNT(1)
 FROM denormalised_actions
 GROUP BY 1; 
 —- is_fraudulent count —- false 82 206 —- null 1 418 487 —- true 21 477 408 51 Also without a clear boolean outcome are data distribution: you want certain values to be very low, but you might not have a clear hard limit to how much is too much. An example of a good inferred feature are booleans to flag problems. You really want to have one of those for every possible edge-case, type of abuse, etc. `is_fraudulent` was one of those. Can you see why this one was problematic? - First surprise, the amount of nulls. It was meant to have some nulls (the estimate was ran once a day, so any transaction on the same day would have a null estimate) but not that many — but still, that number was way too high. - Second problem, even more obvious: is it likely that three order of magnitude transactions are fraudulent? Not in this case: the analyst flipped the branches of the `if` by mistake. TO THE TIME MACHINE, MORTY! SUCCESSIVE TIMESTAMPS -- Timestamp succession
 SELECT
 (datetime_1 < datetime_2)
 , (datetime_2 < datetime_3)
 , COUNT(1)
 FROM {schema}.table_name
 GROUP BY 1, 2; ▸ -- Checking the timeframe of ZenDesk tickets SELECT
 ticket_created_at < ticket_updated_at
 , ticket_received_at < org_created
 , COUNT(1)
 FROM denormalised_zendesk_tickets
 GROUP BY 1, 2; -- Checking the pick-up timing makes sense
 SELECT
 (order_created_at < delivered_at)
 , (order_received_at < driver_confirmed_at)
 , COUNT(1)
 FROM denormalised_orders
 GROUP BY 1, 2; 52 One common way to detect issues with data processes in production are successions. Certain actions, like creating a account, have to happen before another, like an action with that account. Not all timestamps have those constrains, but there must be some. I have yet to work for a company without a very odd issue along those lines. Typically, those are race conditions, over-written values or misunderstood event names. I have seen those both for physical logistics, where a parcel was delivered before it was sent, or for paperwork, with ZenDesk tickets resolved before they were created, or even before the type of the ticket, or the organisation handling them were created. Having a detailed process for all of these tend to reveal edge cases. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 27. YOU CAN TAKE NOTES WHAT TO CHECK ▸ Everything you can think of: ▸ Totals need to add-up ▸ Some things have to be unique ▸ Min, Max, Avg, Null aka basic distribution ▸ Distributions make sense ▸ A coherent timeline ▸ Keep a record of what you checked & expect 53 This is not an exhaustive list, more a list of suggestions to start checks. Expect to write a dozen tests per key table and have most of them fail initially. Other ways of auditing your data include: - dashboard that breakdown key metrics; anomaly detection on key averages, ratios; - having one or several small datasets of the same schema as production,
 filled with edge-cases; process those through the ETL & compare to an expected output. Not sure if you found that part as funny as I did, but I can promise you, it’s great to ask that over-confident analyst about any of those, because, the 20-something who swears there is no way that they ever did anything wrong, they make this game really fun. Nine steps to get to a model in production 54 Ok, back to serious long lists. After the overly detailed 12 points to check before you get an data scientist, let’s assume you have hired a data scientist; actually you have hired a head of data, several analyst who checked their data several times over and couple of engineers who know a lot of about machine learning. How do you get all those people to work together? Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 28. Five roles involved • Product owner
 & PO Managers • Product analyst • Data scientist • Product engineer • ML service engineer 55 First of: I refer to roughly five types of roles. Those are not exclusive: many companies have more detailed roles, or do not have a distinction between analyst (someone who knows how to build metrics for a product team) and a data scientist (in this case, a statistician with an understanding of engineering). I distinguish product engineer, who knows the code that send data and would integrate things into the front-end and a support engineer for machine learning, but either could also be a software architect. Sequential People have tried to
 jump to higher steps It hurts harder 56 Unlike the 12 points, those steps are numbered because their have a strict dependency from one to the next. You can try to skip a step but so far, anyone whom I’m seen try has failed and had to back-track. Many people want to get to Step 7: build a model. On it’s own, that won’t help If I had to express how that makes me feel, let’s tell the story of a painter, someone who paints buildings: Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 29. Sequential People have tried to
 jump to higher steps It hurts harder 57 The painter sends a bill to re-paint the outside of a tall building, five floors: - £40k for building the scaffolding, and that will take two weeks; - £15k for painting, and that will take three day. The clients asks: “Can’t we skip the scaffolding and just paint the five floors in three days for £15k?” You’d think most people would know that that doesn’t make sense. Skipping straight to building a model at Step 7 makes as much sense. What do you want? 1. Team goals, objectives and key metrics are defined & documented 2. OKR clear & broken downs by key dimensions in custom reports
 
 
 
 
 Product
 manager Product
 analyst 58 The most important lesson from this framework, if you have to remember this but not the detail is:
 Product owners and product analyst own all of the early work; actually, most of the two thirds and certainly the most difficult and tedious parts. Those points can seem obvious but for every company that I’ve worked with, defining those objectives is actually quite hard, technically, and can be rather controversial. Every company, including Facebook, have “fake users”, “inactive users”, duplicates, etc. Every company has key activity but which activity actually counts is hard to define: someone who connects to a video game or on Twitter, but doesn’t play any game, or tweets anything, are their active? If someone cancels their only order on an e-commerce platform, are they an active customer? Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 30. What do you want? 1. Team goals, objectives and key metrics are defined & documented 2. OKR clear & broken downs by key dimensions in custom reports 3. Solution is clear & the prediction to be made is defined, measured
 
 … Product
 manager Product
 analyst Product
 owner 59 On the model, this is very rare that product owner come right away with a solution. Imagine you handle Amazon Mechanical Turk, and you want to retain Turkers, the workers on the platform. You are asked to make a model to handle a model for the prevention of de-registration. Two options: - you can try to list Turkers likely to leave, and send them a gift to retain them, or a good opportunity; or - when Turker wants to de-register, you compute the likelihood that you convince them otherwise, and if they probably won’t, let them do it with a click, otherwise, have them call a retention specialist. What do you want? 1. Team goals, objectives and key metrics are defined & documented 2. OKR clear & broken downs by key dimensions in custom reports 3. Solution is clear & the prediction to be made is defined, measured
 
 … Product
 manager Product
 analyst Product
 owner 60 The first model involved building a daily score; it’s a slow (not production-speed) model that someone can audit before acting on; it probably gets a ton of false positive before few people overall leave every day, etc.; the second has a lot of less bias because you only look at Turkers actively trying to leave and you learn from previous retention effort, but making it snappy enough might be tricky. Those are completely different models, with different data, development strategies, etc. for completely different service. Understanding that is key to work effectively with Machine learning. Naturally, that example was never really about Amazon Turk. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 31. Can we do it? 4. Likely improvements and impact from data properly estimated 5. Model, methodology and dataset needed to train defined Product
 analyst (Sr.) Data
 scientist 61 Notice how no data scientist was involved yet. Once the product owner spelled out what to predict and when, how an inference would help, you can ask: by how much? That’s where a detailed understanding of the product ratios become key. Say, you find a better way to rank suggestions to a user; they should convert more—but by how much? For that, pick a metric for the ranking (MRR, AUC) and use previous improvement to the ranking to estimate the impact. That leads to a proxy ratio: +0.01 in MRR raise conversions by 0.45%. Those are always approximations but they allow you to estimate the impact of a better model. You want your busy data scientists to work where they can have the most impact. Can we do it? 4. Likely improvements and impact from data properly estimated 5. Model, methodology and dataset needed to train defined 6. Dataset is available, audited;
 monitored daily updates
 
 … Product
 analyst (Sr.) Data
 scientist ETL Full stack: Data architect,
 Product engineer,
 Product analyst 62 Finally, you get data scientist involved with two key, rapid tasks: - pick a model that could fit the problem, base on what we know at that point (no zipcode for new customers, not until they confirm their first order); - experienced data scientists could predict the likely accuracy; multiplied by the impact sensitivity, that let us prioritise projects accurately. That leads to a key document: the data needed to train the model. Sometimes, it’s all readily available, modulo some minor feature engineering.
 More often than not, Step 6 takes months. Getting the right data the first time can be a gruelling, frustrating effort: so many things can go wrong. A reliable dataset supported by analysts is key but not always enough. Before committing, running a basic audit can reveal very confusing contradictions, so be mindful. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 32. Getting it done 7. Offline model: Objectives defined, quality & impact estimated from training 8. Model is running in production with unit test, skew & balance monitored
 
 
 Data
 scientist ML Engineer
 Product eng. 63 Once you have at least partial data (processing months of data once you have found a way to get it consistent can be a slow process), you can start have the data scientist do the ‘real’ work: training a dataset, test, identify meta-parameter, optimise, etc. As much as that step is the most exciting and the coolest, the first cycles are often very quick because you generally want to try simple models. Then you want that model to be served in production, with their own service: a lot of things there are needed, but this mainly an engineering effort and that type of service is not standard enough to not carry real engineering risk. Getting it done 7. Offline model: Objectives defined, quality & impact estimated from training 8. Model is running in production with unit test, skew & balance monitored 9. Model is self-updating in production; new model schedule Data
 scientist ML Engineer
 Product eng. Data scientist ML Engineer 64 Finally, once you are confidant that your data pipeline is reliable, and that your training can automatically generate a good model (because you audit and monitor those too) you can confidently let an automated system update the model. That’s real machine learning. From there, you might even have it run on its own and free your data scientists to work on something else. Your actual model principle can be increasingly complex, self-optimising, etc. and there certainly will be major improvements in the concepts that you use within six months (thanks to the Machine learning researchers) so you want to go back to it, schedule some time to investigate those possible improvements. To do that, keep monitoring the impact of a better model on your key metric, have data scientists make an informed estimation of how much they can improve models and prioritise accordingly. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 33. “We just want
 a model” 65 One of the key reason that people want “a model” to serve as a proxy for insights: How much that influences conversions?; Is fixing this source of churn more impactful than this blocker for up-sell? Models can help, but those are often better answered by an analyst with a keen product understanding. Even a model would rely on detailed data, categorised appropriately, driven by product insights; a good analyst with a intelligent breakdown with clear odd-ratio provides far more useful and legible insights than most models. Typically, models other than linear regressions are difficult to interpret, so you probably want to use those first for insights-driving models. Regressions are the simplest models, so you can trust junior analysts to handle them—just check for collinearity, a common issue that’s easy to fix. Questions? 1. Team goals, objectives and key metrics are defined & documented 2. OKR commented & breakdowns by key dimensions have custom reports 3. Problem for data science is properly defined, measured 4. Likely improvements and impact from data properly estimated 5. Model, methodology and dataset needed to train defined 6. Dataset is available, audited; regular updates scheduled 7. Offline model: Objectives defined, quality & impact estimated 8. Model is running in production with unit test, balance audited 9. Model is self-updating in production; new model scheduled 66 This was yet again a long section. Do you have any question both on: - any of the individual steps, the person responsible, the success criteria; but also on - the order, how dependent any step is to the next? If you are not tied are this point, I can address individual questions, or I can answer my own. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 34. Plan • Who is a data scientist • Questions to measure if you are ready • Five ways to audit denormalised dataset • Nine steps to a model in production • Common questions • What project to start with • The near future for data science 67 This is where we stand. There are two very common questions that I get, so I’ll cover those, but happy to answer more afterwards. What project
 to start with? 68 The classic question is: what to start with? There are often so many ideas (or not a lot of inspiration) that it’s difficult to pick one. There is also usually quite a bit of pressure to get that first project to be a success, which certainly doesn’t help, as the first project will probably reveal a lot more issues in the organisation, the clarity or the alignment of the incentive, the quality or consistency of the data, in the technical debt, in all the dependency of a data science project, etc. that you’d like. I might have some controversial response there too. Once again: your millage should absolutely vary. In increasing order of urgency and interest: Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 35. Boring, repetitive, expensive & scaling • More events, better model • Simple interactions, decisions • Low quality: priority for human 69 Who thinks that Machine learning is exciting, flying cars, intelligent robots, etc.?
 Well, you would be wrong. I mean: those are true, but the reality is that driving or flying, that’s really boring. Machines are not really smart yet, and they don’t get bored; they also need a lot of cases to learn for now — so naturally, they are really good at anything that is repetitive and boring. Find the least valued role, the team with the highest churn. That’s your best candidate project. I don’t know your industry but I’m willing to bet that that is either the Sales team, or the Customer service team. There are tons of great template projects around generating leads, or rather filter through a long list of potential leads and rank them. Boring, repetitive, expensive & scaling • More events, better model • Simple interactions, decisions • Low quality: priority for human 70 Same thing for customer services: natural language processing can help you prioritise, classify and even suggest default answers to customer contacts; you can infer tone, evaluate agents, estimate an optimal compensation rate for unhappy customers before expensive case processing, etc. All acquisition effort, social media optimisation and marketing campaigns are also often great candidates; even small changes typically scale quite effectively. There are also great projects to be thinking about around already automated processes: ad auctions, recommendations. Those are typically already fully digital, they use exclusively numbers rather than confusing ambiguous classification like the customer service or lead evaluation, so they generally have an easier data gathering curve. But the best rule of thumb is: if I did that for 30 minutes, would I be bored? Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 36. Anomaly detection
 Lifetime value, churn • Understanding what is wrong • Not loosing customer is far easier than making a new one • Understand contradictory effects between promotions, efforts to retain • Both are mainly analysts work than modelling 71 Truth is, the best first project for your data scientists might not be a model to serve in production, especially if they are a little junior and you have a lot of contradictory direction you can go, or if your service is a little wonky. There are three types of projects with fairly standard deployment that really help organisation structure themselves: 1. Anomaly detection is about monitoring your activity closely, breaking it down in detail and identifying unusual patterns; that allows you to trigger alerts to analyst; you don’t need to have client-facing process attached to it, but it really help identifying issues. If your activity goes up and down, and your analysts spend a lot of time figuring out why, that should help relieve them a lot. You should rapidly move from initial naive model to fairly sophisticated time series forecasting, and that’s always a fun learning exercise for analysts. Happy to present a framework to build that with a growing complexity. Anomaly detection
 Lifetime value, churn • Understanding what is wrong • Not loosing customer is far easier than making a new one • Understand contradictory effects between promotions, efforts to retain • Both are mainly analysts work than modelling 72 2. Another similar supporting effort that most analyst can do is first measuring, and then estimating the sum of actualised profits that a customer brings over an extended period of time, also known as “Lifetime value” or LTV. The first effort should be purely analytical and compute past value, but you rapidly will need to make assumption about customers who recently joined. That model allows you to estimate the long-term impact of certain decisions, compare investment that would otherwise seem hard to compare, like Improving the customer service vs. Spending more in advertising. Like for Anomaly detection, there are standard frameworks to build and model that, that I’m happy to share. The key decision associated to LTV is that, you want it to be less than your acquisition cost, how much you can spend in advertising to get new customers, also known as Cost per acquisition or CPA. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 37. Anomaly detection
 Lifetime value, churn • Understanding what is wrong • Not loosing customer is far easier than making a new one • Understand contradictory effects between promotions, efforts to retain • Both are mainly analysts work than modelling 73 The growth of venture-backed companies is more often than not associated to an initial effort to increase LTV and find channels with low CPA and once you have, spend as much money as you can on acquisition, hoping your CPA stays low and your LTV stays higher. 3. Finally, getting customers in is good but it’s often far cheaper to not let them out. The best way to do that is to understand who leaves, why and how to address that. The best approach for that is a framework known as Growth accounting: rather than look at users, you account for new comers and departures, identify the last actions and circumstances of leavers, and test retention efforts. Combined with LTV, that effort is generally very compelling. I guess you won’t be surprised if I’m offering a standard approach to that too. A/B testing • Identifying no-log users • Splitting by user type • Including all costs • Balance & Contamination • Peeking & Cancelling early • Supporting clusters • Documenting hypothesis 74 But the truth is, if you have a statistician joining your software company, the most impact they can have is by building a simple, key tool: an A/B testing framework. Initially, you should be able to get all your users in a room. As soon are they are more than that to meet them regularly, you want to make sure that your changes add by confronting the old solution to the new. Those are known as A/B testing and they are a key feature in any good software company. They are actually surprisingly hard to do well do to a flurry of issues, most of which are engineering work, but some who are classic statistics. Any statistician is familiar with inferential testing and some are even able to suggest Bayesian models. All will want their models and effort to be tested through that prism, so before you have them started on the real work, get them to build a reliable, robust, adapted A/B testing framework. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 38. Where is the
 job going to? 75 One question that I get occasionally — although not often — is: can we future- proof our hires? It’s very difficult to anticipate the future of a job that didn’t exist ten years ago, was completely transformed several times since and whose defining feature is that everyone good at it it is actively trying to code themselves out of a job. In spite of all that, the solution is rather obvious, and similar to any other job in the next ten years: specialisation and automation. “Data scientists will go the
 way of the webmaster…” — Hillary Mason • Automated easy tasks: • Feature engineering
 Model selection • DataRobot, H2O, Peak.AI • Less & more productive data scientists, ML engineer • Hosted models, auto-testing
 standard monitoring & alerts • More analysts, more PO • Can’t automate prioritisation • Data structure make sense 76 Hilary Masson started and used to lead bit.ly data science team. She is one of the most talented and eloquent data scientist there is. The best summary of where we are going is a quote from her, from in an interview in January. For her, the label “data scientist” will probably disappear in ten years; it will go, I quote: “the way of the webmasters.” I guess that word resonates with most of you who have been “making” websites for a decade or more. 15 years ago, every website had its webmaster. This was a generalist trying to handle copy, uploading, fighting with the unruly CSS, content strategy, hosting, etc. all by themselves. Since that confusing time, hosting platform, frameworks, specialisation have turned that one job into different, mostly well defined specialities: editor, copy-writer, designer, Wordpress integrators, SEO experts, front-end and back-end developper, UX researcher and also, in a way, data scientist. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018
  • 39. “Data scientists will go the
 way of the webmaster…” — Hillary Mason • Automated easy tasks: • Feature engineering
 Model selection • DataRobot, H2O, Peak.AI • Less & more productive data scientists, ML engineer • Hosted models, auto-testing
 standard monitoring & alerts • More analysts, more PO • Can’t automate prioritisation • Data structure make sense 77 There are no more webmasters, just people doing things far better and more effectively because they have specialised expertise, context and tools. We should expect data scientists to do the same. We will all benefit for an additional layer and several generations of tools; tools to do all the automatable parts of the job in seconds: feature generation and selection, model selection, meta-search, optimisation. That means that there should be a small increase in demand for highly paid machine-learning specialists building those tools, but also less need for per company for data scientists working on picking the relevant feature and the perfect model. Some could still do that, but many would rely on having a machine run through a more extensive checklist, while they get coffee. At the same time, far more company will be using Machine learning, so I suspect more people will be data scientists, just not the same as today. “Data scientists will go the
 way of the webmaster…” — Hillary Mason • Automated easy tasks: • Feature engineering
 Model selection • DataRobot, H2O, Peak.AI • Less & more productive data scientists, ML engineer • Hosted models, auto-testing
 standard monitoring & alerts • More analysts, more PO • Can’t automate prioritisation • Data structure make sense 78 Hosting large datasets, monitoring, all that should be following standard framework quite rapidly. And those framework should have, like Node React today, and still to an certain extend Wordpress, very strong network effects. Current tools have a bright future.
 The parts that seem harder to automate for now are around feature design, prioritisation, defining a data structure and a schema that allow to accommodate a given service, a given feature. Data scientists probably want to spend more quality time understanding data architects, product owners, because their future job might be focusing on the commonalities with those jobs. More ambitious data scientists might brush on more abstract mathematical results like generativity, ensembling and convergence to build more universally automated systems. Those aspects, core to data science but often overlooked, could become essential when automating data science itself. Hatt Bertil - Elite BCS 2018/05/17 - Get started with AI - 31 May 2018