14. What is user research?
Systematic approach to discovering
users' aspirations, goals, tasks, needs,
pain points, and information and
interaction requirements.
User research grounds, verifies, and
validates what a team builds.
15. Where does user research fit to product?
Iterative research
Foundational research
Evaluative research
PRODUCT DEVELOPMENT
16. Dimensions of user research methods
Context
Natural or near-natural
Scripted
Not using the product
A hybrid of the above
Attitudinal
what people say
Behavioral
what people do
1
Qualitative
answers why
Quantitative
answers how
much/many
2
3
Source: https://www.nngroup.com/articles/which-ux-research-methods/
Experiments
18. What is an experiment?
An experiment is a way to test a hypothesis about the
product.
An experiment may also refer to the gradual launch of a
new feature.
LIVE
EVAL
Note: Tests, while they are an important part of the software development
journey, are not experiments, since you know in advance the result you expect
20. I’m a PM. I know what will happen.
Humans are terrible at
making predictions
1. Hindsight bias
2. Observational selection bias
3. Projection bias
4. Anchoring bias
… and hundreds of cognitive biases...
21. Doing a pre/post analysis is enough
Brazil Search Traffic
June 2014
27. Fundamentals of experiment design
The scientific method is an empirical method of acquiring knowledge. It is the
systematic observation, measurement, and experimentation of a hypothesis.
Observation1 Hypothesis2 Design3
Experiment4 Analysis5 Prove/Reject6
28. PM flavor of scientific method
Observation1 Hypothesis2 Design3
Experiment4 Analysis5 Prove/Reject6
Ask a question0
Communicate results7
29. 0. Ask a question
How can I increase usage of my product?
How can I increase revenue attributed to my product?
How can I increase user happiness?
How can I simplify code without changing metrics?
How can I affect click behavior?
30. 1. Observation: do background research
What others have done before
Are you doing something different?
Did something change since the previous attempt?
Quantitative data
Behavioral metrics
Surveys
Trends
Qualitative data
Perceptions
Attitudes
Assumptions
Preferences
31. 2. Develop a hypothesis
A (1) testable (2) explanation for a
phenomenon.
The goal of an experiment is to prove or
disprove the hypothesis.
AVOID running experiments to see what happens
or to gather data with no hypothesis. Use other
user research methods and have a POV.
32. 2. Develop a hypothesis
Example
1. Ask a question
a. How can I increase sales for Prime users on the mobile
app?
2. Do background research
a. Users had troubles finding filters on mobile
b. Users get overwhelmed with too many results
c. Decreasing options simplifies decision-making
d. BUT, past experiments limiting results had negative
results
33. 2. Develop a hypothesis
Hypothesis:
Prime users will spend more $ if they can easily narrow their
search results to prime products
Is it valid?
● Is it testable?
● Does it have an explanation?
● Do I have an educated guess?
34. 3. Design experiment
Hypothesis:
Prime users will spend more $ if they can easily narrow their
search results to prime products
Design experiment:
1. Show a prime toggle on the navigation bar for all US prime
users on the iOS app
2. Toggle off by default
3. No changes to
a. Backend algorithms
b. Logic that decides when to enable the prime filter
c. Current prime filter behind the filter button
35. 3. Design experiment
BTriggering criteria
● Who: US prime users using iOS app
● When: If results include a prime product
● How: Session-based
Duration
● 2 weeks
Launch criteria (success metric)
● Statistically significant increase in revenue
● No increase in latency
37. 5. Analyze the data
BResults
+2.5% Revenue [1.9%, 3.1%], p=0.05
38. 5. Analyze the data
1. Statistical significance is the likelihood that the numeric difference
between a control and treatment outcome is not due to random chance
2. Null hypothesis states there is no significant difference between control and
treatment, any observed difference is due to sampling or experimental error
3. P-value evaluates how well the sample data supports the argument that the
null hypothesis is true. A low p value suggests you can reject the null hypothesis
4. Confidence interval is a range of values (lower and upper bound) that is
likely to contain an unknown population parameter
40. 6. Draw conclusions
Hypothesis:
Prime users will spend more $ if they can easily narrow
their search results to prime products
1. Validate data
2. Craft a story
3. Evaluate results
a. Arguments in favor and against it
b. Key observations and durable learning
c. Next steps
B
44. Choose the right metrics
1. Think both short-term and long-term
2. Use metrics that matter
3. Align on the success metrics beyond your
own team
45. Be a good wannabe scientist
1. The scientific method is not a suggestion
2. Be suspicious if you didn’t predict a specific
result in advance
3. The more you slice and dice your data, the
more false positives you’ll get
4. Lean against rolling out flat experiments,
unless there are valid reasons
46. Create and follow templates and processes
1. Setup an intake process to get ideas from
everyone
2. Establish a pre and post-experiment design
template
3. Document all learnings and make them widely
available
"As you checked in we sent you an email to join our online communities, events, and to apply for product management jobs. As members of the Product School community we'd like to provide you with these resources at your disposal."
Hello everyone, it’s a pleasure to be here.
My name is Ruben Lozano and I’m a Product Manager at Google Maps.
Before Maps, I was a PM at Google Cloud, Amazon, and Microsoft.
And today, I want to talk to you about using the scientific method when conducting experiments as product managers
From my experience, when people talk about conducting experiments in tech--they talk about A/B testing.
For the few of you who may not be familiar with A/B testing, at its most basic, it is a way to compare two versions of something to figure out which of the two performs better.
There are other more advanced methodologies of experimentation, like Multivariate Testing or Multi-armed Bandit, but I won’t be covering them during this presentation.
But in general, experiments are one of many methods withini your product management toolkit to conduct user research when building products.
That is why I want to briefly talk to you about user research
So you understand when is good a idea to use experiments compared to other research methodologies.
User research is a systematic approach to discovering users’ aspirations, goals, tasks, needs, paint points---you name it.
To me, it is that magical component that helps you ground, verify, and validate what you and your team build.
Research fits in every phase of product development
For example, foundational research usually starts before design and development; but I encourage you to use it even after your product has been launched. Examples of foundational research are diary or ethnographic studies, these help you build empathy towards people, uncover opportunities, and inform your overall product strategy and direction.
Iterative research is commonly used when you have already identified the problem you want to solve, and you may want to conduct an in-lab usability study to gather user input to direct which path your solution should focus on.
Experimentation, fits into evaluative research. In other words, you use it when your product is done or almost done, and you want to improve it.
Experiments will provide you rich data--but not in every dimension.
Experiments will provide you “Behavioral” data, in other words--what people do. Experiments will not provide you attitudinal data--like how people feel, what they want, or their aspiration.
Experiments will provide you “Quantitative” data, in other words--they will answer “how much” or “how many”---but not exactly “why” users do or what they like.
And finally, Experiments will provide you data from a Natural or near-natural context. In other words, you need to have a product already in the wild to collect this data accurately
With that in mind, let’s define an experiment.
An experiment is a way to test a hypothesis about your product.
At Google or Amazon, experiments may also refer to the gradual launch of a new feature.
For this talk, I would only focus on the first ones, the live experiments.
It is important to note that Tests are not experiments, as in tests, you know in advance the result you expect.
So why run live experiments?
Most of the time, you already built the feature. You did user research, you conducted usability studies.
You are the PM--you are smart. You know what will happen, right?
But let me tell you--humans are terrible at making predictions.
Too soon?
I know. The worst part is that our own mind tricks us with multiple cognitive biases. For example, hindsight bias.
I am confident you, or many people you know, say that they deeply knew the results of the 2016 election. So they feel they are good at making predictions, but they are not. We are not. The same happens with product. We don’t always know.
So what about a pre/post analysis? You already built the feature. Launch it and see what happens.
But the world is complicated. Let me give you an example.
This graph roughly shows Google Search traffic over time
The Google Search team released a feature right when you see a big drop in Searches.
Just by looking at pre/post, the team should have been concerned--but they were not. Why?
Let me give you a hint. These data comes from Brazil in June 2014. Any ideas?
Yes. The World Cup. People were not searching, they were watching soccer--it was not your feature. Thank you A/B experiments.
This is the beauty of A/B testing. It isolates the impact of just the product changes you deploy.
Experiments help you understand if something is a good idea.
For example, you decide to add images to your search results. It seems like a better UX, people like images.
Buf if you think deeply--will it be better? What if the site gets slower, what if you show less results on the same screen space, what if the most relevant result doesn’t have an image? Not that straightforward--but If you do an A/B test, you could measure its impact.
A/B tests are very useful, they can help you
Iterate on a good idea
Remove features from your product
Measure impact of changes.
At some point, you may even feel they are magical.
But it’s not true. They are not magical.
The A/B test concept is very easy to understand and there are tools that make it easy to implement.
Ergo, they are overused and used incorrectly.
And as Maslow wisely said: “if the only tool you have is a hammer, you will treat everything as if it were a nail.”
This is when we bring science.
Conducting experiments means doing science--and science follow a very strict methodology.
If you don’t, you are doing pseudoscience.
Not sure about you, but I don’t trust pseudoscience--not even “directionally” or as a “better than nothing” outcome.
To conduct a sound experiment, we should follow the scientific method. Yes, the one you learned long time ago.
It follows 6 steps
Observer the world
Formulate a hypothesis
Design an experiment
Run an experiment
Analyze that experiment
And prove or reject your hypothesis
For product management, it is basically the same. I would just add two steps.
First, you may have a specific question you want to answer
And last, you should invest on communicating your results.
Let’s start by asking questions
And these questions could be like--how can I increase revenue or usage of my product
Or something more philosophical--like---how can I increase happiness or make people love my product?
Then, you move to the observation step, do background research.
First, look at what others have done before, when, and why--has something changed--should we try it again?
Then, look at quantitative and qualitative data from all those user research methods you conducted. What can you learn about your product?
And after that, you develop a hypothesis
A hypothesis is a testable explanation for a phenomenon. It has two parts
Testable: you should be able to measure it
Explanation: you should have a story to explains it
Before you run an experiment, you actually need to have an educated guess of what you think it will happen.
This is required because the goal of an experiment is to prove or disprove a hypothesis.
As an example, let’s use an experiment I conducted at Amazon as the PM of the mobile app.
I asked the question: How can I increase sales of Prime users on the mobile app?
I did background research: I found through different data sources that users get overwhelmed with too many results, users had troubles finding filters on mobile, but also that experiments that limited number of results had negative results, and overall psychological research on how decreasing options helps decision-making
So… let’s try to develop a hypothesis
My hypothesis is that Prime users will spend more $ if they can easily narrow their search results to prime products
Let’s check they hypothesis
Is it testable? Yes, I can measure changes in revenue
Does it have an explanation? Yes, I am saying the change will happen because Prime users will be able to easily narrow their search results to prime products
Do I have an educated guess? Yes, I am saying revenue will increase
Based on that hypothesis, I designed the experiment in this way
Show a prime toggle on the navigation bar for all US prime users on the iOS app
No changes to algorithms
No changes to when the prime filter is enabled
No changes to the prime filter within the filter menu
Toggle off by default
Then, you define the triggering of your experiment
Who is going to see it: US primer users using iOS app
When: When results include a prime product
How: Session-based--it means each session is a data point
Duratios is two weeks.
In most consumer products, you test weekly, as user behavior between a Tuesday and a Sunday are drastically different.
Be careful which 2 weeks--avoid experimenting on holidays or on anything that could disrupt regular user behavior
Think if the 2 first weeks are actually the best. Sometimes you could have features with a novelty effect--in other words, their impact can wear off over time; or with a learnability effect--users require time to adapt
Launch criteria
This is when you define your success metric
I won’t go over details here because it could be its own session
But whatever you decide on your experiment duration or launch criteria, don’t change it after the experiment starts.
Why? To prevent data manipulation or the perception of data manipulation.
Many experiment owners will be tempted to stop or keep running an experiment, or change the narrative of success, to fit their own agenda.
So you run the experiment. Here you see the only difference.
Two weeks pass. You get the results, and they look something like that
Increase of +2.5% in revenue with a confidence interval from 1.9% to 3.2% and a p-value of 0.02
So--how should you read this?
There are four concepts that are important to understand.
Statistical significance. That is the likelihood that the numeric difference between a control and treatment outcome is not due to random chance.
In other words, most of the times you want them to be statistically significant.
Null hypothesis. The null hypothesis says that there is no significant difference between control and treatment, and that any observed difference is due to sampling or experimental error.
In other words, most of the times, you want to reject the null hypothesis--as you expect a difference between control and treatment.
P-value evaluates how well the sample data supports the argument that the null hypothesis is true. A low p value suggests you can reject the null hypothesis.
In other words, most of the times, you want a low p-value
Confidence interval is a range of values (lower and upper bound) that is likely to contain an unknown population parameter
If we look at our data, we will see that our result is significantly positive as the confidence interval is on the positive side. So you can make the assumption that the metric will increase
If the full confidence interval were in the negative side, you could make the assumption that the metric will decrease
If the confidence interval crosses the zero, you don’t have enough data to know if your metric will increase or decrease.
And finally, there is an inconclusive result known as “flat” where the lower bound of your confidence interval is above a threshold called “practical significance”.
Let’s say, you put that threshold at -0.5%. This means that you are ok losing up to 0.5% of revenue when launching your feature product. Put it in another way. “Do no harm” experiments do not exist--you just need to define how much harm you are ok with.
Also, “Leaning positive” or “Leaning negative” outcomes do not exist. If you hear someone using them, make sure they take some statistic courses.
And finally, it’s time to draw conclusions
First, validate the data.
Do the numbers seem off, or are they too good to be true?
Then, craft a story.
Use the experiment data but also, all your previous data. Does it make sense? Does it approve or reject your hypothesis?
And write it down. I recommend to
Write arguments in favor or against launching the feature based on your pre-define launch criteria and other metrics you were tracking.
Record any observations and learnings
Write down next steps? Will you do another iteration, will you expand to other markets?
And after you capture everything--share it. Share successful and failed experiments.
Not only because sharing is caring--but because these insights are very helpful. Even to people who were not involved at all in the experiment.
And as one of my heroes would say, Isaac Asimov, “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka‘ but ‘That’s funny…’”