n this talk, we will walk through the steps of how to build an algorithm to predict property prices from a dataset of property listings, focusing predominantly on finding the right features to include in building the model.
3. What you’ll learn today
1. How to build a predictive model
2. Where in the building process bias
can be introduced
3. What are real - world ramifications
4. What are all these “buzzwords” ?
● Data science produces insights
● Machine learning produces predictions
● Artificial intelligence produces actions
16. ◂ Variables remaining can proxy race
◂ If race is a useful predictor, then you have
a hole in the data
◂ Indirect discrimination
Removing ‘race’ from the dataset
doesn’t remove the problem
17. Now we know the
risks of training
data...
What do we do now?
21. Examples of data cleaning
1. Remove Duplicates
2. Remove Empty columns
3. Remove Not-relevant variables
4. Find averages for empty rows, or mark as 0
5. Remove rows that are blank for the features
most important for you
6. Standardize units
27. Adding additional variables by zip code
◂ Yelp count of stars
◂ Yelp average of stars
◂ Average household income
◂ Per capita income
◂ High income households (% > $200k/yr)
28. Yelp data seems pretty
democratic, that can’t
cultivate bias right?
47. Experiment with Hyperparameter tuning
◂ Increase or decrease number of trees
◂ 10-fold cross validation
◂ Look at depth
◂ Random seed
◂ Where to split the data
49. “ Algorithms will do more justice
to the people who are easiest
to understand at the expense
of those who aren’t.
-Michael Veale, Phd in Responsible ML at UCL
Hello!
My name is Eva. I am SO humbled and honored to be here today to talk to you about this topic that is so important to me. 2 years ago I completed an MSC in business analytics and management science where I learned all about algorithms, including their benefits and risks. Now I’m a PMM at Sentry in SF.
Please tweet me with your questions or comments.
A great example to understand the difference is in autonomous cars stopping at a stop sign.
Data Science - understanding false negatives, insights like does time of day matter for the car to stop
Machine Learning - gathering a dataset to predict which ones have stop signs or not
AI - takes the action to apply the breaks at the stop sign
Get data, clean prepare and manipulate (feature extraction), train, test, deploy and improve
Used as a framework
What are examples of this?
The bias is still there, even though the variable is removed.
How do we move forward with building our model?
Get data, clean prepare and manipulate (feature extraction), train, test, deploy and improve
Many data sets are collected by a particular entity to answer specific types of questions to accomplish a particular goal.
To minimize bias at the data cleaning stage, get context on the data about how the raw data was collected and how certain variables should be interpreted.
The research on Yelp data posted on eater shows that Mexican cuisine in the US had the highest number of people talking about “Authenticity” following by Chinese, Thai, Japanese, and Indian.
How would using this data of yelp stars by zip code to predict housing prices affect our model?
This is another dataset with human-informed, opt-in decisions. Already when using both Amazon Express and Yelp, you have Omitted Variable Bias since the service is voluntary and opt-in.
Racial discrimination that comes through location-based opt-in data
So back to our model
Get data, clean prepare and manipulate (feature extraction), train, test, deploy and improve
The “test” data is used for evaluation.
We can’t and shouldn’t blindly look at prediction power, we also need to understand the variables that are predicting it
This will build on itself in away that the people created it wouldn’t have neccesarily wanted. This will compound and continue a vicious cycle.
Rich people’s houses would have a higher value BECAUSE they are rich.
It’s a self-fulfilling prophecy - it’s reinforcing the vicious cycles of inequality in society, and disadvantages that already exist.
We looked at this a bit in the last section, but we’re going to try to improve our model even more
Get data, clean prepare and manipulate (feature extraction), train, test, deploy and improve
Also notices, it gives rise to new variables for XGBoost
Algorithms don’t do well with outlier data, like we learned when building our model.
Bell curve slightly to the left, it is slightly undervaluing the property prices. This would be favorable to buyers but not favorable to sellers.
Be mindful of who does this hurt
Get data, clean prepare and manipulate (feature extraction), train, test, deploy and improve
This phase is when we bring the bias to life. What’s the problem with using predictive algorithms?
In 2013 Google miss-predicted the peak of the flu by 140%
If mulitple companies used that same tool, women would have a hard time getting hired anywhere.
Personal story
I hope we We can’t blindly trust the algorithm.
We can check the way that decisions that we make affect the models that we build which can affect real people’s lives in the world.
Obviously, awareness. We’ve touched on this. That’s my whole point of talking to you about this.
This is for now.
Right now, it’s a fact, one small, homogenous group of people, make decisions that affect everybody. The people who create this generally look like each other, with a similar upbringing, look and talk the same way.
When we have more people of color training image-recognition models, we’re less likely to have self-driving cars that can’t recognize people of color. When we have more women writing software and training data models, we’re less likely to have hiring algorithms that discriminate against women.
When you bring your diverse perspective to the conversation, you change the conversation.