SlideShare una empresa de Scribd logo
1 de 57
Descargar para leer sin conexión
Measuring the New
Wikipedia Community
PyData 2013
Ryan Faulkner (rfaulkner@wikimedia.org)
Wikimedia Foundation
Overview
Introduction
Problem & Motivation
Proposed Solution
User Metrics
A Short Example
Extending the Solution
Using the Tool
Live Demo!!
Introduction
Me: Data Analyst at Wikimedia
Machine Learning @ McGill
Fundraising - A/B testing
Editor Experiments - increasing the number of
Active editors
Editor Engagement Experiments (E3) team @ the
Wikimedia Foundation
Micro-feature experimentation
Problem
What's wrong with Wikipedia?
Problem - Editor Decline
http://strategy.wikimedia.org/wiki/Editor_Trends_Study
Problem - Approach
Can we stimulate the community of users to become more
numerous and productive?
○ Focus on new users
■ Encourage contribution, make it easier
○ Lower the threshold for account creation
■ Bring more people in.
○ Rapid experimentation on features that retain more
users and stimulate increased participation.
■ This will help us determine what works with less
cost
Problem - Evaluation
○ Data Consistency
■ Anomaly Detection
■ Auto-correlation (seasonality)
○ "A/B" testing
■ Hypothesis testing - student's t, chi-square
■ Linear / Logistic regression
○ Multivariate testing
■ Analysis of variance
Problem - What we need
Currently a lot of the work around analysis is done
manually and is a large drain on resources:
○ Faster Data gathering
○ Knowing what we're logging and measuring &
faster ETL
○ Faster Analysis
○ Broadening Service and iterating on results
Problem - What we need
Build better infrastructure around how we interpret and
analyze our data.
○ Determine what to measure.
■ Rigorously define relevant metrics
○ Expose the metrics from our data store
■ Python is great for writing code quickly to handle
tasks with data
■ Library support for data analysis (pandas,
numpy)
Solution
The tools to build.
Solution - Proposed
We need to measure User Behaviour
"User Metrics" & "UMAPI"
User Metrics & UMAPI
Python implementation for gathering data from MediaWiki data stores,
producing well defined metrics, and facilitating subsequent modelling and
analysis. This includes a way to provide an interface for making different types
of requests and returning standard responses.
Solution - Why Bother
What exactly do we gain by building these
classes? Why not just query the database?
1. Reproducibility & Standardization
2. Extensibility
3. Concise definition
4. Increase turn around
a. Multiprocessing to optimize metrics generation
(e.g. Revert rate on 100K users
via MySQL = 24hrs,
via User Metrics < 10mins)
Solution - Why Python?
Why not C++, Java, or PHP?
1. Speed of development
2. Simplify the code base & easy extensibility
a. more "Scientist Friendly"
3. Good support for data processing
4. Better integration for downstream data analysis
5. The way that metrics work lends them to "Pythonic"
artifacts. List comprehension, decorator patterns, duck-
typing, RESTful API.
User Metrics
How do we form a picture about what happens
on Wikipedia?
User Metrics - User activity
Events (not exhaustive):
■ Registration
■ Making an edit
■ Contributions of Namespaces
■ Reverting edits
■ Blocking
User Metrics - What do we want to
know about users?
○ How much do they contribute?
○ How often do they contribute?
○ Potential vandals. Do they go on to be reverted,
blocked, banned?
User Metrics - Metrics Definitions
https://meta.wikimedia.org/wiki/Research:Metrics
Retention Metrics
Survival(t) Boolean measure of an editor surviving beyond t
Threshold(t,n) Boolean measure of an editor reaching activity threshold n by time t
Live Account(t) Boolean measure of whether the new user click the edit button?
Volume Metrics
Edit Rate Float result of user's rate of contribution.
Content Integer bytes added by revision and edit count.
Sessions Average session length (future)
Time to Threshold Time to reach a threshold (e.g. first edit)
User Metrics - Metrics Definitions
Content Quality
Revert Rate Float representing the proportion of revisions reverted.
Block Boolean indicating a block event on the user.
Content Persistence Integer indicating how long this user's edits survive (future)
Contribution Type
Namespace of Edits Integer edit counts in all namespaces.
Scale of Change Float representation of fraction of total page content modified (future)
User Metrics - Bytes Added
user
revision
history
(over a predifined
period)
Revision k:
byte increase
(user ID, bytes_added, bytes_removed, edit count)
User Metrics - Threshold
user
revision
history
(over a predefined
period)
(user ID, threshold_reached={0,1})
registration
Events since
registration up
to time "t"
if len(event_list) >= n:
threshold_reached = True
else:
threshold_reached = False
User Metrics - Revert Rate
user
revision
history
(over a predefined
period)
for each
revision look
at page
history
Future Revisions
Past Revisions
checksum k
checksum i
if checksum i == checksum k:
# reverted!
(user ID, revert_rate, total_revisions)
User Metrics - Implementation
https://github.com/wikimedia/user_metrics
1. MySQL & Redis (future) data store
a. All of the backend dependency is abstracted out of
metrics classes
2. Python implementation - MySQLdb (SQLalchemy)
3. Strategy Pattern of Parent user metrics class
4. Metrics built mainly from four core MediaWiki tables:
a. revision, user, page, logging
5. Python Decorator methods for handling metric
aggregation
User Metrics
A Concrete Example
How can we use this
framework?
Example - Post Edit Feedback
What effect does editing feedback (confirmation/gratitude)
have on new editors?
Example - Results
An Extended Solution
Turn the data machine into a service.
Editor Metrics go beyond feature
experimentation ...
It became clear that...
● We needed a service to let clients generate their own
user metrics data sets
● We wanted to add a way for this methodology to
extend beyond E3 and potentially WMF
● A force multiplier was necessary to iterate on editor
data in more interesting ways (Machine Learning &
more sophisticated analyses)
User Metrics API [UMAPI]
Open Source (almost) RESTful API (Flask)
Computes metrics per user (User Metrics)
Combines metrics in different ways depending on
request types
HTTP response in JSON with resulting data
Store data internally for reuse
UMAPI
http://metrics.wikimedia.org/
https://github.com/wikimedia/user_metrics
https://github.com/rfaulkner/E3_analysis
https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev
UMAPI - Overview
Service GET requests based on a combination of URL
paths + query params
e.g. /cohort/metric?date_start=..&date_end=...&...
Define user "cohorts" on which to operate
API engine maps to metrics request object (Mediator
Pattern) which is handed off to a request manager which
builds and runs request
JSON response
UMAPI - Overview
Basic cPickle file cache for responses
Can substitute caching system (e.g. memcached)
Reusing request data where it overlaps
Request Types:
"Raw" - metrics per user
Aggregation over cohorts: mean, sum, median, etc.
Time series requests
UMAPI Architecture
HTTP GET request
JSON response
Apache
Flask / App
Servermod_wsgi
Request
Notifications
Listener
Request
Control
Response
Control Cache
MediaWiki
Slaves
User
Metrics
API
Messaging Queues
Metrics objects -
Separate
Processes
Asynchronous Callbacks
UMAPI Architecture - Listeners
Request Notifications Callback
Handles managing and notifications on job status
Request Controller
Queues requests
Spawns jobs from metrics objects
Coordinates parameters
Response Controller
Reconstruct response data
Write to cache
We will want to consider large groups of users, for instance,
a test or control group in some experiment:
Aggregate groups of users
lists of user IDs
Cohort registration (under construction)
adding new cohorts to the model
Single user endpoint
Boolean expressions over cohorts supported
UMAPI - User Cohorts
User Metric Periods
How do we define the periods over which metrics are
measured?
Registration
Look "t" hours since user registration
User Defined
User supplied start and end dates
Conditional Registration
Registration as above with condition that registration falls within input
UMAPI - RequestMeta Module
Mediator Pattern to handle passing request data among
different portions of the architecture
Abstraction allows for easy filtering and default behaviour
of request parameters
Requests can easily be turned into reproducible and unique
hashes for caching
How the Service Works
The user experience with user metrics.
UMAPI - Pipeline
Cohort
or
combo
Raw Params
Time
Series
Aggregator
Aggregator Params
Params JSON
JSON
JSON
UMAPI - Frontend Flow
Job Queue
As you fire off requests the queue tracks what's running:
Response - Bytes Added
Response - Threshold
Response - Edit Rate
Response - Threshold w/ params
Response - Aggregation
Response - Aggregation
Response - Time series
Response - Combining Cohorts
"usertags_meta" - cohort definitions
Response - Combining Cohorts
Two intersecting cohorts:
Response - Combining Cohorts
AND (&)
Response - Combining Cohorts
OR (~)
Response - Single user endpoint
e.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000
Looking ahead ...
Connectivity metrics (additional metrics)
○ Graph database? (Neo4j, gremlin w/ postgreSQL)
○ User talk and common article edits
Better in-memory modelling
○ python-memcached
○ better reuse of generated data based on request data
Beyond English Wikipedia
Implemented!
Looking ahead ...
More sophisticated and robust data modelling
○ Modelling richer data: contribution histories, articles
edited, aggregate metrics
○ Classification: Logistic classifiers, Support Vector
Machine, Deep Belief Networks, Dimensionality
Reduction
○ Modelling revision text - Neural Networks, Hidden
Markov Models
DEMO!!
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/threshold
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/threshold?aggregator=average
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_rate
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_rate?aggregator=dist
http://metrics.wikimedia.org/cohorts/ryan_test_2/bytes_added?
time_series&start=20120101&end=20130101&aggregator=sum&group=input&interval=720
The End
http://metrics.wikimedia.org/
stat1.wikimedia.org:4000
https://github.com/wikimedia/user_metrics
https://github.com/rfaulkner/E3_analysis
https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev
Questions?

Más contenido relacionado

Similar a Measuring the New Wikipedia Community (PyData SV 2013)

Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)
Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)
Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)Dario Taraborelli
 
Data and Business Team Collaboration
Data and Business Team CollaborationData and Business Team Collaboration
Data and Business Team CollaborationApple
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining Editor IJMTER
 
The Art and Science of Requirements Gathering
The Art and Science of Requirements GatheringThe Art and Science of Requirements Gathering
The Art and Science of Requirements GatheringVanessa Turke
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Yahoo Developer Network
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
ADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATION
ADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATIONADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATION
ADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATIONijwscjournal
 
CHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOURCHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOURcsk selva
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEEMEMTECHSTUDENTPROJECTS
 
Quality Measurement Framework Puts the End User in Focus
Quality Measurement Framework Puts the End User in FocusQuality Measurement Framework Puts the End User in Focus
Quality Measurement Framework Puts the End User in FocusQuEST Forum
 
Library Management System
Library Management SystemLibrary Management System
Library Management SystemMartins Okoi
 
A competitive food retail architecture with microservices
A competitive food retail architecture with microservicesA competitive food retail architecture with microservices
A competitive food retail architecture with microservicesSebastian Gauder
 
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017Skelton Thatcher Consulting Ltd
 
Performance testing : An Overview
Performance testing : An OverviewPerformance testing : An Overview
Performance testing : An Overviewsharadkjain
 
an approach to recommend pages to user after path completion
an approach to recommend pages to user after path completionan approach to recommend pages to user after path completion
an approach to recommend pages to user after path completionIJAEMSJORNAL
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsJohann Schleier-Smith
 
Improvement from proof of concept into the production environment cater for...
Improvement from proof of concept into the production environment   cater for...Improvement from proof of concept into the production environment   cater for...
Improvement from proof of concept into the production environment cater for...Conference Papers
 

Similar a Measuring the New Wikipedia Community (PyData SV 2013) (20)

Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)
Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)
Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)
 
Data and Business Team Collaboration
Data and Business Team CollaborationData and Business Team Collaboration
Data and Business Team Collaboration
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining
 
The Art and Science of Requirements Gathering
The Art and Science of Requirements GatheringThe Art and Science of Requirements Gathering
The Art and Science of Requirements Gathering
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
 
cametrics-report-final
cametrics-report-finalcametrics-report-final
cametrics-report-final
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
ADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATION
ADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATIONADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATION
ADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATION
 
CHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOURCHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOUR
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
 
Quality Measurement Framework Puts the End User in Focus
Quality Measurement Framework Puts the End User in FocusQuality Measurement Framework Puts the End User in Focus
Quality Measurement Framework Puts the End User in Focus
 
Library Management System
Library Management SystemLibrary Management System
Library Management System
 
A competitive food retail architecture with microservices
A competitive food retail architecture with microservicesA competitive food retail architecture with microservices
A competitive food retail architecture with microservices
 
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
 
Performance testing : An Overview
Performance testing : An OverviewPerformance testing : An Overview
Performance testing : An Overview
 
an approach to recommend pages to user after path completion
an approach to recommend pages to user after path completionan approach to recommend pages to user after path completion
an approach to recommend pages to user after path completion
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
 
Cd24534538
Cd24534538Cd24534538
Cd24534538
 
Improvement from proof of concept into the production environment cater for...
Improvement from proof of concept into the production environment   cater for...Improvement from proof of concept into the production environment   cater for...
Improvement from proof of concept into the production environment cater for...
 

Más de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Más de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Último (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Measuring the New Wikipedia Community (PyData SV 2013)

  • 1. Measuring the New Wikipedia Community PyData 2013 Ryan Faulkner (rfaulkner@wikimedia.org) Wikimedia Foundation
  • 2. Overview Introduction Problem & Motivation Proposed Solution User Metrics A Short Example Extending the Solution Using the Tool Live Demo!!
  • 3. Introduction Me: Data Analyst at Wikimedia Machine Learning @ McGill Fundraising - A/B testing Editor Experiments - increasing the number of Active editors Editor Engagement Experiments (E3) team @ the Wikimedia Foundation Micro-feature experimentation
  • 5. Problem - Editor Decline http://strategy.wikimedia.org/wiki/Editor_Trends_Study
  • 6. Problem - Approach Can we stimulate the community of users to become more numerous and productive? ○ Focus on new users ■ Encourage contribution, make it easier ○ Lower the threshold for account creation ■ Bring more people in. ○ Rapid experimentation on features that retain more users and stimulate increased participation. ■ This will help us determine what works with less cost
  • 7. Problem - Evaluation ○ Data Consistency ■ Anomaly Detection ■ Auto-correlation (seasonality) ○ "A/B" testing ■ Hypothesis testing - student's t, chi-square ■ Linear / Logistic regression ○ Multivariate testing ■ Analysis of variance
  • 8. Problem - What we need Currently a lot of the work around analysis is done manually and is a large drain on resources: ○ Faster Data gathering ○ Knowing what we're logging and measuring & faster ETL ○ Faster Analysis ○ Broadening Service and iterating on results
  • 9. Problem - What we need Build better infrastructure around how we interpret and analyze our data. ○ Determine what to measure. ■ Rigorously define relevant metrics ○ Expose the metrics from our data store ■ Python is great for writing code quickly to handle tasks with data ■ Library support for data analysis (pandas, numpy)
  • 11. Solution - Proposed We need to measure User Behaviour "User Metrics" & "UMAPI" User Metrics & UMAPI Python implementation for gathering data from MediaWiki data stores, producing well defined metrics, and facilitating subsequent modelling and analysis. This includes a way to provide an interface for making different types of requests and returning standard responses.
  • 12. Solution - Why Bother What exactly do we gain by building these classes? Why not just query the database? 1. Reproducibility & Standardization 2. Extensibility 3. Concise definition 4. Increase turn around a. Multiprocessing to optimize metrics generation (e.g. Revert rate on 100K users via MySQL = 24hrs, via User Metrics < 10mins)
  • 13. Solution - Why Python? Why not C++, Java, or PHP? 1. Speed of development 2. Simplify the code base & easy extensibility a. more "Scientist Friendly" 3. Good support for data processing 4. Better integration for downstream data analysis 5. The way that metrics work lends them to "Pythonic" artifacts. List comprehension, decorator patterns, duck- typing, RESTful API.
  • 14. User Metrics How do we form a picture about what happens on Wikipedia?
  • 15. User Metrics - User activity Events (not exhaustive): ■ Registration ■ Making an edit ■ Contributions of Namespaces ■ Reverting edits ■ Blocking
  • 16. User Metrics - What do we want to know about users? ○ How much do they contribute? ○ How often do they contribute? ○ Potential vandals. Do they go on to be reverted, blocked, banned?
  • 17. User Metrics - Metrics Definitions https://meta.wikimedia.org/wiki/Research:Metrics Retention Metrics Survival(t) Boolean measure of an editor surviving beyond t Threshold(t,n) Boolean measure of an editor reaching activity threshold n by time t Live Account(t) Boolean measure of whether the new user click the edit button? Volume Metrics Edit Rate Float result of user's rate of contribution. Content Integer bytes added by revision and edit count. Sessions Average session length (future) Time to Threshold Time to reach a threshold (e.g. first edit)
  • 18. User Metrics - Metrics Definitions Content Quality Revert Rate Float representing the proportion of revisions reverted. Block Boolean indicating a block event on the user. Content Persistence Integer indicating how long this user's edits survive (future) Contribution Type Namespace of Edits Integer edit counts in all namespaces. Scale of Change Float representation of fraction of total page content modified (future)
  • 19. User Metrics - Bytes Added user revision history (over a predifined period) Revision k: byte increase (user ID, bytes_added, bytes_removed, edit count)
  • 20. User Metrics - Threshold user revision history (over a predefined period) (user ID, threshold_reached={0,1}) registration Events since registration up to time "t" if len(event_list) >= n: threshold_reached = True else: threshold_reached = False
  • 21. User Metrics - Revert Rate user revision history (over a predefined period) for each revision look at page history Future Revisions Past Revisions checksum k checksum i if checksum i == checksum k: # reverted! (user ID, revert_rate, total_revisions)
  • 22. User Metrics - Implementation https://github.com/wikimedia/user_metrics 1. MySQL & Redis (future) data store a. All of the backend dependency is abstracted out of metrics classes 2. Python implementation - MySQLdb (SQLalchemy) 3. Strategy Pattern of Parent user metrics class 4. Metrics built mainly from four core MediaWiki tables: a. revision, user, page, logging 5. Python Decorator methods for handling metric aggregation
  • 24. A Concrete Example How can we use this framework?
  • 25. Example - Post Edit Feedback What effect does editing feedback (confirmation/gratitude) have on new editors?
  • 27. An Extended Solution Turn the data machine into a service.
  • 28. Editor Metrics go beyond feature experimentation ... It became clear that... ● We needed a service to let clients generate their own user metrics data sets ● We wanted to add a way for this methodology to extend beyond E3 and potentially WMF ● A force multiplier was necessary to iterate on editor data in more interesting ways (Machine Learning & more sophisticated analyses)
  • 29. User Metrics API [UMAPI] Open Source (almost) RESTful API (Flask) Computes metrics per user (User Metrics) Combines metrics in different ways depending on request types HTTP response in JSON with resulting data Store data internally for reuse
  • 31. UMAPI - Overview Service GET requests based on a combination of URL paths + query params e.g. /cohort/metric?date_start=..&date_end=...&... Define user "cohorts" on which to operate API engine maps to metrics request object (Mediator Pattern) which is handed off to a request manager which builds and runs request JSON response
  • 32. UMAPI - Overview Basic cPickle file cache for responses Can substitute caching system (e.g. memcached) Reusing request data where it overlaps Request Types: "Raw" - metrics per user Aggregation over cohorts: mean, sum, median, etc. Time series requests
  • 33. UMAPI Architecture HTTP GET request JSON response Apache Flask / App Servermod_wsgi Request Notifications Listener Request Control Response Control Cache MediaWiki Slaves User Metrics API Messaging Queues Metrics objects - Separate Processes Asynchronous Callbacks
  • 34. UMAPI Architecture - Listeners Request Notifications Callback Handles managing and notifications on job status Request Controller Queues requests Spawns jobs from metrics objects Coordinates parameters Response Controller Reconstruct response data Write to cache
  • 35. We will want to consider large groups of users, for instance, a test or control group in some experiment: Aggregate groups of users lists of user IDs Cohort registration (under construction) adding new cohorts to the model Single user endpoint Boolean expressions over cohorts supported UMAPI - User Cohorts
  • 36. User Metric Periods How do we define the periods over which metrics are measured? Registration Look "t" hours since user registration User Defined User supplied start and end dates Conditional Registration Registration as above with condition that registration falls within input
  • 37. UMAPI - RequestMeta Module Mediator Pattern to handle passing request data among different portions of the architecture Abstraction allows for easy filtering and default behaviour of request parameters Requests can easily be turned into reproducible and unique hashes for caching
  • 38. How the Service Works The user experience with user metrics.
  • 39. UMAPI - Pipeline Cohort or combo Raw Params Time Series Aggregator Aggregator Params Params JSON JSON JSON
  • 41. Job Queue As you fire off requests the queue tracks what's running:
  • 45. Response - Threshold w/ params
  • 48. Response - Time series
  • 49. Response - Combining Cohorts "usertags_meta" - cohort definitions
  • 50. Response - Combining Cohorts Two intersecting cohorts:
  • 51. Response - Combining Cohorts AND (&)
  • 52. Response - Combining Cohorts OR (~)
  • 53. Response - Single user endpoint e.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000
  • 54. Looking ahead ... Connectivity metrics (additional metrics) ○ Graph database? (Neo4j, gremlin w/ postgreSQL) ○ User talk and common article edits Better in-memory modelling ○ python-memcached ○ better reuse of generated data based on request data Beyond English Wikipedia Implemented!
  • 55. Looking ahead ... More sophisticated and robust data modelling ○ Modelling richer data: contribution histories, articles edited, aggregate metrics ○ Classification: Logistic classifiers, Support Vector Machine, Deep Belief Networks, Dimensionality Reduction ○ Modelling revision text - Neural Networks, Hidden Markov Models