2. About me
B.S. degree in Management Science; Ph.D. in Statistics;
Data scientist in twitter ads-ranking;
HP Labs : pricing & portfolio management, marketing;
USDA : yield forecasting with satellite & survey data;
Instructor at Colorado State University;
Innovations in the intersection of statistics, computer science
and business
Applications in online advertising and e-commerce.
7. What is data science?
Is the traffic on 101-N heavier on Wednesdays? Why?
Why is swipe to dismiss decreasing ad engagements?
Analytical : think like a data scientist
8. Finding pattern in data
Tease out signals from noise
Educating engineers about variation (e.g. conversion)
Delineate the effects of various factors
Hypothesize root causes and figure out contribution of each
possibility (e.g., swipe to dismiss image viewer)
Prediction, forecasting, optimization
Building data products
Analytical : think like a data scientist
70% data munging + EDA, 20% modeling, 10% viz &
presentation, reporting
9. Data munging
Data
Transactional, web clicks and logs, sensor data (satellite,
wearable device...), ...
Docs, emails, social feeds,..
What questions to ask about a data source?
Munging process:
Extracting from raw form,.
Filtering, selecting, transforming,.
Restructuring, aggregating, sinking,.
Techniques
SQL or similar, ETL tools in data warehouse, Hadoop
MapReduce, dim reduction, sampling, R (*apply, pylr)..
10. Techniques
Distribution & summary statistics: centrality, variation,
outliers
Scatterplot, side-by-side boxplot, histogram
PCA, multidimensional scaling, projection pursuit..
Toolset
Hadoop & equivalents: read terabytes of data and
aggregate
R, python, ruby, excel, …
Exploratory Data Analysis
11. 42 heads out of 100 coin flips, does it indicate the
coin is unfair?
Is the traffic on 101-N heavier on Wednesdays?
Techniques
A/B testing
Time series analysis
Toolset : statistical packages like R
Teasing out signal from noise
14. ● Visualization of analytics data demand in US
https://carterlin.shinyapps.io/brilent/
● Topsy : social search, analytics and draw insights using
entirety of twitter data
● Placepicker : help couples decide where to live
o Commute times, rent or house prices, safety, school quality,
walkability
● Tools for interactive visualization : R shiny package, tableau,
D3.js, ruby/python,
Building Data Products
15. Healthcare Drug development
Patient monitoring
Electronic Medical Records
Utilities Smart grid optimization (generation,
transmission, distribution, demand)
Retail &
marketing
Customer loyalty and churn analysis
Targeted product and services offerings
Product sentiment analysis
Marketing campaign optimization
Financial
services
Fraud detection & prevention
Anti-money laundering
Telecom Customer churn mitigation
Geospatial analytics
Call data record (CDR) analysis
Analytics Use Cases by Industry
16. Crawl twitter data in R (or python)
user info
user tweets
user network
Search results
Text analytics and unsupervised learning, interactive
visualization
Organize twitter users into groups based on similarity of their tweets
Display search results on chosen topic (e.g. Iphone 6) with sentiment
analysis
Phase 1: data crawling and parsing, word cloud and frequency;
Phase 2: several similarity metrics; extract sentiment from
tweets;
Mine Twitter on a Topic
17. Anonymized bike trip data :
Trip start/end time
Trip start/end station
Rider type and member gender & birth year
Visualization and prediction :
Where are riders going? When are they going there? How far do they
ride?
Top stations? Interesting usage pattern?
Similar : Hubway bike trip history (metro-Boston)
Phase 1: exploratory data analysis, design doc of visualization;
Phase 2: EDA; iterate on design doc, simple examples using
interactive viz tools;
Chicago Divvy Bike Usage
18. Public dataset of startup ecosystem
Company: name, homepage, category, total funding;
Rounds: funding amount at each round (seed, A, B, ...);
Investments : investor info & raised amount at each round;
Acquisitions : acquisition and acquirer information.
Problems:
Interactive visualization : rounds of funding? What is total funding
distrn for each category? Distn for a location?
Predict : total funding amount with missing or whether a company is
acquired in k years (k =2, 5, ...), more?
Phase 1: exploratory data analysis, design doc of visualization,
and scope of prediction
Crunchbase Startup Data**
19. Predict monthly sales of consumer products following
initial advertising campaign
Monthly online sales for the first 12 months after the product
launches.
Product and campaign features.
EDA, statistical modeling, visualization
Phase 1: exploratory data analysis, 12-month sales curve or
time series
Phase 2: extract features from 12-month sales curve,
predict with off-the-shelf methods;
Predicting Consumer Product
Sales based on Features
20. How does your smartphone know what you are doing now?
Activity label : walking, walking up/down, sitting, standing, lying
Galaxy SII : Acceleration and angular velocity
Subject identifier, time & frequency domain variables
Supervised machine learning, feature engineering
Phase 1: exploratory data analysis, off-the-shelf ML
Phase 2: more off-the-shelf ML and performance comparison;
ensemble methods?
Human activity recognition using
smartphone data
21. Resources
tryr.codeschool.com
Coursera classes
Intro to statistics
R/Python programming
Machine learning
Intro to data science
Web intelligence and big data (DS)
Books
Statistical sleuth
Big data governance (quality, privacy, application in various verticals)
Data just right (DS)
the Startup of you
7 habits of highly effective people
glassdoor, careercup,...