Identification and ugc

Identification and UGC

IS Economics Research Seminar

By Beibei Li
May-11-2012

1

What is Identification?

 Understanding what is the causal relationship
behind empirical results.

e.g., Imagine variables Yt and Xt are correlated. There can be three
reasons for this, which are not mutually exclusive:
• Cause: Xt  Yt
• Reverse Cause: Yt  Xt
• Correlated variable: Zt  Both Xt and Yt

Identification is essential for empirical research!

Agenda

 Major Research Questions
 Why Is Identification Important for UGC Research
 Overview of Econometric Identification Strategies
 Examples (Archak et al. 2011, Ghose, Ipeirotis and Li 2012, Luca 2011)
 Discussions

Major Research Questions
 Economic Effect
 Product sales, pricing power, new product adoptions

 User Behavior, Motivation, Social Dynamics
 Dynamics of online reviews (e.g., evolve over time)
 How do previous opinions affect subsequent behavior?
 How is rating influenced by public opinions?
e.g., existing ratings, professional ratings

 Firm Perspective, Marketing Strategies, Managerial Implications
 Social media vs. Traditional marketing campaigns
 What should firms do with the existence of social media?
e.g., stimulate additional WOM, adapt pricing/ads to UGC.
 Positive & Negative publicity

Why Identification? – Causality

 Economic Effect
 Unobserved product heterogeneity. e.g. product quality
 Publicity, advertising…

 User Behavior, Motivation, Social Dynamics
 Online reviews may not convey true opinion.
e.g., social influence (cascade/herding, differentiating)
 Online reviews may not reveal true quality.
e.g., early self-selection bias, review dynamics

 Firm Perspective, Marketing Strategies, Managerial Implications
 Social media vs. Traditional marketing campaigns

Overview of Identification Strategies

 Fixed Effect: Control for unobserved characteristics that are time-invariant.
(e.g., product-fixed effect, location-fixed effect) e.g., Ghose et al. 2007.

 Diff-in-Diff: Difference out both time-invariant and time-variant unobservables.
e.g., Chevalier and Mayzlin 2006.

 Regression Discontinuity: Exam treatment effect by observing a
“discontinuous jump” while controlling for continuous score and other covariates.
e.g., Luca 2011.
 Natural Experiment: Treatments effects are not manipulable by the researchers.
(e.g., government interventions, policy changes) e.g., Chan and Ghose 2012.

 Instrumental Variables: Variables that are correlated with the endogeneous
explanatory variables, but not correlated with the error. e.g., Ghose, Ipeirotis &Li 2012.

 Propensity Score Matching: Match a treated sample with an untreated sample
based on their predicted propensities to be treated – “would have been treated but not.”
e.g., Aral, Muchnik and Sundararajan 2009, Rhue and Sundararajan 2010.

Archak, Ghose & Ipeirotis (Mgt Sci 2011)

Motivation:
What is the economic impact of UGC on product sales?
Using only numeric rating has limitations:
• Quality is not one-dimensional;
• Reviewers and readers may have different tastes;
• Ratings may not convey consumers’ true opinions;
(e.g., social influence)
• Ratings may not capture true quality information;
(e.g., Li & Hitt 2008, early self-selection bias,
Hu et al. 2008, bimodal distribution)
• Rating is discrete: “4” reviews may read like “3” or “5”


Research Questions:

• What is the economic impact of UGC on product
sales beyond the effect of numeric review ratings?

• How can product reviews help us learn consumer
preferences for different product attributes, and how
consumers make trade-offs between those attributes?


Main Idea:
• Identify which product attributes (e.g., nouns/noun phrases)
are most frequently discussed in product reviews;
Fully automated (POS tagger) vs. Crowdsourcing

• Extract opinions (e.g., adjectives that refer to those nouns)
about these product attributes;
Fully automated (Syntactic dependency parser) vs. Crowdsourcing

• Estimate the economic impact of the extracted opinions.
Dynamic panel data model + System GMM


Data:
• Sales rank, price and consumer reviews from Amazon.com
• Two product categories (digital cameras and camcorders)
• 15 months (2005/3-2006/5)

Model:


Identification:

• Price Endogeneity: IV-lagged price (Villas-Boas and Winer 1999)
• UGC Endogeneity: Google trends product search volume as
control (Luan & Neslin 2009)
• Autocorrelation: Lagged dependent variable as control

First paper to bridge the qualitative nature of UGC
and the quantitative nature of consumer choice.

Ghose, Ipeirotis & Li (Mkt Sci 2012)

Motivation:

• Content beyond text? Images, geo-maps, social-geo tags…
• Social media  Product search engines: fail to efficiently leverage
information created across multiple social media channels;
• Ranking mechanism cannot capture multidimensional preferences.


Research Questions:

• What is consumers’ willingness-to-pay for
different product attributes?

• Is there a better method for product search
engines for ranking products?
Consumers’ decision : “best value”
Search engines’ decision : “most relevant”


Main Idea:
1. Identify the important product characteristics that
influence demand.
2. Use a choice model to precisely estimate how these
product characteristics influence demand.
3. Impute the expected utility gain (surplus) from each
product and propose a ranking framework based on surplus.

Product ``value-for-money”
Price Characteristics


Transaction data: Travelocity.com, 1497 US hotels, 2008/11-2009/1
Location Characteristics:
 Social geo-tags: Geonames.org, “Public transportation”
 GeoMapping Search Tools: Microsoft Virtual Earth SDK, “Restaurants”
 Image Classification: “Beach”, “Downtown”
 On-Demand Survey: Amazon Mechanical Turk (AMT), “Highway”
Service Characteristics:
 JavaScript parsing engines: TripAdvisor & Travelocity,
“# of Internal amenities”, “Reviewer Rating”, “# of online reviews”
Additional Review Characteristics:
 Text Mining: Review-based content from TripAdvisor & Travelocity,
Text features (e.g., “Breakfast”, “Staff”), “Subjectivity”,
“Readability”, “Disclosure of Reviewer Identity”
16


A Structural Model for Demand Estimation:
u
ij k t
X
jk t
i  i Pjk t   jk t   ikt ,

error term, Type I EV
hotel utility

consumer-specific random coefficients

Random Coefficient Logit Model (Song 2011, PCM 2007, BLP 1995)

How to capture consumer heterogeneity?
• Each individual consumer has different  i , i
• Each individual consumer has a different error  i

17


Identification – Price Endogeneity:
IV for price – variables that are correlated with price, but not error.

Price Error  i
Advertising,
IV Advertising,
Cost … Publicity…

Stage 1: Regress Price on X and IV;
Stage 2: Predict ^Price based on purely X and IV, and
substitute Price with the predicted ^Price .
 ^Price will not correlated with error!

18


Identification – Price Endogeneity:
IV for price – variables that are correlated with price, but not error.

Price Error  i
Advertising,
IV Advertising,
Cost … Publicity…

 Average price of the ``same-star rating” hotels in the other markets as an
instrument for price (Hausman et al. 1994).
 BLP-style instruments - Average characteristics of the same-star rating
hotel in the other markets (BLP 1995)
 Lagged prices as instruments in conjunction with Google Trends data to
control for correlated demand shocks (similar as Archak et al. 2011).
 Region dummies as proxies for the cost (e.g., the cost of transportation,
labor, etc.) (Nevo 2001). 19


Identification – UGC Endogeneity:
Error  i
UGC Rating

Advertising, Publicity,
Advertising,
Publicity, Unobserved Quality…
Quality…
(Both time-variant and
time-invariant)

• Product-Fixed Effect
• Diff-in-Diff
• IV
• Regression Discontinuity (Luca 2011)
20


Summary:
1. Identify the important product characteristics that influence
demand  Machine learning for social media variables.
2. Random coefficient logit model to estimate how these
product characteristics influence demand.
Identification: Price/UGC Endogeneity!
3. Derive the expected utility gain (surplus) from each product
and propose a ranking framework based on surplus.
4. Randomized experiments for ranking validation.

21

Luca (HBS Working Paper 2011)

Research Question:
How do online reviews affect product demand?
Challenge:
Causal relationship  UGC Endogeneity
Identification:
Regression Discontinuity

Data:
• Reviews from Yelp.com, 3,582 Seattle restaurants;
• Revenue from the Washington State Department of
Revenue, 2003-2009.


Identification:
• Unobserved factors that are correlated with both Yelp rating
and demand. (e.g., restaurant quality).
Error  i
UGC Rating
Advertising, Publicity,
Advertising,
Unobserved Quality…
Publicity,
Quality… (Both time-variant and
time-invariant)

Main Idea:
• Rounding Mechanism: Ratings are rounded to the nearest half-
star.
• Seek discontinuous jumps in revenue that follow
discontinuous changes in rating.


RD Design:


Model:

Restaurant, Quarter Fixed Effects

Continuous unrounded rating

Impact of moving from just below a discontinuity to just
above a discontinuity, controlling for the continuous change
in unrounded rating.


Key Identification Assumption:
- Restaurants become increasingly similar, when approaching
both sides of the threshold.
- Random assignment of restaurants to either side of the
rounding threshold.

McCrary density test for “Gaming:”
- Selection bias The thresholds can also be seen by the
restaurants, so restaurants may submit reviews themselves
to pass the rounding threshold.
- If so, one would expect to see a disproportionately large
number of restaurants just above the rounding thresholds.


Conclusion:
A one-star increase in Yelp rating causes a 5-9% increase in
revenue!

Note:
When using a RD design, need to seriously consider:
 Cost of “agent’s gaming” behavior: RD is only valid when
agents face sufficiently high cost of selection. e.g., geographic/age
thresholds.
 Knowledge of agents: RD is valid when agents do not know the
cutoff threshold, or their own score, or both. (e.g., McCrary density
test, Luca 2011)

Discussions

Aspects of social media content that are examined:
- Online ratings (valence, volume, variance, helpfulness)
- Review text (length, sentiments, readability and linguistic styles)
- Reviewer information (identity disclosure)
- Social-tags
- Blogs (music blogs, enterprise blogs, microblogging)
- Discussion forums
- Mobile UGC

Discussions

Product categories that are examined:
- Books
- Electronics, digital cameras, etc.
- Software
- TV shows
- Movie box office
- Video games
- Mobile phones
- Hotels
- Restaurants
- Bath & home products
- Stocks

Discussions

Identification Strategies that are mostly used:
- Fixed-Effect
- Diff-in-Diff
- Regression Discontinuity
- Natural Experiment
- Instrumental Variable
- Propensity Score Matching

- Randomized Experiment

Discussions
Data-Driven Identification?
• Natural Experiment Setting

Research Question-Driven Identification?
• Regression Discontinuity Design
• Diff-in-Diff
• Instrumental Variable

There are a range of approaches – but they all
need some prior economic thought 

Identification and ugc

Recomendados

Recomendados

Más contenido relacionado

Similar a Identification and ugc

Similar a Identification and ugc (20)

Último

Último (20)

Identification and ugc