SlideShare una empresa de Scribd logo
1 de 35
Yelp Dataset
Challenge
ANWAR SHAIKH
ASHWIN NIMHAN
MANASHREE RAO
SHRIJIT PILLAI
TEJAS SHAH
Project Tasks
 Task 1
 Assign Categories to Business in the Yelp Data Set
 Task 2
 Recommend Food Items and/or services in a Restaurant
 Determine Influential Factors in a City affecting Restaurants
Task 1
...
Task 1 : Methodology
Business Business
To To
Review Category
Map Map
…...
…...
Tf-Idf
1. Default
2. BM25
3. Dirichlet
Lucene
Index
Lucene
Index
Mapping Phase
Category to
Review Mapping
Predicted
Categories
Training
Set
Testing
Set
Evaluation
Precision
Recall
F2-Measure
At least 1
TP
0
0.2
0.4
0.6
Top 3 Top 5 Top 7
Precision 0.54 0.38 0.33
Recall 0.55 0.66 0.72
F2-Measure 0.55 0.57 0.58
At least 1 TP 0.85 0.88 0.89
0.54
0.38
0.33
0.55
0.66 0.72
0.55
0.57 0.58
0.85 0.88 0.89
BM25 Similarity
Evaluation
Precision
Recall
F2-Measure
At least 1
TP
0
0.2
0.4
0.6
Top 3 Top 5 Top 7
Precision 0.51 0.36 0.33
Recall 0.53 0.62 0.66
F2-Measure 0.53 0.54 0.55
At least 1 TP 0.84 0.85 0.87
0.51
0.36
0.33
0.53
0.62 0.66
0.53 0.54 0.55
0.84 0.85 0.87
Default Similarity
Evaluation
Precision
Recall
F2-Measure
At least 1 TP
0
0.2
0.4
0.6
Top 3 Top 5 Top 7
Precision 0.42 0.32 0.3
Recall 0.58 0.6 0.55
F2-Measure 0.53 0.51 0.47
At least 1 TP 0.81 0.84 0.86
0.42
0.32 0.3
0.58 0.6
0.55
0.53 0.51
0.47
0.81 0.84 0.86
LMDirichlet Similarity
Task 2: Recommend Restaurant
Food Items or Services
...
Task 2 : Methodology
Feature Extraction
 Every token has an associated POS
tag
 POS tag with “NN” are Nouns and
“JJ” are adjectives
 Nouns are considered as features
and adjectives as sentiments
Feature Filtering
 Noise present in features obtained from Feature Extraction Phase
 Using Task 1 Solution, categories of input features are determined
 Features whose categories are related to restaurants are considered for further
processing
Before Feature Filtering After Feature Filtering
• cheese
• burger
• ones
• menu
• combinations
• idea
• commission
• cheese
• burger
• menu
Feature Processing
Stanford
CoreNLP
Dependency
type NSUBJ
Sentence
Dependent
Tag
Governor
Tag
• Problem : The relationship
between noun and adjective was
ambiguous for some sentences.
• Example : The food was great
but the service was bad
• After parsing “bad” belongs to
food or service?
New Review
Adjective
Positive or
Negative?
Negative
Word in 4-
word
distance?
Decision
(Recommended
or not
Recommended)
Classification of reviews
1. For each sentence the noun is extracted
through feature extraction
2. Corresponding adjective is identified as
positive or negative
3. Negation is searched for within 4 word
distance of adjective
4. Feature is classified as Recommended if
number of positive sentiments associated
with it is more than the number of
negative sentiments
 All the above steps are repeated for each
review
Sample Result
Predicted
Features
Predicted Feature Sentiments Predicted as
Recommended
Features ?
Actual
Recommended
Features
sub, next, decent Y Y
bread flavorful, bland, fresh, great, nice Y Y
peppercorn nice Y Y
stuff-it chewy Y N
sandwich mayo/mustard/vinegar, east, good,
unknown
Y Y
menu decent Y Y
bacon real Y Y
bite huge Y N
veggies sorry N Y
Evaluation
 Set 1 - Recommended Features are obtained from
60% reviews of a particular restaurant.
 Set 2 - The remaining 40% of the reviews are
considered for testing
 If a recommended feature from Set 1 is present as a
recommended feature in Set 2, then it is a True
Positive
 Evaluation Metrics
 Precision
 Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Precision Recall
0.53
0.67
Identifying Influential topics
“Identify features from reviews which are relevant city wide
and influence the user’s choice and restaurant’s popularity”
Phases
I. Business classification by city
II. Popular item word-count
III. NLP feature extraction
IV. Feature re-ranking model
V. Model fitness evaluation
Business Classification Phase I
Issue: Reviews specify neighborhood not city. (~150 !!!)
Solution:
1. Identify city based on geo-code through mapping service.
2. K-means clustering
1. Data point features (Business Id, Latitude, Longitude)
2. Dissimilarity metric (Euclidian distance)
3. Cluster count: k (10)
4. Centroid Labeling
3. Data persistence and indexing
1. Split reviews based on clustered business ids
2. Save & index for next phase.
Word-count Phase II
Issue: How do we get the influential factors of a city
Solution: Word count as first pass
Observation: Noise (adjectives, verbs, expressions)
Proposal: Include features derived through NLP
NLP Features Phase III
Issue: Noise reduction and contextual awareness
Solution: Use NLP to identify features in the reviews
Observation: Subtle change in ordering of words
Proposal: Re-ranking the words using metrics from user and review.
𝑹ε𝑹 𝟏𝟎𝟎𝟎
𝒕𝒇. 𝐥𝐨𝐠 𝟏 −
𝒅𝒇
|𝑫𝒄𝒐𝒖𝒏𝒕|
×
𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓
𝑼 𝒗
𝑼 𝒓𝒄
+ 𝟎. 𝟐𝟎 . 𝐔𝐟)
Mathematical Formula
Elite User
Who is Important?
Elite User
Useful Review
Who is Important?
What is Important?
Mathematical Formula
 Features from NLP does take in account word count and context but does
NOT consider user weight and review weight
Program with
Mathematical
Formula
Solr
Index
Word list
from NLP
Top 1K
Relevant
Reviews
Scored word
User
 Review Count = Urc
 Average Stars
 Votes = Uv
 Friends
 Elite = Ue
 Yelping Since
 Compliments
 Fans = Uf
Mathematical Formula
Uvnorm =
UTotalVotes
UReviewCount
(𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓
𝑼 𝒗
𝑼 𝒓𝒄
+ 𝟎. 𝟐𝟎 . 𝐔𝐟)
Normalization of votes
User
 Review Count = Urc
 Average Stars
 Votes = Uv
 Friends
 Elite = Ue
 Yelping Since
 Compliments
 Fans = Uf
User Review Count Votes
U1 10 1000
U2 1000 1000
User
 Review Count = Urc
 Average Stars
 Votes = Uv
 Friends
 Elite = Ue
 Yelping Since
 Compliments
 Fans = Uf
Mathematical Formula
𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓
𝑼 𝒗
𝑼 𝒓𝒄
+ 𝟎. 𝟐𝟎 . 𝐔𝐟)
Review
 User
 Stars = Rs
 Text
 Date
 Votes = Rv
User
Stars Sentiment
1 Very Strong
2 Inclined -ve
3 Ambivalent
4 Inclined +ve
5 Very Strong
User
 Review Count = Urc
 Average Stars
 Votes = Uv
 Friends
 Elite = Ue
 Yelping Since
 Compliments
 Fans = Uf
Mathematical Formula
𝒕𝒇 . 𝐥𝐨𝐠 𝟏 −
𝒅𝒇
𝑫 𝒄𝒐𝒖𝒏𝒕
× 𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝑼 𝒆 + 𝟎. 𝟓𝟓
𝑼 𝒗
𝑼 𝒓𝒄
+ 𝟎. 𝟐𝟎 . 𝑼𝒇)
Review
 User
 Stars = Rs
 Text
 Date
 Votes = Rv
User
Review Relevance
 TermFrequency = tf
 Document Frequecny = df
 Document Count = Dcount
User
 Review Count = Urc
 Average Stars
 Votes = Uv
 Friends
 Elite = Ue
 Yelping Since
 Compliments
 Fans = Uf
j
Output
1
2
3
4
5
6
7
9
8
10
11
12
14
13
15
16
17
Madison
Rank
Wordcount
List
NLP list-
Unformatted
NLP list-
Model
1 food food pizza
2 place beer cheese
3 like cheese coffee
4 from menu breakfast
5 service curds burger
6 go atmosphere taco
7 time burger sushi
8 madison dane chocolate
9 been drinks beer
10 cheese beers sandwich
11 menu restaurant curds
12 bar table ice
13 restaurant coffee wine
14 ordered pizza store
15 love something cream
16 order sandwich lunch
17 chicken dinner rolls
18 beer lunch atmosphere
19 pizza meal tea
20 sauce sauce curries
21 night burgers steak
22 people drink noodle
23 make bread spot
24 staff server soup
25 made chicken egg
Rank Wordcount List
NLP list-
Unformatted NLP list- Model
1 food food donut
2 good pizza bagel
3 place burger cupcake
4 great menu gelato
5 like restaurant gyro
6 service fries yogurt
7 time atmosphere buffet
8 go chicken boba
9 back patio pizza
10 from breakfast sushi
11 been table coffee
12 love lunch sub
13 ordered dinner wing
14 chicken meal crepe
15 nice salad burger
16 order cheese burrito
17 restaurant potato taco
18 little server cookie
19 menu something gluten
20 pizza sauce breakfast
21 bar drinks coffee-shop
22 delicious rice hash-brown
23 friendly burgers cake
24 first beer Vegan
25 Pretty Spot Teas
Pheonix Las Vegas
Rank Wordcount
NLP list-
Unformatted NLP list- Model
1food food donuts
2good beer bagel
3place sushi crepe
4like restaurant pizza
5great meal oyster
6service menu yogurt
7from atmosphere shrimp
8time table burger
9vegas steak gelato
10go dinner sushi
11back server wings
12ordered salad sandwich
13restaurant tables pancake
14nice rib coffee
15been buffet burrito
16order dining curry
17chicken breakfast buffet
18little waitress waffle
19pretty shrimp chocolate
20love something cake
21menu beers breakfast
22eat dishes tea
23delicious dish cookies
24first restaurants gluten
25people sauce pastrami
Evaluation Metric: NDCG
 Predicted topics for Phoenix under categories: Bakery, Breakfast and
Brunch
 To capture the strongest sentiments about these topics, we analyzed the
top 1000 features for businesses under predicted under Bakery, Breakfast
and Brunch for the specific city, in this case Phoenix.
 Using these features as input for relevance score, we analyze the top 30
topics predicted by the model:
 NDCG = 18.80190835 / 21.8978282
= 0.8586
Rank NLP list- Output From Model
Relevance
Score Log
DCG=
rel(i)/log i
1donut 3 0 3
2bagel 3 1 3
3cupcake 3 1.584963 1.892789
4gelato 0 2 0
5gyro 2 2.321928 0.861353
6yogurt 2 2.584963 0.773706
7buffet 0 2.807355 0
8boba 1 3 0.333333
9pizza 0 3.169925 0
10sushi 0 3.321928 0
11coffee 3 3.459432 0.867194
12sub 2 3.584963 0.557886
13wing 1 3.70044 0.270238
14crepe 2 3.807355 0.525299
15burger 2 3.906891 0.511916
16burrito 2 4 0.5
17taco 2 4.087463 0.489301
18cookie 2 4.169925 0.479625
19gluten 0 4.247928 0
20breakfast 2 4.321928 0.462756
21coffee-shop 2 4.392317 0.45534
22hash-brown 1 4.459432 0.224244
23cake 3 4.523562 0.663194
24vegan 1 4.584963 0.218104
25teas 2 4.643856 0.430677
26bruschetta 1 4.70044 0.212746
27waffle 3 4.754888 0.63093
28pancake 3 4.807355 0.624044
29subway 1 4.857981 0.205847
30latte 3 4.906891 0.611385
Rank Relevance Score Log Ideal DCG (IDCG)
1 3 0 3
2 3 1 3
3 3 1.5849625 1.89278926
4 3 2 1.5
5 3 2.32192809 1.29202967
6 3 2.5849625 1.16055842
7 3 2.80735492 1.06862156
8 3 3 1
9 3 3.169925 0.94639463
10 2 3.32192809 0.60205999
11 2 3.45943162 0.57812965
12 2 3.5849625 0.55788589
13 2 3.70043972 0.54047631
14 2 3.80735492 0.52529907
15 2 3.9068906 0.51191605
16 2 4 0.5
17 2 4.08746284 0.48930108
18 2 4.169925 0.47962493
19 2 4.24792751 0.47081783
20 2 4.32192809 0.46275643
21 1 4.39231742 0.22767025
22 1 4.45943162 0.22424382
23 1 4.52356196 0.22106473
24 1 4.5849625 0.21810429
25 1 4.64385619 0.21533828
26 1 4.70043972 0.21274605
27 0 4.7548875 0
28 0 4.80735492 0
29 0 4.857981 0
30 0 4.9068906 0
Things to Note !
 Based on Results: Identified categories: Breakfast and Brunch, Bakery
 Keywords
donut
bagel
cupcake
gelato
gyro
yogurt
buffet
boba
pizza
sushi
coffee
sub
wing
crepe
burger
burrito
taco
cookie
gluten
breakfast
coffee-shop
hash-brown
cake
Things to Note !
 Identified categories:
 Breakfast and Brunch, Bakery
donut
bagel
cupcake
gelato
gyro
yogurt
buffet
boba
pizza
sushi
coffee
sub
wing
crepe
burger
burrito
taco
cookie
gluten
breakfast
coffee-shop
hash-brown
cake
Thank You!

Más contenido relacionado

Destacado

4 Biscarrat - RichAnalysis-datatuesday - Smart Cities - 26 mars 2013
4   Biscarrat - RichAnalysis-datatuesday - Smart Cities - 26 mars 20134   Biscarrat - RichAnalysis-datatuesday - Smart Cities - 26 mars 2013
4 Biscarrat - RichAnalysis-datatuesday - Smart Cities - 26 mars 2013
Data Tuesday
 

Destacado (8)

Yelp final
Yelp finalYelp final
Yelp final
 
yelp data challenge
yelp data challengeyelp data challenge
yelp data challenge
 
Yelp - optimiser sa fiche pour un professionnel du tourisme avec Touristic
Yelp - optimiser sa fiche pour un professionnel du tourisme avec TouristicYelp - optimiser sa fiche pour un professionnel du tourisme avec Touristic
Yelp - optimiser sa fiche pour un professionnel du tourisme avec Touristic
 
Business Case Presentation
Business Case PresentationBusiness Case Presentation
Business Case Presentation
 
Yelp
YelpYelp
Yelp
 
Yelp Consulting Presentation
Yelp Consulting PresentationYelp Consulting Presentation
Yelp Consulting Presentation
 
What is Yelp?
What is Yelp?What is Yelp?
What is Yelp?
 
4 Biscarrat - RichAnalysis-datatuesday - Smart Cities - 26 mars 2013
4   Biscarrat - RichAnalysis-datatuesday - Smart Cities - 26 mars 20134   Biscarrat - RichAnalysis-datatuesday - Smart Cities - 26 mars 2013
4 Biscarrat - RichAnalysis-datatuesday - Smart Cities - 26 mars 2013
 

Similar a Yelp Dataset Challenge

PredictingYelpReviews
PredictingYelpReviewsPredictingYelpReviews
PredictingYelpReviews
Gary Giust
 
Quality Management.ppt
Quality Management.pptQuality Management.ppt
Quality Management.ppt
ddelucy
 
Competitive UX Benchmarking: How Four Healthcare Insurance Sites Scored Acros...
Competitive UX Benchmarking: How Four Healthcare Insurance Sites Scored Acros...Competitive UX Benchmarking: How Four Healthcare Insurance Sites Scored Acros...
Competitive UX Benchmarking: How Four Healthcare Insurance Sites Scored Acros...
UserZoom
 
Marketing Research Ppt
Marketing Research PptMarketing Research Ppt
Marketing Research Ppt
Vivek Sharma
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
GrubhubTech
 
Text Data Mining and Predictive Modeling of Online Reviews
Text Data Mining and Predictive Modeling of Online ReviewsText Data Mining and Predictive Modeling of Online Reviews
Text Data Mining and Predictive Modeling of Online Reviews
Mark Chesney
 

Similar a Yelp Dataset Challenge (20)

Recommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationRecommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model Evaluation
 
Using Data Science to Transform OpenTable Into Your Local Dining Expert
Using Data Science to Transform OpenTable Into Your Local Dining ExpertUsing Data Science to Transform OpenTable Into Your Local Dining Expert
Using Data Science to Transform OpenTable Into Your Local Dining Expert
 
Restaurant Booking Apps India
Restaurant Booking Apps IndiaRestaurant Booking Apps India
Restaurant Booking Apps India
 
PredictingYelpReviews
PredictingYelpReviewsPredictingYelpReviews
PredictingYelpReviews
 
Text Mining in Social Network
Text Mining in Social NetworkText Mining in Social Network
Text Mining in Social Network
 
Optimizing marketing campaigns using experimental designs
Optimizing marketing campaigns using experimental designsOptimizing marketing campaigns using experimental designs
Optimizing marketing campaigns using experimental designs
 
Strategic Tools- Walmart
Strategic Tools- WalmartStrategic Tools- Walmart
Strategic Tools- Walmart
 
Recommendation Architecture - OpenTable - RecSys 2014 - Large Scale Recommend...
Recommendation Architecture - OpenTable - RecSys 2014 - Large Scale Recommend...Recommendation Architecture - OpenTable - RecSys 2014 - Large Scale Recommend...
Recommendation Architecture - OpenTable - RecSys 2014 - Large Scale Recommend...
 
Empowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews MiningEmpowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews Mining
 
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
 
recommender_systems
recommender_systemsrecommender_systems
recommender_systems
 
TTIPEC: Monitoring and Evaluation (Session 2)
TTIPEC: Monitoring and Evaluation (Session 2)TTIPEC: Monitoring and Evaluation (Session 2)
TTIPEC: Monitoring and Evaluation (Session 2)
 
Quality Management.ppt
Quality Management.pptQuality Management.ppt
Quality Management.ppt
 
Data-Driven Approach to Search Relevance
Data-Driven Approach to Search RelevanceData-Driven Approach to Search Relevance
Data-Driven Approach to Search Relevance
 
Competitive UX Benchmarking: How Four Healthcare Insurance Sites Scored Acros...
Competitive UX Benchmarking: How Four Healthcare Insurance Sites Scored Acros...Competitive UX Benchmarking: How Four Healthcare Insurance Sites Scored Acros...
Competitive UX Benchmarking: How Four Healthcare Insurance Sites Scored Acros...
 
Marketing Research Ppt
Marketing Research PptMarketing Research Ppt
Marketing Research Ppt
 
JA, PA, Selection 2016
JA, PA, Selection 2016JA, PA, Selection 2016
JA, PA, Selection 2016
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
 
Text Data Mining and Predictive Modeling of Online Reviews
Text Data Mining and Predictive Modeling of Online ReviewsText Data Mining and Predictive Modeling of Online Reviews
Text Data Mining and Predictive Modeling of Online Reviews
 
Business Research Methods T Test
Business Research Methods T TestBusiness Research Methods T Test
Business Research Methods T Test
 

Último

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Último (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Yelp Dataset Challenge

  • 1. Yelp Dataset Challenge ANWAR SHAIKH ASHWIN NIMHAN MANASHREE RAO SHRIJIT PILLAI TEJAS SHAH
  • 2. Project Tasks  Task 1  Assign Categories to Business in the Yelp Data Set  Task 2  Recommend Food Items and/or services in a Restaurant  Determine Influential Factors in a City affecting Restaurants
  • 4. Task 1 : Methodology Business Business To To Review Category Map Map …... …... Tf-Idf 1. Default 2. BM25 3. Dirichlet Lucene Index Lucene Index Mapping Phase Category to Review Mapping Predicted Categories Training Set Testing Set
  • 5. Evaluation Precision Recall F2-Measure At least 1 TP 0 0.2 0.4 0.6 Top 3 Top 5 Top 7 Precision 0.54 0.38 0.33 Recall 0.55 0.66 0.72 F2-Measure 0.55 0.57 0.58 At least 1 TP 0.85 0.88 0.89 0.54 0.38 0.33 0.55 0.66 0.72 0.55 0.57 0.58 0.85 0.88 0.89 BM25 Similarity
  • 6. Evaluation Precision Recall F2-Measure At least 1 TP 0 0.2 0.4 0.6 Top 3 Top 5 Top 7 Precision 0.51 0.36 0.33 Recall 0.53 0.62 0.66 F2-Measure 0.53 0.54 0.55 At least 1 TP 0.84 0.85 0.87 0.51 0.36 0.33 0.53 0.62 0.66 0.53 0.54 0.55 0.84 0.85 0.87 Default Similarity
  • 7. Evaluation Precision Recall F2-Measure At least 1 TP 0 0.2 0.4 0.6 Top 3 Top 5 Top 7 Precision 0.42 0.32 0.3 Recall 0.58 0.6 0.55 F2-Measure 0.53 0.51 0.47 At least 1 TP 0.81 0.84 0.86 0.42 0.32 0.3 0.58 0.6 0.55 0.53 0.51 0.47 0.81 0.84 0.86 LMDirichlet Similarity
  • 8. Task 2: Recommend Restaurant Food Items or Services ...
  • 9. Task 2 : Methodology
  • 10. Feature Extraction  Every token has an associated POS tag  POS tag with “NN” are Nouns and “JJ” are adjectives  Nouns are considered as features and adjectives as sentiments
  • 11. Feature Filtering  Noise present in features obtained from Feature Extraction Phase  Using Task 1 Solution, categories of input features are determined  Features whose categories are related to restaurants are considered for further processing Before Feature Filtering After Feature Filtering • cheese • burger • ones • menu • combinations • idea • commission • cheese • burger • menu
  • 12. Feature Processing Stanford CoreNLP Dependency type NSUBJ Sentence Dependent Tag Governor Tag • Problem : The relationship between noun and adjective was ambiguous for some sentences. • Example : The food was great but the service was bad • After parsing “bad” belongs to food or service?
  • 13. New Review Adjective Positive or Negative? Negative Word in 4- word distance? Decision (Recommended or not Recommended) Classification of reviews 1. For each sentence the noun is extracted through feature extraction 2. Corresponding adjective is identified as positive or negative 3. Negation is searched for within 4 word distance of adjective 4. Feature is classified as Recommended if number of positive sentiments associated with it is more than the number of negative sentiments  All the above steps are repeated for each review
  • 14. Sample Result Predicted Features Predicted Feature Sentiments Predicted as Recommended Features ? Actual Recommended Features sub, next, decent Y Y bread flavorful, bland, fresh, great, nice Y Y peppercorn nice Y Y stuff-it chewy Y N sandwich mayo/mustard/vinegar, east, good, unknown Y Y menu decent Y Y bacon real Y Y bite huge Y N veggies sorry N Y
  • 15. Evaluation  Set 1 - Recommended Features are obtained from 60% reviews of a particular restaurant.  Set 2 - The remaining 40% of the reviews are considered for testing  If a recommended feature from Set 1 is present as a recommended feature in Set 2, then it is a True Positive  Evaluation Metrics  Precision  Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Precision Recall 0.53 0.67
  • 16. Identifying Influential topics “Identify features from reviews which are relevant city wide and influence the user’s choice and restaurant’s popularity” Phases I. Business classification by city II. Popular item word-count III. NLP feature extraction IV. Feature re-ranking model V. Model fitness evaluation
  • 17. Business Classification Phase I Issue: Reviews specify neighborhood not city. (~150 !!!) Solution: 1. Identify city based on geo-code through mapping service. 2. K-means clustering 1. Data point features (Business Id, Latitude, Longitude) 2. Dissimilarity metric (Euclidian distance) 3. Cluster count: k (10) 4. Centroid Labeling 3. Data persistence and indexing 1. Split reviews based on clustered business ids 2. Save & index for next phase.
  • 18.
  • 19. Word-count Phase II Issue: How do we get the influential factors of a city Solution: Word count as first pass Observation: Noise (adjectives, verbs, expressions) Proposal: Include features derived through NLP
  • 20. NLP Features Phase III Issue: Noise reduction and contextual awareness Solution: Use NLP to identify features in the reviews Observation: Subtle change in ordering of words Proposal: Re-ranking the words using metrics from user and review.
  • 21. 𝑹ε𝑹 𝟏𝟎𝟎𝟎 𝒕𝒇. 𝐥𝐨𝐠 𝟏 − 𝒅𝒇 |𝑫𝒄𝒐𝒖𝒏𝒕| × 𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓 𝑼 𝒗 𝑼 𝒓𝒄 + 𝟎. 𝟐𝟎 . 𝐔𝐟) Mathematical Formula
  • 22.
  • 23. Elite User Who is Important?
  • 24. Elite User Useful Review Who is Important? What is Important?
  • 25. Mathematical Formula  Features from NLP does take in account word count and context but does NOT consider user weight and review weight Program with Mathematical Formula Solr Index Word list from NLP Top 1K Relevant Reviews Scored word
  • 26. User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf Mathematical Formula Uvnorm = UTotalVotes UReviewCount (𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓 𝑼 𝒗 𝑼 𝒓𝒄 + 𝟎. 𝟐𝟎 . 𝐔𝐟) Normalization of votes User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf User Review Count Votes U1 10 1000 U2 1000 1000
  • 27. User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf Mathematical Formula 𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓 𝑼 𝒗 𝑼 𝒓𝒄 + 𝟎. 𝟐𝟎 . 𝐔𝐟) Review  User  Stars = Rs  Text  Date  Votes = Rv User Stars Sentiment 1 Very Strong 2 Inclined -ve 3 Ambivalent 4 Inclined +ve 5 Very Strong
  • 28. User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf Mathematical Formula 𝒕𝒇 . 𝐥𝐨𝐠 𝟏 − 𝒅𝒇 𝑫 𝒄𝒐𝒖𝒏𝒕 × 𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝑼 𝒆 + 𝟎. 𝟓𝟓 𝑼 𝒗 𝑼 𝒓𝒄 + 𝟎. 𝟐𝟎 . 𝑼𝒇) Review  User  Stars = Rs  Text  Date  Votes = Rv User Review Relevance  TermFrequency = tf  Document Frequecny = df  Document Count = Dcount User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf
  • 30. Madison Rank Wordcount List NLP list- Unformatted NLP list- Model 1 food food pizza 2 place beer cheese 3 like cheese coffee 4 from menu breakfast 5 service curds burger 6 go atmosphere taco 7 time burger sushi 8 madison dane chocolate 9 been drinks beer 10 cheese beers sandwich 11 menu restaurant curds 12 bar table ice 13 restaurant coffee wine 14 ordered pizza store 15 love something cream 16 order sandwich lunch 17 chicken dinner rolls 18 beer lunch atmosphere 19 pizza meal tea 20 sauce sauce curries 21 night burgers steak 22 people drink noodle 23 make bread spot 24 staff server soup 25 made chicken egg Rank Wordcount List NLP list- Unformatted NLP list- Model 1 food food donut 2 good pizza bagel 3 place burger cupcake 4 great menu gelato 5 like restaurant gyro 6 service fries yogurt 7 time atmosphere buffet 8 go chicken boba 9 back patio pizza 10 from breakfast sushi 11 been table coffee 12 love lunch sub 13 ordered dinner wing 14 chicken meal crepe 15 nice salad burger 16 order cheese burrito 17 restaurant potato taco 18 little server cookie 19 menu something gluten 20 pizza sauce breakfast 21 bar drinks coffee-shop 22 delicious rice hash-brown 23 friendly burgers cake 24 first beer Vegan 25 Pretty Spot Teas Pheonix Las Vegas Rank Wordcount NLP list- Unformatted NLP list- Model 1food food donuts 2good beer bagel 3place sushi crepe 4like restaurant pizza 5great meal oyster 6service menu yogurt 7from atmosphere shrimp 8time table burger 9vegas steak gelato 10go dinner sushi 11back server wings 12ordered salad sandwich 13restaurant tables pancake 14nice rib coffee 15been buffet burrito 16order dining curry 17chicken breakfast buffet 18little waitress waffle 19pretty shrimp chocolate 20love something cake 21menu beers breakfast 22eat dishes tea 23delicious dish cookies 24first restaurants gluten 25people sauce pastrami
  • 31. Evaluation Metric: NDCG  Predicted topics for Phoenix under categories: Bakery, Breakfast and Brunch  To capture the strongest sentiments about these topics, we analyzed the top 1000 features for businesses under predicted under Bakery, Breakfast and Brunch for the specific city, in this case Phoenix.  Using these features as input for relevance score, we analyze the top 30 topics predicted by the model:  NDCG = 18.80190835 / 21.8978282 = 0.8586
  • 32. Rank NLP list- Output From Model Relevance Score Log DCG= rel(i)/log i 1donut 3 0 3 2bagel 3 1 3 3cupcake 3 1.584963 1.892789 4gelato 0 2 0 5gyro 2 2.321928 0.861353 6yogurt 2 2.584963 0.773706 7buffet 0 2.807355 0 8boba 1 3 0.333333 9pizza 0 3.169925 0 10sushi 0 3.321928 0 11coffee 3 3.459432 0.867194 12sub 2 3.584963 0.557886 13wing 1 3.70044 0.270238 14crepe 2 3.807355 0.525299 15burger 2 3.906891 0.511916 16burrito 2 4 0.5 17taco 2 4.087463 0.489301 18cookie 2 4.169925 0.479625 19gluten 0 4.247928 0 20breakfast 2 4.321928 0.462756 21coffee-shop 2 4.392317 0.45534 22hash-brown 1 4.459432 0.224244 23cake 3 4.523562 0.663194 24vegan 1 4.584963 0.218104 25teas 2 4.643856 0.430677 26bruschetta 1 4.70044 0.212746 27waffle 3 4.754888 0.63093 28pancake 3 4.807355 0.624044 29subway 1 4.857981 0.205847 30latte 3 4.906891 0.611385 Rank Relevance Score Log Ideal DCG (IDCG) 1 3 0 3 2 3 1 3 3 3 1.5849625 1.89278926 4 3 2 1.5 5 3 2.32192809 1.29202967 6 3 2.5849625 1.16055842 7 3 2.80735492 1.06862156 8 3 3 1 9 3 3.169925 0.94639463 10 2 3.32192809 0.60205999 11 2 3.45943162 0.57812965 12 2 3.5849625 0.55788589 13 2 3.70043972 0.54047631 14 2 3.80735492 0.52529907 15 2 3.9068906 0.51191605 16 2 4 0.5 17 2 4.08746284 0.48930108 18 2 4.169925 0.47962493 19 2 4.24792751 0.47081783 20 2 4.32192809 0.46275643 21 1 4.39231742 0.22767025 22 1 4.45943162 0.22424382 23 1 4.52356196 0.22106473 24 1 4.5849625 0.21810429 25 1 4.64385619 0.21533828 26 1 4.70043972 0.21274605 27 0 4.7548875 0 28 0 4.80735492 0 29 0 4.857981 0 30 0 4.9068906 0
  • 33. Things to Note !  Based on Results: Identified categories: Breakfast and Brunch, Bakery  Keywords donut bagel cupcake gelato gyro yogurt buffet boba pizza sushi coffee sub wing crepe burger burrito taco cookie gluten breakfast coffee-shop hash-brown cake
  • 34. Things to Note !  Identified categories:  Breakfast and Brunch, Bakery donut bagel cupcake gelato gyro yogurt buffet boba pizza sushi coffee sub wing crepe burger burrito taco cookie gluten breakfast coffee-shop hash-brown cake