Performed Data Analytics on the massive yelp academic dataset having almost 2.2 million reviews for 77k businesses by making using various technologies like Hadoop Pig, Hive, SQL, R, Tableau.
2. [AUTHOR NAME] 2
YELP DATASET ANALYSIS REPORT
1. Summary of number of reviews by US City, by Categories
In order to analyze according to these given conditions, I have made use of two of the given Datasets from the Yelp Academic
Challenge Dataset i.e. the Business and Reviews Datasets.
Both the datasets were loaded in Pig using twitter’s elephant-bird JsonLoader as the schema of the datasets is highly nested
with mixed data types. The .jar files for the various components of elephant bird were loaded through the properties tab on
the Pig Editor in Hue web UI. The 3 .jar files were: /user/cloudera/elephant-bird-core-4.13.jar, /user/cloudera/elephant-bird-
hadoop-compat-4.13.jar & /user/cloudera/elephant-bird-pig-4.13.jar respectively available at:
I. http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-core/4.13
II. http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-hadoop-compat/4.13
III. http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-pig/4.13
NOTE: The above 3 .json files have been used in all of the questions
Once the .json file for the dataset has been uploaded as maps, it is stored in a generic variable. Then from that variable we
generate the fields we require for analysis using the format name_of_map#’field_name’ as ‘field_name’. As an example if we
loaded the business dataset with the map name as business, and we wish to generate the business_id field, then our syntax
will look like business#’business_id’ as business_id.
In this case I have generated the fields categories, city, business_id, state, latitude, longitude from the business dataset and
the fields business_id and review_id from the reviews dataset. To obtain US cities we filter them first based on the edge
coordinates of USA Mainland. However, as cities like Waterloo are present, we then filter it by State to remove the Canadian
states of Ontario and Quebec. They are then joined together on their common field business_id in a new variable joined. Once
they have been joined together, I generated the city and categories for each of the records in joined. As the categories given
in the business dataset are nested and each business can be classified under various different categories, I flattened the
categories so that we can identify each category associated with the business individually. Once the categories have been
flattened, I then grouped the variable flattened by city and categories, so that we can see the results grouped respectively.
However, once we group any field, it’s schema changes. So in order to extract the desired result, for each of the records in the
grouped variable, I have flattened the grouping done previously, as city and categories and then generated the count of reviews
associated with it.
Finally, I have ordered the results by city, so that I can arrange the final output by showing the number of reviews for each
business category within each city in the dataset. I then stored the final variable into a folder in HDFS using the PigStorage
method making it a Tab Separated Variable File.
A few exceptions which I noted while analyzing the output of the operation is that few records do not have any city mentioned
in their city field, while in the case of some records the same city has been specified differently, like 110.Las Vegas and Las
Vegas. Such discrepancies can cause minor fluctuations while analyzing the output dataset.
Basic Analysis of the Number of Reviews in Tableau, suggests that the most number of reviews have come from the city of Las
Vegas as shown by Figure 1 and the most number of reviews for any individual category are for the Restaurant’s category as
shown by Figure 2.
Figure 1
3. [AUTHOR NAME] 3
YELP DATASET ANALYSIS REPORT
Figure 2
PIG SCRIPT:
A = LOAD './yelp_academic_dataset_business.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true ') AS
(yelp: map[]);
business = FOREACH A GENERATE yelp#'categories' as categories, yelp#'business_id' as business_id, yelp#'city' as city, yelp#'state' as
state,(float)yelp#'latitude' as latitude, (float)yelp#'longitude' as longitude ;
coordinates_business = FILTER business BY (latitude<49.384472) AND (latitude>24.520833) AND (longitude<-66.950) AND
(longitude>-124.766667);
us_business = FILTER coordinates_business BY NOT ( (state matches '.*ON.*') OR (state matches '.*QC.*') );
businesses = FOREACH us_business GENERATE categories, business_id, city ;
B = LOAD './yelp_academic_dataset_review.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true ') AS
(review: map[]);
revie = FOREACH B GENERATE review#'business_id' as business_id, review#'review_id' as review_id;
joined = JOIN businesses by business_id, revie by business_id;
flatting = FOREACH joined GENERATE city, FLATTEN(categories);
grouped = GROUP flatting by (city, categories);
results = FOREACH grouped GENERATE FLATTEN(group) AS (city,categories), COUNT(flatting);
finals = ORDER results by city;
STORE finals INTO './Q1' USING PigStorage('t');
TRUNCATED OUTPUT:
Magicians 3
Event Planning & Services 3
110. Las Vegas Automotive 12
110. Las Vegas Auto Repair 12
Ahwatukee Pet Boarding/Pet Sitting 10
Ahwatukee Fitness & Instruction 20
Ahwatukee Sewing & Alterations 20
Ahwatukee Health & Medical 13
Ahwatukee Hotels & Travel 14
Ahwatukee Eyelash Service 4
Ahwatukee Carpet Cleaning 4
Ahwatukee Specialty Food 3
Ahwatukee Local Services 30
Ahwatukee Health Markets 3
Ahwatukee Pediatricians 13
Ahwatukee Home Services 6
Ahwatukee Beauty & Spas 4
Ahwatukee Truck Rental 6
Ahwatukee Self Storage 6
Figure 2
4. [AUTHOR NAME] 4
YELP DATASET ANALYSIS REPORT
2. Ranking of cities on the basis of stars in each category
In order to analyze according to these given conditions, I have made use of two of the given Datasets from the Yelp Academic
Challenge Dataset i.e. the Business and Reviews Datasets as in the last example.
Both the datasets were loaded in Pig using twitter’s elephant-bird JsonLoader as the schema of the datasets is highly nested
with mixed data types. The .jar files for the various components of elephant bird were loaded through the properties tab on
the Pig Editor in Hue web UI. Once the .json file for the dataset has been uploaded as maps, it is stored in a generic variable.
Then from that variable we generate the fields we require for analysis using the format name_of_map#’field_name’ as
‘field_name’. As an example if we loaded the business dataset with the map name as business, and we wish to generate the
categories field, then our syntax will look like business#’categories’ as categories.
In this case I have generated the fields categories, city, business_id from the business dataset and the fields business_id and
stars from the reviews dataset. As when we stored all the data in the .json file in terms of a map in key value pairs, we have
to make sure that whenever we are extracting any number we have to typecast it by specifying the data type like int or float
before we generate the field from the data loaded using the twitter elephant bird API. The two are then joined using their
common field i.e. business_id. Once they have been joined together, I generated the city, stars, categories for each of the
records in the joined variable. As the categories given in the business dataset are nested and each business can be classified
under various different categories, I flattened the categories so that we can identify each category associated with the business
individually.
Once the categories have been flattened, I then grouped the variable flattened by city and categories, so that we can see the
results grouped respectively. However, once we group any field, it’s schema changes. So in order to extract the desired result,
for each of the records in the grouped variable, I have flattened the grouping done previously, as city and categories and
generated the average value of the stars within that group and renamed the calculated field as rankings. It should be noted,
that in order to access the stars field we have to mention the variable name in which the field stars exist. In this case we called
the stars field using the syntax flattened_join.stars. Just as noted in the last part, in this part also the problem with the same
city with different names exists like Las Vegas and 110.Las Vegas
The Mean Rating has been found as 3.747, with the minimum and maximum rating values as 1.0 and 5.0. The Median average
rating is 3.758.
Figure 3 shows the number of cities in which the average rating for businesses of all categories have been grouped according
to the categories: Less than 1.5 Stars, 1.5-3 Stars, 3-4.5 Stars, Greater than 4.5 Stars. We can see that most of the cities have
businesses in the 3-4.5 range.
Figure 4 similarly illustrates the number of categories grouped according to their average ratings placed in categories as: Not
Good(Lesser than 1.5 stars), Fair(1.5 – 3 stars), Good(3 – 4.5 stars), Excellent(Above 4.5 stars). We can see that almost 40% of
the categories a Good Rating i.e. 3 – 4.5 stars.
Figure 3
Figure 4