Big Data Final Paper - Warriors Final

Project Final Report
Exploratory Analysis of Social Media
Images to Inform Product Innovation,
Marketing & Promotions
Big Data Analytics Summer2015
Matthew Blough, Eric DeFina, Zixin Mao, Sandilya Tumma
8/12/2015

1
Abstract
Social Listening is an established activity allowing organizations to generate
consumer/customer insights and make more informed business decisions from public data in
social media. While traditionally based on text analytics tools, the rise of platforms such as
Instagram, Pinterest, Snapchat, Tumblr and more, have transformed the content, and therefore
data, generated in social media. As such, the analysis of unstructured data from images will be
critical to “social listening” on today’s platforms to fully understand context, sentiment,
meaning, and more. Through this research, we will explore whether we can use big data
platforms to read, analyze (trends, commonalities) and summarize unstructured data from social
media images to develop insights that feed business and marketing decisions for an online travel
agency company (e.g. Travelocity).
Introduction
Social Listening is an activity that allows brands and organizations to learn from public
data generated by consumers in social networks. By mining this unstructured data, companies
can generate insights from observing online consumer conversation, and then use these insights
to make smarter and more informed business decisions, such as product innovations, decisions
and changes, marketing campaigns, promotional offers, and more.
Over the past couple years, however, there has been a transformational shift in the
content published by consumers to social networks. With the rise of platforms such as
Instagram, Pinterest, Snapchat, and more, social conversation has become dominated by visual
communication and content. In addition, “traditional” social networks such as Facebook and
Twitter have also seen an influx of visual posts verses traditional comments, tags, and other text-

2
based content. In 2014, 500 million image-based posts shared each day in social media, often
without the pairing of text to provide context to the image. This shift has been largely enabled
by the advancement and adoption of smartphones, as well as faster data connection speeds. For
users, visuals are easier and faster to consume.
In order to continue to mine the full sphere of social media for business insights and
questions, we must go beyond text analytics and use big data tools to collect and analyze
imagery quickly and efficiently. As image content now dominates the social web, it will be
critical to understand the context, sentiment and meaning of images in the same way tools have
historically parsed this data from text.
Business Significance
Market research, and specifically understanding consumer needs and the market
environment, has long been a tenant of running a successful and profitable business. In 2014
alone, the market research industry boasted of over $40 billion in sales globally. Today, the
Internet and rise of social media has created new opportunities for research and insights. It is no
longer necessary, in many cases, to set up formal, and expensive, studies in order to understand
and listen to consumers. In addition, the scale and size of the data has offered the ability to
analyze behavior of much larger groups of people compared to the smaller sample sizes of
traditional research studies. Since access to social media has become freely available to
interested organizations, many have turned to the analysis of this massive public data set as a
new form of consumer, market, and competitive insights.
Advertising has been a key activity for large online travel agencies to convert consumers
and drive sales. Expedia and Travelocity along spent over $4 billion in advertising in 2014

3
alone. To make that advertising most effective, it is critical to understand the consumer insight
and create advertisement plans that can drive consumers down the purchase funnel, from product
awareness, to actually purchasing a trip. Analysis of social media images can provide these key
insights that can bolster our advertisement effectiveness and ultimately sales. By knowing what
the types of images consumers are posting, and what the images consist of, we can draw
conclusions of what travel options consumers are looking for, who people most commonly travel
with, and what activities they are doing. This information, far more insightful than transactional
data we have traditionally had access to, can be utilized to create more engaging opportunities
for our advertising creativity and promotional bundles to better meet the wants and needs of our
target audience.
Problem Statement
An online travel agency, TravelWeb, would like to determine the most effective new
advertising and promotion campaign based on consumers’ travel behavior, activity and trends, in
order to increase sales. Based on information extracted from social media imagery, we want to
answer:
• What imagery and creative should we be using for our marketing campaigns, ads,
website imagery and social media content?
• What deals and packages should we be offering?
• How should we structure our offers to best meet the wants and desires of
consumers?
• What bundles and deals should we create?

4
Methodology
The process of this project is to take a large set of unstructured data from social media in
the form of images, transform it into a structured data set by using a computer vision algorithm,
then analyze the structured data set via data mining techniques in order to gain insights into
consumer compositions and preferences.
The first step of our methodology is hypothesis development, in other words, we needed
to outline our business interests. Our hypothesis spans across several topics. One is to understand
type of imagery, whether it may be hiking, camping, skiing, cruise travel, can be used to create
marketing campaigns or ads for prospective clients. Another hypothesis is to see who people are
traveling with in order to understand how to cater to their needs and interests on vacation trips.
What kind of bundles should be offered in terms of activities, foods, excursions based off of
these social media images. For example, people generally enjoy taking pictures on water skiing
more than when sitting around a campfire due to the thrill of the activity. Traveling companies
can offer packages for jet skiing in order to maximize revenue for that specific activity.
The dataset of images has already been given to us. Following Figure 1.2, the next task is
to take the images and run them through Microsoft ComputerVision API to extract structured
quantitative and qualitative information in each picture. This provides information on facial
recognition features such as gender and age, image colors, object categories, and how well these
predictions are doing in terms of a score. The dataset has over 150,000 images and the big data
platform can be useful in running these pictures quickly and efficiently. Since API calls are made
per image and one output in JSON format is produced for each image, we end up with 150,000
individual JSON records.

5
Figure 1.1
While these JSON records we obtained from ComputerVision API are structured data, in
order to conveniently perform analysis, the data set needs to be further transformed into a
relational data structure. To achieve this takes two steps. The first step is to aggregate the
individual JSON records into one single file. This is necessary due to the flexible nature of JSON
format, i.e. it doesn’t require individual files to share the same number of fields. Therefore, in
order to make sure fields from all the files are included, these files need to be properly
aggregated. The second step is converting the single JSON data file into a simple relational
structures with columns and rows. To accomplish this, we utilized a tool named Konklone.
•Background Information
•Hypothesis Statement
•Business Insight
•Big Data usefulness
Hypothesis
•Vision API
•Python to extract raw data
•Clean up the data for analysis (ETL process)
Data Collection/ ETL
•Data Mining Tool  IBM SPSS Modeler
•Classification
•Clustering
•Association
•Exploratory analysis
Analysis
•Find insight for business value
•Business decisions for advertising companies
Expected
Conclusion/Implication

6
After a relational database is constructed, it is time to perform analytics. The platforms
we selected are Microsoft Azure and its Machine Learning Studio component. Azure is a
powerful big data platform with easy navigation and access to numerous plug-ins. Its Machine
Learning Studio allows us to apply different types of analytical techniques on the data. We can
easily perform descriptive analytics by slicing and dicing the data set using SQLite queries then
calculate their statistics. At a little more advanced level, we can create and run data mining
models such as classification, clustering and association. Here we can attempt to find different
patterns, trends, and correlations which can be useful to the business insight at hand. The
business implication here is to see what kind of images consumers are taking and begin advising
traveling companies and/or agencies on how to better promote their advertisements to specific
activities and leisure events. The goal here is to assist these companies in increasing revenue and
maximizing profits so that there are no dead costs in promoting the wrong activities. Why
promote parasailing at a location which isn’t suited for that type of activity as opposed to
parasailing on some off shore islands which is much better with consumers taking daily pictures?

7
Figure 1.2
Project Domain
The project domain is broken down by the ETL process - Extraction, Transformation,
Load. Extraction is the challenging aspect of the process where we must connect the online
vision API to Microsoft Azure. This will allow us to feed images into python so that the script
can pull information from the API and give us output JSON files on each image. Python will be
running a loop function to run through all the images on the directory folder and spit out
thousands of files for analysis. Once we have compiled these JSON files together, we are ready
to transform these files into useful data. We have a couple options here. We can go either go
through Amazon Web Service’s MapReduce in order to compress the data into one big file. Once
we have one big file of the unstructured data, we can run this through Microsoft Excel as a CSV
file and clean it up as a proper dataset. This dataset will be our primary source of analysis once
we load it into the IBM SPSS Modeler application. The load process takes us to the modeler
where we can perform exploratory analysis and find key insights into the data. The key findings
Data Extraction
• Vision API
• Py thon
• JSON
Data Transf ormation
(Classif ication)
• Clean up data
f rom JSON output
• AWS/Cloudera
• Microsof t Excel
Data Loading
• Creation of CSV
f iles
• Load into IBM
SPSS Modeler
• Use of Text
Mining application
of SPSS Modeler
Data Analy sis
• Classif ication
• Clustering
• Association
• Word
count/f requency

8
are what will be useful to businesses promoting their vacation packages in a more efficient
manner and spending resources where they find it best to maximize revenue.
Analytical Methods
We are looking to use three main categories of analytics: Classification, clustering, and
association. Classification will allow us to understand how categories of color, objects, age, and
scores are seen together as a large collection of images. Which image characteristics are more
common with social media images? Is there a commonality of pictures being taken of a younger
generation than the elder generation? Classification can help us understand the trend of these
images. Another tool we can look into is clustering. Clustering allows us to group together
characteristics of images which have more relation with each other. This can be with colors or
category images, just to name a few. The association method gives us connection analysis on
images of various categories. Age, gender, and color attributes can be analyzed to see which
combination of characteristics are closely associated with each other. Observing images which
are associated with each other should have a high confidence % in determining how closely they
are related to each other. One example of this: Are there more buildings in the background vs.
pictures of faces with buildings in the background?
Output/Results
We were interested in looking into insightful results through exploratory analysis and see
if we can identify any patterns of trends throughout the dataset of image information. To start
this off we built a simple model in Microsoft Azure identifying descriptive statistics from
important variables most relevant to the insight. Figure 1.3 shows the different nodes connected

9
as we imported our dataset in the reader node and connected it to the project columns to select
the columns which we were most interested in looking at it. The project columns resembles a
“filter” node from other applications and we are able to concentrate on specific variables which
are of value to us. Lastly the descriptive statistics node had to be connected to the project
columns in order to spit out the statistical values of our categories for analysis.
Figure 1.3
In order to get some deeper insights, we needed to drill down the data set by slicing and
dicing it. For instance, an interesting aspect to look at is consumer composition by gender,
gender association and age groups. Figure 1.4 shows us the different slices we created for our
analysis. For example, we ran the following query to isolate records about images with two male
faces.

10
Another absolutely important use for queries is to filter trustworthy data from noisy data.
This is important because although computer vision is getting more accurate day by day, it is still
far from being 100% accurate. Therefore, we need to take into account of filtering out data that
have very low prediction accuracy or ambiguous categories that are too generic for any
meaningful analysis. Take the following query for example, it removes records in which
ComputerVision API produced a prediction accuracy of lower than 10% in the first object
category. In addition, it takes out records that are categorized as “abstract” or “others”.
Figure 1.5 shows us the detailed results of the statistics of various types. The count,
median, mode, range, min, max, average statistics are displayed for numeric variables. Especially
for variables such as category score and face age, where numeric values are given to these
categories and we can identify certain patterns such as the average age of faces being produced
from this collection of images is around the age of 30. But you can also identify face ages which
vary as low as 1 and as high as 96.

12
Clustering analysis helps identify similar characteristics grouped together. With image
clustering, one can identify the different types of images which are similar to each other through
various measurements. One of them being the Euclidian distance which allows to determine how
far one cluster is from another. Figure 1.6 shows how the cluster model was built through
Microsoft Azure. A reader and project column node were once again inserted to filter through
selected variables. We were most interested in identifying 3 columns of categories along with
their main color categorization. Then we added the “train clustering model” model in order to
train the variables into forming four different clusters. This process took some extra time as it
had to train the model so that we can extract results from it. Running it through azure did take 5
times as fast as running through other platforms such as IBM SPSS Modeler. This was a major
advantage to us from a big data perspective. Running 120,000+ images can be done in a quicker
process through azure than other platforms. The K-Means clustering node was added to cluster
our final results together. The metadata editor was used to name the cluster names as 1, 2, 3, 4
numeric values.
Figure 1.6

13
The clustering results were extracted into a CSV format through the K-Means Clustering node.
From here we compiled the CSV table results into clustering results shown in Figure 1.7. Four
clusters formed categories and colors which were in close proximity to each other in
characteristics. They were all distinctively different and shows the type of category names
associated with the colors.
Figure 1.7
Cluster 1:
Outdoor
Building
Street Tree Text
Grey
Black
Cluster 2:
Food
Drinks
Crowd People
Yellow Blue
Green Black
Cluster 3:
Abstract
Others
Cluster 4:
Beach
Water
Sky
Blue
White

14
Scope & Limitations
The main limit of this project is the amount of data. With more data the scores generated
would be more robust resulting in greater precision and accuracy in analyzing the images.
Moreover, our data being images from just one source is another limit. While the old adage, “a
picture is worth a thousand words,” may stand true, we are hoping to narrow results to find the
most important elements of description for analysis of an image. Additionally, due to the lack of
specificity of the ComputerVision API, classifications of the data were unable to be performed
for enhanced insights.
Policy/Managerial Implications
Improved picture recognition and description can allow managers to bolster “social
listening” on today’s media platforms to better understand context, sentiment, and the meaning
behind why a picture is shared and the context of the image. With enhanced image recognition,
more informed business decisions, such as product innovations, decisions and changes,
marketing campaigns, and promotional offers can emulate successful targeted fields for own
campaigns. Moreover, this project can help understand which elements of an image make it go
viral. Greater analysis and understanding of what customers take pictures of and what they share
allows a business to create a better product search for customers as well.
Conclusions & Future Research
Images are the new text on the web. They are easy to share and more engaging than text.
The trend will continue in favor of images and we believe that analysis of images will grow
tremendously in the coming years. Expanding on the importance of social listening, more
insights can be drawn from interpreting pictures. Enhancements in image perspective analysis,

15
GPS and sentiment overlay, will improve clustering and classification in order to better predict
what appeals to specific customers to increase sales. Social media company Snapchat, an
ephemeral photo and video sharing app currently charges $400,000 worth of ad space for a story
generating 20 million views. Meanwhile, Facebook has utilized its massive storage of photos to
developed a way to recognize people in photos even if their faces are obstructed, identified
individuals with 83% accuracy using a method dubbed PIPER, an acronym for pose invariant
person recognition. As the quantity of images shared online increases the quality of data
algorithms processing photos will bolster analysis. Why pictures were taken, understanding the
important elements, what sparked the instance, and how to better react and cater to what
customer desires are the driving forces on how image analytics will proceed in the future.
Sources
1. http://www.fastcompany.com/3000794/rise-visual-social-media
2. http://blogs.adobe.com/digitalmarketing/social-media/visual-social-snapchat-pinterest-
and-the-rise-of-media-rich-marketing/
3. http://wersm.com/visual-web-the-next-big-thing/
4. http://blogs.wsj.com/digits/2015/06/23/facebook-claims-photo-recognition-breakthrough/
5. http://recode.net/2015/06/17/snapchats-making-some-pretty-serious-money-from-live-
stories/
6. https://www.esomar.org/uploads/industry/reports/global-market-research-
2014/ESOMAR-GMR2014-Preview.pdf
7. http://skift.com/2015/02/20/priceline-and-expedias-advertising-arms-race-in-2014/
8. Mary Meeker; 2014 Internet Trends Report
9. https://www.forrester.com/Big+Datas+Big+Meaning+For+Marketing/quickscan/-/E-
res114782
10. http://www.forbes.com/sites/groupthink/2015/05/01/visual-listening-social-medias-next-
frontier/3/

Big Data Final Paper - Warriors Final

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a Big Data Final Paper - Warriors Final

Similar a Big Data Final Paper - Warriors Final (20)

Big Data Final Paper - Warriors Final