Project report on the design and build of a data warehouse from unstructured and structured data sources (Quandl, yelp and UK Office for National Statistics) using SQL Server 2016, MongoDB and IBM Watson. Design and implementation of business intelligence visualisations using Tableau to answer cross domain business questions
Generative AI on Enterprise Cloud with NiFi and Milvus
Data Warehouse Project Report
1. CA
Data Warehouse Project Report
Tom Donoghue
x16103491
19 December 2016
MSCDAD
Data Warehousing and Business Intelligence
2. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 1
Table of Contents
Introduction.....................................................................................................................2
Objectives................................................................................................................................2
Project Scope ...........................................................................................................................2
Data Warehouse Architecture and Implementation ..........................................................3
The Data Model .......................................................................................................................3
Slowly Changing Dimensions ..........................................................................................................5
Type of Fact table............................................................................................................................5
High Level Model Diagram..............................................................................................................5
ETL Method and Strategy..................................................................................................8
ETL Environment ......................................................................................................................8
Data Sources ...................................................................................................................................9
Staging and Data Warehouse ETL............................................................................................10
Visits..............................................................................................................................................10
Currency Strength.........................................................................................................................11
Business Reviews ..........................................................................................................................13
Edinburgh Visits ............................................................................................................................15
Time ..............................................................................................................................................16
Case Studies...................................................................................................................17
Visitor Nationalities Traveling to the UK and Edinburgh...........................................................17
Currency Strength Impact on Visits and Spend ........................................................................18
Business Review Entity Extraction...........................................................................................19
References .....................................................................................................................20
3. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 2
Introduction
The purpose of this document is to report on the Data Warehousing project undertaken to deliver a
proof of concept data warehouse. This report is split into the following sections, Data Warehouse
Architecture and Implementation, ETL Strategy and Case Studies.
Objectives
The objectives of the project are outlined below:
Design and implement a data warehouse to answer 3 case studies to illustrate the usefulness of a
data warehousing solution
Use 3 or more sources of data
Use Business Intelligence queries and outputs to demonstrate and support the case studies
Project Scope
The scope of the project covers the 3 case studies which are described below and in the following
context diagram.
HandleBig Events want to know should they seriously consider holding their next US Australian trade
symposium in Edinburgh? They have offices in New York, Sydney and Dublin and would like to provide
some useful feedback to these offices to help them build initial promotional ideas. Our task is to help
them make better informed decisions using the case studies (described in the Case Studies section)
and the prototype data warehouse containing the sourced data.
4. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 3
Data Warehouse Architecture and Implementation
The architecture and design approach taken for this project follow the principles of data warehousing
promoted by Kimball, Ross, Thornthwaite, Mundy and Becker (2008). The primary reason for taking
the Kimball approach is based on the need to swiftly design and implement a working proof of concept
data warehouse. The scope of the project is narrow with a tight timescale, which favours using
dimensional modelling over a normalised relational modelling.
Data warehouse data functions as a story about past events, designed to support decision making,
serving up a digest of answers in grouped and aggregated ways which, are more meaningful and
therefore more important to the business. Providing rollup, drilldown and cross views of the data
(typical to OLAP operations) requires complex queries which, impact performance and may also add
a maintenance overhead each time a new business question occurs. The data warehouse must also
ingest data from disparate sources which need to be merged to create the desired outcomes. To
overcome these issues, the data warehouse is designed using dimensional modelling. The data when
organised multidimensionally is fashioned in a such a way that it serves a different business purpose
to the usual OLTP operational database (Chaudhuri and Dayal, 1997).
Adopting a methodology will produce a result, but the success of the result depends on how the
methodology is executed to meet a set of business requirements. As mentioned in Ariyachandra and
Watson (2006), which of the data warehouse architecture choices proposed by Kimball and Inmon is
better, is and still is, an ongoing debate. The authors investigated five main data warehouse
architectures in their studies. Regarding their study, our prototype data warehouse architecture
implementation method is probably closest to the type described as an Independent Data Mart.
Independent Data Marts were often frowned upon as an inferior architectural solution in operational
production environments. However, they do represent a good fit for prototyping and proof of concept
executions due to their relative simplicity and short lead time to deploy. Independent Data Marts may
make a valid contribution as part of a larger hybrid data warehouse solution as the authors conclude.
The diagram below shows the elements which comprise our prototype data warehouse architecture:
Source data is ingested and processed by the Extract, Transform and Load (ETL) and populates the
staging area (this process is detailed in the ETL section below) and subsequently populated the data
warehouse. The data warehouse provides the business intelligence results to business user queries.
The Data Model
The data model was constructed using dimensional modelling, which according to Kimball et al. (2008)
is an applicable way to best satisfy business intelligence needs, as it meets the underlying objectives
of timely query performance and unambiguous meaningful results. The dimensional model contains
dimensions and facts. Facts record business measurements that tend to be numeric and additive.
Dimensions record logical sets of descriptive attributes and are bound to the facts, enabling the fact
measurements to be viewed in various descriptive combinations. The benefits of dimensional
modelling are that: It facilitates a multidimensional analysis domain, via the exploration of fact
measures using dimensions. The schema is far simpler as the dimensions are denormalised which in
turn improves query performance and serves data which is instantly recognisable to the business user.
The resulting schema resembles a star shape, with the dimensions surrounding a single fact entity
(Kimball et al., 2008; Rowen, Song, Medsker and Ewen, 2001). Many data warehouse
implementations follow the star schema when describing and constructing the data model, as again
5. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 4
it addresses the goals of fast query performance, ease and speed of populating the data warehouse
(Chaudhuri and Dayal, 1997).
In the Kimball dimensional design process our first step is to choose the business process or
measurement event to be modelled, which in this case is Passenger Visits. To obtain an understanding
of this, a simple business statement was made:
“I want be able to see the number of visits made by nationality, when they visited, how long they
stayed, how much did they spend? I also want to get a handle on their mode of travel, purpose of visit
and how many people visit Edinburgh.”
This is a powerful way of identifying possible facts and dimensions associated with the visits data
source. However, the fact table grain needs to be defined before advancing further. Examining the
appearance of the visits source data, helped to define the grain, as each visit is recorded quarterly.
The grain should be defined as fine as possible, it is possible to roll up from it (e.g. quarters in to half
years and higher into years), but we will not be able to drill down any lower than the selected grain.
In this case, it is not possible to drill down lower than quarters (e.g. months and lower into weeks as
these attributes are not present in the data). Therefore, the finest grain available in the visits data is
quarters.
Looking at the business statement above the dimensions start to appear:
Visits
Country
Nationality
Mode of Travel
Purpose of Visit
Edinburgh Visits
Time
Identifying the Facts can also be drawn from the statement:
Visits
Spend
Nights Stayed
There are also the three remaining data sources to cater for: Currency Rates, Business Reviews and
Edinburgh Visits. As the grain has been declared then these entities also need to follow the grain and
be at a quarterly level. This raised the following issues:
Reviews are recorded for any given date and therefore need to be massaged to fit the quarterly
grain, which is achieved by transforming the review data in the ETL stage.
Currency FX rates are obtained by quarter which fits, but we have multiple currencies and that
creates a many to many relationship, what does dimensional modelling offer to resolve this
dilemma? As this is a prototype we strive to keep things simple, by ensuring a one to many
relationship between dimensions and facts and to maintain the desired star schema. There are
alternatives but these break our simple design and extend the amount of effort to build in the
additional joins required to satisfy the business queries (Rowen et al., 2001). To resolve this issue
currency data was transformed in the ETL stage, and repurposed as “Currency Strength” (and is
described in detail in the ETL section) to adhere to the one to many objective and match the grain.
Edinburgh Visits data had the same many to many dilemma as Currency rates. Although rows are
recorded quarterly there are multiple countries per quarter. The same solution of transforming
the data to match grain was applied (this is also described in further detail in the ETL section).
6. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 5
The Time dimension also needs to follow the quarter grain. SQL Server SSAS was used to generate a
Time dimension. However, the resulting dimension needed to be modified to add an extra column to
cater for the exact quarterly representation required to join to the Facts table.
Slowly Changing Dimensions
What method of updating the data in the dimensions and facts best suits the prototype data
warehouse? Keeping the objective of simplicity in mind we opt for Kimball Type 1 – overwrite the
dimension attribute. Type 1 means that the data warehouse will be completely overwritten each time
the data requires a refresh. The impact of a Type 1 slowly changing dimension is that we lose all history
of the previous state of the data prior to the reload (Kimball et al., 2008). It is unlikely that this would
be the desired approach in a production data warehouse (depending on business requirements), but
it is acceptable for this proof of concept piece as our source data are a snapshot of a set number of
years from 2010 to 2016 comprising 27 quarters in total.
Type of Fact table
According to Kimball et al. (2008), the measured facts falls into one of three types of grain:
transactions, periodic snapshots or accumulated snapshots. Our prototype model is aligned to the
periodic snapshot type, as measures are recorded each quarter for a set number of quarters (the visits
data source is by quarter). No further updates are applied to the fact table rows once the table has
been populated.
High Level Model Diagram
Using the dimensions that were identified from the earlier business statement a high level model was
created and is illustrated below:
This is our star schema, comprising the central fact table “Travel” surrounded by the dimensions. The
grain is also defined.
The next stage is to identify the dimension attributes and the fact measures. This was achieved taking
each data source in turn and asking whether the associated attributes and measures contributed to
the questions being asked in the case studies. The following images show the source data and the
dimension attributes (refer to the ETL section for further detail).
7. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 6
Visitor data
The following dimensions were created from the Visitor source data during the dimensional
modelling:
Country Attribute Format
Country Id Integer PK
Country Code Text
Country Strength Text
Mode Attribute Format
Mode Id Integer PK
Mode Code Text
Mode Name Text
Mode Detail Text
Nationality Attribute Format
Nationality Id Integer PK
Nationality Code Text
Nationality Strength Text
Purpose Attribute Format
Purpose Id Integer PK
Purpose Code Text
Currency Strength Text
For the prototype two separate Country and Nationality dimensions were created rather than using a
single dimension. The reason for this was due to the data being grouped inconsistently (e.g. a
nationality of “Other EU”, but there is no information as to which countries this refers to) and to retain
the data’s original meaning. In a production scenario, the country and nationalities would possibly be
rationalised and consolidated into a single dimension and transformed to use an ISO country code as
a key.
Some of the data from the data source has been precluded as it was not required to satisfy the 3
business cases. However, this is not to under value its potential contribution in a full production data
warehouse.
8. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 7
Edinburgh Visits
Edinburgh Visits Attribute Format
Visit Id Integer PK
Visit Date YYYYQQ
Visit Count Integer
Currency Rates
By quarter for US Dollar, Australian Dollar and Euro.
Currency Strength Attribute Format
Currency Strength Id Integer PK
Currency Strength Date YYYYQQ
Currency Strength Text
Business Reviews
This data source comprises unstructured data which will undergo entity extraction to gain the
following required attributes:
Review Attribute Format
Review Id Integer PK
Review Date YYYYQQ
Review Count Integer
Name of Business Text
Nationality Id Integer
Entity Text Text
Entity Type Text
The Fact Table
The fact table is required to store the following measures:
Fact Measure
Visits Units (days)
Spend Units (GBP)
Nights Stayed Units (days)
Edinburgh Visits Units (days)
9. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 8
The motivation for dimensional modelling in the context of data warehouse architecture may be
summarised as follows: Understandability, the dimensional view of consolidated data is already
recognisable to the business user. Query performance, gains in performance are obtained using star
joins and flatter denormalised table structures. Dimensions are the pathway to measures in the fact
table which, can absorb a myriad of unknown queries that users may devise over time. Dimensional
extensibility, as new data arrives the dimension is capable of taking on the change either as a new row
of data or by altering the table (Kimball et al., 2008). Finally, the Business Intelligence tools used to
answer the 3 case studies makes use of the dimensional model designed in this project.
ETL Method and Strategy
This section describes the data sources, how they were extracted, the steps taken to transform and
load the required data into the data warehouse. This phase of the project took a considerable amount
of time to complete which, as Kimball et al. (2008) point out, may swallow up to 70% of time and work
expended in the implementation of the data warehouse. Kimball et al. (2008) suggest that taking a
haphazard approach to the ETL is likely to end in a tangle of objects which have multiple points of
failure and are difficult to fathom out. There are many ETL tools which can be used to assist the ETL
phase. The primary activities that these tools cover in terms of their functionality according to
Vassiliadis, Simitsis, Georgantas, Terrovitis and Skiadopoulos (2005) are: (a) recognition of viable data
in the source data, (b) obtaining this information, (c) creating a tailored and consolidated view of
numerous data sources resulting in a unified format, (d) cleansing and massaging data into shape to
fit the business and target database logic and (e) populating the data warehouse.
The diagram below illustrates a high level view of the ETL landscape covered by the project scope:
ETL Environment
Prior to performing the extraction, the database environment was created. This consisted of two
databases, staging and data warehouse. The databases were partitioned to ensure that data
undergoing further exploration, cleaning and transformation was kept separately from the “clean”
and prepared data that exists in the data warehouse environment. The purpose was to assist overall
ETL management using a simple 2 phase approach. Source data is extracted, undergoes initial
transformation and is loaded into the staging tables.
10. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 9
The data is further examined and then undergoes a 2nd
transformation before finally being loaded into
the data warehouse database.
This iterative approach was followed to examine and refine the quality of the data destined for the
data warehouse. On early ETL runs as new issues occurred, the incidents were investigated, resolution
sought and modification made to the appropriate ETL package to resolve the incident. The various ETL
changes are discussed in the following sections.
When the ETL packages were fully tested, and producing the expected results they were merged into
logical steps to form an ETL workflow. This resulted in a workflow to cater for each of the data sources
and a separate ETL package to load the data warehouses Facts table.
The diagram below illustrates the Visits ETL, using this phased ETL design process (authored in SSIS).
As mentioned in the dimensional modelling section, the tables are truncated on each package
execution, no history is retained.
Data Sources
The table below shows the datasets that were sourced.
Name Description Source Type of Data
Visits International Passenger
(IPS) Visits
Edinburgh Visits
Visit Britain (2016) Structured
Currency Currency FX Rates QuandlAPI (2016) Semi-Structured
Reviews Business Reviews Yelp Dataset Challenge (2016) Unstructured
Visits
The IPS Visit data: uk_trend_unfiltered_report was obtained as a CSV containing quarterly rows from
2002 to 2015. The Edinburgh visit data: detailed_towns_data_2010_-_2015 was also obtained as a
CSV. The files were downloaded from the Visit Britain (2016). The datasets were originally created
from the International Passenger Survey data (UK Office for National Statistics, 2016).
11. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 10
Currency
Currency FX rates were obtained using QuandlAPI (2016) to extract average quarterly FX rates for the
Pound Sterling against the US Dollar, Australian Dollar and the Euro. Quarterly data was extracted for
the period 2009 to 2016.
Reviews.
Business reviews were obtained from round 8 of the Yelp dataset challenge download (Yelp Dataset
Challenge, 2016). The dataset was downloaded and unzipped to produce a JSON file for each entity.
Staging and Data Warehouse ETL
The ETL process for each of the data sources is described as follows:
Visits
The source CSV files were examined in OpenRefine (2016), to identify the data to be extracted, and to
quickly perform checks for format inconsistencies and missing data. OpenRefine was used to reformat
the quarter rows from quarters represented as month name e.g. “January-March” to QQ format e.g.
“01”.
The decimal values were converted back to integers and the input data was mapped to the respective
columns of the Visits table in the staging database
The staging dimensional tables Country, Nationality, Mode and Purpose were populated using the
Visits staging table from the previous step.
The Country ETL is described below (the same process was followed for the Nationality, Mode and
Purpose tables). The target Country table was truncated, the country narratives were taken from the
Visits table, sorted and the duplicates removed. A business country code column was assigned a value
12. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 11
of “Unknown” (this column was created for use downstream to hold business friendly values, as none
were available at ingestion. The default value of “Unknown” was assigned rather than leaving it blank
or NULL). The rows were then inserted into the Country table with a unique integer key assigned by
SQL on insert. The data warehouse ETL package truncates the DimCountry table and loads it using the
staging Country table as the source. Again, SQL assigns a unique integer key to each row inserted and
this is the surrogate key that will be used as the foreign key in the fact table.
Currency Strength
The Currency Strength ETL is shown in the diagram below.
A script created in R was used to obtain average quarterly currency rates using the QuandlAPI (2016)
as shown in the code snippet below. The QuandlAPI (2016) call is repeated to get the US and Australian
Dollar values. The quarterly difference for each currency is calculated. The last row of the 2009 quarter
used in the calculation contained “NA” and it was replaced with a dummy value (the entire year 2009
is discarded downstream as it is not required). The currency code and narrative are added to the data
frame before it is written out to the respective currency CSV file.
13. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 12
The R script is called by the Currency ETL package.
Once the CSV files are created, the data is extracted and the date is reformatted to the desired
quarterly format YYYYQQ and inserted into the staging Currency table.
The desired Currency rows are selected from the Currency table, grouped by date and the rate
difference is summed.
The Currency Strength is calculated
And the rows are then inserted into the staging Currency Strength table. The final package is run to
load the Currency Strength dimension table in the data warehouse.
Currency Strength is a measure of the strength of GBP against a basket of 3 currencies namely USD
EUR and AUD. The value of the indicator is either “UP” or “DOWN”. “UP” indicates a strong pound
relative to the basket, and “DOWN” indicates a weak pound relative to the basket of currencies. For
overseas visitors to the UK a “DOWN” position should be more favourable (bearing in mind that the
basket could be shielding a currency that has moved the other way e.g. USD and EUR are strong but a
very weak AUD has caused the overall value of the basket to be negative).
The Currency Strength is calculated by taking the average quarterly exchange rate of Pound Sterling
against 3 Major currencies (i.e. USD, EUR and AUD) and obtaining the quarterly differences between
each currency pair. The currency pair difference are summed to provide the basket value which, if
14. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 13
positive sets the Currency Strength indicator to “UP” otherwise it is set to “DOWN”. In the currency
dataset no quarterly difference of zero was found, had this been the case the indicator would have
been set to “NO CHANGE”.
Business Reviews
To facilitate extraction of the business review data (which is the project’s unstructured data, supplied
in the downloaded JSON files) a suitable document based database such as MongoDB was used.
MongoDB was installed on the same virtual machine as SQL Server to maintain a self-contained
environment. The files were imported into the yelp database using mongoimport, based on a tip from
Eniod's Blog (2015) on working with the Yelp dataset.
mongoimport --db yelp --collection businesses --file yelp_academic_dataset_business.json
mongoimport --db yelp --collection review --file yelp_academic_dataset_review.json
Using python and pymongo, two scripts were created. The first script extracts reviews related to
Edinburgh businesses, retrieves the associated reviews dated from 2010 to 2016 and inserts them into
a new collection. The second script reads the new collection, sends each text review for entity
extraction using the AlchemyAPI (2016). The result of each entity extraction is stored in a dataframe
to which a random Nationality code is added (to associate a review with the Visits nationality data,
this addition to the data makes our reporting more interesting as it provides a link to the nationality
of the reviewer). Once the entity extraction is complete the results are written to a CSV file which is
then processed through SSIS. The scripts can be configured to set the count of businesses and
associated reviews to extract (this assisted testing and limited the API calls as AlchemyAPI (2016) sets
a daily transaction limit).
It was noticed that the yelp dataset had businesses with a review count greater than zero but no
document existed in review collection. The scripts could be improved in the future to handle this
exception. The workaround for the few businesses in error, was to update the review count to zero in
the business collection.
AlchemyAPI (2016) provides an entity extraction API which is used to discover objects in the textual
business reviews such as people, names, places and businesses (Meo, Ferrara, Abel, Aroyo and
Houben, 2013).
The two Python scripts used to obtain Edinburgh business reviews from MongoDB appear below:
#!Python2.7python
# This program connects to mongoDB and extracts Edinburgh businesses. We limit the number of Businesses extracted and then get a limited number of
# associated reviews. The extracted reviews are finally inserted to a new collection
import pprint
from random import randint
import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.yelp
businesses = db.businesses
reviews = db.review
dwreviews = db.dwreview
# get Edinburgh businesses by limit
aBus = businesses.find({"city" : "Edinburgh", "review_count": {"$gt": 0}},
{"business_id" : 1, "name": 1, "categories": 1 }).
sort("stars", pymongo.DESCENDING).limit(2) # set 80 for live run
# create list and dict
collReviews = []
mybus = {}
# loop through business cursor
for busKey in aBus:
#print (busKey['busKey['business_id']']) #+ " " + busKey['categories'])
#mybus.append(busKey['business_id'])
mybus['business_id'] = busKey['business_id']
mybus['name'] = busKey['name'] #collReviews += [mybus]
#for each business key get the reviews and write them out to a new collection
#we also want to randomly assign a country code to each review to indicate nationality of reviewer
print mybus['business_id'] + " " + "**"
print mybus['name']
aReview = reviews.find({"business_id": mybus['business_id'], "review_id" : {"$exists" : True},
"date": {"$gt": "2009-12-31"}},
{"review_id": 1, "date": 1, "text": 1, "business_id": 1}).
sort("date", pymongo.DESCENDING).limit(3) # set to 100 for live run
reviewer = []
for item in aReview:
nationalityId = randint(1,75)
print (item['business_id'] + "^^ " + item['text'])
reviewer.append({"text": item['text'],"review_id": item['review_id'], "date": item['date'],
"name": mybus['name'], "business_id": item['business_id'], "nationality_id": nationalityId})
#reviewer['text'] = item['text']
collReviews += [reviewer]
15. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 14
# insert the document into the new collection
for rec in collReviews:
#pprint.pprint(rec)
db.dwreviewvideo.insert(rec)
print ('End of Pgm ')
Extract Entities Script
#!Python2.7python
import time
# import json
import pandas as pd
# import pprint
from watson_developer_cloud import AlchemyLanguageV1
alchemy_language = AlchemyLanguageV1(api_key='deleted')
import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.yelp
# select which colloction to do entity extraction on
reviews = db.dwreviewvideo
#get some reviews by limit
curReview = reviews.find({}, {"text": 1, "date": 1, "name": 1, "nationality_id": 1}).
sort("date", pymongo.DESCENDING).limit(521) # set to 521 for live run
reviews = {}
review =[]
mylist = []
#loop through the cursor and call the entity extraction API
for yReview in curReview:
print yReview
text = yReview['text'].encode('utf-8')
#get entities for each yReview
response = alchemy_language.entities(text)
# wait for alchemy to do its thing
time.sleep(2)
# add the results to a list of dicts
for item in response['entities']:
textLatin1 = item['text'].encode('latin-1')
mylist.append ({'type': item['type'], 'text': textLatin1,
'count': item['count'], 'date': yReview['date'], 'name': yReview['name'],
'nationality_id': yReview['nationality_id']})
#print 'entities list ' + str(mylist)
# assign the list to a dataframe for ease of outpuuting a csv of the results
df = pd.DataFrame(mylist)
df.to_csv('C:dwDataSetsyelpEntities2.csv', index=False)
print ('End of entity extraction')
Using the created CSV, the data is extracted and the date is reformatted to YYYYQQ, the Nationality
Id is used to look up the nationality name and add it to the output flow. Then data is inserted into the
staging Review table.
16. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 15
To update the data warehouse Review dimension, the reviews are transformed to obtain the reviews
with the highest count for each quarter (one to match each of the 27 quarters), using a crafted SQL
script to update the staging table with an incremented rowcount. The rownumber in the subselect is
set to limit the rows selected to satisfy a review row match for each quarter.
update review
set reviewDateNo = Crownumber
from (
select reviewId, reviewDate, reviewCount,
ROW_NUMBER() over (PARTITION BY reviewDate order by reviewDate, reviewCount DESC) as Crownumber
from (
select reviewId, reviewDate, reviewCount,
ROW_NUMBER() over (PARTITION BY reviewCount
order by reviewDate, reviewCount DESC) as rownumber
from review
Group by reviewId, reviewCount, reviewDate
-- order by reviewDate, reviewCount DESC
) tempQuery
where tempQuery.rownumber < 200
group by reviewDate, reviewCount, reviewId
--order by Crownumber
) as reviewz
where reviewz.reviewId = review.reviewId
Edinburgh Visits
A mixture of Excel and OpenRefine (2016) was used to reshape the data. A row for each of the 27
quarters is required to meet the grain. The following countries US, Australia, France, Germany, Ireland,
Spain, Netherlands, Italy, Poland, Belgium, Greece, Austria and Portugal are summed to provide a
quarterly count for each country.
The summed and reshaped data is shown below, the original visit count was in thousands and was
multiplied by 1000. If a blank was found in the original data is was assigned a zero.
17. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 16
There was no data for 2016, so an average of each quarter was taken between 2010 and 2015 to
create the 2016 quarters. The result was a total count of visitors (for the selected basket of countries)
by quarter. Visits to towns are based on the towns visitors report spending at least one night in during
their trip.
Time
The Time dimension was generated in SSAS and only exists in the data warehouse database. However,
as mentioned above, a new column was needed to cater for the exact quarterly representation
required to join to the Facts table (in the date format YYYYQQ). This was achieved using the following
crafted SQL code which was run in SSMS.
update t
set t.quarterFactDate = (
select CONVERT(varchar(4),DATEPART("YYYY", t2.PK_Date)) +
RIGHT('0' + CONVERT(varchar(2),DATEPART("QQ", t2.PK_Date)),2)
from Time t2
where t2.PK_Date = t.PK_Date)
from Time t
Fact Table - Travel Fact
The Travel Fact table also only exists in the data warehouse database The ETL created for the fact
table is shown below.
The ETL must extract the surrogate key from each dimension, gather the measures and merge the
data into the Travel Fact table. Each row inserted into the Fact table must match the quarterly grain
that was defined during the dimensional modelling.
The result of the ETL is the data warehouse database which is illustrated below.
18. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 17
The ETL has made use of several methods and tools: Manual operations with OpenRefine (2016) and
Excel, automation via custom programs such as R and Python and integrated MongoDB, SQL Server
tools, SSIS and SSMS. The ETL process appears to show that Vassiliadis et al. (2005) observations have
been seen: The data required has been recognised in the source data, this data was obtained, the
creation of a unified format through consolidation across the various sources of data (matching the
grain), cleansing and getting the data into the required shape to fit the business requirements and
finally that it populated the data warehouse.
Case Studies
The deployed cube is shown below, it was connected to Tableau Desktop (2016) to produce the
business intelligence charts to support the following case studies:
Visitor Nationalities Traveling to the UK and Edinburgh
What number of US and Australian nationalities travel to the UK, compare this with several other EU
nationalities too? What are they spending? Of these visitors, what sort of numbers visit Edinburgh?
This information will assist our local offices how to better assess and address the target market on
their home ground.
19. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 18
The prototype data warehouse shows the amount spent and the visit figures for US, Australian and a
selection of EU nationalities (France, Germany, Ireland, Spain, Netherlands, Italy, Poland, Belgium) for
visits to the UK between 2010 and 2015.
The bar chart to the right compares the visit numbers, for each quarter for the same basket of
nationalities, with figures for visits to Edinburgh between 2010 and 2015. There appears to be a
positive correlation between the visits to UK and visits to Edinburgh. Further analysis would need to
be conducted, examining possible causation for fluctuations e.g. obtaining data about major events
that may draw visitors to Edinburgh or keep them away would add value to the analysis. Further charts
that show trend lines, variance e.g. quarter on quarter and year on year within and between both visit
set of data would be interesting to see.
Currency Strength Impact on Visits and Spend
The business is concerned about Brexit impact and that overseas visitors may stay away due to the
volatility of Sterling in the wake of Brexit. Is it possible to provide any information from our data
warehouse to allay these fears?
20. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 19
The charts above indicate the visitor and spend numbers in the light of the strength of Sterling in
relation to the US Dollar, Australian Dollar and Euro basket of currencies. It appears that the currency
strength does not deter visitor visits or spend. Visitor numbers have increased over the 5 year period
and it is clear to see seasonal fluctuations. There appears to be a positive correlation between visits
and spend. However, quarter 201502 and 201403 may warrant an investigation. Visits (6.407M) were
higher in 201502 with less spend (3.085B), than lower visits (6.232M) in 201403 with a higher spend
(3.883B).
Business Review Entity Extraction
Finally, away from the symposium, it would be helpful to provide visitors with places to go and things
to see and do when in Edinburgh. Can we provide any points of interest in Edinburgh that will assist
them?
The treemap above shows entities extracted by entity between 2010 and 2015 from Edinburgh
business reviews. The chart provides the entity name, business name, entity type, the reviewer’s
nationality and total visits to the UK for the quarter that the review relates to (data is not displayed if
the space is not available which is an issue when attempting to make a comparison between entities).
Taking the entity Hanedan, as an example the AlchemyAPI (2016) returned the entity as a person and
a city, it is in fact a Turkish restaurant. But the treemap highlighted this unusual pattern, and provoked
a web search to discover what Hanedan was. Using a treemap visualisation is useful for exposing
patterns that could be of interest and warrant further investigation. The treemap chart works well for
presentation of small numbers. However, treemaps may present a confusing picture when the
number of items displayed increases substantially (Tu and Shen, 2008).
21. CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 20
References
AlchemyAPI (2016) Entity Extraction API [Online] Available at:
http://www.alchemyapi.com/products/alchemylanguage/entity-extraction [Accessed 10 November
2016].
Ariyachandra, T. and Watson, H.J. (2006) ‘Which Data Warehouse Architecture Is Most Successful?’.
Business Intelligence Journal, 11(1): pp. 4.
Chaudhuri, S. and Dayal, U. (1997) ‘An overview of data warehousing and OLAP technology’. ACM
SIGMOD Record, 26(1): pp. 65-74.
Eniod's Blog (2015) Import Yelp dataset to MongoDB [Online] Available at:
https://haduonght.wordpress.com/2015/02/10/import-yelp-dataset-to-mongodb [Accessed 10
November 2016].
Kimball, R., Ross, M., Thornthwaite, W., Mundy. J and Becker, B. (2008) The data warehouse lifecycle
toolkit. 2nd
ed. Indianapolis: Wiley Publishing, Inc.
Meo, P., Ferrara, E., Abel, F., Aroyo, L. and Houben, G. (2013) ‘Analyzing user behavior across social
sharing environments’. ACM Transactions on Intelligent Systems and Technology (TIST), 5(1): pp. 14-
31.
OpenRefine (2016) A free, open source, powerful tool for working with messy data [Online] Available
at: http://openrefine.org/ [Accessed 10 November 2016].
QuandlAPI (2016) Quandl API Introduction [Online] Available at: https://www.quandl.com/docs/api
[Accessed 10 November 2016].
Rowen, W., Song, I.Y., Medsker, C. and Ewen, E. (2001) ‘An analysis of many-to-many relationships
between fact and dimension tables in dimensional modeling’. Proceedings of the International
Workshop on Design and Management of Data Warehouses (DMDW 2001). Interlaken, Switzerland,
4 June 2001.
Tableau Desktop (2016) Analytics that work the way you think [Online] Available at:
http://www.tableau.com/products/desktop [Accessed 10 November 2016].
Tu, Y. and Shen, H. (2008) ‘Balloon Focus: a Seamless Multi-Focus+Context Method for Treemaps’.
IEEE Transactions on Visualization and Computer Graphics, 14(6): pp. 1157-1164.
UK Office for National Statistics (2016) Methodology:International Passenger Survey background
notes [Online] Available at:
https://www.ons.gov.uk/peoplepopulationandcommunity/leisureandtourism/methodologies/intern
ationalpassengersurveybackgroundnotes#sample-methodology [Accessed 10 November 2016].
Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M. & Skiadopoulos, S. (2005) ‘A generic and
customizable framework for the design of ETL scenarios’. Information Systems, 30(7): pp. 492-525.
Visit Britain (2016) Inbound tourism trends by market [Online] Available at:
https://www.visitbritain.org/inbound-tourism-trends [Accessed 10 November 2016].
Yelp Dataset Challenge (2016) Yelp Dataset Challenge [Online] Available at:
https://www.yelp.com/dataset_challenge [Accessed 10 November 2016].