SlideShare una empresa de Scribd logo
1 de 21
CA
Data Warehouse Project Report
Tom Donoghue
x16103491
19 December 2016
MSCDAD
Data Warehousing and Business Intelligence
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 1
Table of Contents
Introduction.....................................................................................................................2
Objectives................................................................................................................................2
Project Scope ...........................................................................................................................2
Data Warehouse Architecture and Implementation ..........................................................3
The Data Model .......................................................................................................................3
Slowly Changing Dimensions ..........................................................................................................5
Type of Fact table............................................................................................................................5
High Level Model Diagram..............................................................................................................5
ETL Method and Strategy..................................................................................................8
ETL Environment ......................................................................................................................8
Data Sources ...................................................................................................................................9
Staging and Data Warehouse ETL............................................................................................10
Visits..............................................................................................................................................10
Currency Strength.........................................................................................................................11
Business Reviews ..........................................................................................................................13
Edinburgh Visits ............................................................................................................................15
Time ..............................................................................................................................................16
Case Studies...................................................................................................................17
Visitor Nationalities Traveling to the UK and Edinburgh...........................................................17
Currency Strength Impact on Visits and Spend ........................................................................18
Business Review Entity Extraction...........................................................................................19
References .....................................................................................................................20
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 2
Introduction
The purpose of this document is to report on the Data Warehousing project undertaken to deliver a
proof of concept data warehouse. This report is split into the following sections, Data Warehouse
Architecture and Implementation, ETL Strategy and Case Studies.
Objectives
The objectives of the project are outlined below:
 Design and implement a data warehouse to answer 3 case studies to illustrate the usefulness of a
data warehousing solution
 Use 3 or more sources of data
 Use Business Intelligence queries and outputs to demonstrate and support the case studies
Project Scope
The scope of the project covers the 3 case studies which are described below and in the following
context diagram.
HandleBig Events want to know should they seriously consider holding their next US Australian trade
symposium in Edinburgh? They have offices in New York, Sydney and Dublin and would like to provide
some useful feedback to these offices to help them build initial promotional ideas. Our task is to help
them make better informed decisions using the case studies (described in the Case Studies section)
and the prototype data warehouse containing the sourced data.
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 3
Data Warehouse Architecture and Implementation
The architecture and design approach taken for this project follow the principles of data warehousing
promoted by Kimball, Ross, Thornthwaite, Mundy and Becker (2008). The primary reason for taking
the Kimball approach is based on the need to swiftly design and implement a working proof of concept
data warehouse. The scope of the project is narrow with a tight timescale, which favours using
dimensional modelling over a normalised relational modelling.
Data warehouse data functions as a story about past events, designed to support decision making,
serving up a digest of answers in grouped and aggregated ways which, are more meaningful and
therefore more important to the business. Providing rollup, drilldown and cross views of the data
(typical to OLAP operations) requires complex queries which, impact performance and may also add
a maintenance overhead each time a new business question occurs. The data warehouse must also
ingest data from disparate sources which need to be merged to create the desired outcomes. To
overcome these issues, the data warehouse is designed using dimensional modelling. The data when
organised multidimensionally is fashioned in a such a way that it serves a different business purpose
to the usual OLTP operational database (Chaudhuri and Dayal, 1997).
Adopting a methodology will produce a result, but the success of the result depends on how the
methodology is executed to meet a set of business requirements. As mentioned in Ariyachandra and
Watson (2006), which of the data warehouse architecture choices proposed by Kimball and Inmon is
better, is and still is, an ongoing debate. The authors investigated five main data warehouse
architectures in their studies. Regarding their study, our prototype data warehouse architecture
implementation method is probably closest to the type described as an Independent Data Mart.
Independent Data Marts were often frowned upon as an inferior architectural solution in operational
production environments. However, they do represent a good fit for prototyping and proof of concept
executions due to their relative simplicity and short lead time to deploy. Independent Data Marts may
make a valid contribution as part of a larger hybrid data warehouse solution as the authors conclude.
The diagram below shows the elements which comprise our prototype data warehouse architecture:
Source data is ingested and processed by the Extract, Transform and Load (ETL) and populates the
staging area (this process is detailed in the ETL section below) and subsequently populated the data
warehouse. The data warehouse provides the business intelligence results to business user queries.
The Data Model
The data model was constructed using dimensional modelling, which according to Kimball et al. (2008)
is an applicable way to best satisfy business intelligence needs, as it meets the underlying objectives
of timely query performance and unambiguous meaningful results. The dimensional model contains
dimensions and facts. Facts record business measurements that tend to be numeric and additive.
Dimensions record logical sets of descriptive attributes and are bound to the facts, enabling the fact
measurements to be viewed in various descriptive combinations. The benefits of dimensional
modelling are that: It facilitates a multidimensional analysis domain, via the exploration of fact
measures using dimensions. The schema is far simpler as the dimensions are denormalised which in
turn improves query performance and serves data which is instantly recognisable to the business user.
The resulting schema resembles a star shape, with the dimensions surrounding a single fact entity
(Kimball et al., 2008; Rowen, Song, Medsker and Ewen, 2001). Many data warehouse
implementations follow the star schema when describing and constructing the data model, as again
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 4
it addresses the goals of fast query performance, ease and speed of populating the data warehouse
(Chaudhuri and Dayal, 1997).
In the Kimball dimensional design process our first step is to choose the business process or
measurement event to be modelled, which in this case is Passenger Visits. To obtain an understanding
of this, a simple business statement was made:
“I want be able to see the number of visits made by nationality, when they visited, how long they
stayed, how much did they spend? I also want to get a handle on their mode of travel, purpose of visit
and how many people visit Edinburgh.”
This is a powerful way of identifying possible facts and dimensions associated with the visits data
source. However, the fact table grain needs to be defined before advancing further. Examining the
appearance of the visits source data, helped to define the grain, as each visit is recorded quarterly.
The grain should be defined as fine as possible, it is possible to roll up from it (e.g. quarters in to half
years and higher into years), but we will not be able to drill down any lower than the selected grain.
In this case, it is not possible to drill down lower than quarters (e.g. months and lower into weeks as
these attributes are not present in the data). Therefore, the finest grain available in the visits data is
quarters.
Looking at the business statement above the dimensions start to appear:
 Visits
 Country
 Nationality
 Mode of Travel
 Purpose of Visit
 Edinburgh Visits
 Time
Identifying the Facts can also be drawn from the statement:
 Visits
 Spend
 Nights Stayed
There are also the three remaining data sources to cater for: Currency Rates, Business Reviews and
Edinburgh Visits. As the grain has been declared then these entities also need to follow the grain and
be at a quarterly level. This raised the following issues:
 Reviews are recorded for any given date and therefore need to be massaged to fit the quarterly
grain, which is achieved by transforming the review data in the ETL stage.
 Currency FX rates are obtained by quarter which fits, but we have multiple currencies and that
creates a many to many relationship, what does dimensional modelling offer to resolve this
dilemma? As this is a prototype we strive to keep things simple, by ensuring a one to many
relationship between dimensions and facts and to maintain the desired star schema. There are
alternatives but these break our simple design and extend the amount of effort to build in the
additional joins required to satisfy the business queries (Rowen et al., 2001). To resolve this issue
currency data was transformed in the ETL stage, and repurposed as “Currency Strength” (and is
described in detail in the ETL section) to adhere to the one to many objective and match the grain.
 Edinburgh Visits data had the same many to many dilemma as Currency rates. Although rows are
recorded quarterly there are multiple countries per quarter. The same solution of transforming
the data to match grain was applied (this is also described in further detail in the ETL section).
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 5
The Time dimension also needs to follow the quarter grain. SQL Server SSAS was used to generate a
Time dimension. However, the resulting dimension needed to be modified to add an extra column to
cater for the exact quarterly representation required to join to the Facts table.
Slowly Changing Dimensions
What method of updating the data in the dimensions and facts best suits the prototype data
warehouse? Keeping the objective of simplicity in mind we opt for Kimball Type 1 – overwrite the
dimension attribute. Type 1 means that the data warehouse will be completely overwritten each time
the data requires a refresh. The impact of a Type 1 slowly changing dimension is that we lose all history
of the previous state of the data prior to the reload (Kimball et al., 2008). It is unlikely that this would
be the desired approach in a production data warehouse (depending on business requirements), but
it is acceptable for this proof of concept piece as our source data are a snapshot of a set number of
years from 2010 to 2016 comprising 27 quarters in total.
Type of Fact table
According to Kimball et al. (2008), the measured facts falls into one of three types of grain:
transactions, periodic snapshots or accumulated snapshots. Our prototype model is aligned to the
periodic snapshot type, as measures are recorded each quarter for a set number of quarters (the visits
data source is by quarter). No further updates are applied to the fact table rows once the table has
been populated.
High Level Model Diagram
Using the dimensions that were identified from the earlier business statement a high level model was
created and is illustrated below:
This is our star schema, comprising the central fact table “Travel” surrounded by the dimensions. The
grain is also defined.
The next stage is to identify the dimension attributes and the fact measures. This was achieved taking
each data source in turn and asking whether the associated attributes and measures contributed to
the questions being asked in the case studies. The following images show the source data and the
dimension attributes (refer to the ETL section for further detail).
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 6
Visitor data
The following dimensions were created from the Visitor source data during the dimensional
modelling:
Country Attribute Format
Country Id Integer PK
Country Code Text
Country Strength Text
Mode Attribute Format
Mode Id Integer PK
Mode Code Text
Mode Name Text
Mode Detail Text
Nationality Attribute Format
Nationality Id Integer PK
Nationality Code Text
Nationality Strength Text
Purpose Attribute Format
Purpose Id Integer PK
Purpose Code Text
Currency Strength Text
For the prototype two separate Country and Nationality dimensions were created rather than using a
single dimension. The reason for this was due to the data being grouped inconsistently (e.g. a
nationality of “Other EU”, but there is no information as to which countries this refers to) and to retain
the data’s original meaning. In a production scenario, the country and nationalities would possibly be
rationalised and consolidated into a single dimension and transformed to use an ISO country code as
a key.
Some of the data from the data source has been precluded as it was not required to satisfy the 3
business cases. However, this is not to under value its potential contribution in a full production data
warehouse.
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 7
Edinburgh Visits
Edinburgh Visits Attribute Format
Visit Id Integer PK
Visit Date YYYYQQ
Visit Count Integer
Currency Rates
By quarter for US Dollar, Australian Dollar and Euro.
Currency Strength Attribute Format
Currency Strength Id Integer PK
Currency Strength Date YYYYQQ
Currency Strength Text
Business Reviews
This data source comprises unstructured data which will undergo entity extraction to gain the
following required attributes:
Review Attribute Format
Review Id Integer PK
Review Date YYYYQQ
Review Count Integer
Name of Business Text
Nationality Id Integer
Entity Text Text
Entity Type Text
The Fact Table
The fact table is required to store the following measures:
Fact Measure
Visits Units (days)
Spend Units (GBP)
Nights Stayed Units (days)
Edinburgh Visits Units (days)
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 8
The motivation for dimensional modelling in the context of data warehouse architecture may be
summarised as follows: Understandability, the dimensional view of consolidated data is already
recognisable to the business user. Query performance, gains in performance are obtained using star
joins and flatter denormalised table structures. Dimensions are the pathway to measures in the fact
table which, can absorb a myriad of unknown queries that users may devise over time. Dimensional
extensibility, as new data arrives the dimension is capable of taking on the change either as a new row
of data or by altering the table (Kimball et al., 2008). Finally, the Business Intelligence tools used to
answer the 3 case studies makes use of the dimensional model designed in this project.
ETL Method and Strategy
This section describes the data sources, how they were extracted, the steps taken to transform and
load the required data into the data warehouse. This phase of the project took a considerable amount
of time to complete which, as Kimball et al. (2008) point out, may swallow up to 70% of time and work
expended in the implementation of the data warehouse. Kimball et al. (2008) suggest that taking a
haphazard approach to the ETL is likely to end in a tangle of objects which have multiple points of
failure and are difficult to fathom out. There are many ETL tools which can be used to assist the ETL
phase. The primary activities that these tools cover in terms of their functionality according to
Vassiliadis, Simitsis, Georgantas, Terrovitis and Skiadopoulos (2005) are: (a) recognition of viable data
in the source data, (b) obtaining this information, (c) creating a tailored and consolidated view of
numerous data sources resulting in a unified format, (d) cleansing and massaging data into shape to
fit the business and target database logic and (e) populating the data warehouse.
The diagram below illustrates a high level view of the ETL landscape covered by the project scope:
ETL Environment
Prior to performing the extraction, the database environment was created. This consisted of two
databases, staging and data warehouse. The databases were partitioned to ensure that data
undergoing further exploration, cleaning and transformation was kept separately from the “clean”
and prepared data that exists in the data warehouse environment. The purpose was to assist overall
ETL management using a simple 2 phase approach. Source data is extracted, undergoes initial
transformation and is loaded into the staging tables.
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 9
The data is further examined and then undergoes a 2nd
transformation before finally being loaded into
the data warehouse database.
This iterative approach was followed to examine and refine the quality of the data destined for the
data warehouse. On early ETL runs as new issues occurred, the incidents were investigated, resolution
sought and modification made to the appropriate ETL package to resolve the incident. The various ETL
changes are discussed in the following sections.
When the ETL packages were fully tested, and producing the expected results they were merged into
logical steps to form an ETL workflow. This resulted in a workflow to cater for each of the data sources
and a separate ETL package to load the data warehouses Facts table.
The diagram below illustrates the Visits ETL, using this phased ETL design process (authored in SSIS).
As mentioned in the dimensional modelling section, the tables are truncated on each package
execution, no history is retained.
Data Sources
The table below shows the datasets that were sourced.
Name Description Source Type of Data
Visits International Passenger
(IPS) Visits
Edinburgh Visits
Visit Britain (2016) Structured
Currency Currency FX Rates QuandlAPI (2016) Semi-Structured
Reviews Business Reviews Yelp Dataset Challenge (2016) Unstructured
Visits
The IPS Visit data: uk_trend_unfiltered_report was obtained as a CSV containing quarterly rows from
2002 to 2015. The Edinburgh visit data: detailed_towns_data_2010_-_2015 was also obtained as a
CSV. The files were downloaded from the Visit Britain (2016). The datasets were originally created
from the International Passenger Survey data (UK Office for National Statistics, 2016).
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 10
Currency
Currency FX rates were obtained using QuandlAPI (2016) to extract average quarterly FX rates for the
Pound Sterling against the US Dollar, Australian Dollar and the Euro. Quarterly data was extracted for
the period 2009 to 2016.
Reviews.
Business reviews were obtained from round 8 of the Yelp dataset challenge download (Yelp Dataset
Challenge, 2016). The dataset was downloaded and unzipped to produce a JSON file for each entity.
Staging and Data Warehouse ETL
The ETL process for each of the data sources is described as follows:
Visits
The source CSV files were examined in OpenRefine (2016), to identify the data to be extracted, and to
quickly perform checks for format inconsistencies and missing data. OpenRefine was used to reformat
the quarter rows from quarters represented as month name e.g. “January-March” to QQ format e.g.
“01”.
The decimal values were converted back to integers and the input data was mapped to the respective
columns of the Visits table in the staging database
The staging dimensional tables Country, Nationality, Mode and Purpose were populated using the
Visits staging table from the previous step.
The Country ETL is described below (the same process was followed for the Nationality, Mode and
Purpose tables). The target Country table was truncated, the country narratives were taken from the
Visits table, sorted and the duplicates removed. A business country code column was assigned a value
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 11
of “Unknown” (this column was created for use downstream to hold business friendly values, as none
were available at ingestion. The default value of “Unknown” was assigned rather than leaving it blank
or NULL). The rows were then inserted into the Country table with a unique integer key assigned by
SQL on insert. The data warehouse ETL package truncates the DimCountry table and loads it using the
staging Country table as the source. Again, SQL assigns a unique integer key to each row inserted and
this is the surrogate key that will be used as the foreign key in the fact table.
Currency Strength
The Currency Strength ETL is shown in the diagram below.
A script created in R was used to obtain average quarterly currency rates using the QuandlAPI (2016)
as shown in the code snippet below. The QuandlAPI (2016) call is repeated to get the US and Australian
Dollar values. The quarterly difference for each currency is calculated. The last row of the 2009 quarter
used in the calculation contained “NA” and it was replaced with a dummy value (the entire year 2009
is discarded downstream as it is not required). The currency code and narrative are added to the data
frame before it is written out to the respective currency CSV file.
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 12
The R script is called by the Currency ETL package.
Once the CSV files are created, the data is extracted and the date is reformatted to the desired
quarterly format YYYYQQ and inserted into the staging Currency table.
The desired Currency rows are selected from the Currency table, grouped by date and the rate
difference is summed.
The Currency Strength is calculated
And the rows are then inserted into the staging Currency Strength table. The final package is run to
load the Currency Strength dimension table in the data warehouse.
Currency Strength is a measure of the strength of GBP against a basket of 3 currencies namely USD
EUR and AUD. The value of the indicator is either “UP” or “DOWN”. “UP” indicates a strong pound
relative to the basket, and “DOWN” indicates a weak pound relative to the basket of currencies. For
overseas visitors to the UK a “DOWN” position should be more favourable (bearing in mind that the
basket could be shielding a currency that has moved the other way e.g. USD and EUR are strong but a
very weak AUD has caused the overall value of the basket to be negative).
The Currency Strength is calculated by taking the average quarterly exchange rate of Pound Sterling
against 3 Major currencies (i.e. USD, EUR and AUD) and obtaining the quarterly differences between
each currency pair. The currency pair difference are summed to provide the basket value which, if
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 13
positive sets the Currency Strength indicator to “UP” otherwise it is set to “DOWN”. In the currency
dataset no quarterly difference of zero was found, had this been the case the indicator would have
been set to “NO CHANGE”.
Business Reviews
To facilitate extraction of the business review data (which is the project’s unstructured data, supplied
in the downloaded JSON files) a suitable document based database such as MongoDB was used.
MongoDB was installed on the same virtual machine as SQL Server to maintain a self-contained
environment. The files were imported into the yelp database using mongoimport, based on a tip from
Eniod's Blog (2015) on working with the Yelp dataset.
mongoimport --db yelp --collection businesses --file yelp_academic_dataset_business.json
mongoimport --db yelp --collection review --file yelp_academic_dataset_review.json
Using python and pymongo, two scripts were created. The first script extracts reviews related to
Edinburgh businesses, retrieves the associated reviews dated from 2010 to 2016 and inserts them into
a new collection. The second script reads the new collection, sends each text review for entity
extraction using the AlchemyAPI (2016). The result of each entity extraction is stored in a dataframe
to which a random Nationality code is added (to associate a review with the Visits nationality data,
this addition to the data makes our reporting more interesting as it provides a link to the nationality
of the reviewer). Once the entity extraction is complete the results are written to a CSV file which is
then processed through SSIS. The scripts can be configured to set the count of businesses and
associated reviews to extract (this assisted testing and limited the API calls as AlchemyAPI (2016) sets
a daily transaction limit).
It was noticed that the yelp dataset had businesses with a review count greater than zero but no
document existed in review collection. The scripts could be improved in the future to handle this
exception. The workaround for the few businesses in error, was to update the review count to zero in
the business collection.
AlchemyAPI (2016) provides an entity extraction API which is used to discover objects in the textual
business reviews such as people, names, places and businesses (Meo, Ferrara, Abel, Aroyo and
Houben, 2013).
The two Python scripts used to obtain Edinburgh business reviews from MongoDB appear below:
#!Python2.7python
# This program connects to mongoDB and extracts Edinburgh businesses. We limit the number of Businesses extracted and then get a limited number of
# associated reviews. The extracted reviews are finally inserted to a new collection
import pprint
from random import randint
import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.yelp
businesses = db.businesses
reviews = db.review
dwreviews = db.dwreview
# get Edinburgh businesses by limit
aBus = businesses.find({"city" : "Edinburgh", "review_count": {"$gt": 0}},
{"business_id" : 1, "name": 1, "categories": 1 }).
sort("stars", pymongo.DESCENDING).limit(2) # set 80 for live run
# create list and dict
collReviews = []
mybus = {}
# loop through business cursor
for busKey in aBus:
#print (busKey['busKey['business_id']']) #+ " " + busKey['categories'])
#mybus.append(busKey['business_id'])
mybus['business_id'] = busKey['business_id']
mybus['name'] = busKey['name'] #collReviews += [mybus]
#for each business key get the reviews and write them out to a new collection
#we also want to randomly assign a country code to each review to indicate nationality of reviewer
print mybus['business_id'] + " " + "**"
print mybus['name']
aReview = reviews.find({"business_id": mybus['business_id'], "review_id" : {"$exists" : True},
"date": {"$gt": "2009-12-31"}},
{"review_id": 1, "date": 1, "text": 1, "business_id": 1}).
sort("date", pymongo.DESCENDING).limit(3) # set to 100 for live run
reviewer = []
for item in aReview:
nationalityId = randint(1,75)
print (item['business_id'] + "^^ " + item['text'])
reviewer.append({"text": item['text'],"review_id": item['review_id'], "date": item['date'],
"name": mybus['name'], "business_id": item['business_id'], "nationality_id": nationalityId})
#reviewer['text'] = item['text']
collReviews += [reviewer]
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 14
# insert the document into the new collection
for rec in collReviews:
#pprint.pprint(rec)
db.dwreviewvideo.insert(rec)
print ('End of Pgm ')
Extract Entities Script
#!Python2.7python
import time
# import json
import pandas as pd
# import pprint
from watson_developer_cloud import AlchemyLanguageV1
alchemy_language = AlchemyLanguageV1(api_key='deleted')
import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.yelp
# select which colloction to do entity extraction on
reviews = db.dwreviewvideo
#get some reviews by limit
curReview = reviews.find({}, {"text": 1, "date": 1, "name": 1, "nationality_id": 1}).
sort("date", pymongo.DESCENDING).limit(521) # set to 521 for live run
reviews = {}
review =[]
mylist = []
#loop through the cursor and call the entity extraction API
for yReview in curReview:
print yReview
text = yReview['text'].encode('utf-8')
#get entities for each yReview
response = alchemy_language.entities(text)
# wait for alchemy to do its thing
time.sleep(2)
# add the results to a list of dicts
for item in response['entities']:
textLatin1 = item['text'].encode('latin-1')
mylist.append ({'type': item['type'], 'text': textLatin1,
'count': item['count'], 'date': yReview['date'], 'name': yReview['name'],
'nationality_id': yReview['nationality_id']})
#print 'entities list ' + str(mylist)
# assign the list to a dataframe for ease of outpuuting a csv of the results
df = pd.DataFrame(mylist)
df.to_csv('C:dwDataSetsyelpEntities2.csv', index=False)
print ('End of entity extraction')
Using the created CSV, the data is extracted and the date is reformatted to YYYYQQ, the Nationality
Id is used to look up the nationality name and add it to the output flow. Then data is inserted into the
staging Review table.
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 15
To update the data warehouse Review dimension, the reviews are transformed to obtain the reviews
with the highest count for each quarter (one to match each of the 27 quarters), using a crafted SQL
script to update the staging table with an incremented rowcount. The rownumber in the subselect is
set to limit the rows selected to satisfy a review row match for each quarter.
update review
set reviewDateNo = Crownumber
from (
select reviewId, reviewDate, reviewCount,
ROW_NUMBER() over (PARTITION BY reviewDate order by reviewDate, reviewCount DESC) as Crownumber
from (
select reviewId, reviewDate, reviewCount,
ROW_NUMBER() over (PARTITION BY reviewCount
order by reviewDate, reviewCount DESC) as rownumber
from review
Group by reviewId, reviewCount, reviewDate
-- order by reviewDate, reviewCount DESC
) tempQuery
where tempQuery.rownumber < 200
group by reviewDate, reviewCount, reviewId
--order by Crownumber
) as reviewz
where reviewz.reviewId = review.reviewId
Edinburgh Visits
A mixture of Excel and OpenRefine (2016) was used to reshape the data. A row for each of the 27
quarters is required to meet the grain. The following countries US, Australia, France, Germany, Ireland,
Spain, Netherlands, Italy, Poland, Belgium, Greece, Austria and Portugal are summed to provide a
quarterly count for each country.
The summed and reshaped data is shown below, the original visit count was in thousands and was
multiplied by 1000. If a blank was found in the original data is was assigned a zero.
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 16
There was no data for 2016, so an average of each quarter was taken between 2010 and 2015 to
create the 2016 quarters. The result was a total count of visitors (for the selected basket of countries)
by quarter. Visits to towns are based on the towns visitors report spending at least one night in during
their trip.
Time
The Time dimension was generated in SSAS and only exists in the data warehouse database. However,
as mentioned above, a new column was needed to cater for the exact quarterly representation
required to join to the Facts table (in the date format YYYYQQ). This was achieved using the following
crafted SQL code which was run in SSMS.
update t
set t.quarterFactDate = (
select CONVERT(varchar(4),DATEPART("YYYY", t2.PK_Date)) +
RIGHT('0' + CONVERT(varchar(2),DATEPART("QQ", t2.PK_Date)),2)
from Time t2
where t2.PK_Date = t.PK_Date)
from Time t
Fact Table - Travel Fact
The Travel Fact table also only exists in the data warehouse database The ETL created for the fact
table is shown below.
The ETL must extract the surrogate key from each dimension, gather the measures and merge the
data into the Travel Fact table. Each row inserted into the Fact table must match the quarterly grain
that was defined during the dimensional modelling.
The result of the ETL is the data warehouse database which is illustrated below.
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 17
The ETL has made use of several methods and tools: Manual operations with OpenRefine (2016) and
Excel, automation via custom programs such as R and Python and integrated MongoDB, SQL Server
tools, SSIS and SSMS. The ETL process appears to show that Vassiliadis et al. (2005) observations have
been seen: The data required has been recognised in the source data, this data was obtained, the
creation of a unified format through consolidation across the various sources of data (matching the
grain), cleansing and getting the data into the required shape to fit the business requirements and
finally that it populated the data warehouse.
Case Studies
The deployed cube is shown below, it was connected to Tableau Desktop (2016) to produce the
business intelligence charts to support the following case studies:
Visitor Nationalities Traveling to the UK and Edinburgh
What number of US and Australian nationalities travel to the UK, compare this with several other EU
nationalities too? What are they spending? Of these visitors, what sort of numbers visit Edinburgh?
This information will assist our local offices how to better assess and address the target market on
their home ground.
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 18
The prototype data warehouse shows the amount spent and the visit figures for US, Australian and a
selection of EU nationalities (France, Germany, Ireland, Spain, Netherlands, Italy, Poland, Belgium) for
visits to the UK between 2010 and 2015.
The bar chart to the right compares the visit numbers, for each quarter for the same basket of
nationalities, with figures for visits to Edinburgh between 2010 and 2015. There appears to be a
positive correlation between the visits to UK and visits to Edinburgh. Further analysis would need to
be conducted, examining possible causation for fluctuations e.g. obtaining data about major events
that may draw visitors to Edinburgh or keep them away would add value to the analysis. Further charts
that show trend lines, variance e.g. quarter on quarter and year on year within and between both visit
set of data would be interesting to see.
Currency Strength Impact on Visits and Spend
The business is concerned about Brexit impact and that overseas visitors may stay away due to the
volatility of Sterling in the wake of Brexit. Is it possible to provide any information from our data
warehouse to allay these fears?
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 19
The charts above indicate the visitor and spend numbers in the light of the strength of Sterling in
relation to the US Dollar, Australian Dollar and Euro basket of currencies. It appears that the currency
strength does not deter visitor visits or spend. Visitor numbers have increased over the 5 year period
and it is clear to see seasonal fluctuations. There appears to be a positive correlation between visits
and spend. However, quarter 201502 and 201403 may warrant an investigation. Visits (6.407M) were
higher in 201502 with less spend (3.085B), than lower visits (6.232M) in 201403 with a higher spend
(3.883B).
Business Review Entity Extraction
Finally, away from the symposium, it would be helpful to provide visitors with places to go and things
to see and do when in Edinburgh. Can we provide any points of interest in Edinburgh that will assist
them?
The treemap above shows entities extracted by entity between 2010 and 2015 from Edinburgh
business reviews. The chart provides the entity name, business name, entity type, the reviewer’s
nationality and total visits to the UK for the quarter that the review relates to (data is not displayed if
the space is not available which is an issue when attempting to make a comparison between entities).
Taking the entity Hanedan, as an example the AlchemyAPI (2016) returned the entity as a person and
a city, it is in fact a Turkish restaurant. But the treemap highlighted this unusual pattern, and provoked
a web search to discover what Hanedan was. Using a treemap visualisation is useful for exposing
patterns that could be of interest and warrant further investigation. The treemap chart works well for
presentation of small numbers. However, treemaps may present a confusing picture when the
number of items displayed increases substantially (Tu and Shen, 2008).
CA2 Data Warehouse Project Report
Tom Donoghue v1.0 Page 20
References
AlchemyAPI (2016) Entity Extraction API [Online] Available at:
http://www.alchemyapi.com/products/alchemylanguage/entity-extraction [Accessed 10 November
2016].
Ariyachandra, T. and Watson, H.J. (2006) ‘Which Data Warehouse Architecture Is Most Successful?’.
Business Intelligence Journal, 11(1): pp. 4.
Chaudhuri, S. and Dayal, U. (1997) ‘An overview of data warehousing and OLAP technology’. ACM
SIGMOD Record, 26(1): pp. 65-74.
Eniod's Blog (2015) Import Yelp dataset to MongoDB [Online] Available at:
https://haduonght.wordpress.com/2015/02/10/import-yelp-dataset-to-mongodb [Accessed 10
November 2016].
Kimball, R., Ross, M., Thornthwaite, W., Mundy. J and Becker, B. (2008) The data warehouse lifecycle
toolkit. 2nd
ed. Indianapolis: Wiley Publishing, Inc.
Meo, P., Ferrara, E., Abel, F., Aroyo, L. and Houben, G. (2013) ‘Analyzing user behavior across social
sharing environments’. ACM Transactions on Intelligent Systems and Technology (TIST), 5(1): pp. 14-
31.
OpenRefine (2016) A free, open source, powerful tool for working with messy data [Online] Available
at: http://openrefine.org/ [Accessed 10 November 2016].
QuandlAPI (2016) Quandl API Introduction [Online] Available at: https://www.quandl.com/docs/api
[Accessed 10 November 2016].
Rowen, W., Song, I.Y., Medsker, C. and Ewen, E. (2001) ‘An analysis of many-to-many relationships
between fact and dimension tables in dimensional modeling’. Proceedings of the International
Workshop on Design and Management of Data Warehouses (DMDW 2001). Interlaken, Switzerland,
4 June 2001.
Tableau Desktop (2016) Analytics that work the way you think [Online] Available at:
http://www.tableau.com/products/desktop [Accessed 10 November 2016].
Tu, Y. and Shen, H. (2008) ‘Balloon Focus: a Seamless Multi-Focus+Context Method for Treemaps’.
IEEE Transactions on Visualization and Computer Graphics, 14(6): pp. 1157-1164.
UK Office for National Statistics (2016) Methodology:International Passenger Survey background
notes [Online] Available at:
https://www.ons.gov.uk/peoplepopulationandcommunity/leisureandtourism/methodologies/intern
ationalpassengersurveybackgroundnotes#sample-methodology [Accessed 10 November 2016].
Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M. & Skiadopoulos, S. (2005) ‘A generic and
customizable framework for the design of ETL scenarios’. Information Systems, 30(7): pp. 492-525.
Visit Britain (2016) Inbound tourism trends by market [Online] Available at:
https://www.visitbritain.org/inbound-tourism-trends [Accessed 10 November 2016].
Yelp Dataset Challenge (2016) Yelp Dataset Challenge [Online] Available at:
https://www.yelp.com/dataset_challenge [Accessed 10 November 2016].

Más contenido relacionado

La actualidad más candente

An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousingShahed Khalili
 
Informatica PowerCenter
Informatica PowerCenterInformatica PowerCenter
Informatica PowerCenterRamy Mahrous
 
Data Science Use cases in Banking
Data Science Use cases in BankingData Science Use cases in Banking
Data Science Use cases in BankingArul Bharathi
 
Data modeling star schema
Data modeling star schemaData modeling star schema
Data modeling star schemaSayed Ahmed
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Business Intelligence - Intro
Business Intelligence - IntroBusiness Intelligence - Intro
Business Intelligence - IntroDavid Hubbard
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big DataDATAVERSITY
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflakeSunil Gurav
 
Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse RequirementsDavid Walker
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 

La actualidad más candente (20)

An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousing
 
Dwbi Project
Dwbi ProjectDwbi Project
Dwbi Project
 
Informatica PowerCenter
Informatica PowerCenterInformatica PowerCenter
Informatica PowerCenter
 
Data Science Use cases in Banking
Data Science Use cases in BankingData Science Use cases in Banking
Data Science Use cases in Banking
 
Tableau Desktop Material
Tableau Desktop MaterialTableau Desktop Material
Tableau Desktop Material
 
Data modeling star schema
Data modeling star schemaData modeling star schema
Data modeling star schema
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
Retail Data Warehouse
Retail Data WarehouseRetail Data Warehouse
Retail Data Warehouse
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data warehouse proposal
Data warehouse proposalData warehouse proposal
Data warehouse proposal
 
Business Intelligence - Intro
Business Intelligence - IntroBusiness Intelligence - Intro
Business Intelligence - Intro
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflake
 
Bi Strategy Roadmap
Bi Strategy RoadmapBi Strategy Roadmap
Bi Strategy Roadmap
 
Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse Requirements
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 

Similar a Data Warehouse Project Report

Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paperjuly12jana
 
Business analytics and data warehousing
Business analytics and data warehousingBusiness analytics and data warehousing
Business analytics and data warehousingSamir Majumder
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdfssuserf8f9b2
 
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida  Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida CLARA CAMPROVIN
 
BI Architecture in support of data quality
BI Architecture in support of data qualityBI Architecture in support of data quality
BI Architecture in support of data qualityTom Breur
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...IRJET Journal
 
3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.pptBsMath3rdsem
 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2akitda
 
Next generation Data Governance
Next generation Data GovernanceNext generation Data Governance
Next generation Data GovernanceVladimiro Borsi
 
Designing the business process dimensional model
Designing the business process dimensional modelDesigning the business process dimensional model
Designing the business process dimensional modelGersiton Pila Challco
 
IBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNIBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNabclearnn
 
TaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxTaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxbradburgess22840
 
TaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxTaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxdeanmtaylor1545
 
Technical Research Document - Anurag
Technical Research Document - AnuragTechnical Research Document - Anurag
Technical Research Document - Anuraganuragrajandekar
 
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...Alan D. Duncan
 

Similar a Data Warehouse Project Report (20)

Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paper
 
Business analytics and data warehousing
Business analytics and data warehousingBusiness analytics and data warehousing
Business analytics and data warehousing
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 
ETL QA
ETL QAETL QA
ETL QA
 
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida  Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
 
BI Architecture in support of data quality
BI Architecture in support of data qualityBI Architecture in support of data quality
BI Architecture in support of data quality
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
 
3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt
 
mod 2.pdf
mod 2.pdfmod 2.pdf
mod 2.pdf
 
Date Analysis .pdf
Date Analysis .pdfDate Analysis .pdf
Date Analysis .pdf
 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2
 
Course Outline Ch 2
Course Outline Ch 2Course Outline Ch 2
Course Outline Ch 2
 
Next generation Data Governance
Next generation Data GovernanceNext generation Data Governance
Next generation Data Governance
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Designing the business process dimensional model
Designing the business process dimensional modelDesigning the business process dimensional model
Designing the business process dimensional model
 
IBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNIBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARN
 
TaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxTaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docx
 
TaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxTaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docx
 
Technical Research Document - Anurag
Technical Research Document - AnuragTechnical Research Document - Anurag
Technical Research Document - Anurag
 
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...
 

Más de Tom Donoghue

Data warehousing and machine learning primer
Data warehousing and machine learning primerData warehousing and machine learning primer
Data warehousing and machine learning primerTom Donoghue
 
Chicago Crime Analysis
Chicago Crime AnalysisChicago Crime Analysis
Chicago Crime AnalysisTom Donoghue
 
The Prepared Executive: A Linguistic Exploration
The Prepared Executive: A Linguistic ExplorationThe Prepared Executive: A Linguistic Exploration
The Prepared Executive: A Linguistic ExplorationTom Donoghue
 
Crime Analysis using Regression and ANOVA
Crime Analysis using Regression and ANOVACrime Analysis using Regression and ANOVA
Crime Analysis using Regression and ANOVATom Donoghue
 
Exploration of Call Transcripts with MapReduce and Zipf’s Law
Exploration of Call Transcripts with MapReduce and Zipf’s LawExploration of Call Transcripts with MapReduce and Zipf’s Law
Exploration of Call Transcripts with MapReduce and Zipf’s LawTom Donoghue
 
Internet of Things (IoT) in the Fog
Internet of Things (IoT) in the FogInternet of Things (IoT) in the Fog
Internet of Things (IoT) in the FogTom Donoghue
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data WarehousesTom Donoghue
 

Más de Tom Donoghue (7)

Data warehousing and machine learning primer
Data warehousing and machine learning primerData warehousing and machine learning primer
Data warehousing and machine learning primer
 
Chicago Crime Analysis
Chicago Crime AnalysisChicago Crime Analysis
Chicago Crime Analysis
 
The Prepared Executive: A Linguistic Exploration
The Prepared Executive: A Linguistic ExplorationThe Prepared Executive: A Linguistic Exploration
The Prepared Executive: A Linguistic Exploration
 
Crime Analysis using Regression and ANOVA
Crime Analysis using Regression and ANOVACrime Analysis using Regression and ANOVA
Crime Analysis using Regression and ANOVA
 
Exploration of Call Transcripts with MapReduce and Zipf’s Law
Exploration of Call Transcripts with MapReduce and Zipf’s LawExploration of Call Transcripts with MapReduce and Zipf’s Law
Exploration of Call Transcripts with MapReduce and Zipf’s Law
 
Internet of Things (IoT) in the Fog
Internet of Things (IoT) in the FogInternet of Things (IoT) in the Fog
Internet of Things (IoT) in the Fog
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data Warehouses
 

Último

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Último (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Data Warehouse Project Report

  • 1. CA Data Warehouse Project Report Tom Donoghue x16103491 19 December 2016 MSCDAD Data Warehousing and Business Intelligence
  • 2. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 1 Table of Contents Introduction.....................................................................................................................2 Objectives................................................................................................................................2 Project Scope ...........................................................................................................................2 Data Warehouse Architecture and Implementation ..........................................................3 The Data Model .......................................................................................................................3 Slowly Changing Dimensions ..........................................................................................................5 Type of Fact table............................................................................................................................5 High Level Model Diagram..............................................................................................................5 ETL Method and Strategy..................................................................................................8 ETL Environment ......................................................................................................................8 Data Sources ...................................................................................................................................9 Staging and Data Warehouse ETL............................................................................................10 Visits..............................................................................................................................................10 Currency Strength.........................................................................................................................11 Business Reviews ..........................................................................................................................13 Edinburgh Visits ............................................................................................................................15 Time ..............................................................................................................................................16 Case Studies...................................................................................................................17 Visitor Nationalities Traveling to the UK and Edinburgh...........................................................17 Currency Strength Impact on Visits and Spend ........................................................................18 Business Review Entity Extraction...........................................................................................19 References .....................................................................................................................20
  • 3. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 2 Introduction The purpose of this document is to report on the Data Warehousing project undertaken to deliver a proof of concept data warehouse. This report is split into the following sections, Data Warehouse Architecture and Implementation, ETL Strategy and Case Studies. Objectives The objectives of the project are outlined below:  Design and implement a data warehouse to answer 3 case studies to illustrate the usefulness of a data warehousing solution  Use 3 or more sources of data  Use Business Intelligence queries and outputs to demonstrate and support the case studies Project Scope The scope of the project covers the 3 case studies which are described below and in the following context diagram. HandleBig Events want to know should they seriously consider holding their next US Australian trade symposium in Edinburgh? They have offices in New York, Sydney and Dublin and would like to provide some useful feedback to these offices to help them build initial promotional ideas. Our task is to help them make better informed decisions using the case studies (described in the Case Studies section) and the prototype data warehouse containing the sourced data.
  • 4. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 3 Data Warehouse Architecture and Implementation The architecture and design approach taken for this project follow the principles of data warehousing promoted by Kimball, Ross, Thornthwaite, Mundy and Becker (2008). The primary reason for taking the Kimball approach is based on the need to swiftly design and implement a working proof of concept data warehouse. The scope of the project is narrow with a tight timescale, which favours using dimensional modelling over a normalised relational modelling. Data warehouse data functions as a story about past events, designed to support decision making, serving up a digest of answers in grouped and aggregated ways which, are more meaningful and therefore more important to the business. Providing rollup, drilldown and cross views of the data (typical to OLAP operations) requires complex queries which, impact performance and may also add a maintenance overhead each time a new business question occurs. The data warehouse must also ingest data from disparate sources which need to be merged to create the desired outcomes. To overcome these issues, the data warehouse is designed using dimensional modelling. The data when organised multidimensionally is fashioned in a such a way that it serves a different business purpose to the usual OLTP operational database (Chaudhuri and Dayal, 1997). Adopting a methodology will produce a result, but the success of the result depends on how the methodology is executed to meet a set of business requirements. As mentioned in Ariyachandra and Watson (2006), which of the data warehouse architecture choices proposed by Kimball and Inmon is better, is and still is, an ongoing debate. The authors investigated five main data warehouse architectures in their studies. Regarding their study, our prototype data warehouse architecture implementation method is probably closest to the type described as an Independent Data Mart. Independent Data Marts were often frowned upon as an inferior architectural solution in operational production environments. However, they do represent a good fit for prototyping and proof of concept executions due to their relative simplicity and short lead time to deploy. Independent Data Marts may make a valid contribution as part of a larger hybrid data warehouse solution as the authors conclude. The diagram below shows the elements which comprise our prototype data warehouse architecture: Source data is ingested and processed by the Extract, Transform and Load (ETL) and populates the staging area (this process is detailed in the ETL section below) and subsequently populated the data warehouse. The data warehouse provides the business intelligence results to business user queries. The Data Model The data model was constructed using dimensional modelling, which according to Kimball et al. (2008) is an applicable way to best satisfy business intelligence needs, as it meets the underlying objectives of timely query performance and unambiguous meaningful results. The dimensional model contains dimensions and facts. Facts record business measurements that tend to be numeric and additive. Dimensions record logical sets of descriptive attributes and are bound to the facts, enabling the fact measurements to be viewed in various descriptive combinations. The benefits of dimensional modelling are that: It facilitates a multidimensional analysis domain, via the exploration of fact measures using dimensions. The schema is far simpler as the dimensions are denormalised which in turn improves query performance and serves data which is instantly recognisable to the business user. The resulting schema resembles a star shape, with the dimensions surrounding a single fact entity (Kimball et al., 2008; Rowen, Song, Medsker and Ewen, 2001). Many data warehouse implementations follow the star schema when describing and constructing the data model, as again
  • 5. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 4 it addresses the goals of fast query performance, ease and speed of populating the data warehouse (Chaudhuri and Dayal, 1997). In the Kimball dimensional design process our first step is to choose the business process or measurement event to be modelled, which in this case is Passenger Visits. To obtain an understanding of this, a simple business statement was made: “I want be able to see the number of visits made by nationality, when they visited, how long they stayed, how much did they spend? I also want to get a handle on their mode of travel, purpose of visit and how many people visit Edinburgh.” This is a powerful way of identifying possible facts and dimensions associated with the visits data source. However, the fact table grain needs to be defined before advancing further. Examining the appearance of the visits source data, helped to define the grain, as each visit is recorded quarterly. The grain should be defined as fine as possible, it is possible to roll up from it (e.g. quarters in to half years and higher into years), but we will not be able to drill down any lower than the selected grain. In this case, it is not possible to drill down lower than quarters (e.g. months and lower into weeks as these attributes are not present in the data). Therefore, the finest grain available in the visits data is quarters. Looking at the business statement above the dimensions start to appear:  Visits  Country  Nationality  Mode of Travel  Purpose of Visit  Edinburgh Visits  Time Identifying the Facts can also be drawn from the statement:  Visits  Spend  Nights Stayed There are also the three remaining data sources to cater for: Currency Rates, Business Reviews and Edinburgh Visits. As the grain has been declared then these entities also need to follow the grain and be at a quarterly level. This raised the following issues:  Reviews are recorded for any given date and therefore need to be massaged to fit the quarterly grain, which is achieved by transforming the review data in the ETL stage.  Currency FX rates are obtained by quarter which fits, but we have multiple currencies and that creates a many to many relationship, what does dimensional modelling offer to resolve this dilemma? As this is a prototype we strive to keep things simple, by ensuring a one to many relationship between dimensions and facts and to maintain the desired star schema. There are alternatives but these break our simple design and extend the amount of effort to build in the additional joins required to satisfy the business queries (Rowen et al., 2001). To resolve this issue currency data was transformed in the ETL stage, and repurposed as “Currency Strength” (and is described in detail in the ETL section) to adhere to the one to many objective and match the grain.  Edinburgh Visits data had the same many to many dilemma as Currency rates. Although rows are recorded quarterly there are multiple countries per quarter. The same solution of transforming the data to match grain was applied (this is also described in further detail in the ETL section).
  • 6. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 5 The Time dimension also needs to follow the quarter grain. SQL Server SSAS was used to generate a Time dimension. However, the resulting dimension needed to be modified to add an extra column to cater for the exact quarterly representation required to join to the Facts table. Slowly Changing Dimensions What method of updating the data in the dimensions and facts best suits the prototype data warehouse? Keeping the objective of simplicity in mind we opt for Kimball Type 1 – overwrite the dimension attribute. Type 1 means that the data warehouse will be completely overwritten each time the data requires a refresh. The impact of a Type 1 slowly changing dimension is that we lose all history of the previous state of the data prior to the reload (Kimball et al., 2008). It is unlikely that this would be the desired approach in a production data warehouse (depending on business requirements), but it is acceptable for this proof of concept piece as our source data are a snapshot of a set number of years from 2010 to 2016 comprising 27 quarters in total. Type of Fact table According to Kimball et al. (2008), the measured facts falls into one of three types of grain: transactions, periodic snapshots or accumulated snapshots. Our prototype model is aligned to the periodic snapshot type, as measures are recorded each quarter for a set number of quarters (the visits data source is by quarter). No further updates are applied to the fact table rows once the table has been populated. High Level Model Diagram Using the dimensions that were identified from the earlier business statement a high level model was created and is illustrated below: This is our star schema, comprising the central fact table “Travel” surrounded by the dimensions. The grain is also defined. The next stage is to identify the dimension attributes and the fact measures. This was achieved taking each data source in turn and asking whether the associated attributes and measures contributed to the questions being asked in the case studies. The following images show the source data and the dimension attributes (refer to the ETL section for further detail).
  • 7. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 6 Visitor data The following dimensions were created from the Visitor source data during the dimensional modelling: Country Attribute Format Country Id Integer PK Country Code Text Country Strength Text Mode Attribute Format Mode Id Integer PK Mode Code Text Mode Name Text Mode Detail Text Nationality Attribute Format Nationality Id Integer PK Nationality Code Text Nationality Strength Text Purpose Attribute Format Purpose Id Integer PK Purpose Code Text Currency Strength Text For the prototype two separate Country and Nationality dimensions were created rather than using a single dimension. The reason for this was due to the data being grouped inconsistently (e.g. a nationality of “Other EU”, but there is no information as to which countries this refers to) and to retain the data’s original meaning. In a production scenario, the country and nationalities would possibly be rationalised and consolidated into a single dimension and transformed to use an ISO country code as a key. Some of the data from the data source has been precluded as it was not required to satisfy the 3 business cases. However, this is not to under value its potential contribution in a full production data warehouse.
  • 8. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 7 Edinburgh Visits Edinburgh Visits Attribute Format Visit Id Integer PK Visit Date YYYYQQ Visit Count Integer Currency Rates By quarter for US Dollar, Australian Dollar and Euro. Currency Strength Attribute Format Currency Strength Id Integer PK Currency Strength Date YYYYQQ Currency Strength Text Business Reviews This data source comprises unstructured data which will undergo entity extraction to gain the following required attributes: Review Attribute Format Review Id Integer PK Review Date YYYYQQ Review Count Integer Name of Business Text Nationality Id Integer Entity Text Text Entity Type Text The Fact Table The fact table is required to store the following measures: Fact Measure Visits Units (days) Spend Units (GBP) Nights Stayed Units (days) Edinburgh Visits Units (days)
  • 9. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 8 The motivation for dimensional modelling in the context of data warehouse architecture may be summarised as follows: Understandability, the dimensional view of consolidated data is already recognisable to the business user. Query performance, gains in performance are obtained using star joins and flatter denormalised table structures. Dimensions are the pathway to measures in the fact table which, can absorb a myriad of unknown queries that users may devise over time. Dimensional extensibility, as new data arrives the dimension is capable of taking on the change either as a new row of data or by altering the table (Kimball et al., 2008). Finally, the Business Intelligence tools used to answer the 3 case studies makes use of the dimensional model designed in this project. ETL Method and Strategy This section describes the data sources, how they were extracted, the steps taken to transform and load the required data into the data warehouse. This phase of the project took a considerable amount of time to complete which, as Kimball et al. (2008) point out, may swallow up to 70% of time and work expended in the implementation of the data warehouse. Kimball et al. (2008) suggest that taking a haphazard approach to the ETL is likely to end in a tangle of objects which have multiple points of failure and are difficult to fathom out. There are many ETL tools which can be used to assist the ETL phase. The primary activities that these tools cover in terms of their functionality according to Vassiliadis, Simitsis, Georgantas, Terrovitis and Skiadopoulos (2005) are: (a) recognition of viable data in the source data, (b) obtaining this information, (c) creating a tailored and consolidated view of numerous data sources resulting in a unified format, (d) cleansing and massaging data into shape to fit the business and target database logic and (e) populating the data warehouse. The diagram below illustrates a high level view of the ETL landscape covered by the project scope: ETL Environment Prior to performing the extraction, the database environment was created. This consisted of two databases, staging and data warehouse. The databases were partitioned to ensure that data undergoing further exploration, cleaning and transformation was kept separately from the “clean” and prepared data that exists in the data warehouse environment. The purpose was to assist overall ETL management using a simple 2 phase approach. Source data is extracted, undergoes initial transformation and is loaded into the staging tables.
  • 10. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 9 The data is further examined and then undergoes a 2nd transformation before finally being loaded into the data warehouse database. This iterative approach was followed to examine and refine the quality of the data destined for the data warehouse. On early ETL runs as new issues occurred, the incidents were investigated, resolution sought and modification made to the appropriate ETL package to resolve the incident. The various ETL changes are discussed in the following sections. When the ETL packages were fully tested, and producing the expected results they were merged into logical steps to form an ETL workflow. This resulted in a workflow to cater for each of the data sources and a separate ETL package to load the data warehouses Facts table. The diagram below illustrates the Visits ETL, using this phased ETL design process (authored in SSIS). As mentioned in the dimensional modelling section, the tables are truncated on each package execution, no history is retained. Data Sources The table below shows the datasets that were sourced. Name Description Source Type of Data Visits International Passenger (IPS) Visits Edinburgh Visits Visit Britain (2016) Structured Currency Currency FX Rates QuandlAPI (2016) Semi-Structured Reviews Business Reviews Yelp Dataset Challenge (2016) Unstructured Visits The IPS Visit data: uk_trend_unfiltered_report was obtained as a CSV containing quarterly rows from 2002 to 2015. The Edinburgh visit data: detailed_towns_data_2010_-_2015 was also obtained as a CSV. The files were downloaded from the Visit Britain (2016). The datasets were originally created from the International Passenger Survey data (UK Office for National Statistics, 2016).
  • 11. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 10 Currency Currency FX rates were obtained using QuandlAPI (2016) to extract average quarterly FX rates for the Pound Sterling against the US Dollar, Australian Dollar and the Euro. Quarterly data was extracted for the period 2009 to 2016. Reviews. Business reviews were obtained from round 8 of the Yelp dataset challenge download (Yelp Dataset Challenge, 2016). The dataset was downloaded and unzipped to produce a JSON file for each entity. Staging and Data Warehouse ETL The ETL process for each of the data sources is described as follows: Visits The source CSV files were examined in OpenRefine (2016), to identify the data to be extracted, and to quickly perform checks for format inconsistencies and missing data. OpenRefine was used to reformat the quarter rows from quarters represented as month name e.g. “January-March” to QQ format e.g. “01”. The decimal values were converted back to integers and the input data was mapped to the respective columns of the Visits table in the staging database The staging dimensional tables Country, Nationality, Mode and Purpose were populated using the Visits staging table from the previous step. The Country ETL is described below (the same process was followed for the Nationality, Mode and Purpose tables). The target Country table was truncated, the country narratives were taken from the Visits table, sorted and the duplicates removed. A business country code column was assigned a value
  • 12. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 11 of “Unknown” (this column was created for use downstream to hold business friendly values, as none were available at ingestion. The default value of “Unknown” was assigned rather than leaving it blank or NULL). The rows were then inserted into the Country table with a unique integer key assigned by SQL on insert. The data warehouse ETL package truncates the DimCountry table and loads it using the staging Country table as the source. Again, SQL assigns a unique integer key to each row inserted and this is the surrogate key that will be used as the foreign key in the fact table. Currency Strength The Currency Strength ETL is shown in the diagram below. A script created in R was used to obtain average quarterly currency rates using the QuandlAPI (2016) as shown in the code snippet below. The QuandlAPI (2016) call is repeated to get the US and Australian Dollar values. The quarterly difference for each currency is calculated. The last row of the 2009 quarter used in the calculation contained “NA” and it was replaced with a dummy value (the entire year 2009 is discarded downstream as it is not required). The currency code and narrative are added to the data frame before it is written out to the respective currency CSV file.
  • 13. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 12 The R script is called by the Currency ETL package. Once the CSV files are created, the data is extracted and the date is reformatted to the desired quarterly format YYYYQQ and inserted into the staging Currency table. The desired Currency rows are selected from the Currency table, grouped by date and the rate difference is summed. The Currency Strength is calculated And the rows are then inserted into the staging Currency Strength table. The final package is run to load the Currency Strength dimension table in the data warehouse. Currency Strength is a measure of the strength of GBP against a basket of 3 currencies namely USD EUR and AUD. The value of the indicator is either “UP” or “DOWN”. “UP” indicates a strong pound relative to the basket, and “DOWN” indicates a weak pound relative to the basket of currencies. For overseas visitors to the UK a “DOWN” position should be more favourable (bearing in mind that the basket could be shielding a currency that has moved the other way e.g. USD and EUR are strong but a very weak AUD has caused the overall value of the basket to be negative). The Currency Strength is calculated by taking the average quarterly exchange rate of Pound Sterling against 3 Major currencies (i.e. USD, EUR and AUD) and obtaining the quarterly differences between each currency pair. The currency pair difference are summed to provide the basket value which, if
  • 14. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 13 positive sets the Currency Strength indicator to “UP” otherwise it is set to “DOWN”. In the currency dataset no quarterly difference of zero was found, had this been the case the indicator would have been set to “NO CHANGE”. Business Reviews To facilitate extraction of the business review data (which is the project’s unstructured data, supplied in the downloaded JSON files) a suitable document based database such as MongoDB was used. MongoDB was installed on the same virtual machine as SQL Server to maintain a self-contained environment. The files were imported into the yelp database using mongoimport, based on a tip from Eniod's Blog (2015) on working with the Yelp dataset. mongoimport --db yelp --collection businesses --file yelp_academic_dataset_business.json mongoimport --db yelp --collection review --file yelp_academic_dataset_review.json Using python and pymongo, two scripts were created. The first script extracts reviews related to Edinburgh businesses, retrieves the associated reviews dated from 2010 to 2016 and inserts them into a new collection. The second script reads the new collection, sends each text review for entity extraction using the AlchemyAPI (2016). The result of each entity extraction is stored in a dataframe to which a random Nationality code is added (to associate a review with the Visits nationality data, this addition to the data makes our reporting more interesting as it provides a link to the nationality of the reviewer). Once the entity extraction is complete the results are written to a CSV file which is then processed through SSIS. The scripts can be configured to set the count of businesses and associated reviews to extract (this assisted testing and limited the API calls as AlchemyAPI (2016) sets a daily transaction limit). It was noticed that the yelp dataset had businesses with a review count greater than zero but no document existed in review collection. The scripts could be improved in the future to handle this exception. The workaround for the few businesses in error, was to update the review count to zero in the business collection. AlchemyAPI (2016) provides an entity extraction API which is used to discover objects in the textual business reviews such as people, names, places and businesses (Meo, Ferrara, Abel, Aroyo and Houben, 2013). The two Python scripts used to obtain Edinburgh business reviews from MongoDB appear below: #!Python2.7python # This program connects to mongoDB and extracts Edinburgh businesses. We limit the number of Businesses extracted and then get a limited number of # associated reviews. The extracted reviews are finally inserted to a new collection import pprint from random import randint import pymongo from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.yelp businesses = db.businesses reviews = db.review dwreviews = db.dwreview # get Edinburgh businesses by limit aBus = businesses.find({"city" : "Edinburgh", "review_count": {"$gt": 0}}, {"business_id" : 1, "name": 1, "categories": 1 }). sort("stars", pymongo.DESCENDING).limit(2) # set 80 for live run # create list and dict collReviews = [] mybus = {} # loop through business cursor for busKey in aBus: #print (busKey['busKey['business_id']']) #+ " " + busKey['categories']) #mybus.append(busKey['business_id']) mybus['business_id'] = busKey['business_id'] mybus['name'] = busKey['name'] #collReviews += [mybus] #for each business key get the reviews and write them out to a new collection #we also want to randomly assign a country code to each review to indicate nationality of reviewer print mybus['business_id'] + " " + "**" print mybus['name'] aReview = reviews.find({"business_id": mybus['business_id'], "review_id" : {"$exists" : True}, "date": {"$gt": "2009-12-31"}}, {"review_id": 1, "date": 1, "text": 1, "business_id": 1}). sort("date", pymongo.DESCENDING).limit(3) # set to 100 for live run reviewer = [] for item in aReview: nationalityId = randint(1,75) print (item['business_id'] + "^^ " + item['text']) reviewer.append({"text": item['text'],"review_id": item['review_id'], "date": item['date'], "name": mybus['name'], "business_id": item['business_id'], "nationality_id": nationalityId}) #reviewer['text'] = item['text'] collReviews += [reviewer]
  • 15. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 14 # insert the document into the new collection for rec in collReviews: #pprint.pprint(rec) db.dwreviewvideo.insert(rec) print ('End of Pgm ') Extract Entities Script #!Python2.7python import time # import json import pandas as pd # import pprint from watson_developer_cloud import AlchemyLanguageV1 alchemy_language = AlchemyLanguageV1(api_key='deleted') import pymongo from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.yelp # select which colloction to do entity extraction on reviews = db.dwreviewvideo #get some reviews by limit curReview = reviews.find({}, {"text": 1, "date": 1, "name": 1, "nationality_id": 1}). sort("date", pymongo.DESCENDING).limit(521) # set to 521 for live run reviews = {} review =[] mylist = [] #loop through the cursor and call the entity extraction API for yReview in curReview: print yReview text = yReview['text'].encode('utf-8') #get entities for each yReview response = alchemy_language.entities(text) # wait for alchemy to do its thing time.sleep(2) # add the results to a list of dicts for item in response['entities']: textLatin1 = item['text'].encode('latin-1') mylist.append ({'type': item['type'], 'text': textLatin1, 'count': item['count'], 'date': yReview['date'], 'name': yReview['name'], 'nationality_id': yReview['nationality_id']}) #print 'entities list ' + str(mylist) # assign the list to a dataframe for ease of outpuuting a csv of the results df = pd.DataFrame(mylist) df.to_csv('C:dwDataSetsyelpEntities2.csv', index=False) print ('End of entity extraction') Using the created CSV, the data is extracted and the date is reformatted to YYYYQQ, the Nationality Id is used to look up the nationality name and add it to the output flow. Then data is inserted into the staging Review table.
  • 16. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 15 To update the data warehouse Review dimension, the reviews are transformed to obtain the reviews with the highest count for each quarter (one to match each of the 27 quarters), using a crafted SQL script to update the staging table with an incremented rowcount. The rownumber in the subselect is set to limit the rows selected to satisfy a review row match for each quarter. update review set reviewDateNo = Crownumber from ( select reviewId, reviewDate, reviewCount, ROW_NUMBER() over (PARTITION BY reviewDate order by reviewDate, reviewCount DESC) as Crownumber from ( select reviewId, reviewDate, reviewCount, ROW_NUMBER() over (PARTITION BY reviewCount order by reviewDate, reviewCount DESC) as rownumber from review Group by reviewId, reviewCount, reviewDate -- order by reviewDate, reviewCount DESC ) tempQuery where tempQuery.rownumber < 200 group by reviewDate, reviewCount, reviewId --order by Crownumber ) as reviewz where reviewz.reviewId = review.reviewId Edinburgh Visits A mixture of Excel and OpenRefine (2016) was used to reshape the data. A row for each of the 27 quarters is required to meet the grain. The following countries US, Australia, France, Germany, Ireland, Spain, Netherlands, Italy, Poland, Belgium, Greece, Austria and Portugal are summed to provide a quarterly count for each country. The summed and reshaped data is shown below, the original visit count was in thousands and was multiplied by 1000. If a blank was found in the original data is was assigned a zero.
  • 17. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 16 There was no data for 2016, so an average of each quarter was taken between 2010 and 2015 to create the 2016 quarters. The result was a total count of visitors (for the selected basket of countries) by quarter. Visits to towns are based on the towns visitors report spending at least one night in during their trip. Time The Time dimension was generated in SSAS and only exists in the data warehouse database. However, as mentioned above, a new column was needed to cater for the exact quarterly representation required to join to the Facts table (in the date format YYYYQQ). This was achieved using the following crafted SQL code which was run in SSMS. update t set t.quarterFactDate = ( select CONVERT(varchar(4),DATEPART("YYYY", t2.PK_Date)) + RIGHT('0' + CONVERT(varchar(2),DATEPART("QQ", t2.PK_Date)),2) from Time t2 where t2.PK_Date = t.PK_Date) from Time t Fact Table - Travel Fact The Travel Fact table also only exists in the data warehouse database The ETL created for the fact table is shown below. The ETL must extract the surrogate key from each dimension, gather the measures and merge the data into the Travel Fact table. Each row inserted into the Fact table must match the quarterly grain that was defined during the dimensional modelling. The result of the ETL is the data warehouse database which is illustrated below.
  • 18. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 17 The ETL has made use of several methods and tools: Manual operations with OpenRefine (2016) and Excel, automation via custom programs such as R and Python and integrated MongoDB, SQL Server tools, SSIS and SSMS. The ETL process appears to show that Vassiliadis et al. (2005) observations have been seen: The data required has been recognised in the source data, this data was obtained, the creation of a unified format through consolidation across the various sources of data (matching the grain), cleansing and getting the data into the required shape to fit the business requirements and finally that it populated the data warehouse. Case Studies The deployed cube is shown below, it was connected to Tableau Desktop (2016) to produce the business intelligence charts to support the following case studies: Visitor Nationalities Traveling to the UK and Edinburgh What number of US and Australian nationalities travel to the UK, compare this with several other EU nationalities too? What are they spending? Of these visitors, what sort of numbers visit Edinburgh? This information will assist our local offices how to better assess and address the target market on their home ground.
  • 19. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 18 The prototype data warehouse shows the amount spent and the visit figures for US, Australian and a selection of EU nationalities (France, Germany, Ireland, Spain, Netherlands, Italy, Poland, Belgium) for visits to the UK between 2010 and 2015. The bar chart to the right compares the visit numbers, for each quarter for the same basket of nationalities, with figures for visits to Edinburgh between 2010 and 2015. There appears to be a positive correlation between the visits to UK and visits to Edinburgh. Further analysis would need to be conducted, examining possible causation for fluctuations e.g. obtaining data about major events that may draw visitors to Edinburgh or keep them away would add value to the analysis. Further charts that show trend lines, variance e.g. quarter on quarter and year on year within and between both visit set of data would be interesting to see. Currency Strength Impact on Visits and Spend The business is concerned about Brexit impact and that overseas visitors may stay away due to the volatility of Sterling in the wake of Brexit. Is it possible to provide any information from our data warehouse to allay these fears?
  • 20. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 19 The charts above indicate the visitor and spend numbers in the light of the strength of Sterling in relation to the US Dollar, Australian Dollar and Euro basket of currencies. It appears that the currency strength does not deter visitor visits or spend. Visitor numbers have increased over the 5 year period and it is clear to see seasonal fluctuations. There appears to be a positive correlation between visits and spend. However, quarter 201502 and 201403 may warrant an investigation. Visits (6.407M) were higher in 201502 with less spend (3.085B), than lower visits (6.232M) in 201403 with a higher spend (3.883B). Business Review Entity Extraction Finally, away from the symposium, it would be helpful to provide visitors with places to go and things to see and do when in Edinburgh. Can we provide any points of interest in Edinburgh that will assist them? The treemap above shows entities extracted by entity between 2010 and 2015 from Edinburgh business reviews. The chart provides the entity name, business name, entity type, the reviewer’s nationality and total visits to the UK for the quarter that the review relates to (data is not displayed if the space is not available which is an issue when attempting to make a comparison between entities). Taking the entity Hanedan, as an example the AlchemyAPI (2016) returned the entity as a person and a city, it is in fact a Turkish restaurant. But the treemap highlighted this unusual pattern, and provoked a web search to discover what Hanedan was. Using a treemap visualisation is useful for exposing patterns that could be of interest and warrant further investigation. The treemap chart works well for presentation of small numbers. However, treemaps may present a confusing picture when the number of items displayed increases substantially (Tu and Shen, 2008).
  • 21. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 20 References AlchemyAPI (2016) Entity Extraction API [Online] Available at: http://www.alchemyapi.com/products/alchemylanguage/entity-extraction [Accessed 10 November 2016]. Ariyachandra, T. and Watson, H.J. (2006) ‘Which Data Warehouse Architecture Is Most Successful?’. Business Intelligence Journal, 11(1): pp. 4. Chaudhuri, S. and Dayal, U. (1997) ‘An overview of data warehousing and OLAP technology’. ACM SIGMOD Record, 26(1): pp. 65-74. Eniod's Blog (2015) Import Yelp dataset to MongoDB [Online] Available at: https://haduonght.wordpress.com/2015/02/10/import-yelp-dataset-to-mongodb [Accessed 10 November 2016]. Kimball, R., Ross, M., Thornthwaite, W., Mundy. J and Becker, B. (2008) The data warehouse lifecycle toolkit. 2nd ed. Indianapolis: Wiley Publishing, Inc. Meo, P., Ferrara, E., Abel, F., Aroyo, L. and Houben, G. (2013) ‘Analyzing user behavior across social sharing environments’. ACM Transactions on Intelligent Systems and Technology (TIST), 5(1): pp. 14- 31. OpenRefine (2016) A free, open source, powerful tool for working with messy data [Online] Available at: http://openrefine.org/ [Accessed 10 November 2016]. QuandlAPI (2016) Quandl API Introduction [Online] Available at: https://www.quandl.com/docs/api [Accessed 10 November 2016]. Rowen, W., Song, I.Y., Medsker, C. and Ewen, E. (2001) ‘An analysis of many-to-many relationships between fact and dimension tables in dimensional modeling’. Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW 2001). Interlaken, Switzerland, 4 June 2001. Tableau Desktop (2016) Analytics that work the way you think [Online] Available at: http://www.tableau.com/products/desktop [Accessed 10 November 2016]. Tu, Y. and Shen, H. (2008) ‘Balloon Focus: a Seamless Multi-Focus+Context Method for Treemaps’. IEEE Transactions on Visualization and Computer Graphics, 14(6): pp. 1157-1164. UK Office for National Statistics (2016) Methodology:International Passenger Survey background notes [Online] Available at: https://www.ons.gov.uk/peoplepopulationandcommunity/leisureandtourism/methodologies/intern ationalpassengersurveybackgroundnotes#sample-methodology [Accessed 10 November 2016]. Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M. & Skiadopoulos, S. (2005) ‘A generic and customizable framework for the design of ETL scenarios’. Information Systems, 30(7): pp. 492-525. Visit Britain (2016) Inbound tourism trends by market [Online] Available at: https://www.visitbritain.org/inbound-tourism-trends [Accessed 10 November 2016]. Yelp Dataset Challenge (2016) Yelp Dataset Challenge [Online] Available at: https://www.yelp.com/dataset_challenge [Accessed 10 November 2016].