2. Large Scale Data Management - Spring 2018
Outline
2
• IMDb
• CSV description
• GAV
• Source Schema
• Global Schema
• Mapping
• Talend
• Data pre-processing and cleaning
• Data integration process
• Results
3. Large Scale Data Management - Spring 2018
IMDb
3
IMDb, also known as Internet Movie Database, is
an online database of information related to world films,
television programs, home videos and video games, and
internet streams, including cast, production crew,
personnel and fictional character biographies, plot
summaries, trivia, and fan reviews and ratings.
4. Large Scale Data Management - Spring 2018
Used dataset
4
In this data integration project are used 4 files in .csv format:
• movies.csv
• rating_I.csv
• rating_II.csv
• rating_III .csv
These data sources are available on kaggle.com
5. Large Scale Data Management - Spring 2018
Movies.csv description
5
Fields: movieId, title, genres
It contains 27779 entries.
6. Large Scale Data Management - Spring 2018
Rating_I.csv description
6
Fields: userId, movieId, rating, timestamp
It contains 20000264 entries.
7. Large Scale Data Management - Spring 2018
GAV - Global as view
7
An information integration system I is a triple <G, S, M>.
The most usual scenario here is the one in which the global
schema is created on the basis of data source schemas
observation, through an intensional integration process of
the data source schemas (think also to the consolidation
process, or to a situation in which we want to represent in an
integrated way the whole information content of the data
architecture of an organization).
In this case the global schema is expressed in terms of local
schemas.
8. Large Scale Data Management - Spring 2018
GAV - Global as view
8
Purpose:
• task based: data integration program for a specific purpose
• service based: data integration query with parameters
• domain based: data integration general purpose (support any
query on that domain)
Type:
• Materialized: I have a copy of the data in order to manipulate it
• Virtualized: each time I ask for data to source. No maintenance
policy, but dangerous.
Approach:
• axioms
• no axioms
9. Large Scale Data Management - Spring 2018
Source Schema
9
movieId title genres year
userId movieId rating timestamp
r1:
r2:
10. Large Scale Data Management - Spring 2018
Global Schema
10
movieId title genres year userId rating timestamp rating_avg
movieId rating_avgrating:
movie:
11. Large Scale Data Management - Spring 2018
Mapping
11
movieId title genres
userId movieId rating timestamp
r1:
r2:
join key
userId movieId rating timestamp
r2:
movieId rating_avg
rating:
group function
12. Large Scale Data Management - Spring 2018
Talend
12
Talend is a software that provides data integration solutions
to gain instant value from data by delivering timely and easy
access to all historical, live and emerging data. Talend runs
natively in Hadoop using the latest innovations from the
Apache ecosystem.
Talend combines big data components for Hadoop
MapReduce 2.0 (YARN), Hadoop, HBase, HCatalog, Sqoop,
Hive, Oozie, and Pig into a unified open source environment,
to process large datas quickly.
13. Large Scale Data Management - Spring 2018
Talend Interface
13
Data sources Instruments
Workflow
TerminalComponent settings
14. Large Scale Data Management - Spring 2018
The field movieId in the file movie.csv contains also information regarding
the year of the movie.
In order to extract this information was used tJavaRow (a Talend
component) that allows you to enter customized code which you can
integrate in job workflows.
But once written and compiled the code, Talend shell returned an error on
the conversion of type String to int.
Data Pre-Processing
14
15. Large Scale Data Management - Spring 2018
The second attempt was using Pandas, a Python Data Analysis Library.
Through a Python script are extracted the years of the movies and it was
generated a new field called year (this information was contained in the field title).
Data Pre-Processing
15
16. Large Scale Data Management - Spring 2018
The data integration process has been done just using the Talend tools.
In Talend is possible to create a workflow in order to manage and integrate
the data.
The higher number of the entries both in the .csv files and database tables
saturated the memory and the terminal returned the error:
java.lang.OutOf MemoryError: GC overhead limit exceeded
So the workflow is divided in four parts due the fact that the configuration
used isn't powerful enough.
Data Integration
16
17. Large Scale Data Management - Spring 2018
Data Integration: Job I
17
Union with duplicate of rating_I, rating_II, rating_III.
It was generated a new table in IMDB database called r2.
18. Large Scale Data Management - Spring 2018
Data Integration: Job II
18
It was generated a new table in IMDB database called rating.
19. Large Scale Data Management - Spring 2018
Data Integration: Job II
19
Obviously, the field userId and timestamp
were removed.
• From database table r2 to database table rating.
r2 contains for each user the movies that he voted.
Entries were grouped by movie_Id and now, rating_avg is the
average of the entries that have the same movie_Id.
Table rating
20. Large Scale Data Management - Spring 2018
Data Integration: Second problem
20
I tried to integrate the data contained in movies.csv with the new database r2
created in the previous step.
The Talend shell returned the memory error (with the option lookup model:
"Load once” setted)
java.lang.OutOf MemoryError: GC overhead limit exceeded
I tried with the option in lookup model: “reload at each row (cache)”.
In this case works, but I estimated the time to complete the job from the row/s
and it was 78 weeks (with my configuration).
The only way to follow with the data integration project was to reduce the
number of records in the table r2.
So the number of records was reduced from 20 million to 80000.
21. Large Scale Data Management - Spring 2018
Data Integration: Job III
21
Join between movies.csv and r2 table through the tMap instrument.
22. Large Scale Data Management - Spring 2018
Data Integration: Job III
22
Inside the tMap instrument.
23. Large Scale Data Management - Spring 2018
Data Integration: Job III
23
It was generated a new table called IMDBresults in IMDB
database starting from movie.csv and the table r2 that contains:
movieId, title, genres, year, userId, rating, timestamp
It was used tMap component, setting InnerJoin on movieId and
with “All matches” option activated.
24. Large Scale Data Management - Spring 2018
Data Integration: Job IV
24
Join between IMDBresults and rating tables through the tMap instrument.
25. Large Scale Data Management - Spring 2018
Data Integration: Job IV
25
Inside the tMap instrument.
26. Large Scale Data Management - Spring 2018
Data Integration: Job IV
26
It was generated a new table called movie in IMDB2 database starting
from the table IMDBresults and the table r2 contained in in IMDB
database. It that contains:
movieId, title, genres, year, userId, rating, timestamp, rating_avg
It was used tMap component, setting InnerJoin on movieId and with
“unique match” option activated.
This operation took about 1 hour of computation.
27. Large Scale Data Management - Spring 2018
Results
27
movie table
Screenshot from Sequel Pro
28. Large Scale Data Management - Spring 2018
Results
28
rating table
Screenshot from Sequel Pro