SlideShare a Scribd company logo
1 of 28
Download to read offline
IMDb Data Integration
Large Scale Data Management - Spring 2018
Giuseppe Andreetti
Large Scale Data Management - Spring 2018
Outline
2
• IMDb
• CSV description
• GAV
• Source Schema
• Global Schema
• Mapping
• Talend
• Data pre-processing and cleaning
• Data integration process
• Results
Large Scale Data Management - Spring 2018
IMDb
3
IMDb, also known as Internet Movie Database, is
an online database of information related to world films,
television programs, home videos and video games, and
internet streams, including cast, production crew,
personnel and fictional character biographies, plot
summaries, trivia, and fan reviews and ratings.
Large Scale Data Management - Spring 2018
Used dataset
4
In this data integration project are used 4 files in .csv format:
• movies.csv
• rating_I.csv
• rating_II.csv
• rating_III .csv
These data sources are available on kaggle.com
Large Scale Data Management - Spring 2018
Movies.csv description
5
Fields: movieId, title, genres

It contains 27779 entries.
Large Scale Data Management - Spring 2018
Rating_I.csv description
6
Fields: userId, movieId, rating, timestamp

It contains 20000264 entries.
Large Scale Data Management - Spring 2018
GAV - Global as view
7
An information integration system I is a triple <G, S, M>.
The most usual scenario here is the one in which the global
schema is created on the basis of data source schemas
observation, through an intensional integration process of
the data source schemas (think also to the consolidation
process, or to a situation in which we want to represent in an
integrated way the whole information content of the data
architecture of an organization).
In this case the global schema is expressed in terms of local
schemas.
Large Scale Data Management - Spring 2018
GAV - Global as view
8
Purpose:
• task based: data integration program for a specific purpose
• service based: data integration query with parameters
• domain based: data integration general purpose (support any
query on that domain)
Type:
• Materialized: I have a copy of the data in order to manipulate it
• Virtualized: each time I ask for data to source. No maintenance
policy, but dangerous.
Approach:
• axioms
• no axioms
Large Scale Data Management - Spring 2018
Source Schema
9
movieId title genres year
userId movieId rating timestamp
r1:
r2:
Large Scale Data Management - Spring 2018
Global Schema
10
movieId title genres year userId rating timestamp rating_avg
movieId rating_avgrating:
movie:
Large Scale Data Management - Spring 2018
Mapping
11
movieId title genres
userId movieId rating timestamp
r1:
r2:
join key
userId movieId rating timestamp
r2:
movieId rating_avg
rating:
group function
Large Scale Data Management - Spring 2018
Talend
12
Talend is a software that provides data integration solutions
to gain instant value from data by delivering timely and easy
access to all historical, live and emerging data. Talend runs
natively in Hadoop using the latest innovations from the
Apache ecosystem.
Talend combines big data components for Hadoop
MapReduce 2.0 (YARN), Hadoop, HBase, HCatalog, Sqoop,
Hive, Oozie, and Pig into a unified open source environment,
to process large datas quickly. 
Large Scale Data Management - Spring 2018
Talend Interface
13
Data sources Instruments
Workflow
TerminalComponent settings
Large Scale Data Management - Spring 2018
The field movieId in the file movie.csv contains also information regarding
the year of the movie.
In order to extract this information was used tJavaRow (a Talend
component) that allows you to enter customized code which you can
integrate in job workflows.
But once written and compiled the code, Talend shell returned an error on
the conversion of type String to int.
Data Pre-Processing
14
Large Scale Data Management - Spring 2018
The second attempt was using Pandas, a Python Data Analysis Library.
Through a Python script are extracted the years of the movies and it was
generated a new field called year (this information was contained in the field title).
Data Pre-Processing
15
Large Scale Data Management - Spring 2018
The data integration process has been done just using the Talend tools.
In Talend is possible to create a workflow in order to manage and integrate
the data.
The higher number of the entries both in the .csv files and database tables
saturated the memory and the terminal returned the error:
java.lang.OutOf MemoryError: GC overhead limit exceeded
So the workflow is divided in four parts due the fact that the configuration
used isn't powerful enough.
Data Integration
16
Large Scale Data Management - Spring 2018
Data Integration: Job I
17
Union with duplicate of rating_I, rating_II, rating_III.
It was generated a new table in IMDB database called r2.
Large Scale Data Management - Spring 2018
Data Integration: Job II
18
It was generated a new table in IMDB database called rating.
Large Scale Data Management - Spring 2018
Data Integration: Job II
19
Obviously, the field userId and timestamp
were removed.
• From database table r2 to database table rating.
r2 contains for each user the movies that he voted.
Entries were grouped by movie_Id and now, rating_avg is the
average of the entries that have the same movie_Id.
Table rating
Large Scale Data Management - Spring 2018
Data Integration: Second problem
20
I tried to integrate the data contained in movies.csv with the new database r2
created in the previous step.
The Talend shell returned the memory error (with the option lookup model:
"Load once” setted)
java.lang.OutOf MemoryError: GC overhead limit exceeded
I tried with the option in lookup model: “reload at each row (cache)”.
In this case works, but I estimated the time to complete the job from the row/s
and it was 78 weeks (with my configuration).
The only way to follow with the data integration project was to reduce the
number of records in the table r2.
So the number of records was reduced from 20 million to 80000.
Large Scale Data Management - Spring 2018
Data Integration: Job III
21
Join between movies.csv and r2 table through the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job III
22
Inside the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job III
23
It was generated a new table called IMDBresults in IMDB
database starting from movie.csv and the table r2 that contains:
movieId, title, genres, year, userId, rating, timestamp
It was used tMap component, setting InnerJoin on movieId and
with “All matches” option activated.
Large Scale Data Management - Spring 2018
Data Integration: Job IV
24
Join between IMDBresults and rating tables through the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job IV
25
Inside the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job IV
26
It was generated a new table called movie in IMDB2 database starting
from the table IMDBresults and the table r2 contained in in IMDB
database. It that contains:
movieId, title, genres, year, userId, rating, timestamp, rating_avg
It was used tMap component, setting InnerJoin on movieId and with
“unique match” option activated.
This operation took about 1 hour of computation.
Large Scale Data Management - Spring 2018
Results
27
movie table
Screenshot from Sequel Pro
Large Scale Data Management - Spring 2018
Results
28
rating table
Screenshot from Sequel Pro

More Related Content

What's hot

Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
diabetic retinopathy.pptx
diabetic retinopathy.pptxdiabetic retinopathy.pptx
diabetic retinopathy.pptxKomal Naphade
 
NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )ANKUSH
 
Hiring Process Analytics .pdf
Hiring Process Analytics .pdfHiring Process Analytics .pdf
Hiring Process Analytics .pdfVaibhaviKhedekar1
 
Diabetic Retinopathy
Diabetic RetinopathyDiabetic Retinopathy
Diabetic RetinopathyAtif Khan
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition Intel Nervana
 
Object recognition
Object recognitionObject recognition
Object recognitionsaniacorreya
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleImpetus Technologies
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningShahar Cohen
 
Netflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case StudyNetflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case StudyKetan Patil
 
EDA_Case_Study_PPT.pptx
EDA_Case_Study_PPT.pptxEDA_Case_Study_PPT.pptx
EDA_Case_Study_PPT.pptxAmitDas125851
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
Data mining to predict academic performance.
Data mining to predict academic performance. Data mining to predict academic performance.
Data mining to predict academic performance. Ranjith Gowda
 
Students academic performance using clustering technique
Students academic performance using clustering techniqueStudents academic performance using clustering technique
Students academic performance using clustering techniquesaniacorreya
 
Modelos De Data Mining
Modelos De Data MiningModelos De Data Mining
Modelos De Data Miningbrobelo
 

What's hot (20)

Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Three Big Data Case Studies
Three Big Data Case StudiesThree Big Data Case Studies
Three Big Data Case Studies
 
Hiring Process Analytics.pptx
Hiring Process Analytics.pptxHiring Process Analytics.pptx
Hiring Process Analytics.pptx
 
diabetic retinopathy.pptx
diabetic retinopathy.pptxdiabetic retinopathy.pptx
diabetic retinopathy.pptx
 
NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )
 
Hiring Process Analytics .pdf
Hiring Process Analytics .pdfHiring Process Analytics .pdf
Hiring Process Analytics .pdf
 
Diabetic Retinopathy
Diabetic RetinopathyDiabetic Retinopathy
Diabetic Retinopathy
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition
 
Object recognition
Object recognitionObject recognition
Object recognition
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
Object detection
Object detectionObject detection
Object detection
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Image captioning
Image captioningImage captioning
Image captioning
 
Netflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case StudyNetflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case Study
 
EDA_Case_Study_PPT.pptx
EDA_Case_Study_PPT.pptxEDA_Case_Study_PPT.pptx
EDA_Case_Study_PPT.pptx
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Data mining to predict academic performance.
Data mining to predict academic performance. Data mining to predict academic performance.
Data mining to predict academic performance.
 
Tableau - bar chart
Tableau - bar chartTableau - bar chart
Tableau - bar chart
 
Students academic performance using clustering technique
Students academic performance using clustering techniqueStudents academic performance using clustering technique
Students academic performance using clustering technique
 
Modelos De Data Mining
Modelos De Data MiningModelos De Data Mining
Modelos De Data Mining
 

Similar to IMDb Data Integration

Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Romit Mehta
 
Dataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDeepak Chandramouli
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019mark madsen
 
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...Big Data Value Association
 
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018 Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018 DataBench
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformDeepak Chandramouli
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET Journal
 
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018Amazon Web Services
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...DataBench
 
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...DataBench
 
Custom Reports & Integrations with GraphQL
Custom Reports & Integrations with GraphQLCustom Reports & Integrations with GraphQL
Custom Reports & Integrations with GraphQLLeanIX GmbH
 
Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...Aljoscha Krettek
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBencht_ivanov
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...DataBench
 
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...Minitab, LLC
 
About The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsAbout The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsKevin Haag
 
Airline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseAirline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseJason Plurad
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil GamesRob Winters
 
How Financial Services can Save On File Storage
How Financial Services can Save On File Storage How Financial Services can Save On File Storage
How Financial Services can Save On File Storage Charly Mostert
 

Similar to IMDb Data Integration (20)

Hadoop Training
Hadoop TrainingHadoop Training
Hadoop Training
 
Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018
 
Dataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platform
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019
 
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
 
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018 Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
Big Data Technical Benchmarking, Arne Berre, BDVe Webinar series, 09/10/2018
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic Platform
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
 
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
Tiered Data Sets in Amazon Redshift (ANT321) - AWS re:Invent 2018
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
 
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
 
Custom Reports & Integrations with GraphQL
Custom Reports & Integrations with GraphQLCustom Reports & Integrations with GraphQL
Custom Reports & Integrations with GraphQL
 
Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBench
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
 
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
 
About The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsAbout The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe Analytics
 
Airline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseAirline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use Case
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil Games
 
How Financial Services can Save On File Storage
How Financial Services can Save On File Storage How Financial Services can Save On File Storage
How Financial Services can Save On File Storage
 

Recently uploaded

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 

Recently uploaded (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 

IMDb Data Integration

  • 1. IMDb Data Integration Large Scale Data Management - Spring 2018 Giuseppe Andreetti
  • 2. Large Scale Data Management - Spring 2018 Outline 2 • IMDb • CSV description • GAV • Source Schema • Global Schema • Mapping • Talend • Data pre-processing and cleaning • Data integration process • Results
  • 3. Large Scale Data Management - Spring 2018 IMDb 3 IMDb, also known as Internet Movie Database, is an online database of information related to world films, television programs, home videos and video games, and internet streams, including cast, production crew, personnel and fictional character biographies, plot summaries, trivia, and fan reviews and ratings.
  • 4. Large Scale Data Management - Spring 2018 Used dataset 4 In this data integration project are used 4 files in .csv format: • movies.csv • rating_I.csv • rating_II.csv • rating_III .csv These data sources are available on kaggle.com
  • 5. Large Scale Data Management - Spring 2018 Movies.csv description 5 Fields: movieId, title, genres It contains 27779 entries.
  • 6. Large Scale Data Management - Spring 2018 Rating_I.csv description 6 Fields: userId, movieId, rating, timestamp It contains 20000264 entries.
  • 7. Large Scale Data Management - Spring 2018 GAV - Global as view 7 An information integration system I is a triple <G, S, M>. The most usual scenario here is the one in which the global schema is created on the basis of data source schemas observation, through an intensional integration process of the data source schemas (think also to the consolidation process, or to a situation in which we want to represent in an integrated way the whole information content of the data architecture of an organization). In this case the global schema is expressed in terms of local schemas.
  • 8. Large Scale Data Management - Spring 2018 GAV - Global as view 8 Purpose: • task based: data integration program for a specific purpose • service based: data integration query with parameters • domain based: data integration general purpose (support any query on that domain) Type: • Materialized: I have a copy of the data in order to manipulate it • Virtualized: each time I ask for data to source. No maintenance policy, but dangerous. Approach: • axioms • no axioms
  • 9. Large Scale Data Management - Spring 2018 Source Schema 9 movieId title genres year userId movieId rating timestamp r1: r2:
  • 10. Large Scale Data Management - Spring 2018 Global Schema 10 movieId title genres year userId rating timestamp rating_avg movieId rating_avgrating: movie:
  • 11. Large Scale Data Management - Spring 2018 Mapping 11 movieId title genres userId movieId rating timestamp r1: r2: join key userId movieId rating timestamp r2: movieId rating_avg rating: group function
  • 12. Large Scale Data Management - Spring 2018 Talend 12 Talend is a software that provides data integration solutions to gain instant value from data by delivering timely and easy access to all historical, live and emerging data. Talend runs natively in Hadoop using the latest innovations from the Apache ecosystem. Talend combines big data components for Hadoop MapReduce 2.0 (YARN), Hadoop, HBase, HCatalog, Sqoop, Hive, Oozie, and Pig into a unified open source environment, to process large datas quickly. 
  • 13. Large Scale Data Management - Spring 2018 Talend Interface 13 Data sources Instruments Workflow TerminalComponent settings
  • 14. Large Scale Data Management - Spring 2018 The field movieId in the file movie.csv contains also information regarding the year of the movie. In order to extract this information was used tJavaRow (a Talend component) that allows you to enter customized code which you can integrate in job workflows. But once written and compiled the code, Talend shell returned an error on the conversion of type String to int. Data Pre-Processing 14
  • 15. Large Scale Data Management - Spring 2018 The second attempt was using Pandas, a Python Data Analysis Library. Through a Python script are extracted the years of the movies and it was generated a new field called year (this information was contained in the field title). Data Pre-Processing 15
  • 16. Large Scale Data Management - Spring 2018 The data integration process has been done just using the Talend tools. In Talend is possible to create a workflow in order to manage and integrate the data. The higher number of the entries both in the .csv files and database tables saturated the memory and the terminal returned the error: java.lang.OutOf MemoryError: GC overhead limit exceeded So the workflow is divided in four parts due the fact that the configuration used isn't powerful enough. Data Integration 16
  • 17. Large Scale Data Management - Spring 2018 Data Integration: Job I 17 Union with duplicate of rating_I, rating_II, rating_III. It was generated a new table in IMDB database called r2.
  • 18. Large Scale Data Management - Spring 2018 Data Integration: Job II 18 It was generated a new table in IMDB database called rating.
  • 19. Large Scale Data Management - Spring 2018 Data Integration: Job II 19 Obviously, the field userId and timestamp were removed. • From database table r2 to database table rating. r2 contains for each user the movies that he voted. Entries were grouped by movie_Id and now, rating_avg is the average of the entries that have the same movie_Id. Table rating
  • 20. Large Scale Data Management - Spring 2018 Data Integration: Second problem 20 I tried to integrate the data contained in movies.csv with the new database r2 created in the previous step. The Talend shell returned the memory error (with the option lookup model: "Load once” setted) java.lang.OutOf MemoryError: GC overhead limit exceeded I tried with the option in lookup model: “reload at each row (cache)”. In this case works, but I estimated the time to complete the job from the row/s and it was 78 weeks (with my configuration). The only way to follow with the data integration project was to reduce the number of records in the table r2. So the number of records was reduced from 20 million to 80000.
  • 21. Large Scale Data Management - Spring 2018 Data Integration: Job III 21 Join between movies.csv and r2 table through the tMap instrument.
  • 22. Large Scale Data Management - Spring 2018 Data Integration: Job III 22 Inside the tMap instrument.
  • 23. Large Scale Data Management - Spring 2018 Data Integration: Job III 23 It was generated a new table called IMDBresults in IMDB database starting from movie.csv and the table r2 that contains: movieId, title, genres, year, userId, rating, timestamp It was used tMap component, setting InnerJoin on movieId and with “All matches” option activated.
  • 24. Large Scale Data Management - Spring 2018 Data Integration: Job IV 24 Join between IMDBresults and rating tables through the tMap instrument.
  • 25. Large Scale Data Management - Spring 2018 Data Integration: Job IV 25 Inside the tMap instrument.
  • 26. Large Scale Data Management - Spring 2018 Data Integration: Job IV 26 It was generated a new table called movie in IMDB2 database starting from the table IMDBresults and the table r2 contained in in IMDB database. It that contains: movieId, title, genres, year, userId, rating, timestamp, rating_avg It was used tMap component, setting InnerJoin on movieId and with “unique match” option activated. This operation took about 1 hour of computation.
  • 27. Large Scale Data Management - Spring 2018 Results 27 movie table Screenshot from Sequel Pro
  • 28. Large Scale Data Management - Spring 2018 Results 28 rating table Screenshot from Sequel Pro