SlideShare una empresa de Scribd logo
1 de 18
BIG DATA ANALYTICS WITH HADOOP
BY-
SWAMIL SINGH
VIPLAV MANDAL
GUIDED BY-DR S.SRIVATAVA
AGENDA
• Design of website clickstream data and example.
• How to load data into sandbox.
• Load data using flume and the process.
• About flume
• Flumes working process.
• Process to refine data.
• Map reduce
• Hcatalog and hcatalog work
• Hive
• Hive work & their process
• Queries
DESIGN OF WEBSITE CLICKSTREAM DATA
• Clickstream data is an information trail a user leaves behind while visiting a website. It is
typically captured in semi-structured website log files.
• These website log files contain data elements such as a date and time stamp, the visitor’s IP
address, the destination URLs of the pages visited, and a user ID that uniquely identifies the
website visitor.
• One of the original uses of Hadoop at Yahoo was to store and process their massive volume of
clickstream data
EXAMPLE OF WEBSITE CLICK STREAM DATA
HOW TO LOAD DATA INTO SANDBOX
• The sandbox is a fully contained Data Platform environment
• The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all
the tools needed for data ingestion and processing.
• You can access and analyze sandbox data with many Business Intelligence (BI) applications.
• By combining web logs with more traditional customer data, we can better understand our
customers, and also understand how to optimize future promotions and advertising.
ABOUT FLUME
• Flume’s high-level architecture is built on a streamlined codebase that is easy to use and
extend.
• . The project is highly reliable, without the risk of data loss. Flume also supports dynamic
reconfiguration without the need for a restart, which reduces downtime for its agents.
• Flume components interact in the following way
• A flow in Flume starts from the Client.
• The Client transmits the Event to a Source operating within the Agent
• The Source receiving this Event then delivers it to one or more Channels.
• One or more Sinks operating within the same Agent drains these Channels.
• Channels decouple the ingestion rate from drain rate using the familiar producer-consumer
model of data exchange.
LOAD DATA USING FLUME ,THE PROCESS
SEQUENCE DIAGRAM OF FLUME
• Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput
streams in the HDFS. These different types of data can be landed in Hadoop for future analysis
using interactive queries in Apache Hive.
• In one specific example,
-Flume is used to log manufacturing operations. When one run of product comes off the line,
it generates a log file about that run.
• The large volume log file data can stream through Flume into a tool for same-day analysis with
Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a
quality assurance engineer using Apache Hive..
THE PROCESS TO REFINE DATA
• Omniture logs* – website log files containing information such as URL, timestamp, IP address,
geocoded IP address, and user ID (SWID).
• Users* – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender.
• Products* – CMS
• data that maps product categories to website URLs
MAP REDUCE
ABOUT THE MAP REDUCE
• A MapReduce job splits a large data set into independent chunks and organizes them into key,
value pairs for parallel processing.
• The Map function divides the input into ranges by the Input Format and creates a map task for
each range in the input
• The output of each map task is partitioned into a group of key-value pairs for each reduce.
• The Reduce function then collects the various results and combines them to answer the larger
problem that the master node needs to solve
HCATALOG
• Apache HCatalog is a table management layer that exposes Hive metadata to other Hadoop
applications
• HCatalog’s table abstraction presents users with a relational view of data in the Hadoop
Distributed File System (HDFS) and ensures that users need not worry about where or in what
format their data is stored
• HCatalog displays data from RCFile format, text files, or sequence files in a tabular view.
HOW HCATALOG WORKS
• HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-
deserializer) can be written.
• By default, HCatalog supports RCFile, CSV, JSON, and Sequence File formats. To use a
custom format, you must provide the Input Format, Output Format, and SerDe.
• HCatalog is built on top of the Hive metastore and incorporates components from the Hive
DDL.
• HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command
line interface for issuing data definition and metadata exploration commands.
HIVE
• Hive is a component of Data Platform. Hive provides a SQL-like interface to data stored in DP.
• Hive provides a database query interface to Apache Hadoop.
• Hive because of its SQL like query language is often used as the interface to an Apache
Hadoop based data warehouse.
• Pig fits in through its data flow strengths where it takes on the tasks of bringing data into
Apache Hadoop and working with it to get it into the form for querying.
HOW TO WORK HIVE
• The tables in Hive are similar to tables in a relational database, and data units are organized in
a taxonomy from larger to more granular units.
• Databases are comprised of tables, which are made up of partitions.
• Data can be accessed via a simple query language and Hive supports overwriting or appending
data
• Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN,
CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT.
• In addition, analysts can combine primitive data types to form complex data types, such as
structs, maps and arrays.
WORKING PROCESS OF HIVE
• Any queries…..
• THANK YOU

Más contenido relacionado

La actualidad más candente

Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 

La actualidad más candente (20)

Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 

Similar a Big data analytics with hadoop volume 2

Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedAnant Kumar
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India eventBig Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India eventThe Hive
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxVanshGupta597842
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detailHariKumar544765
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 

Similar a Big data analytics with hadoop volume 2 (20)

Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India eventBig Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
 
Hadoop
HadoopHadoop
Hadoop
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptx
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detail
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 

Último

Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 

Último (20)

Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

Big data analytics with hadoop volume 2

  • 1. BIG DATA ANALYTICS WITH HADOOP BY- SWAMIL SINGH VIPLAV MANDAL GUIDED BY-DR S.SRIVATAVA
  • 2. AGENDA • Design of website clickstream data and example. • How to load data into sandbox. • Load data using flume and the process. • About flume • Flumes working process. • Process to refine data. • Map reduce • Hcatalog and hcatalog work • Hive • Hive work & their process • Queries
  • 3. DESIGN OF WEBSITE CLICKSTREAM DATA • Clickstream data is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log files. • These website log files contain data elements such as a date and time stamp, the visitor’s IP address, the destination URLs of the pages visited, and a user ID that uniquely identifies the website visitor. • One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data
  • 4. EXAMPLE OF WEBSITE CLICK STREAM DATA
  • 5. HOW TO LOAD DATA INTO SANDBOX • The sandbox is a fully contained Data Platform environment • The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all the tools needed for data ingestion and processing. • You can access and analyze sandbox data with many Business Intelligence (BI) applications. • By combining web logs with more traditional customer data, we can better understand our customers, and also understand how to optimize future promotions and advertising.
  • 6. ABOUT FLUME • Flume’s high-level architecture is built on a streamlined codebase that is easy to use and extend. • . The project is highly reliable, without the risk of data loss. Flume also supports dynamic reconfiguration without the need for a restart, which reduces downtime for its agents. • Flume components interact in the following way • A flow in Flume starts from the Client. • The Client transmits the Event to a Source operating within the Agent • The Source receiving this Event then delivers it to one or more Channels. • One or more Sinks operating within the same Agent drains these Channels. • Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model of data exchange.
  • 7. LOAD DATA USING FLUME ,THE PROCESS
  • 9. • Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the HDFS. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. • In one specific example, -Flume is used to log manufacturing operations. When one run of product comes off the line, it generates a log file about that run. • The large volume log file data can stream through Flume into a tool for same-day analysis with Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive..
  • 10. THE PROCESS TO REFINE DATA • Omniture logs* – website log files containing information such as URL, timestamp, IP address, geocoded IP address, and user ID (SWID). • Users* – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender. • Products* – CMS • data that maps product categories to website URLs
  • 12. ABOUT THE MAP REDUCE • A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. • The Map function divides the input into ranges by the Input Format and creates a map task for each range in the input • The output of each map task is partitioned into a group of key-value pairs for each reduce. • The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve
  • 13. HCATALOG • Apache HCatalog is a table management layer that exposes Hive metadata to other Hadoop applications • HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored • HCatalog displays data from RCFile format, text files, or sequence files in a tabular view.
  • 14. HOW HCATALOG WORKS • HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer- deserializer) can be written. • By default, HCatalog supports RCFile, CSV, JSON, and Sequence File formats. To use a custom format, you must provide the Input Format, Output Format, and SerDe. • HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. • HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands.
  • 15. HIVE • Hive is a component of Data Platform. Hive provides a SQL-like interface to data stored in DP. • Hive provides a database query interface to Apache Hadoop. • Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. • Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying.
  • 16. HOW TO WORK HIVE • The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. • Databases are comprised of tables, which are made up of partitions. • Data can be accessed via a simple query language and Hive supports overwriting or appending data • Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. • In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.