4. What is BIG
DATA ?
Data which is beyond to
the storage capacity and
which is beyond to the
processing power
VOLUME VELOCITY VARIETY
Assume we live in a world of 100% data, 90% of data was generated in the
last 3-4 years and 10% of data was generated when the systems was
introduced.
5. Volume
1. Transaction based data stored in relational database since years.
1. Unstructured data stored as the part of social media .
1. Sensor and machine to machine generated data.
4. The volume on Facebook is 600 Terabytes of data every day.
6. Velocity
1. Reacting Fast enough is one of the challenges
1. Computation is process bound.
1. Processing the unstructured data
7. Variety
1. Different formats.
1. Structured data residing in traditional RDBMS or flat files.
1. Unstructured data related to text documents, videos, email, audio, log files
and etc..
1. Managing , merging and governing different varieties of data is the biggest
challenge
1. Connected and correlated to extract useful information from it.
8. Challenges reading from single disk
Reading data from warehouse
200GB
FB (india)
200GB
FB (US)
200GB
FB (Japan)
200GB
FB (UK)
200GB
FB (CHINA)
1 TB and more
Data warehouse
100 * 1000 *8
__________ = 8000 secs
100
100mbps 2.2
Hours
9. OLTP
( RDBMS )
Social
Networking
Logs
Xml /
.txt
Eg :
Apache
logs
Data Warehouse
(Expensive storage)
Storage spread across not
easily accessible, limited
storage capacity
Reports
Reports
Traditional work flow data
14. Hadoop
Hadoop is a apache software library framework that allows for the distributed
processing of large data sets across cluster of computers using simple
programming model
It is designed to scale up from single server to thousand of machines each
offering local computation and storage.
Rather than rely on hardware to deliver high availability the library is itself
designed to detect and handle failures at the application layer , so delivering
high availability of service on top of cluster of computers, each of which may
be prone to failures.
15. Hadoop Core Features
HDFS - Used for Storing data on cluster of machines.
Mapreduce - it is a technique to process the data that is stored in HDFS.
17. Map Reduce
What is MapReduce?
Mapreduce is a programming model for processing large Data sets with a
parallel and distributed algorithm on a cluster.
Example : we take a word count of a small and big and count the occurrences
of the words.
Mapreduce programs can be written in Perl, Python, Java or Ruby.
Mapreduce uses hadoopstreaming.jar to convert the Python/Perl programs to
jar and execute the programs in parallel and get the result count using the
reducer.
18. Mapper code in Python
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%st%s' % (word, 1)
words_count.txt
pramati yahoo facebook aol facebook IBM
kony google pramati
19. Reducer code in Python
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%st%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%st%s' % (current_word, current_count)
23. Pig Latin
● Apache Pig is a tool used to analyze large amounts of data by representing them as data
flows. Using the PigLatin scripting language operations like ETL ( Extract , Transform
and Load) , adhoc data analysis and iterative processing can be easily achieved.
● Can solve Variety data problems for structured, unstructured and Semi-structured
● Pig was first built in Yahoo! And later became a top level Apache Project . In this series
we will walk through different features of pig using a sample dataset.
24. Pig Access
● Interactive mode
● Batch Mode
● $ Pig -x local - to run the grunt shell in the local file system mode
● $ Pig - to run the grunt shell in the HDFS system mode
25. Execution of Pig
Can Perform Joins
● Self Join
● Equi Join
● Left Outer Join
● Right Outer Join
Customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
Orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
27. Hive
● Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
● Hive is a database technology that can define databases and tables to analyze structured data.
The theme for structured data analysis is to store the data in a tabular manner, and pass
queries to analyze it.
● Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive.
28. Hive Features
Hive is not
1. A relational database.
1. A design for online transaction processing ( OLTP).
1. A language for real time queries and row level updates.
Hiive is a
1. It stores schema in a database and processed data into HDFS.
1. Designed for online analytical processing ( OLAP).
1. It provides SQL type language for querying called HiveQL or HQL.
29. Loading structured data into table using Hive
hive> CREATE DATABASE EMP;
hive > use EMP;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name
String, salary String, deptno int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;
hive> load data local inpath '/home/purnar/emp_data.txt' into
table employee;
hive> select * from employee;
EMP_DATA.txt
1201,chandu,10000.00,20
1202,shekar,2000.00,10
1203,ravi,1000.00,10
1204,kiran,2000.00,20
1205,sharma,30000.00,30
1206,sri,4000.00,40
30. Difference between Hive and Pig
● Hive is mainly used by data analysts whereas Pig is used by Researchers
and Programmers.
● Hive is mainly used for structured data whereas Pig is used for semi-
structured data / unstructured data.
● Hive is mainly used for creating Reports whereas Pig is used for
Programmers.
● Hive has the provision for Partitions so that you can process the subset of
data by date or an alphabetical order where as Pig does not have any notion
of partion though might be one can achieve this through filters
31. SQOOP
● Sqoop is a tool designed to transfer data between Hadoop and relational
database servers. It is used to import data from relational databases such
as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases. This is a brief tutorial that explains how to make use
of Sqoop in Hadoop ecosystem.
● The traditional application management system, that is, the interaction of
applications with relational database using RDBMS, is one of the sources
that generate Big Data. Such Big Data, generated by RDBMS, is stored in
Relational Database Servers in the relational database structure.
● SQOOP = “SQL to Hadoop and Hadoop to SQL”
32. Export data from HDFS to MySQL
1. Create a database test and create a table ( employee) as below.
CREATE TABLE employee(id INT,name VARCHAR(20),deg
VARCHAR(20),salary INT,dept VARCHAR(10));
1. Create a txt file with the data given below and input it to the hadoop file
system
Emp.txt
========
1201, gopal, manager,50000, TP
1202, manisha, preader,50000, TP
1203, kalil, php dev,30000, AC
1204, prasanth, php dev,30000, AC
1205, kranthi, admin,20000, TP
1206, satish p, grp des,20000, GR
1. Hadoop fs -mkdir /emp
2. Hadoop fs -put emp.txt /emp
3. Execute the sqoop command below to export the data from txt to mysql.
sqoop export
--verbose --connect
jdbc:mysql://localhost/hi
ve_db
--username ***** --
password ******
-m 4
--table employee --
export-dir /emp/emp.txt
33. Import data from mysql to flatfile
sqoop import
--connect jdbc:mysql://localhost/test
--username *******
--password *******
--table employee
--m 1
--target-dir /chandu
35. Hadoop Testing
● Unix commands like mkdir , ls, cat and etc...
● Testing the mapreduce scripts
● Test the mapper and reducer scripts separately with different input files.
● Example : Parse the apache.log file for the gmail users and count the
number of times they have logged in on a particular day.
● Here we pass a 12mb file to the hadoop file system and extract only the
gmail users and count the number of times , they have logged in on that
particular day.
36. Test Scenarios
1. Adding a special characters to the pattern object.
1. Provide extra spaces to the patterns.
1. Test for boundary conditions of the pattern.
1. Add some special characters inbetween the pattern.
1. Count the number of patterns using the reducer.
And etc..
37. Hive Test
Hive > create database test;
Hive > create table emp(id int, name string,salary float, deptno int,
designation string);
>ROW FORMAT DELIMITED
>FIELDS TARMINATED BY “,”
>Lines TERMINATED BY ‘n”;
Hive > load data local inpath '/home/purnar/emp.txt' into table emp;
hive> select * from emp;
OK
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
Time taken: 0.1 seconds, Fetched: 6 row(s)
EMP.txt
1201, pavan,4000,30,Dev
1202, ravi,3000, 10,QA
1203, kalil,30000,10,phpdev
1204, prasanth,30000,20,QA
1205, kranthi,20000,30,QA
1206, satishp,20000,40,Admin
38. Hive Test 2
Hive > create database test;
Hive > create table emp(id int, name string,salary float, deptno int,
designation string);
>ROW FORMAT DELIMITED
>FIELDS TARMINATED BY “,”
>Lines TERMINATED BY ‘n”
>STORED AS TEXTFILE;
Hive > load data local inpath '/home/purnar/emp.txt' into table emp;
hive> select * from employee;
OK
1202 pavan 4000.0 30 Dev
1202 ravi 3000.0 NULL QA
1203 kalil 30000.0 10 phpdev
1204 prasanth 30000.0 20 QA
1205 kranthi 20000.0 30 QA
1206 satishp 20000.0 40 Admin
Time taken: 0.096 seconds, Fetched: 6 row(s)
EMP.txt
1201, pavan,4000,30,Dev
1202, ravi,3000, 10,QA
1203, kalil,30000,10,phpdev
1204, prasanth,30000,20,QA
1205, kranthi,20000,30,QA
1206, satishp,20000,40,Admin
Space
before
10
Null Value
displayed