What is Big Data and why it is required and needed for the organization those who really need and generating huge amount of data and when it will be use
2. Author
• Astute corporate resource with 10+ years of corporate experience with emphasis on database management, programming, software
development, testing, web technologies and product improvement for corporations. Combines expert software and database management
expertise with strong qualifications in Software, Data Engineering & Information Management.
Concurrently, manage all the database functions for the current company. Industry experience in Information Technology. Strong
understanding of the complex challenges in Software Development and problem troubleshooting. An expert on identifying and solving
problems, gaining new business contacts, reducing costs, coordinating staff and evaluating performance. Professional traits include;
problem-solving, decision-making, time management, multitasking, analytical thinking, effective communication, and computer
competencies.
• Oracle Certified Professional OCA on 9i
• Oracle Certified Professional OCP on 9i
• Oracle Certified Professional OCP on 10g
• Oracle Certified Professional OCP on 11g
• Oracle Certified Professional OCP on 12c
• Oracle Certified Professional OCP on MySQL 5
• Oracle Certified Professional OCE on 10g managing on Linux
• Oracle Certified Professional OCP on E-Business Apps DBA
• Microsoft Certified Technology Specialist on SQL Server 2005
• Microsoft Certified Technology Specialist on SQL Server 2008
• Microsoft Certified IT Professional on SQL Server 2005
• Microsoft Certified IT Professional on SQL Server 2008
• Sun Certified Java Programmer 5.0
• IBM Certified Database(DB2) Associate 9.0
• ITIL V3 Foundation Certified
• COBIT 5 Foundation Certified
• PRINCE2 Foundation Certified
3. Agenda
• What is Big Data
• Why Big Data
• When Big Data
• Traditional Databases
• Hadoop
• Hadoop Projects
• BigData andTPL Holdings
• Hadoop Distributions
By JBH Syed| BSCS | MSDEIM | MCTS | MCITP | OCA | OCP | OCE | SCJP | ITL V3F | COBIT 5F | PRINCE2
4. What is Big Data ?
• Big data is an all-encompassing term for any collection of data sets
so large and complex that it becomes difficult to process using
traditional data processing applications.The challenges include analysis,
capture, search, sharing, storage, transfer, visualization, and privacy
violations.
• Definition of Big Data as the threeVs -Volume ,Velocity andVariety.
• Big data is data sets that are so voluminous and complex that traditional
data processing , application software are inadequate to deal with them.
Big data challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying , updating, information
privacy and data source.There are a number of concepts associated with
big data: originally there were 3 concepts volume, variety, velocity. Other
concepts later attributed with big data are veracity ( Wikipedia )
By JBH Syed| BSCS | MSDEIM | MCTS | MCITP | OCA | OCP | OCE | SCJP | ITL V3F | COBIT 5F | PRINCE2
5. What is Big Data ?
• Volume. Many factors contribute to the increase in data volume.Transaction-
based data stored through the years. Unstructured data streaming in from
social media. Increasing amounts of sensor and machine-to-machine data being
collected. In the past, excessive data volume was a storage issue. But with
decreasing storage costs, other issues emerge, including how to determine
relevance within large data volumes and how to use analytics to create value
from relevant data.
• Velocity. Data is streaming in at unprecedented speed and must be dealt with in
a timely manner. RFID tags, sensors and smart metering are driving the need to
deal with torrents of data in near-real time. Reacting quickly enough to deal
with data velocity is a challenge for most organizations.
• Variety. Data today comes in all types of formats. Structured, numeric data in
traditional databases. Information created from line-of-business applications.
Unstructured text documents, email, video, audio, stock ticker data and
financial transactions. Managing, merging and governing different varieties of
data is something many organizations still grapple with.
By JBH Syed| BSCS | MSDEIM | MCTS | MCITP | OCA | OCP | OCE | SCJP | ITL V3F | COBIT 5F | PRINCE2
6. Why Big Data
• The hopeful vision is that organizations will be able to take data from any
source and make it to the actionable or harness relevant data and analyze it
to find answers that enable
• 1) Overall Cost reductions
• 2)Time reductions
• 3) New products development and optimized offerings
• 4) Smarter business decision making. For instance, by combining big data and high-
powered analytics
• 5)Faster Resolutions
By JBH Syed| BSCS | MSDEIM | MCTS | MCITP | OCA | OCP | OCE | SCJP | ITL V3F | COBIT 5F | PRINCE2
7. When Big Data ?
• It depends on the requirement of the organization and the available
organization data as we explain earlier about the 3Vs.
• The real issue is not that you are acquiring large amounts of data. It's what you
do with the data that counts.
• What actions you can take with the huge data stream.
• Industry leader like China Mobile which have 7 tera bytes per Day and the
Facebook which have 10 tera bytes per Day.
• Analysis on calls records.
• Analysis on sentiments.
• Analysis on weather information.
• Analysis on vehicles traffic and location trend.
• Analysis on years of SalesTrend , target and glitches.
• Analysis on biological data for example DNA , RNA etc.
• Analysis on Customers Information
• Analysis on Operating System and Hardware logs to prevent the attacks and
take the actions before the actual failure will be occur
• And much more.
By JBH Syed| BSCS | MSDEIM | MCTS | MCITP | OCA | OCP | OCE | SCJP | ITL V3F | COBIT 5F | PRINCE2
8. Traditional Databases and Hadoop
• Mr. AhmedWaleed has describe very well regarding the difference between RDBMS and
Hadoop , www.w3trainingschool.com
• Like Hadoop, traditional RDBMS cannot be used when it comes to process and store a large
amount of data or simply big data. Following are some differences between Hadoop and
traditional RDBMS.
• DataVolume
• Data volume means the quantity of data that is being stored and processed. RDBMS works better when
the volume of data is low(in Gigabytes). But when the data size is huge i.e, inTerabytes and Petabytes,
RDBMS fails to give the desired results.
• On the other hand, Hadoop works better when the data size is big. It can easily process and store large
amount of data quite effectively as compared to the traditional RDBMS.
• Architecture
• If we talk about the architecture, Hadoop has the following core components:
• HDFS(Hadoop Distributed File System), Hadoop Map Reduce(a programming model to process large data
sets) and HadoopYARN(used to manage computing resources in computer clusters).
• Traditional RDBMS possess ACID properties which are Atomicity,Consistency, Isolation, and Durability.
• These properties are responsible to maintain and ensure data integrity and accuracy when a transaction
takes place in a database.
• These transactions may be related to Banking Systems, Manufacturing Industry,Telecommunication
industry,Online Shopping, education sector etc.
• Throughput
• Throughput means the total volume of data processed in a particular period of time so that the output is
maximum. RDBMS fails to achieve a higher throughput as compared to the Apache Hadoop Framework.
• This is one of the reason behind the heavy usage of Hadoop than the traditional Relational Database
Management System.
9. • Data Variety
• Data Variety generally means the type of data to be processed. It may be structured, semi-structured and
unstructured.
• Hadoop has the ability to process and store all variety of data whether it is structured, semi-structured or
unstructured. Although, it is mostly used to process large amount of unstructured data.
• Traditional RDBMS is used only to manage structured and semi-structured data. It cannot be used to manage
unstructured data. So we can say Hadoop is way better than the traditional Relational Database Management
System.
• Latency/ ResponseTime
• Hadoop has higher throughput, you can quickly access batches of large data sets than traditional RDBMS, but you
cannot access a particular record from the data set very quickly. Thus Hadoop is said to have low latency.
• But the RDBMS is comparatively faster in retrieving the information from the data sets. It takes a very little time to
perform the same function provided that there is a small amount of data.
• Scalability
• RDBMS provides vertical scalability which is also known as ‘Scaling Up’ a machine. It means you can add more
resources or hardwares such as memory, CPU to a machine in the computer cluster.
• Whereas, Hadoop provides horizontal scalability which is also known as ‘Scaling Out’ a machine. It means adding
more machines to the existing computer clusters as a result of which Hadoop becomes a fault tolerant. There is no
single point of failure. Due to the presence of more machines in the cluster, you can easily recover data irrespective of
the failure of one of the machines.
• Data Processing
• Apache Hadoop supports OLAP(Online Analytical Processing), which is used in Data Mining techniques.
• OLAP involves very complex queries and aggregations. The data processing speed depends on the amount of data
which can take several hours. The database design is de-normalized having fewer tables. OLAP uses star schemas.
• On the other hand, RDBMS supports OLTP(Online Transaction Processing), which involves comparatively fast query
processing. The database design is highly normalized having a large number of tables. OLTP generally uses 3NF(an
entity model) schema.
• Cost
• Hadoop is a free and open source software framework, you don’t have to pay in order to buy the license of the
software.
• Whereas RDBMS is a licensed software, you have to pay in order to buy the complete software license.
• We have provided you all the probable differences between Big Data Hadoop and traditional RDBMS. Hope you
enjoyed reading the blog.
10. Hadoop
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers
to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is
designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which
may be prone to failures
By JBH Syed| BSCS | MSDEIM | MCTS | MCITP | OCA | OCP | OCE | SCJP | ITL V3F | COBIT 5F | PRINCE2
11. Hadoop Projects
• Hadoop Common:The common utilities that support the other Hadoop
modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
• HadoopYARN:A framework for job scheduling and cluster resource
management.
• Hadoop MapReduce:AYARN-based system for parallel processing of large
data sets.
By JBH Syed| BSCS | MSDEIM | MCTS | MCITP | OCA | OCP | OCE | SCJP | ITL V3F | COBIT 5F | PRINCE2
12. Hadoop Distributions
• Cloudera Enterprise
• www.cloudera.com OnlineTraining Available
• Hortonworks Enterprise
• www.hortonworks.com OnlineTraining Available
• Map R Enterprise
• www.mapr.com only Classroom training availables
By JBH Syed| BSCS | MSDEIM | MCTS | MCITP | OCA | OCP | OCE | SCJP | ITL V3F | COBIT 5F | PRINCE2
13. Cloudera, Hortonworks and MapR Fight for
Hadoop Supremacy
• Who's going to win, Cloudera, Hortonworks or MapR? All three are battling
for Hadoop supremacy in terms of prominent customers, funding and
market share.
• The latest blow was figuratively struck by Cloudera as Intel yesterday
announced it was quitting on its own distribution and joining forces with
the Hadoop pioneer.
• http://adtmag.com/blogs/dev-watch/2014/03/hadoop-war.aspx
By JBH Syed| BSCS | MSDEIM | MCTS | MCITP | OCA | OCP | OCE | SCJP | ITL V3F | COBIT 5F | PRINCE2