Big data refers to large and complex datasets that are difficult to process using traditional database management tools. It involves 4 V's: volume, velocity, variety, and veracity. MapReduce is a programming model used for processing large datasets in parallel across clusters of computers. It involves Map and Reduce procedures. Hadoop is an open-source software framework that implements MapReduce and provides a distributed file system (HDFS) to support data-intensive applications on large clusters. Data science draws on techniques from many fields to extract meaning from data and create data products.
2. What is Big Data?
Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications.
10. … recruit
sponsors and
become an
advocate
…to be the
first human
sensor and
crowd-
source
knowledge
…to learn
about how
technology
enables big
data
analytics
What they want to do??
Employees
Digital Nomads
Technology
Professionals
Data Scientist
Students Influencers
… explore
new, diverse
data to
improve
human
knowledge
… explore
how data is
impacting
mankind
and be a data
scientist for
a day
…share this
with my
network, and
participate
in the
experience
11. Who is a Data Scientist?
Data science incorporates varying
elements and builds on techniques and
theories from many fields, including
mathematics, statistics, data
engineering, pattern recognition and
learning, advanced computing,
visualization, uncertainty modeling,
data warehousing, and high
performance computing with the goal
of extracting meaning from data and
creating data products. A practitioner of
data science is called a data scientist
19. MapReduce & HDFS
• MapReduce is a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster (a group of
connected computers that work together so that in many respects
they can be viewed as a single system)
• A MapReduce program comprises a Map() procedure that performs
filtering & sorting and a Reduce() procedure that performs a
summary operation.
• The "MapReduce System" orchestrates by marshaling the
distributed servers, running the various tasks in parallel, managing
all communications and data transfers between the various parts of
the system, providing for redundancy and fault tolerance, and
overall management of the whole process.
• HDFS is a distributed, scalable, and portable file system written in
Java for the Hadoop framework.
20. Hadoop
• Apache Hadoop is an open-source software framework that
supports data-intensive distributed applications, licensed under the
Apache v2 license.
• Effectively, it implements MapReduce & provides a distributed file
system (HDFS)
• It supports the running of applications on large clusters of
commodity hardware.
• Hadoop makes it possible to run applications on systems with
thousands of nodes involving thousands of terabytes. Its distributed
file system facilitates rapid data transfer rates among nodes and
allows the system to continue operating uninterrupted in case of a
node failure. This approach lowers the risk of catastrophic system
failure, even if a significant number of nodes become inoperative.