A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
3. Big Data
Big Data is data whose scale, diversity and complexity require new
architecture, techniques, algorithms and analytics to manage it and
extract value and hidden knowledge from it.
Simply Big data is similar to “small data” , but bigger in size.
As having a data bigger it requires different approaches like
Techniques, tools, and architecture.
This big data aims to solve new problems or old problems in a better
way.
A Big data generates value from the storage and processing of very
large quantities of digital information that cannot be analyzed with
traditional computing techniques.
5. Analysis -Big data Generation
Walmart handles more than 1 million customer transactions every
hour.
Facebook handles 40 billion photos from its user base.
FB generates 10TB daily
Twitter generates 7TB of data daily.
Decoding the human genome originally took 10 years to process; but
now it can be achieved in one week.
7. Big Data to Value
Big data is not about the size of the data, but its
mainly about the value within the data.
8. Why Big Data Needed?
Big Data Growth is needed.
Increase of storage capacities.
Increase of processing power.
Availability of data(different data
types).
Every day we create 2.5
quintillion bytes of data.
IBM claims 90% of the data in the
world today has been created in
last two years alone.
9. Big Data Analytics
Examining huge amounts of data.
Accurate Information.
Identification of hidden patterns, unknown
correlations.
Competitive environment.
Better Business Decisions like Strategic and
operational.
Effective Marketing, Customer satisfaction, Increased
revenue.
11. Risks of Big Data
It will be so overwhelmed
needs the right people and solve the
right problems.
Costs escalate too fast
is not necessary to capture 100%.
Many sources of big data are privacy
self regulation, legal regulation.
12. Challenges of Big Data
Uncertainty of the Data Management Landscape
The Big Data talent gap
Getting data into Big data platform
Synchronization across the data sources
Getting useful information out of the Big data Platform
13. Big Data Analytics Technologies
NoSQL: non-relational or atleast non-SQL database
solutions such as Hbase (also a part of the Hadoop
ecosystem), Cassandra, MongoDB, Riak, CouchDB and
many others.
Hadoop : It is an ecosystem of software packages,
including MapReduce, HDFS and a whole host of
other software packages.
14. Apache Hadoop is a frame work that allows for the distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.
It is an open source data management with scale-out storage and
distributed processing.
Hadoop is a system for large scale data processing.
It has two main components.
Hadoop = HDFS + MapReduce
15. HDFS – Hadoop Distributed File
System
HDFS ( storage and file system): HDFS is a
scalable, fault tolerance reliable distributed file
system that provides high-throughput access to
data.
NameNode:
Master of the system
Maintains and manages the blocks which are
present on the Datanodes.
DataNodes:
Slaves which are deployed on each machine
and provide the actual storage.
Responsible for serving read and write
requests for the clients.
17. Map Reduce
A MapReduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner. The framework sorts the outputs of the maps, which are then
input to the reduce tasks. Typically both the input and the output of
the job are stored in a file-system.
It has 2 phases.
Mapper Phase:
Process a key/value pair to generate intermediate key/value pairs
Reducer Phase:
Merge all intermediate values associated with the same key
21. PIG:
Pig was initially developed at Yahoo Research around 2006 but moved into the
Apache Software Foundation in 2007 to allow individuals using Apache Hadoop to
focus a lot of on analyzing massive data sets and pay less time having to put in
writing mapper and reducer programs.
The Pig programming language is meant to handle any reasonably data—hence the
name!
Pig consists of a two components, first is the language called as Pig Latin and
secondly an execution environment where Pig Latin programs are executed
HIVE:
Apache Hive is a data warehouse system for Apache Hadoop .
Hive is a technology which is developed by Facebook that turns Hadoop into
a data warehouse which complete with an extension of sql for querying.
Hive is used as HiveQL which is a declarative language.
In piglatin, dataflow is described but in Hive results must be described.
Hive by itself find out a dataflow to get those results.
Hive must have a schema that is more than one.
22. OOZIE:
Oozie is a java based web-application that runs in a java servlet that
uses the database to store definition of Workflow that is a collection of
actions. Hadoop jobs are managed by this.
HBASE:
Hbase is non-relational columnar distributed column oriented database
where as HDFS is file system.
It is built and run on top of HDFS system.
It is a management system that is open-source, versioned, and
distributed based on the Big Table of Google.
It is written in Java. It is serving as the input and output for the Map
Reduce.
For instance, read and write operations involve all rows but only a small
subset of all columns.
23. SQOOP:
Sqoop is a tool used to transfer the data from relational database environments like
oracle, mysql and postgre sql into hadoop environment.
It is a command-line interface platform is used for transferring data between
relational databases and Hadoop.
MAHOUT:
Mahout is a library for machine-learning and data mining which is divided
into four main groups: collective filtering, categorization, clustering, and
mining of parallel frequent patterns.
The Mahout library belongs to the subset that can be executed in a
distributed mode and executed by Map Reduce.
FLUME:
Flume is an open source programming which is made by cloud era to go about as
an organization for gathering and moving enormous measure of data around a
Hadoop bundle as data is conveyed or in no time.
Crucial use case of flume is together log records from all machines in cluster to
continue on them in a united store..
24. Conclusion
Real time big data is not just a process for storing petabytes or
exabytes of data in a data warehouse, its about the ability to make
better decisions and take meaningful actions at the right time.
Fast forward to the present and technologies like hadoop give you
the scale and flexibility to store data before you know how you are
going to process it
Technologies such as MapReduce, Hive and Impala enables you to
run queries without changing the data structures underneath.
It offers commercial opportunities of a comparable scale to
enterprise software in the late 1980’s.
26. Future
Our new research works in organizations use big data to
target
customer centric outcomes,
tap into internal data and
build a better information ecosystem.