This document provides an overview of big data concepts including definitions of big data, sources of big data, and uses of big data analytics. It discusses technologies used for big data including Hadoop, MapReduce, Hive, Mahout, MATLAB, and Revolution R. It also addresses challenges around big data such as lack of standardization and extracting meaningful insights from large datasets.
2. Understanding
Do we know Big Data?
What is Big Data?
Where is Big Data coming from ?
Uses Of Big Data?
Technology
Big data in action
Big Data analytics Technologies
3. Data : Collected Facts.
Information :
Derived meaning from data.
Meaning full data
Source : Any book of database…..
4. Big Data is not new.
It just grown bigger that we started noticing it.
Its same old small chunks of data in large volumes.
Big Data is not only about
Larger Volume of Data
Unmanaged data
Only for Social Media
Than what is it?
5.
6. Data Sources Analytics
Web logs,
Click Streams
ERP, CRM
RSS Feeds
Social N/Ws
Process
Pre process
Capture
Store
Integrate
Hadoop Cluster
Map
Transform
Clean
Analytical Data
Storage
Reports, Scorecards
Forecasting
SQL Queries
Real Time Systems
7. Big data is the new way to see through the data
what we already have.
It is the way to see the data with more insight
of data and not relying on specific set of values.
Thus it is used to create more results form
given data sets.
9. Numerous Sources
Cookies, IP Tracking
Person tracking
Social Messages on Social network web sites(e.g.
Facebook, Twitter)
Stock market trades
And counting….
10. Origin Uses
Websites User Preferences, Shopping Interests
Social Messages Public Interests, Opinions
Digital Receipts Personalized Purchase Suggestions
Healthcare Data Preparing for diseases ,Predecion
Telecom Data New Technologies
Space Data Inventions of new space technology
11. We have large amount of data(!!!).
Now the problem is analyst can discover
“meaningless” pattern .
Statisticians call it Bonferroni`s Principle.
“Roughly if you look at more and more places for
important pattern than your amount of data can
support almost anything.”
Source: taken from Rajaramn,Ulman:Mining of Massive Datasets
12. We want to find (unrelated) people who at least twice have
stayed at the same hotel on the same day
109 people being tracked
1000 days
Each person stays in a hotel 1% of the time (1 day out of 100)
Hotels hold 100 people (so 105 hotels)
If everyone behaves randomly (i.e., no terrorists) will the data
mining detect anything suspicious?
Expected number of "suspicious" pairs of people:
250,000
…too many combinations to check - we need to have some
additional evidence to find "suspicious" pairs of people in some
more efficient way
Source: taken from Rajaramn,Ulman:Mining of Massive Datasets
13. As Big data concept is new, there is no specific
standards available.
Big data working groups and initiatives
Open Data Center Alliance (ODCA)
TMF Big Data Analytics Reference Architecture
Research Data Alliance (RDA)
NIST Big Data Working Group (NBD-WG)
14. The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters of
computers using simple programming
models.[from http://hadoop.apache.org/]
IBM, Yahoo, Microsoft have their own products
and technology for Big Data.
Hadoop project is started by Yahoo research.
15. Hadoop is a Scalable, Reliable, Fault-tolerant and
Simple software library framework.
Logically Hadoop is computing cluster that
provides storage layer and execution layer.
Source:A (very) short intro to Hadoop by Ken Krugler`s talk at
BigDataCamp held in Washington DC November 2011
Storage layer Execution Layer
Hadoop Distributed File
System
Hadoop MapReduce
Runs on regular os file
system like Linux ext3
Runs on many servers
Fixed size blocks, normally
64 mb in size, are replicated
Job consist special “Map”
and “Reduce” functions.
16. Source:A (very) short intro to Hadoop by Ken Krugler`s talk at BigDataCamp held in Washington DC
November 2011
17. Google published research paper describing the
technology that can process hundreds of thousand
of CPU and provide faster execution called
MapReduce.
It has two main functionalities, Mapping and
Reducing.
Mapping is used to process key/value pairs and
produce set of intermediate pairs.
Reduce works for combining all intermediate
values and produce merged output.
Source:http://research.google.com/archive/mapreduce.html
18. Data Collection
Cust_id: A123
Amount: 500
Cust_id: A123
Amount: 250
Cust_id: B212
Amount: 200
Cust_id: A223
Amount: 250
Query (Customers
with A213 and
B212)
Cust_id: A123
Amount: 500
Cust_id: A123
Amount: 250
Cust_id: B212
Amount: 200
Map( Cust_id
With Amount)
A213 {500,250}
B212 {200}
Reduce(Sum of Amount for
Given Cust_id)
Cust_id : A213, Amount : 750
Cust_id : B212, Amount : 200
19. Hive
Apache Mahout
Processing Big Data with MATLAB
Revolution R
20. Hive is SQL like technology which sits on top of
Hadoop Clusters.
Hive provides Hive Query Language (HQL) which
allows SQL developers to write queries similar to
SQL.
One can use HQL queries on Hive Shell or can run
from JDBC/ODBC using drivers called Hive Thrift
Clients.
Hive is based on Hadoop and MapReduce.
The key difference between HQL and SQL is that
hadoop is intended for long sequence scans,we can
have latency in minutes.
21. Apache Mahaout is scalable machine learning
library.
Uses of Machine Learning
Generation of Recommendations based on previous clicks
Classifying DNA sequences
Bioinformatics, Natural Language Processing
A mahout is a person who keeps and drives an
elephant. The name Mahout comes from the
project's use of Apache Hadoop — which has a
yellow elephant as its logo — for scalability and
fault tolerance
22. Apache Mahaout`s algorithms for clustering,
classification and batch based collaborative filtering are
implemented on top of Apache Hadoop using the
map/reduce paradigm.
Mahaout provides very business intelligence features
like collaborative learning, clustering etc.
Collaborative filtering (CF) is a technique, popularized
by Amazon and others, that uses user information such
as ratings, clicks, and purchases to provide
recommendations to other site users.
Clustering is a technique to cluster datasets on given
condition. e.g. Given all the news for a day in all news
paper from whole India,one might want to group all
articles related to same story automatically.
24. Memory Mapped Variables. This allows you to
efficiently access big data sets on disk that are too
large to hold in memory or that take too long to
load.
Intrinsic Multicore Math. Many of the built-in
mathematical functions in MATLAB, such as fft,
inv, and eig, are multithreaded.
Cloud Computing. You can run MATLAB
computations in parallel using MATLAB
Distributed Computing Server on Amazon’s
Elastic Computing Cloud (EC2) for on-demand
parallel processing on hundreds or thousands of
computers.
25. R is a statistical analysis language, developed
by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand.
It is called “R” as it is initial of the developers.
R has ability to do statistical and graphical
analysis and provide clustering, classifications
on given data sets.
R is object oriented programming language
and it is highly extensible as users can submit
specific packages for specific area of interests.
26. Revolution R is developed by a company called
Revolution Analytics.
The concept on which company developed
“Open Core ” solution based on R is all the
data to be analyzed are held in memory.
This concept is not possible in case of large
data sets.
Revolution R provides new file format for large
data sets.
Parallel external memory implementation and
parallel algorithms for Big Data.
27. As there is no standardization and data sets are
growing larger and larger day by day,
everybody is suggesting new solution.
The trend is combine existing technologies and
provide new architecture.
The situation is that we don’t know what we
could already know.
Big data is like junction where multiple roads
from very different directs intersects.
Big Data is certainly a future, with new
possibilities and opportunities.
28. Hsinchun Chen, Roger H. L. Chiang, & Veda C. Storey (2012, December).
MIS Quarterly, Vol. 36, 1165-1188
Phillip Redman, John Girard, Leif-Olof Wallin (13 April 2011). Magic
Quadrant for Mobile Device Management Software, Gartner Research, ID
no: G00211101, 1-25
Adam Jacobs, (August 2009). The Pathologies of Big Data, Vol 52, No 8.
Communications of ACM. 36-44
Jeffery Dean & Sanjay Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. Google Inc Research Paper, OSDI 2004. 1-12
Samet Ayhan , Johnathan Pesce, Paul Comitz, Gary Gerberick & Steve
Bliesner . Predictive Analytics with Surveillance Big Data. 81-90
Divyakant Agrawal, Sudipto Das & Amr El Abbadi. Big Data and Cloud
Computing: Current State and Future.530-533
Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb
Welton, MAD Skills: New Analysis Practices for Big Data, 1481-1492
http://blog.cloudera.com/wp-content/uploads/2010/01/6-
IntroToHive.pdf (accessed on 02/10/2013)
http://www.mathworks.com/discovery/big-data-matlab.html (accessed
on 02/10/2013)