SlideShare una empresa de Scribd logo
1 de 31
By
Gaurav Chauhan
(121060753005)
Guided By
Prof Rajesh Ingle
Pune Institute of
Computing
Technology
 Understanding
 Do we know Big Data?
 What is Big Data?
 Where is Big Data coming from ?
 Uses Of Big Data?
 Technology
 Big data in action
 Big Data analytics Technologies
 Data : Collected Facts.
 Information :
 Derived meaning from data.
 Meaning full data
Source : Any book of database…..
 Big Data is not new.
 It just grown bigger that we started noticing it.
 Its same old small chunks of data in large volumes.
 Big Data is not only about
 Larger Volume of Data
 Unmanaged data
 Only for Social Media
 Than what is it?
Data Sources Analytics
Web logs,
Click Streams
ERP, CRM
RSS Feeds
Social N/Ws
Process
Pre process
Capture
Store
Integrate
Hadoop Cluster
Map
Transform
Clean
Analytical Data
Storage
Reports, Scorecards
Forecasting
SQL Queries
Real Time Systems
 Big data is the new way to see through the data
what we already have.
 It is the way to see the data with more insight
of data and not relying on specific set of values.
 Thus it is used to create more results form
given data sets.
Image Source: http://bigdatawg.nist.gov/_uploadfiles/M0055_v1_7606723276.pdf
 Numerous Sources
 Cookies, IP Tracking
 Person tracking
 Social Messages on Social network web sites(e.g.
Facebook, Twitter)
 Stock market trades
 And counting….
Origin Uses
Websites User Preferences, Shopping Interests
Social Messages Public Interests, Opinions
Digital Receipts Personalized Purchase Suggestions
Healthcare Data Preparing for diseases ,Predecion
Telecom Data New Technologies
Space Data Inventions of new space technology
 We have large amount of data(!!!).
 Now the problem is analyst can discover
“meaningless” pattern .
 Statisticians call it Bonferroni`s Principle.
 “Roughly if you look at more and more places for
important pattern than your amount of data can
support almost anything.”
Source: taken from Rajaramn,Ulman:Mining of Massive Datasets
 We want to find (unrelated) people who at least twice have
stayed at the same hotel on the same day
 109 people being tracked
 1000 days
 Each person stays in a hotel 1% of the time (1 day out of 100)
 Hotels hold 100 people (so 105 hotels)
 If everyone behaves randomly (i.e., no terrorists) will the data
mining detect anything suspicious?
 Expected number of "suspicious" pairs of people:
 250,000
 …too many combinations to check - we need to have some
additional evidence to find "suspicious" pairs of people in some
more efficient way
Source: taken from Rajaramn,Ulman:Mining of Massive Datasets
 As Big data concept is new, there is no specific
standards available.
 Big data working groups and initiatives
 Open Data Center Alliance (ODCA)
 TMF Big Data Analytics Reference Architecture
 Research Data Alliance (RDA)
 NIST Big Data Working Group (NBD-WG)
 The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters of
computers using simple programming
models.[from http://hadoop.apache.org/]
 IBM, Yahoo, Microsoft have their own products
and technology for Big Data.
 Hadoop project is started by Yahoo research.
 Hadoop is a Scalable, Reliable, Fault-tolerant and
Simple software library framework.
 Logically Hadoop is computing cluster that
provides storage layer and execution layer.
Source:A (very) short intro to Hadoop by Ken Krugler`s talk at
BigDataCamp held in Washington DC November 2011
Storage layer Execution Layer
Hadoop Distributed File
System
Hadoop MapReduce
Runs on regular os file
system like Linux ext3
Runs on many servers
Fixed size blocks, normally
64 mb in size, are replicated
Job consist special “Map”
and “Reduce” functions.
Source:A (very) short intro to Hadoop by Ken Krugler`s talk at BigDataCamp held in Washington DC
November 2011
 Google published research paper describing the
technology that can process hundreds of thousand
of CPU and provide faster execution called
MapReduce.
 It has two main functionalities, Mapping and
Reducing.
 Mapping is used to process key/value pairs and
produce set of intermediate pairs.
 Reduce works for combining all intermediate
values and produce merged output.
Source:http://research.google.com/archive/mapreduce.html
Data Collection
Cust_id: A123
Amount: 500
Cust_id: A123
Amount: 250
Cust_id: B212
Amount: 200
Cust_id: A223
Amount: 250
Query (Customers
with A213 and
B212)
Cust_id: A123
Amount: 500
Cust_id: A123
Amount: 250
Cust_id: B212
Amount: 200
Map( Cust_id
With Amount)
A213 {500,250}
B212 {200}
Reduce(Sum of Amount for
Given Cust_id)
Cust_id : A213, Amount : 750
Cust_id : B212, Amount : 200
 Hive
 Apache Mahout
 Processing Big Data with MATLAB
 Revolution R
 Hive is SQL like technology which sits on top of
Hadoop Clusters.
 Hive provides Hive Query Language (HQL) which
allows SQL developers to write queries similar to
SQL.
 One can use HQL queries on Hive Shell or can run
from JDBC/ODBC using drivers called Hive Thrift
Clients.
 Hive is based on Hadoop and MapReduce.
 The key difference between HQL and SQL is that
hadoop is intended for long sequence scans,we can
have latency in minutes.
 Apache Mahaout is scalable machine learning
library.
 Uses of Machine Learning
 Generation of Recommendations based on previous clicks
 Classifying DNA sequences
 Bioinformatics, Natural Language Processing
 A mahout is a person who keeps and drives an
elephant. The name Mahout comes from the
project's use of Apache Hadoop — which has a
yellow elephant as its logo — for scalability and
fault tolerance
 Apache Mahaout`s algorithms for clustering,
classification and batch based collaborative filtering are
implemented on top of Apache Hadoop using the
map/reduce paradigm.
 Mahaout provides very business intelligence features
like collaborative learning, clustering etc.
 Collaborative filtering (CF) is a technique, popularized
by Amazon and others, that uses user information such
as ratings, clicks, and purchases to provide
recommendations to other site users.
 Clustering is a technique to cluster datasets on given
condition. e.g. Given all the news for a day in all news
paper from whole India,one might want to group all
articles related to same story automatically.
 MATLAB
(Matrix
Laboratory) is a
numerical
computing
environment
and fourth
generation
language
developed by
MathWorks.
 Memory Mapped Variables. This allows you to
efficiently access big data sets on disk that are too
large to hold in memory or that take too long to
load.
 Intrinsic Multicore Math. Many of the built-in
mathematical functions in MATLAB, such as fft,
inv, and eig, are multithreaded.
 Cloud Computing. You can run MATLAB
computations in parallel using MATLAB
Distributed Computing Server on Amazon’s
Elastic Computing Cloud (EC2) for on-demand
parallel processing on hundreds or thousands of
computers.
 R is a statistical analysis language, developed
by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand.
 It is called “R” as it is initial of the developers.
 R has ability to do statistical and graphical
analysis and provide clustering, classifications
on given data sets.
 R is object oriented programming language
and it is highly extensible as users can submit
specific packages for specific area of interests.
 Revolution R is developed by a company called
Revolution Analytics.
 The concept on which company developed
“Open Core ” solution based on R is all the
data to be analyzed are held in memory.
 This concept is not possible in case of large
data sets.
 Revolution R provides new file format for large
data sets.
 Parallel external memory implementation and
parallel algorithms for Big Data.
 As there is no standardization and data sets are
growing larger and larger day by day,
everybody is suggesting new solution.
 The trend is combine existing technologies and
provide new architecture.
 The situation is that we don’t know what we
could already know.
 Big data is like junction where multiple roads
from very different directs intersects.
 Big Data is certainly a future, with new
possibilities and opportunities.
 Hsinchun Chen, Roger H. L. Chiang, & Veda C. Storey (2012, December).
MIS Quarterly, Vol. 36, 1165-1188
 Phillip Redman, John Girard, Leif-Olof Wallin (13 April 2011). Magic
Quadrant for Mobile Device Management Software, Gartner Research, ID
no: G00211101, 1-25
 Adam Jacobs, (August 2009). The Pathologies of Big Data, Vol 52, No 8.
Communications of ACM. 36-44
 Jeffery Dean & Sanjay Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. Google Inc Research Paper, OSDI 2004. 1-12
 Samet Ayhan , Johnathan Pesce, Paul Comitz, Gary Gerberick & Steve
Bliesner . Predictive Analytics with Surveillance Big Data. 81-90
 Divyakant Agrawal, Sudipto Das & Amr El Abbadi. Big Data and Cloud
Computing: Current State and Future.530-533
 Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb
Welton, MAD Skills: New Analysis Practices for Big Data, 1481-1492
 http://blog.cloudera.com/wp-content/uploads/2010/01/6-
IntroToHive.pdf (accessed on 02/10/2013)
 http://www.mathworks.com/discovery/big-data-matlab.html (accessed
on 02/10/2013)
Big data analytics 1
Big data analytics 1
Big data analytics 1

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

ANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEWANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEW
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
Big data mining
Big data miningBig data mining
Big data mining
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data
Big dataBig data
Big data
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Big Data
Big DataBig Data
Big Data
 
Big data 101
Big data 101Big data 101
Big data 101
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Data mining on big data
Data mining on big dataData mining on big data
Data mining on big data
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big Data
Big DataBig Data
Big Data
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
Big Data 101
Big Data 101Big Data 101
Big Data 101
 

Destacado

Big data analytics in payments
Big data analytics in payments Big data analytics in payments
Big data analytics in payments Ashish Anand
 
Payments Key Performance Indicators (KPIs): A Basic Perspective
Payments Key Performance Indicators (KPIs):  A Basic PerspectivePayments Key Performance Indicators (KPIs):  A Basic Perspective
Payments Key Performance Indicators (KPIs): A Basic PerspectiveChristopher Uriarte
 
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Perficient, Inc.
 
Coastal Urban DEM project - Mapping the vulnerability of Australia's Coast
Coastal Urban DEM project - Mapping the vulnerability of Australia's CoastCoastal Urban DEM project - Mapping the vulnerability of Australia's Coast
Coastal Urban DEM project - Mapping the vulnerability of Australia's CoastFungis Queensland
 
Digital image processing - What is digital image processign
Digital image processing - What is digital image processignDigital image processing - What is digital image processign
Digital image processing - What is digital image processignE2MATRIX
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...
Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...
Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...PyData
 
Data Visualization(s) Using Python
Data Visualization(s) Using PythonData Visualization(s) Using Python
Data Visualization(s) Using PythonAniket Maithani
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Djangokenluck2001
 
The New Role of Billing & Charging Systems In The Face Of IoT Challenges
The New Role of Billing & Charging Systems In The Face Of IoT ChallengesThe New Role of Billing & Charging Systems In The Face Of IoT Challenges
The New Role of Billing & Charging Systems In The Face Of IoT ChallengesComarch
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Data visualization with Python and SVG
Data visualization with Python and SVGData visualization with Python and SVG
Data visualization with Python and SVGSukjun Kim
 
Matlab Working With Images
Matlab Working With ImagesMatlab Working With Images
Matlab Working With Imagesmatlab Content
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceData Science Thailand
 
Planning & Network Transformation
Planning & Network TransformationPlanning & Network Transformation
Planning & Network TransformationComarch
 

Destacado (17)

Big data analytics in payments
Big data analytics in payments Big data analytics in payments
Big data analytics in payments
 
Payments Key Performance Indicators (KPIs): A Basic Perspective
Payments Key Performance Indicators (KPIs):  A Basic PerspectivePayments Key Performance Indicators (KPIs):  A Basic Perspective
Payments Key Performance Indicators (KPIs): A Basic Perspective
 
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
 
MATLAB Fundamentals (1)
MATLAB Fundamentals (1)MATLAB Fundamentals (1)
MATLAB Fundamentals (1)
 
Coastal Urban DEM project - Mapping the vulnerability of Australia's Coast
Coastal Urban DEM project - Mapping the vulnerability of Australia's CoastCoastal Urban DEM project - Mapping the vulnerability of Australia's Coast
Coastal Urban DEM project - Mapping the vulnerability of Australia's Coast
 
Digital image processing - What is digital image processign
Digital image processing - What is digital image processignDigital image processing - What is digital image processign
Digital image processing - What is digital image processign
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
Geomagic_Control (EN)
Geomagic_Control (EN)Geomagic_Control (EN)
Geomagic_Control (EN)
 
Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...
Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...
Up and Down the Python Data & Web Visualization Stack by Rob Story PyData SV ...
 
Data Visualization(s) Using Python
Data Visualization(s) Using PythonData Visualization(s) Using Python
Data Visualization(s) Using Python
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Django
 
The New Role of Billing & Charging Systems In The Face Of IoT Challenges
The New Role of Billing & Charging Systems In The Face Of IoT ChallengesThe New Role of Billing & Charging Systems In The Face Of IoT Challenges
The New Role of Billing & Charging Systems In The Face Of IoT Challenges
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Data visualization with Python and SVG
Data visualization with Python and SVGData visualization with Python and SVG
Data visualization with Python and SVG
 
Matlab Working With Images
Matlab Working With ImagesMatlab Working With Images
Matlab Working With Images
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
Planning & Network Transformation
Planning & Network TransformationPlanning & Network Transformation
Planning & Network Transformation
 

Similar a Big data analytics 1

The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningPolash Halder
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014Kenneth Igiri
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfkalai75
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 

Similar a Big data analytics 1 (20)

The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Big Data
Big DataBig Data
Big Data
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big data
Big dataBig data
Big data
 

Último

mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 

Último (20)

mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

Big data analytics 1

  • 1. By Gaurav Chauhan (121060753005) Guided By Prof Rajesh Ingle Pune Institute of Computing Technology
  • 2.  Understanding  Do we know Big Data?  What is Big Data?  Where is Big Data coming from ?  Uses Of Big Data?  Technology  Big data in action  Big Data analytics Technologies
  • 3.  Data : Collected Facts.  Information :  Derived meaning from data.  Meaning full data Source : Any book of database…..
  • 4.  Big Data is not new.  It just grown bigger that we started noticing it.  Its same old small chunks of data in large volumes.  Big Data is not only about  Larger Volume of Data  Unmanaged data  Only for Social Media  Than what is it?
  • 5.
  • 6. Data Sources Analytics Web logs, Click Streams ERP, CRM RSS Feeds Social N/Ws Process Pre process Capture Store Integrate Hadoop Cluster Map Transform Clean Analytical Data Storage Reports, Scorecards Forecasting SQL Queries Real Time Systems
  • 7.  Big data is the new way to see through the data what we already have.  It is the way to see the data with more insight of data and not relying on specific set of values.  Thus it is used to create more results form given data sets.
  • 9.  Numerous Sources  Cookies, IP Tracking  Person tracking  Social Messages on Social network web sites(e.g. Facebook, Twitter)  Stock market trades  And counting….
  • 10. Origin Uses Websites User Preferences, Shopping Interests Social Messages Public Interests, Opinions Digital Receipts Personalized Purchase Suggestions Healthcare Data Preparing for diseases ,Predecion Telecom Data New Technologies Space Data Inventions of new space technology
  • 11.  We have large amount of data(!!!).  Now the problem is analyst can discover “meaningless” pattern .  Statisticians call it Bonferroni`s Principle.  “Roughly if you look at more and more places for important pattern than your amount of data can support almost anything.” Source: taken from Rajaramn,Ulman:Mining of Massive Datasets
  • 12.  We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day  109 people being tracked  1000 days  Each person stays in a hotel 1% of the time (1 day out of 100)  Hotels hold 100 people (so 105 hotels)  If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious?  Expected number of "suspicious" pairs of people:  250,000  …too many combinations to check - we need to have some additional evidence to find "suspicious" pairs of people in some more efficient way Source: taken from Rajaramn,Ulman:Mining of Massive Datasets
  • 13.  As Big data concept is new, there is no specific standards available.  Big data working groups and initiatives  Open Data Center Alliance (ODCA)  TMF Big Data Analytics Reference Architecture  Research Data Alliance (RDA)  NIST Big Data Working Group (NBD-WG)
  • 14.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.[from http://hadoop.apache.org/]  IBM, Yahoo, Microsoft have their own products and technology for Big Data.  Hadoop project is started by Yahoo research.
  • 15.  Hadoop is a Scalable, Reliable, Fault-tolerant and Simple software library framework.  Logically Hadoop is computing cluster that provides storage layer and execution layer. Source:A (very) short intro to Hadoop by Ken Krugler`s talk at BigDataCamp held in Washington DC November 2011 Storage layer Execution Layer Hadoop Distributed File System Hadoop MapReduce Runs on regular os file system like Linux ext3 Runs on many servers Fixed size blocks, normally 64 mb in size, are replicated Job consist special “Map” and “Reduce” functions.
  • 16. Source:A (very) short intro to Hadoop by Ken Krugler`s talk at BigDataCamp held in Washington DC November 2011
  • 17.  Google published research paper describing the technology that can process hundreds of thousand of CPU and provide faster execution called MapReduce.  It has two main functionalities, Mapping and Reducing.  Mapping is used to process key/value pairs and produce set of intermediate pairs.  Reduce works for combining all intermediate values and produce merged output. Source:http://research.google.com/archive/mapreduce.html
  • 18. Data Collection Cust_id: A123 Amount: 500 Cust_id: A123 Amount: 250 Cust_id: B212 Amount: 200 Cust_id: A223 Amount: 250 Query (Customers with A213 and B212) Cust_id: A123 Amount: 500 Cust_id: A123 Amount: 250 Cust_id: B212 Amount: 200 Map( Cust_id With Amount) A213 {500,250} B212 {200} Reduce(Sum of Amount for Given Cust_id) Cust_id : A213, Amount : 750 Cust_id : B212, Amount : 200
  • 19.  Hive  Apache Mahout  Processing Big Data with MATLAB  Revolution R
  • 20.  Hive is SQL like technology which sits on top of Hadoop Clusters.  Hive provides Hive Query Language (HQL) which allows SQL developers to write queries similar to SQL.  One can use HQL queries on Hive Shell or can run from JDBC/ODBC using drivers called Hive Thrift Clients.  Hive is based on Hadoop and MapReduce.  The key difference between HQL and SQL is that hadoop is intended for long sequence scans,we can have latency in minutes.
  • 21.  Apache Mahaout is scalable machine learning library.  Uses of Machine Learning  Generation of Recommendations based on previous clicks  Classifying DNA sequences  Bioinformatics, Natural Language Processing  A mahout is a person who keeps and drives an elephant. The name Mahout comes from the project's use of Apache Hadoop — which has a yellow elephant as its logo — for scalability and fault tolerance
  • 22.  Apache Mahaout`s algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.  Mahaout provides very business intelligence features like collaborative learning, clustering etc.  Collaborative filtering (CF) is a technique, popularized by Amazon and others, that uses user information such as ratings, clicks, and purchases to provide recommendations to other site users.  Clustering is a technique to cluster datasets on given condition. e.g. Given all the news for a day in all news paper from whole India,one might want to group all articles related to same story automatically.
  • 23.  MATLAB (Matrix Laboratory) is a numerical computing environment and fourth generation language developed by MathWorks.
  • 24.  Memory Mapped Variables. This allows you to efficiently access big data sets on disk that are too large to hold in memory or that take too long to load.  Intrinsic Multicore Math. Many of the built-in mathematical functions in MATLAB, such as fft, inv, and eig, are multithreaded.  Cloud Computing. You can run MATLAB computations in parallel using MATLAB Distributed Computing Server on Amazon’s Elastic Computing Cloud (EC2) for on-demand parallel processing on hundreds or thousands of computers.
  • 25.  R is a statistical analysis language, developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.  It is called “R” as it is initial of the developers.  R has ability to do statistical and graphical analysis and provide clustering, classifications on given data sets.  R is object oriented programming language and it is highly extensible as users can submit specific packages for specific area of interests.
  • 26.  Revolution R is developed by a company called Revolution Analytics.  The concept on which company developed “Open Core ” solution based on R is all the data to be analyzed are held in memory.  This concept is not possible in case of large data sets.  Revolution R provides new file format for large data sets.  Parallel external memory implementation and parallel algorithms for Big Data.
  • 27.  As there is no standardization and data sets are growing larger and larger day by day, everybody is suggesting new solution.  The trend is combine existing technologies and provide new architecture.  The situation is that we don’t know what we could already know.  Big data is like junction where multiple roads from very different directs intersects.  Big Data is certainly a future, with new possibilities and opportunities.
  • 28.  Hsinchun Chen, Roger H. L. Chiang, & Veda C. Storey (2012, December). MIS Quarterly, Vol. 36, 1165-1188  Phillip Redman, John Girard, Leif-Olof Wallin (13 April 2011). Magic Quadrant for Mobile Device Management Software, Gartner Research, ID no: G00211101, 1-25  Adam Jacobs, (August 2009). The Pathologies of Big Data, Vol 52, No 8. Communications of ACM. 36-44  Jeffery Dean & Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google Inc Research Paper, OSDI 2004. 1-12  Samet Ayhan , Johnathan Pesce, Paul Comitz, Gary Gerberick & Steve Bliesner . Predictive Analytics with Surveillance Big Data. 81-90  Divyakant Agrawal, Sudipto Das & Amr El Abbadi. Big Data and Cloud Computing: Current State and Future.530-533  Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton, MAD Skills: New Analysis Practices for Big Data, 1481-1492  http://blog.cloudera.com/wp-content/uploads/2010/01/6- IntroToHive.pdf (accessed on 02/10/2013)  http://www.mathworks.com/discovery/big-data-matlab.html (accessed on 02/10/2013)