SlideShare una empresa de Scribd logo
1 de 79
COMP9313: Big Data ManagementCOMP9313: Big Data Management
Lecturer: Xin CaoLecturer: Xin Cao
Course web site:Course web site: http://www.cse.unsw.edu.au/~cs9313/
1.2
Chapter 1: Course Infomation and Introduction toChapter 1: Course Infomation and Introduction to
Big DataBig Data
1.3
Course InfoCourse Info
Lectures : 6 : 00 – 9:00 pm (Thursday)
Location: Ainsworth Building 202 (K-J17-202)
Lab: Weeks 2-13
Consultation (Weeks 1-12): Questions regarding lectures, course
materials, assignements, exam, etc.
Time: 3:00 – 4:00 pm (Thursday)
Place: 201D K-17,
Tutors
Longbin Lai, llai@cse.unsw.edu.au
Jianye Yang, jianyey@cse.unsw.edu.au
Fei Bi, z3492319@cse.unsw.edu.au
Discucssion and QA: https://groups.google.com/forum/#!
forum/comp9313-qa
1.4
Lecturer in ChargeLecturer in Charge
Lecturer: Xin Cao
Office: 201D K17 (outside the lift turn left)
Email: xin.cao@unsw.edu.au
Ext: 55932
Research interests
Spatial Database
Data Mining
Data Management
Big Data Technologies
My publications list at google scholar:
https://scholar.google.com.au/citations?user=kJIkUagAAAAJ&hl=en
1.5
Course AimsCourse Aims
This course aims to introduce you to the concepts behind Big Data,
the core technologies used in managing large-scale data sets, and a
range of technologies for developing solutions to large-scale data
analytics problems.
This course is intended for students who want to understand modern
large-scale data analytics systems. It covers a wide range of topics
and technologies, and will prepare students to be able to build such
systems as well as use them efficiently and effectively address
challenges in big data management.
Not possible to cover every aspect of big data management.
1.6
LecturesLectures
Lectures focusing on the frontier technologies on big data
management and the typical applications
Try to run in more interactive mode
A few lectures may run in more practical manner (e.g., like a lab/demo)
to cover the applied aspects
Lecture length varies slightly depending on the progress (of that
lecture)
Note: attendance to every lecture is assumed
BIG DATA BUG
DATA
1.7
ResourcesResources
Text Books
Hadoop: The Definitive Guide. Tom White. 4th Edition - O'Reilly
Media
Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman,
Jeff Ullman. 2nd edition - Cambridge University Press
Reference Books and other readings
Advanced Analytics with Spark. Josh Wills, Sandy Ryza, Sean
Owen, and Uri Laserson. O'Reilly Media
Apache MapReduce Tutorial
Apache Spark Quick Start
Many other online tutorials … …
Big Data is a relatively new topic (so no fixed syllabus)
1.8
PrerequisitePrerequisite
Official prerequisite of this course is COMP9024 (Data Structures and
Algorithms) and COMP9311 (Database Systems).
Before commencing this course, you should:
have experiences and good knowledge of algorithm design
(equivalent to COMP9024 )
have a solid background in database systems (equivalent to
COMP9311)
have solid programming skills in Java
be familiar with working on a Unix-style operating systems
have basic knowledge of linear algebra (e.g., vector spaces, matrix
multiplication), probability theory and statistics , and graph theory
No previous experience necessary in
MapReduce
Parallel and distributed programming
1.9
Please do not enrol if youPlease do not enrol if you
Don’t have COMP9024/9311 knowledge
Cannot produce correct Java program on your own
Never worked on Unix-style operating systems
Have poor time management
Are too busy to attend lectures/labs
Otherwise, you are likely to perform badly in this subject
1.10
Learning outcomesLearning outcomes
After completing this course, you are expected to:
elaborate the important characteristics of Big Data
develop an appropriate storage structure for a Big Data repository
utilize the map/reduce paradigm and the Spark platform to
manipulate Big Data
use a high-level query language to manipulate Big Data
develop efficient solutions for analytical problems involving Big
Data
1.11
AssessmentAssessment
1.12
AssignmentsAssignments
1 warm-up programming assignment on Hadoop
1 programming assignment on HBase/Hive/Pig
1 warm-up programming assignment on Spark
Another harder assignment on Hadoop
Another harder assignment on Spark
Both results and source codes will be checked.
If not able to run your codes due to some bugs, you will not lose all
marks.
1.13
Final examFinal exam
Final written exam (100 pts)
If you are ill on the day of the exam, do not attend
the exam – I will not accept any medical special
consideration claims from people who already
attempted the exam.
1.14
You May Fail Because …You May Fail Because …
*Plagiarism*
Code failed to compile due to a mistake of 1 char or 1 word
Late submission
1 sec late = 1 day late
submit wrong files
Program did not follow the spec
I am unlikely to accept the following excuses:
“Too busy”
“It took longer than I thought it would take”
“It was harder than I initially thought”
“My dog ate my homework” and modern variants thereof
1.15
Tentative course scheduleTentative course schedule
Week Topic Assignment
1 Course info and introduction to big data
2 Hadoop MapReduce 1
3 Hadoop MapReduce 2 Ass1
4 HDFS and Hadoop I/O
5 NoSQL and Hbase Ass2
6 Hive and Pig
7 Spark Ass3
8 Link analysis
9 Graph data processing Ass4
10 Data stream mining Ass5
11 Large-scale machine learning
12 Revision and exam preparation
1.16
Your Feedbacks Are ImportantYour Feedbacks Are Important
Big data is a new topic, and thus the course is tentative
The technologies keep evolving, and the course materials need to be
updated correspondingly
Please advise where I can improve after each lecturer, at the
discussion and QA website
CATEI system
1.17
Why Attend the Lectures?Why Attend the Lectures?
1.18
What is Big Data?What is Big Data?
Big data is like teenage sex:
everyone talks about it
nobody really knows how to do it
everyone thinks everyone else is doing it
so everyone claims they are doing it...
--Dan Ariely, Professor at Duke University
1.19
What is Big Data?What is Big Data?
No standard definition! here is from Wikipedia:
Big data is a term for data sets that are so large or complex that
traditional data processing applications are inadequate.
Challenges include analysis, capture, data curation, search,
sharing, storage, transfer, visualization, querying, updating and
information privacy.
Analysis of data sets can find new correlations to "spot business
trends, prevent diseases, combat crime and so on."
1.20
Who is generating Big Data?Who is generating Big Data?
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
1.21
Big Data Characteristics: 3VBig Data Characteristics: 3V
1.22
Volume (Scale)Volume (Scale)
Data Volume
Growth 40% per year
From 8 zettabytes (2016) to 44zb (2020)
Data volume is increasing exponentially
Exponential increase in
collected/generated data
1.23
How much data?How much data?
Hadoop: 10K nodes, 150K
cores, 150 PB (4/2014)
Processes 20 PB a day (2008)
Crawls 20B web pages a day (2012)
Search index is 100+ PB (5/2014)
Bigtable serves 2+ EB, 600M QPS (5/2014)
300 PB data in Hive +
600 TB/day (4/2014)
400B pages, 10+
PB (2/2014)
LHC: ~15 PB a year
LSST: 6-10 PB a year
(~2020)640K ought to be
enough for
anybody.
150 PB on 50k+ servers
running 15k apps (6/2011)
S3: 2T objects, 1.1M
request/second (4/2013)
SKA: 0.3 – 1.5 EB
per year (~2020)
Hadoop: 365 PB, 330K
nodes (6/2014)
1.24
Variety (Complexity)Variety (Complexity)
Different Types:
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
 Social Network, Semantic Web (RDF), …
Streaming Data
 You can only scan the data once
A single application can be generating/collecting many types of
data
Different Sources :
Movie reviews from IMDB and Rotten Tomatoes
Product reviews from different provider websites
To extract knowledge all these types of
data need to linked together
To extract knowledge all these types of
data need to linked together
1.25
A Single View to the CustomerA Single View to the Customer
Customer
Social
Media
Social
Media
GamingGaming
EntertainEntertain
Banking
Finance
Banking
Finance
Our
Known
History
Our
Known
History
PurchasePurchase
1.26
A Global View of Linked Big DataA Global View of Linked Big Data
patient
doctors
gene
protein
drug
“Ebola”
mutation
diagnosis
prescription
target
tissue
Heterogeneous information networkDiversified social network
1.27
Velocity (Speed)Velocity (Speed)
Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions  missing opportunities
Examples
E-Promotions: Based on your current location, your
purchase history, what you like  send promotions right now
for store next to you
Healthcare monitoring: sensors monitoring your activities
and body  any abnormal measurements require immediate
reaction
Disaster management and response
1.28
Real-Time Analytics/DecisionReal-Time Analytics/Decision
RequirementRequirement
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
Friend Invitations
to join a
Game or Activity
that expands
business
Preventing Fraud
as it is Occurring
& preventing more
proactively
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
1.29
ExtendedExtended Big Data Characteristics:Big Data Characteristics:
6V6V
Volume: In a big data environment, the amounts of data collected and
processed are much larger than those stored in typical relational
databases.
Variety: Big data consists of a rich variety of data types.
Velocity: Big data arrives to the organization at high speeds and from
multiple sources simultaneously.
Veracity: Data quality issues are particularly challenging in a big data
context.
Visibility/Visualization: After big data being processed, we need a way
of presenting the data in a manner that’s readable and accessible.
Value: Ultimately, big data is meaningless if it does not provide value
toward some meaningful goal.
1.30
Veracity (Quality & Trust)Veracity (Quality & Trust)
Data = quantity + quality
When we talk about big data, we typically mean its quantity:
What capacity of a system provides to cope with the sheer size of
the data?
Is a query feasible on big data within our available resources?
How can we make our queries tractable on big data?
. . .
Can we trust the answers to our queries?
Dirty data routinely lead to misleading financial reports, strategic
business planning decision ⇒ loss of revenue, credibility and
customers, disastrous consequences
The study of data quality is as important as data quantity
1.31
Data in real-life is often dirtyData in real-life is often dirty
500,000 dead people retain
active Medicare cards
81 million National Insurance
numbers but only 60 million
eligible citizens
98000 deaths each year,
caused by errors in
medical data
1.32
Visibility/VisualizationVisibility/Visualization
Visible to the process of big data management
Big Data – visibility = Black Hole?
Big data visualization tools:
A visualization of Divvy bike rides across Chicago
1.33
ValueValue
Big data is meaningless if it does not provide value toward some
meaningful goal
1.34
Big Data: 6V in SummaryBig Data: 6V in Summary
Transforming Energy and Utilities through Big Data & Analytics. By Anders Quitzau@IBM
1.35
Other V’sOther V’s
Variability
Variability refers to data whose meaning is constantly changing. This is
particularly the case when gathering data relies on language processing.
Viscosity
This term is sometimes used to describe the latency or lag time in the data
relative to the event being described. We found that this is just as easily
understood as an element of Velocity.
Virality
Defined by some users as the rate at which the data spreads; how often it
is picked up and repeated by other users or events.
Volatility
Big data volatility refers to how long is data valid and how long
should it be stored. You need to determine at what point is data no
longer relevant to the current analysis.
More V’s in the future …
1.36
Big Data Tag CloudBig Data Tag Cloud
1.37
Cloud ComputingCloud Computing
The buzz word before “Big Data”
Larry Ellison’s response in 2009
Cloud Computing is a general term used to describe a new class of
network based computing that takes place over the Internet
A collection/group of integrated and networked hardware, software
and Internet infrastructure (called a platform).
Using the Internet for communication and transport provides
hardware, software and networking services to clients
These platforms hide the complexity and details of the underlying
infrastructure from users and applications by providing very simple
graphical interface or API
A technical point of view
Internet-based computing (i.e., computers attached to network)
A business-model point of view
Pay-as-you-go (i.e., rental)
1.38
Cloud Computing ArchitectureCloud Computing Architecture
1.39
Cloud Computing ServicesCloud Computing Services
1.40
Cloud Computing ServicesCloud Computing Services
Infrastructure as a service (IaaS)
Offering hardware related services using the principles of cloud
computing. These could include storage services (database or
disk storage) or virtual servers.
Amazon EC2, Amazon S3
Platform as a Service (PaaS)
Offering a development platform on the cloud.
Google’s Application Engine, Microsofts Azure
Software as a service (SaaS)
Including a complete software offering on the cloud. Users can
access a software application hosted by the cloud vendor on pay-
per-use basis. This is a well-established sector.
Googles gmail and Microsofts hotmail, Google docs
1.41
Cloud ServicesCloud Services
Software as a
Service (SaaS)
Platform as a
Service (PaaS)
Infrastructure as a
Service (IaaS)
Google
App
Engine
SalesForce CRM
LotusLive
1.42
Why Study Big Data Technologies?Why Study Big Data Technologies?
The hottest topic in both research and industry
Highly demanded in real world
A promising future career
Research and development of big data systems:
distributed systems (eg, Hadoop), visualization tools, data
warehouse, OLAP, data integration, data quality control, …
Big data applications:
social marketing, healthcare, …
Data analysis: to get values out of big data
discovering and applying patterns, predicative analysis, business
intelligence, privacy and security, …
Graduate from UNSW
1.43
Big Data Open Source ToolsBig Data Open Source Tools
1.44
What will the course coverWhat will the course cover
Topic 1. Big data management tools
Apache Hadoop
 MapReduce
 HDFS
 HBase
 Hive and Pig
 Mahout
 Spark
Topic 2. Big data typical applications
Link analysis
Graph data processing
Data stream mining
Some machine learning topics
1.45
Philosophy to Scale for Big DataPhilosophy to Scale for Big Data
ProcessingProcessing
Divide Work
Combine
Results
1.46
Distributed processing is non-trivialDistributed processing is non-trivial
How to assign tasks to different workers in an efficient way?
What happens if tasks fail?
How do workers exchange results?
How to synchronize distributed tasks allocated to different workers?
1.47
Big data storage is challengingBig data storage is challenging
Data Volumes are massive
Reliability of Storing PBs of data is challenging
All kinds of failures: Disk/Hardware/Network Failures
Probability of failures simply increase with the number of machines …
1.48
What is HadoopWhat is Hadoop
Open-source data storage and processing platform
Before the advent of Hadoop, storage and processing of big data was
a big challenge
Massively scalable, automatically parallelizable
Based on work from Google
 Google: GFS + MapReduce + BigTable (Not open)
 Hadoop: HDFS + Hadoop MapReduce +
HBase ( opensource)
Named by Doug Cutting in 2006 (worked at Yahoo! at that time), after
his son's toy elephant.
1.49
Hadoop offersHadoop offers
Redundant, Fault-tolerant data storage
Parallel computation framework
Job coordination
Programmer
s
Q: Where file is
located?
Q: How to
handle failures
& data lost?
Q: How to divide
computation?
Q: How to
program for
scaling?
No longer need
to worry about
1.50
Why Use Hadoop?Why Use Hadoop?
Cheaper
Scales to Petabytes or more easily
Faster
Parallel data processing
Better
Suited for particular types of big data problems
1.51
Companies Using HadoopCompanies Using Hadoop
1.52
Forecast growth of Hadoop JobForecast growth of Hadoop Job
MarketMarket
1.53
Data storage (HDFS)
Runs on commodity hardware (usually Linux)
Horizontally scalable
Processing (MapReduce)
Parallelized (scalable) processing
Fault Tolerant
Other Tools / Frameworks
Data Access
 HBase, Hive, Pig, Mahout
Tools
 Hue, Sqoop
Monitoring
 Greenplum, Cloudera
Hadoop is a set of Apache Frameworks andHadoop is a set of Apache Frameworks and
more…more…
Hadoop Core - HDFS
MapReduce API
Data Access
Tools & Libraries
Monitoring & Alerting
1.54
What are the core parts of a HadoopWhat are the core parts of a Hadoop
distribution?distribution?
HDFS Storage
Redundant (3 copies)
For large files – large blocks
64 or 128 MB / block
Can scale to 1000s of
nodes
MapReduce API
Batch (Job) processing
Distributed and Localized to
clusters (Map)
Auto-Parallelizable for huge
amounts of data
Fault-tolerant (auto retries)
Adds high availability and
more
Pig
Hive
HBase
Others
Other Libraries
1.55
Hadoop 2.0Hadoop 2.0
Hadoop YARN (Yet Another Resource Negotiator): a resource-
management platform responsible for managing computing resources in
clusters and using them for scheduling of users' applications
Single Use System
Batch apps
Multi-Purpose Platform
Batch, Interactive, Online, Streaming
1.56
Hadoop EcosystemHadoop Ecosystem
A combination of technologies which have proficient advantage in solving business problems.
http://www.edupristine.com/blog/hadoop-ecosystem-and-components
1.57
Common Hadoop DistributionsCommon Hadoop Distributions
Open Source
Apache
Commercial
Cloudera
Hortonworks
MapR
AWS MapReduce
Microsoft Azure HDInsight (Beta)
1.58
Setting up Hadoop DevelopmentSetting up Hadoop Development
Hadoop
Binaries
Local install
• Linux
• Windows
Cloudera’s Demo
VM
• Need Virtualization
software, i.e. VMware,
etc…
Cloud
• AWS
• Microsoft (Beta)
• Others
Data Storage
Local
• File System
• HDFS Pseudo-
distributed (single-
node)
Cloud
• AWS
• Azure
• Others
MapReduce
Local
Cloud
Other
Libraries &
Tools
Vendor Tools
Libraries
1.59
Comparing: RDBMS vs. HadoopComparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
1.60
The Changing Data ManagementThe Changing Data Management
LandscapeLandscape
1.61
MapReduceMapReduce
Typical big data problem
Iterate over a large number of records
Extract something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results
Generate final output
Programmers specify two functions:
map (k1, v1) [<k→ 2, v2>]
reduce (k2, [v2]) [<k→ 3, v3>]
All values with the same key are sent to the same reducer
The execution framework handles everything else…
Map
Reduce
Key idea: provide a functional abstraction
for these two operations
1.62
Philosophy to Scale for Big DataPhilosophy to Scale for Big Data
ProcessingProcessing
Divide Work
Combine
Results
1.63
Understanding MapReduceUnderstanding MapReduce
Map>>
(K1, V1) 
 Info in
 Input Split
list (K2, V2)
 Key / Value out
(intermediate values)
 One list per local
node
 Can implement local
Reducer (or
Combiner)
Reduce
(K2, list(V2)) 
 Shuffle / Sort phase
precedes Reduce phase
 Combines Map output
into a list
list (K3, V3)
 Usually aggregates
intermediate values
(input) <k1, v1>  map  <k2, v2>  combine  <k2, list(V2)>  reduce  <k3, v3> (output)
Shuffle/Sort>>
1.64
WordCount - MapperWordCount - Mapper
Reads in input pair <k1,v1>
Outputs a pair <k2, v2>
Let’s count number of each word in user queries (or Tweets/Blogs)
The input to the mapper will be <queryID, QueryText>:
<Q1,“The teacher went to the store. The store was closed; the
store opens in the morning. The store opens at 9am.” >
The output would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store,1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store,1>
<opens, 1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>
1.65
WordCount - ReducerWordCount - Reducer
Accepts the Mapper output (k2, v2), and aggregates values on the key
to generate (k3, v3)
For our example, the reducer input would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store, 1>
<opens,1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>
The output would be:
<The, 6> <teacher, 1> <went, 1> <to, 1> <store, 4> <was, 1>
<closed, 1> <opens, 2> <in, 1> <morning, 1> <at, 1> <9am, 1>
1.66
MapReduce Example - WordCountMapReduce Example - WordCount
Hadoop MapReduce is an implementation of MapReduce
MapReduce is a computing paradigm (Google)
Hadoop MapReduce is an open-source software
1.67
AWS (Amazon Web Services)AWS (Amazon Web Services)
Amazon
From Wikipedia 2006
From Wikipedia 2016
1.68
AWS (Amazon Web Services)AWS (Amazon Web Services)
AWS is a subsidiary of Amazon.com, which offers a suite of cloud
computing services that make up an on-demand computing platform.
Amazon Web Services (AWS) provides a number of different services,
including:
Amazon Elastic Compute Cloud (EC2)
Virtual machines for running custom software
Amazon Simple Storage Service (S3)
Simple key-value store, accessible as a web service
Amazon Elastic MapReduce (EMR)
Scalable MapReduce computation
Amazon DynamoDB
Distributed NoSQL database, one of several in AWS
Amazon SimpleDB
Simple NoSQL database
...
1.69
Cloud Computing Services in AWSCloud Computing Services in AWS
IaaS
EC2, S3, …
Highlight: EC2 and S3 are two of the earliest products in AWS
PaaS
Aurora, Redshift, …
Highlight: Aurora and Redshift are two of the fastest growing
products in AWS
SaaS
WorkDocs, WorkMail
Highlight: May not be the main focus of AWS
1.70
Setting up an AWS accountSetting up an AWS account
Sign up for an account on aws.amazon.com
You need to choose an username and a password
These are for the management interface only
Your programs will use other credentials (RSA keypairs, access
keys, ...) to interact with AWS
aws.amazon.com
1.71
Signing up for AWS EducateSigning up for AWS Educate
Complete the web form on
https://aws.amazon.com/education/awseducate/
Assumes you already have an AWS account
Use your Penn email address!
Amazon says it should only take 2-5 minutes (but don’t rely on
this!!)
This should give you $100/year in AWS credits. Be careful!!!
1.72
Big Data ApplicationsBig Data Applications
Link analysis
Graph data processing
Data stream mining
Large-scale machine learning
End of IntroductionEnd of Introduction
1.74
NoSQLNoSQL
Stands for Not Only SQL
Class of non-relational data storage systems
Usually do not require a fixed table schema nor do they use the
concept of joins
All NoSQL offerings relax one or more of the ACID properties (will talk
about the CAP theorem)
1.75
Why NoSQL?Why NoSQL?
For data storage, an RDBMS cannot be the be-all/end-all
Just as there are different programming languages, need to have other
data storage tools in the toolbox
A NoSQL solution is more acceptable to a client now than even a year
ago
Think about proposing a Ruby/Rails or Groovy/Grails solution now
versus a couple of years ago
1.76
What kinds of NoSQLWhat kinds of NoSQL
NoSQL solutions fall into two major areas:
Key/Value or ‘the big hash table’.
 Amazon S3 (Dynamo)
 Voldemort
 Scalaris
 Memcached (in-memory key/value store)
 Redis
Schema-less which comes in multiple flavors, column-
based, document-based or graph-based.
 Cassandra (column-based)
 CouchDB (document-based)
 MongoDB(document-based)
 Neo4J (graph-based)
 HBase (column-based)
1.77
Key/ValueKey/Value
Pros:
very fast
very scalable
simple model
able to distribute horizontally
Cons:
- many data structures (objects) can't be easily modeled
as key value pairs
1.78
Common AdvantagesCommon Advantages
Cheap, easy to implement (open source)
Data are replicated to multiple nodes (therefore identical and fault-
tolerant) and can be partitioned
Down nodes easily replaced
No single point of failure
Easy to distribute
Don't require a schema
Can scale up and down
Relax the data consistency requirement (CAP)
1.79
What am I giving up?What am I giving up?
joins
group by
order by
ACID transactions
SQL as a sometimes frustrating but still powerful
query language
easy integration with other applications that support
SQL

Más contenido relacionado

La actualidad más candente

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databasesAshwani Kumar
 

La actualidad más candente (20)

Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Sqoop
SqoopSqoop
Sqoop
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 

Similar a COMP9313: Understanding Key Concepts of Big Data Management

Data scientist enablement dse 400 week 8 roadmap
Data scientist enablement   dse 400   week 8 roadmap Data scientist enablement   dse 400   week 8 roadmap
Data scientist enablement dse 400 week 8 roadmap Dr. Mohan K. Bavirisetty
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Qazi Maaz Arshad
 
Past, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataPast, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataJeongwhan Choi
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sdThinkful
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17Thinkful
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsMohd Izhar Firdaus Ismail
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media suresh sood
 
Current and future challenges in data science
Current and future challenges in data scienceCurrent and future challenges in data science
Current and future challenges in data scienceNathaniel Shimoni
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4
 
Data management plans
Data management plansData management plans
Data management plansBrad Houston
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
 
L2 DS Tools and Application.pptx
L2 DS Tools and Application.pptxL2 DS Tools and Application.pptx
L2 DS Tools and Application.pptxShambhavi Vats
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfneelakandan2001kpm
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Joanne Luciano
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 

Similar a COMP9313: Understanding Key Concepts of Big Data Management (20)

Data scientist enablement dse 400 week 8 roadmap
Data scientist enablement   dse 400   week 8 roadmap Data scientist enablement   dse 400   week 8 roadmap
Data scientist enablement dse 400 week 8 roadmap
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
 
Past, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataPast, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software Data
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sd
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact Solutions
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 
Current and future challenges in data science
Current and future challenges in data scienceCurrent and future challenges in data science
Current and future challenges in data science
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Data management plans
Data management plansData management plans
Data management plans
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
L2 DS Tools and Application.pptx
L2 DS Tools and Application.pptxL2 DS Tools and Application.pptx
L2 DS Tools and Application.pptx
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 

Último

welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 

Último (20)

welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 

COMP9313: Understanding Key Concepts of Big Data Management

  • 1. COMP9313: Big Data ManagementCOMP9313: Big Data Management Lecturer: Xin CaoLecturer: Xin Cao Course web site:Course web site: http://www.cse.unsw.edu.au/~cs9313/
  • 2. 1.2 Chapter 1: Course Infomation and Introduction toChapter 1: Course Infomation and Introduction to Big DataBig Data
  • 3. 1.3 Course InfoCourse Info Lectures : 6 : 00 – 9:00 pm (Thursday) Location: Ainsworth Building 202 (K-J17-202) Lab: Weeks 2-13 Consultation (Weeks 1-12): Questions regarding lectures, course materials, assignements, exam, etc. Time: 3:00 – 4:00 pm (Thursday) Place: 201D K-17, Tutors Longbin Lai, llai@cse.unsw.edu.au Jianye Yang, jianyey@cse.unsw.edu.au Fei Bi, z3492319@cse.unsw.edu.au Discucssion and QA: https://groups.google.com/forum/#! forum/comp9313-qa
  • 4. 1.4 Lecturer in ChargeLecturer in Charge Lecturer: Xin Cao Office: 201D K17 (outside the lift turn left) Email: xin.cao@unsw.edu.au Ext: 55932 Research interests Spatial Database Data Mining Data Management Big Data Technologies My publications list at google scholar: https://scholar.google.com.au/citations?user=kJIkUagAAAAJ&hl=en
  • 5. 1.5 Course AimsCourse Aims This course aims to introduce you to the concepts behind Big Data, the core technologies used in managing large-scale data sets, and a range of technologies for developing solutions to large-scale data analytics problems. This course is intended for students who want to understand modern large-scale data analytics systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as use them efficiently and effectively address challenges in big data management. Not possible to cover every aspect of big data management.
  • 6. 1.6 LecturesLectures Lectures focusing on the frontier technologies on big data management and the typical applications Try to run in more interactive mode A few lectures may run in more practical manner (e.g., like a lab/demo) to cover the applied aspects Lecture length varies slightly depending on the progress (of that lecture) Note: attendance to every lecture is assumed BIG DATA BUG DATA
  • 7. 1.7 ResourcesResources Text Books Hadoop: The Definitive Guide. Tom White. 4th Edition - O'Reilly Media Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. 2nd edition - Cambridge University Press Reference Books and other readings Advanced Analytics with Spark. Josh Wills, Sandy Ryza, Sean Owen, and Uri Laserson. O'Reilly Media Apache MapReduce Tutorial Apache Spark Quick Start Many other online tutorials … … Big Data is a relatively new topic (so no fixed syllabus)
  • 8. 1.8 PrerequisitePrerequisite Official prerequisite of this course is COMP9024 (Data Structures and Algorithms) and COMP9311 (Database Systems). Before commencing this course, you should: have experiences and good knowledge of algorithm design (equivalent to COMP9024 ) have a solid background in database systems (equivalent to COMP9311) have solid programming skills in Java be familiar with working on a Unix-style operating systems have basic knowledge of linear algebra (e.g., vector spaces, matrix multiplication), probability theory and statistics , and graph theory No previous experience necessary in MapReduce Parallel and distributed programming
  • 9. 1.9 Please do not enrol if youPlease do not enrol if you Don’t have COMP9024/9311 knowledge Cannot produce correct Java program on your own Never worked on Unix-style operating systems Have poor time management Are too busy to attend lectures/labs Otherwise, you are likely to perform badly in this subject
  • 10. 1.10 Learning outcomesLearning outcomes After completing this course, you are expected to: elaborate the important characteristics of Big Data develop an appropriate storage structure for a Big Data repository utilize the map/reduce paradigm and the Spark platform to manipulate Big Data use a high-level query language to manipulate Big Data develop efficient solutions for analytical problems involving Big Data
  • 12. 1.12 AssignmentsAssignments 1 warm-up programming assignment on Hadoop 1 programming assignment on HBase/Hive/Pig 1 warm-up programming assignment on Spark Another harder assignment on Hadoop Another harder assignment on Spark Both results and source codes will be checked. If not able to run your codes due to some bugs, you will not lose all marks.
  • 13. 1.13 Final examFinal exam Final written exam (100 pts) If you are ill on the day of the exam, do not attend the exam – I will not accept any medical special consideration claims from people who already attempted the exam.
  • 14. 1.14 You May Fail Because …You May Fail Because … *Plagiarism* Code failed to compile due to a mistake of 1 char or 1 word Late submission 1 sec late = 1 day late submit wrong files Program did not follow the spec I am unlikely to accept the following excuses: “Too busy” “It took longer than I thought it would take” “It was harder than I initially thought” “My dog ate my homework” and modern variants thereof
  • 15. 1.15 Tentative course scheduleTentative course schedule Week Topic Assignment 1 Course info and introduction to big data 2 Hadoop MapReduce 1 3 Hadoop MapReduce 2 Ass1 4 HDFS and Hadoop I/O 5 NoSQL and Hbase Ass2 6 Hive and Pig 7 Spark Ass3 8 Link analysis 9 Graph data processing Ass4 10 Data stream mining Ass5 11 Large-scale machine learning 12 Revision and exam preparation
  • 16. 1.16 Your Feedbacks Are ImportantYour Feedbacks Are Important Big data is a new topic, and thus the course is tentative The technologies keep evolving, and the course materials need to be updated correspondingly Please advise where I can improve after each lecturer, at the discussion and QA website CATEI system
  • 17. 1.17 Why Attend the Lectures?Why Attend the Lectures?
  • 18. 1.18 What is Big Data?What is Big Data? Big data is like teenage sex: everyone talks about it nobody really knows how to do it everyone thinks everyone else is doing it so everyone claims they are doing it... --Dan Ariely, Professor at Duke University
  • 19. 1.19 What is Big Data?What is Big Data? No standard definition! here is from Wikipedia: Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on."
  • 20. 1.20 Who is generating Big Data?Who is generating Big Data? Homeland Security Real Time Search Social eCommerce User Tracking & Engagement Financial Services
  • 21. 1.21 Big Data Characteristics: 3VBig Data Characteristics: 3V
  • 22. 1.22 Volume (Scale)Volume (Scale) Data Volume Growth 40% per year From 8 zettabytes (2016) to 44zb (2020) Data volume is increasing exponentially Exponential increase in collected/generated data
  • 23. 1.23 How much data?How much data? Hadoop: 10K nodes, 150K cores, 150 PB (4/2014) Processes 20 PB a day (2008) Crawls 20B web pages a day (2012) Search index is 100+ PB (5/2014) Bigtable serves 2+ EB, 600M QPS (5/2014) 300 PB data in Hive + 600 TB/day (4/2014) 400B pages, 10+ PB (2/2014) LHC: ~15 PB a year LSST: 6-10 PB a year (~2020)640K ought to be enough for anybody. 150 PB on 50k+ servers running 15k apps (6/2011) S3: 2T objects, 1.1M request/second (4/2013) SKA: 0.3 – 1.5 EB per year (~2020) Hadoop: 365 PB, 330K nodes (6/2014)
  • 24. 1.24 Variety (Complexity)Variety (Complexity) Different Types: Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data  Social Network, Semantic Web (RDF), … Streaming Data  You can only scan the data once A single application can be generating/collecting many types of data Different Sources : Movie reviews from IMDB and Rotten Tomatoes Product reviews from different provider websites To extract knowledge all these types of data need to linked together To extract knowledge all these types of data need to linked together
  • 25. 1.25 A Single View to the CustomerA Single View to the Customer Customer Social Media Social Media GamingGaming EntertainEntertain Banking Finance Banking Finance Our Known History Our Known History PurchasePurchase
  • 26. 1.26 A Global View of Linked Big DataA Global View of Linked Big Data patient doctors gene protein drug “Ebola” mutation diagnosis prescription target tissue Heterogeneous information networkDiversified social network
  • 27. 1.27 Velocity (Speed)Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions  missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction Disaster management and response
  • 28. 1.28 Real-Time Analytics/DecisionReal-Time Analytics/Decision RequirementRequirement Customer Influence Behavior Product Recommendations that are Relevant & Compelling Friend Invitations to join a Game or Activity that expands business Preventing Fraud as it is Occurring & preventing more proactively Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play
  • 29. 1.29 ExtendedExtended Big Data Characteristics:Big Data Characteristics: 6V6V Volume: In a big data environment, the amounts of data collected and processed are much larger than those stored in typical relational databases. Variety: Big data consists of a rich variety of data types. Velocity: Big data arrives to the organization at high speeds and from multiple sources simultaneously. Veracity: Data quality issues are particularly challenging in a big data context. Visibility/Visualization: After big data being processed, we need a way of presenting the data in a manner that’s readable and accessible. Value: Ultimately, big data is meaningless if it does not provide value toward some meaningful goal.
  • 30. 1.30 Veracity (Quality & Trust)Veracity (Quality & Trust) Data = quantity + quality When we talk about big data, we typically mean its quantity: What capacity of a system provides to cope with the sheer size of the data? Is a query feasible on big data within our available resources? How can we make our queries tractable on big data? . . . Can we trust the answers to our queries? Dirty data routinely lead to misleading financial reports, strategic business planning decision ⇒ loss of revenue, credibility and customers, disastrous consequences The study of data quality is as important as data quantity
  • 31. 1.31 Data in real-life is often dirtyData in real-life is often dirty 500,000 dead people retain active Medicare cards 81 million National Insurance numbers but only 60 million eligible citizens 98000 deaths each year, caused by errors in medical data
  • 32. 1.32 Visibility/VisualizationVisibility/Visualization Visible to the process of big data management Big Data – visibility = Black Hole? Big data visualization tools: A visualization of Divvy bike rides across Chicago
  • 33. 1.33 ValueValue Big data is meaningless if it does not provide value toward some meaningful goal
  • 34. 1.34 Big Data: 6V in SummaryBig Data: 6V in Summary Transforming Energy and Utilities through Big Data & Analytics. By Anders Quitzau@IBM
  • 35. 1.35 Other V’sOther V’s Variability Variability refers to data whose meaning is constantly changing. This is particularly the case when gathering data relies on language processing. Viscosity This term is sometimes used to describe the latency or lag time in the data relative to the event being described. We found that this is just as easily understood as an element of Velocity. Virality Defined by some users as the rate at which the data spreads; how often it is picked up and repeated by other users or events. Volatility Big data volatility refers to how long is data valid and how long should it be stored. You need to determine at what point is data no longer relevant to the current analysis. More V’s in the future …
  • 36. 1.36 Big Data Tag CloudBig Data Tag Cloud
  • 37. 1.37 Cloud ComputingCloud Computing The buzz word before “Big Data” Larry Ellison’s response in 2009 Cloud Computing is a general term used to describe a new class of network based computing that takes place over the Internet A collection/group of integrated and networked hardware, software and Internet infrastructure (called a platform). Using the Internet for communication and transport provides hardware, software and networking services to clients These platforms hide the complexity and details of the underlying infrastructure from users and applications by providing very simple graphical interface or API A technical point of view Internet-based computing (i.e., computers attached to network) A business-model point of view Pay-as-you-go (i.e., rental)
  • 38. 1.38 Cloud Computing ArchitectureCloud Computing Architecture
  • 40. 1.40 Cloud Computing ServicesCloud Computing Services Infrastructure as a service (IaaS) Offering hardware related services using the principles of cloud computing. These could include storage services (database or disk storage) or virtual servers. Amazon EC2, Amazon S3 Platform as a Service (PaaS) Offering a development platform on the cloud. Google’s Application Engine, Microsofts Azure Software as a service (SaaS) Including a complete software offering on the cloud. Users can access a software application hosted by the cloud vendor on pay- per-use basis. This is a well-established sector. Googles gmail and Microsofts hotmail, Google docs
  • 41. 1.41 Cloud ServicesCloud Services Software as a Service (SaaS) Platform as a Service (PaaS) Infrastructure as a Service (IaaS) Google App Engine SalesForce CRM LotusLive
  • 42. 1.42 Why Study Big Data Technologies?Why Study Big Data Technologies? The hottest topic in both research and industry Highly demanded in real world A promising future career Research and development of big data systems: distributed systems (eg, Hadoop), visualization tools, data warehouse, OLAP, data integration, data quality control, … Big data applications: social marketing, healthcare, … Data analysis: to get values out of big data discovering and applying patterns, predicative analysis, business intelligence, privacy and security, … Graduate from UNSW
  • 43. 1.43 Big Data Open Source ToolsBig Data Open Source Tools
  • 44. 1.44 What will the course coverWhat will the course cover Topic 1. Big data management tools Apache Hadoop  MapReduce  HDFS  HBase  Hive and Pig  Mahout  Spark Topic 2. Big data typical applications Link analysis Graph data processing Data stream mining Some machine learning topics
  • 45. 1.45 Philosophy to Scale for Big DataPhilosophy to Scale for Big Data ProcessingProcessing Divide Work Combine Results
  • 46. 1.46 Distributed processing is non-trivialDistributed processing is non-trivial How to assign tasks to different workers in an efficient way? What happens if tasks fail? How do workers exchange results? How to synchronize distributed tasks allocated to different workers?
  • 47. 1.47 Big data storage is challengingBig data storage is challenging Data Volumes are massive Reliability of Storing PBs of data is challenging All kinds of failures: Disk/Hardware/Network Failures Probability of failures simply increase with the number of machines …
  • 48. 1.48 What is HadoopWhat is Hadoop Open-source data storage and processing platform Before the advent of Hadoop, storage and processing of big data was a big challenge Massively scalable, automatically parallelizable Based on work from Google  Google: GFS + MapReduce + BigTable (Not open)  Hadoop: HDFS + Hadoop MapReduce + HBase ( opensource) Named by Doug Cutting in 2006 (worked at Yahoo! at that time), after his son's toy elephant.
  • 49. 1.49 Hadoop offersHadoop offers Redundant, Fault-tolerant data storage Parallel computation framework Job coordination Programmer s Q: Where file is located? Q: How to handle failures & data lost? Q: How to divide computation? Q: How to program for scaling? No longer need to worry about
  • 50. 1.50 Why Use Hadoop?Why Use Hadoop? Cheaper Scales to Petabytes or more easily Faster Parallel data processing Better Suited for particular types of big data problems
  • 52. 1.52 Forecast growth of Hadoop JobForecast growth of Hadoop Job MarketMarket
  • 53. 1.53 Data storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant Other Tools / Frameworks Data Access  HBase, Hive, Pig, Mahout Tools  Hue, Sqoop Monitoring  Greenplum, Cloudera Hadoop is a set of Apache Frameworks andHadoop is a set of Apache Frameworks and more…more… Hadoop Core - HDFS MapReduce API Data Access Tools & Libraries Monitoring & Alerting
  • 54. 1.54 What are the core parts of a HadoopWhat are the core parts of a Hadoop distribution?distribution? HDFS Storage Redundant (3 copies) For large files – large blocks 64 or 128 MB / block Can scale to 1000s of nodes MapReduce API Batch (Job) processing Distributed and Localized to clusters (Map) Auto-Parallelizable for huge amounts of data Fault-tolerant (auto retries) Adds high availability and more Pig Hive HBase Others Other Libraries
  • 55. 1.55 Hadoop 2.0Hadoop 2.0 Hadoop YARN (Yet Another Resource Negotiator): a resource- management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications Single Use System Batch apps Multi-Purpose Platform Batch, Interactive, Online, Streaming
  • 56. 1.56 Hadoop EcosystemHadoop Ecosystem A combination of technologies which have proficient advantage in solving business problems. http://www.edupristine.com/blog/hadoop-ecosystem-and-components
  • 57. 1.57 Common Hadoop DistributionsCommon Hadoop Distributions Open Source Apache Commercial Cloudera Hortonworks MapR AWS MapReduce Microsoft Azure HDInsight (Beta)
  • 58. 1.58 Setting up Hadoop DevelopmentSetting up Hadoop Development Hadoop Binaries Local install • Linux • Windows Cloudera’s Demo VM • Need Virtualization software, i.e. VMware, etc… Cloud • AWS • Microsoft (Beta) • Others Data Storage Local • File System • HDFS Pseudo- distributed (single- node) Cloud • AWS • Azure • Others MapReduce Local Cloud Other Libraries & Tools Vendor Tools Libraries
  • 59. 1.59 Comparing: RDBMS vs. HadoopComparing: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing)
  • 60. 1.60 The Changing Data ManagementThe Changing Data Management LandscapeLandscape
  • 61. 1.61 MapReduceMapReduce Typical big data problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Programmers specify two functions: map (k1, v1) [<k→ 2, v2>] reduce (k2, [v2]) [<k→ 3, v3>] All values with the same key are sent to the same reducer The execution framework handles everything else… Map Reduce Key idea: provide a functional abstraction for these two operations
  • 62. 1.62 Philosophy to Scale for Big DataPhilosophy to Scale for Big Data ProcessingProcessing Divide Work Combine Results
  • 63. 1.63 Understanding MapReduceUnderstanding MapReduce Map>> (K1, V1)   Info in  Input Split list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner) Reduce (K2, list(V2))   Shuffle / Sort phase precedes Reduce phase  Combines Map output into a list list (K3, V3)  Usually aggregates intermediate values (input) <k1, v1>  map  <k2, v2>  combine  <k2, list(V2)>  reduce  <k3, v3> (output) Shuffle/Sort>>
  • 64. 1.64 WordCount - MapperWordCount - Mapper Reads in input pair <k1,v1> Outputs a pair <k2, v2> Let’s count number of each word in user queries (or Tweets/Blogs) The input to the mapper will be <queryID, QueryText>: <Q1,“The teacher went to the store. The store was closed; the store opens in the morning. The store opens at 9am.” > The output would be: <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store,1> <the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store,1> <opens, 1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1> <opens, 1> <at, 1> <9am, 1>
  • 65. 1.65 WordCount - ReducerWordCount - Reducer Accepts the Mapper output (k2, v2), and aggregates values on the key to generate (k3, v3) For our example, the reducer input would be: <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1> <the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store, 1> <opens,1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1> <opens, 1> <at, 1> <9am, 1> The output would be: <The, 6> <teacher, 1> <went, 1> <to, 1> <store, 4> <was, 1> <closed, 1> <opens, 2> <in, 1> <morning, 1> <at, 1> <9am, 1>
  • 66. 1.66 MapReduce Example - WordCountMapReduce Example - WordCount Hadoop MapReduce is an implementation of MapReduce MapReduce is a computing paradigm (Google) Hadoop MapReduce is an open-source software
  • 67. 1.67 AWS (Amazon Web Services)AWS (Amazon Web Services) Amazon From Wikipedia 2006 From Wikipedia 2016
  • 68. 1.68 AWS (Amazon Web Services)AWS (Amazon Web Services) AWS is a subsidiary of Amazon.com, which offers a suite of cloud computing services that make up an on-demand computing platform. Amazon Web Services (AWS) provides a number of different services, including: Amazon Elastic Compute Cloud (EC2) Virtual machines for running custom software Amazon Simple Storage Service (S3) Simple key-value store, accessible as a web service Amazon Elastic MapReduce (EMR) Scalable MapReduce computation Amazon DynamoDB Distributed NoSQL database, one of several in AWS Amazon SimpleDB Simple NoSQL database ...
  • 69. 1.69 Cloud Computing Services in AWSCloud Computing Services in AWS IaaS EC2, S3, … Highlight: EC2 and S3 are two of the earliest products in AWS PaaS Aurora, Redshift, … Highlight: Aurora and Redshift are two of the fastest growing products in AWS SaaS WorkDocs, WorkMail Highlight: May not be the main focus of AWS
  • 70. 1.70 Setting up an AWS accountSetting up an AWS account Sign up for an account on aws.amazon.com You need to choose an username and a password These are for the management interface only Your programs will use other credentials (RSA keypairs, access keys, ...) to interact with AWS aws.amazon.com
  • 71. 1.71 Signing up for AWS EducateSigning up for AWS Educate Complete the web form on https://aws.amazon.com/education/awseducate/ Assumes you already have an AWS account Use your Penn email address! Amazon says it should only take 2-5 minutes (but don’t rely on this!!) This should give you $100/year in AWS credits. Be careful!!!
  • 72. 1.72 Big Data ApplicationsBig Data Applications Link analysis Graph data processing Data stream mining Large-scale machine learning
  • 73. End of IntroductionEnd of Introduction
  • 74. 1.74 NoSQLNoSQL Stands for Not Only SQL Class of non-relational data storage systems Usually do not require a fixed table schema nor do they use the concept of joins All NoSQL offerings relax one or more of the ACID properties (will talk about the CAP theorem)
  • 75. 1.75 Why NoSQL?Why NoSQL? For data storage, an RDBMS cannot be the be-all/end-all Just as there are different programming languages, need to have other data storage tools in the toolbox A NoSQL solution is more acceptable to a client now than even a year ago Think about proposing a Ruby/Rails or Groovy/Grails solution now versus a couple of years ago
  • 76. 1.76 What kinds of NoSQLWhat kinds of NoSQL NoSQL solutions fall into two major areas: Key/Value or ‘the big hash table’.  Amazon S3 (Dynamo)  Voldemort  Scalaris  Memcached (in-memory key/value store)  Redis Schema-less which comes in multiple flavors, column- based, document-based or graph-based.  Cassandra (column-based)  CouchDB (document-based)  MongoDB(document-based)  Neo4J (graph-based)  HBase (column-based)
  • 77. 1.77 Key/ValueKey/Value Pros: very fast very scalable simple model able to distribute horizontally Cons: - many data structures (objects) can't be easily modeled as key value pairs
  • 78. 1.78 Common AdvantagesCommon Advantages Cheap, easy to implement (open source) Data are replicated to multiple nodes (therefore identical and fault- tolerant) and can be partitioned Down nodes easily replaced No single point of failure Easy to distribute Don't require a schema Can scale up and down Relax the data consistency requirement (CAP)
  • 79. 1.79 What am I giving up?What am I giving up? joins group by order by ACID transactions SQL as a sometimes frustrating but still powerful query language easy integration with other applications that support SQL

Notas del editor

  1. Enterprise resource planning Customer relationship management
  2. Atomicity, Consistency, Isolation, Durability
  3. -&amp;gt; No JDBC -&amp;gt; Data integrity at the application layer