Introduction To Big Data and Use Cases on Hadoop

Jongwook Woo
HiPIC
CSULA
Seoul Technology Society Meetup:
Hack'n'Tell night #3
Seoul, Korea
July 25th 2014
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Introduction To Big Data
and Use Cases on Hadoop

High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Introduction
Big Data Use Cases
 Hadoop 2.0
 Training in Big Data

Jongwook Woo
CSULA
Me
Name: Jongwook Woo, PhD
Backgrounds:
Since 1998, consulting companies in Hollywood
– Implementing eBusiness applications using J2EE
– Search applications using FAST, Lucene/Solr, Sphinx
• Data Integration, Data Feed
– Warner Bros (Matrix online game), E!, citysearch.com, ARM
Teaching since 2002:
– California State University Los Angeles
Exposed to Hadoop since 2008
Exposed to Cloudera since 2010

Jongwook Woo
CSULA
Experience in Big Data
 Certificate
 Certified Cloudera Instructor
 Certified Cloudera Hadoop Developer / Administrator
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Grants
 Received Microsoft Windows Azure Educator Grant (Oct 2013 -
July 2014)
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011

Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing

Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”

Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
• Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On inexpensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers

Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
Doug Cutting
– Hadoop founder
– Initiate Apache Lucene, Nutch, Avro, Hadoop
projects
– Board member of Apache Software Foundations
– Chief Architect at Cloudera
MapReduce
HDFS
Restricted Parallel Programming
– Not for iterative algorithms
– Not for graph

Jongwook Woo
CSULA
MapReduce
Provides Restricted Parallel Programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
–Parallelization
–Fault Tolerance
–Data Distribution
–Load Balancing
Now you can own a supercomputer

Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and
process it faster in parallel
Hadoop
–You can build and run your applications

Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.

Jongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
 The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours

Jongwook Woo
CSULA
HuffPost | AOL
Two Machine Learning Use Cases
Comment Moderation
 Evaluate All New HuffPost User Comments
Every Day
 Identify Abusive / Aggressive Comments
 Auto Delete / Publish ~25% Comments Every
Day
Article Classification
 Tag Articles for Advertising
 E.g.: scary, salacious, …

Jongwook Woo
CSULA
Use Cases experienced
Log Analysis
 Log files from IPS and IDS
– 1.5GB per day for each systems
 Extracting unusual cases using Hadoop, Solr,
Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
 Machine Learning for Image Processing
with Texas A&M
Hadoop Streaming API
 Movie Data Analysis
 Hive, Impala

Jongwook Woo
CSULA
Hadoop 2.0: YARN
 Data processing applications and services
 Impala on MPP
 Tez – Generic framework to run a complex DAG
 Machine Learning, Data Streaming: Spark
 Graph processing: Giraph

Jongwook Woo
CSULA
Training in Big Data
 Learn by yourself?
Miss many important topics
 Cloudera: a leading Big Data Hadoop distributor
With hands-on exercises
Cloudera Training series
Hadoop Developer
Hadoop Systems Admistrator
Hadoop Data Analyst/Scientist

Jongwook Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Many solutions but Hadoop is the way
to go
Hadoop is supercomputer that you
can own
Hadoop 2.0
Training is important

Jongwook Woo
CSULA
Question?

Introduction To Big Data and Use Cases on Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (19)

Similar a Introduction To Big Data and Use Cases on Hadoop

Similar a Introduction To Big Data and Use Cases on Hadoop (20)

Más de Jongwook Woo

Más de Jongwook Woo (13)

Último

Último (20)

Introduction To Big Data and Use Cases on Hadoop