1. Jongwook Woo
HiPIC
CSULA
Seoul Technology Society Meetup:
Hack'n'Tell night #3
Seoul, Korea
July 25th 2014
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Introduction To Big Data
and Use Cases on Hadoop
2. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Introduction
Big Data Use Cases
Hadoop 2.0
Training in Big Data
3. High Performance Information Computing Center
Jongwook Woo
CSULA
Me
Name: Jongwook Woo, PhD
Backgrounds:
Since 1998, consulting companies in Hollywood
– Implementing eBusiness applications using J2EE
– Search applications using FAST, Lucene/Solr, Sphinx
• Data Integration, Data Feed
– Warner Bros (Matrix online game), E!, citysearch.com, ARM
Teaching since 2002:
– California State University Los Angeles
Exposed to Hadoop since 2008
Exposed to Cloudera since 2010
4. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Certificate
Certified Cloudera Instructor
Certified Cloudera Hadoop Developer / Administrator
Partnership
Received Academic Education Partnership with Cloudera since
June 2012
Grants
Received Microsoft Windows Azure Educator Grant (Oct 2013 -
July 2014)
Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
5. High Performance Information Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
6. High Performance Information Computing Center
Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”
7. High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
• Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
8. High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On inexpensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers
9. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
Doug Cutting
– Hadoop founder
– Initiate Apache Lucene, Nutch, Avro, Hadoop
projects
– Board member of Apache Software Foundations
– Chief Architect at Cloudera
MapReduce
HDFS
Restricted Parallel Programming
– Not for iterative algorithms
– Not for graph
10. High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce
Provides Restricted Parallel Programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
–Parallelization
–Fault Tolerance
–Data Distribution
–Load Balancing
Now you can own a supercomputer
11. High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and
process it faster in parallel
Hadoop
–You can build and run your applications
12. High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
13. High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours
14. High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL
Two Machine Learning Use Cases
Comment Moderation
Evaluate All New HuffPost User Comments
Every Day
Identify Abusive / Aggressive Comments
Auto Delete / Publish ~25% Comments Every
Day
Article Classification
Tag Articles for Advertising
E.g.: scary, salacious, …
15. High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases experienced
Log Analysis
Log files from IPS and IDS
– 1.5GB per day for each systems
Extracting unusual cases using Hadoop, Solr,
Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
Machine Learning for Image Processing
with Texas A&M
Hadoop Streaming API
Movie Data Analysis
Hive, Impala
16. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services
Impala on MPP
Tez – Generic framework to run a complex DAG
Machine Learning, Data Streaming: Spark
Graph processing: Giraph
17. High Performance Information Computing Center
Jongwook Woo
CSULA
Training in Big Data
Learn by yourself?
Miss many important topics
Cloudera: a leading Big Data Hadoop distributor
With hands-on exercises
Cloudera Training series
Hadoop Developer
Hadoop Systems Admistrator
Hadoop Data Analyst/Scientist
18. High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Many solutions but Hadoop is the way
to go
Hadoop is supercomputer that you
can own
Hadoop 2.0
Training is important