Big Data Analytics (ML, DL, AI) hands-on

Machine Learning, Deep
Learning, Big Data
Hands-On
by Dony Riyanto
Prepared and Presented to Panin Asset Management
January 2019

Hands-on Agenda
• Machine Learning Re-Visited
• Python Example of Machine Learning
• Introduction to Deep Learning
• Immentation of 'Big Data' (Hadoop Ecosystem)
• Hadoop File System
• Hadoop Map Reduce
• Case Study
• More advance implementation of Big Data

The Learning Problem
The essence of ML:
1. We have data
2. Patterns exist in data
3. We can't do math formula
(don't know the formula yet)
Examples:
 Movie Rating
 Credit Approval
 Hand Written Recognition
Domain Areas
 Computer Vision
 Natural Language Processing
 Business Intelligence

Components of Learning
Example in Banking: Credit Card Approval
Input : x (customer application)
Output : y (good/bad customer)
Unknown Target Function f :XY
Dataset {x, y} (customers record database)
Hypothesis Set: H : X Y
Final Hypothesis g
Learning Model = Hypothesis Set + Learning Algorithm

Machine Learning Model
Spatial Data
(Text, Image)
{x, y}
Sequence or
Time Series
Data
{x, t}
Classifier
Class Score
Regression
Cont. Values

Main Paradigms
Automatic discovery of patterns in data through computer algorithms and the use of
those patterns to take actions such as classifying or clustering the data into
categories.
Supervised Learning: Learning by labeled example
E.g. An email spam detector
We have (input, correct output), and we can predict (new input, predicted output)
Amazingly effective if you have lots of data
Unsupervised Learning: Discovering Patterns
E.g. Data clustering
Instead of (input, correct output), we get (input, ?)
Difficult in practices but useful if we lack labeled data
Reinforcement Learning: Feedback & Error
E.g. Learning to play chess
Instead of (input, correct output), we get (input, only some output, grade of this output)
Works well in some domains, becoming more important

What/why is Python
Python is an interpreted, high-level programming language, general-
purpose programming language.
Its high-level built in data structures, combined with dynamic typing and
dynamic binding, make it very attractive for Rapid Application Development,
as well as for use as a scripting or glue language to connect existing
components together. Python's simple, easy to learn syntax emphasizes
readability and therefore reduces the cost of program maintenance. Python
supports modules and packages, which encourages program modularity and
code reuse. The Python interpreter and the extensive standard library are
available in source or binary form without charge for all major platforms, and
can be freely distributed.

Machine Learning with Python
• We need Python 2.7.x or 3.7.x
• Libraries, ex.:
• numpy (fundamental package for scientific computing with Python)
• matplotlib (plotting library for the Python programming language and its
numerical mathematics extension NumPy)
• pandas (software library written for the Python programming language for
data manipulation and analysis)
• seaborn (Python data visualization library based on matplotlib)
• sklearn (Scikit-learn is a machine learning library for the Python programming
language)
• IDE, ex: pycharm
• Alternatives, install Anaconda (distribution of the Python programming
languages for scientific computing (data science, machine learning applications,
large-scale data processing, predictive analytics, etc.)

Machine Learning with Python
• Python 3 installation
• Introduction to pip (python package installer)
• Install PyCharm
• or install Anaconda

Lesson 2
*Class labeling with preprocessing

Lesson 3
*Load CSV and data observation

Introduction to Deep Learning
• Deep learning has produced good results for a few applications
such as computer vision, language translation, image captioning,
audio transcription, molecular biology, speech recognition,
natural language processing, self-driving cars, brain tumour
detection, real-time speech translation, music composition,
automatic game playing and so on.
• Deep learning is the next big leap after machine learning with a
more advanced implementation. Currently, it is heading towards
becoming an industry standard bringing a strong promise of
being a game changer when dealing with raw unstructured data.

• Deep learning is currently one of the best solution providers fora wide
range of real-world problems. Developers are building AI programs that,
instead of using previously given rules, learn from examples to solve
complicated tasks. With deep learning being used by many data scientists,
deeper neural networks are delivering results that are ever more accurate.
• The idea is to develop deep neural networks by increasing the number of
training layers for each network; machine learns more about the data until
it is as accurate as possible. Developers can use deep learning techniques
to implement complex machine learning tasks, and train AI networks to
have high levels of perceptual recognition.

• Deep learning finds its popularity in Computer vision. Here one
of the tasks achieved is image classification where given input
images are classified as cat, dog, etc. or as a class or label that
best describe the image. We as humans learn how to do this
task very early in our lives and have these skills of quickly
recognizing patterns, generalizing from prior knowledge, and
adapting to different image environments.

Deep Learning with TensorFlow
• Googles TensorFlow is a python library. This library is a great choice for
building commercial grade deep learning applications.
• TensorFlow grew out of another library DistBelief V2 that was a part of
Google Brain Project. This library aims to extend the portability of
machine learning so that research models could be applied to
commercial-grade applications.
• Much like the Theano library, TensorFlow is based on computational
graphs where a node represents persistent data or math operation and
edges represent the flow of data between nodes, which is a
multidimensional array or tensor; hence the name TensorFlow

Deep Learning Implementation with
Tensorflow and Python
• Preparation (Python + libraries)
• Installing Tensorflow
• Running Several Tensorflow built-in example, ex.:
• Regression
• Image Classification

Hadoop
Hadoop is:
• - scalable.
• - a “Framework”.
• - not a drop in replacement for RDBMS.
• - great for pipelining massive amounts of data to achieve the
end result.

Hadoop
• example of file/text search

Hadoop
• Planning
• Installation step
• Using HDFS
• Using Map Reduce

Hadoop Map Reduce
• MapReduce is a processing technique and a program model for distributed computing
based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Secondly, reduce task,
which takes the output from a map as an input and combines those data tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task
is always performed after the map job.
VS

Hadoop Map Reduce
• The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job
as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably
of different types.
• The key and the value classes should be in serialized manner by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable
interface to facilitate sorting by the framework. Input and Output types of a MapReduce job − (Input) <k1, v1>
→ map → <k2, v2> → reduce → <k3, v3>(Output).

Hadoop Map Reduce
• Words Count (without map-reduce)

Hadoop Map Reduce
• Words Count (mapper)

Hadoop Map Reduce
• Words Count (reducer)

Hadoop Map Reduce
• Run on HadoopMR
input file from local or HDFS
mapper application (see prev. slide)
reducer application (see prev. slide)
*mapper and recuder apps can be written in Python , R, Java, Scala, etc

Hadoop Map Reduce
• Map Reduce is not magic. It's a method
• Map Reduce is not always about big data (ex: find pi value)
• Map Reduce is not silver bullet. (e.g: batch vs streaming data)
• Map Reduce is usually solved:
• Batch processing flow
• Unstructured/Semi-structured data

Bigger Image of Hadoop (Hadoop Ecosystem)

Data Stream
Why Stream Processing?
• Processing unbounded data sets, or "stream processing", is a new way
of looking at what has always been done as batch in the past. Whilst
intra-day ETL and frequent batch executions have brought latencies
down, they are still independent executions with optional bespoke code
in place to handle intra-batch accumulations. With a platform such as
Spark Streaming we have a framework that natively supports
processing both within-batch and across-batch (windowing).
• By taking a stream processing approach we can benefit in several ways.
The most obvious is reducing latency between an event occurring and
taking an action driven by it, whether automatic or via analytics
presented to a human. Other benefits include a more smoothed out
resource consumption profile.

Introducing Spark
• Better speed compared to HadoopMR
• Minimized disk read-write (on memory processing)
• Comes with Spark Streaming (later, Hadoop also create
Hadoop Stream)
• Still in Hadoop Ecosystem

Data Stream with Spark Streaming

Simple Spark Streaming Implementation Example
near realtime dashboard
data stream processing and analytics
(bigger/reliable capabilities)
multiple channel/type of data

Different programming style.
Spark libraries included in app
returned data of processing/analytics
Infinite run

Spark Streaming Implementation
• Review some spark streaming example
• Review some Spark Streaming architecture

Example of Bukalapak
• Save all data from 2014 'til now
• >1.5PB data including:
• Product images
• Products data
• Messaging

Buka Lapak 'Big Data' Implementation

Example: Application Health Monitoring

Example: Recomender Engine
source:

Example: Gojek Data Visualization

Big Data Analytics (ML, DL, AI) hands-on

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Analytics (ML, DL, AI) hands-on

Similar to Big Data Analytics (ML, DL, AI) hands-on (20)

More from Dony Riyanto

More from Dony Riyanto (20)

Recently uploaded

Recently uploaded (20)

Big Data Analytics (ML, DL, AI) hands-on