Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
2. Hands-on Agenda
• Machine Learning Re-Visited
• Python Example of Machine Learning
• Introduction to Deep Learning
• Immentation of 'Big Data' (Hadoop Ecosystem)
• Hadoop File System
• Hadoop Map Reduce
• Case Study
• More advance implementation of Big Data
3. The Learning Problem
The essence of ML:
1. We have data
2. Patterns exist in data
3. We can't do math formula
(don't know the formula yet)
Examples:
Movie Rating
Credit Approval
Hand Written Recognition
Domain Areas
Computer Vision
Natural Language Processing
Business Intelligence
4. Components of Learning
Example in Banking: Credit Card Approval
Input : x (customer application)
Output : y (good/bad customer)
Unknown Target Function f :XY
Dataset {x, y} (customers record database)
Hypothesis Set: H : X Y
Final Hypothesis g
Learning Model = Hypothesis Set + Learning Algorithm
5. Machine Learning Model
Spatial Data
(Text, Image)
{x, y}
Sequence or
Time Series
Data
{x, t}
Classifier
Class Score
Regression
Cont. Values
6. Main Paradigms
Automatic discovery of patterns in data through computer algorithms and the use of
those patterns to take actions such as classifying or clustering the data into
categories.
Supervised Learning: Learning by labeled example
E.g. An email spam detector
We have (input, correct output), and we can predict (new input, predicted output)
Amazingly effective if you have lots of data
Unsupervised Learning: Discovering Patterns
E.g. Data clustering
Instead of (input, correct output), we get (input, ?)
Difficult in practices but useful if we lack labeled data
Reinforcement Learning: Feedback & Error
E.g. Learning to play chess
Instead of (input, correct output), we get (input, only some output, grade of this output)
Works well in some domains, becoming more important
7. What/why is Python
Python is an interpreted, high-level programming language, general-
purpose programming language.
Its high-level built in data structures, combined with dynamic typing and
dynamic binding, make it very attractive for Rapid Application Development,
as well as for use as a scripting or glue language to connect existing
components together. Python's simple, easy to learn syntax emphasizes
readability and therefore reduces the cost of program maintenance. Python
supports modules and packages, which encourages program modularity and
code reuse. The Python interpreter and the extensive standard library are
available in source or binary form without charge for all major platforms, and
can be freely distributed.
8. Machine Learning with Python
• We need Python 2.7.x or 3.7.x
• Libraries, ex.:
• numpy (fundamental package for scientific computing with Python)
• matplotlib (plotting library for the Python programming language and its
numerical mathematics extension NumPy)
• pandas (software library written for the Python programming language for
data manipulation and analysis)
• seaborn (Python data visualization library based on matplotlib)
• sklearn (Scikit-learn is a machine learning library for the Python programming
language)
• IDE, ex: pycharm
• Alternatives, install Anaconda (distribution of the Python programming
languages for scientific computing (data science, machine learning applications,
large-scale data processing, predictive analytics, etc.)
9. Machine Learning with Python
• Python 3 installation
• Introduction to pip (python package installer)
• Install PyCharm
• or install Anaconda
15. Introduction to Deep Learning
• Deep learning has produced good results for a few applications
such as computer vision, language translation, image captioning,
audio transcription, molecular biology, speech recognition,
natural language processing, self-driving cars, brain tumour
detection, real-time speech translation, music composition,
automatic game playing and so on.
• Deep learning is the next big leap after machine learning with a
more advanced implementation. Currently, it is heading towards
becoming an industry standard bringing a strong promise of
being a game changer when dealing with raw unstructured data.
16. Introduction to Deep Learning
• Deep learning is currently one of the best solution providers fora wide
range of real-world problems. Developers are building AI programs that,
instead of using previously given rules, learn from examples to solve
complicated tasks. With deep learning being used by many data scientists,
deeper neural networks are delivering results that are ever more accurate.
• The idea is to develop deep neural networks by increasing the number of
training layers for each network; machine learns more about the data until
it is as accurate as possible. Developers can use deep learning techniques
to implement complex machine learning tasks, and train AI networks to
have high levels of perceptual recognition.
17. Introduction to Deep Learning
• Deep learning finds its popularity in Computer vision. Here one
of the tasks achieved is image classification where given input
images are classified as cat, dog, etc. or as a class or label that
best describe the image. We as humans learn how to do this
task very early in our lives and have these skills of quickly
recognizing patterns, generalizing from prior knowledge, and
adapting to different image environments.
19. Deep Learning with TensorFlow
• Googles TensorFlow is a python library. This library is a great choice for
building commercial grade deep learning applications.
• TensorFlow grew out of another library DistBelief V2 that was a part of
Google Brain Project. This library aims to extend the portability of
machine learning so that research models could be applied to
commercial-grade applications.
• Much like the Theano library, TensorFlow is based on computational
graphs where a node represents persistent data or math operation and
edges represent the flow of data between nodes, which is a
multidimensional array or tensor; hence the name TensorFlow
20. Deep Learning Implementation with
Tensorflow and Python
• Preparation (Python + libraries)
• Installing Tensorflow
• Running Several Tensorflow built-in example, ex.:
• Regression
• Image Classification
22. Hadoop
Hadoop is:
• - scalable.
• - a “Framework”.
• - not a drop in replacement for RDBMS.
• - great for pipelining massive amounts of data to achieve the
end result.
34. Hadoop Map Reduce
• MapReduce is a processing technique and a program model for distributed computing
based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Secondly, reduce task,
which takes the output from a map as an input and combines those data tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task
is always performed after the map job.
VS
35. Hadoop Map Reduce
• The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job
as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably
of different types.
• The key and the value classes should be in serialized manner by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable
interface to facilitate sorting by the framework. Input and Output types of a MapReduce job − (Input) <k1, v1>
→ map → <k2, v2> → reduce → <k3, v3>(Output).
39. Hadoop Map Reduce
• Run on HadoopMR
input file from local or HDFS
mapper application (see prev. slide)
reducer application (see prev. slide)
*mapper and recuder apps can be written in Python , R, Java, Scala, etc
41. Hadoop Map Reduce
• Map Reduce is not magic. It's a method
• Map Reduce is not always about big data (ex: find pi value)
• Map Reduce is not silver bullet. (e.g: batch vs streaming data)
• Map Reduce is usually solved:
• Batch processing flow
• Unstructured/Semi-structured data
43. Data Stream
Why Stream Processing?
• Processing unbounded data sets, or "stream processing", is a new way
of looking at what has always been done as batch in the past. Whilst
intra-day ETL and frequent batch executions have brought latencies
down, they are still independent executions with optional bespoke code
in place to handle intra-batch accumulations. With a platform such as
Spark Streaming we have a framework that natively supports
processing both within-batch and across-batch (windowing).
• By taking a stream processing approach we can benefit in several ways.
The most obvious is reducing latency between an event occurring and
taking an action driven by it, whether automatic or via analytics
presented to a human. Other benefits include a more smoothed out
resource consumption profile.
44. Introducing Spark
• Better speed compared to HadoopMR
• Minimized disk read-write (on memory processing)
• Comes with Spark Streaming (later, Hadoop also create
Hadoop Stream)
• Still in Hadoop Ecosystem
46. Simple Spark Streaming Implementation Example
near realtime dashboard
data stream processing and analytics
(bigger/reliable capabilities)
multiple channel/type of data