Slide presentasi ini dibawakan oleh Jony Sugianto dalam Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
What is Big Data?
1. Big Data
Dipl. Inform.(FH) Jony Sugianto, M. Comp. Sc.
Hp:0838-98355491
WA:0812-13086659
Email:jonysugianto@gmail.com
2. Agenda
● What is Big Data?
● Analytic
● Big Data Platforms
● Questions
3. What is Big Data?
● The Basic idea behind the phrase Big Data is that
everything we do is increasingly leaving a digital trace(or
data), which we(and others) can use and analyse
● Big Data therefore refers to our ability to make use of the
ever increasing volumes of data
● Big Data is not about the size of the data, it's about the
value within the data
4. Datafication of the world
● Activities
- Web Browser
- Credit Cards
- E-Commerce
● Conversations
- WhatsApp
- Email
- Twitter
● Photos/Videos
- Instagram
- You Tube
● Sensors
- Gps
● Etc...
5. Turning Big Data into Value
Datafication of
our world
● Activity
● Conversation
● Sensors
● Photo/Video
● Etc...
Analysing Big
Data
● Text Analytics
● Sentiment
Analysis
● Movement
Analytics
● Face/Voice
Recognition
● Etc...
Value
6. Webdata
● Log Data(all user)
- Anonymous ID from Cookie Data
- LoginID (if exist)
- ArticleId
- Kanal / Category
- Browser
- IP
- Etc...
● Registered User Data(10 %)
- Login ID
- Name
- Age
- Gender
- Education
- Etc...
7. Valuable data
● User activness
● User interest based on reading behaviour
● Personal Profile for all user
16. How to define the similarity?
● Linear |x1 – x2|
● Square (x1 – x2)^2
● Exponential 10^f(|x1-x2|)
17. Complexity Analysis
● Assume 30.000.000 click a day
● A week: 210.000.000 click
● Size log entry: 1 kb
● Total size: 210.000.000.000 byte = 210 Gb
18. Complexity Analysis
● All User : 10.000.000
● Loginuser: 1.000.000
● Comparison per second per CPU: 1.000.000
● Total Comparison: 9.000.000.000.000
● Total Time: 9.000.000 second=104 hari
21. What is Hadoop?
● Hadoop:
an open-source framework that supports data-intensive
distributed applications, licensed under apache v2 license
● Goals:
- Abstract and facilitate the storage and processing of large
and/or rapidly growing data sets
- High scalability and availability
- Use commodity Hardware
- Fault-tolerance
- Move computation rather than data
22. Hadoop Components
● Hadoop Distributed File System(HDFS)
A distributed file system that provides high-throughput access to
application data
● Hadoop YARN
A framework for job scheduling and cluster resource
management
● Hadoop MapReduce
A Yarn-based system for parallel processing of large data sets
23. What is Hive?
● Hive is a data warehouse infrastructure built on top of
Hadoop
● Hive stored data in the HDFS
● Hive compile SQL Queries into MapReduce jobs
25. What is Pig?
● Pig is a platform for analyzing large data sets that consist
of a high-level language for expressing data analysis
programs
● Pig generates and compiles a MapReduce program on the
fly
27. What is Spark?
● Fast and general purpose cluster computing system
● 10x(on disk) – 100x(in-memory) faster than Hadoop
MapReduce
● Provides high level APIs in
-Scala
-Java
-Python
● Can be deployed through Apache Mesos, Apache Hadoop
via YARN, or Spark's cluster manager
28. Resilient Distributed Datasets
● Written in scala
● Fundamental Unit of data in spark
● Distributed collection of object
● Resilient-Ability to recompute missing partions(node failure)
● Distributed-Split across multiple partions
● Dataset-Can contains any type, Scala/Java/Python Object or User
defined object
● Operations
-Transformations(map, filter, groupBy,...)
-Actions(count, collect, save, ...)
29. Spark Example
// Spark wordcount
object WordCount {
def main(args: Array[String]) {
val env = new SparkContext("local","wordCount")
val data = List("hi","how are you","hi")
val dataSet = env.parallelize(data)
val words = dataSet.flatMap(value => value.split("s+"))
val mappedWords = words.map(value => (value,1))
val sum = mappedWords.reduceByKey(_+_)
println(sum.collect())
}
}
30. What is Flink?
● Written in java
● An open source platform for distributed stream and batch
data processing
● Several APIs in Java/Scala/Python
-DataSet API – Batch processing
-DataStream API – Stream processing
-Table API – Relational Queries
31. Flink Example
// Flink wordcount
object WordCount {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val data = List("hi","how are you","hi")
val dataSet = env.fromCollection(data)
val words = dataSet.flatMap(value => value.split("s+"))
val mappedWords = words.map(value => (value,1))
val grouped = mappedWords.groupBy(0)
val sum = grouped.sum(1)
println(sum.collect())
}
}