Why Functional Programming Is Important in Big Data Era

•Download as PPTX, PDF•

1 like•861 views

Handaru Sakti

The only thing that works for parallel programming is functional programming. --Carnegie Mello Professor Bob Harper

Data & Analytics Technology

Why Functional Programming Is
Important In Big Data Era?
handaru@tiket.com

What Are The Steps?
Act On
Analyze
Collect

What We Need?
D
Distributed Computing
Cluster
ProcessData

What We Need?
• Spark as data processsing in cluster, originally
written in Scala, which allows concise
function syntax and interactive use
• Mesos as cluster manager
• ZooKeeper as highly reliable distributed
coordinator
• HDFS as distributed storage

What We Need?
• Pure functions
• Atomic operations
• Parallel patterns or skeletons
• Lightweight algorithms
The only thing that works for parallel programming
is functional programming.
--Carnegie Mello Professor Bob Harper

FP Quick Tour In Scala
• Basic transformations:
var array = new Array[Int](10)
var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
• Indexing:
array(0) = 1
println(list(0))
• Anonymous functions:
val multiplay = (x: Int, y: Int) => x * y
val procedure = { x: Int => {
println(“Hello, ”+x)
println(x * 10)
}
}

$FP Quick Tour In Scala • Scala closure syntax: (x: Int) => x * 10 // full version x => x * 10 // type interference _ * 10 // underscore syntax x => { // body is block of code val y = 10 x * y }$

FP Quick Tour In Scala
• Processing collections:
var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
list.foreach(x => println(x))
list.map(_ * 10)
list.filter(x => x % 2 == 0)
list.reduce((x, y) => x + y)
list.reduce(_ + _)
def f(x: Int) = List(x-1, x x+1)
list.map(x => f(x))
list.map(f(_))
list.flatMap(x => f(x))
list.map(x => f(x)).reduce(_ ++ _)

Spark Quick Tour
• Spark context:
• Entry point to Spark functionality
• In spark-shell, crated as sc
• In standalone-spark-program, we must create it
• Resilient distributed datasets (RDDs) :
• A distributed memory abstraction
• A logically centralized entity but physically partitioned across multiple
machines inside a cluster based on some notion of key
• Immutable
• Automatically rebuilt on failure
• Based on LRU (Least Recent Use) eviction algorithm

Spark Quick Tour
• Transformations:
• Lazy operations to build RDDs from other RDDs
• Narrow transformation (involves no data shuffling) :
• map
• flatMap
• filter
• Wide transformation (involves data shuffling):
• sortByKey
• reduceByKey
• groupByKey
• Actions:
• Return a result or write it to storage
• collect
• count
• take(n)

Spark Quick Tour
• Creating RDDs:
val numbers = sc.parallelize(List(1, 2, 3, 4, 5))
val textFile = sc.textFile("hdfs://localhost/test/tobe.txt")
val textFile = sc.textFile("hdfs://localhost/test/*.txt")
• Basic transformations:
val squares = numbers.map(x => x * x)
val evens = squares.filter(_ < 9)
val mapto = numbers.flatMap(x => 1 to x)
val words = textFile.flatMap(_.split(" ")).cache()
Base RDD
Transformed RDD
Turn a collection
to RDD

Spark Quick Tour
• Basic actions:
words.collect()
words take(5)
words count
words.reduce(_ + _)
words.filter(_ == “be").count()
words.filter(_ == “or").count()
words.saveAsTextFile("hdfs://localhost/test/result")
The influence of
cache

Spark Quick Tour
• Pair syntax:
val pair = (a, b)
• Accessing pair elements:
pair._1
pair._2
• Key-value operations:
val pets = sc.parallelize(List(("cat", 1), ("dog", 2), ("cat", 3)))
pets.reduceByKey(_ + _)
pets.groupByKey()
pets.sortByKey()

Hello World
val logFile = "hdfs://localhost/test/tobe.txt"
val logData = sc.textFile(logFile).cache()
val wordCount = logData.flatMap(_.split(“ “))
.map((_, 1))
.reduceByKey(_ + _)
wordCount.saveAsTextFile("hdfs://localhost/wordcount/result")
sc.stop()

Software Components
Application
Spark Context
ZooKeeper
Mesos
Master
Mesos Slave
Spark Executor
Mesos Slave
Spark Executor
HDFS/Other Storage

Literature
Parallel Programming With Spark
Spark: Low latency, massively parallel processing framework

What's hot

Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar

Introduction to Streaming Distributed Processing with StormBrandon O'Brien

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Building DSLs with ScalaMohit Jaggi

Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Bruno Bonnin

Reactive dashboard’s using apache sparkRahul Kumar

Webinar how to build a highly available time series solution with kairos-db (1)Julia Angell

WEBINAR - Introducing Scylla Open Source 3.0: Materialized Views, Secondary I...ScyllaDB

Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB

Apache Spark on Kubernetesharidasnss

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit

Apache spark on Hadoop Yarn Resource Managerharidasnss

Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

Bigdata and Hadoop with Dockerharidasnss

Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLucidworks

Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScyllaDB

What's hot (20)

Hoodie: How (And Why) We built an analytical datastore on Spark

Introduction to Streaming Distributed Processing with Storm

Scylla Summit 2016: Graph Processing with Titan and Scylla

Building DSLs with Scala

Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !

Reactive dashboard’s using apache spark

Webinar how to build a highly available time series solution with kairos-db (1)

WEBINAR - Introducing Scylla Open Source 3.0: Materialized Views, Secondary I...

Cassandra vs. ScyllaDB: Evolutionary Differences

Apache Spark on Kubernetes

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...

Spark Summit EU talk by Miklos Christine paddling up the stream

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Apache spark on Hadoop Yarn Resource Manager

Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Bigdata and Hadoop with Docker

Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware

Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes

Similar to Why Functional Programming Is Important in Big Data Era

Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime

Dive into spark2Gal Marder

Apache Spark Overview @ ferretAndrii Gakhov

Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095

Introduction to Apache Spark EcosystemBojan Babic

Spark real world use cases and optimizationsGal Marder

Paris Data Geek - Spark Streaming Djamel Zouaoui

TriHUG talk on Spark and Sharktrihug

Spark ProgrammingTaewook Eom

Spark - The Ultimate Scala Collections by Martin OderskySpark Summit

Scala and sparkFabio Fumarola

Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime

Introduction to apache sparkJohn Godoi

Introduction to Apache SparkRahul Jain

Sumedh Wale's presentationpunesparkmeetup

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

Apache spark-melbourne-april-2015-meetupNed Shawa

Apache Spark TutorialAhmet Bulut

Ten tools for ten big data areas 03_Apache SparkWill Du

Similar to Why Functional Programming Is Important in Big Data Era (20)

Apache Spark - San Diego Big Data Meetup Jan 14th 2015

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

Dive into spark2

Apache Spark Overview @ ferret

Apache Spark™ is a multi-language engine for executing data-S5.ppt

Introduction to Apache Spark Ecosystem

Spark real world use cases and optimizations

Paris Data Geek - Spark Streaming

TriHUG talk on Spark and Shark

Spark Programming

Spark - The Ultimate Scala Collections by Martin Odersky

Scala and spark

Introduction to Spark - Phoenix Meetup 08-19-2014

Introduction to apache spark

Introduction to Apache Spark

Sumedh Wale's presentation

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache spark-melbourne-april-2015-meetup

Apache Spark Tutorial

Ten tools for ten big data areas 03_Apache Spark

Recently uploaded

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Data-Analysis for Chicago Crime Data 2023ymrp368

Halmar dropshipping via API with DroFxolyaivanovalion

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Midocean dropshipping via API with DroFxolyaivanovalion

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Recently uploaded (20)

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Data-Analysis for Chicago Crime Data 2023

Halmar dropshipping via API with DroFx

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

100-Concepts-of-AI by Anupama Kate .pptx

Sampling (random) method and Non random.ppt

Midocean dropshipping via API with DroFx

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Generative AI on Enterprise Cloud with NiFi and Milvus

BabyOno dropshipping via API with DroFx.pptx

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Schema on read is obsolete. Welcome metaprogramming..pdf

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...

Why Functional Programming Is Important in Big Data Era

1. Why Functional Programming Is Important In Big Data Era? handaru@tiket.com

2. What Is Big Data?

3. What Are The Steps? Act On Analyze Collect

4. What We Need? D Distributed Computing Cluster ProcessData

5. What We Need? • Spark as data processsing in cluster, originally written in Scala, which allows concise function syntax and interactive use • Mesos as cluster manager • ZooKeeper as highly reliable distributed coordinator • HDFS as distributed storage

6. What We Need? • Pure functions • Atomic operations • Parallel patterns or skeletons • Lightweight algorithms The only thing that works for parallel programming is functional programming. --Carnegie Mello Professor Bob Harper

7. What Is Functional Programming?

8. FP Quick Tour In Scala • Basic transformations: var array = new Array[Int](10) var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) • Indexing: array(0) = 1 println(list(0)) • Anonymous functions: val multiplay = (x: Int, y: Int) => x * y val procedure = { x: Int => { println(“Hello, ”+x) println(x * 10) } }

9. FP Quick Tour In Scala • Scala closure syntax: (x: Int) => x * 10 // full version x => x * 10 // type interference _ * 10 // underscore syntax x => { // body is block of code val y = 10 x * y }

10. FP Quick Tour In Scala • Processing collections: var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9) list.foreach(x => println(x)) list.map(_ * 10) list.filter(x => x % 2 == 0) list.reduce((x, y) => x + y) list.reduce(_ + _) def f(x: Int) = List(x-1, x x+1) list.map(x => f(x)) list.map(f(_)) list.flatMap(x => f(x)) list.map(x => f(x)).reduce(_ ++ _)

11. Spark Quick Tour • Spark context: • Entry point to Spark functionality • In spark-shell, crated as sc • In standalone-spark-program, we must create it • Resilient distributed datasets (RDDs) : • A distributed memory abstraction • A logically centralized entity but physically partitioned across multiple machines inside a cluster based on some notion of key • Immutable • Automatically rebuilt on failure • Based on LRU (Least Recent Use) eviction algorithm

12. Spark Quick Tour Working with RDDs

13. Spark Quick Tour Cached RDDs

14. Spark Quick Tour • Transformations: • Lazy operations to build RDDs from other RDDs • Narrow transformation (involves no data shuffling) : • map • flatMap • filter • Wide transformation (involves data shuffling): • sortByKey • reduceByKey • groupByKey • Actions: • Return a result or write it to storage • collect • count • take(n)

15. Spark Quick Tour Transformations

16. Spark Quick Tour • Creating RDDs: val numbers = sc.parallelize(List(1, 2, 3, 4, 5)) val textFile = sc.textFile("hdfs://localhost/test/tobe.txt") val textFile = sc.textFile("hdfs://localhost/test/*.txt") • Basic transformations: val squares = numbers.map(x => x * x) val evens = squares.filter(_ < 9) val mapto = numbers.flatMap(x => 1 to x) val words = textFile.flatMap(_.split(" ")).cache() Base RDD Transformed RDD Turn a collection to RDD

17. Spark Quick Tour • Basic actions: words.collect() words take(5) words count words.reduce(_ + _) words.filter(_ == “be").count() words.filter(_ == “or").count() words.saveAsTextFile("hdfs://localhost/test/result") The influence of cache

18. Spark Quick Tour • Pair syntax: val pair = (a, b) • Accessing pair elements: pair._1 pair._2 • Key-value operations: val pets = sc.parallelize(List(("cat", 1), ("dog", 2), ("cat", 3))) pets.reduceByKey(_ + _) pets.groupByKey() pets.sortByKey()

19. Hello World val logFile = "hdfs://localhost/test/tobe.txt" val logData = sc.textFile(logFile).cache() val wordCount = logData.flatMap(_.split(“ “)) .map((_, 1)) .reduceByKey(_ + _) wordCount.saveAsTextFile("hdfs://localhost/wordcount/result") sc.stop()

20. Execution

21. Software Components Application Spark Context ZooKeeper Mesos Master Mesos Slave Spark Executor Mesos Slave Spark Executor HDFS/Other Storage

22. Literature Parallel Programming With Spark Spark: Low latency, massively parallel processing framework

23. handaru@tiket.com

24. handaru@tiket.com