SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
Remove Duplicates
Basic Spark Functionality
Spark
Spark Core
• Spark Core is the base engine for large-scale
parallel and distributed data processing. It is
responsible for:
• memory management and fault recovery
• scheduling, distributing and monitoring jobs
on a cluster
• interacting with storage systems
Spark Core
• Spark introduces the concept of an RDD (Resilient
Distributed Dataset)
• an immutable fault-tolerant, distributed collection of objects
that can be operated on in parallel.
• contains any type of object and is created by loading an
external dataset or distributing a collection from the driver
program.
• RDDs support two types of operations:
• Transformations are operations (such as map, filter, join, union,
and so on) that are performed on an RDD and which yield a
new RDD containing the result.
• Actions are operations (such as reduce, count, first, and so
on) that return a value after running a computation on an RDD.
Spark DataFrames
• DataFrames API is inspired by data frames in R and Python
(Pandas), but designed from the ground-up to support
modern big data and data science applications:
• Ability to scale from kilobytes of data on a single laptop to
petabytes on a large cluster
• Support for a wide array of data formats and storage systems
• State-of-the-art optimization and code generation through the
Spark SQL Catalyst optimizer
• Seamless integration with all big data tooling and
infrastructure via Spark
• APIs for Python, Java, Scala, and R (in development via
SparkR)
Remove Duplicates
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")
college: org.apache.spark.rdd.RDD[String]
val cNoDups = college.distinct
cNoDups: org.apache.spark.rdd.RDD[String]
college: RDD
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“as,df,asf”
“q3,e,qw”
“mb,kg,o”
“as,df,asf”
“qw,e,qw”
“mb,k2,o”
cNoDups: RDD
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“q3,e,qw
“mb,k2,o”
val cRows = college.map(x => x.split(",",-1))
cRows: org.apache.spark.rdd.RDD[Array[String]]
val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x )
cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])]
college: RDD
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“as,df,asf”
“q3,e,qw”
“mb,kg,o”
“as,df,asf”
“qw,e,qw”
“mb,k2,o”
cRows: RDD
Array(as,df,asf)
Array(qw,e,qw)
Array(mb,kg,o)
Array(as,df,asf)
Array(q3,e,qw)
Array(mb,kg,o)
Array(as,df,asf)
Array(qw,e,qw)
Array(mb,k2,o)
cKeyRows: RDD
key->Array(as,df,asf)
key->Array(qw,e,qw)
key->Array(mb,kg,o)
key->Array(as,df,asf)
key->Array(q3,e,qw)
key->Array(mb,kg,o)
key->Array(as,df,asf)
key->Array(qw,e,qw)
key->Array(mb,k2,o)
val cGrouped = cKeyRows
.groupBy(x => x._1)
.map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer]))
cGrouped: org.apache.spark.rdd.RDD[(String,Array[(String, Array[String])])]
val cDups = cGrouped.filter(x => x._2.length > 1)
cKeyRows: RDD
key->Array(as,df,asf)
key->Array(qw,e,qw)
key->Array(mb,kg,o)
key->Array(as,df,asf)
key->Array(q3,e,qw)
key->Array(mb,kg,o)
key->Array(as,df,asf)
key->Array(qw,e,qw)
key->Array(mb,k2,o)
cGrouped: RDD
key->Array(as,df,asf)
Array(as,df,asf)
Array(as,df,asf)
key->Array(mb,kg,o)
Array(mb,kg,o)
key->Array(mb,k2,o)
key->Array(qw,e,qw)
Array(qw,e,qw)
key->Array(q3,e,qw)
val cDups = cGrouped.filter(x => x._2.length > 1)
cDups: org.apache.spark.rdd.RDD[(String, Array[(String, Array[String])])]
val cNoDups = cGrouped.map(x => x._2(0)._2)
cNoDups: org.apache.spark.rdd.RDD[Array[String]]
cGrouped: RDD
key->Array(as,df,asf)
Array(as,df,asf)
Array(as,df,asf)
key->Array(mb,kg,o)
Array(mb,kg,o)
key->Array(mb,k2,o)
key->Array(qw,e,qw)
Array(qw,e,qw)
key->Array(q3,e,qw)
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“q3,e,qw
“mb,k2,o”
cNoDups: RDD cDups: RDD
key->Array(as,df,asf)
Array(as,df,asf)
Array(as,df,asf)
key->Array(mb,kg,o)
Array(mb,kg,o)
key->Array(qw,e,qw)
Array(qw,e,qw)
Previously it was RDD but currently the Spark DataFrames API is
considered to be the primary interaction point of Spark. but RDD is
available if needed
What is partitioning in Apache Spark?
Partitioning is actually the main concept of access your entire Hardware resources while
executing any Job.
More Partition = More Parallelism
So conceptually you must check the number of slots in your hardware, how many tasks can
each of executors can handle.Each partition will leave in different Executor.
DataFrames
• So Dataframe is more like column structure and each
record is actually a line.
• Can Run statistics naturally as its somewhat works like
SQL or Python/R Dataframe.
• In RDD, to process any data for last 7 days, spark
needed to go through entire dataset to get the details, but
in Dataframe you already get Time column to handle the
situation, so Spark won’t even see the data which is
greater than 7 days.
• Easier to program.
• Better performance and storage in the heap of executor.
How Dataframe ensures to
read less data?
• You can skip partition while reading the data
using Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.
What is Parquet
• You can skip partition while reading the data using
Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.
• Parquet should be the source for any operation or ETL. So if the
data is different format, preferred approach is to convert the source
to Parquet and then process.
• If any dataset in JSON or comma separated file, first ETL it to
convert it to Parquet.
• It limits I/O , so scans/reads only the columns that are needed.
• Parquet is columnar layout based, so it compresses better, so
save spaces.
• So parquet takes first column and store that as a file, and so on. So
if we have 3 different files and sql query is on 2 files, then parquet
won’t even consider to read the 3rd file.
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/CollegeScoreCard.csv")
college.count
res2: Long = 7805
val collegeNoDups = college.distinct
collegeNoDups.count
res3: Long = 7805
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")
college: org.apache.spark.rdd.RDD[String] = /Users/marksmith/TulsaTechFest2016/colleges.csv MapPartitionsRDD[17] at textFile at <console>:27
val cNoDups = college.distinct
cNoDups: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at distinct at <console>:29
cNoDups.count
res7: Long = 7805
college.count
res8: Long = 9000
val cRows = college.map(x => x.split(",",-1))
cRows: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at map at <console>:29
val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x )
cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[22] at map at <console>:31
cKeyRows.take(2)
res11: Array[(String, Array[String])] = Array((UNITID_OPEID_opeid6_INSTNM,Array(
val cGrouped = cKeyRows.groupBy(x => x._1).map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer]))
cGrouped: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[27] at map at <console>:33
val cDups = cGrouped.filter(x => x._2.length > 1)
val cDups = cGrouped.filter(x => x._2.length > 1)
cDups: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[28] at filter at <console>:35
cDups.count
res12: Long = 1195
val cNoDups = cGrouped.map(x => x._2(0))
cNoDups: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[29] at map at <console>:35
cNoDups.count
res13: Long = 7805
val cNoDups = cGrouped.map(x => x._2(0)._2)
cNoDups: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:35
cNoDups.take(5)
16/08/04 16:44:24 ERROR Executor: Managed memory leak detected; size = 41227428 bytes, TID = 28
res16: Array[Array[String]] = Array(Array(145725, 00169100, 001691, Illinois Institute of Technology, Chicago, IL, www.iit.edu, npc.collegeboard.org/student/app/iit, 0, 3, 2, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0,
NULL, 520, 640, 640, 740, 520, 640, 580, 690, 580, 25, 30, 24, 31, 26, 33, NULL, NULL, 28, 28, 30, NULL, 1252, 1252, 0, 0, 0.2026, 0, 0, 0, 0.1225, 0, 0, 0.4526, 0.0245
Demo RDD Code
import org.apache.spark.sql.SQLContext
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/Users/marksmith/TulsaTechFest2016/colleges.csv")
df: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, CITY: string, STABBR: string, INSTURL: string, NPCURL: string, HCM2: int, PREDDEG: int, CONTROL: int, LOCALE: string, HBCU: string, PBI: string, ANNHI:
string, TRIBAL: string, AANAPII: string, HSI: string, NANTI: string, MENONLY: string, WOMENONLY: string, RELAFFIL: string, SATVR25: string, SATVR75: string, SATMT25: string, SATMT75: string, SATWR25: string, SATWR75: string, SATVRMID: string,
SATMTMID: string, SATWRMID: string, ACTCM25: string, ACTCM75: string, ACTEN25: string, ACTEN75: string, ACTMT25: string, ACTMT75: string, ACTWR25: string, ACTWR75: string, ACTCMMID: string, ACTENMID: string, ACTMTMID: string,
ACTWRMID: string, SAT_AVG: string, SAT_AVG_ALL: string, PCIP01: string, PCIP03: stri...
val dfd = df.distinct
dfd.count
res0: Long = 7804
df.count
res1: Long = 8998
val dfdd = df.dropDuplicates(Array("UNITID", "OPEID", "opeid6", "INSTNM"))
dfdd.count
res2: Long = 7804
val dfCnt = df.groupBy("UNITID", "OPEID", "opeid6", "INSTNM").agg(count("UNITID").alias("cnt"))
res8: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]
dfCnt.show
+--------+-------+------+--------------------+---+
| UNITID| OPEID|opeid6| INSTNM|cnt|
+--------+-------+------+--------------------+---+
|10236801| 104703| 1047|Troy University-P...| 2|
|11339705|3467309| 34673|Marinello Schools...| 2|
| 135276| 558500| 5585|Lively Technical ...| 2|
| 145682| 675300| 6753|Illinois Central ...| 2|
| 151111| 181300| 1813|Indiana Universit...| 1|
df.registerTempTable("colleges")
val dfCnt2 = sqlContext.sql("select UNITID, OPEID, opeid6, INSTNM, count(UNITID) as cnt from colleges group by UNITID, OPEID, opeid6, INSTNM")
dfCnt2: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]
dfCnt2.show
+--------+-------+------+--------------------+---+
| UNITID| OPEID|opeid6| INSTNM|cnt|
+--------+-------+------+--------------------+---+
|10236801| 104703| 1047|Troy University-P...| 2|
|11339705|3467309| 34673|Marinello Schools...| 2|
| 135276| 558500| 5585|Lively Technical ...| 2|
| 145682| 675300| 6753|Illinois Central ...| 2|
| 151111| 181300| 1813|Indiana Universit...| 1|
| 156921| 696100| 6961|Jefferson Communi...| 1|
Demo DataFrame Code

Más contenido relacionado

La actualidad más candente

Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureDr. Christian Betz
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰoggers
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using sparkRan Silberman
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
 

La actualidad más candente (20)

Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 

Destacado

AMQP and RabbitMQ (OKCJUG, January 2014)
AMQP and RabbitMQ (OKCJUG, January 2014)AMQP and RabbitMQ (OKCJUG, January 2014)
AMQP and RabbitMQ (OKCJUG, January 2014)Ryan Hoegg
 
Мобильные платежи подходы и методы исполнения, Gemalto
Мобильные платежи подходы и методы исполнения, GemaltoМобильные платежи подходы и методы исполнения, Gemalto
Мобильные платежи подходы и методы исполнения, GemaltoAleksandrs Baranovs
 
Spark application on ec2 cluster
Spark application on ec2 clusterSpark application on ec2 cluster
Spark application on ec2 clusterChao-Hsuan Shen
 
Building Performant, Reliable, and Scalable Integrations with Mule ESB
Building Performant, Reliable, and Scalable Integrations with Mule ESBBuilding Performant, Reliable, and Scalable Integrations with Mule ESB
Building Performant, Reliable, and Scalable Integrations with Mule ESBRyan Hoegg
 
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MININGDino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MININGieee-cist
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Junli Gu
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 
Mining Object Movement Patterns from Trajectory Data
Mining Object Movement Patterns from Trajectory DataMining Object Movement Patterns from Trajectory Data
Mining Object Movement Patterns from Trajectory DataNhatHai Phan
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking TutorialTilmann Rabl
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesCentre of Geographic Sciences (COGS)
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 

Destacado (16)

DDPPresentation
DDPPresentationDDPPresentation
DDPPresentation
 
AMQP and RabbitMQ (OKCJUG, January 2014)
AMQP and RabbitMQ (OKCJUG, January 2014)AMQP and RabbitMQ (OKCJUG, January 2014)
AMQP and RabbitMQ (OKCJUG, January 2014)
 
IEEE big data 2015
IEEE big data 2015IEEE big data 2015
IEEE big data 2015
 
Мобильные платежи подходы и методы исполнения, Gemalto
Мобильные платежи подходы и методы исполнения, GemaltoМобильные платежи подходы и методы исполнения, Gemalto
Мобильные платежи подходы и методы исполнения, Gemalto
 
Spark application on ec2 cluster
Spark application on ec2 clusterSpark application on ec2 cluster
Spark application on ec2 cluster
 
Building Performant, Reliable, and Scalable Integrations with Mule ESB
Building Performant, Reliable, and Scalable Integrations with Mule ESBBuilding Performant, Reliable, and Scalable Integrations with Mule ESB
Building Performant, Reliable, and Scalable Integrations with Mule ESB
 
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MININGDino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Mining Object Movement Patterns from Trajectory Data
Mining Object Movement Patterns from Trajectory DataMining Object Movement Patterns from Trajectory Data
Mining Object Movement Patterns from Trajectory Data
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 

Similar a Tulsa techfest Spark Core Aug 5th 2016

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)I Goo Lee.
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 

Similar a Tulsa techfest Spark Core Aug 5th 2016 (20)

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 

Más de Mark Smith

Ss jan19 2020_isafepeople
Ss jan19 2020_isafepeopleSs jan19 2020_isafepeople
Ss jan19 2020_isafepeopleMark Smith
 
Ss jan12 2020_introboundaries
Ss jan12 2020_introboundariesSs jan12 2020_introboundaries
Ss jan12 2020_introboundariesMark Smith
 
Ss dec092018genesis
Ss dec092018genesisSs dec092018genesis
Ss dec092018genesisMark Smith
 
The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1Mark Smith
 
The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2Mark Smith
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
Sunday School Trial of Jesus
Sunday School Trial of JesusSunday School Trial of Jesus
Sunday School Trial of JesusMark Smith
 
Ss sep11 2016_apologetics
Ss sep11 2016_apologeticsSs sep11 2016_apologetics
Ss sep11 2016_apologeticsMark Smith
 
Ss aug28 2016_apologetics
Ss aug28 2016_apologeticsSs aug28 2016_apologetics
Ss aug28 2016_apologeticsMark Smith
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016Mark Smith
 

Más de Mark Smith (10)

Ss jan19 2020_isafepeople
Ss jan19 2020_isafepeopleSs jan19 2020_isafepeople
Ss jan19 2020_isafepeople
 
Ss jan12 2020_introboundaries
Ss jan12 2020_introboundariesSs jan12 2020_introboundaries
Ss jan12 2020_introboundaries
 
Ss dec092018genesis
Ss dec092018genesisSs dec092018genesis
Ss dec092018genesis
 
The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1
 
The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Sunday School Trial of Jesus
Sunday School Trial of JesusSunday School Trial of Jesus
Sunday School Trial of Jesus
 
Ss sep11 2016_apologetics
Ss sep11 2016_apologeticsSs sep11 2016_apologetics
Ss sep11 2016_apologetics
 
Ss aug28 2016_apologetics
Ss aug28 2016_apologeticsSs aug28 2016_apologetics
Ss aug28 2016_apologetics
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 

Último

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsBert Jan Schrijver
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 

Último (20)

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 

Tulsa techfest Spark Core Aug 5th 2016

  • 3. Spark Core • Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for: • memory management and fault recovery • scheduling, distributing and monitoring jobs on a cluster • interacting with storage systems
  • 4. Spark Core • Spark introduces the concept of an RDD (Resilient Distributed Dataset) • an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. • contains any type of object and is created by loading an external dataset or distributing a collection from the driver program. • RDDs support two types of operations: • Transformations are operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result. • Actions are operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.
  • 5. Spark DataFrames • DataFrames API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications: • Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster • Support for a wide array of data formats and storage systems • State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer • Seamless integration with all big data tooling and infrastructure via Spark • APIs for Python, Java, Scala, and R (in development via SparkR)
  • 7. val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv") college: org.apache.spark.rdd.RDD[String] val cNoDups = college.distinct cNoDups: org.apache.spark.rdd.RDD[String] college: RDD “as,df,asf” “qw,e,qw” “mb,kg,o” “as,df,asf” “q3,e,qw” “mb,kg,o” “as,df,asf” “qw,e,qw” “mb,k2,o” cNoDups: RDD “as,df,asf” “qw,e,qw” “mb,kg,o” “q3,e,qw “mb,k2,o”
  • 8. val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] college: RDD “as,df,asf” “qw,e,qw” “mb,kg,o” “as,df,asf” “q3,e,qw” “mb,kg,o” “as,df,asf” “qw,e,qw” “mb,k2,o” cRows: RDD Array(as,df,asf) Array(qw,e,qw) Array(mb,kg,o) Array(as,df,asf) Array(q3,e,qw) Array(mb,kg,o) Array(as,df,asf) Array(qw,e,qw) Array(mb,k2,o) cKeyRows: RDD key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o) key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o) key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o)
  • 9. val cGrouped = cKeyRows .groupBy(x => x._1) .map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String,Array[(String, Array[String])])] val cDups = cGrouped.filter(x => x._2.length > 1) cKeyRows: RDD key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o) key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o) key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o) cGrouped: RDD key->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf) key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o) key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw)
  • 10. val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, Array[(String, Array[String])])] val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]] cGrouped: RDD key->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf) key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o) key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw) “as,df,asf” “qw,e,qw” “mb,kg,o” “q3,e,qw “mb,k2,o” cNoDups: RDD cDups: RDD key->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf) key->Array(mb,kg,o) Array(mb,kg,o) key->Array(qw,e,qw) Array(qw,e,qw)
  • 11. Previously it was RDD but currently the Spark DataFrames API is considered to be the primary interaction point of Spark. but RDD is available if needed
  • 12. What is partitioning in Apache Spark? Partitioning is actually the main concept of access your entire Hardware resources while executing any Job. More Partition = More Parallelism So conceptually you must check the number of slots in your hardware, how many tasks can each of executors can handle.Each partition will leave in different Executor.
  • 13.
  • 14. DataFrames • So Dataframe is more like column structure and each record is actually a line. • Can Run statistics naturally as its somewhat works like SQL or Python/R Dataframe. • In RDD, to process any data for last 7 days, spark needed to go through entire dataset to get the details, but in Dataframe you already get Time column to handle the situation, so Spark won’t even see the data which is greater than 7 days. • Easier to program. • Better performance and storage in the heap of executor.
  • 15. How Dataframe ensures to read less data? • You can skip partition while reading the data using Dataframe. • Using Parquet • Skipping data using statistucs (ie min, max) • Using partitioning (ie year = 2015/month = 06…) • Pushing predicates into storage systems.
  • 16. What is Parquet • You can skip partition while reading the data using Dataframe. • Using Parquet • Skipping data using statistucs (ie min, max) • Using partitioning (ie year = 2015/month = 06…) • Pushing predicates into storage systems.
  • 17. • Parquet should be the source for any operation or ETL. So if the data is different format, preferred approach is to convert the source to Parquet and then process. • If any dataset in JSON or comma separated file, first ETL it to convert it to Parquet. • It limits I/O , so scans/reads only the columns that are needed. • Parquet is columnar layout based, so it compresses better, so save spaces. • So parquet takes first column and store that as a file, and so on. So if we have 3 different files and sql query is on 2 files, then parquet won’t even consider to read the 3rd file.
  • 18. val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/CollegeScoreCard.csv") college.count res2: Long = 7805 val collegeNoDups = college.distinct collegeNoDups.count res3: Long = 7805 val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv") college: org.apache.spark.rdd.RDD[String] = /Users/marksmith/TulsaTechFest2016/colleges.csv MapPartitionsRDD[17] at textFile at <console>:27 val cNoDups = college.distinct cNoDups: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at distinct at <console>:29 cNoDups.count res7: Long = 7805 college.count res8: Long = 9000 val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at map at <console>:29 val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[22] at map at <console>:31 cKeyRows.take(2) res11: Array[(String, Array[String])] = Array((UNITID_OPEID_opeid6_INSTNM,Array( val cGrouped = cKeyRows.groupBy(x => x._1).map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[27] at map at <console>:33 val cDups = cGrouped.filter(x => x._2.length > 1) val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[28] at filter at <console>:35 cDups.count res12: Long = 1195 val cNoDups = cGrouped.map(x => x._2(0)) cNoDups: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[29] at map at <console>:35 cNoDups.count res13: Long = 7805 val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:35 cNoDups.take(5) 16/08/04 16:44:24 ERROR Executor: Managed memory leak detected; size = 41227428 bytes, TID = 28 res16: Array[Array[String]] = Array(Array(145725, 00169100, 001691, Illinois Institute of Technology, Chicago, IL, www.iit.edu, npc.collegeboard.org/student/app/iit, 0, 3, 2, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, NULL, 520, 640, 640, 740, 520, 640, 580, 690, 580, 25, 30, 24, 31, 26, 33, NULL, NULL, 28, 28, 30, NULL, 1252, 1252, 0, 0, 0.2026, 0, 0, 0, 0.1225, 0, 0, 0.4526, 0.0245 Demo RDD Code
  • 19. import org.apache.spark.sql.SQLContext val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/Users/marksmith/TulsaTechFest2016/colleges.csv") df: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, CITY: string, STABBR: string, INSTURL: string, NPCURL: string, HCM2: int, PREDDEG: int, CONTROL: int, LOCALE: string, HBCU: string, PBI: string, ANNHI: string, TRIBAL: string, AANAPII: string, HSI: string, NANTI: string, MENONLY: string, WOMENONLY: string, RELAFFIL: string, SATVR25: string, SATVR75: string, SATMT25: string, SATMT75: string, SATWR25: string, SATWR75: string, SATVRMID: string, SATMTMID: string, SATWRMID: string, ACTCM25: string, ACTCM75: string, ACTEN25: string, ACTEN75: string, ACTMT25: string, ACTMT75: string, ACTWR25: string, ACTWR75: string, ACTCMMID: string, ACTENMID: string, ACTMTMID: string, ACTWRMID: string, SAT_AVG: string, SAT_AVG_ALL: string, PCIP01: string, PCIP03: stri... val dfd = df.distinct dfd.count res0: Long = 7804 df.count res1: Long = 8998 val dfdd = df.dropDuplicates(Array("UNITID", "OPEID", "opeid6", "INSTNM")) dfdd.count res2: Long = 7804 val dfCnt = df.groupBy("UNITID", "OPEID", "opeid6", "INSTNM").agg(count("UNITID").alias("cnt")) res8: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint] dfCnt.show +--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1| df.registerTempTable("colleges") val dfCnt2 = sqlContext.sql("select UNITID, OPEID, opeid6, INSTNM, count(UNITID) as cnt from colleges group by UNITID, OPEID, opeid6, INSTNM") dfCnt2: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint] dfCnt2.show +--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1| | 156921| 696100| 6961|Jefferson Communi...| 1| Demo DataFrame Code