SlideShare una empresa de Scribd logo
1 de 28
Distributed DataFrame (DDF)
Simplifying Big Data
For The Rest Of Us
Christopher Nguyen, PhD
Co-Founder & CEO
DATA INTELLIGENCE FOR ALL
1. Motivation: Problem & Solution
2. DDF Features & Benefits
3. Demo
Agenda
• Former Engineering Director of Google Apps 

(Google Founders’ Award)
• Former Professor and Co-Founder of the Computer
Engineering program at HKUST
• PhD Stanford, BS U.C. Berkeley Summa cum Laude
Christopher Nguyen, PhD
Adatao Inc.
Co-Founder & CEO
Small Data
World
RHadoop
HDFS
MapReduce
Hive
HBASE
Pig
SummingBird
Crunch
Storm
Scalding
Cascading
class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double,
upper: Double, isTraining: Boolean)	
extends RDD[Array[Double]](prev) {	
override def getPartitions: Array[Partition] = {	
val rg = new Random(seed)	
firstParent[Array[Double]].partitions.map(x => new SeededPartition(x,
rg.nextInt))	
}	
override def getPreferredLocations(split: Partition): Seq[String] =	
firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio
n].prev)	
override def compute(splitIn: Partition, context: TaskContext):
Iterator[Array[Double]] = {	
val split = splitIn.asInstanceOf[SeededPartition]	
val rand = new Random(split.seed)	
if (isTraining) {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
z < lower || z >= upper	
})	
} else {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
lower <= z && z < upper	
})	
}	
}	
}	
def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int,
trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]],
RDD[Array[Double]])] = {	
require(0 < trainingSize && trainingSize < 1)	
val rg = new Random(seed)	
(1 to numSplits).map(_ => rg.nextInt).map(z =>	
(new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true),	
new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator	
}	
Can’t It Be
Even
Simpler?
WHY
class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double,
upper: Double, isTraining: Boolean)	
extends RDD[Array[Double]](prev) {	
override def getPartitions: Array[Partition] = {	
val rg = new Random(seed)	
firstParent[Array[Double]].partitions.map(x => new SeededPartition(x,
rg.nextInt))	
}	
override def getPreferredLocations(split: Partition): Seq[String] =	
firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio
n].prev)	
override def compute(splitIn: Partition, context: TaskContext):
Iterator[Array[Double]] = {	
val split = splitIn.asInstanceOf[SeededPartition]	
val rand = new Random(split.seed)	
if (isTraining) {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
z < lower || z >= upper	
})	
} else {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
lower <= z && z < upper	
})	
}	
}	
}	
def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int,
trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]],
RDD[Array[Double]])] = {	
require(0 < trainingSize && trainingSize < 1)	
val rg = new Random(seed)	
(1 to numSplits).map(_ => rg.nextInt).map(z =>	
(new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true),	
new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator	
}	
table = load(“housePrices”)
table.dropNA
table.train.glm(“price”, “sf,zip,beds,baths”)
What
Happiness
Looks Like
vs.
!
!
HDFS
DAG Engine
MapReduce
Monad
FP
Bottom-Up
Top-Down
we begin to dream / find solution to fight this evil
Design Principles
Sophistication
Power & SpeedEase & Simplicity
HDFS
Web Browser R-Studio Python
Business Analyst Data Scientist Data Engineer
API
API
API
DDFClient
PAClient
PIClient
Example:
CLIENT WORKER WORKER WORKERWORKERMASTER
Demo Deployment
Diagram
Airline Dataset, 12GB, 123M rows
1
2
3
4
5
6
DDF Offers
Table-Like Abstraction
on Top of Big Data
1
1
2
3
4
5
6
DDF Offers
2 Native R Data.frame
Experience
ddf.getSummary()	
ddf.getFiveNum()	
ddf.binning(“distance”, 10)	
ddf.scale(SCALE.MINMAX)
1
2
3
4
5
6
DDF Offers
Real-Time
Scoring
manager.loadModel(uri)	
lm.predict(point)
Data
Wrangling
ddf.dropNA()	
ddf.transform(“speed
=distance/duration”)
ddf.lm(0.1, 10)	
ddf.roc(testddf)
Model
Building & Validation
3 Focus on Analytics,
Not MapReduce
1
2
3
4
5
6
DDF Offers
4 Seamlessly Integrate
with External ML Libraries
Data Algorithm Intelligence
+ =
1
2
3
4
5
6
DDF Offers
5 Share DDFs
Seamlessly & Efficiently
Can I see
your data?
ddf://adatao/airline
1
2
3
4
5
6
DDF Offers
6Work in Your
Preferred Language
…even Plain English!
1
2
3
4
5
6
Table-Like Abstraction
on Top of Big Data
Seamlessly Integrate with
External ML Libraries
Share DDFs
Seamlessly & efficiently
Native R Data.frame
Experience
Focus on Analytics, 

not MR
Work in Your
Preferred Language
as
DDF is to
Big Data
SQL+R is to
Small Data
www.adatao.com/ddf
DATA INTELLIGENCE FOR ALL
We’re Open Sourcing DDF!!!
1. Accept Core Committers
2. Clean up & Prep
3. Open up for public access

Más contenido relacionado

La actualidad más candente

Scaling modern JVM applications with Akka toolkit
Scaling modern JVM applications with Akka toolkitScaling modern JVM applications with Akka toolkit
Scaling modern JVM applications with Akka toolkitBojan Babic
 
MongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with MorphiaMongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with MorphiaScott Hernandez
 
テスト用のプレゼンテーション
テスト用のプレゼンテーションテスト用のプレゼンテーション
テスト用のプレゼンテーションgooseboi
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases LabFabio Fumarola
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceTakahiro Inoue
 
20110514 mongo dbチューニング
20110514 mongo dbチューニング20110514 mongo dbチューニング
20110514 mongo dbチューニングYuichi Matsuo
 
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)Grant Goodale
 
Windows Server 2012 Active Directory Recovery
Windows Server 2012 Active Directory RecoveryWindows Server 2012 Active Directory Recovery
Windows Server 2012 Active Directory RecoverySerhad MAKBULOĞLU, MBA
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)MongoSF
 
Introduction To Core Data
Introduction To Core DataIntroduction To Core Data
Introduction To Core Datadanielctull
 
Using Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data VisualisationUsing Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data VisualisationAlex Hardman
 
MySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of ThingsMySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of ThingsAlexander Rubin
 
Meetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4jMeetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4jNeo4j
 

La actualidad más candente (18)

Talk MongoDB - Amil
Talk MongoDB - AmilTalk MongoDB - Amil
Talk MongoDB - Amil
 
Scaling modern JVM applications with Akka toolkit
Scaling modern JVM applications with Akka toolkitScaling modern JVM applications with Akka toolkit
Scaling modern JVM applications with Akka toolkit
 
MongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with MorphiaMongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with Morphia
 
テスト用のプレゼンテーション
テスト用のプレゼンテーションテスト用のプレゼンテーション
テスト用のプレゼンテーション
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduce
 
MongoDB-SESSION03
MongoDB-SESSION03MongoDB-SESSION03
MongoDB-SESSION03
 
20110514 mongo dbチューニング
20110514 mongo dbチューニング20110514 mongo dbチューニング
20110514 mongo dbチューニング
 
Reading Data into R
Reading Data into RReading Data into R
Reading Data into R
 
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
 
Windows Server 2012 Active Directory Recovery
Windows Server 2012 Active Directory RecoveryWindows Server 2012 Active Directory Recovery
Windows Server 2012 Active Directory Recovery
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
 
Introduction To Core Data
Introduction To Core DataIntroduction To Core Data
Introduction To Core Data
 
Using Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data VisualisationUsing Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data Visualisation
 
Data aggregation in R
Data aggregation in RData aggregation in R
Data aggregation in R
 
WOTC_Import
WOTC_ImportWOTC_Import
WOTC_Import
 
MySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of ThingsMySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of Things
 
Meetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4jMeetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4j
 

Similar a Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in SparkShiao-An Yuan
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions Dr. Volkan OBAN
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
From Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksFrom Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksSeniorDevOnly
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Introduction to R
Introduction to RIntroduction to R
Introduction to Ragnonchik
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
A deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonA deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonCheng Min Chi
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Using Scala Slick at FortyTwo
Using Scala Slick at FortyTwoUsing Scala Slick at FortyTwo
Using Scala Slick at FortyTwoEishay Smith
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11Computer Science Club
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 

Similar a Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us (20)

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
From Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksFrom Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risks
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
A deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonA deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidson
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Using Scala Slick at FortyTwo
Using Scala Slick at FortyTwoUsing Scala Slick at FortyTwo
Using Scala Slick at FortyTwo
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11
 
Scala
ScalaScala
Scala
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 

Último

Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 

Último (20)

Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 

Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us

  • 1. Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO DATA INTELLIGENCE FOR ALL
  • 2. 1. Motivation: Problem & Solution 2. DDF Features & Benefits 3. Demo Agenda
  • 3. • Former Engineering Director of Google Apps 
 (Google Founders’ Award) • Former Professor and Co-Founder of the Computer Engineering program at HKUST • PhD Stanford, BS U.C. Berkeley Summa cum Laude Christopher Nguyen, PhD Adatao Inc. Co-Founder & CEO
  • 6.
  • 7. class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double, upper: Double, isTraining: Boolean) extends RDD[Array[Double]](prev) { override def getPartitions: Array[Partition] = { val rg = new Random(seed) firstParent[Array[Double]].partitions.map(x => new SeededPartition(x, rg.nextInt)) } override def getPreferredLocations(split: Partition): Seq[String] = firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio n].prev) override def compute(splitIn: Partition, context: TaskContext): Iterator[Array[Double]] = { val split = splitIn.asInstanceOf[SeededPartition] val rand = new Random(split.seed) if (isTraining) { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; z < lower || z >= upper }) } else { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; lower <= z && z < upper }) } } } def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int, trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]], RDD[Array[Double]])] = { require(0 < trainingSize && trainingSize < 1) val rg = new Random(seed) (1 to numSplits).map(_ => rg.nextInt).map(z => (new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true), new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator } Can’t It Be Even Simpler? WHY
  • 8. class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double, upper: Double, isTraining: Boolean) extends RDD[Array[Double]](prev) { override def getPartitions: Array[Partition] = { val rg = new Random(seed) firstParent[Array[Double]].partitions.map(x => new SeededPartition(x, rg.nextInt)) } override def getPreferredLocations(split: Partition): Seq[String] = firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio n].prev) override def compute(splitIn: Partition, context: TaskContext): Iterator[Array[Double]] = { val split = splitIn.asInstanceOf[SeededPartition] val rand = new Random(split.seed) if (isTraining) { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; z < lower || z >= upper }) } else { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; lower <= z && z < upper }) } } } def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int, trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]], RDD[Array[Double]])] = { require(0 < trainingSize && trainingSize < 1) val rg = new Random(seed) (1 to numSplits).map(_ => rg.nextInt).map(z => (new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true), new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator } table = load(“housePrices”) table.dropNA table.train.glm(“price”, “sf,zip,beds,baths”) What Happiness Looks Like
  • 10. we begin to dream / find solution to fight this evil
  • 12. HDFS Web Browser R-Studio Python Business Analyst Data Scientist Data Engineer API API API DDFClient PAClient PIClient Example:
  • 13. CLIENT WORKER WORKER WORKERWORKERMASTER Demo Deployment Diagram Airline Dataset, 12GB, 123M rows
  • 17. 2 Native R Data.frame Experience ddf.getSummary() ddf.getFiveNum() ddf.binning(“distance”, 10) ddf.scale(SCALE.MINMAX)
  • 21. 4 Seamlessly Integrate with External ML Libraries Data Algorithm Intelligence + =
  • 23. 5 Share DDFs Seamlessly & Efficiently Can I see your data? ddf://adatao/airline
  • 25. 6Work in Your Preferred Language …even Plain English!
  • 26. 1 2 3 4 5 6 Table-Like Abstraction on Top of Big Data Seamlessly Integrate with External ML Libraries Share DDFs Seamlessly & efficiently Native R Data.frame Experience Focus on Analytics, 
 not MR Work in Your Preferred Language
  • 27. as DDF is to Big Data SQL+R is to Small Data
  • 28. www.adatao.com/ddf DATA INTELLIGENCE FOR ALL We’re Open Sourcing DDF!!! 1. Accept Core Committers 2. Clean up & Prep 3. Open up for public access