SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
FUSING APACHE SPARK AND
LUCENE FOR NEAR-REALTIME
PREDICTIVE MODEL BUILDING
Debasish Das
Principal Engineer
Verizon
Contributors
Platform: Pankaj Rastogi, Venkat Chunduru, Ponrama Jegan, Masoud Tavazoei
Algorithm: Santanu Das, Debasish Das (Dave)
Frontend: Altaff Shaik, Jon Leonhardt
Pramod Lakshmi Narasimha
Principal Engineer
Verizon
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Data Overview
• Location data
• Each srcIp defined as unique row key
• Provides approximate location of each key
• Timeseries containing latitude, longitude, error bound, duration, timezone for
each key
• Clickstreamdata
• Contains clickstream data of each row key
• Contains startTime, duration, httphost, httpuri, upload/download bytes,
httpmethod
• Compatible with IPFIX/Netflow formats
2
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Marketing Analytics
3
Lookalike modeling
Churn reduction
Competitive analysis
Increased share of
stomach
• Anonymous aggregate analysis for customer insights
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Data Model
• Dense dimension, dense measure
Schema: srcip, date, hour, tld, zip, tldvisits, zipvisits
Data: 10.1.13.120, d1, H2, macys.com, 94555, 2, 4
• Sparse dimension, dense measure
Schema: srcip, date, tld, zip, clickstreamvisits, zipvisits
Data: 10.1.13.120, d1, {macys.com, kohls.com}, {94555, 94301}, 10, 15
• Sparse dimension, sparse measure
Schema: srcip, date, tld, zip, tldvisits, zipvisits
Data: 10.1.13.120, d1, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7}
Schema: srcip, week, tld, zip, tldvisits, zipvisits
Data: 10.1.13.120, week1, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7}
• Sparse dimension, sparse measure, last N days
Schema: srcip, tld, zip, tldvisits, zipvisits
Data: 10.1.13.120, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7}
• Competing technologies: PowerDrill, Druid, LinkedIn Pinot, EssBase
4
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Document Dataset Representation
• Example
Schema: srcip, tld, zip, tldvisits, zipvisits
Data: 10.1.13.120, {macys.com,kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8,94301:7}
• DataFrame row to Lucene Document mapping
5
Store/schema Row Document
srcip primary key docId
tld
zip
String
Array[String]
SingleValue/MultiValue
Indexed Fields
tldvisits
zipvisits
Double
Map[String, Double]
SparseVector
StoredField
• Distributed collection of srcIp as RDD[Document]
• ~100M srcip, 1M+ terms (sparse dimensions)
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
DeviceAnalyzer
6
• DeviceAnalyzer goals
– Search and retrieve devices that
matched query
– Generate statistical and predictive
models on retrieved devices
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
What is Trapezium ?
7
DAIS Open Source framework to build batch, streaming and API services
https://github.com/Verizon/trapezium
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Trapezium Architecture
8
Trapezium
D1
D2
D3
O1
O2
O3
Validation
D1
V1
V1
O1
D2
O2
D3
O1
VARIOUS TRANSACTIONS
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Lucene Overview
9
• Scalable, full-text search library
• Focus: Indexing + searching documents
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Trapezium LuceneDAO
• SparkSQL and MLlib optimized for full scan, column indexing not supported
• Why Spark + Lucene integration
• Lucene is battle tested Apache Licensed Open Source Project
• Adds column search capabilities to Spark
• Adds spark operators (treeAggregate, treeReduce, map) to Lucene
• LuceneDAO features
• Build distributed lucene shards from Dataframe
• Save shards to HDFS for QueryProcessor (CloudSolr)
• Access saved shards through LuceneDAO for ML pipelines
10
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Trapezium Batch
11
runMode = "BATCH"
dataSource = “HDFS”
dependentWorkflows= {
workflows=[aggregate]
frequencyToCheck=100
}
hdfsFileBatch = {
batchTime = 86400
timerStartDelay = 1
batchInfo = [{
name = "DeviceStore"
dataDirectory = {saiph-devqa=/aggregates}
fileFormat = "parquet"
}]
}
transactions = [{
transactionName = “DeviceIndexer”
inputData = [{name = "DeviceStore"}]
persistDataName = “indexed"
}]
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
DeviceAnalyzer: Indexing
12
/?ref=1108&?url=http://www.macys.c
om&id=5
www.walmart.com%2Fc%2Fep%2Frange-
hood-filters&sellermemid=459
http%3A%2F%2Fm.macys.com%2Fshop%2F
product%2Fjockey-elance-cotton
/?ref=1108&?url=http://www.macys.c
om&id=5
m.amazon.com%2Fshop%2Fproduct%2Fjo
ckey-elance-cotton
https://www.walmart.com/ip/Women-
Pant-Suit-Roundtree
walmart://ip/?veh=dsn&wmlspartner
m.macys.com%2Fshop%2Fsearch%3Fkeyw
ord%3DDress
ip1, macys.com : 2
ip1, walmart.com: 1
ip1, macys.com : 1
ip2, walmart.com: 1
ip1, amazon.com: 1
ip1, macys.com : 2
ip2, walmart.com: 1
0 1 2
5 1 1
Macys, 0
Walmart, 1
Amazon, 2
object DeviceIndexer extend BatchTransaction {
process(dfs: Map[String, DataFrame], batchTime: Time): {
df = dfs(“DeviceStore”)
dm = generateDictionary(df)
vectorizedDf = transform(df, dm)
}
persist(df: DataFrame, batchTime: Time): {
converter = SparkLuceneConverter(dm.size)
dao = LuceneDAO(batchTime,…).setConverter(converter)
dm.save(path, batchTime)
dao.index(df, numShards)
}
1
2
ip1 ip2
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
LuceneDAO Index Size
13
0.0
75.0
150.0
225.0
300.0
1M 4M 8M 16M 73M 73M all
rows
InputSize(gb) IndexSize(gb)
rows InputSize(gb) IndexSize(gb)
1M 4.0 5.1
4M 14.4 19.0
8M 27.9 35.7
16M 58.8 63.2
73M 276.5 228.0
73M	all 276.5 267.1
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
LuceneDAO Shuffle Size
14
0.
250.
500.
750.
1000.
1M 4M 8M 16M 73M 73M all
Dictionary(mb) ShuffleWrite(mb)
rows ShuffleWrite(mb) Dictionary(mb)
1M 25 22.0
4M 56 30.0
8M 85 31.6
16M 126 32.2
73M 334 32.4
73M	all 921 146.5
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
LuceneDAO Index Runtime
15
rows Runtime	(s)
1M 135
4M 228
8M 434
16M 571
73M 1726
73M	all 2456
0
750
1500
2250
3000
1M 4M 8M 16M 73M 73M all
#rows
Runtime (s)
20 executors 16 cores
Executor RAM 16 GB
Driver RAM 8g
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Trapezium Api
runMode = "BATCH"
dataSource = "HDFS"
httpServer = {
provider = "akka"
hostname = "localhost"
port = 19999
contextPath = "/"
endPoints = [{
path = “analyzer-api"
className = "TopKEndPoint"
}]
}
16
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
DeviceAnalyzer: Topk
• Given a query select * from devices
where tld=‘macys.com’ OR
‘nordstorm.com’ AND
(city=‘SanFrancisco’ OR ‘Brussels’) AND
(device=‘Android’) …
• ML: Find topk dimensions
highly correlated with
selected device
• BI: group by tld order by
sum(visits) as tldVisits limit
topk
17
class TopkController(sc: SparkContext) extends
SparkServiceEndPoint(sc) {
override def route : topkRoute
converter = SparkLuceneConverter(dm.size)
batchTime = Trapezium.getSyncTime(“indexer”)
dao = LuceneDAO(batchTime…)
.setConverter(converter).load(sc, indexPath)
dict = loadDictionary(sc, indexPath, batchTime)
def topkRoute : {
post { request => {
devices = dao.search(request)
response = getCorrelates(devices, dict, topk)
}
}
df[deviceId, vector]
sum, support
mean, median, stddev
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Trapezium Stream
18
runMode = "STREAM"
dataSource = “KAFKA"
kafkaTopicInfo = {
consumerGroup = "KafkaStreamGroup"
maxRatePerPartition = 970
batchTime = "5"
streamsInfo = [{
name = "queries"
topicName = “deviceanalyzer”
}]
}
transactions = [{
transactionName = DeviceAnalyzer"
inputStreams = [{name:"queries"}]
persistStreamName= "deviceanalyzer"
isPersist= "true"
}]
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
DeviceAnalyzer: Compare
19
• Given two queries
select * from Devices where
tld=‘macys.com’ OR ‘nordstorm.com’ AND
(city=‘SanFrancisco’) AND (device=‘Android’)
select * from Devices where
tld=‘macys.com’ OR ‘nordstorm.com’ AND
(city=‘Brussels’) AND (device=‘Android’)
• Find the dimensions that
discriminate the devices
associated with two groups
def processStream(streams: Map[String,
DStream[Row]], workflowTime: Time): {
streams(“queries”).collect().map{ requests =>
group1 = dao.search(requests(0))
group2 = dao.search(requests(1))
response = runLDA(aud1, aud2, dict)
}
def persistStream(responses: RDD[Row],
batchTime: Time) {
HBaseDAO.write(responses)
}
• Sparse weighted least squares
using Breeze QuadraticMinimizer
• L1 Regularized logistic regression
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
DeviceAnalyzer: Augment
20
• Given a query
select * from Devices where
tld=‘macys.com’ OR ‘nordstorm.com’
AND (city=‘SanFrancisco’ OR ‘Brussels’)
AND (device=‘Android’)…
• Find devices similar to
seed as lookalikes
• Find dimensions that
represent lookalikes
object DeviceAnalyzer extends StreamingTransaction {
converter = SparkLuceneConverter(dm.size)
batchTime = Trapezium.getSyncTime(“indexer”)
dao = LuceneDAO(batchTime…)
.setConverter(converter).load(sc, indexPath)
dict = loadDictionary(sc, indexPath, batchTime)
all = dao.search(“*:*”)
def processStream(streams: Map[String, DStream[Row]]) :
{
streams(“queries”).collect().map{ request =>
audience = dao.search(request)
response = getLookalikeDimensions(all, audience, dict)
} • Sparse weighted least squares using
Breeze QuadraticMinimizer
• L2 regularized linear regression
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
FastSummarizer
• Statistical and predictive operators
• sum: sum over numeric measures
• support: sum over distinct docID
• sumSquared: L2 norm
• gram: Uses BLAS sspr
• solve: Uses BreezeQuadraticMinimizer to support L1
• Implemented using Array[Float] for shuffle opt
• Scala/Java for Level1 operations
• OpenBLAS for Level3 operations
21
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Sync API Benchmark
73M rows 1M+ search terms
1 measure on 250K sparse dimensions
20 executors 8 cores
32 GB driver RAM 16 GB executor RAM
akka-http cores: 24 default
22
topk
qps runtime(s)
1 1.389
5 1.663
10 3.214
20 5.992
40 12.174
0.
3.5
7.
10.5
14.
1 5 10 20 40
qps
runtime(s)
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
Async API Benchmark
73M rows, 1M+ search terms
1 measure on 250K sparse dimensions
20 executors 8 cores
32 GB driver RAM 16 GB executor RAM
forkjoinpool = 40
Kafka Fetch + compare/augment + HBase Persist
23
predictions
qps compare(s) augment(s)
1 9 16
5 13 36
10 23 70
20 42 142
0
40
80
120
160
1 5 10 20
qps
compare augment
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
topk tld + apps
24
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
25
Augment: Auto Enthusiastic
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice. All trademarks used hereinare property oftheir respectiveowners.
26
Augment Model Performance
© Verizon 2016All Rights Reserved
Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners.
27
Compare: Leisure vs Business Travellers
THANK YOU.
Q&A
Join us and make machines intelligent
Data & Artificial Intelligence Systems
499 Hamilton Ave, Palo Alto
California

Más contenido relacionado

La actualidad más candente

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at Youtube
DataWorks Summit
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 

La actualidad más candente (20)

HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at Youtube
 
Hive on Spark, production experience @Uber
 Hive on Spark, production experience @Uber Hive on Spark, production experience @Uber
Hive on Spark, production experience @Uber
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 

Destacado

Activity plan blossoming tree
Activity plan blossoming treeActivity plan blossoming tree
Activity plan blossoming tree
wiosenka
 
Zachodniopomorskie Kozakowski Ist Bariery 20080417
Zachodniopomorskie Kozakowski Ist Bariery 20080417Zachodniopomorskie Kozakowski Ist Bariery 20080417
Zachodniopomorskie Kozakowski Ist Bariery 20080417
martin.zawisza
 
5. Wykonywanie analizy jakościowej i ilościowej produktów leczniczych
5. Wykonywanie analizy jakościowej i ilościowej produktów leczniczych 5. Wykonywanie analizy jakościowej i ilościowej produktów leczniczych
5. Wykonywanie analizy jakościowej i ilościowej produktów leczniczych
Jakub Duda
 

Destacado (20)

Spark summit talk, july 2014 powered by reveal
Spark summit talk, july 2014 powered by revealSpark summit talk, july 2014 powered by reveal
Spark summit talk, july 2014 powered by reveal
 
Spark Meetup July 2015
Spark Meetup July 2015Spark Meetup July 2015
Spark Meetup July 2015
 
mis amigos
mis amigosmis amigos
mis amigos
 
Analitzem versions caputxeta
Analitzem versions caputxetaAnalitzem versions caputxeta
Analitzem versions caputxeta
 
Troubleshoot Your Mind For Better Business Ideas
Troubleshoot Your Mind For Better Business IdeasTroubleshoot Your Mind For Better Business Ideas
Troubleshoot Your Mind For Better Business Ideas
 
Activity plan blossoming tree
Activity plan blossoming treeActivity plan blossoming tree
Activity plan blossoming tree
 
Incrediblue
Incrediblue Incrediblue
Incrediblue
 
Zachodniopomorskie Kozakowski Ist Bariery 20080417
Zachodniopomorskie Kozakowski Ist Bariery 20080417Zachodniopomorskie Kozakowski Ist Bariery 20080417
Zachodniopomorskie Kozakowski Ist Bariery 20080417
 
School 4
School 4School 4
School 4
 
Nanotechnology companies in Europe | Glonatech SA Insights
Nanotechnology companies in Europe | Glonatech SA InsightsNanotechnology companies in Europe | Glonatech SA Insights
Nanotechnology companies in Europe | Glonatech SA Insights
 
Win Marketing portfolio
Win Marketing portfolioWin Marketing portfolio
Win Marketing portfolio
 
Компетентностный подход в образовании
Компетентностный подход в образованииКомпетентностный подход в образовании
Компетентностный подход в образовании
 
Light
LightLight
Light
 
Antarctic- Guess the Climate
Antarctic- Guess the ClimateAntarctic- Guess the Climate
Antarctic- Guess the Climate
 
5. Wykonywanie analizy jakościowej i ilościowej produktów leczniczych
5. Wykonywanie analizy jakościowej i ilościowej produktów leczniczych 5. Wykonywanie analizy jakościowej i ilościowej produktów leczniczych
5. Wykonywanie analizy jakościowej i ilościowej produktów leczniczych
 
Yoganant 2016
Yoganant 2016Yoganant 2016
Yoganant 2016
 
Drg kcs-magnetic-8-25-16-final deck
Drg kcs-magnetic-8-25-16-final deckDrg kcs-magnetic-8-25-16-final deck
Drg kcs-magnetic-8-25-16-final deck
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Diagnostyka stanu skóry
Diagnostyka stanu skóryDiagnostyka stanu skóry
Diagnostyka stanu skóry
 
User education and information literacy - Innovative strategies and practices
User education and information literacy - Innovative strategies and practicesUser education and information literacy - Innovative strategies and practices
User education and information literacy - Innovative strategies and practices
 

Similar a Spark Summit EU talk by Debasish Das and Pramod Narasimha

Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 

Similar a Spark Summit EU talk by Debasish Das and Pramod Narasimha (20)

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Oracle Cloud - Infrastruktura jako kód
Oracle Cloud - Infrastruktura jako kódOracle Cloud - Infrastruktura jako kód
Oracle Cloud - Infrastruktura jako kód
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
CISCO - Presentation at Hortonworks Booth - Strata 2014
CISCO - Presentation at Hortonworks Booth - Strata 2014CISCO - Presentation at Hortonworks Booth - Strata 2014
CISCO - Presentation at Hortonworks Booth - Strata 2014
 
WarsawITDays_ ApacheNiFi202
WarsawITDays_ ApacheNiFi202WarsawITDays_ ApacheNiFi202
WarsawITDays_ ApacheNiFi202
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
HTrace: Tracing in HBase and HDFS (HBase Meetup)
HTrace: Tracing in HBase and HDFS (HBase Meetup)HTrace: Tracing in HBase and HDFS (HBase Meetup)
HTrace: Tracing in HBase and HDFS (HBase Meetup)
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
Karunakar Kondam
Karunakar KondamKarunakar Kondam
Karunakar Kondam
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
2013-05-22 RedHatGov Partner Event
2013-05-22 RedHatGov Partner Event2013-05-22 RedHatGov Partner Event
2013-05-22 RedHatGov Partner Event
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
 

Más de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Más de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Último (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Spark Summit EU talk by Debasish Das and Pramod Narasimha

  • 1. FUSING APACHE SPARK AND LUCENE FOR NEAR-REALTIME PREDICTIVE MODEL BUILDING Debasish Das Principal Engineer Verizon Contributors Platform: Pankaj Rastogi, Venkat Chunduru, Ponrama Jegan, Masoud Tavazoei Algorithm: Santanu Das, Debasish Das (Dave) Frontend: Altaff Shaik, Jon Leonhardt Pramod Lakshmi Narasimha Principal Engineer Verizon
  • 2. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Data Overview • Location data • Each srcIp defined as unique row key • Provides approximate location of each key • Timeseries containing latitude, longitude, error bound, duration, timezone for each key • Clickstreamdata • Contains clickstream data of each row key • Contains startTime, duration, httphost, httpuri, upload/download bytes, httpmethod • Compatible with IPFIX/Netflow formats 2
  • 3. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Marketing Analytics 3 Lookalike modeling Churn reduction Competitive analysis Increased share of stomach • Anonymous aggregate analysis for customer insights
  • 4. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Data Model • Dense dimension, dense measure Schema: srcip, date, hour, tld, zip, tldvisits, zipvisits Data: 10.1.13.120, d1, H2, macys.com, 94555, 2, 4 • Sparse dimension, dense measure Schema: srcip, date, tld, zip, clickstreamvisits, zipvisits Data: 10.1.13.120, d1, {macys.com, kohls.com}, {94555, 94301}, 10, 15 • Sparse dimension, sparse measure Schema: srcip, date, tld, zip, tldvisits, zipvisits Data: 10.1.13.120, d1, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7} Schema: srcip, week, tld, zip, tldvisits, zipvisits Data: 10.1.13.120, week1, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7} • Sparse dimension, sparse measure, last N days Schema: srcip, tld, zip, tldvisits, zipvisits Data: 10.1.13.120, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7} • Competing technologies: PowerDrill, Druid, LinkedIn Pinot, EssBase 4
  • 5. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Document Dataset Representation • Example Schema: srcip, tld, zip, tldvisits, zipvisits Data: 10.1.13.120, {macys.com,kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8,94301:7} • DataFrame row to Lucene Document mapping 5 Store/schema Row Document srcip primary key docId tld zip String Array[String] SingleValue/MultiValue Indexed Fields tldvisits zipvisits Double Map[String, Double] SparseVector StoredField • Distributed collection of srcIp as RDD[Document] • ~100M srcip, 1M+ terms (sparse dimensions)
  • 6. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. DeviceAnalyzer 6 • DeviceAnalyzer goals – Search and retrieve devices that matched query – Generate statistical and predictive models on retrieved devices
  • 7. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. What is Trapezium ? 7 DAIS Open Source framework to build batch, streaming and API services https://github.com/Verizon/trapezium
  • 8. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Trapezium Architecture 8 Trapezium D1 D2 D3 O1 O2 O3 Validation D1 V1 V1 O1 D2 O2 D3 O1 VARIOUS TRANSACTIONS
  • 9. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Lucene Overview 9 • Scalable, full-text search library • Focus: Indexing + searching documents
  • 10. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Trapezium LuceneDAO • SparkSQL and MLlib optimized for full scan, column indexing not supported • Why Spark + Lucene integration • Lucene is battle tested Apache Licensed Open Source Project • Adds column search capabilities to Spark • Adds spark operators (treeAggregate, treeReduce, map) to Lucene • LuceneDAO features • Build distributed lucene shards from Dataframe • Save shards to HDFS for QueryProcessor (CloudSolr) • Access saved shards through LuceneDAO for ML pipelines 10
  • 11. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Trapezium Batch 11 runMode = "BATCH" dataSource = “HDFS” dependentWorkflows= { workflows=[aggregate] frequencyToCheck=100 } hdfsFileBatch = { batchTime = 86400 timerStartDelay = 1 batchInfo = [{ name = "DeviceStore" dataDirectory = {saiph-devqa=/aggregates} fileFormat = "parquet" }] } transactions = [{ transactionName = “DeviceIndexer” inputData = [{name = "DeviceStore"}] persistDataName = “indexed" }]
  • 12. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. DeviceAnalyzer: Indexing 12 /?ref=1108&?url=http://www.macys.c om&id=5 www.walmart.com%2Fc%2Fep%2Frange- hood-filters&sellermemid=459 http%3A%2F%2Fm.macys.com%2Fshop%2F product%2Fjockey-elance-cotton /?ref=1108&?url=http://www.macys.c om&id=5 m.amazon.com%2Fshop%2Fproduct%2Fjo ckey-elance-cotton https://www.walmart.com/ip/Women- Pant-Suit-Roundtree walmart://ip/?veh=dsn&wmlspartner m.macys.com%2Fshop%2Fsearch%3Fkeyw ord%3DDress ip1, macys.com : 2 ip1, walmart.com: 1 ip1, macys.com : 1 ip2, walmart.com: 1 ip1, amazon.com: 1 ip1, macys.com : 2 ip2, walmart.com: 1 0 1 2 5 1 1 Macys, 0 Walmart, 1 Amazon, 2 object DeviceIndexer extend BatchTransaction { process(dfs: Map[String, DataFrame], batchTime: Time): { df = dfs(“DeviceStore”) dm = generateDictionary(df) vectorizedDf = transform(df, dm) } persist(df: DataFrame, batchTime: Time): { converter = SparkLuceneConverter(dm.size) dao = LuceneDAO(batchTime,…).setConverter(converter) dm.save(path, batchTime) dao.index(df, numShards) } 1 2 ip1 ip2
  • 13. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. LuceneDAO Index Size 13 0.0 75.0 150.0 225.0 300.0 1M 4M 8M 16M 73M 73M all rows InputSize(gb) IndexSize(gb) rows InputSize(gb) IndexSize(gb) 1M 4.0 5.1 4M 14.4 19.0 8M 27.9 35.7 16M 58.8 63.2 73M 276.5 228.0 73M all 276.5 267.1
  • 14. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. LuceneDAO Shuffle Size 14 0. 250. 500. 750. 1000. 1M 4M 8M 16M 73M 73M all Dictionary(mb) ShuffleWrite(mb) rows ShuffleWrite(mb) Dictionary(mb) 1M 25 22.0 4M 56 30.0 8M 85 31.6 16M 126 32.2 73M 334 32.4 73M all 921 146.5
  • 15. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. LuceneDAO Index Runtime 15 rows Runtime (s) 1M 135 4M 228 8M 434 16M 571 73M 1726 73M all 2456 0 750 1500 2250 3000 1M 4M 8M 16M 73M 73M all #rows Runtime (s) 20 executors 16 cores Executor RAM 16 GB Driver RAM 8g
  • 16. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Trapezium Api runMode = "BATCH" dataSource = "HDFS" httpServer = { provider = "akka" hostname = "localhost" port = 19999 contextPath = "/" endPoints = [{ path = “analyzer-api" className = "TopKEndPoint" }] } 16
  • 17. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. DeviceAnalyzer: Topk • Given a query select * from devices where tld=‘macys.com’ OR ‘nordstorm.com’ AND (city=‘SanFrancisco’ OR ‘Brussels’) AND (device=‘Android’) … • ML: Find topk dimensions highly correlated with selected device • BI: group by tld order by sum(visits) as tldVisits limit topk 17 class TopkController(sc: SparkContext) extends SparkServiceEndPoint(sc) { override def route : topkRoute converter = SparkLuceneConverter(dm.size) batchTime = Trapezium.getSyncTime(“indexer”) dao = LuceneDAO(batchTime…) .setConverter(converter).load(sc, indexPath) dict = loadDictionary(sc, indexPath, batchTime) def topkRoute : { post { request => { devices = dao.search(request) response = getCorrelates(devices, dict, topk) } } df[deviceId, vector] sum, support mean, median, stddev
  • 18. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Trapezium Stream 18 runMode = "STREAM" dataSource = “KAFKA" kafkaTopicInfo = { consumerGroup = "KafkaStreamGroup" maxRatePerPartition = 970 batchTime = "5" streamsInfo = [{ name = "queries" topicName = “deviceanalyzer” }] } transactions = [{ transactionName = DeviceAnalyzer" inputStreams = [{name:"queries"}] persistStreamName= "deviceanalyzer" isPersist= "true" }]
  • 19. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. DeviceAnalyzer: Compare 19 • Given two queries select * from Devices where tld=‘macys.com’ OR ‘nordstorm.com’ AND (city=‘SanFrancisco’) AND (device=‘Android’) select * from Devices where tld=‘macys.com’ OR ‘nordstorm.com’ AND (city=‘Brussels’) AND (device=‘Android’) • Find the dimensions that discriminate the devices associated with two groups def processStream(streams: Map[String, DStream[Row]], workflowTime: Time): { streams(“queries”).collect().map{ requests => group1 = dao.search(requests(0)) group2 = dao.search(requests(1)) response = runLDA(aud1, aud2, dict) } def persistStream(responses: RDD[Row], batchTime: Time) { HBaseDAO.write(responses) } • Sparse weighted least squares using Breeze QuadraticMinimizer • L1 Regularized logistic regression
  • 20. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. DeviceAnalyzer: Augment 20 • Given a query select * from Devices where tld=‘macys.com’ OR ‘nordstorm.com’ AND (city=‘SanFrancisco’ OR ‘Brussels’) AND (device=‘Android’)… • Find devices similar to seed as lookalikes • Find dimensions that represent lookalikes object DeviceAnalyzer extends StreamingTransaction { converter = SparkLuceneConverter(dm.size) batchTime = Trapezium.getSyncTime(“indexer”) dao = LuceneDAO(batchTime…) .setConverter(converter).load(sc, indexPath) dict = loadDictionary(sc, indexPath, batchTime) all = dao.search(“*:*”) def processStream(streams: Map[String, DStream[Row]]) : { streams(“queries”).collect().map{ request => audience = dao.search(request) response = getLookalikeDimensions(all, audience, dict) } • Sparse weighted least squares using Breeze QuadraticMinimizer • L2 regularized linear regression
  • 21. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. FastSummarizer • Statistical and predictive operators • sum: sum over numeric measures • support: sum over distinct docID • sumSquared: L2 norm • gram: Uses BLAS sspr • solve: Uses BreezeQuadraticMinimizer to support L1 • Implemented using Array[Float] for shuffle opt • Scala/Java for Level1 operations • OpenBLAS for Level3 operations 21
  • 22. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Sync API Benchmark 73M rows 1M+ search terms 1 measure on 250K sparse dimensions 20 executors 8 cores 32 GB driver RAM 16 GB executor RAM akka-http cores: 24 default 22 topk qps runtime(s) 1 1.389 5 1.663 10 3.214 20 5.992 40 12.174 0. 3.5 7. 10.5 14. 1 5 10 20 40 qps runtime(s)
  • 23. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. Async API Benchmark 73M rows, 1M+ search terms 1 measure on 250K sparse dimensions 20 executors 8 cores 32 GB driver RAM 16 GB executor RAM forkjoinpool = 40 Kafka Fetch + compare/augment + HBase Persist 23 predictions qps compare(s) augment(s) 1 9 16 5 13 36 10 23 70 20 42 142 0 40 80 120 160 1 5 10 20 qps compare augment
  • 24. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. topk tld + apps 24
  • 25. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. 25 Augment: Auto Enthusiastic
  • 26. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice. All trademarks used hereinare property oftheir respectiveowners. 26 Augment Model Performance
  • 27. © Verizon 2016All Rights Reserved Informationcontained herein is providedAS ISandsubjectto changewithoutnotice.All trademarks usedhereinare property of their respective owners. 27 Compare: Leisure vs Business Travellers
  • 28. THANK YOU. Q&A Join us and make machines intelligent Data & Artificial Intelligence Systems 499 Hamilton Ave, Palo Alto California