SlideShare una empresa de Scribd logo
1 de 112
Descargar para leer sin conexión
Analyzing log data with
Apache Spark
William Benton
Red Hat Emerging Technology
BACKGROUND
Challenges of log data
Challenges of log data
SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg)
FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%'
GROUP BY hostname, hour
Challenges of log data
11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00
SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg)
FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%'
GROUP BY hostname, hour
Challenges of log data
postgres
httpd
syslog
INFO INFO WARN CRIT DEBUG INFO
GET GET GET POST
WARN WARN INFO INFO INFO
GET (404)
INFO
(ca. 2000)
Challenges of log data
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
(ca. 2016)
Challenges of log datapostgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
How many services are
generating logs in your
datacenter today?
DATA INGEST
Collecting log data
collecting
Ingesting live log
data via rsyslog,
logstash, fluentd
normalizing
Reconciling log
record metadata
across sources
warehousing
Storing normalized
records in ES indices
analysis
cache warehoused
data as Parquet files
on Gluster volume
local to Spark cluster
Collecting log data
warehousing
Storing normalized
records in ES indices
analysis
cache warehoused
data as Parquet files
on Gluster volume
local to Spark cluster
Collecting log data
warehousing
Storing normalized
records in ES indices
analysis
cache warehoused
data as Parquet files
on Gluster volume
local to Spark cluster
Schema mediation
Schema mediation
Schema mediation
Schema mediation
timestamp, level, host, IP
addresses, message, &c.
rsyslog-style metadata, like
app name, facility, &c.
logs
.select("level").distinct
.map { case Row(s: String) => s }
.collect
Exploring structured data
logs
.groupBy($"level", $"rsyslog.app_name")
.agg(count("level").as("total"))
.orderBy($"total".desc)
.show
info kubelet 17933574
info kube-proxy 10961117
err journal 6867921
info systemd 5184475
…
debug, notice, emerg,
err, warning, crit, info,
severe, alert
Exploring structured data
logs
.groupBy($"level", $"rsyslog.app_name")
.agg(count("level").as("total"))
.orderBy($"total".desc)
.show
info kubelet 17933574
info kube-proxy 10961117
err journal 6867921
info systemd 5184475
…
logs
.select("level").distinct
.as[String].collect
debug, notice, emerg,
err, warning, crit, info,
severe, alert
Exploring structured data
logs
.groupBy($"level", $"rsyslog.app_name")
.agg(count("level").as("total"))
.orderBy($"total".desc)
.show
info kubelet 17933574
info kube-proxy 10961117
err journal 6867921
info systemd 5184475
…
logs
.select("level").distinct
.as[String].collect
debug, notice, emerg,
err, warning, crit, info,
severe, alert
This class must be declared outside the REPL!
FEATURE ENGINEERING
From log records to vectors
What does it mean for two sets of categorical features to be similar?
red
green
blue
orange
-> 000
-> 010
-> 100
-> 001
pancakes
waffles
aebliskiver
omelets
bacon
hash browns
-> 10000
-> 01000
-> 00100
-> 00001
-> 00000
-> 00010
From log records to vectors
What does it mean for two sets of categorical features to be similar?
red
green
blue
orange
-> 000
-> 010
-> 100
-> 001
pancakes
waffles
aebliskiver
omelets
bacon
hash browns
-> 10000
-> 01000
-> 00100
-> 00001
-> 00000
-> 00010
red pancakes
orange waffles
-> 00010000
-> 00101000
Similarity and distance
Similarity and distance
Similarity and distance
(q - p) • (q - p)
Similarity and distance
pi - qi
i=1
n
Similarity and distance
pi - qi
i=1
n
Similarity and distance
p • q
p q
WARN INFO INFOINFO
WARN DEBUGINFOINFOINFO
WARN WARNINFO INFO INFO
WAR
INFO
INFO
Other interesting features
host01
host02
host03
INFO INFOINFO
DEBUGINFOINFO
WARNNFO INFO INFO
WARN INFO INFOINFO
INFO INFO
INFOINFOINFO
INFOINFO INFO INFO
INFO INFO
WARN DEBUG
Other interesting features
host01
host02
host03
INFO INFOINFO
INFO INFO
INFOINFO
INFO INFO INFO
WARN
DEBUG
WARN
INFO
INFO
EBUG
WARN
INFO
INFO
INFO
INFO
INFO
INFO INFO INFO
WARN
WARN
INFO
WARN
INFO
Other interesting features
host01
host02
host03
Other interesting features
: Great food, great service, a must-visit!
: Our whole table got gastroenteritis.
: This place is so wonderful that it has ruined all
other tacos for me and my family.
Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
CRIT: No requests in last hour; suspending running app containers.
Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
CRIT: No requests in last hour; suspending running app containers.
INFO: Phoenix datacenter is on fire; may not rise from ashes.
Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
CRIT: No requests in last hour; suspending running app containers.
INFO: Phoenix datacenter is on fire; may not rise from ashes.
See https://links.freevariable.com/nlp-logs/ for more!
VISUALIZING STRUCTURE
and FINDING OUTLIERS
Multidimensional data
Multidimensional data
[4,7]
Multidimensional data
[4,7]
Multidimensional data
[4,7] [2,3,5]
Multidimensional data
[4,7] [2,3,5]
Multidimensional data
[4,7] [2,3,5]
[7,1,6,5,12,

8,9,2,2,4,
7,11,6,1,5]
Multidimensional data
[4,7] [2,3,5]
[7,1,6,5,12,

8,9,2,2,4,
7,11,6,1,5]
A linear approach: PCA
0 0 0 1 1 0 1 0 1 0
0 0 1 0 0 0 1 1 0 0
1 0 1 1 0 1 0 0 0 0
0 0 0 0 0 0 1 1 0 1
0 1 0 0 1 0 0 1 0 0
1 0 0 0 0 1 0 1 1 0
0 0 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1
0 0 0 0 1 0 0 1 0 1
1 1 0 0 0 0 0 0 0 1
A linear approach: PCA
0 0 0 1 1 0 1 0 1 0
0 0 1 0 0 0 1 1 0 0
1 0 1 1 0 1 0 0 0 0
0 0 0 0 0 0 1 1 0 1
0 1 0 0 1 0 0 1 0 0
1 0 0 0 0 1 0 1 1 0
0 0 1 0 1 0 1 0 0 0
0 1 0 0 0 1 0 0 1 1
0 0 0 0 1 0 0 1 0 1
1 1 0 0 0 0 0 0 0 1
Tree-based approaches
Tree-based approaches
yes
no
yes
no
if orange
if !orange
if red
if !red
if !gray
if !gray
Tree-based approaches
yes
no
yes
no
if orange
if !orange
if red
if !red
if !gray
if !gray
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
Tree-based approaches
yes
no
yes
no
if orange
if !orange
if red
if !red
if !gray
if !gray
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
Self-organizing maps
Self-organizing maps
Finding outliers with SOMs
Finding outliers with SOMs
Finding outliers with SOMs
Finding outliers with SOMs
Finding outliers with SOMs
Outliers in log data
Outliers in log data
0.95
Outliers in log data
0.95
0.97
Outliers in log data
0.95
0.97
0.92
Outliers in log data
0.95
0.97
0.92
0.37
An outlier is any
record whose best
match was at least
4σ below the mean.
0.94
0.89
0.91
0.93
0.96
Out of 310 million log
records, we identified
0.0012% as outliers.
Thirty most extreme outliers
10 Can not communicate with power supply 2.
9 Power supply 2 failed.
8 Power supply redundancy is lost.
1 Drive A is removed.
1 Can not communicate with power supply 1.
1 Power supply 1 failed.
SOM TRAINING in SPARK
On-line SOM training
On-line SOM training
On-line SOM training
On-line SOM training
On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + ex * alpha(t) * wt
On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + ex * alpha(t) * wt
at each step, we update each unit by
adding its value from the previous step…
On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + ex * alpha(t) * wt
to the example that we considered…
On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + ex * alpha(t) * wt
scaled by a learning factor and the
distance from this unit to its best match
On-line SOM training
On-line SOM training
sensitive to
learning rate
not parallel
sensitive to
example order
Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
update the state of every cell in the neighborhood
of the best matching unit, weighting by distance
Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
keep track of the distance weights
we’ve seen for a weighted average
Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
since we can easily merge multiple states, we
can train in parallel across many examples
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
driver (using aggregate)
workers
driver (using aggregate)
workers
driver (using aggregate)
workers
driver (using aggregate)
workers
What if you have a 3 mb model and 2,048 partitions?
driver (using treeAggregate)
workers
driver (using treeAggregate)
workers
driver (using treeAggregate)
workers
driver (using treeAggregate)
workers
driver (using treeAggregate)
workers
SHARING MODELS
BEYOND SPARK
Sharing models
class Model(private var entries: breeze.linalg.DenseVector[Double],
/* ... lots of (possibly) mutable state ... */ )
implements java.io.Serializable {
// lots of implementation details here
}
Sharing models
class Model(private var entries: breeze.linalg.DenseVector[Double],
/* ... lots of (possibly) mutable state ... */ )
implements java.io.Serializable {
// lots of implementation details here
}
case class FrozenModel(entries: Array[Double], /* ... */ ) { }
Sharing models
case class FrozenModel(entries: Array[Double], /* ... */ ) { }
class Model(private var entries: breeze.linalg.DenseVector[Double],
/* ... lots of (possibly) mutable state ... */ )
implements java.io.Serializable {
// lots of implementation details here
def freeze: FrozenModel = // ...
}
object Model {
def thaw(im: FrozenModel): Model = // ...
}
Sharing models
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite}
implicit val formats = Serialization.formats(NoTypeHints)
def toJson(m: Model): String = {
jwrite(som.freeze)
}
def fromJson(json: String): Try[Model] = {
Try({
Model.thaw(jread[FrozenModel](json))
})
}
Sharing models
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite}
implicit val formats = Serialization.formats(NoTypeHints)
def toJson(m: Model): String = {
jwrite(som.freeze)
}
def fromJson(json: String): Try[Model] = {
Try({
Model.thaw(jread[FrozenModel](json))
})
}
Also consider how you’ll
share feature encoders
and other parts of your
learning pipeline!
PRACTICAL MATTERS
Spark and ElasticSearch
Data locality is an issue and caching is even more
important than when running from local storage.
If your data are write-once, consider exporting ES
indices to Parquet files and analyzing those instead.
Structured queries in Spark
Always program defensively: mediate schemas,
explicitly convert null values, etc.
Use the Dataset API whenever possible to minimize
boilerplate and benefit from query planning without
(entirely) forsaking type safety.
Memory and partitioning
Large JVM heaps can lead to appalling GC pauses and
executor timeouts.
Use multiple JVMs or off-heap storage (in Spark 2.0!)
Tree aggregation can save you both memory and
execution time by partially aggregating at worker nodes.
Interoperability
Avoid brittle or language-specific model serializers
when sharing models with non-Spark environments.
JSON is imperfect but ubiquitous. However, json4s
will serialize case classes for free!
See also SPARK-13944, merged recently into 2.0.
Feature engineering
Favor feature engineering effort over complex or
novel learning algorithms.
Prefer approaches that train interpretable models.
Design your feature engineering pipeline so you can
translate feature vectors back to factor values.
@willb • willb@redhat.com

https://chapeau.freevariable.com
THANKS!

Más contenido relacionado

La actualidad más candente

Presentation capacity management for oracle exadata database machine v2
Presentation   capacity management for oracle exadata database machine v2Presentation   capacity management for oracle exadata database machine v2
Presentation capacity management for oracle exadata database machine v2xKinAnx
 
MySQL 8.0 Document Store - Discovery of a New World
MySQL 8.0 Document Store - Discovery of a New WorldMySQL 8.0 Document Store - Discovery of a New World
MySQL 8.0 Document Store - Discovery of a New WorldFrederic Descamps
 
Oracle Database performance tuning using oratop
Oracle Database performance tuning using oratopOracle Database performance tuning using oratop
Oracle Database performance tuning using oratopSandesh Rao
 
Presentation database security audit vault & database firewall
Presentation   database security audit vault & database firewallPresentation   database security audit vault & database firewall
Presentation database security audit vault & database firewallxKinAnx
 
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
SQL vs NoSQL | MySQL vs MongoDB Tutorial | EdurekaSQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
SQL vs NoSQL | MySQL vs MongoDB Tutorial | EdurekaEdureka!
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architectureAjeet Singh
 
Why shift from ETL to ELT?
Why shift from ETL to ELT?Why shift from ETL to ELT?
Why shift from ETL to ELT?HEXANIKA
 
Introduction to Power BI and Data Visualization
Introduction to Power BI and Data VisualizationIntroduction to Power BI and Data Visualization
Introduction to Power BI and Data VisualizationSwapnil Jadhav
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at OracleSandesh Rao
 
Creación Indices y Constraints en bases de datos de SQL Server.pptx
Creación Indices y Constraints en bases de datos de SQL Server.pptxCreación Indices y Constraints en bases de datos de SQL Server.pptx
Creación Indices y Constraints en bases de datos de SQL Server.pptxCesarUlisesLopez1
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional ModellingAshish Chandwani
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Kyle Hailey
 
What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1Satishbabu Gunukula
 
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptxFive_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptxMaria Colgan
 
Oracle dba training
Oracle  dba    training Oracle  dba    training
Oracle dba training P S Rani
 
Data warehousing PRACTICAL (Sem - VI)
Data warehousing PRACTICAL (Sem - VI)Data warehousing PRACTICAL (Sem - VI)
Data warehousing PRACTICAL (Sem - VI)Satyendra Singh
 

La actualidad más candente (20)

Presentation capacity management for oracle exadata database machine v2
Presentation   capacity management for oracle exadata database machine v2Presentation   capacity management for oracle exadata database machine v2
Presentation capacity management for oracle exadata database machine v2
 
obiee basics ppt
obiee basics pptobiee basics ppt
obiee basics ppt
 
MySQL 8.0 Document Store - Discovery of a New World
MySQL 8.0 Document Store - Discovery of a New WorldMySQL 8.0 Document Store - Discovery of a New World
MySQL 8.0 Document Store - Discovery of a New World
 
Oracle Database performance tuning using oratop
Oracle Database performance tuning using oratopOracle Database performance tuning using oratop
Oracle Database performance tuning using oratop
 
Presentation database security audit vault & database firewall
Presentation   database security audit vault & database firewallPresentation   database security audit vault & database firewall
Presentation database security audit vault & database firewall
 
Sql dml & tcl 2
Sql   dml & tcl 2Sql   dml & tcl 2
Sql dml & tcl 2
 
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
SQL vs NoSQL | MySQL vs MongoDB Tutorial | EdurekaSQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
Why shift from ETL to ELT?
Why shift from ETL to ELT?Why shift from ETL to ELT?
Why shift from ETL to ELT?
 
Introduction to Power BI and Data Visualization
Introduction to Power BI and Data VisualizationIntroduction to Power BI and Data Visualization
Introduction to Power BI and Data Visualization
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at Oracle
 
Creación Indices y Constraints en bases de datos de SQL Server.pptx
Creación Indices y Constraints en bases de datos de SQL Server.pptxCreación Indices y Constraints en bases de datos de SQL Server.pptx
Creación Indices y Constraints en bases de datos de SQL Server.pptx
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional Modelling
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle
 
What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1
 
Sql views
Sql viewsSql views
Sql views
 
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptxFive_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
 
Oracle dba training
Oracle  dba    training Oracle  dba    training
Oracle dba training
 
Data warehousing PRACTICAL (Sem - VI)
Data warehousing PRACTICAL (Sem - VI)Data warehousing PRACTICAL (Sem - VI)
Data warehousing PRACTICAL (Sem - VI)
 

Similar a Analyzing Log Data With Apache Spark

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
OSMC 2013 | Making monitoring simple? by Michael Medin
OSMC 2013 | Making monitoring simple? by Michael MedinOSMC 2013 | Making monitoring simple? by Michael Medin
OSMC 2013 | Making monitoring simple? by Michael MedinNETWAYS
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problemsgrepalex
 
Juggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBJuggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBDavid Golden
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillCharles Givre
 
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicOC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicBig Data Joe™ Rossi
 
Keep it simple web development stack
Keep it simple web development stackKeep it simple web development stack
Keep it simple web development stackEric Ahn
 
Informix SQL & NoSQL: Putting it all together
Informix SQL & NoSQL: Putting it all togetherInformix SQL & NoSQL: Putting it all together
Informix SQL & NoSQL: Putting it all togetherKeshav Murthy
 
Rest API using Flask & SqlAlchemy
Rest API using Flask & SqlAlchemyRest API using Flask & SqlAlchemy
Rest API using Flask & SqlAlchemyAlessandro Cucci
 
NSClient++: Monitoring Simplified at OSMC 2013
NSClient++: Monitoring Simplified at OSMC 2013NSClient++: Monitoring Simplified at OSMC 2013
NSClient++: Monitoring Simplified at OSMC 2013Michael Medin
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-pythonEric Ahn
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-pythonEric Ahn
 
RDF Analytics... SPARQL and Beyond
RDF Analytics... SPARQL and BeyondRDF Analytics... SPARQL and Beyond
RDF Analytics... SPARQL and BeyondFadi Maali
 
Introduction to Riak - Red Dirt Ruby Conf Training
Introduction to Riak - Red Dirt Ruby Conf TrainingIntroduction to Riak - Red Dirt Ruby Conf Training
Introduction to Riak - Red Dirt Ruby Conf TrainingSean Cribbs
 
How We Learned To Love The Data Center Operating System
How We Learned To Love The Data Center Operating SystemHow We Learned To Love The Data Center Operating System
How We Learned To Love The Data Center Operating Systemsaulius_vl
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresSerena Villata
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redactedRyan Breed
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 

Similar a Analyzing Log Data With Apache Spark (20)

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
OSMC 2013 | Making monitoring simple? by Michael Medin
OSMC 2013 | Making monitoring simple? by Michael MedinOSMC 2013 | Making monitoring simple? by Michael Medin
OSMC 2013 | Making monitoring simple? by Michael Medin
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problems
 
Juggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBJuggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDB
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
 
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicOC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
 
Keep it simple web development stack
Keep it simple web development stackKeep it simple web development stack
Keep it simple web development stack
 
Informix SQL & NoSQL: Putting it all together
Informix SQL & NoSQL: Putting it all togetherInformix SQL & NoSQL: Putting it all together
Informix SQL & NoSQL: Putting it all together
 
Rest API using Flask & SqlAlchemy
Rest API using Flask & SqlAlchemyRest API using Flask & SqlAlchemy
Rest API using Flask & SqlAlchemy
 
NSClient++: Monitoring Simplified at OSMC 2013
NSClient++: Monitoring Simplified at OSMC 2013NSClient++: Monitoring Simplified at OSMC 2013
NSClient++: Monitoring Simplified at OSMC 2013
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-python
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-python
 
RDF Analytics... SPARQL and Beyond
RDF Analytics... SPARQL and BeyondRDF Analytics... SPARQL and Beyond
RDF Analytics... SPARQL and Beyond
 
Introduction to Riak - Red Dirt Ruby Conf Training
Introduction to Riak - Red Dirt Ruby Conf TrainingIntroduction to Riak - Red Dirt Ruby Conf Training
Introduction to Riak - Red Dirt Ruby Conf Training
 
How We Learned To Love The Data Center Operating System
How We Learned To Love The Data Center Operating SystemHow We Learned To Love The Data Center Operating System
How We Learned To Love The Data Center Operating System
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph Stores
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redacted
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 

Más de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Más de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 

Último (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Analyzing Log Data With Apache Spark

  • 1. Analyzing log data with Apache Spark William Benton Red Hat Emerging Technology
  • 4. Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour
  • 5. Challenges of log data 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour
  • 6. Challenges of log data postgres httpd syslog INFO INFO WARN CRIT DEBUG INFO GET GET GET POST WARN WARN INFO INFO INFO GET (404) INFO (ca. 2000)
  • 7. Challenges of log data postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN (ca. 2016)
  • 8. Challenges of log datapostgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN How many services are generating logs in your datacenter today?
  • 10. Collecting log data collecting Ingesting live log data via rsyslog, logstash, fluentd normalizing Reconciling log record metadata across sources warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
  • 11. Collecting log data warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
  • 12. Collecting log data warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
  • 16. Schema mediation timestamp, level, host, IP addresses, message, &c. rsyslog-style metadata, like app name, facility, &c.
  • 17. logs .select("level").distinct .map { case Row(s: String) => s } .collect Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … debug, notice, emerg, err, warning, crit, info, severe, alert
  • 18. Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert
  • 19. Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert This class must be declared outside the REPL!
  • 21. From log records to vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010
  • 22. From log records to vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles -> 00010000 -> 00101000
  • 25. Similarity and distance (q - p) • (q - p)
  • 29. WARN INFO INFOINFO WARN DEBUGINFOINFOINFO WARN WARNINFO INFO INFO WAR INFO INFO Other interesting features host01 host02 host03
  • 30. INFO INFOINFO DEBUGINFOINFO WARNNFO INFO INFO WARN INFO INFOINFO INFO INFO INFOINFOINFO INFOINFO INFO INFO INFO INFO WARN DEBUG Other interesting features host01 host02 host03
  • 31. INFO INFOINFO INFO INFO INFOINFO INFO INFO INFO WARN DEBUG WARN INFO INFO EBUG WARN INFO INFO INFO INFO INFO INFO INFO INFO WARN WARN INFO WARN INFO Other interesting features host01 host02 host03
  • 32. Other interesting features : Great food, great service, a must-visit! : Our whole table got gastroenteritis. : This place is so wonderful that it has ruined all other tacos for me and my family.
  • 33. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK.
  • 34. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers.
  • 35. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes.
  • 36. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes. See https://links.freevariable.com/nlp-logs/ for more!
  • 45. A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
  • 46. A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
  • 47.
  • 49. Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray
  • 50. Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no
  • 51. Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no
  • 60. Outliers in log data 0.95
  • 61. Outliers in log data 0.95 0.97
  • 62. Outliers in log data 0.95 0.97 0.92
  • 63. Outliers in log data 0.95 0.97 0.92 0.37 An outlier is any record whose best match was at least 4σ below the mean. 0.94 0.89 0.91 0.93 0.96
  • 64.
  • 65. Out of 310 million log records, we identified 0.0012% as outliers.
  • 66.
  • 67.
  • 68. Thirty most extreme outliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.
  • 74. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt
  • 75. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt at each step, we update each unit by adding its value from the previous step…
  • 76. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt to the example that we considered…
  • 77. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt scaled by a learning factor and the distance from this unit to its best match
  • 79. On-line SOM training sensitive to learning rate not parallel sensitive to example order
  • 80. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
  • 81. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) update the state of every cell in the neighborhood of the best matching unit, weighting by distance
  • 82. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) keep track of the distance weights we’ve seen for a weighted average
  • 83. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) since we can easily merge multiple states, we can train in parallel across many examples
  • 94. driver (using aggregate) workers What if you have a 3 mb model and 2,048 partitions?
  • 101. Sharing models class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here }
  • 102. Sharing models class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here } case class FrozenModel(entries: Array[Double], /* ... */ ) { }
  • 103. Sharing models case class FrozenModel(entries: Array[Double], /* ... */ ) { } class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here def freeze: FrozenModel = // ... } object Model { def thaw(im: FrozenModel): Model = // ... }
  • 104. Sharing models import org.json4s.jackson.Serialization import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite} implicit val formats = Serialization.formats(NoTypeHints) def toJson(m: Model): String = { jwrite(som.freeze) } def fromJson(json: String): Try[Model] = { Try({ Model.thaw(jread[FrozenModel](json)) }) }
  • 105. Sharing models import org.json4s.jackson.Serialization import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite} implicit val formats = Serialization.formats(NoTypeHints) def toJson(m: Model): String = { jwrite(som.freeze) } def fromJson(json: String): Try[Model] = { Try({ Model.thaw(jread[FrozenModel](json)) }) } Also consider how you’ll share feature encoders and other parts of your learning pipeline!
  • 107. Spark and ElasticSearch Data locality is an issue and caching is even more important than when running from local storage. If your data are write-once, consider exporting ES indices to Parquet files and analyzing those instead.
  • 108. Structured queries in Spark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and benefit from query planning without (entirely) forsaking type safety.
  • 109. Memory and partitioning Large JVM heaps can lead to appalling GC pauses and executor timeouts. Use multiple JVMs or off-heap storage (in Spark 2.0!) Tree aggregation can save you both memory and execution time by partially aggregating at worker nodes.
  • 110. Interoperability Avoid brittle or language-specific model serializers when sharing models with non-Spark environments. JSON is imperfect but ubiquitous. However, json4s will serialize case classes for free! See also SPARK-13944, merged recently into 2.0.
  • 111. Feature engineering Favor feature engineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models. Design your feature engineering pipeline so you can translate feature vectors back to factor values.