Unsupervised Learning with Apache Spark

● Data scientist at Cloudera
● Recently lead Apache Spark development at
Cloudera
● Before that, committing on Apache Hadoop
● Before that, studying combinatorial
optimization and distributed systems at
Brown

● How many kinds of stuff are there?
● Why is some stuff not like the others?
● How do I contextualize new stuff?
● Is there a simpler way to represent this stuff?

● Learn hidden structure of your data
● Interpret new data as it relates to this
structure

● Clustering
○ Partition data into categories
● Dimensionality reduction
○ Find a condensed representation of your
data

● Designing a system for processing huge
data in parallel
● Taking advantage of it with algorithms that
work well in parallel

bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toDouble) numbers.sum()

bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toInt) numbers.cache()
.sum()

bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver

Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares

● Anomalies as data points far away from any
cluster

val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)

● Alternate between two steps:
○ Assign each point to a cluster based on
existing centers
○ Recompute cluster centers from the
points in each cluster

● Alternate between two steps:
○ Assign each point to a cluster based on
existing centers
■ Process each data point independently
○ Recompute cluster centers from the
points in each cluster
■ Average across partitions

// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()

// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value

● K-Means is very sensitive to initial set of
center points chosen.
● Best existing algorithm for choosing centers
is highly sequential.

● Start with random point from dataset
● Pick another one randomly, with probability
proportional to distance from the closest
already chosen
● Repeat until initial centers chosen

● Initial cluster has expected bound of O(log k)
of optimum cost

● Requires k passes over the data

● Do only a few (~5) passes
● Sample m points on each pass
● Oversample
● Run K-Means++ on sampled points to find
initial centers

Discrete Continuous
Supervised Classification
● Logistic regression (and
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal
component
analysis / singular value
decomposition
● Alternating least squares

● Select a basis for your data that
○ Is orthonormal
○ Maximizes variance along its axes

● Find a lower-dimensional representation that
lets you visualize the data
● Feature learning - find a representation that’
s good for clustering or classification
● Latent Semantic Analysis

val data: RDD[Vector] = ...
val mat = new RowMatrix(data)
// compute the top 5 principal components
val principalComponents =
mat.computePrincipalComponents(5)
// project data into subspace
val transformed = data.map(_.toBreeze *
mat.toBreeze)

● Center data
● Find covariance matrix
● Its eigenvectors are the principal
components

Data
m
n
Data
Data
Data
Data
Data

Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
...

Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
... ...

def computeGramianMatrix (): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2
// Compute the upper triangular part of the gram matrix.
val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))(
seqOp = (U, v) => {
RowMatrix.dspr( 1.0, v, U.data)
U
},
combOp = (U1, U2) => U1 += U2
)
RowMatrix.triuToFull(n, GU.data)
}

● n^2 must fit in memory
● Not yet implemented: EM algorithm can do it
with O(kn), where k is the number of
principal components

Unsupervised Learning with Apache Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a Unsupervised Learning with Apache Spark

Similar a Unsupervised Learning with Apache Spark (20)

Más de DB Tsai

Más de DB Tsai (6)

Último

Último (20)

Unsupervised Learning with Apache Spark