SlideShare una empresa de Scribd logo
1 de 75
Chu-Cheng Hsieh
Modern technologies
in data science
www.linkedin.com/in/chucheng
3
Task: Find “Error” in /var/log/system.log
4
/*Java*/
BufferedReader br = new BufferedReader(
new FileReader(”system.log"));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
if (line.contains(“Error”) {
System.out.println(line);
}
String everything = sb.toString();
} finally {
br.close();
}
5
# Python
with open(”system.log", "r") as ins:
for line in ins:
if “Error” in line:
print line[:-1]
6
# Bash
grep "Error" system.log
# !! Best !!
grep –B 3 –A 2 "Error" system.log|less
7
Task: Find “Error” in /var/log/system.log
What if,
system.log > 1TB? 10TB? … 1PB?
8
More CPUsMore Data
9credit: http://www.cse.wustl.edu/~jain/
10
(1) Multi-thread coding is really really
painful.
(2) Max # of cores in a machines in
limited
11
12
$US 390 billion
Map Reduce
13 13
14
An easy way of “multi-core”
programming.
Map
15 15
16
Q. How to eat a pizza in 1 minute?
17
Ans.
18
Map = “things can be divide-and-conquer”
Reduce
19 19
20
Reduce =
“merge pieces to make result meaningful”
21
“Word Count”
4 Nodes
22
“map”
23
“reduce”
24
Map-reduce - an easy way to
“collaborate”.
25
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
//////////// MAPPER function
//////////////
public static class Map extends
MapReduceBase implements
Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text
value, OutputCollector<Text, IntWritable>
output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
Text word = new Text();
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, new IntWritable(1));
}
}
}
//////////// REDUCER function /////////////
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterator<IntWritable>
values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws
Exception {
/////////// JOB description ///////////
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new
Path(args[0]));
FileOutputFormat.setOutputPath(conf, new
Path(args[1]));
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}
}
26
a = load ’input.txt';
#map
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
#reduce
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into ‘wordcount.txt';
27
What’s wrong with PIG?
28
Word count + “case insensitive”
29
REGISTER lib.jar;
a = load ’input.txt';
#map
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
b2 = foreach b generate lib.UPPER(word)
#reduce
c = group b2 by word;
d = foreach c generate COUNT(b), group;
store d into ‘wordcount.txt';
Case In sensitive
30
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0 || input.get(0) == null)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException(”Error in input row ", e);
}
}
}
31
32 https://spark.apache.org/
33
chsieh@dev-sfear01:~$ spark-shell --master local[4]
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/26 14:32:12 WARN NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.3.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
scala>
34
chsieh@dev-sfear01:~/SparkTutorial$ cat wc.txt
one
two two
three three three
four four four four
five five five five five
35
// RDD[String] = MapPartitionsRDD
val file = sc.textFile("wc.txt")
val counts = file.flatMap(line => line.split(" "))
// RDD[String] = MapPartitionsRDD
.map(word => (word, 1))
// RDD[(String, Int)] = MapPartitionsRDD
.reduceByKey( (l, r) => l + r)
// RDD[(String, Int)] = ShuffledRDD
counts.collect()
// Array[(String, Int)] =
// Array((two,2), (one,1), (three,2), (five,5), (four,4))
counts.saveAsTextFile(”file:///path/to/wc_result")
// Use hdfs:// for hadoop file system
36
mapFlat( … )
map( … ) and then flatten( )
scala> List(List(1, 2), Set(3, 4)).flatten
res0: List[Int] = List(1, 2, 3, 4)
// create an empty list
// travel each item once
val l = List(List(1,2), List(3,4,List(4, 5))).flatten
l: List[Any] = List(1, 2, 3, 4, List(4, 5))
37
// ReduceByKey
(”two”, 1) (”two”, 1)
^^^^^ ^^^^^
key key
l r
(”two”, 2)
38
Why Spark?
39
It’s super fast !!
40
How can it so fast?
41
RDD (Resilient Distributed Datasets)
42
How to create an RDD?
43
Method #1 (Parallelized)
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize
44
Method #2 (External Dataset)
scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
 Local Path
 Amazon S3 => s3n://
 Hadoop => hdfs://
 etc. URI
Desired Properties for map-reduce
 Distributed
 Lazy
(Optimize as much as you can)
 Persistence
(Caching)
45
46
Input
(disk)
Tuples
(disk)
Tuples
(disk)
Tuples
(disk)
Output
(disk)
MR
1
MR
2
MR
3
MR
4
Input
(disk)
RDD1
(in memory)
RDD2
(in disk)
RDD3
(in memory)
Output
(disk)
t1 t2 t3 t4
47
We have RDD, than how to “operate” it?
48
# Transformation (can wait)
RDD1 => “map” => RDD2
# Action (cannot wait)
RDD1 => “reduce” => ??
49
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at
<console>:23
scala> val addone = distData.map(x => x + 1) // Transformation
addone: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at map at
<console>:25
scala> val back = addone.map(x => x - 1) // Transformation
back: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:27
scala> val sum = back.reduce( (l, r) => l + r) // Action
sum: Int = 15
50
Passing functions
Object Util {
def addOne(x:Int) = {
x + 1
}
}
val addone_v1 = distData.map(x => x + 1)
// or
val addone_v2 = distData.map(Util.addOne)
51
Popular Transformations
map(func):
run func(x)
filter(func)
return x if func(x) is true
sample(withReplacement, fraction, seed)
union(otherDataset)
intersection(otherDataset)
52
Popular Transformations (cont.)
Assuming RDD[(K, V)]
groupByKey([numTasks])
return a dataset of (K, Iterable<V>) pairs.
reduceByKey(func, [numTasks])
groupByKey and then reduce by “func”
join(otherDataset, [numTasks])
(K, V) join (K, W) => (K, (V, W))
53
Popular Actions
reduce(func):
(left, right) => func(left, right)
collect()
force computing transformations
count()
first()
take(n)
persist()
54
More about “persist” -- reuse
Input
(disk)
RDD1
(in memory)
RDD2
(in disk)
RDD3
(in memory)
Output
(disk)
t1 t2 t3
t4
RDD4
(in memory)
persist()
t5
55
You are master now.
Journey Continue
56 56
57
58
val file = sc.textFile("wc.txt")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (l, r) => l + r)
// Using SQL syntax is possible
import sqlContext.implicits._
val sq = new org.apache.spark.sql.SQLContext(sc)
case class WC(word: String, count: Int)
val wordcount = counts.map(col => WC(col._1, col._2))
val df = wordcount.toDF()
df.registerTempTable("tbl")
val avg = sq.sql("SELECT AVG(count) FROM tbl")
59
Why Spark SQL?
60
Guess what I’m doing here?
// row: (country, city, profit)
data .filter( _._1 == “us”)
.map ( (a, b, c) => (b, c))
.groupBy( _._1 )
.mapValues( v =>
v.reduce(
(a, b) => (a._1, a._2+b._2)
)
)
.values
.sortBy (x => x._2, false)
.take(3)
61
sq.sql(”
SELECT city, SUM(profit) as p
FROM data
WHERE country=‘us’
GROUP BY city
ORDER BY p DESC
LIMIT 3
")
//(country, city, profit)
data .filter( _._1 == “us”)
.map ( (a, b, c) => (b, c))
.groupBy( _._1 )
.mapValues( v =>
v.reduce(
(a, b) => (a._1,
a._2+b._2)
)
)
.values
.sortBy (x => x._2, false)
.take(3)
Find top 3 cities in US with highest profit
62
63
64
chsieh@dev-sfear01:~/SparkTutorial$ cat kmeans_data.txt
0.0 0.0 0.0
0.1 0.1 0.1
0.5 0.5 0.8
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
65
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ')
.map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(
parsedData, numClusters, numIterations
)
// Show results
scala> clusters.clusterCenters
res0: Array[org.apache.spark.mllib.linalg.Vector] =
Array([9.099999999999998,9.099999999999998,9.099999999999998],
[0.19999999999999998,0.19999999999999998,0.3])
66
67
68
PageRank: Random Surfer Model
The probability of a Web surfer to reach a page
after many clicks, following random links
Random Click
70
Credit: http://en.wikipedia.org/wiki/PageRank
PageRank
 PR(p) = PR(p1)/c1 + … + PR(pk)/ck
pi : page pointing to p,
ci : number of links in pi
 One equation for every page
 N equations, N unknown variables
Credit: Prof. John Cho / CS144 (UCLA)
Users.txt
1,BarackObama,Barack Obama
2,ladygaga,Goddess of Love
3,jeresig,John Resig
4,justinbieber,Justin Bieber
6,matei_zaharia,Matei Zaharia
7,odersky,Martin Odersky
8,anonsys
72
followers.txt
2 1
4 1
1 2
6 3
7 3
7 6
6 7
3 7
73
74
// Load the edges as a graph
val graph = GraphLoader.edgeListFile(sc, "followers.txt")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices // id, rank
// Join the ranks with the usernames
val users = sc.textFile("users.txt")
.map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1)) // id, username
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) => (username, rank)
}
// Print the result
println(ranksByUsername.collect().mkString("n"))
75

Más contenido relacionado

La actualidad más candente

Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional ProgrammingDmitry Buzdin
 
ITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingIstanbul Tech Talks
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
 
"PostgreSQL and Python" Lightning Talk @EuroPython2014
"PostgreSQL and Python" Lightning Talk @EuroPython2014"PostgreSQL and Python" Lightning Talk @EuroPython2014
"PostgreSQL and Python" Lightning Talk @EuroPython2014Henning Jacobs
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful weddingStéphane Wirtel
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
 
JAVA 8 : Migration et enjeux stratégiques en entreprise
JAVA 8 : Migration et enjeux stratégiques en entrepriseJAVA 8 : Migration et enjeux stratégiques en entreprise
JAVA 8 : Migration et enjeux stratégiques en entrepriseSOAT
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenPostgresOpen
 
From Zero to Application Delivery with NixOS
From Zero to Application Delivery with NixOSFrom Zero to Application Delivery with NixOS
From Zero to Application Delivery with NixOSSusan Potter
 
Haskell in the Real World
Haskell in the Real WorldHaskell in the Real World
Haskell in the Real Worldosfameron
 
Writing Domain-Specific Languages for BeepBeep
Writing Domain-Specific Languages for BeepBeepWriting Domain-Specific Languages for BeepBeep
Writing Domain-Specific Languages for BeepBeepSylvain Hallé
 
Java 8 Puzzlers [as presented at OSCON 2016]
Java 8 Puzzlers [as presented at  OSCON 2016]Java 8 Puzzlers [as presented at  OSCON 2016]
Java 8 Puzzlers [as presented at OSCON 2016]Baruch Sadogursky
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
 
The Ring programming language version 1.5.3 book - Part 25 of 184
The Ring programming language version 1.5.3 book - Part 25 of 184The Ring programming language version 1.5.3 book - Part 25 of 184
The Ring programming language version 1.5.3 book - Part 25 of 184Mahmoud Samir Fayed
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of AbstractionAlex Miller
 
Joker 2015 - Валеев Тагир - Что же мы измеряем?
Joker 2015 - Валеев Тагир - Что же мы измеряем?Joker 2015 - Валеев Тагир - Что же мы измеряем?
Joker 2015 - Валеев Тагир - Что же мы измеряем?tvaleev
 

La actualidad más candente (20)

Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional Programming
 
ITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function Programming
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
"PostgreSQL and Python" Lightning Talk @EuroPython2014
"PostgreSQL and Python" Lightning Talk @EuroPython2014"PostgreSQL and Python" Lightning Talk @EuroPython2014
"PostgreSQL and Python" Lightning Talk @EuroPython2014
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful wedding
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
JAVA 8 : Migration et enjeux stratégiques en entreprise
JAVA 8 : Migration et enjeux stratégiques en entrepriseJAVA 8 : Migration et enjeux stratégiques en entreprise
JAVA 8 : Migration et enjeux stratégiques en entreprise
 
Dotnet 18
Dotnet 18Dotnet 18
Dotnet 18
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
 
From Zero to Application Delivery with NixOS
From Zero to Application Delivery with NixOSFrom Zero to Application Delivery with NixOS
From Zero to Application Delivery with NixOS
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
Haskell in the Real World
Haskell in the Real WorldHaskell in the Real World
Haskell in the Real World
 
Writing Domain-Specific Languages for BeepBeep
Writing Domain-Specific Languages for BeepBeepWriting Domain-Specific Languages for BeepBeep
Writing Domain-Specific Languages for BeepBeep
 
Java 8 Puzzlers [as presented at OSCON 2016]
Java 8 Puzzlers [as presented at  OSCON 2016]Java 8 Puzzlers [as presented at  OSCON 2016]
Java 8 Puzzlers [as presented at OSCON 2016]
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
The Ring programming language version 1.5.3 book - Part 25 of 184
The Ring programming language version 1.5.3 book - Part 25 of 184The Ring programming language version 1.5.3 book - Part 25 of 184
The Ring programming language version 1.5.3 book - Part 25 of 184
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of Abstraction
 
ES6 in Real Life
ES6 in Real LifeES6 in Real Life
ES6 in Real Life
 
Oop lecture9 13
Oop lecture9 13Oop lecture9 13
Oop lecture9 13
 
Joker 2015 - Валеев Тагир - Что же мы измеряем?
Joker 2015 - Валеев Тагир - Что же мы измеряем?Joker 2015 - Валеев Тагир - Что же мы измеряем?
Joker 2015 - Валеев Тагир - Что же мы измеряем?
 

Similar a Modern technologies in data science

Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Python 101 language features and functional programming
Python 101 language features and functional programmingPython 101 language features and functional programming
Python 101 language features and functional programmingLukasz Dynowski
 
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonByterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonakaptur
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingMuthu Vinayagam
 
Clojure Intro
Clojure IntroClojure Intro
Clojure Introthnetos
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheeltcurdt
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)jaxLondonConference
 
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...julien.ponge
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojurePaul Lam
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.pptssuserd64918
 

Similar a Modern technologies in data science (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Python 101 language features and functional programming
Python 101 language features and functional programmingPython 101 language features and functional programming
Python 101 language features and functional programming
 
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonByterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python Programming
 
Full Stack Clojure
Full Stack ClojureFull Stack Clojure
Full Stack Clojure
 
Clojure Intro
Clojure IntroClojure Intro
Clojure Intro
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheel
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Java VS Python
Java VS PythonJava VS Python
Java VS Python
 
The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)
 
A bit about Scala
A bit about ScalaA bit about Scala
A bit about Scala
 
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
 
Scala @ TomTom
Scala @ TomTomScala @ TomTom
Scala @ TomTom
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
 

Último

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Último (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Modern technologies in data science

  • 3. 3 Task: Find “Error” in /var/log/system.log
  • 4. 4 /*Java*/ BufferedReader br = new BufferedReader( new FileReader(”system.log")); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) { if (line.contains(“Error”) { System.out.println(line); } String everything = sb.toString(); } finally { br.close(); }
  • 5. 5 # Python with open(”system.log", "r") as ins: for line in ins: if “Error” in line: print line[:-1]
  • 6. 6 # Bash grep "Error" system.log # !! Best !! grep –B 3 –A 2 "Error" system.log|less
  • 7. 7 Task: Find “Error” in /var/log/system.log What if, system.log > 1TB? 10TB? … 1PB?
  • 10. 10 (1) Multi-thread coding is really really painful. (2) Max # of cores in a machines in limited
  • 11. 11
  • 14. 14 An easy way of “multi-core” programming.
  • 16. 16 Q. How to eat a pizza in 1 minute?
  • 18. 18 Map = “things can be divide-and-conquer”
  • 20. 20 Reduce = “merge pieces to make result meaningful”
  • 24. 24 Map-reduce - an easy way to “collaborate”.
  • 25. 25 import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { //////////// MAPPER function ////////////// public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); Text word = new Text(); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, new IntWritable(1)); } } } //////////// REDUCER function ///////////// public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { /////////// JOB description /////////// JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } }
  • 26. 26 a = load ’input.txt'; #map b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; #reduce c = group b by word; d = foreach c generate COUNT(b), group; store d into ‘wordcount.txt';
  • 28. 28 Word count + “case insensitive”
  • 29. 29 REGISTER lib.jar; a = load ’input.txt'; #map b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; b2 = foreach b generate lib.UPPER(word) #reduce c = group b2 by word; d = foreach c generate COUNT(b), group; store d into ‘wordcount.txt'; Case In sensitive
  • 30. 30 package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0 || input.get(0) == null) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw new IOException(”Error in input row ", e); } } }
  • 31. 31
  • 33. 33 chsieh@dev-sfear01:~$ spark-shell --master local[4] Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/26 14:32:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.3.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala>
  • 34. 34 chsieh@dev-sfear01:~/SparkTutorial$ cat wc.txt one two two three three three four four four four five five five five five
  • 35. 35 // RDD[String] = MapPartitionsRDD val file = sc.textFile("wc.txt") val counts = file.flatMap(line => line.split(" ")) // RDD[String] = MapPartitionsRDD .map(word => (word, 1)) // RDD[(String, Int)] = MapPartitionsRDD .reduceByKey( (l, r) => l + r) // RDD[(String, Int)] = ShuffledRDD counts.collect() // Array[(String, Int)] = // Array((two,2), (one,1), (three,2), (five,5), (four,4)) counts.saveAsTextFile(”file:///path/to/wc_result") // Use hdfs:// for hadoop file system
  • 36. 36 mapFlat( … ) map( … ) and then flatten( ) scala> List(List(1, 2), Set(3, 4)).flatten res0: List[Int] = List(1, 2, 3, 4) // create an empty list // travel each item once val l = List(List(1,2), List(3,4,List(4, 5))).flatten l: List[Any] = List(1, 2, 3, 4, List(4, 5))
  • 37. 37 // ReduceByKey (”two”, 1) (”two”, 1) ^^^^^ ^^^^^ key key l r (”two”, 2)
  • 40. 40 How can it so fast?
  • 42. 42 How to create an RDD?
  • 43. 43 Method #1 (Parallelized) scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize
  • 44. 44 Method #2 (External Dataset) scala> val distFile = sc.textFile("data.txt") distFile: RDD[String] = MappedRDD@1d4cee08  Local Path  Amazon S3 => s3n://  Hadoop => hdfs://  etc. URI
  • 45. Desired Properties for map-reduce  Distributed  Lazy (Optimize as much as you can)  Persistence (Caching) 45
  • 47. 47 We have RDD, than how to “operate” it?
  • 48. 48 # Transformation (can wait) RDD1 => “map” => RDD2 # Action (cannot wait) RDD1 => “reduce” => ??
  • 49. 49 scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:23 scala> val addone = distData.map(x => x + 1) // Transformation addone: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at map at <console>:25 scala> val back = addone.map(x => x - 1) // Transformation back: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:27 scala> val sum = back.reduce( (l, r) => l + r) // Action sum: Int = 15
  • 50. 50 Passing functions Object Util { def addOne(x:Int) = { x + 1 } } val addone_v1 = distData.map(x => x + 1) // or val addone_v2 = distData.map(Util.addOne)
  • 51. 51 Popular Transformations map(func): run func(x) filter(func) return x if func(x) is true sample(withReplacement, fraction, seed) union(otherDataset) intersection(otherDataset)
  • 52. 52 Popular Transformations (cont.) Assuming RDD[(K, V)] groupByKey([numTasks]) return a dataset of (K, Iterable<V>) pairs. reduceByKey(func, [numTasks]) groupByKey and then reduce by “func” join(otherDataset, [numTasks]) (K, V) join (K, W) => (K, (V, W))
  • 53. 53 Popular Actions reduce(func): (left, right) => func(left, right) collect() force computing transformations count() first() take(n) persist()
  • 54. 54 More about “persist” -- reuse Input (disk) RDD1 (in memory) RDD2 (in disk) RDD3 (in memory) Output (disk) t1 t2 t3 t4 RDD4 (in memory) persist() t5
  • 57. 57
  • 58. 58 val file = sc.textFile("wc.txt") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (l, r) => l + r) // Using SQL syntax is possible import sqlContext.implicits._ val sq = new org.apache.spark.sql.SQLContext(sc) case class WC(word: String, count: Int) val wordcount = counts.map(col => WC(col._1, col._2)) val df = wordcount.toDF() df.registerTempTable("tbl") val avg = sq.sql("SELECT AVG(count) FROM tbl")
  • 60. 60 Guess what I’m doing here? // row: (country, city, profit) data .filter( _._1 == “us”) .map ( (a, b, c) => (b, c)) .groupBy( _._1 ) .mapValues( v => v.reduce( (a, b) => (a._1, a._2+b._2) ) ) .values .sortBy (x => x._2, false) .take(3)
  • 61. 61 sq.sql(” SELECT city, SUM(profit) as p FROM data WHERE country=‘us’ GROUP BY city ORDER BY p DESC LIMIT 3 ") //(country, city, profit) data .filter( _._1 == “us”) .map ( (a, b, c) => (b, c)) .groupBy( _._1 ) .mapValues( v => v.reduce( (a, b) => (a._1, a._2+b._2) ) ) .values .sortBy (x => x._2, false) .take(3) Find top 3 cities in US with highest profit
  • 62. 62
  • 63. 63
  • 64. 64 chsieh@dev-sfear01:~/SparkTutorial$ cat kmeans_data.txt 0.0 0.0 0.0 0.1 0.1 0.1 0.5 0.5 0.8 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2
  • 65. 65 import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("kmeans_data.txt") val parsedData = data.map(s => Vectors.dense(s.split(' ') .map(_.toDouble))).cache() // Cluster the data into two classes using KMeans val numClusters = 2 val numIterations = 20 val clusters = KMeans.train( parsedData, numClusters, numIterations ) // Show results scala> clusters.clusterCenters res0: Array[org.apache.spark.mllib.linalg.Vector] = Array([9.099999999999998,9.099999999999998,9.099999999999998], [0.19999999999999998,0.19999999999999998,0.3])
  • 66. 66
  • 67. 67
  • 68. 68
  • 69. PageRank: Random Surfer Model The probability of a Web surfer to reach a page after many clicks, following random links Random Click
  • 71. PageRank  PR(p) = PR(p1)/c1 + … + PR(pk)/ck pi : page pointing to p, ci : number of links in pi  One equation for every page  N equations, N unknown variables Credit: Prof. John Cho / CS144 (UCLA)
  • 72. Users.txt 1,BarackObama,Barack Obama 2,ladygaga,Goddess of Love 3,jeresig,John Resig 4,justinbieber,Justin Bieber 6,matei_zaharia,Matei Zaharia 7,odersky,Martin Odersky 8,anonsys 72
  • 73. followers.txt 2 1 4 1 1 2 6 3 7 3 7 6 6 7 3 7 73
  • 74. 74 // Load the edges as a graph val graph = GraphLoader.edgeListFile(sc, "followers.txt") // Run PageRank val ranks = graph.pageRank(0.0001).vertices // id, rank // Join the ranks with the usernames val users = sc.textFile("users.txt") .map { line => val fields = line.split(",") (fields(0).toLong, fields(1)) // id, username } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank) } // Print the result println(ranksByUsername.collect().mkString("n"))
  • 75. 75