Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
Cargando en…5
×

# Algebird : Abstract Algebra for big data analytics. Devoxx 2014

4.912 visualizaciones

Algebird; abstract algebra for analytics.

Devoxx 2014. Antwerp. Belgium

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv

¿Estás seguro?    No
Tu mensaje aparecerá aquí

### Algebird : Abstract Algebra for big data analytics. Devoxx 2014

1. 1. Algebird Abstract Algebra for Analytics Sam BESSALAH @samklr Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr
2. 2. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
3. 3. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
4. 4. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
5. 5. Abstract Algebra Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
6. 6. From WikiPedia Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
7. 7. Algebraic Structure “ Set of values, coupled with one or more finite operations,and a set of laws those operations must obey. “ Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
8. 8. Algebraic Structure “ Set of values, coupled with one or more finite operations, and a set of laws those operations must obey. “ e.g Sum, Magma, Semigroup, Groups, Monoid, Abelian Group, Semi Lattices, Rings, Monads, etc. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
9. 9. Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
10. 10. Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity) trait Semigroup[T] { def aggregate(x : T, y : T) : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
11. 11. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
12. 12. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identiy / zero) trait Monoid[T] { def identity : T def aggregate (x, y) : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
13. 13. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x trait Monoid[T] extends Semigroup[T]{ def identity : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
14. 14. Groups Group Laws: (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity) x <> inverse x = identity inverse x <> x = identity (invertibility) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
15. 15. Groups Group Laws (x <> y) <> z = x <> (y <> z) identity <> x = x x <> identity = x x <> inverse x = identity inverse x <> x = identity trait Group[T] extends Monoid[T]{ def inverse (v : T) :T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
16. 16. Many More - Abelian groups (Commutative Sets) - Rings - Semi Lattices - Ordered Semigroups - Fields .. Many of those are in Algebird …. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
17. 17. Examples - (a min b) min c = a (b min c) with Int. - a max ( b max c) = (a max b) max c ** - a or (b or c) = (a or b) or c - a and (b and c) = (a and b) and c - int addition - set union - harmonic sum - Integer mean - Priority queue Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
18. 18. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
19. 19. Why do we need those algebraic structures ? Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
20. 20. We want to : - Build scalable analytics systems - Leverage distributed computing to perform aggregation on really large data sets. - A lot of operations in analytics are just sorting and counting at the end of the day Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
21. 21. Distributed Computing → Parallellism Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
22. 22. Distributed Computing → Parallellism Associativity → enables parallelism Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
23. 23. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
24. 24. Distributed Computing → Parallellism Associativity enables parallelism Identity means we can ignore some data Commutativity helps us ignore order Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
25. 25. Typical Map Reduce ... Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
26. 26. Finding Top-K Elements in Scalding ... class TopKJob(args : Args) extends Job (args) { Tsv ( args(‘input’), visitScheme) .filter (. ..) .leftJoinWithTiny ( … ) .filter ( … ) .groupBy( ‘fieldOne) { _.sortWithTake (visitScheme -> top } (biggerSale) .write(Tsv(...) ) } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
27. 27. .sortWithTake( … ) Looking into .sortWithTake in Scalding, there’s one nice thing : class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
28. 28. class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Let’s take a look : PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 Priority Queue : Can be empty Two Priority Queues can be “added” in any order Associative + Commutative Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
29. 29. class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Let’s take a look : PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 Priority Queue : Makes Scalding go fast, by doing sorting, filtering and extracting in one single “map” step. Can be empty Two Priority Queues can be “added” in any order Associative + Commutative Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
30. 30. Stream Mining Challenges - Update predictions after each observation - Single pass : can’t read old data or replay the stream - Full size of the stream often unknown - Limited time for computation per observation - O(1) memory size Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
31. 31. Stream Mining Challenges http://radar.oreilly.com/2013/10/stream-mining-essentials.html Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
32. 32. Tradeoff : Space and speed over accuracy. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
33. 33. Tradeoff : Space and speed over accuracy. use sketches. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
34. 34. Sketches Probabilistic data structures that store a summary (hashed mostly)of a data set that would be costly to store in its entirety, thus providing most of the time, sublinear algorithmic properties. E.g Bloom Filters, Counter Sketch, KMV counters, Count Min Sketch, HyperLogLog, Min Hashes #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
35. 35. Bloom filters Approximate data structure for set membership Behaves like an approximate set BloomFilter.contains(x) => NO | Maybe P(False Positive) > 0 P(False Negative) = 0 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
36. 36. Internally : Bit Array of fixed size add(x) : for all element i, b[h(x,i)]=1 contains(x) : TRUE if b[h(x,i)] = = 1 for all i. (Boolean AND => associative) Both are associative => BF can be designed as a Monoid #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
37. 37. Bloom filters import com.twitter.algebird._ import com.twitter.algebird.Operators._ // generate 2 lists val A = (1 to 300).toList // Generate a Bloomfilter val NUM_HASHES = 6 val WIDTH = 6000 // bits val SEED = 1 implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED) // approximate set with bloomfilter val A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _) val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…) #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
38. 38. Count Min Sketch Gives an approximation of the number of occurrences of an element in a set. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
39. 39. Count Min Sketch Count min sketch Adding an element is a numerical addition Querying uses a MIN function. Both are associative. useful for detecting heavy hitters, topK, LSH We have in Algebird : #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
40. 40. HyperLogLog Popular sketch for cardinality estimtion. Gives within a probilistic distribution of an error the number of distinct values in a data set. HLL.size = Approx[Number] Intuition Long runs of trailings 0 in a random bits chain are rare But the more bit chains you look at, the more likely you are to find a long one The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
41. 41. Adding an element uses a Max and Sum function. Both are associative and Monoids. (Max is an ordered semigroup in Algebird really) Querying for an element uses an harmonic mean which is a Monoid. In Algebird : #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
42. 42. Many More juicy sketches ... - MinHashes to compute Jaccard similarity - QTree for quantiles estimation. Neat for anomaly detection. - SpaceSaverMonoid, Awesome to find the approximate most frequent and top K elements. - TopKMonoid - SGD, PriorityQueues, Histograms, etc. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
43. 43. SummingBird : Lamba in a box #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
44. 44. Heard of Lambda Architecture ? #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
45. 45. SummingBird Same code for both batch and real time processing. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
46. 46. SummingBird Same code, for both batch and real time processing. But works only on Monoids. Uses Storehaus, as a mergeable store layer. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr