Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Big Data - An Overview

1.375 visualizaciones

Publicado el

An overview of what makes big data challenging and highlights the reasoning behind some of the recent trends in the area.

Publicado en: Datos y análisis
  • Sé el primero en comentar

Big Data - An Overview

  1. 1. big data - an overview Arvind Kalyan Engineer at LinkedIn
  2. 2. What makes data big?
  3. 3. Size? 100 GB of 1980 U.S. Census database was considered ‘big’ at the time and required sophisticated machinery
  4. 4. Size? 7 billion people on earth… storing name (~20 chars) , age (7 bits), gender (1 bit) for everyone on earth: 7 billion * (20 * 8 + 7 +1)/8 = 147GB In the last 25 years the number hasn’t grown too much and is still in the same order of magnitude..
  5. 5. Size? i.e., cardinality of real-world data doesn't change/grow very fast…. so that really is not that much data…
  6. 6. Size? Server with 128GB RAM and multiple TB of disk space is not difficult to find in 2015
  7. 7. Size? but observations on those entities can be too many Those 7 billion people interacting with: other people, web-sites, products, etc at different points in time and at different locations quickly explodes in the number of data points
  8. 8. Analysis ‘big-data’ is a challenge when you want to analyze of all those observations to identify trends & patterns
  9. 9. RDBMS for performing analysis RDBMS these days can hold *much* larger volumes on a single machine
  10. 10. RDBMS for performing analysis RDBMS is good at storing data & fetching individual rows satisfying your query
  11. 11. RDBMS for performing analysis MySQL for example, when used right can guarantee data durability and at the same time provide some of the lowest read-latencies
  12. 12. RDBMS for performing analysis RDBMS can also be used for ad-hoc analysis performance of such analysis can be improved by sacrificing on ‘relational’ principles: for example, by denormalizing tables (cost of copy vs seek)
  13. 13. RDBMS for performing analysis But this doesn’t scale when you want to run the analysis across all rows for every user some queries can turn out to be running for seconds or even days depending on the size of the tables
  14. 14. RDBMS for performing analysis .. and then consider the overhead of doing this on an on-going basis
  15. 15. RDBMS for performing analysis RDBMS is a not the right system if you want to look at & process the full data-set
  16. 16. RDBMS for performing analysis These days data is an asset and businesses & organizations need to extrapolate trends, patterns, etc. from that data
  17. 17. What’s the solution? .. if not RDBMS
  18. 18. “Numbers Everyone Should Know” L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns
  19. 19. Let’s look closely at disk vs memory Source:
  20. 20. Let’s look closely at disk vs memory in general: disk is slower than SSD, and SSD is slower than memory but more importantly…
  21. 21. Let’s look closely at disk vs memory Memory is not always faster than disk SSD is not always faster than disk
  22. 22. .. and at network vs disk.. network is not always slower than disk plus local machine and disk have limits but network enables you to grow!
  23. 23. Lesson learned access pattern is important depending on data-set (size) & use-case, the access pattern alone can make a difference between possible & not possible!
  24. 24. The solution more often, these days, it is better to have a distributed system to crunch the data one node gathers partial results from many other nodes and produces final result individual nodes are in-turn optimized to return results quickly from their partial local data
  25. 25. big-data tech ‘big data’ technologies & tools help you define your task in a higher order language, abstracting out the details of distributed systems underneath
  26. 26. big-data tech i.e., they encourage/force you to follow a certain access pattern
  27. 27. big-data tech these are still tools following suggested best-practices help make the best use of the tool at the same time, following anti-patterns can make the situation appear more challenging
  28. 28. big-data tech: DSLs some of the most popular frameworks happen to be DSLs that generate another set of instructions that actually execute pig, hive, scalding, cascading, etc.
  29. 29. big-data tech: DSLs biggest challenge with these DSLs is getting them to work, and long term maintenance
  30. 30. big-data tech: DSLs when using DSLs, code is written in one language and executed in another if anything fails, the error message is usually associated with the latter and you have to know enough about the abstracted layer to be able to translate it back to your code
  31. 31. big-data tech: DSLs but it usually works for the most part, because.. some of these popular frameworks have active user-groups and blog posts that’ll help you get the job done
  32. 32. big-data tech: DSLs so, more popular the technology, the better the documentation & support also, bigger the community, the better the likelihood of evolution of the technology so start with the most popular/common technology that does the job; even if it means compromising some ‘cool’ feature provided by another, currently less popular tool unless absolutely necessary
  33. 33. big-data tech: map/reduce most of the current (as of Feb 2015) ‘big-data’ frameworks revolve around the map/reduce paradigm … and use hadoop technologies underneath
  34. 34. big-data tech: map/reduce hadoop technologies for big data processing can be seen as 2 major components hdfs => distributed filesystem for storage ‘map/reduce’ => programming model
  35. 35. big-data tech: hdfs hdfs stores data in immutable format data is ‘partitioned’ & stored on different machines there are also multiple copies of each ‘part’ i.e., replicated
  36. 36. big-data tech: hdfs ‘partitioning’ enables faster processing in parallel also helps with data locality: code can run where data is and not the other way around
  37. 37. big-data tech: hdfs ‘replication’ increases availability of the data itself it also helps with overall performance & availability of tasks running on it (speculating execution): run the same task on multiple replicas, and wait for one of them to finish
  38. 38. big-data tech: map/reduce ‘map/reduce’ is a functional programming concept map => process & transform each set of data points in parallel and outputs (key, value) pairs reduce => gather those partial results and come up with final result
  39. 39. big-data tech: map/reduce ‘map/reduce’ in hadoop also has one more important step between map and reduce shuffle: makes sure all values for a given ‘key’ ends up on the same reducer
  40. 40. big-data tech: map/reduce ‘map/reduce’ in hadoop also has a slightly different ‘reduce’ reduce: values are aggregated per-key. Not across the whole dataset
  41. 41. big-data tech: map/reduce in general map/reduce shines for ‘embarrassingly parallel’ problems trying to run non-parallelizable jobs on hadoop (like requiring global ordering, or something similar) might work now, but may not scale in the long run
  42. 42. big-data tech: map/reduce But surprisingly a *lot* of ‘big-data’ problems can be modeled directly on map/reduce with little to no change
  43. 43. big-data tech: map/reduce map/reduce on hadoop is a multi-tenant distributed system running on disk-local data
  44. 44. non-interactive analysis big-data analysis/processing has typically been associated with non-interactive jobs. i.e., the user doesn’t expect the results to come back in a few seconds the job usually takes a few mins to a few hours, or even days
  45. 45. the need for speed What if we made it run on in-memory data?
  46. 46. the need for speed this is the current trend spark, presto are some noteworthy examples not based on map/reduce programming paradigm but still take advantage of the underlying distributed filesystem (hdfs)
  47. 47. faster big-data: spark Spark is a Scala DSL Resilient Distributed Datasets : primary abstraction in Spark RDDs: collections of data kept in-memory Provides collections API comparable to Scala lang that transform one RDD to another
  48. 48. faster big-data: spark fault-tolerance: retries computation on certain failures so it’s also good for the ‘backend’ / scheduled jobs in addition to interactive usage
  49. 49. faster big-data: spark where it shines: complex, iterative algorithms like in Machine learning. Since RDDs can be cached in-memory and used across the network, the computation speeds up considerably in the absence of disk I/O
  50. 50. faster big-data: presto presto is from facebook; written in java essentially a distributed read-only SQL query engine designed specifically for interactive ad-hoc analysis over Petabytes of data data on disk, but processing pipeline is fully in-memory
  51. 51. faster big-data: presto no fault-tolerance, but extremely fast ideal for ad-hoc queries on extremely large data that finish fast but not for long running or scheduled jobs as of today there is no UDF support
  52. 52. references
  53. 53. Leave questions & comments below, or reach out through LinkedIn! Arvind Kalyan