Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Hadoop 101
Hadoop 101
Cargando en…3
×

Eche un vistazo a continuación

1 de 40 Anuncio

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Descargar para leer sin conexión

An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.

An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Anuncio

Similares a Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends (20)

Más reciente (20)

Anuncio

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

  1. 1. BIG DATA AND HADOOP History, Technical Deep Dive, and Industry Trends Esther Kundin Bloomberg LP
  2. 2. About Me
  3. 3. Big Data –What is It?
  4. 4. Outline • What Is Big Data? • A History Lesson • Hadoop – Dive in to the details • HDFS • MapReduce • HBase • Industry Trends • Questions
  5. 5. What is Big Data?
  6. 6. A History Lesson
  7. 7. Big Data Origins • Indexing the web requires lots of storage • Petabytes of data! • Economic problem – reliable servers expensive! • Solution: • Cram in as many cheap machines as possible • Replace them when they fail • Solve reliability via software!
  8. 8. Big Data Origins Cont’d • DBs are slow and expensive • Lots of unneeded features RDBMS NoSQL ACID Eventual consistency Strongly-typed No type checking Complex Joins Get/Put RAID storage Commodity hardware
  9. 9. Big Data Origins Cont’d • Google publishes papers about: • GFS (2000) • MapReduce (2004) • BigTable (2006) • Hadoop, originally developed at Yahoo, accepted as Apache top-level project in 2008
  10. 10. Translation GFS HDFS MapReduce Hadoop MapReduce BigTable HBASE
  11. 11. Why Hadoop? • Huge and growing ecosystem of services • Pace of development is swift • Tons of money and talent pouring in
  12. 12. Diving into the details!
  13. 13. Hadoop Ecosytem • HDFS – Hadoop Distributed File System • Pig: a scripting language that simplifies the creation of MapReduce jobs and excels at exploring and transforming data. • Hive: provides SQL-like access to your Big Data. • HBase: Hadoop database . • HCatalog: for defining and sharing schemas . • Ambari: for provisioning, managing, and monitoring Apache Hadoop clusters . • ZooKeeper: an open-source server which enables highly reliable distributed coordination . • Sqoop: for efficiently transferring bulk data between Hadoop and relation databases . • Oozie: a workflow scheduler system to manage Apache Hadoop jobs • Mahout : scalable machine learning library
  14. 14. HDFS • Hadoop Distributed File System • Basis for all other tools, built on top of it • Allows for distributed workloads
  15. 15. HDFS details
  16. 16. HDFS Demo
  17. 17. MapReduce
  18. 18. MapReduce demo • To run, can use: • Custom JAVA application • PIG – nice interface • Hadoop Streaming + any executable, like python • Thanks to: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce- program-in-python/ • HIVE – SQL over MapReduce – “we put the SQL in NoSQL”
  19. 19. HBase • Database running on top of HDFS • NOSQL – key/value store • Distributed • Good for sparse requests, rather than scans like MapReduce • Sorted • Eventually Consistent
  20. 20. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS
  21. 21. HBase Read Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client requests Meta Region Server address
  22. 22. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client determines Which RegionServer to contact and caches that data
  23. 23. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client requests data from the Region Server, which gets data from HDFS
  24. 24. HBase Demo
  25. 25. HMaster • Only one main master at a time – ensured by zookeeper • Keeps track of all table metadata • Used in table creation, modification, and deletion. • Not used for reads
  26. 26. Region Server • This is the worker node of HBase • Performs Gets, Puts, and Scans for the regions it handles • Multiple regions are handled by each Region Server • On startup • Registers with zookeeper • Hmaster assigns it regions • Physical blocks on HDFS may or may not be on the same machine • Regions are split if they get too big • Data stored in a format called Hfile • Cache of data is what gives good performance. Cache based on blocks, not rows
  27. 27. HBaseWrite – step 1 Region Server WAL (on HDFS) MemStore HFile HFile HFile Region Server persists write at the end of the WAL
  28. 28. HBaseWrite – step 2 Region Server WAL (on HDFS) MemStore HFile HFile HFile Regions Server saves write in a sorted map in memory in the MemStore
  29. 29. HBaseWrite – offline Region Server WAL (on HDFS) MemStore HFile HFile HFile When MemStore reaches a configurable size, it is flushed to an HFile
  30. 30. Minor Compaction • When writing a MemStore to Hfile, may trigger a Minor Compaction • Combine many small Hfiles into one large one • Saves disk reads • May block further MemStore flushes, so try to keep to a minimum
  31. 31. Major Compaction • Happens at configurable times for the system • Ie. Once a week on weekends • Default to once every 24 hrs • Resource-intensive • Don’t set it to “never” • Reads in all Hfiles and makes sure there is one Hfile per Region per column family • Purges deleted records • Ensures that HDFS files are local
  32. 32. Tuning your DB - HBase Keys • Row Key – byte array • Best performance for Single Row Gets • Best Caching Performance • Key Design – • Distributes well – usually accomplished by hashing natural key • MD5 • SHA1
  33. 33. Tuning your DB - BlockCache • Each region server has a BlockCache where it stores file blocks that it has already read • Every read that is in the block increases performance • Don’t want your blocks to be much bigger than your rows • Modes of caching: • 2-level LRU cache, by default • Other options: BucketCache – can use DirectByteBuffers to manage off-heap RAM – better Garbage Collection stats on the region server
  34. 34. Tuning your DB - Columns and Column Families • All columns in a column families accessed together for reads • Different column families stored in different HFiles • All Column Families written once when any MemStore is full • Example: • Storing package tracking information: • Need package shipping info • Need to store each location in the path
  35. 35. Tuning your DB – Bloom Filters • Can be set on rows or columns • Keep an extra index of available keys • Slows down reads and writes a bit • Increases storage • Saves time checking if keys exist • Turn on if it is likely that client will request missing data
  36. 36. Tuning your DB – Short-Circuit Reads • HDFS exposes service interface • If file is actually local, much faster to just read Hfile directly off of the disk
  37. 37. Current Industry Trends
  38. 38. Big Data in Finance – the challenges • Real-Time financial analysis • Reliability • “medium-data”
  39. 39. What Bloomberg is Working on • Working with Hortonworks on fixing real-time issues in Hadoop • Creating a framework for reliably serving real-time data • Presenting at Hadoop World and Hadoop Summit • Open source Chef recipes for running a hadoop cluster on OpenStack-managed VMs
  40. 40. Questions? • Thank you!

Notas del editor

  • Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry
  • Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry
  • Name node is the manager, data node is the worker
  • Job Tracker = Resource Manager
    Task Tracker = Node Manager
    Number of Jobs depends on the range of keys Number of mappers is set by the user – you’d want it to correspond to the set of possible values. So, if the values are ascii, you won’t want reducers to exceed 256. You also don’t want them to exceed the number of data nodes you have.
  • Remember, HBase treats everything as a file system
  • Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  • Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  • Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  • Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  • All columns in a column family are read for a get – but not all column families unless specified
  • Although there is a separate memstore per column family – as soon as one is full, all of them written to hfiles. Note also that deletes are handled with a marker, and only really purged at a major compaction
  • Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry

×