Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Big data & hadoop framework

  • Sé el primero en comentar

Big data & hadoop framework

  1. 1. BIG DATA & HADOOP FRAMEWORK 1
  2. 2. 2 Who am I?  Name: Pham Phuong Tu  Work as R&D developer at VC Corp  Related Projects: muachung.vn, enbac.com, rongbay.com, …  Interest with system & software architecture, big data, data statistic & analytic  Email: phamptu@gmail.com
  3. 3. 3 Statistic System
  4. 4. 4 Statistic System Measuring Marketing Effective
  5. 5. 5 User Activity Recorder Mouse Click
  6. 6. 6 User Activity Recorder Mouse Click
  7. 7. 7 User Activity Recorder Mouse Scroll
  8. 8. 8 Mining System Log
  9. 9. 9 Table of content   Challenger Big Data        Hadoop Framework         Overview Data type What – Who – Why Extract, transform, load data Data operation Big data platform Overview History User Architecture Map Reduce Hadoop Environment Q&A Demo
  10. 10. 10 CHALLENGER
  11. 11. 11
  12. 12. 12
  13. 13. 13
  14. 14. 14 BIG DATA
  15. 15. 15
  16. 16. 16 Analyze All Available Data Data warehouse Social Media Website Billing ERP CRM Devices Network Switches
  17. 17. 17 Type of Data  Plain Text Data (Web)  Semi-structured Data (XML, Json)  Relational Data (Tables/Transaction/Legacy Data)  Graph Data  Social Network, Semantic Web (RDF), …  Multi  Media Data Image, Video, …  Streaming  Data You can only scan the data once
  18. 18. 18 What is collecting all this data?       Web Browsers Web Sites (Search Engine, Social Network, E-commerce Platform…) Applications Computer, Smartphone, Tablet, Games Boxes Other System (Banking, Phone, Medical, GPS) Internet Service Providers
  19. 19. 19 Who is collecting your data?  Government Agencies  Companies  Service Provider  Big Stores
  20. 20. 20 Why are they collecting your data?  Search   Recommendation Systems   New York Times, Eyealike Target Marketing   Facebook, AOL Video and Image Analysis   Facebook, Yahoo, Google Data Warehouse   Facebook, Amazon Log analytic   Yahoo, Amazon, Zvents Google Ads, Facebook Ads Business strategy  Walmart
  21. 21. 21 ETL  Extract: To convert the data into a single format appropriate for transformation processing.  Transform: Applies a series of rules or functions to the extracted data from the source.  Load: Loads the data into the end target, usually the Data Warehouse.
  22. 22. 22 Real-life ETL cycle The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up
  23. 23. 23 Operational data with current implement
  24. 24. 24 Operational data with big data implement
  25. 25. 25 Big Data Platform Understand and navigate federated big data sources Federated Discovery and Navigation Manage & store huge volume of any data Hadoop File System MapReduce Structure and control data Data Warehousing Manage streaming data Stream Computing Analyze unstructured data Text Analytics Engine Integrate and govern all data sources Integration, Data Quality, Security, Lifecycle Management, MDM
  26. 26. 26 HADOOP FRAMEWORK
  27. 27. 27 Single-node architecture CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk
  28. 28. 28 Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch CPU Mem Disk Switch CPU … CPU Mem Mem Disk Disk Each rack contains 16-64 nodes CPU … Mem Disk
  29. 29. 29 Hadoop History  Dec 2004 – Google GFS paper published  July 2005 – Nutch uses MapReduce  Feb 2006 – Becomes Lucene subproject  Apr 2007 – Yahoo! on 1000-node cluster  Jan 2008 – An Apache Top Level Project  Jul 2008 – A 4000 node test cluster
  30. 30. 30 Who uses Hadoop?  Google, Yahoo, Bing  Amazon, Ebay, Alibaba  Facebook, Twitter  IBM, HP. Toshiba, Intel  New York Times, BBC  Line, Wechat  VC Corp, FPT, VNG, VTC
  31. 31. 31 Hadoop Components  Distributed   file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance  MapReduce   framework Executes user jobs specified as “map” and “reduce” functions Manages work distribution & fault-tolerance
  32. 32. 32
  33. 33. 33 Goals of HDFS  Very Large Distributed File System   10K nodes, 100 million files, 10 PB Assumes Commodity Hardware  Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing    Provides very high aggregate bandwidth User Space, runs on heterogeneous OS   Data locations exposed so that computations can move to where data resides
  34. 34. 34 HDFS Multi Cluster
  35. 35. 35 Hadoop 1.0 vs 2.0
  36. 36. 36 NameNode  Meta-data in Memory  The entire metadata is in main memory No demand paging of meta-data Types of Metadata    List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g creation time, replication factor A Transaction Log      Records file creations, file deletions. etc
  37. 37. 37 DataNode  A Block Server  Stores data in the local file system (e.g. ext3) Stores meta-data of a block  Serves data and meta-data to Clients Block Report     Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data  Forwards data to other specified DataNodes
  38. 38. 38 Block Placement  Current  Strategy One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed  Clients read from nearest replica  Would like to make this policy pluggable
  39. 39. 39 Data Correctness  Use Checksums to validate data   Use CRC32 File Creation  Client computes checksum per 512 byte DataNode stores the checksum File access    Client retrieves the data and checksum from DataNode  If Validation fails, Client tries other replicas
  40. 40. 40 NameNode Failure A single point of failure  Transaction Log stored in multiple directories  A directory on the local file system A directory on a remote file system (NFS/CIFS)  Need to develop a real HA solution
  41. 41. 41 Data Pipelining  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file
  42. 42. 42 Rebalancer  Goal: % disk full on DataNodes should be similar     Usually run when new DataNodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool
  43. 43. 43 What is MapReduce?  Simple data-parallel programming model designed for scalability and fault-tolerance  Pioneered  by Google Processes 20 petabytes of data per day  Popularized  by open-source Hadoop project Used at Yahoo!, Facebook, Amazon, …
  44. 44. 44 Map Reduce Data Flow
  45. 45. 45 Execution
  46. 46. 46 Parallel Execution
  47. 47. 47 Example: Word count process
  48. 48. 48 Hadoop Environment
  49. 49. 49

×