Más contenido relacionado




  2.  Introduction  Evolution of big data  Characteristics  Examples of big data generation  Big data v/s RDBMS  Hadoop  HDFS  MapReduce  references
  3.  Big data is a term for DATASETS that are so large or complex that traditional data processing applications are inadequate.  Big data is the capability to manage huge volume of disparate data, at the right speed and within the right time frame to allow real time analysis.
  4. Wave 1: Creating manageable data structures Wave 2: Web and content management Wave 3: Managing big data
  5. source: understanding big data analytics for enterprise class hadoop and streaming data
  7. Black box data Social media data Stock exchange data Power grid data Transport data Search engine data
  8.  Huge competition in market: retails- customer analytics and predictive analytics travel- travel pattern of customers website- understand users navigation pattern, interest, conversion etc.  Sensors, satellite and geospatial data  Military and intelligence
  9.  Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types. • Relational data.Structured data: • XML data Semi Structured data: • Word, PDF, Text, Media Logs Unstructured data:
  10. Relational database Big data  Single-computer platform that scales with better CPUs, centralized processing.  Relational database (SQL), centralized storage.  Batched, descriptive, centralized  Cluster platforms that scale to thousands of nodes, distributed process  Non-relational databases that manage varied data types and formats (NoSQL), distributed storage.  Real-time, predictive and prescriptive, distributed analytics
  11.  An open source apache foundation framework.  It allows distributed processing of large datasets across clusters of computers using simple programming models.  Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others.  It Uses the concept of Data locality.
  12.  Processing/Comput ation layer (MapReduce), and  Storage layer (Hadoop Distributed File System). source : hadoop tutorial on
  13.  Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.  Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.  hadoop is designed to be self-healing.
  14.  HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.  It can be defined as, "A reliable, high bandwidth, low-cost, data-storage cluster that facilitates the management of related files across machines.”
  15. Basic architecture of HDFS source: J. Hurwitz, et al., “Big Data for Dummies,” Wiley, 2013, ISBN:978-1-118-50422-2.
  16. Replica placement source: Hadoop: The Definitive Guide, by Tom White, 2015, ISBN: 978-1-491-90163-2
  17.  Hadoop mapReduce is an implementation of mapReduce algorithm.  Map reduce is a batch query processor, and the ability to run an adhoc query against whole dataset and get the results in a reasonable time is TRANSFORMATIVE.
  18. source: J. Hurwitz, et al., “Big Data for Dummies,” Wiley, 2013, ISBN:978-1-118-50422-2.
  19.  Example of air temperature analysis.  Problems :  Dividing the work into equal size pieces is not easy.  Combininng the results from independent process may requirefurther processing.  The processing capacity of a single machine is limited. source: Hadoop: The Definitive Guide, by Tom White, 2015, ISBN: 978-1-491-90163-2
  20. source: Hadoop: The Definitive Guide, by Tom White, 2015, ISBN: 978-1-491-90163-2
  21. i. J. Hurwitz, et al., “Big Data for Dummies,” Wiley, 2013, ISBN:978-1-118-50422-2. ii. 13/ iii. Hadoop: The Definitive Guide, by Tom White, 2015, ISBN: 978-1-491-90163-2 iv. Hadoop tutorials on

Notas del editor

  1. There is a data explosion, according to an figure by IBM 2.5 quintilloins of data is created each day which is very huge amount of data. And to oracle ,in 2012 data growth rate was 40% compound annual rate. Data is growing exponentially.
  2. in the late 1960s, data was stored in flat files that imposed no structure Later in the 1970s, things changed with the invention of the relational data model and the relational database management system (RDBMS) that imposed structure and a method for improving performance Enterprise Content Management systems evolved in the 1980s to provide businesses with the capability to better manage unstructured data, mostly documents. In the 1990s with the rise of the web, organizations wanted to move beyond documents and store and manage web content, images, audio, and video. As with other waves in data management, big data is built on top of the evolution of data management practices over the past five decades. With big data, it is now possible to virtualize data so that it can be stored efficiently and, utilizing cloud-based storage, more cost-effectively as well.
  3. Volume: How much data Velocity: How fast that data is processed Variety: The various types of data Even more important is the fourth V: veracity. How accurate is that data in predicting business value? Variability : Inconsistency of the data set can hamper processes to handle and manage it.
  4. Data processing Data management Analytics
  5. Hadoop is designed to parallelize data processing across computing nodes to speed computations and hide latency. Hadoop has two primary layers: Mapreduce HDFS YARN Common utilities
  6. Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. Or we can say, hadoop is able to detect changes, including failures, and adjust to the changes and continues to operate without interruption.
  7. Very large files “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. Streaming data access HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record. Commodity hardware Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure. Low-latency data access Lots of small files Multiple writers, arbitrary file modifications
  8. placing replicas in different data centers may maximize redundancy, but at the cost of bandwidth. Even in the same data center (which is what all Hadoop clusters to date have run in), there are a variety of possible placement strategies. Data Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random