This document provides an overview of Hadoop storage perspectives from different stakeholders. The Hadoop application team prefers direct attached storage for performance reasons, as Hadoop was designed for affordable internet-scale analytics where data locality is important. However, IT operations has valid concerns about reliability, manageability, utilization, and integration with other systems when data is stored on direct attached storage instead of shared storage. There are tradeoffs to both approaches that depend on factors like the infrastructure, workload characteristics, and priorities of the organization.
Traditional applications work on the model where data is loaded into memory from wherever it is stored onto the computer where the application is run. As Google was processing ever increasing amounts of internet data, the IT people there quickly realized that this centralized approach to computation was not sustainable. So they decided to move to a model where they would scale out their processing and storage and created a system where the data would be processed on the machine where it is stored. This processing technology became MapReduce and the storage model is known as the Google File System (GFS), which is a direct descendant to today’s HDFS. Hadoop is a top-level Apache project being built and used by a global community of contributors. Yahoo has been the largest contributor to the project, and it uses Hadoop extensively across its businesses. One of its employees, Doug Cutting, reviewed key papers from Google and concluded that the technologies they described could solve the scalability problems of Nutch, an open source Web search technology. So Cutting led an effort to develop Hadoop (which, incidentally, he named after his son’s stuffed elephant). Hadoop is particularly well-suited to batch-oriented, read-intensive applications. Key features include the ability to distribute and manage data across a large number of nodes and disks. By using the MapReduce programming model with the Hadoop framework, programmers can create applications that automatically take advantage of parallel processing. A single commodity box consisting of, let’s say, a single CPU and disk, forms a node in Hadoop. Such boxes can be combined into clusters, and new nodes can be added to a cluster without an administrator or programming changing the format of the data, how the data was loaded, or how the jobs (programming logic) were written. The following overview of Hadoop was extracted from the Hadoop wiki at http://wiki.apache.org/hadoop/ Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond A Hadoop “stack” is made up of a number of components. They include: Hadoop Distributed File System (HDFS): The default storage layer in any given Hadoop cluster; Name Node: The node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail; Secondary Node: A backup to the Name Node, it periodically replicates and stores data from the Name Node should it fail; Job Tracker: The node in a Hadoop cluster that initiates and coordinates MapReduce jobs, or the processing of the data. Slave Nodes: The grunts of any Hadoop cluster, slave nodes store data and take direction to process it from the Job Tracker. In addition to the above, the Hadoop ecosystem is made up of a number of complimentary sub-projects. NoSQL data stores like Cassandra and HBase are also used to store the results of MapReduce jobs in Hadoop. In addition to Java, some MapReduce jobs and other Hadoop functions are written in Pig, an open source language designed specifically for Hadoop. Hive is an open source data warehouse originally developed by Facebook that allows for analytic modeling within Hadoop. Following is a guide to Hadoop's components: Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query. Hive: Hive is a Hadoop-based data warehouse developed by Facebook. It allows users to write queries in SQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc. Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.) HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily. Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop. Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Whirr: Whirr is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace or any virtual infrastructure. It supports all major virtualized infrastructure vendors on the market. Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls. Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related components with the goal improving the Hadoop platform as a whole.
5 cents / GB = $50/TB = $500/10TB = $2500/50TB x 3 Hadoop copies = $7500 storage in Hadoop server + $2500 for Hadoop server = $10,000 26 cents / GB = $260/TB = $2600/10TB = $13,000/50TB = $13,000 storage in external array + cost of SAN or NAS network + cost of rebalancing Hadoop cluster for equivalent performance
There are two aspects of Hadoop that are important to understand: MapReduce is a software framework introduced by Google to support distributed computing on large data sets of clusters of computers. The Hadoop Distributed File System (HDFS) is where Hadoop stores its data. This file system spans all the nodes in a cluster. Effectively, HDFS links together the data that resides on many local nodes, making the data part of one big file system. Furthermore, HDFS assumes nodes will fail, so it replicates a given chunk of data across multiple nodes to achieve reliability. The degree of replication can be customized by the Hadoop administrator or programmer. However, by default is to replicate every chunk of data across 3 nodes: 2 on the same rack, and 1 on a different rack. You can use other file systems with Hadoop, but HDFS is quite common. (ex GPFS) The key to understanding Hadoop lies in the MapReduce programming model. This is essentially a representation of the divide and conquer processing model, where your input is split into many small pieces (the map step), and the Hadoop nodes process these pieces in parallel. Once these pieces are processed, the results are distilled (in the reduce step) down to a single answer.
The filer is present just for completeness. It has no part in the story we are telling about GPFS or GPFS-FPO replacement for HDFS. No need to talk to it, and in your personal copies of the deck, if you wish to remove it, that is fine.
Policy based ingest – move data into and out of FPO pool using GPFS policies.
Summary: When you break it down even further, IBM has constructed a portfolio of software and solutions with the breadth and depth to meet all of the needs of all organizations today, combined with unique synergies across this portfolio that enable organizations to start with their most pressing needs knowing that they will be able to leverage their skills and investment in future projects to reduce risk, lower costs and achieve faster time to value in meeting the needs of the business. There are multiple “entry-points”, driven by your most pressing needs, that help you start moving down the path for an information-led transformation. (Note: describe this slide from the bottom-up ) When you think about an information-led transformation, you need to ensure that your infrastructure and systems are optimized to handle the various workloads that are demanded of it. Especially today when you are faced with a glut of new information, you need to ensure that relevant information is available, that it is secure and that you are able to retrieve it in a timely manner not only for analytical, operational and transactional systems, but also for regulatory compliance. That is why IBM Software Group and our Systems & Technology Group are working together to provide optimized solutions focused on delivering greater business value to our customers, faster, for increased return on investment. From the new IBM Smart Analytics System, to the new DB2 PureScale for continuous availability, unlimited capacity and application transparency, to the deep integration of System z, IBM has unparalleled expertise in designing and implementing workload optimized systems and services. On top of that infrastructure, there is also the need to ensure that you can bring all of those sources of information together to create a single, trusted view of information from across your business – regardless of whether that information is structured or unstructured – and then manage it over time. From data warehousing, Master Data Management, information integration, and Agile ECM and integrated data management, IBM’s InfoSphere portfolio ensures that organizations will be able to leverage their information over time to drive innovation across their business. And armed with this single-view of your business, you can then look to optimize business processes and drive greater performance across your organization. Decision makers will have the right information, at the right time, in the right context to make better, more informed decisions, and even anticipate new opportunities or counter potential threats more effectively. The Business Analytics and Optimization Platform supports and information-led transformation in that it focuses on establishing well-constructed processes and empowering individuals throughout the organization with pervasive, predictive real-time analytics . From Cognos and the newly acquired SPSS portfolios, organizations can now be more pro-active and predictive in innovating their business.
The IBM Big Data Platform extends the traditional warehouse in two ways: Big Data in Motion , which is streaming data, such as securities data (like stock tickers), or sensor data, (like temperature readings, heart rates, or the revolutions per second of a piece of machinery). This data can be streaming at a very high transfer rate, and vary greatly in its structure. Our product offering for Big Data in Motion is InfoSphere Streams. This product is capable of performing analytics on the streaming data in real-time. Big Data at Rest , which is a set of data in static storage, for instance a large set of log files from a web site’s click-stream analysis, or pools of raw text from service engagements with customers. Our product offering for Big Data at Rest is InfoSphere BigInsights. This product is capable of performing analytics on this large set of varied data. Both Streams and BigInsights interface with each-other, and can use existing data warehouses as data sources for their analytics. Or the data warehouse can pull data from Streams and BigInsights. Transcript : BIG Data's all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes: IBM IOD 2010_GS Day 2 Transcript : BIG Data's all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes: 06/25/13 Transcript : BIG Data's all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes: Prensenter name here.ppt Transcript : BIG Data's all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes:
Summary: When you break it down even further, IBM has constructed a portfolio of software and solutions with the breadth and depth to meet all of the needs of all organizations today, combined with unique synergies across this portfolio that enable organizations to start with their most pressing needs knowing that they will be able to leverage their skills and investment in future projects to reduce risk, lower costs and achieve faster time to value in meeting the needs of the business. There are multiple “entry-points”, driven by your most pressing needs, that help you start moving down the path for an information-led transformation. (Note: describe this slide from the bottom-up ) When you think about an information-led transformation, you need to ensure that your infrastructure and systems are optimized to handle the various workloads that are demanded of it. Especially today when you are faced with a glut of new information, you need to ensure that relevant information is available, that it is secure and that you are able to retrieve it in a timely manner not only for analytical, operational and transactional systems, but also for regulatory compliance. That is why IBM Software Group and our Systems & Technology Group are working together to provide optimized solutions focused on delivering greater business value to our customers, faster, for increased return on investment. From the new IBM Smart Analytics System, to the new DB2 PureScale for continuous availability, unlimited capacity and application transparency, to the deep integration of System z, IBM has unparalleled expertise in designing and implementing workload optimized systems and services. On top of that infrastructure, there is also the need to ensure that you can bring all of those sources of information together to create a single, trusted view of information from across your business – regardless of whether that information is structured or unstructured – and then manage it over time. From data warehousing, Master Data Management, information integration, and Agile ECM and integrated data management, IBM’s InfoSphere portfolio ensures that organizations will be able to leverage their information over time to drive innovation across their business. And armed with this single-view of your business, you can then look to optimize business processes and drive greater performance across your organization. Decision makers will have the right information, at the right time, in the right context to make better, more informed decisions, and even anticipate new opportunities or counter potential threats more effectively. The Business Analytics and Optimization Platform supports and information-led transformation in that it focuses on establishing well-constructed processes and empowering individuals throughout the organization with pervasive, predictive real-time analytics . From Cognos and the newly acquired SPSS portfolios, organizations can now be more pro-active and predictive in innovating their business.
IBM offers two basic models – the GSS24 and GSS26 – with 4 or 6 JBODs, respectively. These two basic configurations are scalable into larger storage solutions by using them as building blocks.
IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. Apache Hadoop is the open source software framework, used to reliably managing large volumes of structured and unstructured data. BigInsights enhances this technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research. The result is that you get a more developer and user-friendly solution for complex, large scale analytics. InfoSphere BigInsights allows enterprises of all sizes to cost effectively manage and analyze the massive volume, variety and velocity of data that consumers and businesses create every day. Infosphere Streams Part of IBM’s platform for big data, IBM InfoSphere Streams allows you to capture and act on all of your business data... all of the time... just in time. InfoSphere Streams radically extends the state-of-the-art in big data processing; it’s a high-performance computing platform that allows user-developed applications to rapidly ingest, analyze, & correlate information as it arrives from thousands of real-time sources. Users are able to: Continuously analyze massive volumes of data at rates up to petabytes per day. Perform complex analytics of heterogeneous data types including text, images, audio, voice, VoIP, video, police scanners, web traffic, email, GPS data, financial transaction data, satellite data, sensors, and any other type of digital information that is relevant to your business. Leverage sub-millisecond latencies to react to events and trends as they are unfolding, while it is still possible to improve business outcomes. Adapt to rapidly changing data forms and types. Seamlessly deploy applications on any size computer cluster. Meet current reaction time and scalability requirements with the flexibility to evolve with future changes in data volumes and business rules. Quickly develop new applications that can be mapped to a variety of hardware configurations, and adapted with shifting priorities. Provides security and information confidentiality for shared information. Learn more about how InfoSphere Streams aligns with any industry.
Link to enter your email address and then get free copy of this book downloaded: https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=sw-infomgt&S_PKG=500016891&S_CPM=is_bdebook1_biginsightsfp Direct URL to load book (3.5 MB Acrobat Reader file): http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF
InfoSphere BigInsights features Apache Hadoop as a core component. There are two releases of InfoSphere BigInsights: Basic and Enterprise. Basic edition –This is a free offering. It has open source components as well as IBM value add (maintenance console, DB2 integration, integrated installation) You can purchase support for this offering. This is an excellent choice for companies who want a Hadoop environment up and running or conducting a POC – but it lays the foundation to turn that POC into a pilot or full enterprise deployment. The Enterprise edition adds significant value on the same base platform – you can grow into it. There are two main value adds: this offering hardens Hadoop, providing enterprise-quality stability, and it provides an analytics layer. Specifically, it includes a rock-solid file system alternative to what’s included in open source Hadoop, text analytics, analytics visualization, security, integrated web console, workflow and scheduling, indexing, and documentation. Transcript : What we're doing is putting together a comprehensive solution around BIG Data and if you're going to the sessions here you'll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we're looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we're applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: IBM IOD 2010_GS Day 2 Transcript : What we're doing is putting together a comprehensive solution around BIG Data and if you're going to the sessions here you'll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we're looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we're applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: 06/25/13 Transcript : What we're doing is putting together a comprehensive solution around BIG Data and if you're going to the sessions here you'll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we're looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we're applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: Prensenter name here.ppt Transcript : What we're doing is putting together a comprehensive solution around BIG Data and if you're going to the sessions here you'll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we're looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we're applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: