This document discusses IBM's vision for combining Hadoop and data warehousing (DW) platforms into a unified "Hadoop DW". It describes how big data is driving new use cases that require analyzing diverse data types at extreme scales. Hadoop provides a massively parallel processing framework for advanced analytics on polystructured data, while DW focuses on structured data. The emergence of Hadoop DW will provide a single platform for all data types and workloads through tight integration of Hadoop and DW capabilities.
Now if you recall, I talked about the EDW not going away and the Big Data system working with it. Just a couple of slides ago I talked about the IBM Big Data platform and I included commentary about IBM Information Server for integration and that’s what this slide is showing here. We know that we are now faced with two complementary analytical approaches – we have this traditional approach, we have this new approach – and when we bring these together, we need some help to figure out a way to get from the left sphere to the right sphere and that’s going to be enterprise integration. So IBM provides that; for example IIS has readers for HDFS and natively within DB2 is a UDF that can call a MapReduce program, and more. If you look at this slide, you can see that if you live in the SQL world, you can talk to the Big Data world, and vice versa.
Key Points IBM research developed a sophisticated text analytics engine – similar technology to what was demonstrated in Watson Its purpose is to identify meaning within text We have pre-built 100s of rules (annotators) that understand textual meaning – names (e.g., what is a first name v a last name), addresses (what is a street, apartment) among others. The annotators are context sensitive and discover the relationship between terms even if they are separate by text – for example, it discovers that Iker Casillas is a “keeper” even though the phrase “for Spain” is in between them Accuracy – our text analytics engine is very accurate and we’ve done testing that indicates it is 2-3x more accurate than some alternatives It is also highly performant – it is designed for use in Big Data and map reduce parallel processing
Key Points - Integrate v3 – the point is to have one platform to manage all of the data – there’s no point in having separate silos of data, each creating separate silos of insight. From the customer POV (a solution POV) big data has to be bigger than just one technology Analyze v3 – very important point – we see big data as a viable place to analyze and store data. New technology is not just a pre-processor to get data into a structured DW for analysis. Significant area of value add by IBM – and the game has changed – unlike DBs/SQL, the market is asking who gets the better answer and therefore sophistication and accuracy of the analytics matters Visualization – need to bring big data to the users – spreadsheet metaphor is the key to doing son Development – need sophisticated development tools for the engines and across them to enable the market to develop analytic applications Workload optimization – improvements upon open source for efficient processing and storage Security and Governance – many are rushing into big data like the wild west. But there is sensitive data that needs to be protected, retention policies need to be determined – all of the maturity of governance for the structured world can benefit the big data world