Situation: Past few years we have seen a major transformation on next gen analytics.Big Data is a major focus of your business and application teams2013 EMCWorldwe announced Isilon HDFS support, and launch of Pivotal2014 EMCWorldViPR HDFS, Pivotal momentum, industry investment in Cloudera, and MongoDBToday Amazon 1/3 of sales comes from personalization & recc systemsNot just big companies like Amazon, and eBay but even your local groceryWeekly sales circularNow-enter store you get target marketing messages with discounts 4uStores collecting data on every shopper using loyalty and WIFIEvery industry healthcare, insurance, financialProblem: confusion, what do I need to do. Almost 40% EBCStake: In order to determine the best IT infrastructure for these services you need to understand the key enablers<next>
First evolution is data volume: EMC Digital Universe 7th UpdateDigital Data stored will double every two years for next decadeData growth from emerging markets are exploding60% data generated from mature markets US, Japan, GermanyBy 2016 60% of data generated from emerging economies such Brazil, China, and MexicoThe second evolution is the impact from the Internet of Things or the Industrial Internet, Data collection is accelerating:14 Billion internet connected devices today, 2% of all data32 billion internet connected devices in 2020, 10% of all dataGE Wind Turbine example, 20,000 sensors. 400 updates per second.The third is analysis of unstructured data including images, video, and audio. No longer just analyzing neat tables of data organ in columns & rows.The NYC exploding manhole coverMessy data – records from 188051,000 manholesEnough cable to wrap the earth 2.5 times44% reduction in disastersThe forth, is new tools optimized to analyze large, complete data sets, that are often dense with frequently collected data from sensors and devices, and include unstructured data such as images, video, and audio. These tools are inexpensive, leverage Open Source. Easy to deploy – local grocerCombination of collect, store more data cost effectively with new tools creating perfect storm for Big Data. <click>My current storage architecture can’t meet all these requirements. What should a storage architect do?
First step is you need to content repository or Data Lake. Most of the new analytics tools such as Hadoop rely upon HDFS and it API interface. Several great attributes of HDFS:Scales from terabytes to Petabytes easilyOptimized for Big Block IO – 64MB block sizeSupports structured & unstructured dataOpen source low cost, HW independentLet’s look at this simple HDFS block diagramHighly distributed processingBut, what about those Wind Turbines streaming 400 data points a second?Customers are combining in-memory database technologies such as GemfireXD and Impala <click>
IMDG provide the fact ingest and query performance. IMDG technologies such as GemfireXD will write copies of their data to HDFS for persistence and deeper analytics.IMDG+HDFS support store, and analysis capability for large data sets, streaming ingest, and analysis of structured, and unstructured data. Tools like Pivotal HAWQ allow you to access data in IMDG and HDFS Data LakeWhat are storage requirements for DL?
Cost Optimized: We recommend HDFS and IMDG to manage storage costs at scale with hot edge, cold core arch. Minimize $/GB. Data will double every two years.No Silo’s: Content Repository/DL must be accessible by all protocols. Write one protocol, read any. Ready for next big thingScalability from terabytes to 100’s of petabytes. Non disruptive capacity growth. No down time migrationsPiece of cake? How many storage solutions can do this today?No one storage platform provides all this. EMC believes in building blocks and options. There are four common DL storage options today. Each have +/-’s
First one Hadoop HDFS on server storageMost start hereExperience issues with scale. Poor capacity utilizationDisadvantages:Low efficiencyHardware support at scaleLimited to Hadoop distroHadoop silo
+Access data already storedLeverage existing investmentEnterprise Reliability, Security, and Availability ** EMC Hadoop Starter Kit<<talk about EMC Elastic Cloud Storage - ECSCommon concerns:- limited high performance options- storage hardware lockin- HDFS compatibility with Hadoop Distro’s
ViPR architectureHadoop Starter Kit – ViPR editionLot’s to like:Leverage existing investmentCentralized management/provisioningLeverage reliability, security, and availability of storage HWFlexibility of Data ServicesCommon concerns: - new with HDFS data services GA in Feb 2014 - HDFS compatibility with Hadoop Distro’s (HCFS)
Mature: Greenplum DCA, and VCE Vblock for Big DataLarge enterprise and SP customersFast to deploy, predictable performanceCommon concerns:Hardware Vendor lockinInflexible modulesSlower innovationThese four options all have strengths and weaknesses. The most mature for Gown up HDFS is our Isilon solution with many happy customers. The most compelling is our storage software virtualization solution, ViPR but it is new and building traction. With the 2.0 release it is gaining many of the features customers need now. Things like additional protocol support is road mapped over the next 12 months.Do you want to see this in action? I’d like to introduce Jim Ruddy - Lead EMC OCTO Big Data Architect to demo a Data Lake in action with the Pivotal Analytics suite from a recent customer deployment. Jim what are you going to show us?
Demo – Retail Use Case1) Data enters though adapters. These adapters can receive data from multiple sources like Twitter, POS, manufacturing devices, or from sources on “The internet of things”2) The adapter is written in SpringXDwith can be a single node or scale to multiple nodes.3) The first analytics of data is done at this level. Where does the data go? Does it need instance analysis or does it need to be compared to a history of transactions 4) There are 2 ways data can be written at this point. It can be directly written from the adapter to GemfireXD for in memory analytics or a tap can be done, where data is written to GemfireXD and HDFS at the same time. The adapter can also decide if some data goes to GemfireXD and some goes to HDFS and how to make this determination. This would be the first level of analytics.5) Once data is in Gemfire, it is stored using in-memory tables, or you can persist very large tables to local disk store files or to HDFS. How long or where the data is kept is variable and can be tuned per table created. The use of pivotal framework extension (PFX) allow for HAWQ to query data in memory. As the data is persisted to HDFS Hawq can also query the data there as well.6) GemfireXD is built as a cluster. There is one locater server and one or more data servers that host data. These servers keep the tables in memory, have local storage to persist data, can write and read data to/from HDFS and run yarn/map reduce jobs.7) Once data is persisted to HDFS Yarn (mapreduce version2) can then run batch jobs against the data. 8) Every node in Pivotal hadoop that is a Yarn node manager is also a Hawq segment. This is how hawq access’s data in HDFS.9) Once data is persisted to HDFS, Hawq and hadoop can do historical analysis.
Awesome Jim. As you can see Gen Analytics is very powerful for your business. It is the top priority for many of our customers application teams.EMC is uniquely qualified as the industry leader in data storage, 30+ years of history of innovation helping our customers and industry through these evolutions. We also have learned a great deal from our experience with Pivotal.We believe the key is to architect your content repository using a combination of storage technologies optimized for both $/GB, and performance, to support the new analytics tools. These tools require access via a variety of protocols including legacy file, SQL, and new storage protocols such as Object and HDFS.In closing, EMC provides highly scalable, and cost efficient storagesolutions that are part of our building block approach. We have proven solutions to help you deploy a DL that scales effortlessly, and cost effectively, across geo’s. Thank you.We have time for some questions…Jim, Dan please join me.