Más contenido relacionado


Similar a Skillwise Big Data part 2(20)


Skillwise Big Data part 2

  2. IBM Big Data Platform Overview
  3. Big Data is a Hot Topic Because Technology Makes it Possible to Analyze ALL Available Data Cost effectively manage and analyze all available data in its native form unstructured, structured, streaming ERP CRM RFID Website Network Switches Social Media Billing
  4. BIG DATA is not just HADOOP Manage & store huge volume of any data Hadoop File System MapReduce Manage streaming data Stream Computing Analyze unstructured data Text Analytics Engine Data WarehousingStructure and control data Integrate and govern all data sources Integration, Data Quality, Security, Lifecycle Management, MDM Understand and navigate federated big data sources Federated Discovery and Navigation
  5. Business-Centric Big Data Enables You to Start With a Critical Business Pain and Expand the Foundation for Future Requirements  “Big data” isn’t just a technology—it’s a business strategy for capitalizing on information resources  Getting started is crucial  Success at each entry point is accelerated by products within the Big Data platform  Build the foundation for future requirements by expanding further into the big data platform
  6. 1 – Unlock Big Data Customer need • Understand existing data sources • Search and navigate data within existing systems • No copying of data Value statement • Get up and running quickly • Discover and retrieve big data • Work even with big data sources – by business users Solution • Vivisimo Velocity renamed to • IBM InfoSphere DataDiscovery
  7. 2 – Analyze Raw Data Customer need • Ingest data as-is into Hadoop • Combine it with data from DWH • Process very large volume of data Value statement • Gain new insight • Overcome the high cost of converting data from unstructured to structured format • Experiment with analysis on different data and combine them with other sources Solution • IBM InfoSphere BigInsights
  8. Merging the Traditional and Big Data Approaches IT Structures the data to answer that question IT Delivers a platform to enable creative discovery Business Explores what questions could be asked Business Users Determine what question to ask Monthly sales reports Profitability analysis Customer surveys Brand sentiment Product strategy Maximum asset utilization Big Data Approach Iterative & Exploratory Analysis Traditional Approach Structured & Repeatable Analysis
  9. InfoSphere BigInsights is more than just HADOOP IBM InfoSphere Big Insights • Is much more than HADOOP IBM Big data platform • Includes much more than IBM InfoSphere Big Insights
  10. Hadoop  Open-source software framework from Apache  Inspired by  Google MapReduce  GFS (Google File System)  HDFS  Map/Reduce
  11. InfoSphere BigInsightsPlatform for volume, variety, velocity  Enhanced Hadoop foundation Analytics  Text analytics & tooling  Application accelerators Usability  Web console  Spreadsheet-style tool  Ready-made “apps” Enterprise Class  Storage, security, cluster management Integration  Connectivity to Netezza, DB2, JDBC databases, etc Apache Hadoop Basic Edition Enterprise Edition Licensed Application accelerators Pre-built applications Text analytics Spreadsheet-style tool RDBMS, warehouse connectivity Administrative tools, security Eclipse development tools Performance enhancements . . . . Free download Integrated install Online InfoCenter BigData Univ. Breadth of capabilities Enterpriseclass Can run also on top of
  12. Spreadsheet-style Analysis Web-based analysis and visualization Spreadsheet-like interface  Define and manage long running data collection jobs  Analyze content of the text on the pages that have been retrieved
  13. Build a Big Data Program – MapReduce example Eclipse tools For Jaql, Hive, Pig Java MapReduce, BigSheets plug-ins, text analytics, etc.
  14. JAQL – IBM’s programming language in hadoop world • Jaql is a complete solutions environment supporting all other BigInsights components Integration point for various analytics – Text analytics – Statistical analysis – Machine learning – Ad-hoc analysis  Integration point for various data sources – Local and distributed file systems – NoSQL data bases – Content repositories – Relational sources (Warehouses, operational data bases) BigInsightsText Analytics StatisticalAnalysis (Rmodule) Machinelearning (SystemML) Ad-Hocanalysis (BigSheets) (Integration)DB2, Netezza,Streams, … Jaql Jaql I/O Jaql Core Operators Jaql Modules DFS NoSQL RDBMS File System
  15. BigInsights Data warehouse Traditional analytic tools Big Data analytic applications Filter Transform Aggregate BigInsights and the data warehouse
  16. 3 – Simplify your warehouse Customer need – SIGNIFICANTLY • Make performance of DWH better • Reduce DWH administration costs Value statement • Speed: 10 – 100x better performance • Simplicity: Administration costs reduced by 75% - 90% • Scalability • Smart system • In-database analytics • Out-of-the box integration with SPSS Solution • IBM Netezza renamed to • PureData System for Analytics
  17. Analyst IT I need to evaluate the possible relationship between client salary and overdrafts OK. We have to evaluate a lot of statistics, set the correct db indexes and db partitioning. It will take us 5 days.
  18. Analyst IT Great. Thanks a lot. I’m going to check the results. Done. You can run your analytical query.
  19. Analyst IT Great. I can see here some nice correlations. Now I need to look at it from the different perspective. Ohhh, welcome dear friend. Understand. So, it’s …. another 5 days of our work Noooo!!! It’s not possible to work here!
  20. And now with Netezza ...
  21. Analyst IT I need to evaluate the possible relationship between client salary and overdrafts. I will use Netezza.
  22. Analyst IT Great. I can see here some nice correlations. Now I need to look at it from the different perspective. With Netezza I can run the query immediately. The response will be in the same time IT can do something else – much more useful
  23. Built-In Expertise Makes This as Simple as an Appliance 2  Dedicated device  Optimized for purpose  Complete solution  Fast installation  Very easy operation  Standard interfaces  Low cost
  24. IBM Netezza was renamed to IBM PureData System for Analytics In October 2012
  25. Netezza Genesis in T-Mobile CZ Proof-Of-Concept Project –New EnterpriseDataWarehouse platform selection –Comparison of existing and other platforms –Selection Criteria • Performance • Operational Savings ….and the winner was: Netezza
  26. Netezza Genesis in T-Mobile CZ Expectations Significant response improvement: Faster platform means better reports response Direct Data Availability Higher trust in data , one version of truth Aggregation reduction Any attribute available Operational Benefits Storage savings (no data replicas) Administration costs reduction(DBA) Infrastructure Simplification Lower environment complexity
  27. Netezza Genesis in T-Mobile CZ Project Implementation –EDW platform migration •Netezza platform implementation •ETL graphs/processes redesign –BI Front-End Tool Migration •SAP Business Object implementation •All reports redesign Main Integration Partner: T-System CZ
  28. Netezza Genesis in T-Mobile CZ Actual Status All relevant ETL procecessing redesigned Actual parallel run to Original and Netezza platform finished Netezza as only primary platform
  29. Original Platform Netezza Workflow Reporting 2 hours 1 minute Invoicing and Payments reporting Payment discipline of current month invoices 33 minutes 17 seconds Overdue Debt of Invoices – in Current Month 10 hours 23 seconds Average Monthly Invoice Figures 50 minutes 38 seconds RESPONSE TIME MASSIVELY IMPROVED Real Netezza experience from T-Mobile Czech Rep.
  30. 4 – Reduce costs with Hadoop Customer need – SIGNIFICANTLY • Too much data => Too expensive to store and to maintain • Big portion is used “just in case” • Data amount is still growing => it’s more expensive • => too expensive to have all data in standard DWH Value statement • Leverage the architecture of parallel processing in Hadoop • Hadoop uses cheap commodity HW • Enable business users still work in the same or similar way Solution • IBM InfoSphere BigInsights
  31. BigInsights and the data warehouse BigInsights • Query-ready archive for “cold” warehouse data Data Warehouse Big Data analytic applications Traditional analytic tools From Cognos BI via Hive JDBC
  32. Application SQL interface Engine InfoSphere BigInsights HiveTables HBase tables CSV Files Data Sources SQL Language JDBC / ODBC Driver JDBC / ODBC Server Future: The SQL interface . . . . • Rich SQL query capabilities – SQL '92 and 2011 features – Correlated subqueries – Windowed aggregates • SQL access to all data stored in InfoSphere BigInsights • Robust JDBC/ODBC support • Take advantage of key features of each data source • Leverage MapReduce parallelism OR achieving low-latency
  33. 5 – Analyze Streaming Data Customer need • Process and leverage streaming data • Select valuable data from data stream for future processing • Quickly process data going to be useless if it’s not processed immediately Value statement • React in real-time to take an oppurtinity before it expires • Periodically adjust streaming models based on analysis on data at rest Solution • IBM InfoSphere Streams Streams Computing Streaming Data Sources ACTION
  34. Why and when to use InfoSphere Streams? Sensors  Environmental, Industrial, GPS, …  Images, Videos, … Data Exhaust  Network data  system logs (web server, app server), … High-rate transaction data  Financial transactions  CDRs Isolation  Processing in isolation  … or in limited windows (time / nr. Of records) Non-traditional formats included  Spatial data, images, text, voice, … Integration challenges  Different connection methods  Different data rates  Different processing requirements Multiple processing nodes  Volume / rate very high => scalability required Sub-millisecond latency  Immediate analysis and response Store & mine approach doesn’t work  Because of very high volume of data (and its rates) At least 2 criteria from the list bellow should be fulfilled Applications needing on-fly processing, filtering and analyzing streaming data
  35. Streams and BigInsights - Integrated Analytics on Data in Motion & Data at Rest 1. Data Ingest Data Integration, data mining, machine learning, statistical modeling Visualization of real- time and historical insights 3. Adaptive Analytics Model Data ingest, preparation, online analysis, model validation Data 2. Bootstrap/Enrich Contro l flow InfoSphere BigInsights, Database & Warehouse InfoSphere Streams
  36. The Platform Advantage BI / Reporting BI / Reporting Exploration / Visualization Functional App Industry App Predictive Analytics Content Analytics Analytic Applications IBM Big Data Platform Systems Management Application Development Visualization & Discovery Accelerators Information Integration & Governance Hadoop System Stream Computing Data Warehouse BENEFITS IN DETAIL Increase over time  By moving from entry to a 2nd and 3rd project Lowering deployment costs  Shared components  Integration Points of leverage  Shared text analytics for Streams and BigInsights  HDFS connectors (data integration (ETL, …), Streams)  Accelerators  Build across multiple engines