February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Informatica & Big Data - Leveraging Hadoop for Enterprise Analytics
1. Informatica & Big Data Sanjeev Kumar VP & MD, Informatica India Apache Hadoop India Summit 2011
2. Agenda Big Data Big Data in Enterprise Informatica & Data Informatica & Big Data
3.
4. Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this yearSource: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. .
5. Why Now? Exploding Data Volumes Explosion in user-generated content e.g. Blogs, Twitter, Facebook etc. Proliferation of web-connected devices Smartphone interactions with the web Increased consumption of digital content Netflix, HULU, Pandora etc. Internet of things Smart-grid and smart-meters Machine-generated data via the web
6. Why Now? : New Apps/Use-cases Analyze customer/market sentiment Text analytics on Social Media, blogs Achieve Operational Efficiency e.g. Analyze CDRs to optimize cell tower placements Make Recommendations Data mining on click-stream, purchase history Predict the future e.g. Flightcast predicts flight delays
7. Big Data Challenges Storage Cost-effective Scalability: to multi-terabytes and petabytes Non-traditional data models: complex, semi-structured data Processing Data mining, collaborative filtering for structured data Text Analytics, classification etc. for unstructured data Regulatory Compliance Data Privacy / Masking Data Archival
8. Addressing Big Data Challenges Storage Parallel Databases Greenplum(EMC), Vertica, AsterData Distributed Key/Value Stores Hbase, Google’s BigTable, Amazon’s SimpleDB Distributed File Systems HDFS, GFS, ParAccel Analytics SQL with extensions Map Reduce DataFlow Languages : PIG, Sawzall etc
12. Big Data in the EnterpriseCase Studies: Hadoop World 2009 Yahoo!: Social Graph Analysis VISA: Large Scale Transaction Analysis China Mobile: Data Mining Platform for Telecom Industry JP Morgan Chase: Data Processing for Financial Services eHarmony: Matchmaking in the Hadoop Cloud Rackspace: Cross Data Center Log Processing Visible Technologies: Real-Time Business Intelligence Booz Allen Hamilton: Protein Alignment using Hadoop Slides and Videos at http://www.cloudera.com/hadoop-world-nyc
13. Big Data in the EnterpriseCase Studies: Hadoop World 2010 eBay: Hadoop at eBay Twitter: The Hadoop Ecosystem at Twitter General Electric: Sentiment Analysis powered by Hadoop Yale University: MapReduce and Parallel Database Systems AOL: AOL’s Data Layer Facebook: Hbase in Production Bank of America: The Business of Big Data StumbleUpon: Mixing Real-Time and Batch Processing Raytheon: SHARD: Storing and Querying Large-Scale Data More info at - http://www.cloudera.com/company/press-center/hadoop-world-nyc/
14. Agenda Big Data Big Data in Enterprise Informatica & Data Informatica & Big Data
15. Informatica – Our Singular Mission Enabling The Information Economy We enable organizations to gain a competitive advantage from all their information assetsto drive their top business imperatives
16. Informatica – What We DoComprehensive, Unified, Open and Economical platform Application Partner Data SWIFT NACHA HIPAA … Cloud Computing Unstructured Database Complex Event Processing Data Warehouse Data Migration Test Data Management & Archiving Master Data Management Data Synchronization B2B Data Exchange Data Consolidation UltraMessaging
17. Informatica & Data Verbs on Data – We do things to data! INFA = Data + [ Archival | As a Service | Cleansing | Clustering | Consolidation | Conversion | De-duping | Exchange | Extraction | Federation | Hub | Identity | Integration | Life-cycle Management | Loading | Masking | Mastering | Matching | Migration | On Demand | Privacy | Profiling | Provisioning | Quality | Quality Assessment | Registry | Replication | Retirement | Services | Stewardship | Sub-setting | Synchronization | Test Management | Transformation | Validation | Virtualization | Warehousing| ]
18. Informatica & Big Data HDFS as a source and a target - Enable universal data connectivity for Hadoop developers Enable Hadoop developers to leverage prebuilt Data Transformation and Data Quality logic Lower the barrier to Hadoop-entry by using Informatica Developer as a development tool Support virtualized access to data split across HDFS and (relational) data-warehouses
19. Informatica & Hadoop – Big Picture Enterprise Connectivity for Hadoop programs Weblogs Databases BI DW/DM Metadata Repository Graphical IDE for Hadoop Development Semi-structured Un-structured Enterprise Applications Transformation Engine for custom data processing Hadoop Cluster HDFS Job Tracker HDFS Name Node Data Node HDFS
Notas del editor
Map/Reduce implementationApache Open Source Project : Yahoo dominatedTwo major componentsHDFSFailure Resilient Distributed File SystemsMap/ReduceFailure Resilient Distributed Computing FrameworkScales to thousand+ node clusterUsed by Yahoo, Facebook etc