Publicidad
Publicidad

Más contenido relacionado

Similar a Data Lake Architecture – Modern Strategies & Approaches(20)

Publicidad

Más de DATAVERSITY(20)

Publicidad

Data Lake Architecture – Modern Strategies & Approaches

  1. Data Lake Architecture Modern Strategies & Approaches Donna Burbank, Managing Director Global Data Strategy, Ltd. August 23rd, 2018 Follow on Twitter @donnaburbank Twitter Event hashtag: #DAStrategies
  2. Global Data Strategy, Ltd. 2018 Donna Burbank Donna is a recognised industry expert in information management with over 20 years of experience in data strategy, information management, data modeling, metadata management, and enterprise architecture. Her background is multi-faceted across consulting, product development, product management, brand strategy, marketing, and business leadership. She is currently the Managing Director at Global Data Strategy, Ltd., an international information management consulting company that specializes in the alignment of business drivers with data-centric technology. In past roles, she has served in key brand strategy and product management roles at CA Technologies and Embarcadero Technologies for several of the leading data management products in the market. As an active contributor to the data management community, she is a long time DAMA International member, Past President and Advisor to the DAMA Rocky Mountain chapter, and was recently awarded the Excellence in Data Management Award from DAMA International in 2016. Donna is also an analyst at the Boulder BI Train Trust (BBBT) where she provides advice and gains insight on the latest BI and Analytics software in the market. She was on several review committees for the Object Management Group’s for key information management and process modeling notations. She has worked with dozens of Fortune 500 companies worldwide in the Americas, Europe, Asia, and Africa and speaks regularly at industry conferences. She has co- authored two books: Data Modeling for the Business and Data Modeling Made Simple with ERwin Data Modeler and is a regular contributor to industry publications. She can be reached at donna.burbank@globaldatastrategy.com Donna is based in Boulder, Colorado, USA. 2 Follow on Twitter @donnaburbank Twitter Event hashtag: #DAStrategies
  3. Global Data Strategy, Ltd. 2018 DATAVERSITY Data Architecture Strategies • January - on demand Panel: Emerging Trends in Data Architecture – What’s the Next Big Thing? • February - on demand Building an Enterprise Data Strategy – Where to Start? • March - on demand Modern Metadata Strategies • April - on demand The Rise of the Graph Database • May - on demand Data Architecture Best Practices for Today’s Rapidly Changing Data Landscape • June - on demand Artificial Intelligence: Real-World Applications for Your Organization • July - on demand Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic Asset • August Data Lake Architecture – Modern Strategies & Approaches • Sept Master Data Management: Practical Strategies for Integrating into Your Data Architecture • October Business-Centric Data Modeling: Strategies for Maximizing Business Benefit • December Panel: Self-Service Reporting and Data Prep – Benefits & Risks 3 This Year’s Line Up for 2018
  4. Global Data Strategy, Ltd. 2018 Today’s Topic 4 Building a Successful Data Lake Architecture • Data Lake or Data Swamp? By now, we’ve likely all heard the comparison. • Data Lake architectures have the opportunity to provide the ability to integrate vast amounts of disparate data across the organization for strategic business analytic value. • But without a proper architecture and metadata management strategy in place, a Data Lake can quickly devolve into a swamp of information that is difficult to understand. • This webinar will offer practical strategies to architect and manage your Data Lake in a way that optimizes its success.
  5. Global Data Strategy, Ltd. 2018 Data Lakes – the Opportunity • Data Lakes provide a response to the opportunity & reality of today’s data-focused world. • Consumer data provides a myriad of opportunities • IoT data for machine logs, sensors, etc. • And more… • aka “Big Data” 5 Opportunity and Complexity Purchasing Patterns Photos & Video Support Call Logs Web Click Activity Etc… Social Media Interactions Consumer IoT data from wearable tech Location data from phone
  6. Global Data Strategy, Ltd. 2018 What is Big Data? • Big Data is often characterised by the “3 Vs”: • Volume: Is there a high volume of data? (e.g. terabytes per day) • Velocity: Is data generated or changed at a rapid pace? (e.g. per second, sub-second) • Variety: Is data stored across multiple formats? (e.g. machine data, media files, log files) • The ability to understand and manage these sources and integrate them into the larger Business Intelligence ecosystem can provide the ability to gain valuable insights from data. • Social Media Sentiment Analysis – e.g. What are customers saying about our products? • Web Browsing Analytics – Customer usage patterns • Internet of Things (IoT) Analysis – e.g. Sensor data, Machine log data • Customer Support – e.g. Call log analysis • This ability leads to the “4th V” of Big Data – Value. • Value: Valuable insights gained from the ability to analyze and discover new patterns and trends from high-volume and/or cross-platform systems. • Volume • Velocity • Variety Value
  7. Global Data Strategy, Ltd. 2018 The Business Need Spans Traditional and Modern Technology 7 Tell me what customers are saying about our product. Sybase SAP DB2 Oracle SQL Server SQL Azure Informix Teradata DBA Which customer database do you want me to pull this from? We have 25. Data Architect And, by the way, the databases all store customer information in a different format. “CUST_NM” on DB2, “cust_last_nm” on Oracle, etc. It’s a mess. Traditional Databases & DW Data Scientist I’ll need to input the raw data from thousands of sources, and write a program to parse and analyze the relevant information. Big Data & Data Lake
  8. Global Data Strategy, Ltd. 2018 The 5th “V” - Veracity • Only through proper Governance, Data Quality Management, Metadata Management, etc., can organizations achieve the 5th “V” – Veracity. • Veracity: Trust in the accuracy, quality and content of the organizations’ information assets. • i.e. The hard work doesn’t go away with Big Data Raw data used in Self-Service Analytics and BI environments is often so poor that many data scientists and BI professionals spend an estimated 50 – 90% of their time cleaning and reformatting data to make it fit for purpose.(4 Source: DataCenterJournal.com The absence of commonly understood and shared metadata and data definitions is cited as one of the main impediments to the success of Data Lakes. Source: Radiant Advisors Correcting poor data quality is a Data Scientist’s least favorite task, consuming on average 80% of their working day Source: Forbes 2016 71% of interviewees expect digitization to grow their business. But 70% say the biggest barrier is finding the right data; 62% cite inconsistent data Source: Stibo Systems Data Science Data Lakes Data Science Digitization & Data Quality
  9. Global Data Strategy, Ltd. 2018 Big Data a Growing Trend • Over 70% of organizations are either using Big Data solutions, or planning to in the future. • Analysis & Discovery are leading trends including: • Data Science & Discovery • Reporting & Analytics • “Sandbox” Exploration 9 Analysis & Discovery are Key Drivers 1 Trends in Data Architecture, 2017, DATAVERSITY, by Donna Burbank and Charles Roe 1
  10. Global Data Strategy, Ltd. 2018 Big Data Concerns 10 • The Complexity of current Big Data solutions & the Skills Required to manage them were also common issues. • Security is a leading concern, and Data Governance was a top write-in response. 1 Trends in Data Architecture, 2017, DATAVERSITY, by Donna Burbank and Charles Roe 1
  11. Global Data Strategy, Ltd. 2018 Balance Opportunity & Risk • Scalability • Cost Considerations • Latency • Storage of Diverse Data Sources 11 With the Opportunity of Big Data Comes Risk • Privacy • Security • Compliance • Collaboration between New Roles Architecture Governance • With the opportunity from Big Data comes a myriad of risks and concerns such as scalability, security, etc. • These concerns can be addressed through a combination of data Lake Architecture and the supporting Governance mechanisms.
  12. Global Data Strategy, Ltd. 2018 12 A Successful Data Strategy links Business Goals with Technology Solutions “Top-Down” alignment with business priorities “Bottom-Up” management & inventory of data sources Managing the people, process, policies & culture around data Coordinating & integrating disparate data sources Leveraging & managing data for strategic advantage Copyright 2018 Global Data Strategy, Ltd Aligning Business Strategy and Data Strategy
  13. Global Data Strategy, Ltd. 2018 Traditional Relational Technologies and “Big Data”: a Paradigm Shift Traditional • Top-Down, Hierarchical • Design, then Implement • “Passive”, Push technology • “Manageable” volumes of information • “Stable” rate of change • Data Warehouse • Business Intelligence Big Data • Distributed, Democratic • Discover and Analyze • Collaborative, Interactive • Massive volumes of information • Rapid and Exponential rate of growth • Data Lake • Statistical Analysis Design Implement Discover Analyze
  14. Global Data Strategy, Ltd. 2018 “Traditional” way of Looking at the World: Hierarchies • Carolus Linnaeus in 1735 established a hierarchy/taxonomy for organizing and identifying biological systems. Kingdom Phylum Class Order Family Genus Species
  15. Global Data Strategy, Ltd. 2018 “New” Way of Looking at the World - Emergence In philosophy, systems theory, science, and art, emergence is the way complex systems and patterns arise out of a multiplicity of relatively simple interactions. - Wikipedia I love my new Levis jeans. Is Levi coming to my party? Sale #LEVIS 20% at Macys. LOL. TTYL. Leving soon.
  16. Global Data Strategy, Ltd. 2018 Data Warehouse vs. Data Lake 16 Data Warehouse Data Lake A Data Lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure & requirements are not defined until the data is needed. A Data Warehouse is a storage repository that holds current and historical data used for creating analytical reports. Data structures & requirements are pre-defined, and data is organized & stored according to these definitions.
  17. Global Data Strategy, Ltd. 2018 Combining DW & Big Data Provides Value • There are numerous ways to gain value from data • Relational Database and Data Warehouse systems are one key source of value • Customer information • Product information • Big Data can offer new insights from data • From new data sources (e.g. social media, IoT) • By correlating multiple new and existing data sources (e.g. network patterns & customer data) • Integrating DW and Big Data can provide valuable new insights. • Examples include: • Customer Experience Optimization • Churn Management • Products & Services Innovation 17 New InsightsData Warehouse Data Lake
  18. Global Data Strategy, Ltd. 2018 Data Lake Adoption is Varied • Most are using a Data Lake along with a Data Warehouse • Many are not currently using a Data Lake 18 1 Trends in Data Architecture, 2017, DATAVERSITY, by Donna Burbank and Charles Roe 1
  19. Global Data Strategy, Ltd. 2018 Poll: Are you currently implementing a Data Lake? Are you currently implementing a Data Lake? 19 YES NO
  20. Global Data Strategy, Ltd. 2018 Integrating the Data Lake & Traditional Data Sources • The Data Lake has a different architecture & purpose than traditional data sources such as data warehouses. • But the two environments can co-exist to share relevant information. 20 Data Analysis & Discovery – Data Lake Enterprise Systems of Record Data Governance & Collaboration Master & Reference Data Data Warehouse Data MartsOperational Data Security & Privacy Sandbox Lightly Modeled Data Data Exploration Reporting & Analytics Advanced Analytics Self-Service BI Standard BI Reports
  21. Global Data Strategy, Ltd. 2018 The Data Ecosystem • Know what to manage closely and what to leave alone • The more the data is shared across & beyond the organization, the more formal governance needs to be 21 Core Enterprise Data Functional & Operational Data Exploratory Data Reference & Master Data Core Enterprise Data • Common data elements used by multiple stakeholders, departments, etc. (e.g. DW) • Highly governed • Highly published & shared Functional & Operational Data • Lightly modeled & prepared data for limited sharing & reuse • Collaboration-based governance • May be future candidates for core data Exploratory Data • Raw or lightly prepped data for exploratory analysis • Mainly ad hoc, one-off analysis • Light touch governance Examples • Operational Reporting • Non-productionized analytical model data • Ad hoc reporting & discovery Examples • Raw data sets for exploratory analytics • External & Open data sources Examples • Common Financial Metrics: for Financial & Regulatory Reporting • Common Attributes: Core attributes reused across multiple areas Master & Reference Data • Common data elements used by multiple stakeholders across functional areas, applications, etc. • Highly governed • Highly published & shared Examples • Reference Data: Department Codes, Country Codes, etc. • Master Data: Customer, Product, Student, Supplier, etc. Exploratory analysis uses core data sets when applicable Derived variables of value can be fed into Core Enterprise, or even Master Data. PublishPromote
  22. Global Data Strategy, Ltd. 2018 Governance Requires Interaction Between Roles Data Scientist “Citizen Data Scientist” Data Architect BI Reporting Analyst ETL Developer Data Steward Data Warehouse – centric roles Data Lake – centric roles Alignment DW Developer Data Lake Platform Administrator Data Governance ManagerCross-cutting Governance & Architecture Roles
  23. Global Data Strategy, Ltd. 2018 Metadata Repository vs. Data Catalogue • The collaboration paradigm of the data lake can require a different way of managing metadata 23 Different Data Sources Require Different Ways of Working Encyclopedia – Metadata Repository Wikipedia – Data Catalogue • Created by a few, then published as read-only • Single source of “vetted” truth • Slowly-changing • Created by a by many, edited by many • Eventual consistency with multiple inputs • Dynamic For Standardized, Enterprise Data Sets Data Warehouse For Data Exploration, Self Service Data Lake
  24. Global Data Strategy, Ltd. 2018 Collaboration, Governance & Metadata 24 Data Lakes require new ways of collaborating Core Enterprise Data Functional & Operational Data Exploratory Data Reference & Master Data Metadata Repository, Stricter Governance Data Catalogue – Collaborative Governance • Glossary: Strictly vetted • Data Dictionary: approved sources • Data Lineage: detailed source/target mapping at field level • Audit Trails • PII mapping and audit • Data classification • Glossary: Crowdsourced & open • Data Dictionary: exploratory sources • Data Lineage: high-level data flow and lineage between source and target • Usage ranking • Usefulness ranking and “likes” • Tagging
  25. Global Data Strategy, Ltd. 2018 Data Catalogue: Harnessing “Tribal Knowledge” 25 Usage Ranking • Which: • Definitions are most complete & helpful? • Algorithms offer a helpful starting point? • Queries offer great logic to share? • Etc. Helpfulness Ranking • Which: • Queries are others using? • Tables are accessed the most? • Glossary terms are most often searched? • Etc. Collaboration & Crowdsourcing Term: Part Number Alternate Names: Component Number Definition: A part number is an 8 digit alphanumeric field that uniquely identifies a machine part used in the manufacturing process. Is this truly the same as the old Component Number? That was a 10 digit numeric field. It didn’t have letters. Yes, it is. I had the same problem for the finance app, and I wrote a quick program to convert the numbers. We just strip off the first two chars now. Click here to find it.
  26. Global Data Strategy, Ltd. 2018 Avoiding Silos • Don’t create Data Lily Pads – i.e. disparate Data Lakes not connected with a wider Data Strategy. 26 • Often, teams create their own “stealth” data lakes in order to solve an immediate, tactical problem. • This approach loses the value of cross- functional data sharing. • Costs issues and redundancy are also a concern.
  27. Global Data Strategy, Ltd. 2018 Considerations & Risks to Avoid • The World of Data Lakes brings with it new risks and concerns 27 Platform - On Prem - Cloud - Provider selection Skills - Outsourced - In House - Training Requirements Cost - Is Cloud the right model for our scalable usage? - Are we shutting off sandboxes when we’re done? Data Lifecycle - What can be cold storage vs hot storage? - When can data be deleted? - How do we move from Exploration to Enterprise? Data Security - Who has access? - How is PII managed? Data Governance - Is there common semantic meaning? - How do teams work together – operating model? - Policies & Procedures - Who is spinning up a sandbox & why?
  28. Global Data Strategy, Ltd. 2018 Summary • Data Lakes can provide significant opportunity to an organization to gain value from cross-functional, disparate data sources • Data Warehouses and Data Lakes work well together for a comprehensive enterprise view. • Data Governance is critical for the success of data lakes: • Collaboration and sharing of information • Access control and security • Lifecycle and production to enterprise-data assets • Operating model and ways of working between roles and departments
  29. Global Data Strategy, Ltd. 2018 DATAVERSITY Data Architecture Strategies • January - on demand Panel: Emerging Trends in Data Architecture – What’s the Next Big Thing? • February - on demand Building an Enterprise Data Strategy – Where to Start? • March - on demand Modern Metadata Strategies • April - on demand The Rise of the Graph Database: Practical Use Cases & Approaches to Benefit your Business • May - on demand Data Architecture Best Practices for Today’s Rapidly Changing Data Landscape • June –on demand Artificial Intelligence: Real-World Applications for Your Organization • July – on demand Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic Asset • August – soon on demand Data Lake Architecture – Modern Strategies & Approaches • Sept Master Data Management: Practical Strategies for Integrating into Your Data Architecture • October Business-Centric Data Modeling: Strategies for Maximizing Business Benefit • December Panel: Self-Service Reporting and Data Prep – Benefits & Risks 29 This Year’s Line Up for 2018 – Join Us Next Month
  30. Global Data Strategy, Ltd. 2018 White Paper: Trends in Data Architecture 30 Free Download • Download from www.globaldatastrategy.com • Under ‘Resources/Whitepapers’
  31. Global Data Strategy, Ltd. 2018 Questions? 31 Thoughts? Ideas?
Publicidad