Publicidad
Publicidad

Más contenido relacionado

Similar a datavault2.pptx(20)

Último(20)

Publicidad

datavault2.pptx

  1. DATA VAULT 2.0: Big Data Meets Data Warehousing DEAN HALLMAN WIRESOFT, LLC
  2. DATA WAREHOUSING VS BIG DATA • Does Big Data replace Data Warehousing? Or do I need both? • What’s the difference: • Between the data flowing into a data warehouse vs big data tools? • Between the ingestion processes and infrastructure? • Data Lakes arrived with Big Data, so are they useful in Data Warehousing? • How should I model my data in EDW? • 3NF, Star Schema, same as my operational data stores? • Data Vault 2.0 • Graph Databases • What is an architecture that allows both to co-exists effectively?
  3. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse C DC , snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) Staging Vault Raw Vault Business Vault Information Mart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing C lients ETL ELT BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL
  4. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse C DC , snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) Staging Vault Raw Vault Business Vault Information Mart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing C lients ETL ELT BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL THE DATA MODEL
  5. DATA VAULT 2.0 COMMON FOUNDATIONAL WAREHOUSE ARCHITECTURE • “The Data Vault Model is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise” -- Dan Linstedt, Creator of Data Vault • Data loaded as-is from sources, no edits or cleanup • Append-only to afford highest performance • Agile & agnostic to changes in the operational store’s data model • Essentially, a prescription for Layered Graph to Relational Mapping
  6. DATA WAREHOUSING & DATA VAULT 2.0 • 60’s, 70’s, 80’s • E.F. Codd => 3NF • Bill Inmon invents Data Warehousing concept • Dr. Ralph Kimball popularizes Star Schema design • 90’s, 00’s: • Dan Linstedt creates Data Vault Model @ DOD • 2014: • Dan Introduces Data Vault 2.0
  7. Source: “What are Graph Databases and Why should I care?“, by Dave Bechberger of Expero
  8. SOLVE BY STAR SCHEMA ?
  9. RELATIONAL VS GRAPH DATABASES • Enterprise Grade • Well-worn path • SQL has been relatively stagnant vs programming languages
  10. GRAPH DATA MODEL Source: https://neo4j.com/developer/graph-database/
  11. GRAPH DATABASE VS DATA VAULT
  12. GRAPH DATABASE VS DATA VAULT
  13. SERVICED_BY Flight Record Source Airport CAE Load Date 2018-11-17 Source Id 20181117-32-983 Base Dest Forecast Record Source LoadDate Depart Gate LGA 2018-10-11 1:25P M B27 CAE 2018-10-24 3:30P M A14 SFO 2018-09-06 8:55P G19 M RDU 2018-08-12 4:45P M C22 Aircraft Record Source United Airlines Load Date 2018-01-17 Source Id 2412c Base Service FAA NTSB Recor d Source LoadDate Model Tailno United 2017-02-11 767 1477 Delta 2015-11-04 A6 2381 Alaska 2013-08-28 747 8312 Frontie r 2016-07-19 182 1438 r SERVICED_BY Record Source United Airlines Load Date 2018-09-17 Base Dest Manifest Recor d Source LoadDate Begin End United 2017-02-11 2017-04-23 2017-09-23 Delta 2015-11-04 2015-12-01 2017-04-22 Alaska 2013-08-28 2013-09-14 2016-05-04 Frontie 2016-07-19 2016-08-02 2018-04-11 Hubs Links Satellites Tab
  14. • Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations - Mel Conway
  15. FLIGHT Base Dest Forecast Record Source LoadDate Depart G ate LG A 2018-10- 11 1:25P M B27 CAE 2018-10- 24 3:30P M A14 FLIGHT Record Source Airport CAE Load Date 2018-11-17 Source Id 20181117-32-983 Aircraft Bas e Service FAA NTSB Recor d Source LoadDate Model Tailno United 2017-02- 767 1477 11 Delta 2015-11- A6 2381 04 Alaska 2013-08- 747 8312 28 Frontie 2016-07- 182 1438 r 19 Record Source United Airlines Load Date 2018-01-17 Source Id 2412c Airport Base Dest Manifest Recor d Source LoadDate Begin End United 2017-02-11 2017-04-23 2017-09- 23 Delta 2015-11-04 2015-12-01 2017-04- 22 Alaska 2013-08-28 2013-09-14 2016-05- 04 Frontie 2016-07-19 2016-08-02 2018-04- r 11 Record Source United Airlines Load Date 2018-09-17 Airline Base Service FAA NTS B Record Source LoadDate Model Tailno United 2017-02-11 767 1477 Delta 2015-11-04 A6 2381 Record Source United Airlines Load Date 2018-01-17 Source Id 2412c Hubs Links Satellites Tab
  16. Source: https://www.wherescape.com/solutions/project-types/data-vault-automation/
  17. • Modeled after self- organizing networks • A Business Key identifies a key concept in business. • They have a business meaning • They are unique and have very low propensity to change • Business keys change only when the business change • Enables (forces) cross- source modeling Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
  18. Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
  19. Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
  20. DATA VAULT 2.0 MODELING: HUBS, LINKS & SATELLITES
  21. @wiresoft/Pathfinder
  22. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse C DC , snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) Staging Vault Raw Vault Business Vault Information Mart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing C lients ETL ELT BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL THE DATA Impressions vs Business Data
  23. ENTERPRISE DATA SILOS Small Data Large Data Big Data Describes the user base Describes the Enterprise Describes the Product
  24. Instance Grain Transaction Grain Audit Grain Impression Grain Big Data Enterprise Data Warehouse Operational Data Stores Impression Analytics Business Analytics External Data Sources DATA GRANULARITY FUNNEL
  25. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse CDC, snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) Staging Vault Raw Vault Business Vault Information Mart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing C lients ETL ELT BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL DATA INGESTION ETL vs ELT vs SerDe
  26. ETL VS ELT VS SerDe • Beware the Turing tar-pit, in which everything is possible, but nothing of interest is easy - Alan Perlis
  27. DATA CLASSIFICATION MATRIX: DECLARATIVE VS INTERPRETIVE Declarative Interpretive Hadoo p RDBMS Web Events Media Player
  28. DATA WAREHOUSING • Deep Topic • 60’s, 70’s, 80’s • E.F. Codd => 3NF • Bill Inmon invents Data Warehousing concept • Dr. Ralph Kimball popularizes Star Schema design • 90’s, 00’s: • Dan Linstedt creates Data Vault Model @ DOD • 2014: • Dan Introduces Data Vault 2.0 • Data Warehouse vs Operational Data Stores • Data Warehouse as Version Control System BIG DATA • MapReduce, 2004, Google by Jeffery Dean and Sanjay, “MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS” , GFS • Nutch 2005, Hadoop 2006, 2007 - Doug Cutting • What exactly is “Big Data”?
  29. Client User Interpreter Analysis UNSTRUCTURED USER EXPERIENCE L L n L i lossy
  30. Client User Time Series Event Record Analysis STRUCTURED USER EXPERIENCE lossless L p L p L e
  31. ETL OR SERDE ? S3 Hadoop Time Series Event Record Analysis Deserializer L e L d L m Client User Serializer L p L p Eventlog.e Eventlog.d L e Single Source (Version Locked) Kafka/Kinesis Le Internet
  32. ETL ELT (SerDe) vs Source: https://www.ironsidegroup.com/2015/03/01/etl-vs-elt-whats-the-big-difference/ Schema On Write Schema On Read
  33. OTHER CHALLENGES • Satellites must be loaded chronologically • Time-based scheduling vs data-availability scheduling
  34. QUESTIONS? • Contact:  Dean Hallman  rdhallman@gmail.com  Linkedin: https://www.linkedin.com/in/dean-hallman/
Publicidad