SlideShare una empresa de Scribd logo
1 de 43
Building Data WareHouse by
        Inmon



Chapter 5: The Data Warehouse and Technology




                                http://it-slideshares.blogspot.com/
5.0 Overview
Requires  a simpler set of technological
 features than its operational
 predecessors:
 ◦ Online updating: Not need.
 ◦ Locking, integrity: needs are minimal.
 ◦ Teleprocessing interface: is required very
   basic.
This chapter outlines some of
 technological requirements for the data
 warehouse.
MANAGING LARGE
AMOUNTS OF DATA

 1.   Manage Volumes
 2.   Manage multiple
      media technology
 3.   Index and
      monitoring data
 4.   Interface to retrieve
      and passing data
Managing Multiple Media
Following is a hierarchy of storage of data in terms
  of speed of access and cost of storage:
     Main memory                                   Very fast                  Very
      expensive
     Expanded memory                               Very fast     Expensive
     Cache                                         Very fast     Expensive
     DASD                                          Fast          Moderate
     Magnetic tape                                 Not fast      Not expensive
     Near line                                            Not fast*     Not
      expensive
     Optical disk                                  Not slow                   Not expensive
     Fiche                                         Slow                       Cheap
*Not fast to find first record sought; very fast to find all other records in the block.
Indexing and Monitoring Data
Monitoring data warehouse data
 determines such factors as the following:
 ◦ If a reorganization needs to be done
 ◦ If an index is poorly structured
 ◦ If too much or not enough data is in overflow
 ◦ The statistical composition of the access of
   the data
 ◦ Available remaining space
Interfaces to Many Technologies
The interface to different technologies requires
 several considerations:
  Does the data pass from one DBMS to another
   easily?
  Does it pass from one operating system to another
   easily?
  Does it change its basic format in passage (EBCDIC,
   ASCII, and so forth)?
  Can passage into multidimensional processing be
   done easily?
  Can selected increments of data, such as changed
   data capture (CDC) be passed rather than entire
   tables?
  Is the context of data lost in translation as data is
   moved to other environments?
PROGRAMMER OR
DESIGNER CONTROL OF
DATA PLACEMENT


 Place   data at block/page
 level
 Manage data in parallel
 Solid Meta Data control
 Rich Language Interface
Parallel Storage and Management of
Data
Metadata Management
 Data warehouse table structures
 Data warehouse table attribution
 Data warehouse source data (the system of
  record)
 Mapping from the system of record to the data
  warehouse
 Data model specification
 Extract logging
 Common routines for access of data
 Definitions and/or descriptions of data
 Relationships of one unit of data to another
Language Interface
Typically, the language interface to the data
 warehouse should do the following:
  ◦ Be able to access data a set at a time
  ◦ Be able to access data a record at a time
  ◦ Specifically ensure that one or more indexes
    will be used in the satisfaction of a query
  ◦ Have an SQL interface
  ◦ Be able to insert, delete, or update data
EFFICIENT LOADING OF
DATA


 Load efficiently
 Use indexes efficiently
 Store   data in compact
 way
 Support compound
 Keys
Efficient Index Utilization

Technology can support efficient index access in
 several ways:
 ◦ Using bit maps
 ◦ Having multileveled indexes
 ◦ Storing all or parts of an index in main memory
 ◦ Compacting the index entries when the order of
   the data being indexed allows such compaction
 ◦ Creating selective indexes and range indexes
Compaction of Data
Manage large amounts of data.
Programmer gets the most out of a given
 I/O when data is stored compactly
Compound Keys
The   time valiancy of data warehouse
 data.
Key-foreign key relationships are quite
 common in the atomic data
VARIABLE-LENGTH
DATA
Variable-length data efficiently
Lock Manager, explicit control at   programmer Level
Able Index Only processing
Restore data in Bulk efficiently
Lock Management
Ensures that two or more people are not
 updating the same record at the same
 time.
Turn the lock manager off and on is
 necessary.
Index-Only Processing
Looking in an index (or indexes)—
 without going to the primary source of
 data
Fast Restore
The capability to quickly restore a data
 warehouse table from non-DASD storage
Other Technological Features
Some of those features include the
 following:
 ◦   Transaction integrity
 ◦   High-speed buffering
 ◦   Row- or page-level locking
 ◦   Referential integrity
 ◦   VIEWs of data
 ◦   Partial block loadin
DBMS Types and the Data
Warehouse
Data warehouses manage massive amounts of data
 because:
   Granular, atomic detail
   Historical information
   Summary as well as detailed data
Because record level, transaction-based updates are a
  regular feature of the general-purpose DBMS, must
  offer facilities:
   Locking
   COMMITs
   Checkpoints
   Log tape processing
   Deadlock
    Backout
Changing DBMS Technology
Such a change may be in order for several reasons:
  DBMS technologies may be available.
  The size of the warehouse has grown.
  Use of the warehouse has escalated and changed.
  The basic DBMS decision must be revisited from
   time to time.
Should the decision be made to go to a new DBMS
  technology, what are the considerations?
  Will the new DBMS technology meet the foreseeable
   requirements?
  How will the conversion from the older DBMS
   technology to the newer DBMS technology be done?
  How will the transformation programs be converted?
Multidimensional DBMS and the
Data Warehouse
The multidimensional DBMS                The data warehouse
1.   holds at least an order of          1.   holds massive amounts of data
     magnitude less data.
2.   is geared for very heavy and
     unpredictable access and analysis   2.   is geared for a limited amount of
     of data.                                 flexible access

3.   holds a much shorter time           3.   contains data with a very lengthy
     horizon of data.                         time horizon (from 5 to 10
                                              years)
4.   allows unfettered access.
                                         4.   allows analysts to access its data
                                              in a constrained fashion

                                         5.   being housed in a
5.   enjoy a complementary                    multidimensional DBMS
     relationship.

 Multidimensional DBMS and the
      Data Warehouse con’t
Multidimensional DBMS and the
Data Warehouse con’t
Following is the relational foundation for
 multidimensional DBMS data marts:
Strengths:
  Can support a lot of data.
  Can support dynamic joining of data.
  Has proven technology.
   Is capable of supporting general-purpose update
   processing.
  If there is no known pattern of usage of data,
   then the relational structure is as good as any
   other.
Weaknesses:
  Has performance that is less than optimal.
  Cannot be purely optimized for access
Multidimensional DBMS and the
Data Warehouse con’t
Following is the cube foundation for multidimensional
  DBMS data marts:
 Strengths:
    Performance that is optimal for DSS processing.
    Can be optimized for very fast access of data.
    If pattern of access of data is known, then the structure of
     data can be optimized.
     Can easily be sliced and diced.
    Can be examined in many ways.
   Weaknesses:
     Cannot handle nearly as much data as a standard
     relational format.
    Does not support general-purpose update processing.
    May take a long time to load.
    If access is desired on a path not supported by the design
     of the data, the structure is not flexible.
Multidimensional DBMS and the
Data Warehouse con’t
Multidimensional DBMS and the
Data Warehouse con’t
MULTIDIMENSIONAL DBMS
AND THE DATA
WAREHOUSE CON’T
Data Warehousing across Multiple
Storage Media
A large amount of data is spread across
 more than one storage medium.
 ◦ One processing environment is the DASD
   environment where online, interactive
   processing is done.
 ◦ The other processing environment is often a
   tape or mass store environment
The Role of Metadata in the Data
Warehouse Environment
The Role of Metadata in the Data
Warehouse Environment
The Role of Metadata in the Data
Warehouse Environment
Context and Content
The  context of the reports is explained
 for the contents
Three Types of Contextual
Information
Threelevels of contextual information must be
 managed:
  Simple contextual information
  Complex contextual information
  External contextual information
Simple contextual information relates to the basic
 structure of data itself, and includes such things
 as these:
  The structure of data
  The encoding of data
  The naming conventions used for data
  The metrics describing the data, such as:
   How much data there is
   How fast the data is growing
    What sectors of the data are growing
Three Types of Contextual
Information con’t
This type of information addresses such aspects
 of data as these:
 ◦ Product definitions
 ◦ Marketing territories
 ◦ Pricing
 ◦ Packaging
 ◦ Organization structure
 ◦ Distribution
Three Types of Contextual
Information con’t
Some  examples of external contextual
 information include the following:
  Economic forecasts:
   Inflation
   Financial trends
   Taxation
   Economic growth
 Political information
 Competitive information
 Technological advancements
 Consumer demographic movements
Capturing and Managing Contextual
Information
Complex   and external contextual types
 of information are hard to capture and
 quantify because they are so
 unstructured.
Looking at the Past
Some of these shortcomings are as follows:
The information management attempts
 were aimed at the information systems
 developer, not the end user.
Attempts at contextual management
 were passive.
Attempts at contextual information
 management were in many cases
 removed from the development effort.
Attempts to manage contextual
Refreshing the Data Warehouse
    Reading a log tape is no small matter,
     however. Many obstacles are in the way,
     including the following:
    The log tape contains much extraneous
     data.
    The log tape format is often arcane.
    The log tape contains spanned records.
    The log tape often contains addresses
     instead of data values.
    The log tape reflects the idiosyncrasies of
Testing
It is very unusual to find a similar test
  environment in the world of the data
  warehouse, for the following reasons:
Data warehouses are so large that a
  corporation has a hard time justifying one
  of them, much less two of them.
The nature of the development life cycle
  for the data warehouse is iterative.
For the most part, programs are run in a
  heuristic manner, not in a repetitive
Summary
 Some technological features are
  required:
       Robust language interface
       Compound keys
       Variable-length data
       The abilities to do the following:
         Manage large amounts of data         Have metadata control of the
         Manage data on a diverse media        warehouse
         Easily index and monitor data        Efficiently load the warehouse
         Interface with a wide number of      Efficiently use indexes
          technologies                         Store data in a compact way
         Allow the programmer to place        Support compound keys
          the data directly on the physical    Selectively turn off the lock
          device                                manager
         Store and access data in parallel    Do index-only processing
                                               Quickly restore from bulk
                                                storage
Summary con’t
The   data architect must recognize the
 differences between a transaction-based
 DBMS and a data warehouse-based
 DBMS.
Summary con’t
MultidimensionalOLAP technology is suited for
 data mart processing and not data warehouse
 processing.

When  the data mart approach is used, many
 problems become evident:
  The number of extract programs grows large.
  Each new multidimensional database must return to
   the legacy operational environment for its own data.
  There is no basis for reconciliation of differences in
   analysis.
  A tremendous amount of redundant data among
   different multidimensional DBMS environments
   exists.
Summary con’t
Metadata in the data warehouse
 environment plays a very different role
 than metadata in the operational legacy
 environment.




                       http://it-slideshares.blogspot.com/

Más contenido relacionado

La actualidad más candente

Lecture 10 distributed database management system
Lecture 10   distributed database management systemLecture 10   distributed database management system
Lecture 10 distributed database management systememailharmeet
 
Database , 5 Semantic
Database , 5 SemanticDatabase , 5 Semantic
Database , 5 SemanticAli Usman
 
Storage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 NotesStorage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 NotesSudarshan Dhondaley
 
Database , 12 Reliability
Database , 12 ReliabilityDatabase , 12 Reliability
Database , 12 ReliabilityAli Usman
 
Database , 4 Data Integration
Database , 4 Data IntegrationDatabase , 4 Data Integration
Database , 4 Data IntegrationAli Usman
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiUnmesh Baile
 
Distributed web based systems
Distributed web based systemsDistributed web based systems
Distributed web based systemsReza Gh
 
Deep semantic understanding
Deep semantic understandingDeep semantic understanding
Deep semantic understandingsidra ali
 
Distributed processing
Distributed processingDistributed processing
Distributed processingNeil Stein
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed databaseHoneySah
 
Week 17 slides 1 7 multidimensional, parallel, and distributed database
Week 17 slides 1 7 multidimensional, parallel, and distributed databaseWeek 17 slides 1 7 multidimensional, parallel, and distributed database
Week 17 slides 1 7 multidimensional, parallel, and distributed databaseAnne Lee
 

La actualidad más candente (20)

Deduplication - Remove Duplicate
Deduplication - Remove DuplicateDeduplication - Remove Duplicate
Deduplication - Remove Duplicate
 
Lecture 10 distributed database management system
Lecture 10   distributed database management systemLecture 10   distributed database management system
Lecture 10 distributed database management system
 
Hdfs
HdfsHdfs
Hdfs
 
Distributed database
Distributed databaseDistributed database
Distributed database
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Database , 5 Semantic
Database , 5 SemanticDatabase , 5 Semantic
Database , 5 Semantic
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
 
Storage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 NotesStorage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 Notes
 
Database , 12 Reliability
Database , 12 ReliabilityDatabase , 12 Reliability
Database , 12 Reliability
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Database , 4 Data Integration
Database , 4 Data IntegrationDatabase , 4 Data Integration
Database , 4 Data Integration
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Distributed web based systems
Distributed web based systemsDistributed web based systems
Distributed web based systems
 
Deep semantic understanding
Deep semantic understandingDeep semantic understanding
Deep semantic understanding
 
Distributed processing
Distributed processingDistributed processing
Distributed processing
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Hdfs
HdfsHdfs
Hdfs
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed database
 
Week 17 slides 1 7 multidimensional, parallel, and distributed database
Week 17 slides 1 7 multidimensional, parallel, and distributed databaseWeek 17 slides 1 7 multidimensional, parallel, and distributed database
Week 17 slides 1 7 multidimensional, parallel, and distributed database
 
DISTRIBUTED DATABASE
DISTRIBUTED DATABASEDISTRIBUTED DATABASE
DISTRIBUTED DATABASE
 

Destacado

Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016Kent Graziano
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingJason S
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 

Destacado (6)

Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 

Similar a Lecture 05 - The Data Warehouse and Technology

IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...In-Memory Computing Summit
 
NoSQL Architecture Overview
NoSQL Architecture OverviewNoSQL Architecture Overview
NoSQL Architecture OverviewChristopher Foot
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architectureRahul Chaturvedi
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
A Successful Journey to the Cloud with Data Virtualization
A Successful Journey to the Cloud with Data VirtualizationA Successful Journey to the Cloud with Data Virtualization
A Successful Journey to the Cloud with Data VirtualizationDenodo
 
Lecture 12 The Really Large Data Warehouse
Lecture 12 The Really Large Data WarehouseLecture 12 The Really Large Data Warehouse
Lecture 12 The Really Large Data Warehousephanleson
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Database management system.pptx
Database management system.pptxDatabase management system.pptx
Database management system.pptxRamyaGr4
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...Niraj Tolia
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...QBiC_Tue
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
iRODS UGM 2016 Preso Summary FINAL
iRODS UGM 2016 Preso Summary FINALiRODS UGM 2016 Preso Summary FINAL
iRODS UGM 2016 Preso Summary FINALRandy Splinter
 

Similar a Lecture 05 - The Data Warehouse and Technology (20)

IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
 
NoSQL Architecture Overview
NoSQL Architecture OverviewNoSQL Architecture Overview
NoSQL Architecture Overview
 
Distributed dbms (ddbms)
Distributed dbms (ddbms)Distributed dbms (ddbms)
Distributed dbms (ddbms)
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
A Successful Journey to the Cloud with Data Virtualization
A Successful Journey to the Cloud with Data VirtualizationA Successful Journey to the Cloud with Data Virtualization
A Successful Journey to the Cloud with Data Virtualization
 
Lecture 12 The Really Large Data Warehouse
Lecture 12 The Really Large Data WarehouseLecture 12 The Really Large Data Warehouse
Lecture 12 The Really Large Data Warehouse
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Database management system.pptx
Database management system.pptxDatabase management system.pptx
Database management system.pptx
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
iRODS UGM 2016 Preso Summary FINAL
iRODS UGM 2016 Preso Summary FINALiRODS UGM 2016 Preso Summary FINAL
iRODS UGM 2016 Preso Summary FINAL
 

Más de phanleson

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewallsphanleson
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hackingphanleson
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocolsphanleson
 
E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacksphanleson
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applicationsphanleson
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designphanleson
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operationsphanleson
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBasephanleson
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibphanleson
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLphanleson
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Clusterphanleson
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programmingphanleson
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Dataphanleson
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairsphanleson
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagiaphanleson
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLphanleson
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Webphanleson
 

Más de phanleson (20)

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewalls
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hacking
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocols
 
E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacks
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XML
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Web
 

Lecture 05 - The Data Warehouse and Technology

  • 1. Building Data WareHouse by Inmon Chapter 5: The Data Warehouse and Technology http://it-slideshares.blogspot.com/
  • 2. 5.0 Overview Requires a simpler set of technological features than its operational predecessors: ◦ Online updating: Not need. ◦ Locking, integrity: needs are minimal. ◦ Teleprocessing interface: is required very basic. This chapter outlines some of technological requirements for the data warehouse.
  • 3. MANAGING LARGE AMOUNTS OF DATA 1. Manage Volumes 2. Manage multiple media technology 3. Index and monitoring data 4. Interface to retrieve and passing data
  • 4. Managing Multiple Media Following is a hierarchy of storage of data in terms of speed of access and cost of storage: Main memory Very fast Very expensive Expanded memory Very fast Expensive Cache Very fast Expensive DASD Fast Moderate Magnetic tape Not fast Not expensive Near line Not fast* Not expensive Optical disk Not slow Not expensive Fiche Slow Cheap *Not fast to find first record sought; very fast to find all other records in the block.
  • 5. Indexing and Monitoring Data Monitoring data warehouse data determines such factors as the following: ◦ If a reorganization needs to be done ◦ If an index is poorly structured ◦ If too much or not enough data is in overflow ◦ The statistical composition of the access of the data ◦ Available remaining space
  • 6. Interfaces to Many Technologies The interface to different technologies requires several considerations: Does the data pass from one DBMS to another easily? Does it pass from one operating system to another easily? Does it change its basic format in passage (EBCDIC, ASCII, and so forth)? Can passage into multidimensional processing be done easily? Can selected increments of data, such as changed data capture (CDC) be passed rather than entire tables? Is the context of data lost in translation as data is moved to other environments?
  • 7. PROGRAMMER OR DESIGNER CONTROL OF DATA PLACEMENT Place data at block/page level Manage data in parallel Solid Meta Data control Rich Language Interface
  • 8. Parallel Storage and Management of Data Metadata Management Data warehouse table structures Data warehouse table attribution Data warehouse source data (the system of record) Mapping from the system of record to the data warehouse Data model specification Extract logging Common routines for access of data Definitions and/or descriptions of data Relationships of one unit of data to another
  • 9. Language Interface Typically, the language interface to the data warehouse should do the following: ◦ Be able to access data a set at a time ◦ Be able to access data a record at a time ◦ Specifically ensure that one or more indexes will be used in the satisfaction of a query ◦ Have an SQL interface ◦ Be able to insert, delete, or update data
  • 10. EFFICIENT LOADING OF DATA Load efficiently Use indexes efficiently Store data in compact way Support compound Keys
  • 11. Efficient Index Utilization Technology can support efficient index access in several ways: ◦ Using bit maps ◦ Having multileveled indexes ◦ Storing all or parts of an index in main memory ◦ Compacting the index entries when the order of the data being indexed allows such compaction ◦ Creating selective indexes and range indexes
  • 12. Compaction of Data Manage large amounts of data. Programmer gets the most out of a given I/O when data is stored compactly
  • 13. Compound Keys The time valiancy of data warehouse data. Key-foreign key relationships are quite common in the atomic data
  • 14. VARIABLE-LENGTH DATA Variable-length data efficiently Lock Manager, explicit control at programmer Level Able Index Only processing Restore data in Bulk efficiently
  • 15. Lock Management Ensures that two or more people are not updating the same record at the same time. Turn the lock manager off and on is necessary.
  • 16. Index-Only Processing Looking in an index (or indexes)— without going to the primary source of data
  • 17. Fast Restore The capability to quickly restore a data warehouse table from non-DASD storage
  • 18. Other Technological Features Some of those features include the following: ◦ Transaction integrity ◦ High-speed buffering ◦ Row- or page-level locking ◦ Referential integrity ◦ VIEWs of data ◦ Partial block loadin
  • 19. DBMS Types and the Data Warehouse Data warehouses manage massive amounts of data because: Granular, atomic detail Historical information Summary as well as detailed data Because record level, transaction-based updates are a regular feature of the general-purpose DBMS, must offer facilities: Locking COMMITs Checkpoints Log tape processing Deadlock  Backout
  • 20. Changing DBMS Technology Such a change may be in order for several reasons: DBMS technologies may be available. The size of the warehouse has grown. Use of the warehouse has escalated and changed. The basic DBMS decision must be revisited from time to time. Should the decision be made to go to a new DBMS technology, what are the considerations? Will the new DBMS technology meet the foreseeable requirements? How will the conversion from the older DBMS technology to the newer DBMS technology be done? How will the transformation programs be converted?
  • 21. Multidimensional DBMS and the Data Warehouse
  • 22. The multidimensional DBMS The data warehouse 1. holds at least an order of 1. holds massive amounts of data magnitude less data. 2. is geared for very heavy and unpredictable access and analysis 2. is geared for a limited amount of of data. flexible access 3. holds a much shorter time 3. contains data with a very lengthy horizon of data. time horizon (from 5 to 10 years) 4. allows unfettered access. 4. allows analysts to access its data in a constrained fashion 5. being housed in a 5. enjoy a complementary multidimensional DBMS relationship. Multidimensional DBMS and the Data Warehouse con’t
  • 23. Multidimensional DBMS and the Data Warehouse con’t Following is the relational foundation for multidimensional DBMS data marts: Strengths: Can support a lot of data. Can support dynamic joining of data. Has proven technology.  Is capable of supporting general-purpose update processing. If there is no known pattern of usage of data, then the relational structure is as good as any other. Weaknesses: Has performance that is less than optimal. Cannot be purely optimized for access
  • 24. Multidimensional DBMS and the Data Warehouse con’t Following is the cube foundation for multidimensional DBMS data marts:  Strengths: Performance that is optimal for DSS processing. Can be optimized for very fast access of data. If pattern of access of data is known, then the structure of data can be optimized.  Can easily be sliced and diced. Can be examined in many ways.  Weaknesses:  Cannot handle nearly as much data as a standard relational format. Does not support general-purpose update processing. May take a long time to load. If access is desired on a path not supported by the design of the data, the structure is not flexible.
  • 25. Multidimensional DBMS and the Data Warehouse con’t
  • 26. Multidimensional DBMS and the Data Warehouse con’t
  • 27. MULTIDIMENSIONAL DBMS AND THE DATA WAREHOUSE CON’T
  • 28. Data Warehousing across Multiple Storage Media A large amount of data is spread across more than one storage medium. ◦ One processing environment is the DASD environment where online, interactive processing is done. ◦ The other processing environment is often a tape or mass store environment
  • 29. The Role of Metadata in the Data Warehouse Environment
  • 30. The Role of Metadata in the Data Warehouse Environment
  • 31. The Role of Metadata in the Data Warehouse Environment
  • 32. Context and Content The context of the reports is explained for the contents
  • 33. Three Types of Contextual Information Threelevels of contextual information must be managed: Simple contextual information Complex contextual information External contextual information Simple contextual information relates to the basic structure of data itself, and includes such things as these: The structure of data The encoding of data The naming conventions used for data The metrics describing the data, such as: How much data there is How fast the data is growing  What sectors of the data are growing
  • 34. Three Types of Contextual Information con’t This type of information addresses such aspects of data as these: ◦ Product definitions ◦ Marketing territories ◦ Pricing ◦ Packaging ◦ Organization structure ◦ Distribution
  • 35. Three Types of Contextual Information con’t Some examples of external contextual information include the following: Economic forecasts: Inflation Financial trends Taxation Economic growth Political information Competitive information Technological advancements Consumer demographic movements
  • 36. Capturing and Managing Contextual Information Complex and external contextual types of information are hard to capture and quantify because they are so unstructured.
  • 37. Looking at the Past Some of these shortcomings are as follows: The information management attempts were aimed at the information systems developer, not the end user. Attempts at contextual management were passive. Attempts at contextual information management were in many cases removed from the development effort. Attempts to manage contextual
  • 38. Refreshing the Data Warehouse Reading a log tape is no small matter, however. Many obstacles are in the way, including the following: The log tape contains much extraneous data. The log tape format is often arcane. The log tape contains spanned records. The log tape often contains addresses instead of data values. The log tape reflects the idiosyncrasies of
  • 39. Testing It is very unusual to find a similar test environment in the world of the data warehouse, for the following reasons: Data warehouses are so large that a corporation has a hard time justifying one of them, much less two of them. The nature of the development life cycle for the data warehouse is iterative. For the most part, programs are run in a heuristic manner, not in a repetitive
  • 40. Summary  Some technological features are required:  Robust language interface  Compound keys  Variable-length data  The abilities to do the following:  Manage large amounts of data  Have metadata control of the  Manage data on a diverse media warehouse  Easily index and monitor data  Efficiently load the warehouse  Interface with a wide number of  Efficiently use indexes technologies  Store data in a compact way  Allow the programmer to place  Support compound keys the data directly on the physical  Selectively turn off the lock device manager  Store and access data in parallel  Do index-only processing  Quickly restore from bulk storage
  • 41. Summary con’t The data architect must recognize the differences between a transaction-based DBMS and a data warehouse-based DBMS.
  • 42. Summary con’t MultidimensionalOLAP technology is suited for data mart processing and not data warehouse processing. When the data mart approach is used, many problems become evident: The number of extract programs grows large. Each new multidimensional database must return to the legacy operational environment for its own data. There is no basis for reconciliation of differences in analysis. A tremendous amount of redundant data among different multidimensional DBMS environments exists.
  • 43. Summary con’t Metadata in the data warehouse environment plays a very different role than metadata in the operational legacy environment. http://it-slideshares.blogspot.com/