SlideShare una empresa de Scribd logo
1 de 90
DWH Concepts
What is a DATA WAREHOUSE?

A data warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data derived from
transaction data, but it can include data from other sources. It separates analysis
workload from transaction workload and enables an organization to consolidate data
from several sources. In addition to a relational database, a data warehouse
environment includes an extraction, transportation, transformation, and loading (ETL)
solution, an online analytical processing (OLAP) engine, client analysis tools, and
other applications that manage the process of gathering data and delivering it to
business users.

® A data warehouse is a database designed to support a broad range of decision
tasks in a specific organization. It is usually batch updated and structured for rapid
online queries and managerial summaries. Data warehouses contain large amounts
of historical data. The term data warehousing is often used to describe the process
of creating, managing and using a data warehouse.

What are the characteristics of a DATA WAREHOUSE?

The characteristics of a DWH are

•   Subject-Oriented: DWH’s are designed to help you analyze data. For example,
    to learn more about the company’s sales data, you can build a warehouse that
    concentrates on sales. This ability to define a DWH by subject matter, sales in
    this case makes the DWH subject oriented.
•   Integrated: It is closely related to subject orientation. DWH’s put data from
    desperate sources into a consistent format. They must resolve such problems as
    naming conflicts and inconsistencies among units of measure. When they
    achieve this, they are said be integrated.
•   Nonvolatile: It means that, once entered into the warehouse, data should not
    change. This is logical because the purpose of a warehouse is to enable you to
    analyze what has occurred and whatever once happened never changes.
•   Time-Variant: In order to discover trends, analysts need large amounts of data.
    This is very much in contrast to OLTP systems, where performance requirements
    demand that historical data be moved to an archive. A DWH focus on change
    over time is what is meant by the term time variant.




What are the goals of a DATA WAREHOUSE?
The goals of a DATA WAREHOUSE are

•  To provide a reliable, single integrated source of key corporate information.
•  To give end users access to their data without a reliance on reports produced by
   the information system department.
• To allow analysts to analyze corporate data and even produce predictive “what if”
   models from that data.
The data warehouse is simply one component of modern reporting architectures.
The real goal of reporting systems are decision support –or its modern equivalent
Business intelligence-to help people makes better, more intelligent decision.

When should a company consider implementing a data warehouse?

Data warehouses or a more focused database called a data mart should be
considered when a significant number of potential users are requesting access to a
large amount of related historical information for analysis and reporting purposes.
So-called active or real-time data warehouses can provide advanced decision
support capabilities.

What are the uses of DATAWAREHOUSE?

•   It separates analysis workload and enables an organization to consolidate data
    from several sources.
•   It manages the process of gathering data and delivering to business users.
•   It is used to analyze data.
•   It puts data from desperate sources into a consistent format.


What are the benefits of data warehousing?

Some of the potential benefits of putting data into a data warehouse include:

1. Improving turnaround time for data access and reporting;
2. Standardizing data across the organization so there will be one view of the
   "truth";
3. Merging data from various source systems to create a more comprehensive
   information source;
4. Lowering costs to create and distribute information and reports;
5. Sharing data and allowing others to access and analyze the data;
6. Encouraging and improving fact-based decision-making.




What are the limitations of data warehousing?
The major limitations associated with data warehousing are related to user
expectations, lack of data and poor data quality. Building a data warehouse creates
some unrealistic expectations that need to be managed. A data warehouse doesn't
meet all decision support needs. If needed data is not currently collected, transaction
systems need to be altered to collect the data. If data quality is a problem, the
problem should be corrected in the source system before the data warehouse is
built. Software can provide only limited support for cleaning and transforming data.
Missing and inaccurate data can not be "fixed" using software. Historical data can be
collected manually, coded and "fixed", but at some point source systems need to
provide quality data that can be loaded into the data warehouse without manual
clerical intervention.

What data is stored in a data warehouse?

In general, organized data about business transactions and business operations is
stored in a data warehouse. But, any data used to manage a business or any type of
data that has value to a business should be evaluated for storage in the warehouse.
Some static data may be compiled for initial loading into the warehouse. Any data
that comes from mainframe, client/server, or web-based systems can then be
periodically loaded into the warehouse. The idea behind a data warehouse is to
capture and maintain useful data in a central location. Once data is organized,
managers and analysts can use software tools like OLAP to link different types of
data together and potentially turn that data into valuable information that can be used
for a variety of business decision support needs, including analysis, discovery,
reporting and planning. Database administrators (DBAs) have always said that
having non-normalized or de-normalized data is bad.

What are the methodologies of Data Warehousing?

Every company has methodology of their own. But to name a few SDLC
Methodology, AIM methodology are sturdily used. Other methodologies are AMM,
World class methodology and many more.

How does my company get started with data warehousing?

Build one! The easiest way to get started with data warehousing is to analyze some
existing transaction processing systems and see what type of historical trends and
comparisons might be interesting to examine to support decision making. See if
there is a "real" user need for integrating the data. If there is, then IS/IT staff can
develop a data model for a new schema and load it with some current data and start
creating a decision support data store using a database management system
(DBMS). Find some software for query and reporting and build a decision support
interface that's easy to use. Although the initial data warehouse/data-driven DSS
may seem to meet only limited needs, it is a "first step". Start small and build more
sophisticated systems based upon experience and successes.
What is the Data warehouse Implementation Schemes?

What type of Indexing mechanism do we need to use for a typical data
warehouse?

On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap
and/or the other types of clustered/non-clustered, unique/non-unique indexes.

To my knowledge, SQLServer does not support bitmap indexes. Only Oracle
supports bitmaps.

What are the steps to build the data warehouse?

Gathering business requirements

Identifying Sources

Identifying Facts

Defining Dimensions

Define Attributes

Redefine Dimensions & Attributes

Organize Attribute Hierarchy & Define Relationship

Assign Unique Identifiers

Additional conventions: Cardinality/Adding ratios

How often should data be loaded into a data warehouse from transaction
processing and other source systems?

It all depends on the needs of the users, how fast data changes and the volume of
information that is to be loaded into the data warehouse. It is common to schedule
daily, weekly or monthly dumps from operational data stores during periods of low
activity (for example, at night or on weekends). The longer the gap between loads,
the longer the processing times for the load when it does run. A technical IS/IT
staffer should make some calculations and consult with potential users to develop a
schedule to load new data.




What are the different architectures of data warehouse? ® What are the
different approaches of a Data warehouse?

There are two main things
Top down - (bill Inmon)

Bottom up - (Ralph Kimball)

What are the types of a data warehouse?

What is the main difference between Inmon and Kimball philosophies of data
warehousing?

Both differed in the concept of building the data warehouse.

Kimball views data warehousing as a constituency of data marts. Data marts are
focused on delivering business objectives for departments in the organization. And
the data warehouse is a conformed dimension of the data marts. Hence a unified
view of the enterprise can be obtained from the dimension modeling on a local
departmental level.

Inmon beliefs in creating a data warehouse on a subject-by-subject area basis.
Hence the development of the data warehouse can start with data from the online
store. Other subject areas can be added to the data warehouse as their needs arise.
Point-of-sale (POS) data can be added later if management decides it is necessary.

i.e., Kimball--First Data Marts--Combined way ---Data warehouse

     Inmon---First Data warehouse--Later----Data marts

When should I consider a Data warehouse solution?

What is the process of warehousing data?

Explain the architecture of a data warehouse with the diagram.

What is Staging Area?

What is a general purpose scheduling tool?

The basic purpose of the scheduling tool in a DW Application is to stream line the
flow of data from Source to Target at specific time or based on some condition.

What is real time data warehousing?

Real-time data warehousing is a combination of two things:

1. real-time activity and
2. Data warehousing.
Real-time activity is activity that is happening right now. The activity could be
anything such as the sale of widgets. Once the activity is complete, there is data
about it. Data warehousing captures business activity data. Real-time data
warehousing captures business activity data as it occurs. As soon as the business
activity is complete and there is data about it, the completed activity data flows into
the data warehouse and becomes available instantly. In other words, real-time data
warehousing is a framework for deriving information from data as the data becomes
available.

What is ODS?

ODS means Operational Data Store. A collection of operation or bases data that is
extracted from operation databases and standardized, cleansed, consolidated,
transformed, and loaded into enterprise data architecture. An ODS is used to support
data mining of operational data, or as the store for base data that is summarized for
a data warehouse. The ODS may also be used to audit the data warehouse to
assure summarized and derived data is calculated properly. The ODS may further
become the enterprise shared operational database, allowing operational systems
that are being reengineered to use the ODS as there operation databases.

What is Active data warehousing?

An active data warehouse provides information that enables decision-makers within
an organization to manage customer relationships nimbly, efficiently and proactively.
Active data warehousing is all about integrating advanced decision support with day-
to-day-even minute-to-minute-decision making in a way that increases quality of
those customer touches which encourages customer loyalty and thus secure an
organization's bottom line. The marketplace is coming of age as we progress from
first-generation "passive" decision-support systems to current- and next-generation
"active" data warehouse implementations.

® Active Data ware house means every user can access the database any time 24/7
that is called Active DWH.

® Active Transformation means data can change and pass.




What is meant by OLTP?

OLTP stands for On-Line Transaction Processing. This is a standard, normalized
database structure. OLTP is designed for Transactions i.e., day-to-day transactions.
OLTP database has hundreds of users connected to it. These databases are
normalized to reduce the redundancy of the data & increase the performance while
inserting the data. The ratio of no. of records being inserted is more than the ration of
no. of records being updated or deleted. OLTP systems are not designed for
analysis, reporting and decision support. Examples: ATM Machines, Online
Shopping, Online Application Filling, and Online Railway Reservations.

Why OLTP database are designs not generally a good idea for a Data
Warehouse?
Since in OLTP, tables are normalized and hence query response will be slow for end
user and OLTP doesn’t contain years of data and hence cannot be analyzed.

Why is de-normalized data now ok when it's used for Decision Support?

Normalization of a relational database for transaction processing avoids processing
anomalies and results in the most efficient use of database storage. A data
warehouse for Decision Support is not intended to achieve these same goals. For
Data-driven Decision Support, the main concern is to provide information to the user
as fast as possible. Because of this, storing data in a de-normalized fashion,
including storing redundant data and pre-summarizing data, provides the best
retrieval results. Also, data warehouse data is usually static so anomalies will not
occur from operations like add, delete and update a record or field.

Why should you put your data warehouse on a different system than your
OLTP system?

A OLTP system is basically “data oriented” (ER model) and not “Subject oriented
"(Dimensional Model) .That is why we design a separate system that will have a
subject oriented OLAP system...Moreover if a complex query is fired on a OLTP
system will cause a heavy overhead on the OLTP server that will affect the day-to-
day business directly.

What is Business Intelligence?

Business intelligence (BI) is a broad category of applications and technologies for
gathering, storing, analyzing, and providing access to data to help enterprise users
make better business decisions.



What are the important concerns of OLTP and DSS systems?

               OLTP                                DSS

No. of users   Many                                FEW
Data           1. Stored in a Complex data format. 1. Stored in multidimensional
                                                   structures (Normalized)  e.g.:
                                                   cube (3 dimensional).
               2. Stored in a normalized form.
               Normally 3rd Normalized form.
               Normalization          enhances 2. Stored in de-normalized format.
               performance.

                                                 3. Large volumes of data.
               3. Small volumes of data.

                                                 4. Static in nature with periodic
               4. Data is volatile in nature.    loads.




Operations     Transactions.                    Reporting.

Indexes        Few                              Many.

Joins          Many(because it is normalized)   Few (because it is de-normalized).

Performanc     Concurrency and availability are Response time is most imp.
e              more imp aspects. e.g.: ATM's.

OLTP                                                 DSS

Complex Data                                         Multidimensional Data
Structures                                           Structures

Few                    INDEXES                       Many

Many                   JOINS                         Some

Normalized DBMS        DUPLICATED DATA               De-Normalized DBMS

Rare                   DERIVED DATA AND              Common
                       AGGREGATES

Many                   NUMBER OF USERS               Few
Predefined             WORKLOAD                           AD-HOC queries
operations

Volatile               DATA MODIFICATIONS                 Update on a regular basis

Small Volumes          DATA                               Large Volume (Historical
                                                          Data)

Availability Must be                                      Response time must be
high                                                      good


What is the difference between ODS and OLTP?

ODS: It is nothing but a collection of tables created in the Data warehouse that
maintains only current data where as OLTP maintains the data only for transactions,
these are designed for recording daily operations and transactions of a business

® ODS: Having data with Data warehouse that will be stand alone. No further
transaction will take place for current data which is part of the data ware house.
Current data will be change once you upload through ETL on schedule basis.

OLTP: Having data with on line system which connected to network and all update
on transaction happened in seconds. Every second data summarized value will get
changed.

What is an OLAP? What are the types of OLAP?

OLAP is software for manipulating multidimensional data from a variety of sources.
The data is often stored in data warehouse. OLAP software helps a user create
queries, views, representations and reports. OLAP tools can provide a "front-end" for
a data-driven DSS.

® OLAP: On-Line Analytical Processing: On-Line Analytical Processing (OLAP) is
a category of software technology that enables analysts, managers and executives
to gain insight into data through fast, consistent, interactive access to a wide variety
of possible views of information that has been transformed from raw data to reflect
the real dimensionality of the enterprise as understood by the user.

® OLAP stands for On-Line Analytical Processing. OLAP system stores data in
multidimensional databases. U then accesses these databases to perform financial
and statistical analysis on different combinations of the data. An OLAP database is
generally used to analyze data. It is optimized so that u can quickly retrieve data. An
OLAP database is generally created from the information u have put in an OLTP
database. OLAP products can be grouped into 3 categories.
MOLAP: (Multidimensional OLAP)

o Data is stored multidimensional arrays in order to be viewed in a
  multidimensional manner.
o Multidimensional arrays provide efficiency in storage and operations.
o Examples: ORACLE Express Servers, Essbase by Hyperion Software, Power
  play by Cognos.
o MOLAP does not support ad-hoc queries because it is optimized for
  multidimensional operations
o Retrieval is Fast
o Storage is very efficient
ROLAP: (Relational OLAP)

o Data is stored in a Relational model because OLAP capabilities are best provided
  against the relational database.
o Examples: Oracle, SQL Server… etc.
o ROLAP integrates naturally with existing technology and standards.
o ROLAP can readily take advantage of parallel relational technology.
HOLAP: (Hybrid OLAP)

o These products combine MOLAP and ROLAP.
o With HOLAP products, a relational database stores most of the data.
o A separatable multidimensional database stores a small portion of the data
o
Is OLAP databases are called decision support system??? True/false?

True

What does the term ‘Metadata’ mean?

Very loosely, it is documentation about data; it is how you provide context for data
people might be using. Metadata is basically the wrapping you put around data you
use in everyday life to transform it into meaningful information.

What is the difference between data warehousing and OLAP?

The term’s data warehousing and OLAP are often used interchangeably. As the
definitions suggest, warehousing refers to the organization and storage of data from
a variety of sources so that it can be analyzed and retrieved easily. OLAP deals with
the software and the process of analyzing data, managing aggregations, and
partitioning information into cubes for in-depth analysis, retrieval and visualization.
Some vendors are replacing the term OLAP with the term’s analytical software and
business intelligence.

® Data warehouse is the place where the data is stored for analyzing where as
OLAP is the process of analyzing the data, managing aggregations, partitioning
information into cubes for in-depth visualization.

What is OLAP, MOLAP, ROLAP, DOLAP, and HOLAP?
OLAP - On-Line Analytical Processing: Designates a category of applications and
technologies that allow the collection, storage, manipulation and reproduction of
multidimensional data, with the goal of analysis.

MOLAP - Multidimensional OLAP: This term designates a Cartesian data structure
more specifically. In effect, MOLAP contrasts with ROLAP. In the former, joins
between tables are already suitable, which enhances performances. In the latter,
joins are computed during the request. Targeted at groups of users because it's a
shared environment. Data is stored in an exclusive server-based format. It performs
more complex analysis of data.

ROLAP - Relational OLAP: Designates one or several star schemas stored in
relational databases. This technology permits multidimensional analysis with data
stored in relational databases. Used for large departments or groups because it
supports large amounts of data and users.

DOLAP - Desktop OLAP: Small OLAP products for local multidimensional analysis
Desktop OLAP. There can be a mini multidimensional database (using Personal
Express), or extraction of a data cube (using Business Objects). Designed for low-
end, single, departmental user. Data is stored in cubes on the desktop. It's like
having your own spreadsheet. Since the data is local, end users don't have to worry
about performance hits against the server.

HOLAP: Hybridization of OLAP, which can include any of the above.

What is meant by metadata in context of a Data warehouse and how it is
important?

Meta data is the data about data; Business Analyst or data modeler usually capture
information about data - the source (where and how the data is originated), nature of
data (char, varchar, nullable, existence, valid values etc) and behavior of data (how it
is modified / derived and the life cycle) in data dictionary a.k.a metadata. Metadata is
also presented at the Data mart level, subsets, fact and dimensions, ODS etc. For a
DW user, metadata provides vital information for analysis / DSS.




What is difference between MOLAP, ROLAP?

                ROLAP                                      MOLAP

Tactical                                   Strategic

   •   Detailed Data                          •   Summary Data
   •   Simple calculations                    •   Complex
   •   Analyze past trends                    •   Predict future trends
Data storage structure                    Data storage structure

   • Tables                                  • Cube
Advantages                                Advantages

   •  Requires less memory storage           • Data access is faster
      space                               Disadvantages
Disadvantages
                                              •   Requires more memory storage
   •   Data access is slow                        space.
                                              •   Is sparsely filled as the number
                                                  of dimensions in the cube
                                                  increases


What is the Difference between OLTP and OLAP?

Main Differences between OLTP and OLAP are:-

1. User and System Orientation

OLTP: customer-oriented, used for data analysis and querying by clerks, clients and
IT professionals.

OLAP: market-oriented, used for data analysis by knowledge workers (managers,
executives, analysis).

2. Data Contents

OLTP: manages current data, very detail-oriented.

OLAP: manages large amounts of historical data, provides facilities for
summarization and aggregation, stores information at different levels of granularity to
support decision making process.

3. Database Design

OLTP: adopts an entity relationship(ER) model and an application-oriented database
design.

OLAP: adopts star, snowflake or fact constellation model and a subject-oriented
database design.

4. View

OLTP: focuses on the current data within an enterprise or department.

OLAP: spans multiple versions of a database schema due to the evolutionary
process of an organization; integrates information from many organizational
locations and data stores
What types of Metadata are there and when will they be available?

Metadata will be made available on the Decision Support website as each increment
'goes live'. We have two classifications of metadata: one that is business and one
that is technical. Technical metadata is fairly clear-cut: where did the data come from
or how was it transformed along the way? Business metadata deals more with the
possible meaning of the data and how it can be used.

Why is Metadata important to the DWH User?

Metadata is what makes the data in the Data Warehouse meaningful. The Data
Warehouse is very different from an operational application. When you're using an
operational application, you can get clues from the screen that tells you to update a
particular field on the window. If I’m processing a new employee, I know exactly what
needs to be updated for that new employee record, and can move through the
process based on the context that the application provides. In a data-warehousing
environment, you don’t have that context or workflow. You have data that is
interrelated, and it is raw out there in a form, but there is no application between you
and the data. Basically, you have a number of tables and structures that you have
access to without a business layer, without a definition on top of it. So metadata is
very important to be able to provide that context to people so they know how to go
between subject areas or how data within a subject area is related and what it
defines and represents.

Is Metadata a description of what the data represents?

In the simplest terms it is. As an example, if a user of the Data Warehouse is
interested in a field called "campus code", then the metadata might have a definition
of what the campus code represents, such as "an indicator for one of the three
campuses". That is a form of metadata, although it is not a complete picture of what
metadata can be.



What types of Metadata will be made available to the User?

Decision Support has identified several kinds of metadata that will be published on
the website. Some basic categories are the data model, source-to-target mapping,
and the logical & physical model. The logical model gives more of a grouping or
identifies logically what would be expected from the business side. The physical
model goes into more detail with more of the data dictionary definition, but it gives
the user a pictorial representation of the data, not just a list of columns and tables. It
provides a visual so people can see how data elements relate to each other. There is
also a category of metadata that we call usage notes. These go into expanding on
how someone might query the Data Warehouse or use a query against a data mart.
Based on going through the requirements process and working with the focus
groups, as data is available, we expect to expand the metadata categories.
Is Metadata also useful to the average User of the DWH, in addition to a
department’s technical staff?

Yes. For an "ad hoc" user, there may be questions as to what a field represents.
Another form of metadata at a business user level would be sample queries that
Decision Support’s Services area would publish based on findings from the
requirements process and focus groups. These queries provide samples of relating
data to answer a business question.

What Challenges are involved when providing Metadata?

Historically organizations find it a challenge to manage metadata over time. So I
think the biggest challenge that we face at Decision Support is learning from those
mistakes and from what we’ve read in the industry. We need to make sure the
metadata we have is ‘live’; that it’s not something that is static and put on the shelf.
Decision Support has formed a Custodial Data Council that will take ownership in
making sure we have business definitions and work with the user community. I think
we also need to technically streamline those processes as much as possible, publish
the metadata, and make it as consistent as possible.

What is the difference between DWH and BI?

There may be a Feature film (movie) without a Trailer. But there will be no trailer
without a movie. Similarly Data warehousing is a concept related to extracting client's
business data and applying business processing features on that data according to
user needs and finally loading the processed data into a database, this database is
what we call a warehouse or data warehouse. After the completion of a data
warehouse the business user ultimately want to view his data (a precise and
summary data) but as a business person he may don't have knowledge of accessing
a database (a computer person can access the database with SQL). So there comes
OLAP tools (which help that person to access the database) we can call these OLAP
tools as Business Intelligence tools (Intelligence in sense they generate SQL queries
internally and provide lot of facilities and privileges for a reporting developers in
formatting the data and presenting it in a highly convenient manner). So data
warehouse (movie) is a database and business intelligence tools (trailers) present
the content of a database in an efficient manner.

® Simply speaking, BI is the capability of analyzing the data of a data warehouse in
advantage of that business. A BI tool analyzes the data of a data warehouse and to
come into some business decision depending on the result of the analysis.

® Data warehouses deals with all aspects of managing the development,
implementation and operation of a data warehouse or data mart including meta data
management, data acquisition, data cleansing, data transformation, storage
management, data distribution, data archiving, operational reporting, analytical
reporting, security management, backup/recovery planning, etc. Business
intelligence, on the other hand, is a set of software tools that enable an organization
to analyze measurable aspects of their business such as sales performance,
profitability, operational efficiency, effectiveness of marketing campaigns, market
penetration among certain customer groups, cost trends, anomalies and exceptions,
etc. Typically, the term “business intelligence” is used to encompass OLAP, data
visualization, data mining and query/reporting tools. Think of the data warehouse as
the back office and business intelligence as the entire business including the back
office. The business needs the back office on which to function, but the back office
without a business to support, makes no sense.

® DATAWAREHOUSE: Data warehouse is integrated, time-variant, subject oriented
and non-volatile collection data in support of management decision making process.

BUSINESS INTELLIGENCE: Business Intelligence is the process of extracting the
data, converting it into information and then into knowledge base is known as
Business Intelligence.

® A data warehouse is a database geared towards the business intelligence
requirements of an organization. It integrates data from the various operational
systems and is typically loaded from these systems at regular intervals.

BI - It is category of technologies that allows for gathering, storing, accessing and
analyzing data to help business users make better decisions.

® To make Business Analysis effective and efficient we require specialized form of
storage. This special form of storage of data is called Data Warehouse and the
process Data Warehousing.

Business Intelligence, is the mechanism of using data according to type of industry
for predictive analysis, fault findings, process improvement etc.

What is a Data Dictionary?

A data dictionary is a kind of metadata. A data dictionary explains how data
physically resides in an environment. A data dictionary identifies the type of column it
is, whether it is character or numeric or some other value. It identifies the width of a
column as well as the name of the column. Sometimes in data dictionaries you see
descriptions; sometimes you don’t. But basically it is how that field is physically
represented in Oracle or Sybase or some other platform, if that’s where the data
resides. It's difficult to do any meaningful query or report without basic metadata.

What are the possible data marts in Retail sales?

Product information, sales information.

What are data validation strategies for data mart validation after loading
process?
Data validation is to make sure that the loaded data is accurate and meets the
business requirements.

Strategies are different methods followed to meet the validation requirements.

What is a Data Mart?

A Data Mart is a focused subset of a DWH that deals with a single area of data and
is organized for quick analysis. It contains the summarized data of the warehouses
and is referred as High Performance Query Structures. They consist of
Materialized Views and Special Indexes. In some businesses these data marts may
be maintained within the warehouses whereas, in some other scenario’s they may
be maintained apart from the DWH’s.

® A data mart is a repository of data gathered from operational data and other
sources that is designed to serve a particular community of knowledge workers.

® The systems designed for a particular line of business.

What are Data Marts?

Data Marts are designed to help manager make strategic decisions about their
business. Data Marts are subset of the corporate-wide data that is of value to a
specific group of users.

There are two types of Data Marts:

1. Independent data marts – sources from data captured form OLTP system,
external providers or from data generated locally within a particular department or
geographic area.

2. Dependent data mart – sources directly form enterprise data warehouses.

What are the levels of Data mart?

What are the difference between Database, DATAWAREHOUSE and Data
Marts?

A Database is an organized collection of data.

A DWH is a very large database with special set of tools to extract and cleanse data
from operational systems and to analyze data.

A Data Mart is a focused subset of a DWH that deals with a single area of data and
is organized for quick analysis.

What is Data Sampling?

What is Data Scrubbing?
What is Data Acquisition Process?

What is data mining?

Data mining is a process of extracting hidden trends within a data warehouse. For
example an insurance data warehouse can be used to mine data for the most high
risk people to insure in a certain geographical area.

What is a transformation?

It is a repository object that generates, modifies or passes data.

Transformations: Transformations are the manipulation of data from how it appears
in the source systems into another form in the DWH or data mart in a way that
enhances or simplifies its meaning. In another way, you transform data into
information. This includes the following:

Data Merging: It is a process of standardizing data types and fields. Suppose one
source system calls integer type data as smallint whereas another calls same data
as decimal. The data from the two source systems needs to rationalize when moved
into the oracle data format called number.

Cleansing: It is the process of validating the data brought from multiple sources.
This involves identifying any changing inconsistencies or inaccuracies.

•   Eliminating inconsistencies in the data from multiple sources.
•   Converting data from different systems into single consistent data set suitable for
    analysis.
• Meets a standard for establishing data elements, codes, domains, formats and
    naming conventions.
• Correct data errors and fills in for missing data values.
Aggregation: The process where by multiple detailed values are combined into a
single summary value typically summation numbers representing dollars spend or
units sold.

Generate summarized data for use in aggregate fact and dimension tables.

What are the advantages of data mining over traditional approaches?

Data Mining is used for the estimation of future. For example, if we take a
company/business organization, by using the concept of Data Mining, we can predict
the future of business in terms of Revenue (or) Employees (or) Customers (or)
Orders etc.

Traditional approaches use simple algorithms for estimating the future. But, it does
not give accurate results when compared to Data Mining.

What is ETL?
ETL stands for extraction, transformation and loading.

ETL provide developers with an interface for designing source-to-target mappings,
transformation and job control parameter.

• Extraction: Take data from an external source and move it to the warehouse
  pre-processor database.
• Transformation: Transform data task allows point-to-point generating, modifying
  and transforming data.
• Loading: Load data task adds records to a database table in a warehouse.
Explain the classification of Tables in a Data warehouse?

What is Fact table?

Fact Table contains the measurements or metrics or facts of business process. If
your business process is "Sales”, then a measurement of this business process such
as "monthly sales number" is captured in the Fact table. Fact table also contains the
foreign keys for the dimension tables.




Why fact table is in normal form?

Basically the fact table consists of the Index keys of the dimension/look up tables
and the measures. So when ever we have the keys in a table. That itself implies that
the table is in the normal form.

What is a level of Granularity of a fact table?

Level of granularity means level of detail that you put into the fact table in a data
warehouse. For example: Based on design you can decide to put the sales data in
each transaction. Now, level of granularity would mean what detail you are willing to
put for each transactional fact. Product sales with respect to each minute or you
want to aggregate it up to minute and put that data.

What does level of Granularity of a fact table signify?

Granularity: The first step in designing a fact table is to determine the granularity of
the fact table. By granularity, we mean the lowest level of information that will be
stored in the fact table. This constitutes two steps:

Determine which dimensions will be included.

Determine where along the hierarchy of each dimension the information will be kept.

The determining factors usually go back to the requirements

What is aggregate fact table?
Aggregate table contains the [measure] values, aggregated /grouped/summed up to
some level of hierarchy.

What is fact less fact table? Where you have used it in your project?

Factless table means only the key available in the Fact there is no measures
available.



What is the common use of creating a Factless Fact Table?

What are the different types of Fact Table? Explain with an example.

1. Cumulative Fact Table:
2. Snapshot Fact Table:




What are the types of Facts?

Additive: A Fact that can be summed up with any of the dimensions is called Additive
Facts.

® A measure can participate arithmetic calculations using all or any dimensions. Ex:
Sales profit

Semi additive: A Fact that can be summed up with some of the dimensions is called
Semi-additive Facts.

® A measure can participate arithmetic calculations using some dimensions. Ex:
Sales amount

Non Additive: A Fact that can be summed up with none of the dimensions is called
Non-additive Facts.

® A measure can’t participate arithmetic calculations using dimensions. Ex:
temperature

What are Semi-additive and factless facts and in which scenario will you use
such kinds of fact tables?

Snapshot facts are semi-additive, while we maintain aggregated facts we go for
semi-additive. EX: Average daily balance

A fact table without numeric fact columns is called factless fact table. Ex: Promotion
Facts
While maintain the promotion values of the transaction (ex: product samples)
because this table doesn’t contain any measures.

What are non-additive facts in detail?

A fact may be measure, metric or a dollar value. Measure and metric are non
additive facts.

Dollar value is additive fact. If we want to find out the amount for a particular place
for a particular period of time, we can add the dollar amounts and come up with the
total amount.

A non additive fact, for e.g. measure height(s) for 'citizens by geographical location' ,
when we rollup 'city' data to 'state' level data we should not add heights of the
citizens rather we may want to use it to derive 'count'.



What is conformed fact?

Conformed dimensions are the dimensions which can be used across multiple Data
Marts in combination with multiple facts tables accordingly.

What is a continuously valued fact?

What is Centipede Fact Table?

What is Fact Constellation?

What are the categories of Snapshot Fact Table Grains?

What is a dimension table?

A dimensional table is a collection of hierarchies and categories along which the user
can drill down and drill up. It contains only the textual attributes.

How are the Dimension tables designed?

Most dimension tables are designed using Normalization principles up to 2NF. In
some instances they are further normalized to 3NF.

Find where data for this dimension are located.

Figure out how to extract this data.

Determine how to maintain changes to this dimension (see more on this in the next
section).

Change fact table and DW population routines.

What are the Different methods of loading Dimension tables?
Conventional Load: Before loading the data, all the Table constraints will be checked
against the data.

Direct load: (Faster Loading) All the Constraints will be disabled. Data will be loaded
directly. Later the data will be checked against the table constraints and the bad data
won't be indexed.

Can a dimension table contain numeric values?

What is hierarchy relationship in a dimension? Whether it is:
1. 1:1
2. 1: m
3. M: m

What are the different types of dimensions? Explain with examples.

1. Regular Dimensions
2. Shared dimensions


What are the different types of dimension tables? Explain with examples.

Why dimensions are de-normalized in nature?

Can 2 fact tables share same dimension tables?

What is junk dimension?

Junk dimension: Grouping of Random flags and text attributes in a dimension and
moving them to a separate sub dimension.

® A dimension, which does not change the grain level, is called junk dimension.

Grain- lowest level of reporting.

(Or) The junk dimension is simply a structure that provides a convenient place to
store the junk attributes

(Or) A junk dimension is a convenient grouping of flags and indicators.

What are Conformed Dimensions?

A dimension that is used in more than one cube.

® The use of conformed dimensions and shared measures is the primary way a set
of data marts can be united into one consolidated data warehouse.

® Conformed dimensions are dimensions which are common to the cubes.(cubes
are the schemas contains facts and dimension tables)
Consider Cube-1 contains F1, D1, D2, D3 and Cube-2 contains F2, D1, D2, D4 are
the Facts and Dimensions. Here D1,D2 are the Conformed Dimensions

® Conformed dimensions mean the exact same thing with every possible fact table
to which they are joined. Ex: Date Dimensions is connected all facts like Sales facts,
Inventory facts. Etc

What is degenerated dimension?

Degenerate Dimension: Keeping the control information on Fact table ex: Consider
a Dimension table with fields like order number and order line number and have 1:1
relationship with Fact table, In this case this dimension is removed and the order
information will be directly stored in a Fact table in order eliminate unnecessary joins
while retrieving order information.

What is degenerate dimension table?

Degenerate Dimensions: If a table contains the values, which r neither dimension
nor measures is called degenerate dimensions. Ex: invoice id, empno.

What is Audit dimension? Explain with an example.

What is a Fact Dimension?

What is a Mini Dimension?

What are Role-playing dimensions?

What is a Mystery Dimension?

How do you connect the facts and dimensions in the tables?

1. Smart Matching columns
2. Manually you can link


Which columns go to the fact table and which columns go the dimension
table?

The Primary Key columns of the Tables (Entities) go to the Dimension Tables as
Foreign Keys.

The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign
Keys.

What is Associate Table?

What is Bridge Table?

What is crass reference table?
What is Event-Tracking Table?




What is a lookup table?

A lookup table is the one which is used when updating a warehouse. When the
lookup is placed on the target table (fact table / warehouse) based upon the primary
key of the target, it just updates the table by allowing only new records or updated
records based on the lookup condition.

What is the data type of the surrogate key?

Data type of the surrogate key is either integer or numeric or number.

What is a Schema?

What is a Star Schema?

Star schema is a type of organizing the tables such that we can retrieve the result
from the database easily and fastly in the warehouse environment. Usually a star
schema consists of one or more dimension tables around a fact table which looks
like a star, so that it got its name.

Differences between star and snowflake schemas?

Star schema: A single fact table with N number of Dimension.

Snowflake schema: Any dimensions with extended dimensions are known as
snowflake schema.

® Star schema - all dimensions will be linked directly with a fat table.

Snow schema - dimensions maybe interlinked or may have one-to-many relationship
with other tables.

What is Snow-Flake Schema?

When do U go for Star Schema? & when do U go for Snow-Flake Schema?

What is the main difference between schema in RDBMS and schemas in Data
Warehouse?

RDBMS Schema
•   Used for OLTP systems
•   Traditional and old schema
•   Normalized
•   Difficult to understand and navigate
•   Cannot solve extract and complex problems
•   Poorly modeled



DWH Schema

•   Used for OLAP systems
•   New generation schema
•   De Normalized
•   Easy to understand and navigate
•   Extract and complex problems can be easily solved
•   Very good model


Why did u choose STAR SCHEMA only? What are the benefits of STAR
SCHEMA?

Because it’s de-normalized structure, i.e., Dimension Tables are de-normalized. Why
to de-normalize means the first (and often only) answer is: speed. OLTP structure is
designed for data inserts, updates, and deletes, but not data retrieval. Therefore, we
can often squeeze some speed out of it by de-normalizing some of the tables and
having queries go against fewer tables. These queries are faster because they
perform fewer joins to retrieve the same record set. Joins are also confusing to many
End users. By de-normalizing, we can present the user with a view of the data that is
far easier for them to understand.

Benefits of STAR SCHEMA:

Far fewer Tables.

Designed for analysis across time.

Simplifies joins.

Less database space.

Supports “drilling” in reports.

Flexibility to meet business and technical needs.

Difference between Snow flake and Star Schema. What are situations where
Snow flake Schema is better than Star Schema to use and when the opposite
is true?
Star schema contains the dimension tables mapped around one or more fact tables.
It is a denormalised model. No need to use complicated joins. Queries results fastly.

Snowflake schema: It is the normalized form of Star schema. It contains in-depth
joins, because the tables r splitted in to many pieces. We can easily do modification
directly in the tables. We have to use complicated joins, since we have more
tables .There will be some delay in processing the Query.

Which is preferable? Star Schema or Snow-Flake Schema?

If U have 2 fact tables connected in the schema, do U know the name of the
schema?

What is Galaxy Schema?

What is Multi-Star Schema?

How do you load the time dimension?

Time dimensions are usually loaded by a program that loops through all possible
dates that may appear in the data. It is not unusual for 100 years to be represented
in a time dimension, with one row per day.

What are slowly changing dimensions?

SCD stands for Slowly changing dimensions. Slowly changing dimensions are of
three types

SCD1: only maintained updated values.

Ex: a customer address modified we update existing record with new address.

SCD2: maintaining historical information and current information by using

A) Effective Date

B) Versions

C) Flags            Or combination of these

SCD3: by adding new columns to target table we maintain historical information and
current information

® Type-1: Most Recent Value

Type-2(full History)

i) Version Number

ii) Flag
iii) Date

Type-3: Current and one Previous value

® Type 1: overwrite data is to be there.

Type 2: current, recent and history data should be there.

Type 3: current and recent data should be there.

What is BUS Schema?

BUS Schema is composed of a master suite of confirmed dimension and
standardized definition if facts.

What is hybrid slowly changing dimension?

What are Critical columns?

What is a surrogate key? Why is it used? What is its need? Give an example.

Explain in detail what do you mean by Slicing and Dicing?

Slicing and dicing refers to the ability to combine and re-combine the dimensions to
see different slices of the information. Picture slicing a three-dimensional cube of
information, in order to see what values are contained in the middle layer. Dicing is
the ability to view the cube from different perspectives. Slicing and dicing a cube
allows an end-user to do the same thing with multiple dimensions.


What is a Measure? What are the types of Measures?

How can U create Measures & Dimensions?

Can we group a measure?

What do U mean by Multi-dimensional Analysis?

What is a Grain?

What is Drill-up, Drill-down & Drill-Across?

Differentiate between Level and Category?

Level is a logical subdivision of a dimension

e.g.: if orderdate is a dimension, the levels are year, quarter, month, week, day etc.

Category is the different instances of a level

E.g. if year is a level, the category are 1996, 1997, 1998 etc.

What is a CUBE in data warehousing concept?
Cubes are logical representation of multidimensional data. The edge of the cube
contains dimension members and the body of the cube contains data values.

What is a Virtual Cube?

Difference between filter and condition?

Parameter is the only difference

® The difference between Filter and Condition: Condition returns true or false Ex: if
Country = 'India' then ...Filter will return two types of results.

1. Detail information which is equal to where clause in SQL statement

2. Summary information which is equal to Group by and having clause in SQL
statement

® I filter we just create a parameter on which we can filter the fields. but in condition
we can have the static functions like if yes then color it green, if no then color it as
red etc. so here we can create conditions for filtering in the report. Mean we can
make different filtering function at the same time by using conditional formatting.

What is snapshot?

You can disconnect the report from the catalog to which it is attached by saving the
report with a snapshot of the data. However, you must reconnect to the catalog if you
want to refresh the data.

What is a linked cube?

Linked cube in which a sub-set of the data can be analyzed into great detail. The
linking ensures that the data in the cubes remain consistent.

What is VLDB?

VLDB stands for Very Large Database.

It is an environment or storage space managed by a relational database
management system (RDBMS) consisting of vast quantities of information. VLDB
doesn’t refer to size of database or vast amount of information stored. It refers to the
window of opportunity to take back up the database.

Window of opportunity refers to the time of interval and if the DBA was unable to
take back up in the specified time then the database was considered as VLDB.



What is batch processing?

What is incremental loading?
Incremental loading means loading the ongoing changes in the OLTP.

Explain the advantages of RAID 1, 1/0, and 5. What type of RAID setup would
you put your TX logs.

Transaction logs write sequentially and don't need to be read at all. The ideal is to
have each on RAID 1/0 because it has much better write performance than RAID 5.

RAID 1 is also better for TX logs and costs less than 1/0 to implement. It has a tad
less reliability and performance is a little worse generally speaking.

RAID 5 is best for data generally because of cost and the fact it provides great read
capability.

What is BAS? What is the function?

The Business Application Support (BAS) functional area at SLAC provides
administrative computing services to the Business Services Division and Human
Resources Department. We are responsible for software development and
maintenance of the PeopleSoft applications and consultation to customers with their
computer-related tasks. It’s called Broadcast Agent Server. Its function is to run the
jobs or reports scheduled and can be monitored using Broadcast Agent Console.

What are modeling tools available in the Market?

There are a number of data modeling tools

Tool Name           Company Name

Erwin               Computer Associates

Embarcadero         Embarcadero Technologies

Rational Rose       IBM Corporation

Power Designer      Sybase Corporation

Oracle Designer     Oracle Corporation

What are the various Reporting tools in the Market?

1. MS-Excel

2. Business Objects (Crystal Reports)

3. Cognos (Impromptu, Power Play)

4. Microstrategy

5. MS reporting services
6. Informatica Power Analyzer

7. Actuate

8. Hyperion (BRIO)

9. Oracle Express OLAP

10. Proclarity

® Some of the standard Business Intelligence tools in the market According to their
performance

1) MICROSTRATEGY
2) BUSINESS OBJECTS, CRYSTAL REPORTS
3) COGNOS REPORT NET
4) MS-OLAP SERVICES
Or
1. Seagate Crystal report

2. SAS

3. Business objects

4. Microstrategy

5. Cognos

6. Microsoft OLAP

7. Hyperion

8. Microsoft integrated services and some more.

What are the various ETL tools in the Market?

Various ETL tools used in market are:

Informatica.

Data Stage.

Oracle Warehouse Builder.

Ab Initio.

Data Junction.

Name some of the real time data-warehousing tools?

What is Outsourcing, Offshoring & Insourcing? And what is the difference
between them.
Outsourcing is not strictly IT. Any function of an organization that is executed by non-
employees is essentially an Outsourced task.

Insourcing is the use of external resources (not employees of the Organization) to
accomplish some function, but they are predominately carrying out the function at
the client’s site. So, the function is “sourced” but not “out” sourced. These resources
are also typically managed more closely by the client directly with little management
involvement from the supplier.

Offshoring is a subset of Outsourcing which is generally understood to involve a
country in which cost remain lower than the clients country of operations.

While most Offshoring situations are indeed an example of Outsourcing, for those
companies (HP for example) who now own their offshore operations and have folded
them into the company, the line gets blurred. In other words, Offshoring is not always
outsourcing anymore.

What is ER Diagram?

The Entity-Relationship (ER) model was originally proposed by Peter in 1976
[Chen76] as a way to unify the network and relational database views. Simply stated
the ER model is a conceptual data model that views the real world as entities and
relationships. A basic component of the model is the Entity-Relationship diagram
which is used to visually represent data objects.

Since Chen wrote his paper the model has been extended and today it is commonly
used for database design for the database designer, the utility of the ER model is:

It maps well to the relational model. The constructs used in the ER model can easily
be transformed into relational tables.

It is simple and easy to understand with a minimum of training. Therefore, the model
can be used by the database designer to communicate the design to the end user.

In addition, the model can be used as a design plan by the database developer to
implement a data model in specific database management software.

What Oracle tools can be used to build and design a warehouse?

What Oracle features can be used to optimize my warehouse system?

What is Data Modeling?

Data modeling represent information in the entities, attributes and relationships.

Visual representation of the information.

What are the different steps for Data Modeling?

1. Define the problem and scope of the problem.
2. Information gathering.
3. Analysis(normalization)
4. Create a logical data model (independent of platform).
5. Decision about physical platform like oracle or SQL etc.
6. Create a physical data model, which is platform specific.
7. Database creation.
What is Dimensional Modeling?

Dimensional Modeling is a design concept used by many data warehouse designers
to build their data warehouse. In this design model all the data is stored in two types
of tables - Facts table and Dimension table. Fact table contains the
facts/measurements of the business and the dimension table contains the context of
measurements i.e., the dimensions on which the facts are calculated. Data modeling
is probably the most labor intensive and time consuming part of the development
process. Why bother especially if you are pressed for time? A common response by
practitioners who write on the subject is that you should no more build a database
without a model than you should build a house without blueprints. The goal of the
data model is to make sure that the all data objects required by the database are
completely and accurately represented. Because the data model uses easily
understood notations and natural language, it can be reviewed and verified as
correct by the end-users. The data model is also detailed enough to be used by the
database developers to use as a "blueprint" for building the physical database. The
information contained in the data model will be used to define the relational tables,
primary and foreign keys, stored procedures, and triggers. A poorly designed
database will require more time in the long-term. Without careful planning you may
create a database that omits data required to create critical reports, produces results
that are incorrect or inconsistent, and is unable to accommodate changes in the
user's requirements.




What is Logical Modeling?

The Logical Model: In Erwin, the logical model is the version of the model that
represents all of the logical business requirements of an organization. There are
three levels of logical models that are used to capture these requirements:

The Entity Relationship Diagram A high-level data model that includes all major
entities and relationships. The Entity Relationship Diagram does not contain much
detail and is often used in the initial planning phase.

The Key Based Model A model that describes major data structures such as
entities, primary keys, and sample attributes.
The Fully Attributed Model A complete model that includes all required entities,
attributes, key groups, and relationships.

In Erwin, a logical model can be created in conjunction with the physical model, or
independent of the physical model. Logical models can also be derived from other
models using the Derive Model Wizard.

In addition, Erwin supports the definition of model objects in a logical model as
logical only and in a physical model as physical only. These options allow for the
logical model to be fully normalized and for the corresponding physical model to be
de-normalized. Erwin also allows for the automatic conversion of many-to-many and
super type/subtype relationships when you change from a logical model to a physical
model.

What are the types of Dimensional Modeling?

What is Conceptual Modeling?

What is Physical Modeling?

Comparing Logical and Physical Models in a Logical/Physical Model:

In an Erwin logical/physical model, each model that you create automatically
includes both a logical and a physical model. By default, the logical model is closely
related to the physical model. If you make a change in the logical model, the change
is automatically reflected in the physical model and vice-versa.

You can use either the logical model or the physical model to define and document
database structures; although the model you use typically depends on the type of
work you want to perform. You can use the logical model to represent business
information and define business rules in a fully normalized model, while the physical
model supports the needs of the database administrator, who focuses on the
physical implementation of the model in a database.

Comparing Logical and Physical Model Objects:

Most of the objects in the logical model correspond to a related object in the physical
model. For example, the logical model contains entities, attributes, and key groups,
which are represented in the physical model as tables, columns, and indexes,
respectively. The following table compares the logical and physical components in
an Erwin model.

What is Difference between E-R Modeling and Dimensional Modeling?

Basic diff is E-R modeling will have logical and physical model. Dimensional model
will have only physical model.

E-R modeling is used for normalizing the OLTP database design.
Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.

What is Entity, Attribute and Relationship?

Entity: Entity is an object of which an organization wants to maintain the information

E.g.: Employee.

Attribute: Is an object that maintains the information.

Key attribute: A key attribute consists of one or more attributes of an entity, which
uniquely identify the entity. e.g.; Bank account no identifies for account.

Relationship: Defines the association between different entities.

one to one, one to many, many to one, many to many.

What is meant by De-Normalization?

What is the definition of normalized and denormalized view and what are the
differences between them?

Normalization is the process of removing redundancies.

Denormalization is the process of allowing redundancies.

Why Denormalization is promoted in Universe Designing?

In a relational data model, for normalization purposes, some lookup tables are not
merged as a single table. In a dimensional data modeling (star schema), these
tables would be merged as a single table called DIMENSION table for performance
and slicing data. Due to this merging of tables into one large Dimension table, it
comes out of complex intermediate joins. Dimension tables are directly joined to Fact
tables. Though, redundancy of data occurs in DIMENSION table, size of
DIMENSION table is 15% only when compared to FACT table. So only
Denormalization is promoted in Universe Designing.

What is Cardinality?

What is Referential Integrity?

What are Integrity Constraints?

What is the difference between view and materialized view?

View - store the SQL statement in the database and let you use it as a table. Every
time you access the view, the SQL statement executes.

Materialized view - stores the results of the SQL in table form in the database. SQL
statement only executes once and after that every time you run the query, the stored
result set is used. Pros include quick query results.
What is Normalization, First Normal Form, Second Normal Form , Third Normal
Form?

1. Normalization is process for assigning attributes to entities–Reduces data
redundancies–Helps eliminate data anomalies–Produces controlled redundancies to
link tables

2. Normalization is the analysis of functional dependency between attributes / data
items of user views. It reduces a complex user view to a set of small and stable
subgroups of fields / relations

1NF: Repeating groups must be eliminated, Dependencies can be identified, All key
attributes defined, No repeating groups in table

2NF: The Table is already in1NF,Includes no partial dependencies–No attribute
dependent on a portion of primary key, Still possible to exhibit transitive dependency,
Attributes may be functionally dependent on non-key attributes

3NF: The Table is already in 2NF, Contains no transitive dependencies.

What is a Table space? What does it contain?

What is a Composite Key or Concatenated Key? What is its use?

What are Unique Identifiers?

What is an Index? What are the types of Indexes?

What do U mean Partitioned Indexes?

What is partitioning? What are the methods of partitioning?

What is Parallelism?

What are the advantages and disadvantages of reporting directly against the
database? Do you always need to copy the data before reporting on it?
(Example, real-time & on-demand reporting is a requirement)

There isn’t any need to copy the data before reporting on as long as the data is
clean. But if     the data is not clean it should be cleansed and so go for ETL
process.

Adv of reporting directly against the database (OLTP): No need to separately
maintain a    Database for it. (Space consumption is reduced).

Disadv of reporting directly against the database (OLTP): It slows down the
process bcoz OLTP system is designed for the online application but a Data
Warehouse application which requires to do analysis and hence takes the same data
but takes a long time.
What are the most frequent data errors that slow down data input process?

Data mining is the process of data selection, exploration and building models
using vast data stores to uncover previously unknown patterns. What does
this mean to you?

You can produce new knowledge to better inform decision makers before they act.
Build a model of the real world based on data collected from a variety of sources,
including corporate transactions, customer histories and demographics, even
external sources such as credit bureaus. Then use this model to produce patterns in
the information that can support decision making and predict new business
opportunities. Text mining capabilities enable you to apply such analyses to text-
based documents. With SAS's rich suite of text processing and analysis tools, you
can uncover underlying themes or concepts contained in large document collections,
group documents into topical clusters, classify documents into predefined categories
and integrate text data with structured data for enriched predictive modeling
endeavors.
Before you begin, you should know the answers for the following questions.

         what is Data?

    D    what is a Database?

    D    what is an RDBMS?

    R    What is a Data Model?

    D    Why we follow Normalization while designing data model?

         What is an OLTP system

WHAT IS A DATAWAREHOUSING:

    •  A data warehouse is a relational database that is designed for query and
       analysis rather than for transaction processing. It usually contains historical
       data derived from transaction data, but it can include data from other sources.
       It separates analysis workload from transaction workload and enables an
       organization to consolidate data from several sources.
    • In addition to a relational database, a data warehouse environment includes
       an extraction, transportation, transformation, and loading (ETL) solution, an
       online analytical processing (OLAP) engine, client analysis tools, and other
       applications that manage the process of gathering data and delivering it to
       business users.
•   A Data warehouse is a complete set of

                           Subject Oriented

                           Integrated

                           Time variant

                           Nonvolatile

data which helps business in taking organization decision

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more
about your company's sales data, you can build a warehouse that concentrates on
sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case, makes the data warehouse subject oriented.
Integrated

Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems
as naming conflicts and inconsistencies among units of measure. When they achieve
this, they are said to be integrated.

Nonvolatile

Nonvolatile means that, once entered into the warehouse, data should not change.
This is logical because the purpose of a warehouse is to enable you to analyze what
has occurred.

Time Variant

In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A
data warehouse's focus on change over time is what is meant by the term time
variant.

When an organization should create a Data Warehouse?

 Once an organization have too much of information where it becomes too difficult to
get the meaning full information for the business to take the strategic decisions. The
decisions we make using the Data warehousing data will affect the entire
organization instead of one customer or one employee. Example of decisions we
make in DW is, should we continue with the specific product offerings to our
customers or not. Should we move the customer support department to a different
location for a cost saving, etc etc.

Data warehouses and OLTP systems have very different requirements. Here are
some examples of differences between typical data warehouses and OLTP systems:

   •   Workload

       Data warehouses are designed to accommodate ad hoc queries. You might
       not know the workload of your data warehouse in advance, so a data
       warehouse should be optimized to perform well for a wide variety of possible
       query operations.

       OLTP systems support only predefined operations. Your applications might be
       specifically tuned or designed to support only these operations.

   •   Data modifications
A data warehouse is updated on a regular basis by the ETL process (run
      nightly or weekly) using bulk data modification techniques. The end users of a
      data warehouse do not directly update the data warehouse.

      In OLTP systems, end users routinely issue individual data modification
      statements to the database. The OLTP database is always up to date, and
      reflects the current state of each business transaction.

  •   Schema design

      Data warehouses often use denormalized or partially denormalized schemas
      (such as a star schema) to optimize query performance.

      OLTP systems often use fully normalized schemas to optimize
      update/insert/delete performance, and to guarantee data consistency.

  •   Typical operations

      A typical data warehouse query scans thousands or millions of rows. For
      example, "Find the total sales for all customers last month."

      A typical OLTP operation accesses only a handful of records. For example,
      "Retrieve the current order for this customer."

  •   Historical data

      Data warehouses usually store many months or years of data. This is to
      support historical analysis.

      OLTP systems usually store data from only a few weeks or months. The
      OLTP system stores only historical data as needed to successfully meet the
      requirements of the current transaction.

END USER OF APPPLICATION:

                 What you mean by end user in OLTP system ?


  •    An end user is who is entering data or reading a particular report from the
      system.
  •   For a Bank teller he/she should enter the account number see the balance or
      deposit the cheque etc
  •   For a customer representative job he/she must see the cust information to be
      more effective

                 What kind of information management wants to know, because
                    the DW data is primarily used by management.
•   Which are our lowest/highest margin customers?
   •   What is the most effective distribution channel?
   •   What product promotions have the biggest impact on revenue?
   •   What impact will new products/services have on revenue and margins?
   •   Which customers are most likely to go to the competition?
   •   Who are my customers and what products are they buying?

In OLTP applications, end users are individuals who takes care of day to day
operations.

In DW applications, end users are managers and above who takes decisions based
on the trend, history, predictions etc

If end users are not satisfied with the application, then the product is considered to
be failure even though the technology wise its a great achievement.

Data Warehouse Architecture:




Source Data:
An organization will have many OLTP applications, all these operational data
becomes the source for the Data Warehouse database.

ETL: (Extract Transform and Load)

 We extract data from various operational systems and clean the data so that we get
only the information make sense to have in Data Warehouse. While cleansing the
data we may reject some records or we fill in the missing information. Once we
transform the operational data to the format in which DW expects, then we load the
data to DW. This process takes most of the time while developing DW applications.

DW Database

 This is the area where we store the data which is required by the business so that
they can run any report against the data. In data warehouses we will have current
and history information which is very useful for trend analysis, behavioral analysis
etc.

What is Data Mart?

A data mart is a simple form of a data warehouse that is focused on a single subject
(or functional area), such as Sales or Finance or Marketing. Data marts are often
built and controlled by a single department within an organization. Given their single-
subject focus, data marts usually draw data from only a few sources. The sources
could be internal operational systems, a central data warehouse, or external data

Difference between Data Warehouse and Data Mart

          Data Warehouse                                 Data Mart
   D     Enterprise-wide                           Departmental

         Structure for corporate view of           Star Schema based (Facts and
       data                                      dimensions)

   d     Organized E-R Model or              d     Quick turn around (up and
       Galaxy of Star (Multiple Star             running as there are less
       schemas in the Data Model)                stakeholders)

   s     Long turn around time

Data Granularity
What is Granularity of your DW?

         Granularity is the level of details we want to store in the data warehouse.
         For a retail store, Point of Sale (POS) is the lowest granularity information
          available.
         For banking its the account level details based on every day transactions.
         As DSS is learning towards analyzing the data as a whole, not necessarily
          the data warehouse will have all the details up to daily transactions.

t    Daily sales by date, product and customer

     Weekly sales by product and customer

     Monthly sales by product and customer

     Quarterly sales by product and customer

     Yearly sales by product and customer

     Usually in Data Warehouses (EDW) we will tend to have POS where as in Data
marts we will have it aggregated by week or month so that we never loose the
detailed information. This detailed level data can be used to get the micro behaviors
of our customers (especially in Data Mining)

Data Warehousing Objects:

    Data ware housing consists only two objects

                   Fact
                   Dimension

Fact Tables:

A fact table typically has two types of columns: those that contain numeric facts
(often called measurements), and those that are foreign keys to dimension tables. A
fact table contains either detail-level facts or facts that have been aggregated. Fact
tables that contain aggregated facts are often called summary tables. A fact table
usually contains facts with the same level of aggregation. Though most facts are
additive, they can also be semi-additive or non-additive. Additive facts can be
aggregated by simple arithmetical addition. A common example of this is sales. Non-
additive facts cannot be added at all. An example of this is averages. Semi-additive
facts can be aggregated along some of the dimensions and not along others. An
example of this is inventory levels, where you cannot tell what a level means simply
by looking at it.

Dimension Tables:

A dimension is a structure, often composed of one or more hierarchies, that
categorizes data. Dimensional attributes help to describe the dimensional value.
They are normally descriptive, textual values. Several distinct dimensions, combined
with facts, enable you to answer business questions. Commonly used dimensions
are customers, products, and time.

Dimension data is typically collected at the lowest level of detail and then aggregated
into higher level totals that are more useful for analysis. These natural rollups or
aggregations within a dimension table are called hierarchies.

Hierarchies:

Hierarchies are logical structures that use ordered levels as a means of organizing
data. A hierarchy can be used to define data aggregation. For example, in a time
dimension, a hierarchy might aggregate data from the month level to the quarter
level to the year level. A hierarchy can also be used to define a navigational drill path
and to establish a family structure.

Within a hierarchy, each level is logically connected to the levels above and below it.
Data values at lower levels aggregate into the data values at higher levels. A
dimension can be composed of more than one hierarchy. For example, in the
product dimension, there might be two hierarchies--one for product categories and
one for product suppliers.

Dimension hierarchies also group levels from general to granular. Query tools use
hierarchies to enable you to drill down into your data to view different levels of
granularity. This is one of the key benefits of a data warehouse.

When designing hierarchies, you must consider the relationships in business
structures. For example, a divisional multilevel sales organization.

Hierarchies impose a family structure on dimension values. For a particular level
value, a value at the next higher level is its parent, and values at the next lower level
are its children. These familial relationships enable analysts to access data quickly.

YE
AR




      QUATER




                            WEEK
How to handle Slowly Changing Dimensions (SCDs) in data model design?

Posted by Dylan Wan on January 13, 2007

There are multiple methods to handle the slowly changing dimensions. Which
technique to use depends on your business requirements. The choice among these
three methods are not a technical design decision since their behaviors are different.

Type One: Overwite the old data with new data

Using this method, you do not store the histoy. For example, that say each customer
can have one salesrep at any given point in time. When the salerep of ABC Inc.,
changes from Sandy to Laura, Sandy was a salerep of ABC will not be kept
anywhere. Any report by salesrep will assume that Laura is the salereps of ABC Inc.
forever and count all the sales done by Sandy as Lanura’s.

The above example may not sound making business sense. However, if you only
report the sales of the current period, and salesrep does not change during the
period, this method is ok to be used.

Mary OLTP tables does not need to track the history of changes and thus this
method may be used by the source application. However, if you want to report the
historical data, even your OLTP does not track history, the data warehouse can still
use other methods to track the history.

Type Two: Add a new record at the timeof the change

                               Using this method, all priorhistory are saved. There
                   MONTH       are two alternative methods to model the key of this
table.

Method A – No surrogate key – Use timestamp

When a change happens, a new record is added into the table. All the attributes are
copied from the previous record except the changed values. The nature key is
copied as well so the timestamps is used to differentiate the records.

When a fact table is joined with the dimension, if you are interested in the historical
data, the timestamp will be used as part of the join condition. To ease the join, the
record typically use two date columns – the effective start date and the effective end
date.
Method B – No surrogate key – Use version number

Instead of using the date column, a version number is used to differentiate the
different versions of the records.

This technique requires the fact table store both nature key and the version number
to retrive a given version of the dimension date.

Method C – Use a surrogate key

When an attribue is change, a sequence generated key is used, the fact table will
also use this key column as the foreign key.

Type Three: Track changes using a separate column

Using this method, you use a separate column of dimension table to store the values
of previous years, in addition to the current year data.

This method does not track all the history, but just one prior version.

If the data is changed, the old value need to be moved from the current value column
to the prior column and the new value overwrites the current column.

This method is used when the changes is not randon but a predefined interval such
as annual.
Structured Query Language

SQL is a database language used to create, manipulate and control the access to
the Database objects. SQL is a non procedural language used to access relational
databases. It is a flexible, efficient language with features designed to manipulate
and examine relational data.

SQL is only used for definition and manipulation of database objects. It cannot be
used for application development like form definitions, creation of procedures
etc...For that you need to necessarily have some 3gl languages such as cobol or 4gl
languages such as Dbase to provide front-end support to the database.

Key features of SQL are:

   •   Non procedural language
   •   Unified Language
   •   Common language for all Relational databases. ( Syntax may change
       between different RDBMS )

SQL is made of Three sub-languages such as:

   •   Data Definition language (DDL)
   •   Data Manipulation language (DML)
   •   Data control language (DCL)

Data Definition Language (DDL): allows you to define database objects at the
conceptual level. It consists of commands to create objects and alter the structure of
objects, such as tables, views, indexes etc.. Commonly used DDL statements are
CREATE, DROP etc..

If you want to create a table Student,then use the following syntax

CREATE TABLE STUDENT
( STUDENT_ID INTEGER PRIMARY KEY,
STUDENT_NM VARCHAR(30),
COURSE_ID VARCHAR(15) ,
PHONE VARCHAR(10) ,
ADDRESS VARCHAR(50) );

To drop a table from the database

DROP TABLE STUDENT;

Data Manipulation language(DML): Allows you to retrieve or update data within a
database. It is used for query, insertion, deletion and updating of information stored
in databases. Eg: Select, Insert, Update, Delete.
STUDENT_ID STUDENT_NM COURSE_ID PHONE         ADDRESS
                                   972-888-90 888, North Central Exp,
1001       JAMES      Oracle
                                   18         Dallas, TX- 75089
                                   972-678-89 567, Preston Road, Dallas,
1002       JIM        MSSql Server
                                   09         TX - 75240
                                   214-571-15 1234, Elm Street, Dallas,
1003       BRUCE      Java
                                   67         TX - 75039

Select statement:
Select statement in SQL language is used to display certain data from the table.For
example:- if you want to know what course Jim is taking; Select statement fetches
you the information you want,when you use the information you have. So,in the
above scenario the information you have is student_nm as Jim and and the
information you want is course_id, the intersection of those two columns in that
table is what you are looking for.

SELECT (what you want)
FROM (which tables)
WHERE (what you have )

Now the select statement to know the course_id Jim looks like this:

SELECT COURSE_ID
FROM STUDENT
WHERE STUDENT_NM = 'JIM'

You will get the result as:

COURSE_ID
MSSql Server

If you want to see all the rows in the table then your select will be:

SELECT * FROM STUDENT;

If you would like to show student_nm and address who is attending Oracle course in
the form of a report then your select will look like:

SELECT STUDENT_NM, ADDRESS
FROM STUENT
WHERE COURSE_ID = 'Oracle'

The result will be

STUDENT_NM ADDRESS
JAMES           888, North Central Exp, Dallas, TX- 75089

Insert Statement

Insert statement is used to insert a new row into the table. For example:- If a new
student DAVE is joining Java course then,use the INSERT SQL statement.

INSERT INTO STUDENT (STUDENT_ID, STUDENT_NM, COURSE_ID,PHONE,
ADDRESS ) VALUES
(1004, 'DAVE', 'Java','972-912-4008', '567, Washington Ave, Dallas - 75543' )

after executing the insert statement,your table should look like below when you issue
a select from student table:

STUDENT_ID STUDENT_NM COURSE_ID PHONE         ADDRESS
                                   972-888-90 888, North Central Exp,
1001       JAMES      Oracle
                                   18         Dallas, TX- 75089
                                   972-678-89 567, Preston Road, Dallas,
1002       JIM        MSSql Server
                                   09         TX - 75240
                                   214-571-15 1234, Elm Street, Dallas, TX
1003       BRUCE      Java
                                   67         - 75039
                                   972-912-40 567, Washington Ave,
1004       DAVE       Java
                                   08         Dallas - 75543

Update Statement

is used to change the existing information in the table.For example:-If DAVE moved
to another address then we need to change the ADDRESS column for DAVE's
record.If the new address is 146, Dallas Parkway, Dallas - 75240 then your update
should be:

UPDATE STUDENT SET ADDRESS = '146, Dallas Parkway, Dallas - 75240'

WHERE STUDENT_NM = 'DAVE'

In order to make sure you updated the Address column for DAVE issue following
SQL

SELECT * FROM STUDENT WHERE STUDENT_NM = 'DAVE'

then you should see the following result

STUDENT_ID STUDENT_NM COURSE_ID PHONE      ADDRESS
                                972-912-40 146, Dallas Parkway, Dallas
1004       DAVE       Java
                                08         - 75240

Delete Statement
is used to delete a row from the table ie remove records from the table.For
example:JAMES moved to different city, and he does not want to take the course.In
order to remove JAMES's record from the table we use the DELETE statement

DELETE STUDENT
WHERE STUDENT_NM = 'JAMES'

once you delete the record and you select all the information from the student table
you should see the following information:

STUDENT_ID STUDENT_NM COURSE_ID PHONE                      ADDRESS
                                   972-678-89              567, Preston Road, Dallas,
1002       JIM        MSSql Server
                                   09                      TX - 75240
                                   214-571-15              1234, Elm Street, Dallas,
1003       BRUCE      Java
                                   67                      TX - 75039
                                   972-912-40              567, Washington Ave,
1004       DAVE       Java
                                   08                      Dallas - 75543

If you dont include where clause in delete statment then it will remove all the rows
from the table.

Data control language(DCL)
In RDBMS one of the main advantages is the security for the data in the database.
You can allow some user to do a specific operation or all operations on certain
objects. Examples for DCL statements are GRANT, REVOKE statements.

GRANT is used to Grant a permission to an user so that the user can do that
operation.
REVOKE is used to take back that permission from that user on that object.

For example we have two users JAMES and DAVID
If JAMES created a table called ITEMS then JAMES becomes the owner of that
table.
DAVID cannot access ITEMS table because he is not the owner of that table.
DAVID can access ITEMS if JAMES gives the permission on his table.
JAMES can give different types of access like Select, Update, Delete and Insert
on ITEMS table to DAVID.
For example:-
If JAMES wants to provide only Select on ITEMS to DAVID then he can issue:
GRANT SELECT ON ITEMS TO JAMES

If JAMES wants to provide only Select and Insert on ITEMS to DAVID then he can
issue: GRANT SELECT, INSERT ON ITEMS TO JAMES

If JAMES wants to provide all the operations on ITEMS to DAVID then he can issue:
GRANT ALL ON ITEMS TO JAMES

Once you provide all permissions on an object to an user then indirectly he becomes
the owner and can do any manipulation to the table.
Oracle datatypes

Data in a database is stored in the form of tables. Each table consists of rows and
columns to store the data.
A particular column in a table must contain the same type of data.For example:

PLAYER_NAME(char       COUNTRY         DATE_OF_BIRTH(date
                                                          ROOM_NO(number)
)                      (char)          )
AGASSI                 USA             10/12/1969         1004
WILLIAM                USA             01/15/1975         1006
JIM                    RUSSIA          05/25/1980         1007
                       SWITZERLAN
HINGIS                                 06/25/1979               1009
                       D

Every column has certain information, PLAYER_NAME is a char column.
DATE_OF_BIRTH is a Date column, ROOM_NO is a number column.

Different datatypes available in Oracle database:

CHAR: To store character type of data,for example: name of a person (you can save
anything in character field)

VARCHAR: Same as CHAR. The only difference between CHAR and VARCHAR is
the way the database saves the data.

To understand the difference better we will take the following example.

CREATE TABLE EMPLOYEE (EMP_NO NUMBER(4), ENAME CHAR(15))

EMP_NO   ENAME
888      CLARK
889      KING
890      DAVID COOPER

As Ename column defined as CHAR(15) every value you put it that column will
occupy all 15 bytes ie CLARK is 5 bytes string,so the database pads 10 spaces.

CREATE TABLE EMPLOYEE (EMP_NO NUMBER(4), ENAME VARCHAR(15))

EMP_NO   ENAME
888      CLARK
889      KING
890      DAVID COOPER
Here as Ename is defined as VARCHAR(15) it occupies only the required space. so
in the above table ename CLARK occupies only 5 bytes in the database.

So what are the advantages and disadvantages?.The thumb rule here is that if you
are using a char column as primary key then it better be a char field. If you are using
a column to have comments then you must use varchar.

NUMBER: Used to store the numbers, for example:If you want to store employee
numbers then you define the column's data type as number. If you want to define a
column to store currency then you can define the column as NUMBER(7,2).

DATE: Used to store the date,like Date of birth of a person, join date in a company
etc.

LONG: to store the variable char length.

RAW:

LONG RAW: store binary data of variable length.

LOB: Large objects to store binary files.
In addition oracle 8 supports CLOB, BLOB and BFILE

CLOB - A table can have multiple columns of this type.

BLOB - can store large binary objects such as graphics, video and sound files.

BFILE - stores file pointers to LOB managed by file systems external to the database




                                     Constraints

When you bind a business rule to a column in the table then those rules are called
the Constraints. Constraints are defined while creating the table. Say for example,
you cannot have an employee who does not have a name, then employee name
column in employee table should be a NOT NULL column. The NOT NULL is a
constraint.

The following table shows the constraint types and short descriptions.

Constraint Type Description
                you must provide the value in that column. you cannot leave that
NOT NULL
                column blank
PRIMARY KEY No duplicate values allowed, for example Empno in Employee table
should be unique
CHECK            checks the value and controls the inserting and updating values.
DEFAULT          Assigns a default value if no value is given.
REFERENCES       To maintain the referential integrity (Foreign Key)

Examples for some of the rules usually implement through the business rules.

NOT NULL
If we have a business rule saying that all customers should have a name, we cannot
have any customer without a name. So to implement that business rule we can
create customer table and specify customer name column as NOT NULL (constraint)
Example
CREATE TABLE EMPLOYEE (EMPNO NUMBER(4) PRIMARY KEY, ENAME
VARCHAR(4) NOT NULL);CHECK
Check constraint is used where we define a condition on a column. Check constraint
consists of the keyword

col_name datatype CHECK (col_name in(value1, value2))

Example
If you have a business rule saying that all employees in the organization should get
atleast $500 then we can use CHECK constraint while creating table.

CREATE TABLE EMPLOYEE ( EMPNO NUMBER(4) PRIMARY KEY, ENAME
VARCHAR(4) NOT NULL, SALARY NUMBER(7,2) CHECK (SALARY > 500) );

DEFAULT
While inserting a row into a table without giving values for every column, SQL must
insert a default value to fill in the excluded columns, or the command will be rejected.
The most common default value is NULL. This can be used with columns not defined
with a NOT NULL.

Default value assigned to a column while creating the table using CREATE TABLE
operation.

Example
CREATE TABLE ITEM (ITEM_ID NUMBER(4) PRIMARY KEY, ITEM_NAME
VARCHAR(15),
ITEM_DESC VARCHAR(100), QOH NUMBER(4) DEFAULT 100)

Assigning a default value 0 for numeric columns makes the computation.

PRIMARY KEY
Primary Key in a table is a unique identifier of a row. For example,if you are
maintaning the customer profiles, you should assign particular number to each one.
So customer_number should be defined as a Primary key in Customer table.
REFERENCES
is a Foreign key. A foreign key column value refers a column in another table to
check whether the value exists or not.

UNIQUE
The values entered into a column are unique ie no duplicate values exists.This
constraint ensures business that there is no duplicates allowed.

                             Data Definition Language

It's a part of SQL langugae which creates a database object. Examples of database
objects are tables, procedures, functions, packages etc. When you create a table or
drop a table you are modifying the structure of the database and that is the reason
why it is called data definition language. When you issue a create or alter or drop sql
statements database internally does a commit,and that is why we cannot include the
DDL as part of the transaction.Following are a few DDL statements.

Create table
Create table course (
course_id not null number(5) primary key,
course_name not null varchar2(30),
start_date Date);
Alter table course modify ( start_date not null date );
Alter table course add ( instructor_id null );
Drop table course
Create table course ( course_id not null primary key, course_name varchar(30),
start_Date date ) tablespace=course_info storage (initial 1024k next 1024
pctincrease=10)

                           Data Manipulation Language

Data Manipulation in RDBMS means maintaining the data in the database. There are
three DML statements:Insert,Update and Delete. INSERT statment is used to insert
a new record into a table. The UPDATE statement is used to change the existing
information of a table. The DELETE statement is used to remove certain information
from the table.

We will take an example here:If you are running an apartment complex where you
rent apartments,the day to day record maintenance would look like this.

tenant_id   aptno   tenant_name   home_phone     work_phone     apt_rent   no_of_pets
1000        888     SMITH         881-890-9000   767-908-5432   900        1
1001        889     STEVE         881-909-8971   898-543-9032   890        0
1002        890     BILL          781-897-9011   567-891-9108   880        2

INSERT Statement
If a person named JAMES rented an apartment,we need to add his information into
the table. We have to do an INSERT because the information does not exist in the
table as of now.The following information has to be entered into the database:-name
= JAMES aptno = 891, home_phone as 676-789-9011, work_phone as
777-567-1234, apt_rent = 880 and no_of_pets as 1.

So now how we can write the INSERT statement.

INSERT into TENANT
(tenant_id, aptno, tenant_name, home_phone, work_phone, apt_rent, no_of_pets )
VALUES
(1003, 891, 'JAMES','676-789-9011','777-567-1234', 880, 1 ). After executing the
insert statement the table now should have four rows as shown below

tenant_id   aptno   tenant_name   home_phone     work_phone     apt_rent   no_of_pets
1000        888     SMITH         881-890-9000   767-908-5432   900        1
1001        889     STEVE         881-909-8971   898-543-9032   890        0
1002        890     BILL          781-897-9011   567-891-9108   880        2
1003        891     JAMES         676-789-9011   777-567-1234   880        1

Following shown are the different syntaxes available INSERT SQL syntaxes.

Syntax1
INSERT into table_name values (col1, col2, col3....) values (value1, value2,
value3.....)

In the syntax 1 we need to specify the column names of a table and values
respectively. In the application development its more recommened to use this syntax
while doing inserts into the table, reason being if you added a column in the table
then it won’t give an error except the value for that column won’t be supplied and
program will run fine.

Syntax2
INSERT into table_name values ( value1, value2.....)

In the Syntax 2 we won’t specify the column names and pass all the values to the
columns respectively.

Syntax3
INSERT itno table_name (col1, col2, col3...)
SELECT col1, col2, col3........ FROM table

In the Syntax 3 we can insert multiple rows using one INSERT into statement where
as in Syntax 1 and Syntax 2 you can insert only one row at a time.

UPDATE Statement
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw
Oracle sql plsql & dw

Más contenido relacionado

La actualidad más candente

Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Dwdm 2(data warehouse)
Dwdm 2(data warehouse)
Er Bansal
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
Shanthi Mukkavilli
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
Ryan Andhavarapu
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
butest
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
work
 

La actualidad más candente (20)

Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Dwdm 2(data warehouse)
Dwdm 2(data warehouse)
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
Introduction to data warehousing
Introduction to data warehousingIntroduction to data warehousing
Introduction to data warehousing
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousing
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Benefits of a data warehouse presentation by Being topper
Benefits of a data warehouse presentation by Being topperBenefits of a data warehouse presentation by Being topper
Benefits of a data warehouse presentation by Being topper
 
Ppt
PptPpt
Ppt
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Introduction Data warehouse
Introduction Data warehouseIntroduction Data warehouse
Introduction Data warehouse
 
1.4 data warehouse
1.4 data warehouse1.4 data warehouse
1.4 data warehouse
 
Datawarehouse & bi introduction
Datawarehouse & bi introductionDatawarehouse & bi introduction
Datawarehouse & bi introduction
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse concepts
 
Data warehousing Demo PPTS | Over View | Introduction
Data warehousing Demo PPTS | Over View | Introduction Data warehousing Demo PPTS | Over View | Introduction
Data warehousing Demo PPTS | Over View | Introduction
 
Enterprise resource planning system & data warehousing implementation
Enterprise resource planning system & data warehousing implementationEnterprise resource planning system & data warehousing implementation
Enterprise resource planning system & data warehousing implementation
 
Business Intelligence: Data Warehouses
Business Intelligence: Data WarehousesBusiness Intelligence: Data Warehouses
Business Intelligence: Data Warehouses
 

Similar a Oracle sql plsql & dw

Informatica and datawarehouse Material
Informatica and datawarehouse MaterialInformatica and datawarehouse Material
Informatica and datawarehouse Material
obieefans
 
Data warehouse
Data warehouseData warehouse
Data warehouse
MR Z
 

Similar a Oracle sql plsql & dw (20)

Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdf
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Informatica and datawarehouse Material
Informatica and datawarehouse MaterialInformatica and datawarehouse Material
Informatica and datawarehouse Material
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Implementation of Data Marts in Data ware house
Implementation of Data Marts in Data ware houseImplementation of Data Marts in Data ware house
Implementation of Data Marts in Data ware house
 
Data warehouse presentaion
Data warehouse presentaionData warehouse presentaion
Data warehouse presentaion
 
Cognos datawarehouse
Cognos datawarehouseCognos datawarehouse
Cognos datawarehouse
 
DWDM Unit 1 (1).pptx
DWDM Unit 1 (1).pptxDWDM Unit 1 (1).pptx
DWDM Unit 1 (1).pptx
 
Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
 
DATAWAREHOUSE MAIn under data mining for
DATAWAREHOUSE MAIn under data mining forDATAWAREHOUSE MAIn under data mining for
DATAWAREHOUSE MAIn under data mining for
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Lesson 2.docx
Lesson 2.docxLesson 2.docx
Lesson 2.docx
 
dw_concepts_2_day_course.ppt
dw_concepts_2_day_course.pptdw_concepts_2_day_course.ppt
dw_concepts_2_day_course.ppt
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
The Data Warehouse Essays
The Data Warehouse EssaysThe Data Warehouse Essays
The Data Warehouse Essays
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data Warehousing
 
Data Management
Data ManagementData Management
Data Management
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 

Último

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Último (20)

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 

Oracle sql plsql & dw

  • 1.
  • 2.
  • 3. DWH Concepts What is a DATA WAREHOUSE? A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. ® A data warehouse is a database designed to support a broad range of decision tasks in a specific organization. It is usually batch updated and structured for rapid online queries and managerial summaries. Data warehouses contain large amounts of historical data. The term data warehousing is often used to describe the process of creating, managing and using a data warehouse. What are the characteristics of a DATA WAREHOUSE? The characteristics of a DWH are • Subject-Oriented: DWH’s are designed to help you analyze data. For example, to learn more about the company’s sales data, you can build a warehouse that concentrates on sales. This ability to define a DWH by subject matter, sales in this case makes the DWH subject oriented. • Integrated: It is closely related to subject orientation. DWH’s put data from desperate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said be integrated. • Nonvolatile: It means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred and whatever once happened never changes. • Time-Variant: In order to discover trends, analysts need large amounts of data. This is very much in contrast to OLTP systems, where performance requirements demand that historical data be moved to an archive. A DWH focus on change over time is what is meant by the term time variant. What are the goals of a DATA WAREHOUSE?
  • 4. The goals of a DATA WAREHOUSE are • To provide a reliable, single integrated source of key corporate information. • To give end users access to their data without a reliance on reports produced by the information system department. • To allow analysts to analyze corporate data and even produce predictive “what if” models from that data. The data warehouse is simply one component of modern reporting architectures. The real goal of reporting systems are decision support –or its modern equivalent Business intelligence-to help people makes better, more intelligent decision. When should a company consider implementing a data warehouse? Data warehouses or a more focused database called a data mart should be considered when a significant number of potential users are requesting access to a large amount of related historical information for analysis and reporting purposes. So-called active or real-time data warehouses can provide advanced decision support capabilities. What are the uses of DATAWAREHOUSE? • It separates analysis workload and enables an organization to consolidate data from several sources. • It manages the process of gathering data and delivering to business users. • It is used to analyze data. • It puts data from desperate sources into a consistent format. What are the benefits of data warehousing? Some of the potential benefits of putting data into a data warehouse include: 1. Improving turnaround time for data access and reporting; 2. Standardizing data across the organization so there will be one view of the "truth"; 3. Merging data from various source systems to create a more comprehensive information source; 4. Lowering costs to create and distribute information and reports; 5. Sharing data and allowing others to access and analyze the data; 6. Encouraging and improving fact-based decision-making. What are the limitations of data warehousing?
  • 5. The major limitations associated with data warehousing are related to user expectations, lack of data and poor data quality. Building a data warehouse creates some unrealistic expectations that need to be managed. A data warehouse doesn't meet all decision support needs. If needed data is not currently collected, transaction systems need to be altered to collect the data. If data quality is a problem, the problem should be corrected in the source system before the data warehouse is built. Software can provide only limited support for cleaning and transforming data. Missing and inaccurate data can not be "fixed" using software. Historical data can be collected manually, coded and "fixed", but at some point source systems need to provide quality data that can be loaded into the data warehouse without manual clerical intervention. What data is stored in a data warehouse? In general, organized data about business transactions and business operations is stored in a data warehouse. But, any data used to manage a business or any type of data that has value to a business should be evaluated for storage in the warehouse. Some static data may be compiled for initial loading into the warehouse. Any data that comes from mainframe, client/server, or web-based systems can then be periodically loaded into the warehouse. The idea behind a data warehouse is to capture and maintain useful data in a central location. Once data is organized, managers and analysts can use software tools like OLAP to link different types of data together and potentially turn that data into valuable information that can be used for a variety of business decision support needs, including analysis, discovery, reporting and planning. Database administrators (DBAs) have always said that having non-normalized or de-normalized data is bad. What are the methodologies of Data Warehousing? Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology are sturdily used. Other methodologies are AMM, World class methodology and many more. How does my company get started with data warehousing? Build one! The easiest way to get started with data warehousing is to analyze some existing transaction processing systems and see what type of historical trends and comparisons might be interesting to examine to support decision making. See if there is a "real" user need for integrating the data. If there is, then IS/IT staff can develop a data model for a new schema and load it with some current data and start creating a decision support data store using a database management system (DBMS). Find some software for query and reporting and build a decision support interface that's easy to use. Although the initial data warehouse/data-driven DSS may seem to meet only limited needs, it is a "first step". Start small and build more sophisticated systems based upon experience and successes.
  • 6. What is the Data warehouse Implementation Schemes? What type of Indexing mechanism do we need to use for a typical data warehouse? On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types of clustered/non-clustered, unique/non-unique indexes. To my knowledge, SQLServer does not support bitmap indexes. Only Oracle supports bitmaps. What are the steps to build the data warehouse? Gathering business requirements Identifying Sources Identifying Facts Defining Dimensions Define Attributes Redefine Dimensions & Attributes Organize Attribute Hierarchy & Define Relationship Assign Unique Identifiers Additional conventions: Cardinality/Adding ratios How often should data be loaded into a data warehouse from transaction processing and other source systems? It all depends on the needs of the users, how fast data changes and the volume of information that is to be loaded into the data warehouse. It is common to schedule daily, weekly or monthly dumps from operational data stores during periods of low activity (for example, at night or on weekends). The longer the gap between loads, the longer the processing times for the load when it does run. A technical IS/IT staffer should make some calculations and consult with potential users to develop a schedule to load new data. What are the different architectures of data warehouse? ® What are the different approaches of a Data warehouse? There are two main things
  • 7. Top down - (bill Inmon) Bottom up - (Ralph Kimball) What are the types of a data warehouse? What is the main difference between Inmon and Kimball philosophies of data warehousing? Both differed in the concept of building the data warehouse. Kimball views data warehousing as a constituency of data marts. Data marts are focused on delivering business objectives for departments in the organization. And the data warehouse is a conformed dimension of the data marts. Hence a unified view of the enterprise can be obtained from the dimension modeling on a local departmental level. Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the development of the data warehouse can start with data from the online store. Other subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS) data can be added later if management decides it is necessary. i.e., Kimball--First Data Marts--Combined way ---Data warehouse Inmon---First Data warehouse--Later----Data marts When should I consider a Data warehouse solution? What is the process of warehousing data? Explain the architecture of a data warehouse with the diagram. What is Staging Area? What is a general purpose scheduling tool? The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source to Target at specific time or based on some condition. What is real time data warehousing? Real-time data warehousing is a combination of two things: 1. real-time activity and 2. Data warehousing. Real-time activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it. Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into
  • 8. the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available. What is ODS? ODS means Operational Data Store. A collection of operation or bases data that is extracted from operation databases and standardized, cleansed, consolidated, transformed, and loaded into enterprise data architecture. An ODS is used to support data mining of operational data, or as the store for base data that is summarized for a data warehouse. The ODS may also be used to audit the data warehouse to assure summarized and derived data is calculated properly. The ODS may further become the enterprise shared operational database, allowing operational systems that are being reengineered to use the ODS as there operation databases. What is Active data warehousing? An active data warehouse provides information that enables decision-makers within an organization to manage customer relationships nimbly, efficiently and proactively. Active data warehousing is all about integrating advanced decision support with day- to-day-even minute-to-minute-decision making in a way that increases quality of those customer touches which encourages customer loyalty and thus secure an organization's bottom line. The marketplace is coming of age as we progress from first-generation "passive" decision-support systems to current- and next-generation "active" data warehouse implementations. ® Active Data ware house means every user can access the database any time 24/7 that is called Active DWH. ® Active Transformation means data can change and pass. What is meant by OLTP? OLTP stands for On-Line Transaction Processing. This is a standard, normalized database structure. OLTP is designed for Transactions i.e., day-to-day transactions. OLTP database has hundreds of users connected to it. These databases are normalized to reduce the redundancy of the data & increase the performance while inserting the data. The ratio of no. of records being inserted is more than the ration of no. of records being updated or deleted. OLTP systems are not designed for analysis, reporting and decision support. Examples: ATM Machines, Online Shopping, Online Application Filling, and Online Railway Reservations. Why OLTP database are designs not generally a good idea for a Data Warehouse?
  • 9. Since in OLTP, tables are normalized and hence query response will be slow for end user and OLTP doesn’t contain years of data and hence cannot be analyzed. Why is de-normalized data now ok when it's used for Decision Support? Normalization of a relational database for transaction processing avoids processing anomalies and results in the most efficient use of database storage. A data warehouse for Decision Support is not intended to achieve these same goals. For Data-driven Decision Support, the main concern is to provide information to the user as fast as possible. Because of this, storing data in a de-normalized fashion, including storing redundant data and pre-summarizing data, provides the best retrieval results. Also, data warehouse data is usually static so anomalies will not occur from operations like add, delete and update a record or field. Why should you put your data warehouse on a different system than your OLTP system? A OLTP system is basically “data oriented” (ER model) and not “Subject oriented "(Dimensional Model) .That is why we design a separate system that will have a subject oriented OLAP system...Moreover if a complex query is fired on a OLTP system will cause a heavy overhead on the OLTP server that will affect the day-to- day business directly. What is Business Intelligence? Business intelligence (BI) is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. What are the important concerns of OLTP and DSS systems? OLTP DSS No. of users Many FEW
  • 10. Data 1. Stored in a Complex data format. 1. Stored in multidimensional structures (Normalized) e.g.: cube (3 dimensional). 2. Stored in a normalized form. Normally 3rd Normalized form. Normalization enhances 2. Stored in de-normalized format. performance. 3. Large volumes of data. 3. Small volumes of data. 4. Static in nature with periodic 4. Data is volatile in nature. loads. Operations Transactions. Reporting. Indexes Few Many. Joins Many(because it is normalized) Few (because it is de-normalized). Performanc Concurrency and availability are Response time is most imp. e more imp aspects. e.g.: ATM's. OLTP DSS Complex Data Multidimensional Data Structures Structures Few INDEXES Many Many JOINS Some Normalized DBMS DUPLICATED DATA De-Normalized DBMS Rare DERIVED DATA AND Common AGGREGATES Many NUMBER OF USERS Few
  • 11. Predefined WORKLOAD AD-HOC queries operations Volatile DATA MODIFICATIONS Update on a regular basis Small Volumes DATA Large Volume (Historical Data) Availability Must be Response time must be high good What is the difference between ODS and OLTP? ODS: It is nothing but a collection of tables created in the Data warehouse that maintains only current data where as OLTP maintains the data only for transactions, these are designed for recording daily operations and transactions of a business ® ODS: Having data with Data warehouse that will be stand alone. No further transaction will take place for current data which is part of the data ware house. Current data will be change once you upload through ETL on schedule basis. OLTP: Having data with on line system which connected to network and all update on transaction happened in seconds. Every second data summarized value will get changed. What is an OLAP? What are the types of OLAP? OLAP is software for manipulating multidimensional data from a variety of sources. The data is often stored in data warehouse. OLAP software helps a user create queries, views, representations and reports. OLAP tools can provide a "front-end" for a data-driven DSS. ® OLAP: On-Line Analytical Processing: On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. ® OLAP stands for On-Line Analytical Processing. OLAP system stores data in multidimensional databases. U then accesses these databases to perform financial and statistical analysis on different combinations of the data. An OLAP database is generally used to analyze data. It is optimized so that u can quickly retrieve data. An OLAP database is generally created from the information u have put in an OLTP database. OLAP products can be grouped into 3 categories.
  • 12. MOLAP: (Multidimensional OLAP) o Data is stored multidimensional arrays in order to be viewed in a multidimensional manner. o Multidimensional arrays provide efficiency in storage and operations. o Examples: ORACLE Express Servers, Essbase by Hyperion Software, Power play by Cognos. o MOLAP does not support ad-hoc queries because it is optimized for multidimensional operations o Retrieval is Fast o Storage is very efficient ROLAP: (Relational OLAP) o Data is stored in a Relational model because OLAP capabilities are best provided against the relational database. o Examples: Oracle, SQL Server… etc. o ROLAP integrates naturally with existing technology and standards. o ROLAP can readily take advantage of parallel relational technology. HOLAP: (Hybrid OLAP) o These products combine MOLAP and ROLAP. o With HOLAP products, a relational database stores most of the data. o A separatable multidimensional database stores a small portion of the data o Is OLAP databases are called decision support system??? True/false? True What does the term ‘Metadata’ mean? Very loosely, it is documentation about data; it is how you provide context for data people might be using. Metadata is basically the wrapping you put around data you use in everyday life to transform it into meaningful information. What is the difference between data warehousing and OLAP? The term’s data warehousing and OLAP are often used interchangeably. As the definitions suggest, warehousing refers to the organization and storage of data from a variety of sources so that it can be analyzed and retrieved easily. OLAP deals with the software and the process of analyzing data, managing aggregations, and partitioning information into cubes for in-depth analysis, retrieval and visualization. Some vendors are replacing the term OLAP with the term’s analytical software and business intelligence. ® Data warehouse is the place where the data is stored for analyzing where as OLAP is the process of analyzing the data, managing aggregations, partitioning information into cubes for in-depth visualization. What is OLAP, MOLAP, ROLAP, DOLAP, and HOLAP?
  • 13. OLAP - On-Line Analytical Processing: Designates a category of applications and technologies that allow the collection, storage, manipulation and reproduction of multidimensional data, with the goal of analysis. MOLAP - Multidimensional OLAP: This term designates a Cartesian data structure more specifically. In effect, MOLAP contrasts with ROLAP. In the former, joins between tables are already suitable, which enhances performances. In the latter, joins are computed during the request. Targeted at groups of users because it's a shared environment. Data is stored in an exclusive server-based format. It performs more complex analysis of data. ROLAP - Relational OLAP: Designates one or several star schemas stored in relational databases. This technology permits multidimensional analysis with data stored in relational databases. Used for large departments or groups because it supports large amounts of data and users. DOLAP - Desktop OLAP: Small OLAP products for local multidimensional analysis Desktop OLAP. There can be a mini multidimensional database (using Personal Express), or extraction of a data cube (using Business Objects). Designed for low- end, single, departmental user. Data is stored in cubes on the desktop. It's like having your own spreadsheet. Since the data is local, end users don't have to worry about performance hits against the server. HOLAP: Hybridization of OLAP, which can include any of the above. What is meant by metadata in context of a Data warehouse and how it is important? Meta data is the data about data; Business Analyst or data modeler usually capture information about data - the source (where and how the data is originated), nature of data (char, varchar, nullable, existence, valid values etc) and behavior of data (how it is modified / derived and the life cycle) in data dictionary a.k.a metadata. Metadata is also presented at the Data mart level, subsets, fact and dimensions, ODS etc. For a DW user, metadata provides vital information for analysis / DSS. What is difference between MOLAP, ROLAP? ROLAP MOLAP Tactical Strategic • Detailed Data • Summary Data • Simple calculations • Complex • Analyze past trends • Predict future trends
  • 14. Data storage structure Data storage structure • Tables • Cube Advantages Advantages • Requires less memory storage • Data access is faster space Disadvantages Disadvantages • Requires more memory storage • Data access is slow space. • Is sparsely filled as the number of dimensions in the cube increases What is the Difference between OLTP and OLAP? Main Differences between OLTP and OLAP are:- 1. User and System Orientation OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT professionals. OLAP: market-oriented, used for data analysis by knowledge workers (managers, executives, analysis). 2. Data Contents OLTP: manages current data, very detail-oriented. OLAP: manages large amounts of historical data, provides facilities for summarization and aggregation, stores information at different levels of granularity to support decision making process. 3. Database Design OLTP: adopts an entity relationship(ER) model and an application-oriented database design. OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database design. 4. View OLTP: focuses on the current data within an enterprise or department. OLAP: spans multiple versions of a database schema due to the evolutionary process of an organization; integrates information from many organizational locations and data stores
  • 15. What types of Metadata are there and when will they be available? Metadata will be made available on the Decision Support website as each increment 'goes live'. We have two classifications of metadata: one that is business and one that is technical. Technical metadata is fairly clear-cut: where did the data come from or how was it transformed along the way? Business metadata deals more with the possible meaning of the data and how it can be used. Why is Metadata important to the DWH User? Metadata is what makes the data in the Data Warehouse meaningful. The Data Warehouse is very different from an operational application. When you're using an operational application, you can get clues from the screen that tells you to update a particular field on the window. If I’m processing a new employee, I know exactly what needs to be updated for that new employee record, and can move through the process based on the context that the application provides. In a data-warehousing environment, you don’t have that context or workflow. You have data that is interrelated, and it is raw out there in a form, but there is no application between you and the data. Basically, you have a number of tables and structures that you have access to without a business layer, without a definition on top of it. So metadata is very important to be able to provide that context to people so they know how to go between subject areas or how data within a subject area is related and what it defines and represents. Is Metadata a description of what the data represents? In the simplest terms it is. As an example, if a user of the Data Warehouse is interested in a field called "campus code", then the metadata might have a definition of what the campus code represents, such as "an indicator for one of the three campuses". That is a form of metadata, although it is not a complete picture of what metadata can be. What types of Metadata will be made available to the User? Decision Support has identified several kinds of metadata that will be published on the website. Some basic categories are the data model, source-to-target mapping, and the logical & physical model. The logical model gives more of a grouping or identifies logically what would be expected from the business side. The physical model goes into more detail with more of the data dictionary definition, but it gives the user a pictorial representation of the data, not just a list of columns and tables. It provides a visual so people can see how data elements relate to each other. There is also a category of metadata that we call usage notes. These go into expanding on how someone might query the Data Warehouse or use a query against a data mart. Based on going through the requirements process and working with the focus groups, as data is available, we expect to expand the metadata categories.
  • 16. Is Metadata also useful to the average User of the DWH, in addition to a department’s technical staff? Yes. For an "ad hoc" user, there may be questions as to what a field represents. Another form of metadata at a business user level would be sample queries that Decision Support’s Services area would publish based on findings from the requirements process and focus groups. These queries provide samples of relating data to answer a business question. What Challenges are involved when providing Metadata? Historically organizations find it a challenge to manage metadata over time. So I think the biggest challenge that we face at Decision Support is learning from those mistakes and from what we’ve read in the industry. We need to make sure the metadata we have is ‘live’; that it’s not something that is static and put on the shelf. Decision Support has formed a Custodial Data Council that will take ownership in making sure we have business definitions and work with the user community. I think we also need to technically streamline those processes as much as possible, publish the metadata, and make it as consistent as possible. What is the difference between DWH and BI? There may be a Feature film (movie) without a Trailer. But there will be no trailer without a movie. Similarly Data warehousing is a concept related to extracting client's business data and applying business processing features on that data according to user needs and finally loading the processed data into a database, this database is what we call a warehouse or data warehouse. After the completion of a data warehouse the business user ultimately want to view his data (a precise and summary data) but as a business person he may don't have knowledge of accessing a database (a computer person can access the database with SQL). So there comes OLAP tools (which help that person to access the database) we can call these OLAP tools as Business Intelligence tools (Intelligence in sense they generate SQL queries internally and provide lot of facilities and privileges for a reporting developers in formatting the data and presenting it in a highly convenient manner). So data warehouse (movie) is a database and business intelligence tools (trailers) present the content of a database in an efficient manner. ® Simply speaking, BI is the capability of analyzing the data of a data warehouse in advantage of that business. A BI tool analyzes the data of a data warehouse and to come into some business decision depending on the result of the analysis. ® Data warehouses deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management, backup/recovery planning, etc. Business
  • 17. intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term “business intelligence” is used to encompass OLAP, data visualization, data mining and query/reporting tools. Think of the data warehouse as the back office and business intelligence as the entire business including the back office. The business needs the back office on which to function, but the back office without a business to support, makes no sense. ® DATAWAREHOUSE: Data warehouse is integrated, time-variant, subject oriented and non-volatile collection data in support of management decision making process. BUSINESS INTELLIGENCE: Business Intelligence is the process of extracting the data, converting it into information and then into knowledge base is known as Business Intelligence. ® A data warehouse is a database geared towards the business intelligence requirements of an organization. It integrates data from the various operational systems and is typically loaded from these systems at regular intervals. BI - It is category of technologies that allows for gathering, storing, accessing and analyzing data to help business users make better decisions. ® To make Business Analysis effective and efficient we require specialized form of storage. This special form of storage of data is called Data Warehouse and the process Data Warehousing. Business Intelligence, is the mechanism of using data according to type of industry for predictive analysis, fault findings, process improvement etc. What is a Data Dictionary? A data dictionary is a kind of metadata. A data dictionary explains how data physically resides in an environment. A data dictionary identifies the type of column it is, whether it is character or numeric or some other value. It identifies the width of a column as well as the name of the column. Sometimes in data dictionaries you see descriptions; sometimes you don’t. But basically it is how that field is physically represented in Oracle or Sybase or some other platform, if that’s where the data resides. It's difficult to do any meaningful query or report without basic metadata. What are the possible data marts in Retail sales? Product information, sales information. What are data validation strategies for data mart validation after loading process?
  • 18. Data validation is to make sure that the loaded data is accurate and meets the business requirements. Strategies are different methods followed to meet the validation requirements. What is a Data Mart? A Data Mart is a focused subset of a DWH that deals with a single area of data and is organized for quick analysis. It contains the summarized data of the warehouses and is referred as High Performance Query Structures. They consist of Materialized Views and Special Indexes. In some businesses these data marts may be maintained within the warehouses whereas, in some other scenario’s they may be maintained apart from the DWH’s. ® A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. ® The systems designed for a particular line of business. What are Data Marts? Data Marts are designed to help manager make strategic decisions about their business. Data Marts are subset of the corporate-wide data that is of value to a specific group of users. There are two types of Data Marts: 1. Independent data marts – sources from data captured form OLTP system, external providers or from data generated locally within a particular department or geographic area. 2. Dependent data mart – sources directly form enterprise data warehouses. What are the levels of Data mart? What are the difference between Database, DATAWAREHOUSE and Data Marts? A Database is an organized collection of data. A DWH is a very large database with special set of tools to extract and cleanse data from operational systems and to analyze data. A Data Mart is a focused subset of a DWH that deals with a single area of data and is organized for quick analysis. What is Data Sampling? What is Data Scrubbing?
  • 19. What is Data Acquisition Process? What is data mining? Data mining is a process of extracting hidden trends within a data warehouse. For example an insurance data warehouse can be used to mine data for the most high risk people to insure in a certain geographical area. What is a transformation? It is a repository object that generates, modifies or passes data. Transformations: Transformations are the manipulation of data from how it appears in the source systems into another form in the DWH or data mart in a way that enhances or simplifies its meaning. In another way, you transform data into information. This includes the following: Data Merging: It is a process of standardizing data types and fields. Suppose one source system calls integer type data as smallint whereas another calls same data as decimal. The data from the two source systems needs to rationalize when moved into the oracle data format called number. Cleansing: It is the process of validating the data brought from multiple sources. This involves identifying any changing inconsistencies or inaccuracies. • Eliminating inconsistencies in the data from multiple sources. • Converting data from different systems into single consistent data set suitable for analysis. • Meets a standard for establishing data elements, codes, domains, formats and naming conventions. • Correct data errors and fills in for missing data values. Aggregation: The process where by multiple detailed values are combined into a single summary value typically summation numbers representing dollars spend or units sold. Generate summarized data for use in aggregate fact and dimension tables. What are the advantages of data mining over traditional approaches? Data Mining is used for the estimation of future. For example, if we take a company/business organization, by using the concept of Data Mining, we can predict the future of business in terms of Revenue (or) Employees (or) Customers (or) Orders etc. Traditional approaches use simple algorithms for estimating the future. But, it does not give accurate results when compared to Data Mining. What is ETL?
  • 20. ETL stands for extraction, transformation and loading. ETL provide developers with an interface for designing source-to-target mappings, transformation and job control parameter. • Extraction: Take data from an external source and move it to the warehouse pre-processor database. • Transformation: Transform data task allows point-to-point generating, modifying and transforming data. • Loading: Load data task adds records to a database table in a warehouse. Explain the classification of Tables in a Data warehouse? What is Fact table? Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales”, then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact table also contains the foreign keys for the dimension tables. Why fact table is in normal form? Basically the fact table consists of the Index keys of the dimension/look up tables and the measures. So when ever we have the keys in a table. That itself implies that the table is in the normal form. What is a level of Granularity of a fact table? Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail you are willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it up to minute and put that data. What does level of Granularity of a fact table signify? Granularity: The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps: Determine which dimensions will be included. Determine where along the hierarchy of each dimension the information will be kept. The determining factors usually go back to the requirements What is aggregate fact table?
  • 21. Aggregate table contains the [measure] values, aggregated /grouped/summed up to some level of hierarchy. What is fact less fact table? Where you have used it in your project? Factless table means only the key available in the Fact there is no measures available. What is the common use of creating a Factless Fact Table? What are the different types of Fact Table? Explain with an example. 1. Cumulative Fact Table: 2. Snapshot Fact Table: What are the types of Facts? Additive: A Fact that can be summed up with any of the dimensions is called Additive Facts. ® A measure can participate arithmetic calculations using all or any dimensions. Ex: Sales profit Semi additive: A Fact that can be summed up with some of the dimensions is called Semi-additive Facts. ® A measure can participate arithmetic calculations using some dimensions. Ex: Sales amount Non Additive: A Fact that can be summed up with none of the dimensions is called Non-additive Facts. ® A measure can’t participate arithmetic calculations using dimensions. Ex: temperature What are Semi-additive and factless facts and in which scenario will you use such kinds of fact tables? Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive. EX: Average daily balance A fact table without numeric fact columns is called factless fact table. Ex: Promotion Facts
  • 22. While maintain the promotion values of the transaction (ex: product samples) because this table doesn’t contain any measures. What are non-additive facts in detail? A fact may be measure, metric or a dollar value. Measure and metric are non additive facts. Dollar value is additive fact. If we want to find out the amount for a particular place for a particular period of time, we can add the dollar amounts and come up with the total amount. A non additive fact, for e.g. measure height(s) for 'citizens by geographical location' , when we rollup 'city' data to 'state' level data we should not add heights of the citizens rather we may want to use it to derive 'count'. What is conformed fact? Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly. What is a continuously valued fact? What is Centipede Fact Table? What is Fact Constellation? What are the categories of Snapshot Fact Table Grains? What is a dimension table? A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up. It contains only the textual attributes. How are the Dimension tables designed? Most dimension tables are designed using Normalization principles up to 2NF. In some instances they are further normalized to 3NF. Find where data for this dimension are located. Figure out how to extract this data. Determine how to maintain changes to this dimension (see more on this in the next section). Change fact table and DW population routines. What are the Different methods of loading Dimension tables?
  • 23. Conventional Load: Before loading the data, all the Table constraints will be checked against the data. Direct load: (Faster Loading) All the Constraints will be disabled. Data will be loaded directly. Later the data will be checked against the table constraints and the bad data won't be indexed. Can a dimension table contain numeric values? What is hierarchy relationship in a dimension? Whether it is: 1. 1:1 2. 1: m 3. M: m What are the different types of dimensions? Explain with examples. 1. Regular Dimensions 2. Shared dimensions What are the different types of dimension tables? Explain with examples. Why dimensions are de-normalized in nature? Can 2 fact tables share same dimension tables? What is junk dimension? Junk dimension: Grouping of Random flags and text attributes in a dimension and moving them to a separate sub dimension. ® A dimension, which does not change the grain level, is called junk dimension. Grain- lowest level of reporting. (Or) The junk dimension is simply a structure that provides a convenient place to store the junk attributes (Or) A junk dimension is a convenient grouping of flags and indicators. What are Conformed Dimensions? A dimension that is used in more than one cube. ® The use of conformed dimensions and shared measures is the primary way a set of data marts can be united into one consolidated data warehouse. ® Conformed dimensions are dimensions which are common to the cubes.(cubes are the schemas contains facts and dimension tables)
  • 24. Consider Cube-1 contains F1, D1, D2, D3 and Cube-2 contains F2, D1, D2, D4 are the Facts and Dimensions. Here D1,D2 are the Conformed Dimensions ® Conformed dimensions mean the exact same thing with every possible fact table to which they are joined. Ex: Date Dimensions is connected all facts like Sales facts, Inventory facts. Etc What is degenerated dimension? Degenerate Dimension: Keeping the control information on Fact table ex: Consider a Dimension table with fields like order number and order line number and have 1:1 relationship with Fact table, In this case this dimension is removed and the order information will be directly stored in a Fact table in order eliminate unnecessary joins while retrieving order information. What is degenerate dimension table? Degenerate Dimensions: If a table contains the values, which r neither dimension nor measures is called degenerate dimensions. Ex: invoice id, empno. What is Audit dimension? Explain with an example. What is a Fact Dimension? What is a Mini Dimension? What are Role-playing dimensions? What is a Mystery Dimension? How do you connect the facts and dimensions in the tables? 1. Smart Matching columns 2. Manually you can link Which columns go to the fact table and which columns go the dimension table? The Primary Key columns of the Tables (Entities) go to the Dimension Tables as Foreign Keys. The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign Keys. What is Associate Table? What is Bridge Table? What is crass reference table?
  • 25. What is Event-Tracking Table? What is a lookup table? A lookup table is the one which is used when updating a warehouse. When the lookup is placed on the target table (fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only new records or updated records based on the lookup condition. What is the data type of the surrogate key? Data type of the surrogate key is either integer or numeric or number. What is a Schema? What is a Star Schema? Star schema is a type of organizing the tables such that we can retrieve the result from the database easily and fastly in the warehouse environment. Usually a star schema consists of one or more dimension tables around a fact table which looks like a star, so that it got its name. Differences between star and snowflake schemas? Star schema: A single fact table with N number of Dimension. Snowflake schema: Any dimensions with extended dimensions are known as snowflake schema. ® Star schema - all dimensions will be linked directly with a fat table. Snow schema - dimensions maybe interlinked or may have one-to-many relationship with other tables. What is Snow-Flake Schema? When do U go for Star Schema? & when do U go for Snow-Flake Schema? What is the main difference between schema in RDBMS and schemas in Data Warehouse? RDBMS Schema
  • 26. Used for OLTP systems • Traditional and old schema • Normalized • Difficult to understand and navigate • Cannot solve extract and complex problems • Poorly modeled DWH Schema • Used for OLAP systems • New generation schema • De Normalized • Easy to understand and navigate • Extract and complex problems can be easily solved • Very good model Why did u choose STAR SCHEMA only? What are the benefits of STAR SCHEMA? Because it’s de-normalized structure, i.e., Dimension Tables are de-normalized. Why to de-normalize means the first (and often only) answer is: speed. OLTP structure is designed for data inserts, updates, and deletes, but not data retrieval. Therefore, we can often squeeze some speed out of it by de-normalizing some of the tables and having queries go against fewer tables. These queries are faster because they perform fewer joins to retrieve the same record set. Joins are also confusing to many End users. By de-normalizing, we can present the user with a view of the data that is far easier for them to understand. Benefits of STAR SCHEMA: Far fewer Tables. Designed for analysis across time. Simplifies joins. Less database space. Supports “drilling” in reports. Flexibility to meet business and technical needs. Difference between Snow flake and Star Schema. What are situations where Snow flake Schema is better than Star Schema to use and when the opposite is true?
  • 27. Star schema contains the dimension tables mapped around one or more fact tables. It is a denormalised model. No need to use complicated joins. Queries results fastly. Snowflake schema: It is the normalized form of Star schema. It contains in-depth joins, because the tables r splitted in to many pieces. We can easily do modification directly in the tables. We have to use complicated joins, since we have more tables .There will be some delay in processing the Query. Which is preferable? Star Schema or Snow-Flake Schema? If U have 2 fact tables connected in the schema, do U know the name of the schema? What is Galaxy Schema? What is Multi-Star Schema? How do you load the time dimension? Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. It is not unusual for 100 years to be represented in a time dimension, with one row per day. What are slowly changing dimensions? SCD stands for Slowly changing dimensions. Slowly changing dimensions are of three types SCD1: only maintained updated values. Ex: a customer address modified we update existing record with new address. SCD2: maintaining historical information and current information by using A) Effective Date B) Versions C) Flags Or combination of these SCD3: by adding new columns to target table we maintain historical information and current information ® Type-1: Most Recent Value Type-2(full History) i) Version Number ii) Flag
  • 28. iii) Date Type-3: Current and one Previous value ® Type 1: overwrite data is to be there. Type 2: current, recent and history data should be there. Type 3: current and recent data should be there. What is BUS Schema? BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts. What is hybrid slowly changing dimension? What are Critical columns? What is a surrogate key? Why is it used? What is its need? Give an example. Explain in detail what do you mean by Slicing and Dicing? Slicing and dicing refers to the ability to combine and re-combine the dimensions to see different slices of the information. Picture slicing a three-dimensional cube of information, in order to see what values are contained in the middle layer. Dicing is the ability to view the cube from different perspectives. Slicing and dicing a cube allows an end-user to do the same thing with multiple dimensions. What is a Measure? What are the types of Measures? How can U create Measures & Dimensions? Can we group a measure? What do U mean by Multi-dimensional Analysis? What is a Grain? What is Drill-up, Drill-down & Drill-Across? Differentiate between Level and Category? Level is a logical subdivision of a dimension e.g.: if orderdate is a dimension, the levels are year, quarter, month, week, day etc. Category is the different instances of a level E.g. if year is a level, the category are 1996, 1997, 1998 etc. What is a CUBE in data warehousing concept?
  • 29. Cubes are logical representation of multidimensional data. The edge of the cube contains dimension members and the body of the cube contains data values. What is a Virtual Cube? Difference between filter and condition? Parameter is the only difference ® The difference between Filter and Condition: Condition returns true or false Ex: if Country = 'India' then ...Filter will return two types of results. 1. Detail information which is equal to where clause in SQL statement 2. Summary information which is equal to Group by and having clause in SQL statement ® I filter we just create a parameter on which we can filter the fields. but in condition we can have the static functions like if yes then color it green, if no then color it as red etc. so here we can create conditions for filtering in the report. Mean we can make different filtering function at the same time by using conditional formatting. What is snapshot? You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the data. However, you must reconnect to the catalog if you want to refresh the data. What is a linked cube? Linked cube in which a sub-set of the data can be analyzed into great detail. The linking ensures that the data in the cubes remain consistent. What is VLDB? VLDB stands for Very Large Database. It is an environment or storage space managed by a relational database management system (RDBMS) consisting of vast quantities of information. VLDB doesn’t refer to size of database or vast amount of information stored. It refers to the window of opportunity to take back up the database. Window of opportunity refers to the time of interval and if the DBA was unable to take back up in the specified time then the database was considered as VLDB. What is batch processing? What is incremental loading?
  • 30. Incremental loading means loading the ongoing changes in the OLTP. Explain the advantages of RAID 1, 1/0, and 5. What type of RAID setup would you put your TX logs. Transaction logs write sequentially and don't need to be read at all. The ideal is to have each on RAID 1/0 because it has much better write performance than RAID 5. RAID 1 is also better for TX logs and costs less than 1/0 to implement. It has a tad less reliability and performance is a little worse generally speaking. RAID 5 is best for data generally because of cost and the fact it provides great read capability. What is BAS? What is the function? The Business Application Support (BAS) functional area at SLAC provides administrative computing services to the Business Services Division and Human Resources Department. We are responsible for software development and maintenance of the PeopleSoft applications and consultation to customers with their computer-related tasks. It’s called Broadcast Agent Server. Its function is to run the jobs or reports scheduled and can be monitored using Broadcast Agent Console. What are modeling tools available in the Market? There are a number of data modeling tools Tool Name Company Name Erwin Computer Associates Embarcadero Embarcadero Technologies Rational Rose IBM Corporation Power Designer Sybase Corporation Oracle Designer Oracle Corporation What are the various Reporting tools in the Market? 1. MS-Excel 2. Business Objects (Crystal Reports) 3. Cognos (Impromptu, Power Play) 4. Microstrategy 5. MS reporting services
  • 31. 6. Informatica Power Analyzer 7. Actuate 8. Hyperion (BRIO) 9. Oracle Express OLAP 10. Proclarity ® Some of the standard Business Intelligence tools in the market According to their performance 1) MICROSTRATEGY 2) BUSINESS OBJECTS, CRYSTAL REPORTS 3) COGNOS REPORT NET 4) MS-OLAP SERVICES Or 1. Seagate Crystal report 2. SAS 3. Business objects 4. Microstrategy 5. Cognos 6. Microsoft OLAP 7. Hyperion 8. Microsoft integrated services and some more. What are the various ETL tools in the Market? Various ETL tools used in market are: Informatica. Data Stage. Oracle Warehouse Builder. Ab Initio. Data Junction. Name some of the real time data-warehousing tools? What is Outsourcing, Offshoring & Insourcing? And what is the difference between them.
  • 32. Outsourcing is not strictly IT. Any function of an organization that is executed by non- employees is essentially an Outsourced task. Insourcing is the use of external resources (not employees of the Organization) to accomplish some function, but they are predominately carrying out the function at the client’s site. So, the function is “sourced” but not “out” sourced. These resources are also typically managed more closely by the client directly with little management involvement from the supplier. Offshoring is a subset of Outsourcing which is generally understood to involve a country in which cost remain lower than the clients country of operations. While most Offshoring situations are indeed an example of Outsourcing, for those companies (HP for example) who now own their offshore operations and have folded them into the company, the line gets blurred. In other words, Offshoring is not always outsourcing anymore. What is ER Diagram? The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way to unify the network and relational database views. Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represent data objects. Since Chen wrote his paper the model has been extended and today it is commonly used for database design for the database designer, the utility of the ER model is: It maps well to the relational model. The constructs used in the ER model can easily be transformed into relational tables. It is simple and easy to understand with a minimum of training. Therefore, the model can be used by the database designer to communicate the design to the end user. In addition, the model can be used as a design plan by the database developer to implement a data model in specific database management software. What Oracle tools can be used to build and design a warehouse? What Oracle features can be used to optimize my warehouse system? What is Data Modeling? Data modeling represent information in the entities, attributes and relationships. Visual representation of the information. What are the different steps for Data Modeling? 1. Define the problem and scope of the problem.
  • 33. 2. Information gathering. 3. Analysis(normalization) 4. Create a logical data model (independent of platform). 5. Decision about physical platform like oracle or SQL etc. 6. Create a physical data model, which is platform specific. 7. Database creation. What is Dimensional Modeling? Dimensional Modeling is a design concept used by many data warehouse designers to build their data warehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measurements i.e., the dimensions on which the facts are calculated. Data modeling is probably the most labor intensive and time consuming part of the development process. Why bother especially if you are pressed for time? A common response by practitioners who write on the subject is that you should no more build a database without a model than you should build a house without blueprints. The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented. Because the data model uses easily understood notations and natural language, it can be reviewed and verified as correct by the end-users. The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building the physical database. The information contained in the data model will be used to define the relational tables, primary and foreign keys, stored procedures, and triggers. A poorly designed database will require more time in the long-term. Without careful planning you may create a database that omits data required to create critical reports, produces results that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements. What is Logical Modeling? The Logical Model: In Erwin, the logical model is the version of the model that represents all of the logical business requirements of an organization. There are three levels of logical models that are used to capture these requirements: The Entity Relationship Diagram A high-level data model that includes all major entities and relationships. The Entity Relationship Diagram does not contain much detail and is often used in the initial planning phase. The Key Based Model A model that describes major data structures such as entities, primary keys, and sample attributes.
  • 34. The Fully Attributed Model A complete model that includes all required entities, attributes, key groups, and relationships. In Erwin, a logical model can be created in conjunction with the physical model, or independent of the physical model. Logical models can also be derived from other models using the Derive Model Wizard. In addition, Erwin supports the definition of model objects in a logical model as logical only and in a physical model as physical only. These options allow for the logical model to be fully normalized and for the corresponding physical model to be de-normalized. Erwin also allows for the automatic conversion of many-to-many and super type/subtype relationships when you change from a logical model to a physical model. What are the types of Dimensional Modeling? What is Conceptual Modeling? What is Physical Modeling? Comparing Logical and Physical Models in a Logical/Physical Model: In an Erwin logical/physical model, each model that you create automatically includes both a logical and a physical model. By default, the logical model is closely related to the physical model. If you make a change in the logical model, the change is automatically reflected in the physical model and vice-versa. You can use either the logical model or the physical model to define and document database structures; although the model you use typically depends on the type of work you want to perform. You can use the logical model to represent business information and define business rules in a fully normalized model, while the physical model supports the needs of the database administrator, who focuses on the physical implementation of the model in a database. Comparing Logical and Physical Model Objects: Most of the objects in the logical model correspond to a related object in the physical model. For example, the logical model contains entities, attributes, and key groups, which are represented in the physical model as tables, columns, and indexes, respectively. The following table compares the logical and physical components in an Erwin model. What is Difference between E-R Modeling and Dimensional Modeling? Basic diff is E-R modeling will have logical and physical model. Dimensional model will have only physical model. E-R modeling is used for normalizing the OLTP database design.
  • 35. Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design. What is Entity, Attribute and Relationship? Entity: Entity is an object of which an organization wants to maintain the information E.g.: Employee. Attribute: Is an object that maintains the information. Key attribute: A key attribute consists of one or more attributes of an entity, which uniquely identify the entity. e.g.; Bank account no identifies for account. Relationship: Defines the association between different entities. one to one, one to many, many to one, many to many. What is meant by De-Normalization? What is the definition of normalized and denormalized view and what are the differences between them? Normalization is the process of removing redundancies. Denormalization is the process of allowing redundancies. Why Denormalization is promoted in Universe Designing? In a relational data model, for normalization purposes, some lookup tables are not merged as a single table. In a dimensional data modeling (star schema), these tables would be merged as a single table called DIMENSION table for performance and slicing data. Due to this merging of tables into one large Dimension table, it comes out of complex intermediate joins. Dimension tables are directly joined to Fact tables. Though, redundancy of data occurs in DIMENSION table, size of DIMENSION table is 15% only when compared to FACT table. So only Denormalization is promoted in Universe Designing. What is Cardinality? What is Referential Integrity? What are Integrity Constraints? What is the difference between view and materialized view? View - store the SQL statement in the database and let you use it as a table. Every time you access the view, the SQL statement executes. Materialized view - stores the results of the SQL in table form in the database. SQL statement only executes once and after that every time you run the query, the stored result set is used. Pros include quick query results.
  • 36. What is Normalization, First Normal Form, Second Normal Form , Third Normal Form? 1. Normalization is process for assigning attributes to entities–Reduces data redundancies–Helps eliminate data anomalies–Produces controlled redundancies to link tables 2. Normalization is the analysis of functional dependency between attributes / data items of user views. It reduces a complex user view to a set of small and stable subgroups of fields / relations 1NF: Repeating groups must be eliminated, Dependencies can be identified, All key attributes defined, No repeating groups in table 2NF: The Table is already in1NF,Includes no partial dependencies–No attribute dependent on a portion of primary key, Still possible to exhibit transitive dependency, Attributes may be functionally dependent on non-key attributes 3NF: The Table is already in 2NF, Contains no transitive dependencies. What is a Table space? What does it contain? What is a Composite Key or Concatenated Key? What is its use? What are Unique Identifiers? What is an Index? What are the types of Indexes? What do U mean Partitioned Indexes? What is partitioning? What are the methods of partitioning? What is Parallelism? What are the advantages and disadvantages of reporting directly against the database? Do you always need to copy the data before reporting on it? (Example, real-time & on-demand reporting is a requirement) There isn’t any need to copy the data before reporting on as long as the data is clean. But if the data is not clean it should be cleansed and so go for ETL process. Adv of reporting directly against the database (OLTP): No need to separately maintain a Database for it. (Space consumption is reduced). Disadv of reporting directly against the database (OLTP): It slows down the process bcoz OLTP system is designed for the online application but a Data Warehouse application which requires to do analysis and hence takes the same data but takes a long time.
  • 37. What are the most frequent data errors that slow down data input process? Data mining is the process of data selection, exploration and building models using vast data stores to uncover previously unknown patterns. What does this mean to you? You can produce new knowledge to better inform decision makers before they act. Build a model of the real world based on data collected from a variety of sources, including corporate transactions, customer histories and demographics, even external sources such as credit bureaus. Then use this model to produce patterns in the information that can support decision making and predict new business opportunities. Text mining capabilities enable you to apply such analyses to text- based documents. With SAS's rich suite of text processing and analysis tools, you can uncover underlying themes or concepts contained in large document collections, group documents into topical clusters, classify documents into predefined categories and integrate text data with structured data for enriched predictive modeling endeavors.
  • 38. Before you begin, you should know the answers for the following questions. what is Data? D what is a Database? D what is an RDBMS? R What is a Data Model? D Why we follow Normalization while designing data model? What is an OLTP system WHAT IS A DATAWAREHOUSING: • A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. • In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. • A Data warehouse is a complete set of Subject Oriented Integrated Time variant Nonvolatile data which helps business in taking organization decision Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
  • 39. Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant. When an organization should create a Data Warehouse? Once an organization have too much of information where it becomes too difficult to get the meaning full information for the business to take the strategic decisions. The decisions we make using the Data warehousing data will affect the entire organization instead of one customer or one employee. Example of decisions we make in DW is, should we continue with the specific product offerings to our customers or not. Should we move the customer support department to a different location for a cost saving, etc etc. Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems: • Workload Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations. OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations. • Data modifications
  • 40. A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse. In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction. • Schema design Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency. • Typical operations A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month." A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer." • Historical data Data warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction. END USER OF APPPLICATION:  What you mean by end user in OLTP system ? • An end user is who is entering data or reading a particular report from the system. • For a Bank teller he/she should enter the account number see the balance or deposit the cheque etc • For a customer representative job he/she must see the cust information to be more effective  What kind of information management wants to know, because the DW data is primarily used by management.
  • 41. Which are our lowest/highest margin customers? • What is the most effective distribution channel? • What product promotions have the biggest impact on revenue? • What impact will new products/services have on revenue and margins? • Which customers are most likely to go to the competition? • Who are my customers and what products are they buying? In OLTP applications, end users are individuals who takes care of day to day operations. In DW applications, end users are managers and above who takes decisions based on the trend, history, predictions etc If end users are not satisfied with the application, then the product is considered to be failure even though the technology wise its a great achievement. Data Warehouse Architecture: Source Data:
  • 42. An organization will have many OLTP applications, all these operational data becomes the source for the Data Warehouse database. ETL: (Extract Transform and Load) We extract data from various operational systems and clean the data so that we get only the information make sense to have in Data Warehouse. While cleansing the data we may reject some records or we fill in the missing information. Once we transform the operational data to the format in which DW expects, then we load the data to DW. This process takes most of the time while developing DW applications. DW Database This is the area where we store the data which is required by the business so that they can run any report against the data. In data warehouses we will have current and history information which is very useful for trend analysis, behavioral analysis etc. What is Data Mart? A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such as Sales or Finance or Marketing. Data marts are often built and controlled by a single department within an organization. Given their single- subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data Difference between Data Warehouse and Data Mart Data Warehouse Data Mart D Enterprise-wide Departmental Structure for corporate view of Star Schema based (Facts and data dimensions) d Organized E-R Model or d Quick turn around (up and Galaxy of Star (Multiple Star running as there are less schemas in the Data Model) stakeholders) s Long turn around time Data Granularity
  • 43. What is Granularity of your DW?  Granularity is the level of details we want to store in the data warehouse.  For a retail store, Point of Sale (POS) is the lowest granularity information available.  For banking its the account level details based on every day transactions.  As DSS is learning towards analyzing the data as a whole, not necessarily the data warehouse will have all the details up to daily transactions. t Daily sales by date, product and customer Weekly sales by product and customer Monthly sales by product and customer Quarterly sales by product and customer Yearly sales by product and customer Usually in Data Warehouses (EDW) we will tend to have POS where as in Data marts we will have it aggregated by week or month so that we never loose the detailed information. This detailed level data can be used to get the micro behaviors of our customers (especially in Data Mining) Data Warehousing Objects: Data ware housing consists only two objects  Fact  Dimension Fact Tables: A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non- additive facts cannot be added at all. An example of this is averages. Semi-additive
  • 44. facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it. Dimension Tables: A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Several distinct dimensions, combined with facts, enable you to answer business questions. Commonly used dimensions are customers, products, and time. Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies. Hierarchies: Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure. Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies--one for product categories and one for product suppliers. Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse. When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization. Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly. YE AR QUATER WEEK
  • 45. How to handle Slowly Changing Dimensions (SCDs) in data model design? Posted by Dylan Wan on January 13, 2007 There are multiple methods to handle the slowly changing dimensions. Which technique to use depends on your business requirements. The choice among these three methods are not a technical design decision since their behaviors are different. Type One: Overwite the old data with new data Using this method, you do not store the histoy. For example, that say each customer can have one salesrep at any given point in time. When the salerep of ABC Inc., changes from Sandy to Laura, Sandy was a salerep of ABC will not be kept anywhere. Any report by salesrep will assume that Laura is the salereps of ABC Inc. forever and count all the sales done by Sandy as Lanura’s. The above example may not sound making business sense. However, if you only report the sales of the current period, and salesrep does not change during the period, this method is ok to be used. Mary OLTP tables does not need to track the history of changes and thus this method may be used by the source application. However, if you want to report the historical data, even your OLTP does not track history, the data warehouse can still use other methods to track the history. Type Two: Add a new record at the timeof the change Using this method, all priorhistory are saved. There MONTH are two alternative methods to model the key of this table. Method A – No surrogate key – Use timestamp When a change happens, a new record is added into the table. All the attributes are copied from the previous record except the changed values. The nature key is copied as well so the timestamps is used to differentiate the records. When a fact table is joined with the dimension, if you are interested in the historical data, the timestamp will be used as part of the join condition. To ease the join, the record typically use two date columns – the effective start date and the effective end date.
  • 46. Method B – No surrogate key – Use version number Instead of using the date column, a version number is used to differentiate the different versions of the records. This technique requires the fact table store both nature key and the version number to retrive a given version of the dimension date. Method C – Use a surrogate key When an attribue is change, a sequence generated key is used, the fact table will also use this key column as the foreign key. Type Three: Track changes using a separate column Using this method, you use a separate column of dimension table to store the values of previous years, in addition to the current year data. This method does not track all the history, but just one prior version. If the data is changed, the old value need to be moved from the current value column to the prior column and the new value overwrites the current column. This method is used when the changes is not randon but a predefined interval such as annual.
  • 47.
  • 48. Structured Query Language SQL is a database language used to create, manipulate and control the access to the Database objects. SQL is a non procedural language used to access relational databases. It is a flexible, efficient language with features designed to manipulate and examine relational data. SQL is only used for definition and manipulation of database objects. It cannot be used for application development like form definitions, creation of procedures etc...For that you need to necessarily have some 3gl languages such as cobol or 4gl languages such as Dbase to provide front-end support to the database. Key features of SQL are: • Non procedural language • Unified Language • Common language for all Relational databases. ( Syntax may change between different RDBMS ) SQL is made of Three sub-languages such as: • Data Definition language (DDL) • Data Manipulation language (DML) • Data control language (DCL) Data Definition Language (DDL): allows you to define database objects at the conceptual level. It consists of commands to create objects and alter the structure of objects, such as tables, views, indexes etc.. Commonly used DDL statements are CREATE, DROP etc.. If you want to create a table Student,then use the following syntax CREATE TABLE STUDENT ( STUDENT_ID INTEGER PRIMARY KEY, STUDENT_NM VARCHAR(30), COURSE_ID VARCHAR(15) , PHONE VARCHAR(10) , ADDRESS VARCHAR(50) ); To drop a table from the database DROP TABLE STUDENT; Data Manipulation language(DML): Allows you to retrieve or update data within a database. It is used for query, insertion, deletion and updating of information stored in databases. Eg: Select, Insert, Update, Delete.
  • 49. STUDENT_ID STUDENT_NM COURSE_ID PHONE ADDRESS 972-888-90 888, North Central Exp, 1001 JAMES Oracle 18 Dallas, TX- 75089 972-678-89 567, Preston Road, Dallas, 1002 JIM MSSql Server 09 TX - 75240 214-571-15 1234, Elm Street, Dallas, 1003 BRUCE Java 67 TX - 75039 Select statement: Select statement in SQL language is used to display certain data from the table.For example:- if you want to know what course Jim is taking; Select statement fetches you the information you want,when you use the information you have. So,in the above scenario the information you have is student_nm as Jim and and the information you want is course_id, the intersection of those two columns in that table is what you are looking for. SELECT (what you want) FROM (which tables) WHERE (what you have ) Now the select statement to know the course_id Jim looks like this: SELECT COURSE_ID FROM STUDENT WHERE STUDENT_NM = 'JIM' You will get the result as: COURSE_ID MSSql Server If you want to see all the rows in the table then your select will be: SELECT * FROM STUDENT; If you would like to show student_nm and address who is attending Oracle course in the form of a report then your select will look like: SELECT STUDENT_NM, ADDRESS FROM STUENT WHERE COURSE_ID = 'Oracle' The result will be STUDENT_NM ADDRESS
  • 50. JAMES 888, North Central Exp, Dallas, TX- 75089 Insert Statement Insert statement is used to insert a new row into the table. For example:- If a new student DAVE is joining Java course then,use the INSERT SQL statement. INSERT INTO STUDENT (STUDENT_ID, STUDENT_NM, COURSE_ID,PHONE, ADDRESS ) VALUES (1004, 'DAVE', 'Java','972-912-4008', '567, Washington Ave, Dallas - 75543' ) after executing the insert statement,your table should look like below when you issue a select from student table: STUDENT_ID STUDENT_NM COURSE_ID PHONE ADDRESS 972-888-90 888, North Central Exp, 1001 JAMES Oracle 18 Dallas, TX- 75089 972-678-89 567, Preston Road, Dallas, 1002 JIM MSSql Server 09 TX - 75240 214-571-15 1234, Elm Street, Dallas, TX 1003 BRUCE Java 67 - 75039 972-912-40 567, Washington Ave, 1004 DAVE Java 08 Dallas - 75543 Update Statement is used to change the existing information in the table.For example:-If DAVE moved to another address then we need to change the ADDRESS column for DAVE's record.If the new address is 146, Dallas Parkway, Dallas - 75240 then your update should be: UPDATE STUDENT SET ADDRESS = '146, Dallas Parkway, Dallas - 75240' WHERE STUDENT_NM = 'DAVE' In order to make sure you updated the Address column for DAVE issue following SQL SELECT * FROM STUDENT WHERE STUDENT_NM = 'DAVE' then you should see the following result STUDENT_ID STUDENT_NM COURSE_ID PHONE ADDRESS 972-912-40 146, Dallas Parkway, Dallas 1004 DAVE Java 08 - 75240 Delete Statement
  • 51. is used to delete a row from the table ie remove records from the table.For example:JAMES moved to different city, and he does not want to take the course.In order to remove JAMES's record from the table we use the DELETE statement DELETE STUDENT WHERE STUDENT_NM = 'JAMES' once you delete the record and you select all the information from the student table you should see the following information: STUDENT_ID STUDENT_NM COURSE_ID PHONE ADDRESS 972-678-89 567, Preston Road, Dallas, 1002 JIM MSSql Server 09 TX - 75240 214-571-15 1234, Elm Street, Dallas, 1003 BRUCE Java 67 TX - 75039 972-912-40 567, Washington Ave, 1004 DAVE Java 08 Dallas - 75543 If you dont include where clause in delete statment then it will remove all the rows from the table. Data control language(DCL) In RDBMS one of the main advantages is the security for the data in the database. You can allow some user to do a specific operation or all operations on certain objects. Examples for DCL statements are GRANT, REVOKE statements. GRANT is used to Grant a permission to an user so that the user can do that operation. REVOKE is used to take back that permission from that user on that object. For example we have two users JAMES and DAVID If JAMES created a table called ITEMS then JAMES becomes the owner of that table. DAVID cannot access ITEMS table because he is not the owner of that table. DAVID can access ITEMS if JAMES gives the permission on his table. JAMES can give different types of access like Select, Update, Delete and Insert on ITEMS table to DAVID. For example:- If JAMES wants to provide only Select on ITEMS to DAVID then he can issue: GRANT SELECT ON ITEMS TO JAMES If JAMES wants to provide only Select and Insert on ITEMS to DAVID then he can issue: GRANT SELECT, INSERT ON ITEMS TO JAMES If JAMES wants to provide all the operations on ITEMS to DAVID then he can issue: GRANT ALL ON ITEMS TO JAMES Once you provide all permissions on an object to an user then indirectly he becomes the owner and can do any manipulation to the table.
  • 52. Oracle datatypes Data in a database is stored in the form of tables. Each table consists of rows and columns to store the data. A particular column in a table must contain the same type of data.For example: PLAYER_NAME(char COUNTRY DATE_OF_BIRTH(date ROOM_NO(number) ) (char) ) AGASSI USA 10/12/1969 1004 WILLIAM USA 01/15/1975 1006 JIM RUSSIA 05/25/1980 1007 SWITZERLAN HINGIS 06/25/1979 1009 D Every column has certain information, PLAYER_NAME is a char column. DATE_OF_BIRTH is a Date column, ROOM_NO is a number column. Different datatypes available in Oracle database: CHAR: To store character type of data,for example: name of a person (you can save anything in character field) VARCHAR: Same as CHAR. The only difference between CHAR and VARCHAR is the way the database saves the data. To understand the difference better we will take the following example. CREATE TABLE EMPLOYEE (EMP_NO NUMBER(4), ENAME CHAR(15)) EMP_NO ENAME 888 CLARK 889 KING 890 DAVID COOPER As Ename column defined as CHAR(15) every value you put it that column will occupy all 15 bytes ie CLARK is 5 bytes string,so the database pads 10 spaces. CREATE TABLE EMPLOYEE (EMP_NO NUMBER(4), ENAME VARCHAR(15)) EMP_NO ENAME 888 CLARK 889 KING 890 DAVID COOPER
  • 53. Here as Ename is defined as VARCHAR(15) it occupies only the required space. so in the above table ename CLARK occupies only 5 bytes in the database. So what are the advantages and disadvantages?.The thumb rule here is that if you are using a char column as primary key then it better be a char field. If you are using a column to have comments then you must use varchar. NUMBER: Used to store the numbers, for example:If you want to store employee numbers then you define the column's data type as number. If you want to define a column to store currency then you can define the column as NUMBER(7,2). DATE: Used to store the date,like Date of birth of a person, join date in a company etc. LONG: to store the variable char length. RAW: LONG RAW: store binary data of variable length. LOB: Large objects to store binary files. In addition oracle 8 supports CLOB, BLOB and BFILE CLOB - A table can have multiple columns of this type. BLOB - can store large binary objects such as graphics, video and sound files. BFILE - stores file pointers to LOB managed by file systems external to the database Constraints When you bind a business rule to a column in the table then those rules are called the Constraints. Constraints are defined while creating the table. Say for example, you cannot have an employee who does not have a name, then employee name column in employee table should be a NOT NULL column. The NOT NULL is a constraint. The following table shows the constraint types and short descriptions. Constraint Type Description you must provide the value in that column. you cannot leave that NOT NULL column blank PRIMARY KEY No duplicate values allowed, for example Empno in Employee table
  • 54. should be unique CHECK checks the value and controls the inserting and updating values. DEFAULT Assigns a default value if no value is given. REFERENCES To maintain the referential integrity (Foreign Key) Examples for some of the rules usually implement through the business rules. NOT NULL If we have a business rule saying that all customers should have a name, we cannot have any customer without a name. So to implement that business rule we can create customer table and specify customer name column as NOT NULL (constraint) Example CREATE TABLE EMPLOYEE (EMPNO NUMBER(4) PRIMARY KEY, ENAME VARCHAR(4) NOT NULL);CHECK Check constraint is used where we define a condition on a column. Check constraint consists of the keyword col_name datatype CHECK (col_name in(value1, value2)) Example If you have a business rule saying that all employees in the organization should get atleast $500 then we can use CHECK constraint while creating table. CREATE TABLE EMPLOYEE ( EMPNO NUMBER(4) PRIMARY KEY, ENAME VARCHAR(4) NOT NULL, SALARY NUMBER(7,2) CHECK (SALARY > 500) ); DEFAULT While inserting a row into a table without giving values for every column, SQL must insert a default value to fill in the excluded columns, or the command will be rejected. The most common default value is NULL. This can be used with columns not defined with a NOT NULL. Default value assigned to a column while creating the table using CREATE TABLE operation. Example CREATE TABLE ITEM (ITEM_ID NUMBER(4) PRIMARY KEY, ITEM_NAME VARCHAR(15), ITEM_DESC VARCHAR(100), QOH NUMBER(4) DEFAULT 100) Assigning a default value 0 for numeric columns makes the computation. PRIMARY KEY Primary Key in a table is a unique identifier of a row. For example,if you are maintaning the customer profiles, you should assign particular number to each one. So customer_number should be defined as a Primary key in Customer table.
  • 55. REFERENCES is a Foreign key. A foreign key column value refers a column in another table to check whether the value exists or not. UNIQUE The values entered into a column are unique ie no duplicate values exists.This constraint ensures business that there is no duplicates allowed. Data Definition Language It's a part of SQL langugae which creates a database object. Examples of database objects are tables, procedures, functions, packages etc. When you create a table or drop a table you are modifying the structure of the database and that is the reason why it is called data definition language. When you issue a create or alter or drop sql statements database internally does a commit,and that is why we cannot include the DDL as part of the transaction.Following are a few DDL statements. Create table Create table course ( course_id not null number(5) primary key, course_name not null varchar2(30), start_date Date); Alter table course modify ( start_date not null date ); Alter table course add ( instructor_id null ); Drop table course Create table course ( course_id not null primary key, course_name varchar(30), start_Date date ) tablespace=course_info storage (initial 1024k next 1024 pctincrease=10) Data Manipulation Language Data Manipulation in RDBMS means maintaining the data in the database. There are three DML statements:Insert,Update and Delete. INSERT statment is used to insert a new record into a table. The UPDATE statement is used to change the existing information of a table. The DELETE statement is used to remove certain information from the table. We will take an example here:If you are running an apartment complex where you rent apartments,the day to day record maintenance would look like this. tenant_id aptno tenant_name home_phone work_phone apt_rent no_of_pets 1000 888 SMITH 881-890-9000 767-908-5432 900 1 1001 889 STEVE 881-909-8971 898-543-9032 890 0 1002 890 BILL 781-897-9011 567-891-9108 880 2 INSERT Statement
  • 56. If a person named JAMES rented an apartment,we need to add his information into the table. We have to do an INSERT because the information does not exist in the table as of now.The following information has to be entered into the database:-name = JAMES aptno = 891, home_phone as 676-789-9011, work_phone as 777-567-1234, apt_rent = 880 and no_of_pets as 1. So now how we can write the INSERT statement. INSERT into TENANT (tenant_id, aptno, tenant_name, home_phone, work_phone, apt_rent, no_of_pets ) VALUES (1003, 891, 'JAMES','676-789-9011','777-567-1234', 880, 1 ). After executing the insert statement the table now should have four rows as shown below tenant_id aptno tenant_name home_phone work_phone apt_rent no_of_pets 1000 888 SMITH 881-890-9000 767-908-5432 900 1 1001 889 STEVE 881-909-8971 898-543-9032 890 0 1002 890 BILL 781-897-9011 567-891-9108 880 2 1003 891 JAMES 676-789-9011 777-567-1234 880 1 Following shown are the different syntaxes available INSERT SQL syntaxes. Syntax1 INSERT into table_name values (col1, col2, col3....) values (value1, value2, value3.....) In the syntax 1 we need to specify the column names of a table and values respectively. In the application development its more recommened to use this syntax while doing inserts into the table, reason being if you added a column in the table then it won’t give an error except the value for that column won’t be supplied and program will run fine. Syntax2 INSERT into table_name values ( value1, value2.....) In the Syntax 2 we won’t specify the column names and pass all the values to the columns respectively. Syntax3 INSERT itno table_name (col1, col2, col3...) SELECT col1, col2, col3........ FROM table In the Syntax 3 we can insert multiple rows using one INSERT into statement where as in Syntax 1 and Syntax 2 you can insert only one row at a time. UPDATE Statement