2. Why data reduction?
Huge amount of data is being created day by day.
Development of big data platform.
Poor performance of old algorithms.
Most of the data mining algorithms are column wise implemented.
Pushed for data reduction procedures.
3. What is data reduction?
Data reduction is a process that reduced the volume of
original data and represents it in a much smaller volume.
It maintains the integrity of the data while reducing.
The time required for data reduction should not overshadow the the time
saved by data mining on the reduced data set.
Data reduction does not affect the result obtained from data mining.
Data reduction increases the efficiency of data mining.
4. Data reduction strategies
1. Data cube aggregation
2. Attribute subset selection
3. Dimensionality reduction
4. Numerosity reduction
5. Discretization and concept hierarchy generation
5. Data Cube Aggregation
This technique is used to aggregate
(combine) data in a simpler form. So we can
summarize the data in such a way that the data is
used as result
6. Data Cube Aggregation
The data is given of states and their profit earned in
dollars for selling laptops in each country in
different tables by each state .
7. States Gross Profit($)
Arizona 500
Texas 320
Illanoid 430
States Gross Profit($)
Kerala 245
Tamil Nadu 380
Goa 950
States Gross Profit($)
Alberta 420
Manitoba 200
Ontario 300
Country Gross Profit($)
USA 1250
India 1575
Canada 920
Country
USA
Country
Canada
Country
India
8. Attribute Subset Selection
From a large number of attributes a minimal
attribute set is being reduced by eliminating
the irrelevant attributes that may not much
affect the data. Mining of reduced data
makes it easier to understand.
9. Methods of Attribute Subset Selection are:
1. Stepwise Forward Selection- It starts with an empty set and add the
relevant attributes ignoring the rest.
2. Step-wise backward elimination –It starts with full set and removes
the irrelevant attributes keeping the rest.
3. Combining forward selection and backward elimination-select the
best and removes the worst
4. Decision-tree induction-It is a flowchart like structure to choose best
attribute to partition data.
10. Example
A data set is given from which we need to segregate the
number of male, female and transgender individuals who are
eligible for voting.
Initial Attribute Set={ Name, Age, Gender, Address, Phone}
13. Decision Tree Induction
Initial attribute={Name,Age,Gender,Address,Phone}
Age
Not a
voter
Gender
Male Female T.Gender
>=18
<18
Reduced attribute set={Age ,Gender}