1. Conjoint Analysis :
Conjoint Analysis is a marketing research technique designed to help determine preferences of
customers. It is used to analyse how customers value different attributes of a product ( or service)
and thus gives an insight into the trade-offs they are to make among the various attributes. To put
simply, it tells how much each feature of a product is worth to the consumers.
This study includes surveying people with a certain set of attribute combinations which the survey-
takers rank or provide preferences. Analysis will be done to model the customer preferences for
different combination of attributes. The attributes are termed factors and the different values are
levels.
In the example that we have taken to use Conjoint Analysis through the tool SPSS, we have analysed
data on carpet, taking attributes like Price, Brand, Money-return, Package design and Seal as the
attributes based on which the consumers give prefernces. Using two data sets, we calculate the part
worths and decide on the weightage of each of the attributes that the users have provided.
Variable name Variable label Value label
package package design A*, B*, C*
brand brand name K2R, Glory, Bissell
price price $1.19, $1.39, $1.59
seal Good Housekeeping seal no, yes
money money-back guarantee no, yes
Code to fetch import the data and analyse :
GET
FILE='C:UsersAbhiDesktopcarpet_plan.sav'.
DATASET NAME DataSet1 WINDOW=FRONT.
GET
FILE='C:UsersAbhiDesktopcarpet_prefs.sav'.
DATASET NAME DataSet2 WINDOW=FRONT.
CONJOINT PLAN='C:UsersAbhiDesktopcarpet_plan.sav'
/DATA='C:UsersAbhiDesktopcarpet_prefs.sav'
/SEQUENCE=PREF1 PREF2 PREF3 PREF4 PREF5 PREF6 PREF7 PREF8 PREF9 PREF10 PR
EF11 PREF12 PREF13 PREF14 PREF15 PREF16 PREF17 PREF18 P
REF19 PREF20 PREF21 PREF22
/SUBJECT=ID
/FACTORS=PACKAGE BRAND (DISCRETE)
PRICE (LINEAR LESS)
SEAL (LINEAR MORE) MONEY (LINEAR MORE)
/PRINT=SUMMARYONLY.
2. Model Description
Relation to Ranks
N of Levels or Scores
package 3 Discrete
brand 3 Discrete
price 3 Linear (less)
seal 2 Linear (more)
money 2 Linear (more)
Calculation of the part-worth of each attribute
Utilities
Utility Estimate Std. Error
package A* -2.233 .192
B* 1.867 .192
C* .367 .192
brand K2R .367 .192
Glory -.350 .192
Bissell -.017 .192
price $1.19 -6.595 .988
$1.39 -7.703 1.154
$1.59 -8.811 1.320
seal no 2.000 .287
yes 4.000 .575
money no 1.250 .287
3. yes 2.500 .575
(Constant) 12.870 1.282
This table shows the utility (part-worth) scores and their standard errors for each factor level. Higher
utility values indicate greater preference. We can see that the value of the part worths are such that,
for each attribute if part-worths are added for different levels, it sums up to zero. Thus with respect
to brand Glory and Bisell, K2R is preferred more. As expected, there is an inverse relationship
between price and utility, with higher prices corresponding to lower utility. The presence of a seal of
approval or money-back guarantee corresponds to a higher utility.Also, total utility of a combination
can be calculated as :
If the cleaner had package design C*, brand Bissell, price $1.59, a seal of approval, and a money -back
guarantee, the total utility would be:
0.367 + (−0.017) + (−8.811) + 4.000 + 2.500 + 12.870 = 10.909.
Importance:
Importance Values
package 35.635
brand 14.911
price 29.410
seal 11.172
money 8.872
We can see that attributes package has most importance followed by price. Money return is of least
concern for the consumer. The values are computed by taking the utility range for each factor
separately and dividing by the sum of the utility ranges for al l factors. The values thus represent
percentages and have the property that they sum to 100.
4. Coefficients
B Coefficient
Estimate
price -5.542
seal 2.000
money 1.250
The utility for a particular factor level is determined by multiplying the level by the coefficient. For
example, the predicted utility for a price of $1.19 was listed as −6.595 in the utilities table. This is
simply the value of the price level, 1.19, multiplied by the price coefficient, −5.542.
This table provides measures of the correlation between the observed and estimated preferences.
Preference Scores of
Simulations a
Card
Number ID Score
1 1 10.258
2 2 14.292
5. The real power of conjoint analysis is the ability to predict preference for product profiles that
weren't rated by the subjects. These are referred to as simulation cases.
b
Preference Probabilities of Simulations
Card Bradley-Terry-
a
Number ID Maximum Utility Luce Logit
1 1 30.0% 43.1% 30.9%
2 2 70.0% 56.9% 69.1%
The maximum utility model determines the probability as the number
of respondents predicted to choose the profile divided by the total
number of respondents. For each respondent, the predicted choice is
simply the profile with the largest total utility.
Number of Reversals
Factor price 3
money 2
seal 2
brand 0
package 0
Subject 1 Subject 1 1
2 Subject 2 2
3 Subject 3 0
4 Subject 4 0
5 Subject 5 0
6 Subject 6 1
6. 7 Subject 7 0
8 Subject 8 0
9 Subject 9 1
10 Subject 10 2
This table displays the number of reversals for each factor and for each subject. For example, three
subjects showed a reversal for price. That is, they preferred product profiles with higher prices.
Reversal Summary
N of
Revers
als N of Subjects
1 3
2 2
7. Q. Perform Discriminant Analysis on the given dataset.
The dataset chosen contains statistics on set of people who have been given bank loans & have defaulted or not defaulted with their various characteristics.
Discriminant
Notes
Output Created 04-Apr-2013 18:39:05
p{color:black;font-family:sans-serif;font-size:10pt;font-
Comments weight:normal}
Input Data E:VGSOMSTUDYSECOND Your trial period for SPSS for Windows will expire in 14 da
SEMBRMSPSS16Samplesbanklo ys.p{color:0;font-family:Monospaced;font-size:13pt;font-
style:normal;font-weight:normal;text-decoration:none}
an.sav GET
Active Dataset DataSet1 FILE='E:VGSOMSTUDYSECOND SEMBRMSPSS16Samplesbanklo
an.sav'.
File Label Bank Loan Default DATASET NAME DataSet1 WINDOW=FRONT.
Filter <none> DISCRIMINANT
/GROUPS=default(0 1)
Weight <none> /VARIABLES=employ address age
Split File <none> /ANALYSIS ALL
/PRIORS EQUAL
N of Rows in Working /STATISTICS=MEAN STDDEV UNIVF BOXM COEFF CORR TABLE
850
Data File /PLOT=COMBINED
/PLOT=CASES
Missing Value Handling Definition of Missing User-defined missing values are
treated as missing in the analysis /CLASSIFY=NONMISSING POOLED MEANSUB.
phase.
Cases Used In the analysis phase, cases with no
user- or system-missing values for
any predictor variable are used.
Cases with user-, system-missing, or
out-of-range values for the
grouping variable are always
excluded.
Syntax DISCRIMINANT
/GROUPS=default(0 1)
/VARIABLES=employ address age
/ANALYSIS ALL
/PRIORS EQUAL
/STATISTICS=MEAN STDDEV UNIVF
BOXM COEFF CORR TABLE
/PLOT=COMBINED
/PLOT=CASES
/CLASSIFY=NONMISSING POOLED
MEANSUB.
Resources Processor Time 00:00:00.047
[DataSet1] E:VGSOMSTUDYSECOND SEMBRMSPSS16Samplesbankloan.sav
Elapsed Time 00:00:00.121
8. Warnings
All-Groups Stacked Histogram is no longer displayed.
Analysis Case Processing Summary
Unweighted Cases N Percent
Valid 700 82.4
Excluded Missing or out-of-range
150 17.6
group codes
At least one missing
0 .0
discriminating variable
Both missing or out-of-
range group codes and at
0 .0
least one missing
discriminating variable
Total 150 17.6
Total 850 100.0
Group Statistics
Valid N (listwise)
Previously defaulted Mean Std. Deviation Unweighted Weighted
No Years with current
9.51 6.664 517 517.000
employer
Years at current address 8.95 7.001 517 517.000
Age in years 35.51 7.708 517 517.000
Yes Years with current
5.22 5.543 183 183.000
employer
Years at current address 6.39 5.925 183 183.000
Age in years 33.01 8.518 183 183.000
Total Years with current
8.39 6.658 700 700.000
employer
Years at current address 8.28 6.825 700 700.000
Age in years 34.86 7.997 700 700.000
9. Tests of Equality of Group Means
Wilks' Lambda F df1 df2 Sig.
Years with current
.920 60.759 1 698 .000
employer
Years at current address .973 19.402 1 698 .000
Age in years .981 13.482 1 698 .000
Pooled Within-Groups Matrices
Years with
current Years at
employer current address Age in years This matrix shows correlation between the predictors. The largest
Correlation Years with current correlations occur between Credit card debt in thousands and the
1.000 .292 .524 other variables.
employer
Years at current address .292 1.000 .588
Age in years .524 .588 1.000
Analysis 1
Box's Test of Equality of Covariance Matrices
Log Determinants
Log
Previously defaulted Rank Determinant
No 3 11.012
Yes 3 10.501
Pooled within-groups 3 10.919
The ranks and natural logarithms of determinants
printed are those of the group covariance
matrices.
Test Results
Box's M 28.171
F Approx. 4.665
df1 6
df2 7.335E5
Sig. .000
10. Log Determinants
Log
Previously defaulted Rank Determinant
No 3 11.012
Yes 3 10.501
Pooled within-groups 3 10.919
Tests null hypothesis of
equal population covariance
matrices.
Summary of Canonical Discriminant Functions
Eigenvalues
Functio Canonical
n Eigenvalue % of Variance Cumulative % Correlation
1 .100a 100.0 100.0 .301
a. First 1 canonical discriminant functions were used in the analysis.
Wilks' Lambda
Test of
Functio
n(s) Wilks' Lambda Chi-square df Sig.
1 .909 66.251 3 .000
Standardized Canonical Discriminant
Function Coefficients
Function
1
Years with current
.980
employer
Years at current address .436
Age in years -.330
11. Structure Matrix
Function
1
Years with current
.934
employer
Years at current address .528
Age in years .440
Pooled within-groups correlations
between discriminating variables and
standardized canonical discriminant
functions
Variables ordered by absolute size of
correlation within function.
Functions at Group
Centroids
Previo Function
usly
default
ed 1
No .188
Yes -.530
Unstandardized
canonical
discriminant
functions evaluated
at group means
Classification Statistics
Classification Processing Summary
Processed 850
Excluded Missing or out-of-range
0
group codes
At least one missing
0
discriminating variable
Used in Output 850
12. Prior Probabilities for Groups
Previo Cases Used in Analysis
usly
default
ed Prior Unweighted Weighted
No .500 517 517.000
Yes .500 183 183.000
Total 1.000 700 700.000
Classification Function Coefficients
Previously defaulted
No Yes
Years with current
-.192 -.302
employer
Years at current address -.302 -.348
Age in years .797 .827
(Constant) -12.588 -12.444
Fisher's linear discriminant functions
Classification Resultsa
Previously Predicted Group Membership
defaulted No Yes Total The Discriminant Analysis shows that the persons in the category
Original Count No 300 217 517 who have previously defaulted are predicted likely to default this
Yes 44 139 183 time as well & those who haven’t defaulted earlier are predicted less
Ungrouped cases 81 69 150 likely to default this time.
% No 58.0 42.0 100.0 The conclusion is inferred from the total no. of defaulters being
Yes 24.0 76.0 100.0 more than non defaulters (139>44) similarly (300>217).
Ungrouped cases 54.0 46.0 100.0
a. 62.7% of original grouped cases correctly classified.
13. Q. Perform Factor Analysis on the given dataset.
The dataset chosen contains fictional statistics anxiety questionnaire. It contains response given
by students regarding their ease of use, liking and usage of SPSS in statistics.
By using the Scree Plot I have chosen 5 factors.
Since a student may give related answers depending upon the choices hence I considered the
variables to be inter-related and hence used Oblimin rotation. Say a student gave high points for
variable “I have little experience of computers” is likely to give high points for “All computers
hate me” as the variables are correlated somewhat.
14. Using the options of SPSS the following Pattern Matrix was generated.
Pattern Matrix a
Component
1 2 3 4 5
I have little experience of .903
computers
SPSS always crashes when I .732
try to use it
All computers hate me .684
I worry that I will cause .662
irreparable damage because
of my incompetenece with
computers
Computers have minds of .581
their own and deliberately go
wrong whenever I use them
People try to tell you that .446
SPSS makes statistics easier
to understand but it doesn't
Computers are out to get me .333
My friends are better at SPSS .661
than I am
My friends are better at .655
statistics than me
If I'm good at statistics my .622
friends will think I'm a nerd
My friends will think I'm stupid .504 .330
for not being able to cope
with SPSS
Everybody looks at me when .358 .358
I use SPSS
I can't sleep for thoughts of -.728
eigen vectors
I wake up under my duvet .324 -.543
thinking that I am trapped
under a normal distribtion
15. Computers are useful only for .359 .393 -.366
playing games
Standard deviations excite .301 .356 .315
me
I have never been good at -.855
mathematics
I did badly at mathematics at -.736
school
I slip into a coma whenever I -.722
see an equation
Statiscs makes me cry -.772
I don't understand statistics -.730
I weep openly at the mention -.664
of central tendency
I dream that Pearson is -.564
attacking me with correlation
coefficients
Extraction Method: Principal Component Analysis.
Rotation Method: Oblimin with Kaiser Normalization.
a. Rotation converged in 15 iterations.
The total variance explained by each factor is given below
Total Variance Explained
Rotation Sums of
Squared
Loadings a
Compo
nent Total
1 5.522
2 2.452
3 2.383
4 3.535
5 4.913
16. Extraction Method:
Principal Component
Analysis.
It is calculated by the sum of squared loadings of the factor and dividing the sum of squared loadings by
the number of variables and multiplying by 100.
Hence the factoring would be as follows depending on the loading values.
Factor Variable Nos.
1 1,2,3,4,5,6,7,14
2 8,9,10
3 13
4 17,18,19
5 20,21,22,23
Since variables 11, 12, 15 and 16 have very close loadings in different factors it is not good as this
variable is assessing both constructs.15 has exact same value in both Factor 2 and Factor 3.These are
said to have split loading.
They are hence mentioned in a separately.
Factor Variable No
2 11,16,15
3 12,15
As Split loading is present this is not a simple structure.
Factor 1: Anxiety about the usage of computers accounts for 55.22% of the total variance and loads 8 of
the variables.
Factor 2: View of students regarding their understanding of statistics and SPSS with regard to their peers
accounts for 24.52% of the total variance and loads 3 variables. It also split loads variable 11, 16 and 15.
Factor 3: Anxiety about Eigen vectors corresponds to only 23.83% of the total variance and loads only 1
variable directly while it split loads variable 12 and 15.
Factor 4: Students interest in mathematics accounts for 35.35% of the total variance and loads 3
variable.
Factor 5: Dislike for statistics accounts for 49.13% of the total variance and loads 4 variables.
17. CLUSTER ANALYSIS
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same
group (called cluster) are more similar (in some sense or another) to each other than to those in other groups
(clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis
used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and
bioinformatics.
Proximities
Notes
Output Crea ted 02-Apr-2013 22:00:05
Comments
Input Da ta C:Us ers dev
ma l etiaDownloadsClusterAnonFaculty.sav
Acti ve Da taset Da ta Set3
Fi l ter <none>
Wei ght <none>
Spl it File <none>
N of Rows i n Working Data File 44
Mi s sing Value Handling Defi nition of Missing Us er-defined missing values a re treated as missing.
Ca s es Used Sta ti stics a re based on cases with no missing values
for a ny va ri able used.
Synta x PROXIMITIES Sa l ary FTE Ra nk Arti cles Experience
/MATRIX
OUT('C:Us ersDEVMAL~1AppDataLocalTempspss
6496s pssclus.tmp')
/VIEW=CASE
/MEASURE=SEUCLID
/PRINT NONE
/ID=Name
/STANDARDIZE=VARIABLE Z.
Res ources Proces sor Ti me 00:00:00.078
El a psed Time 00:00:00.082
Works pace Bytes 11152
Fi l es Saved Ma tri x Fi le C:Us ers DEVMAL~1AppDataLocalTempspss6496
s pssclus.tmp
18. The variables are which I have used in the dataset are as follows:
• Name -- Although faculty salaries are public information under North Carolina state law
• Salary – annual salary in dollars, from the university report available in One Stop.
• FTE – Full time equivalent work load for the faculty member.
• Rank – where 1 = adjunct, 2 = visiting, 3 = assistant, 4 = associate, 5 = professor
• Articles – number of published scholarly articles, excluding things like comments in newsletters,
abstracts in proceedings, and the like.
• Experience – Number of years working as a full time faculty member in a Department of Psychology.
• ArticlesAPD – number of published articles as listed in the university’s Academic Publications
• Sex –biological sex from physical appearance.
In the first step SPSS computes for each pair of cases the squared Euclidian distance between the cases. This is
quite simply, the sum across variables (from i = 1 to v) of the squared difference between the score on variable
i for the one case (Xi) and the score on variable i for the other case (Yi). The two cases which are separated by
the smallest Euclidian distance are identified and then classified together into the first cluster. At this point
there is one cluster with two cases in it.
Next SPSS re-computes the squared Euclidian distances between each entity (case or cluster) and each other
entity. When one or both of the compared entities is a cluster, SPSS computes the averaged squared Euclidian
distance between members of the one entity and members of the other entity. The two entities with the
smallest squared Euclidian distance are classified together. SPSS then re-computes the squared Euclidian
distances between each entity and each other entity and the two with the smallest squared Euclidian distance
are classified together. This continues until all of the cases have been clustered into one big cluster.
The output obtained can be seen below:
Case Processing Summary a
Ca s es
Va l i d Mi s s i ng Tota l
N Percent N Percent N Percent
44 100.0% 0 .0% 44 100.0%
a. Squa red Euclidean Distance used
19. On the first step SPSS clustered case 32 with 33. The squared Euclidian distance between these two cases is
0.000. At stages 2-4 SPSS creates three more clusters, each containing two cases. At stage 5 SPSS adds case
39 to the cluster that already contains cases 37 and 38. By the 43rd stage all cases have been clustered into
one entity.
The results can be seen below:
Average Linkage (Between Groups)
Agglomeration Schedule
Cl us ter Combi ned Sta ge Cl us ter Fi rs t Appea rs
Sta ge Cl us ter 1 Cl us ter 2 Coeffi ci ents Cl us ter 1 Cl us ter 2 Next Sta ge
1 32 33 .000 0 0 9
2 41 42 .000 0 0 6
3 43 44 .000 0 0 6
4 37 38 .000 0 0 5
5 37 39 .001 4 0 7
6 41 43 .002 2 3 27
7 36 37 .003 0 5 27
8 20 22 .007 0 0 11
9 30 32 .012 0 1 13
10 21 26 .012 0 0 14
11 20 25 .031 8 0 12
12 16 20 .055 0 11 14
13 29 30 .065 0 9 26
14 16 21 .085 12 10 20
15 11 18 .093 0 0 22
16 8 9 .143 0 0 25
17 17 24 .144 0 0 20
18 13 23 .167 0 0 22
19 14 15 .232 0 0 32
20 16 17 .239 14 17 23
21 7 12 .279 0 0 28
22 11 13 .441 15 18 29
23 16 27 .451 20 0 26
24 3 10 .572 0 0 28
25 6 8 .702 0 16 36
26 16 29 .768 23 13 35
27 36 41 .858 7 6 33
20. 28 3 7 .904 24 21 31
29 11 28 .993 22 0 30
30 5 11 1.414 0 29 34
31 3 4 1.725 28 0 36
32 14 31 1.928 19 0 34
33 36 40 2.168 27 0 40
34 5 14 2.621 30 32 35
35 5 16 2.886 34 26 37
36 3 6 3.089 31 25 38
37 5 19 4.350 35 0 39
38 1 3 4.763 0 36 41
39 5 34 5.593 37 0 42
40 35 36 8.389 0 33 43
41 1 2 8.961 38 0 42
42 1 5 11.055 41 39 43
43 1 35 17.237 42 40 0
Cluster Membership
Ca s e 5 Cl us ters 4 Cl us ters 3 Cl us ters 2 Cl us ters
1:Ros alyn 1 1 1 1
2:La wrence 2 2 1 1
3:Suni la 1 1 1 1
4:Ra ndolph 1 1 1 1
5:Mi ckey 3 3 2 1
6:Loui s 1 1 1 1
7:Tony 1 1 1 1
8:Ra ul 1 1 1 1
9:Ca ta l ina 1 1 1 1
10:Johns on 1 1 1 1
11:Beul ah 3 3 2 1
12:Ma rti na 1 1 1 1
13:Ma ri e 3 3 2 1
14:Ernes t 3 3 2 1
15:Chri s topher 3 3 2 1
16:Erni e 3 3 2 1
17:Chri s ta 3 3 2 1
21. 18:Li nette 3 3 2 1
19:Bo 3 3 2 1
20:Ca rl a 3 3 2 1
21:Al berto 3 3 2 1
22:Chri s ti na 3 3 2 1
23:Jona h 3 3 2 1
24:Tucker 3 3 2 1
25:Sha nta 3 3 2 1
26:Mel i ssa 3 3 2 1
27:Jenna 3 3 2 1
28:Johnny 3 3 2 1
29:Cl ea tus 3 3 2 1
30:Jona s 3 3 2 1
31:Ta d 3 3 2 1
32:Ama ryl l is 3 3 2 1
33:Na tha n 3 3 2 1
34:Dea nna 3 3 2 1
35:Wi l ly 4 4 3 2
36:Dea na 5 4 3 2
37:Dea 5 4 3 2
38:Cl a ude 5 4 3 2
39:Ama nda 5 4 3 2
40:Bori s 5 4 3 2
41:Ga rrett 5 4 3 2
42:Stew 5 4 3 2
43:Bree 5 4 3 2
44:Ka rma 5 4 3 2
Vertical Icicle:
In this document, it is not possible to display the full vertical icicle, but, yet, the results for the same are
described below.
For the two cluster solution you can see that one cluster consists of ten cases (Boris through Willy, followed by
a column with no X’s). These were our adjunct (part-time) faculty (excepting one) and the second cluster
consists of everybody else.
For the three cluster solution you can see the cluster of adjunct faculty and the others split into two. Deanna
through Mickey were our junior faculty and Lawrence through Rosalyn our senior faculty
For the four cluster solution you can see that one case (Lawrence) forms a cluster of his own.
22. Dendrogram
It displays essentially the same information that is found in the agglomeration schedule but in graphic form.
* * * * * * * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * * * * * * * *
Dendrogram using Average Linkage (Between Groups)
Rescaled Distance Cluster Combine
C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+
Amaryllis 32 ─┐
Nathan 33 ─┤
Jonas 30 ─┼─┐
Cleatus 29 ─┘ │
Alberto 21 ─┐ │
Melissa 26 ─┤ │
Carla 20 ─┤ ├─────┐
Christina 22 ─┤ │ │
Shanta 25 ─┤ │ │
Ernie 16 ─┤ │ │
Christa 17 ─┼─┘ │
Tucker 24 ─┤ │
Jenna 27 ─┘ ├───┐
Beulah 11 ─┐ │ │
Linette 18 ─┼─┐ │ │
Marie 13 ─┤ ├─┐ │ │
Jonah 23 ─┘ │ ├─┐ │ │
Johnny 28 ───┘ │ │ │ ├───┐
Mickey 5 ─────┘ ├─┘ │ │
Ernest 14 ─┬───┐ │ │ │
Christopher 15 ─┘ ├─┘ │ ├───────────────┐
Tad 31 ─────┘ │ │ │
Bo 19 ─────────────┘ │ │
Deanna 34 ─────────────────┘ │
Raul 8 ─┬─┐ │
Catalina 9 ─┘ ├─────┐ ├───────────────┐
Louis 6 ───┘ │ │ │
Tony 7 ─┬─┐ ├───┐ │ │
Martina 12 ─┘ ├─┐ │ │ │ │
Sunila 3 ─┬─┘ ├───┘ ├───────────┐ │ │
Johnson 10 ─┘ │ │ │ │ │
Randolph 4 ─────┘ │ ├───────┘ │
Rosalyn 1 ─────────────┘ │ │
Lawrence 2 ─────────────────────────┘ │
Garrett 41 ─┐ │
Stew 42 ─┼─┐ │
Bree 43 ─┤ │ │
Karma 44 ─┘ ├───┐ │
Dea 37 ─┐ │ │ │
Claude 38 ─┤ │ ├─────────────────┐ │
Amanda 39 ─┼─┘ │ │ │
Deana 36 ─┘ │ ├───────────────────────┘
Boris 40 ───────┘ │
Willy 35 ─────────────────────────┘
23. Multiple Regression Analysis
In this Analysis we are using a data file that was created by randomly sampling 400 elementary
schools from the California Department of Education's API 2000 dataset. This data file contains a
measure of school academic performance as well as other attributes of the elementary schools, such
as, class size, enrolment, poverty, etc.,
Now, performing a regression analysis using api00 as the outcome variable and the
variables acs_k3, meals and full as predictors. These measure the academic performance of the
school (api00), the average class size in kindergarten through 3rd grade (acs_k3), the percentage of
students receiving free meals (meals) - which is an indicator of poverty, and the percentage of
teachers who have full teaching credentials (full). We expect that better academic performance would
be associated with lower class size, fewer students receiving free meals, and a higher percentage of
teachers having full teaching credentials. The output is as follows:
Regression
Notes
Output Created 02-Apr-2013 21:48:19
Comments
Input Data C:UsersDivijDesktopSPSS Dataelemapi.sav
Active Dataset DataSet5
Filter <none>
Weight <none>
Split File <none>
N of Row s in Working Data File 400
Missing Value Handling Definition of Missing User-defined missing values are treated as
missing.
Cases Used Statistics are based on cases with no missing
values for any variable used.
Syntax regression
/dependent api00
/method=enter acs_k3 meals full
.
Resources Processor Time 00:00:00.063
Elapsed Time 00:00:00.026
Memory Required 2284 bytes
Additional Memory Required for
0 bytes
Residual Plots
b
Variables Entered/Removed
24. Variables Variables
Model Entered Removed Method
1 pct full
credential, avg
. Enter
class size k-3,
a
pct free meals
a. All requested variables entered.
b. Dependent Variable: api 2000
Model Summary
Adjusted R Std. Error of the
Model R R Square Square Estimate
a
1 .821 .674 .671 64.153
a. Predictors: (Constant), pct full credential, avg class size k-3, pct
free meals
b
ANOVA
Model Sum of Squares df Mean Square F Sig.
a
1 Regression 2634884.261 3 878294.754 213.407 .000
Residual 1271713.209 309 4115.577
Total 3906597.470 312
a. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals
b. Dependent Variable: api 2000
a
Coefficients
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 906.739 28.265 32.080 .000
avg class size k-3 -2.682 1.394 -.064 -1.924 .055
pct free meals -3.702 .154 -.808 -24.038 .000
pct full credential .109 .091 .041 1.197 .232
a. Dependent Variable: api 2000
25. Let's test the three predictors on whether they are statistically significant and, if so, the direction of the
relationship. The average class size (acs_k3, b=-2.682) is not significant (p=0.055), but only just so,
and the coefficient is negative which would indicate that larger class sizes is related t o lower
academic performance, which is what we would expect. Next, the effect of meals (b=-3.702, p=.000)
is significant and its coefficient is negative indicating that the greater the proportion students receiving
free meals, the lower the academic performance. We cannot say that free meals are causing lower
academic performance. The meals variable is highly related to income level and functions more as a
proxy for poverty. Thus, higher levels of poverty are associated with lower academic performance.
Finally, the percentage of teachers with full credentials (full, b=0.109, p=.2321) seems to be unrelated
to academic performance. This would seem to indicate that the percentage of teachers with full
credentials is not an important factor in predicting academic performance which is unexpected.
From these results, we would conclude that lower class sizes are related to higher performance, that
fewer students receiving free meals is associated with higher performance, and that the percentage of
teachers with full credentials was not related to academic performance in the schools. Before we
write this up as our finding, we should do checks to make sure we can firmly stand behind these
results.
Examining Data
Step 1)
To start examining the data we have a look at the first 10 data points for the variables included in our
regression analysis. We need to lay focus on the number of missing data points in the given data.
api00 acs_k3 meals full
693 16 67 76.00
570 15 92 79.00
546 17 97 68.00
571 20 90 87.00
478 18 89 87.00
858 20 . 100.00
918 19 . 100.00
831 20 . 96.00
860 20 . 100.00
737 21 29 96.00
Number of cases read: 10 Number of cases listed: 10
We see that among the first 10 observations, we have four missing values for meals. Keeping this in
mind, we can use the descriptives command with /var=all to get descriptive statistics for all of the
variables, and pay special attention to the number of valid cases for meals.
Step 2)
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
school number 400 58 6072 2866.81 1543.811
district number 400 41 796 457.73 184.823
26. api 2000 400 369 940 647.62 142.249
api 1999 400 333 917 610.21 147.136
growth 1999 to 2000 400 -69 134 37.41 25.247
pct free meals 315 6 100 71.99 24.386
english language learners 400 0 91 31.45 24.839
year round school 400 0 1 .23 .421
pct 1st year in school 399 2 47 18.25 7.485
avg class size k-3 398 -21 25 18.55 5.005
avg class size 4-6 397 20 50 29.69 3.841
parent not hsg 400 0 100 21.25 20.676
parent hsg 400 0 100 26.02 16.333
parent some college 400 0 67 19.71 11.337
parent college grad 400 0 100 19.70 16.471
parent grad school 400 0 67 8.64 12.131
avg parent ed 381 1.00 4.62 2.6685 .76379
pct full credential 400 .42 100.00 66.0568 40.29793
pct emer credential 400 0 59 12.66 11.746
number of students 400 130 1570 483.47 226.448
Percentage free meals in
400 1 3 2.02 .819
3 categories
Valid N (listwise) 295
Examining the output for the variables we used in our regression analysis above,
namely api00, acs_k3, meals, full. For api00, we see that the values range from 369 to 940 and
there are 400 valid values. For acs_k3, the average class size ranges from -21 to 25 and there are 2
missing values. An average class size of -21 sounds wrong. The variable meals ranges from 6%
getting free meals to 100% getting free meals, so these values seem reasonable, but there are only
315 valid values for this variable. The percent of teachers being full credentialed ranges from .42 to
100, and all of the values are valid.
This has uncovered a number of peculiarities worthy of further examination. We now obtain a
corrected data set from the same source. This data set has got all the data corrected & is free from
the shortcomings diagnosed above. We run another multiple regression on the new data set.
27. New Multiple regression analysis
For this multiple regression example, we will regress the dependent variable, api00, on all of the
predictor variables in the data set.
Regression
Notes
Output Created 02-Apr-2013 22:54:47
Comments
Input Data C:UsersDivijDesktopSPSS
Dataelemapi2.sav
Active Dataset DataSet8
Filter <none>
Weight <none>
Split File <none>
N of Row s in Working Data File 400
Missing Value Handling Definition of Missing User-defined missing values are treated as
missing.
Cases Used Statistics are based on cases with no missing
values for any variable used.
Syntax regression
/dependent api00
/method=enter ell meals yr_rnd mobility
acs_k3 acs_46 full emer enroll .
Resources Processor Time 00:00:00.031
Elapsed Time 00:00:00.022
Memory Required 4724 bytes
Additional Memory Required for
0 bytes
Residual Plots
b
Variables Entered/Removed
Variables Variables
Model Entered Removed Method
28. 1 number of
students, avg
class size 4-6,
pct 1st year in
school, avg
class size k-3,
pct emer
. Enter
credential,
english language
learners, year
round school,
pct free meals,
pct full
a
credential
a. All requested variables entered.
b. Dependent Variable: api 2000
Model Summary
Adjusted R Std. Error of the
Model R R Square Square Estimate
a
1 .919 .845 .841 56.768
a. Predictors: (Constant), number of students, avg class size 4-6, pct
1st year in school, avg class size k-3, pct emer credential, english
language learners, year round school, pct free meals, pct full
credential
b
ANOVA
Model Sum of Squares df Mean Square F Sig.
a
1 Regression 6740702.006 9 748966.890 232.409 .000
Residual 1240707.781 385 3222.618
Total 7981409.787 394
a. Predictors: (Constant), number of students, avg class size 4-6, pct 1st year in school, avg
class size k-3, pct emer credential, english language learners, year round school, pct free
meals, pct full credential
b. Dependent Variable: api 2000
a
Coefficients
Standardized
Model Unstandardized Coefficients Coefficients t Sig.
29. B Std. Error Beta
1 (Constant) 758.942 62.286 12.185 .000
english language learners -.860 .211 -.150 -4.083 .000
pct free meals -2.948 .170 -.661 -17.307 .000
year round school -19.889 9.258 -.059 -2.148 .032
pct 1st year in school -1.301 .436 -.069 -2.983 .003
avg class size k-3 1.319 2.253 .013 .585 .559
avg class size 4-6 2.032 .798 .055 2.546 .011
pct full credential .610 .476 .064 1.281 .201
pct emer credential -.707 .605 -.058 -1.167 .244
number of students -.012 .017 -.019 -.724 .469
a. Dependent Variable: api 2000
1) Examining the output from this regression analysis. As with the simple regression, we look to
the p-value of the F-test to see if the overall model is significant. With a p-value of zero to
three decimal places, the model is statistically significant. The R-squared is 0.845, meaning
that approximately 85% of the variability of api00 is accounted for by the variables in the
model. In this case, the adjusted R-squared indicates that about 84% of the variability
ofapi00 is accounted for by the model, even after taking into account the number of predictor
variables in the model. The coefficients for each of the variables indicates the amount of
change one could expect in api00 given a one-unit change in the value of that variable, given
that all other variables in the model are held constant. For example, consider the
variable ell. We would expect a decrease of 0.86 in the api00 score for every one unit
increase in ell, assuming that all other variables in the model are held constant.
2) R-Square is the proportion of variance in the dependent variable (api00) which can be
predicted from the independent variables (ell, meals, yr_rnd,
mobility, acs_k3, acs_46, full, emer and enroll). This value indicates that 84% of the
variance in api00 can be predicted from the
variables ell, meals,yr_rnd, mobility, acs_k3, acs_46, full, emer and enroll.
3) The beta coefficients are used by some researchers to compare the relative strength of the
various predictors within the model. Because the beta coefficients are all measured in
standard deviations, instead of the units of the variables, they can be compared to one
another. In other words, the beta coefficients are the coefficients that you would obtain if the
outcome and predictor variables were all transformed to standard scores, also cal led z-
scores, before running the regression. In this example, meals has the largest Beta coefficient,
-0.661, and acs_k3 has the smallest Beta, 0.013. Thus, a one standard deviation increase
in meals leads to a 0.661 standard deviation decrease in predicted api00, with the other
variables held constant. And, a one standard deviation increase in acs_k3, in turn, leads to a
0.013 standard deviation increase api00 with the other variables in the model held constant.
4) The adjusted R-square attempts to yield a more honest value to estimate the R-squared for
the population. The value of R-square was .8446, while the value of Adjusted R-square was
30. .8409. The adjusted R-square attempts to yield a more honest value to estimate the R-
squared for the population.
5) The F Value is the Mean Square Regression (748966.89) divided by the Mean Square
Residual (3222.61761), yielding F=232.41. The p value associated with this F value is very
small (0.0000). These values are used to answer the question "Do the independent variables
reliably predict the dependent variable?". The p value is compared to your alpha level
(typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably
predict the dependent variable".
6) These are the degrees of freedom associated with the sources of variance. The Total
variance has N-1 degrees of freedom (DF). In this case, there were N=395 observations, so
the DF for total is 394.