Research Review VI
Critical Review of a Published Quantitative Research Study

Klimczak, A.K., & Wedman, J. (1996). Instructional design project indicators: An empirical basis. Performance Improvement Quarterly, 9 (4), 5-18.

Lorraine Sherry

March, 1997

Purpose and Major Research Questions

The purposes of this study were:
  1. to establish empirically a set of instructional design project (ID) success indicators, and
  2. to determine if stakeholder perspective influences the perceived importance attached to these indicators.
The first research question was essentially: Can we isolate a small number of independent indicators of successful ID projects? This had to be done qualitatively.

The second research question was essentially: Are there any significant differences between the mean perceived importance ratings of the individual success indicators that were reported by subjects representing four independent groups of stakeholders: designers, sponsors, trainers, and learners? Here, they did have a tacit research hypothesis. For the estimates of perceived importance on their success indicators, they were looking to reject:

H0: mu (item 1) = mu (item 2) = ... = mu (item 7).
Also, since they were dealing with four different groups, they were looking for a main effect by group, i.e. to reject:
H0: mu (designers) = mu (trainers) = mu (sponsors) = mu (learners).
They also wanted to know if there was any interaction of group x item.

Research Question 1

From the review of relevant literature the authors found 578 success indicators that they then tried to cluster. IÕll discuss their process in Section III. A successful ID project is indicated by the following:
  1. its impact on on-the-job performance,
  2. its usefulness throughout its intended life span,
  3. its effect on the organization's bottom line,
  4. learners' attainment of desired learning outcomes
  5. learners' desire to explore and share ideas
  6. learners' enjoyment of the training experience, and
  7. its adherence to budgetary constraints.
These seven success indicators then became the independent variables in the design.

Research Question 2

The next phase of the study was to assess the relative importance of the seven success indicators and to determine if the importance attached to the indicators was related to stakeholder role. The authors constructed a 30-item instrument in which they randomly interspersed the seven success indicators among the 23 success factors, and then asked 64 subjects to rate each item's importance to ID project success on a Likert scale from 1 (not important at all) to 10 (ultimately important!). The 64 subjects were grouped into four independent groups: designers, sponsors, trainers, and learners. They then gathered all 64 responses and ran a two way ANOVA on the seven success indicators.

Design

The design was a two way ANOVA with repeated measures. Here, the repeated measure is the success indicators, with seven levels (see Table 1 below):

Group Subjects 1 2 3 4 5 6 7
1 1..16 * * * * * * *
2 17..32 * * * * * * *
3 33..48 * * * * * * *
4 49..64 * * * * * * *

Table 1: Two way ANOVA with repeated measures

If N had been greater, perhaps gender and state could have been added as additional factors.

Analysis

The two way ANOVA table clearly differentiates the between subjects sources of variance (the groups) and the within subjects sources of variance (the seven indicators). The degrees of freedom seem to add up for a repeated measure two-way ANOVA. The authors used a nested design: subjects nested within groups, groups as a fixed factor (one subject can be a member of one and only one group), and the seven success indicators as seven independent variables. Since each subject answered seven questions, each pertaining to a different independent variable, on a single 30 item questionnaire, we are dealing with a repeated measures design, with the repeated measure being the individual items on the test.

Thus, the design was a 4 x 7 two-way ANOVA where J = 4 was the number of groups and K = 7 was the number of success indicators, giving 28 cells in all. The descriptive statistics from the questionnaire responses are reported in Table 5, with means and standard deviations by each cell (group and success indicator). Looking at Table 5 in the article, you can see that the ANOVA design is clearly described (once you eliminate the "all groups" and "all indicators".) They also report the means for all groups and all indicators, but they did not include these in the ANOVA.

The design certainly addresses Research Question 2, which was the authors' intent. Using a two-way ANOVA was certainly preferable to using seven (or four) one way ANOVAs, because it enabled them to find a significant interaction. There was no reason to use an ANCOVA, since there were no covariates. To add state or gender to a design with 28 cells would probably have resulted in unfilled cells, so this didn't seem to be appropriate. Table 2 summarizes the two-way ANOVA:

SV SS df MS F ratio Signif?
Between Subjects 386.41 64-1=63
Groups (G) 16.85 4-1=3 5.62 0.91 no
Subjects within
Groups (s:G)
369.56 64-4=60 7.16
Within subjects 763.93 384
Indicators (I) 144.04 7-1-6 24.01 16.03 p < .01
Groups x
Indicators (IxG)
80.84 (4-1)*(7-1)=18 4.49 3.00 p < .01
Ixs:G 539.05 (6)*(60)=360 1.50
Total 1150.34 (7)*(64)=447

Table 2: Summary of Two-Way ANOVAs

There was no significant difference among the groups. No group main effect means that the four groups were consistent in terms of the relative importance they attached to the seven success indicators. All groups ranked "job performance" as the most important success indicator. When the authors investigated the responses of different groups to different questions, they could see that there was a group x indicators interaction as shown in the ANOVA table.

One of the strong points of the paper is that the authors took the time to explain the interaction: stakeholder perspectives can potentially be a confounding influence when evaluating ID project success. For example, sponsors placed higher importance on "project budget" and lower importance on "job performance", whereas trainers, being less critical of the project than sponsors, placed higher importance on "job performance" and lowest importance on "project budget". (From a practical standpoint, trainers would very much like to keep their jobs!)

A graph of the cell means to illustate the interaction would heve been a useful addition to the article. Also, the follow-up one-way ANOVAS to the interaction were not such a good idea.

There was a significant main effect among the seven success indicators at the p < .01 level, so the authors used a Tukey pairwise comparison for the necessary follow-up MCP. This was sensible. However, if it were me, I'd have chosen Newman Keuls rather than Tukey, because for simple pairwise comparisons Newman Keuls is more powerful than Tukey, but also has a higher Type I error rate. Since the calculated p value was so low, the Type I error rate wasn't an issue here, so I'd have gone for more power. There was a significant difference between groups on a single success indicator ("bottom line"): the Tukey MCP revealed that designers differed significantly from sponsors and trainers in terms of the importance attached to this indicator.

Measurement Strategies

Research Question 1

Isolating the seven success indicators from the literature review was a very involved procedure, and the authors really made an excellent attempt to establish content validity and independence of the set of success indicators. One author clustered the first 100; the second author reviewed the clustering to establish inter-rater reliability. (What about the other 478?) The initial clustering was by Gagne's five learning domains, plus "other". The "other" category was further subdivided into four more indicators: employee turnover, supervisor rating, employee productivity, and corporate productivity.

To validate the review and subsequent clustering, the authors mailed the results of the literature review to six ID professionals (i.e. independent SMEs) who were asked to provide additions to and elaborations on the review results. The authors also conducted three independent focus group interviews, each with four to eight stakeholders involved in ID projects, to build on the list of project success indicators generated through the literature review. Combining the information from the three sources, namely, the literature review, the professionals' responses, and the focus group comments, they identified 31 tentative success indicators.

As with any qualitative method scheme, this procedure took several iterations until the final set of success factors were isolated. I have been through this, and I know how much work it takes! In a qualitative methods dissertation, one doctoral student took four complete iterations through all the data to isolate three independent themes relating to girls' reactions to reading stories. These authors did a good job in the two iterations that they described.

There are six criteria upon which the validity of a qualitative research study rests, which I thought the authors addressed adequately:

  1. adequacy of relationships developed between the researcher and the participants (did not apply here)
  2. extent and adequacy of data produced (64 data points was a decent N)
  3. variety of methods used to gather data (their single instrument - the questionnaire - was good for gathering initial data; follow-up interviews might be helpful afterwards)
  4. adequacy of the analysis in terms of identifying recurrent patterns of action and meaning (this was very well done using converging lines of inquiry)
  5. rigorous search for evidence that disconfirms these patterns of meaning (this was not done, though they found that their study disconfirmed the previous model that was popular throughout the literature)
  6. credibility of the ultimate account (they agreed with other researchers who posed a similar challenge to the established models).
To estimate content validity, a verification letter listing and defining the 31 tentative success indicators was sent to the focus group participants and the six professionals. Responses pointed to a need to distinguish between ID project success indicators (evidence of success) and ID project success factors (things that contribute to success). With the assistance of the focus group moderator and the feedback from the verification letter, the 31 tentative success indicator clusters were re-analyzed and re-sorted into seven success indicators and 23 success factors. (This adds up to 30, not 31!)

It might have been nice if they had done a factor analysis on these seven success indicators to find out if they were truly independent or whether they clustered in any way.

Research Question 2

Finding the answer to this question meant that the authors assessed the relative importance of the seven success indicators and determined whether or not the importance attached to the indicators was related to stakeholder role. Subjects were not randomly selected; they volunteered to participate in the study. They were all involved in technical training in for-profit corporations or state and federal agencies.

Few women were represented in the sample. Gender couldn't be used as another independent variable because they wouldn't have been able to fill all cells in the design . Here is the gender information for the four independent groups of stakeholders: designers: 7 M, 9F; trainers: 15 M, 1 F; sponsors 15 M, 1F; learners: 13 M, 3F. This is far from equal representation; we do not know if this imbalance also held true in the population from which the sample of volunteers was selected.

The volunteers also came from different states, but the authors did not indicate what states they came from. Knowing the ID field, I am sure Florida respondents would have answered differently from California or Massachusetts respondents, especially on the budget issue, since Florida instructional designers boast about being the low bidders on ID projects. Since selection was not random, there is certain to be some nonrepresentative sample bias.

Internal validity deals with cause and effect, so it was not an issue here. External validity means that you want to generalize to the population from which the respondents were selected. This is questionable, because the subjects were not randomly selected (i.e. nonrepresentative sample). There is also some question of the nongeneralizability of the dependent variable (the ratings on the seven success indicators) because their results did not replicate those of other researchers. They explain this in the discussion section. Usually "enjoyment" ranks first for learners; in this study it ranked next to last in terms of importance as an indicator of project success. The discrepancy between the themes in the literature and the results of the current study sparked lengthy discussion and conclusions sections. I think that a lot of this is related to external validity questions.

The authors say that the literature does not appear to focus on the most important success indicators, because it is primarily concerned with learning outcomes rather than project success indicators. In the conclusions, they say that the common evaluation frameworks (listed on p. 14 of the article) fail to include important success indicators that were used in their own study - specifically, "intended lifespan", "knowledge sharing", and "project budget". They also say that the Kirkpatrick model stated that results is the most important success indicator, followed by behavior, whereas their study showed that behavior was more important than results. Kirkpatrick's model had already been challenged by two other researchers in 1989, and the current study upheld the view of the challengers.

No test reliability information was given. Internal consistency reliability refers to the consistency of results produced by a measure. That is the only type of reliability that would have been appropriate for a single instrument given once. Since the items on the test were from different domains (seven hopefully different success indicators and 23 success factors), and the authors wanted to measure seven different constructs for this report, we wouldn't expect very high test reliability - although it would have been appropriate to estimate it. However, mixing up these assorted factors did have one salutary effect - they made the test longer and enabled the researchers to get two different journal articles out of a single instrument! Whether this actually increased reliability is doubtful.

In all, I thought the report was up to the high standards of PIQ, which is a reputable, refereed journal. The discussion was in depth, and the description of the qualitative methods used for Question 1 was given in substantial detail. The ANOVA was described nicely, and the complete summary table was given, rather than just the findings that were significant. The fact that these results backed up the challenges to the established models seven years ago further strengthened the study.

[back arrow]Back to Research Critiques
[back arrow]Back to Advanced Statistics


Copyright © 1997 Lorraine Sherry
lsherry@carbon.cudenver.edu
April 27, 1997