This report was written by Dr. Bob Ewell of Creative Solutions , a statistical analysis service specializing in helping organizations use data to make good decisions.

The United States Air Force Academy (AFA) desired to convert its mid and end-of-course evaluations from paper-pencil to computer-administered methodology

Introduction

The United States Air Force Academy (USAFA) desired to convert its mid and end-of-course evaluations from paper-pencil to computer-administered methodology. Issues involved in administering all mid and end-of-course surveys on line center around four areas:

The mechanism solved the access and confidentiality issues, at least technically. We chose the mechanism to be a web-based survey accessible by a browser. We used Question MarkÒ software to generate and display the survey and process the results. We designed a front-end access page to the survey which solved the access issue. Students, upon logging onto the page, were given a list of courses they were authorized to assess. When the student chose a course and instructor, the survey was displayed. The data collected on the survey did not include any student identifying information. Therefore, the access and confidentiality issues were technically solved for this research. One question in the on-line survey asked if the students believed that their data were confidential. Results are reported later.

Logistics were not tested in this research.

The remainder of this research focused on the issue of possible response differences by mode of survey administration.

Other than some evidence from Texas A&M (cited later in the literature review), we were not able to find research on comparability of course evaluation surveys by mode of administration. Here are some of the potential factors which could skew results:

- Accessibility to computers: this is an issue often raised in the literature (see for example, Bracken (1994)). Does the entire population have equal access to a computer? In the case of the Academy, this is a non-issue. All students have computer.

- Propensity to use computers: while equal accessibility is guaranteed, equal desire is not. There may be some students who would prefer one mode over the other.

- Time: students completing a course evaluation in class are on school time. If the same students complete evaluations in their rooms, they are on personal time. How will this perceived incursion into their already limited time affect their attitudes? One solution to this might be to give some kind of compensatory time or have them do it during class time and report late.

- Scheduling: when on-line evaluations are implemented school wide, how will the students complete these course evaluations? Will they sit down at the last minute and do them all at once? How will this affect their attitudes toward the courses or the surveys?

- Faculty attitudes: while not directly impacting on student ratings, if the faculty perceive that some of the factors listed above will impact course ratings, will the faculty support computerizing the process?

- Other: we don't know what will impact ratings. For example, Texas A&M noticed lower ratings on the computer-administered survey than the paper-pencil. This was attributed to the fact that each question on the computer had to be answered individually whereas on paper-pencil, students could put their pencils on a column and mark it all the way down. Although we noticed differences by mode in this research, we noticed no difference in the propensity to string like responses together.

Review of Literature

The articles fall under the following headings:

Issues

David Bracken (1994), studying different modes of administration concluded that the effect of method on the responses is unclear. Until that ambiguity is resolved by research, it's probably best to pick one administration technology and stick with it. He raised issues such as:

Mark Troy of Texas A&M commented in June 1996 on the EVALTALK listserv (internet discussion group) on what he observed in A&M's initial trials with on-line surveys. Specifically,

There is no difference

In a test designed to measure Type A characteristics, Holden and Hickman (1987) concluded

Katz and Dalby (1981), testing the Eysenck Personality Inventory found no difference between paper-pencil and computerized administrations.

George Huba (1988) tested the Western Personnel Test and found no mean differences in the number of items correct over paper-pencil versus computer-administered. In addition, he concluded that equivalent correlations between computerized and standard forms also suggest that computer and paper-pencil version yield comparable results.

There is a difference: computers are "better"

Kapes and Vansickle (1992) using the Harrington-O'Shea Career Decision-Making System found the computer-based administrations significantly more reliable than paper-pencil.

A group assessing the equivalence of the paper and on-line formats of the Quis 5.5 (Slaughter, et al, 1995) found that

There is a difference, but who knows what it is?

There were a series of conflicting reports, mostly on "social desirability" of responses.

In 1986, Kielser and Sproull conducted a study using an 18-item questionnaire on health and personal characteristics administered with paper-pencil or by computer with results stored on a disk. They found that:

While Kiesler and Sproull found the computer-administered surveys lower in socially desirable responses, Davis and Cowles (1989) found just the opposite in a personality inventory. In addition, they found that the test/retest reliabilities were comparable across method although the pattern of correlation shows greater consistency as well as higher values in the group that used the computer in two administrations. Among those who used both administration methods, the computer was preferred.

Using the Balanced Inventory of Desirable Responding (BIDR) instrument, Lautenschlager and Flaherty (1990) discovered a significant difference by mode of administration with the computer scores higher.

In a response to the Lautenschlager and Flaherty study, another group of researchers discovered that there were no differences by mode of administration (Impression Management, 1992). [NOTE TO CEE: I can't find my copy of this article, and my notes have left off the authors. Please refer to your Volume 2 of last summer's report]

Methodology

Again, our purpose was to determine if mode of end-of course evaluation made a difference in the results. Differences could come from any or all of the following factors:

Ideally, because of the difference of each class-instructor, course material, students-we wanted to compare within specific classes, not between classes. However, we ruled out having students do evaluations twice. Therefore, we settled on comparing matching courses and instructors.

A total of 17 instructors were recruited across several disciplines:

Each instructor taught at least two sections of the same course. In addition, we tried to pick instructor/class pairings taught at comparable times of day rather than one morning and one afternoon, for example. Class compositions tend to be different by times of day, with inter-scholastic athletes taking mostly morning classes, for example. Detailed information about numbers of students in the classes used in the study is in the Results section.

All courses were taught in the Spring semester 1997. Toward the end of the semester, we randomly picked one of the classes for each instructor to do the end-of-course evaluation using the computer. All others completed the evaluations in class in the usual manner.

The instrument was the standard end-of-course evaluation form used at USAFA. The on-line version was converted to HTML for administration on the web. See Appendix 1 for a copy of both instruments.

For analysis purposes, we used only questions 1 - 23 which divide into three sections:

Students rated various aspects of the course or instructor on a 6-point scale: very poor, poor, fair, good, very good, excellent.

Questions 24 - 26 were more about the students; subsequent questions were unique to each course. The on-line version contained three additional questions all scored on a 7-point strongly disagree/strongly agree scale:

Data were extracted, reduced, and processed in a variety of ways using SPSSÒ . See the Results section.

 

Results

Returns

One event we had not anticipated was students' not completing the on-line survey. Knowledgeable people advised us this would be a problem, but since this is a military school, we had assumed we could make the evaluations a "duty" and expect compliance. Unfortunately, all evaluations are voluntary, including the paper-pencil ones done in class. When the on-line version was produced, it copied the paper-pencil version exactly resulting in the following heading:

Completion of this questionnaire is voluntary.

You are free to leave some or all items unanswered.

This information will remain anonymous at all times.

Many students took us at our word, and did not complete the evaluations. (In class, they are offered the same options, but since they are there anyway, they usually complete the evaluations.) We generally picked for analysis instructors who had at least 9 students respond to each survey. The table below shows the 10 instructor/class pairs we used for analysis. (Numbers are shown for the first question only. Responses to other questions vary.) Additional analysis comparing total students with numbers of returns revealed that in the 10 classes used in the study, 82 percent of the paper students returned evaluations compared with 73 percent of the on-line students. [NOTE TO CEE: if you want to try to get all the numbers, that might be interesting. I have original class sizes only from the 7 or 8 instructors I have been able to talk with about ratings.]

Instructor ID

Paper-pencil

On-line

Analyzed

Paper-pencil

Analyzed

On-line

1

20

11

20

11

2

12

9

12

9

3

13

4

   

4

0

8

   

5

15

16

15

16

6

48

15

48

(multiple sections)

15

7

6

8

   

8

35

6

   

9

33

16

33

(multiple sections)

16

10

10

9

10

9

11

20

8

   

12

15

14

15

14

13

 

3

   

14

32

9

   

15

24

14

24

14

16

14

10

14

10

17

17

25

17

25

Total

314

185

208

139

The ten instructors left still represented a cross-section of disciplines:

Differences

It appears that mode does make a difference and that paper-pencil comes in higher than on-line. Since we compared pairs of classes, and the differences could reflect actual differences within each pairing, we cannot make this statement categorically. However, we can build a good case.

First, the mean of all questions was higher for paper-pencil than for on-line.

Second, the frequencies of differences between questions for instructor/course pairings are strongly skewed toward paper-pencil higher than on-line. For each instructor/class pairing, we looked at the average for each question for each mode and computed the difference between paper-pencil average and on-line average. With 10 instructor/course pairings and 23 questions, there were 230 differences to look at. If the classes in all the pairings were identical and there were no differences by mode of administration, we would expect most differences to be near zero and equal numbers of inequalities in both directions: on-line higher and paper-pencil higher. In fact, most differences are in favor of paper-pencil.

Counting significant differences among the 23 questions per instructor shows the vast majority of significant differences tilted toward paper greater than on-line. Only two of the 10 instructors had more on-line averages greater than paper-pencil.

More composite instructor averages (questions 1-12) are higher for paper-pencil than on-line. However, we noticed that the differences are relatively small. The biggest average difference is for instructor 9, and it is less than one point on a six-point scale.

The same imbalance is true in questions 13 - 18, the course questions. More instructors have their paper-pencil averages higher than their on-line averages.

The imbalance is more pronounced in the last four questions, the general ones. Overall, we must look at the fact that of the ten instructors, only one did not have significant differences by mode. Seven of the remaining nine had significant differences in favor of paper. If we assume, not that the classes in each pair are equal, but that they are unequal, we would expect a more even distribution of inequality since the classes for each mode were chosen randomly. To have a seven to two imbalance has probability less than 0.12-not 5% significance, but low nonetheless.

The following data summarize the differences by instructor. Note again that only two of the ten have on-line higher than paper.

Finally, this chart shows graphically the number of instructors with differences-most are in favor of paper versus on-line.

Differences are apparently not due to more students marking consecutive numbers on the paper survey. Since there were 23 questions, we counted the number of students who had consecutive strings of 12 or more identical responses. (Counting strings of 12 or more means we couldn't count any students twice.) About 13 percent of the paper-pencil students and nearly 9 percent of the on-line students had such a consecutive identical-response string-a nonsignificant difference.

Reaction to the on-line survey

In addition to the difficult question of whether or not mode of survey administration makes a difference in the ratings, there remain issues associated with student acceptance of the technique and the promise of anonymity.

When asked if they preferred on-line to paper-pencil, over half said they did. Eighteen percent were undecided, and fewer than 30 percent did not prefer the on-line method.

Strongly disagree

11%

2

13%

3

4%

4

18%

5

7%

6

20%

Strongly agree

26%

 

Nearly 90 percent of the on-line students said the mechanism was easy to use.

Strongly Disagree

2%

2

3%

3

3%

4

5%

5

7%

6

41%

Strongly agree

40%

 

Slightly over half the students believed us when we said they would be anonymous. However, nearly 30 percent did not with 14 percent undecided. One would think that a perception of non-anonymity would lead to higher ratings instead of the lower ratings that we received from the on-line students.

Strongly Disagree

8%

2

9%

3

12%

4

14%

5

7%

6

31%

Strongly agree

18%

 

Conclusions

The concept of on-line evaluations works. Students in the test classes were able to log onto the network, access the survey using their browser and complete the evaluation.

However, data from this study could not be used to say that there would be no differences in ratings if the mode of survey administration were changed. The real question then becomes, if the ratings are different, which is more "accurate." Just because the on-line ratings are lower doesn't mean the paper ones are correct. In fact, there is evidence that the on-line may be the better data:

The mechanism itself appears to be appropriate. Most students found the on-line survey easy to use.

Getting students to do the evaluation is more of a challenge than we thought. In this test, some teachers got good response by promising the students compensatory time out of class. When the students did not log on right away, he reminded them of his commitment and suggested they live up to theirs. Another instructor happened to have a computer lab scheduled, so those students actually did the survey on-line during class time. If the evaluations are administered across the school, USAFA will have to devise a way to get more response than we got from some classes.

The logistics issue has not been resolved. While it is ecologically satisfying to think of administering the end-of-course surveys without cutting down trees, there is a nontrivial strain on network resources of 4,000 students filling out approximately six evaluations each in a relatively short time interval. This challenge needs further study.

Finally, the critical issue of anonymity needs work. Perhaps with time, students will become used to the on-line idea and won't even think about it. Also, education appears to be in order. If students understood how the response database is built, and how it is separate from the access step, they might better understand how they technologically anonymous. Once their responses go into the database, it is literally impossible to match their responses with their identity (unless exactly one student filled out a survey for a course).

References

Bracken, David (1994). Straight talk about multirater feedback. Training & Development (48), 09-01-1994, 44.

Davis, C. & Cowles, M. (1989). Automated psychological testing: method of administration, need for approval and measures of anxiety. Educational and Psychological Measurement, 49, 311-320.

Holden, R.R. & Hickman, D. (1987). Computerized versus Standard Administration of the Jenkins Activity Survey (Form T). Journal of Human Stress, 13(4), 175-179

Huba, G.J. (1988). Comparability of Traditional and Computer Western Personnel Test (WPT) Versions. Educational and Psychological Measurement, 48, 957-959.

[NEEDS AUTHORS] Impression Management, Social Desirability, and Computer Administration of Attitude Questionnaires: Does the Computer make a difference? (1992), Journal of Applied Psychology, 77(4), 562-566.

Kapes, J. T. & Vansickle, T. R. (1992). Comparing Paper-Pencil and Computer-Based Versions of the Harrington-O'Shea Career Decision-Making System. Measurement and Evaluation in Counseling and Development, 25, 5-13.

Katz, L. and Dalby, J. T. (1981). Computer and Manual Administration of the Eysenck Personality Inventory. Journal of Clinical Psychology, 37, 586-592.

Kielser, S. & Sproull, L. S. (1986). Response Effects in the Electronic Survey. Public Opinion Quarterly, 50, 402-413.

Kiesler, S., Walsh, J., & Sproull, L. (1992). Computer Networks in Field Research, (1986, 1987, 1991), in Methodological Issues in Applied Psychology, edited by Bryant et al. Plenum Press, New York, 1992.

Lautenschlager, G. J.& Flaherty, V. L. (1990). Computer Administration of Questions: More Desirable or More Social Desirability? Journal of Applied Psychology, 75(3), 310-314.

Slaughter, L., Harber, B., & Norman, K. (1995). Assessing the Equivalence of the paper and on-line formats of the Quis 5.5. Laboratory for Automation Psychology, University of Maryland, College Park.