Student Learning Outcomes
At the completion of this unit of instruction students will be able to:
Measurement is a research tool and also a research area.
Remember that scientific problem solving involves four steps (1. Developing the problem; 2. Formulating the hypotheses; 3. Gathering the data; 4. Analyzing and interpreting results.)
Step 3 necessitates an understanding of measurement
Remember we discussed concerns related to internal and external validity. We have similar concerns about the validity of our measurements. Specifically, "Does the test or instrument measure what it is supposed to measure?"
Q: What is reliability?
A: The consistency or repeatability of a measure
For example, if I use the measurement twice (e.g. take a test twice) would my scores be the same?
Returning to the different types of validity distinguished in the text...
Four basic types of measurement validity
1. Logical validity
2. Content validity
3. Criterion validity (concurrent and predictive)
4. Construct validity
Logical Validity is also referred to as face validity. Does the measure obviously measure the intended performance. A pull-up test obviously measures pull ups but is it a valid measure of strength? Is the frequently used maximum bench press a valid measurement of strength? Does my tennis skills test measure tennis skill? Often measurements are difficult to justify on the basis of logical validity.
Content Validity is of great interest to you as students in this class. You are obviously concerned that the assignments given to you in this class, and the midterm, and final examinations cover the content of the class and are a valid representation of your learning.
Criterion Validity involves measurements that can be validated against some criterion.
Concurrent validity exists when a test that can easily be administered is validated by a high correlation with another (often difficult to administer) test that is know to be valid. For example, in running the change of the body's use of oxygen as the principle fuel to the use of carbohydrates is associated with the anaerobic threshold. This particular change could probably be most accurately measured with blood samples. However, taking blood from athletes is not very convenient so instead we measure the heart rate. We know as the result of measurement research that the HR is a valid way of predicting the change in energy sources.
The same would be true with a tennis test. The most valid way of assessing playing ability would perhaps be to have skilled observers watch players and give ratings. A tennis test that any coach could administer would be much more convenient. However, in creating this test we would be wise to be sure that the results compared similarly with the observers' ratings. Having found that concurrent validity exists we could claim that our test was valid.
Predictive validity refers to the validity of a measurement to be used for the prediction of future performance. The GRE is for example is used as a measure of predicting future college success. Maybe health educators would like to predict the likelihood of future drug dependency of elementary aged children. To attempt to find a valid measure we would need to select several criteria then through the use of correlational statistics examine the relationship. The major question (that statistics help answer) is whether our measure has much predictive value.
If successful in finding measures that are proved to valid
in terms of prediction, we would like to think that our measure
could be used elsewhere. What happens however is that the validity
tends to reduce when used with a different sample. This phenomenon
is know as shrinkage.
Construct validity is a concern when we attempt to measure
something that is not observable but which we attempt to infer.
We do this all the time with concepts such as intelligence, anxiety,
arousal, learning, attitude, etc. To validate tests of these variables
we again need some type of comparison and usually in relation
to an observable behavior. For example, past assessments of a
person's teaching effectiveness have often been made by observers
simply watching a teaching episode, taking notes, and writing
up a critique. Such assessments may or may not be valid. They
tend to be very subjective and often two observers will focus
on different aspects thereby producing sometimes contradictory
evidence. More recently, lots of assessment tools have been developed
that direct observers to record specific, observable behaviors,
e.g. time spent organizing, time spent managing, # of feedback
statements, # of student names used. Based on these types of observations
we attempt to infer "teaching effectiveness" even though
teaching effectiveness is not any clearly observable behavior.
Be sure you can distinguish between validity and reliability. Validity is whether or not a measurement is really measuring the item of interest. In contrast, reliability focuses on the consistency of the measurement. If a measurement is reliable you should get the same results if you repeat it.
With any measurement the score you get is the observed score. This score is a combination of the true score and error score. As researchers we would of course like to eliminate or at least minimize the error score.
Four sources of measurement error include:
1. Subjects - variations in their mood, physical condition, mental state, motivation
2. Testing - poor directions, different expressions of interest or attempts to motivate
3. Scoring - use of inexperienced scorers, errors in recording
4. Instrumentation - inaccuracies, poor tests, calibration
We can establish the reliability of a measurement by various statistical techniques all of which attempt to see the extent to which similar results can be obtained twice. One obvious method is the same day test-retest in which the same subjects are tested twice on the same day and the results compared. This methods works best when the quality being measured is unlikely to be influenced by exposure to the test. In other words, there is a question whether any learning might occur that would influence a person's response the second time around.
We need to remember that all tests (measurements) are likely
to include some degree of error and when appropriate calculate
data expressing the error. We also need to be careful to minimize
this error to the greatest extent possible.
Four Types of Measurement Scales
When researchers construct tests, they have to first decide on the appropriate scale of measurement. In the text four scales are presented.
1. Nominal - classification by name (males/females, teenagers/adults are examples of categories). Can also be formed on the basis of some measurement criterion (high/low achievers, high/low skilled, although these are to some extent also of an ordinal nature). The purpose of the scale is just for identification.
2. Ordinal - provides a rank order, e.g. percentile or a numbered list of students in order of achievement on some measure. However, knowing the ranking of a score doesn't provide information of differences between scores. In other words, knowing that Sally was first in a quiz and Bill was second gives the rank order but not any measurement of the difference between Sally and Bill's scores.
3. Interval - this measure provides both an order and shows
the size of the difference between scores. For example if we count
the number of pull ups students can do we can determine a rank
order and also know that 2 pull ups is twice as many as 1.
4. Ratio - these scores have similar qualities to others but also have a true zero. Force, time, and distance all have zero points.
Often we want to compare scores on one measure to scores on a different measure. This is impossible to do unless we can convert the two scores to a similar scale. In other words you can't compare apples and oranges unless you convert the apple to the orange - that sounds more confusing than I anticipated!
Anyway, there are ways to convert scores and the z score and T scale represent commonly used standard scores.
As noted in the text, in PE especially we are often concerned with measuring movement. The main point to appreciate is that it is important that we should always be concerned about the validity and reliability of such measures.
Measuring Written Responses
Another common area of PEHL measurement. We always seem to be interested in items such as attitudes, self-concept, anxiety, stress, motivation, communication and so on.
The biggest problem faced in this area is defining the behavior we wish to measure and then producing a valid and reliable measure. The main difficulty is that the quality described by these words (attitude, self-concept, etc.) is rather intangible. In fact it only exists to the extent that we define it. If we change our definition we have in effect changed the quality itself!
It is for these reasons that I strongly advise graduate students to use preexisting measures that have been proven to be somewhat valid and reliable whenever possible, rather than attempt to design their own measurements.
Lots of energy has been devoted to research on attitudes and personality. Do athletes have special personality characteristics in comparison to non-athletes? Does athletics develop these characteristics or are certain personalities attracted to athletics? Are there differences between sports? Do sports develop character or characters? You can apply these and lots more examples to your own area of specialization.
Four commonly used measurement scales include the:
The Likert Scale typically involves a 5-7 point scale on which subjects respond according to levels of agreement. For example:
"I have enjoyed and learned a lot from participating in my graduate research methods class."
Strongly Agree Agree Undecided Disagree Strongly Disagree
This scale give a wider choice of expression than just yes/no
The Semantic Differential Scale uses bipolar adjectives (e.g. beautiful-ugly, skilled-unskilled, supportive-critical), at the end of a 7 point scale. Subjects score 7 for the most positive and 1 for the least positive.
In the Thurstone Scale subjects express agreement or disagreement with a written statement. For example:
"PEHL 557 should be a 4 credit class."
These are harder to construct because they involved the use of judges in weighting each statement for use in scoring.
Rating Scales are frequently used in research (e.g. Borg's Rating of Perceive Exertion, ALT-PE scales, and many more). As pointed out in the text, when "experts" are involved in ratings various types of inconsistencies sometimes emerge.
Leniency = overgenerous
Central tendency = tendency to grade everyone as average
Halo effect = use of prior knowledge about a subject can influence judgment
Proximity errors = is concerned with the location of the rating criteria on the rating sheet
Observer bias = personal biases the judges may have
Observer expectation = based on a person's knowledge of the experimental arrangements the rater may exhibit different expectations.
If possible it is better to create evaluation devices that reduce the need for subjective, value judgments and increase objective measurements. This trend had occurred in evaluating teaching effectiveness. For example, we now will count specific behaviors exhibited by teachers rather than try to judge whether the behavior is good or bad. If I told you that you said "um" forty times during your 5-minute presentation you would probably conclude a need to improve communication without me having to say that I think your communication skills rate a 3 on a 5-point rating scale.
Whenever we take tests or give our students tests we would like to believe that the questions we are posing a valid measurements of their knowledge. Sometimes subjects don't have the opportunity to express their concerns to the test creator. Fortunately, it is possible for test creators to objectively evaluate the validity of their own measurements.
Item difficulty is a way of assessing the value of a question. If everyone answers a question correctly, the thought arises as to whether there is any point including the question as a measure. Think about this...maybe we actually do want everyone to answer the question correctly...or maybe we want to differentiate between the level of knowledge of our students.
Anyway, as explained in the text we can calculate a difficulty index and learn that many test makers will eliminate questions with a difficulty index below .10 or above .90
Item discrimination is a way of learning how well our
tests discriminate between high achievers and low achievers. Many
test makers strive for discrimination indexes of .20 or higher
for each question.