5.5 Measurement quality
- Define reliability and describe the types of reliability
- Define validity and describe the types of validity
In quantitative research, once we’ve defined our terms and specified the operations for measuring them, how do we know that our measures are any good? Without some assurance of the quality of our measures, we cannot be certain that our findings have any meaning or, at the least, that our findings mean what we think they mean. When social scientists measure concepts, they aim to achieve reliability and validity in their measures. These two aspects of measurement quality are the focus of this section. We’ll consider reliability first and then take a look at validity. For both aspects of measurement quality, let’s say our interest is in measuring the concepts of alcoholism and alcohol intake. What are some potential problems that could arise when attempting to measure this concept, and how might we work to overcome those problems?
First, let’s say we’ve decided to measure alcoholism by asking people to respond to the following question: Have you ever had a problem with alcohol? If we measure alcoholism in this way, it seems likely that anyone who identifies as an alcoholic would respond with a yes to the question. So, this must be a good way to identify our group of interest, right? Well, maybe. Think about how you or others you know would respond to this question. Would responses differ after a wild night out from what they would have been the day before? Might an infrequent drinker’s current headache from the single glass of wine she had last night influence how she answers the question this morning? How would that same person respond to the question before consuming the wine? In each of these cases, if the same person would respond differently to the same question at different points, it is possible that our measure of alcoholism has a reliability problem. Reliability in measurement is about consistency.
One common problem of reliability with social scientific measures is memory. If we ask research participants to recall some aspect of their own past behavior, we should try to make the recollection process as simple and straightforward for them as possible. Sticking with the topic of alcohol intake, if we ask respondents how much wine, beer, and liquor they’ve consumed each day over the course of the past 3 months, how likely are we to get accurate responses? Unless a person keeps a journal documenting their intake, there will very likely be some inaccuracies in their responses. If, on the other hand, we ask a person how many drinks of any kind they have consumed in the past week, we might get a more accurate set of responses.
Reliability can be an issue even when we’re not reliant on others to accurately report their behaviors. Perhaps a researcher is interested in observing how alcohol intake influences interactions in public locations. She may decide to conduct observations at a local pub, noting how many drinks patrons consume and how their behavior changes as their intake changes. But what if the researcher has to use the restroom and misses the three shots of tequila that the person next to her downs during the brief period she is away? The reliability of this researcher’s measure of alcohol intake, counting numbers of drinks she observes patrons consume, depends on her ability to actually observe every instance of patrons consuming drinks. If she is unlikely to be able to observe every such instance, then perhaps her mechanism for measuring this concept is not reliable.
If a measure is reliable, it means that if the measure is given multiple times, the results will be consistent each time. For example, if you took the SATs on multiple occasions before coming to school, your scores should be relatively the same from test to test. This is what is known as test-retest reliability. In the same way, if a person is clinically depressed, a depression scale should give similar (though not necessarily identical) results today that it does two days from now.
If your study involves observing people’s behaviors, for example watching sessions of mothers playing with infants, you may also need to assess inter-rater reliability. Inter-rater reliability is the degree to which different observers agree on what happened. Did you miss when the infant offered an object to the mother and the mother dismissed it? Did the other person rating miss that event? Do you both similarly rate the parent’s engagement with the child? Again, scores of multiple observers should be consistent, though perhaps not perfectly identical.
Finally, for scales, internal consistency reliability is an important concept. The scores on each question of a scale should be correlated with each other, as they all measure parts of the same concept. Think about a scale of depression, like Beck’s Depression Inventory. A person who is depressed would score highly on most of the measures, but there would be some variation. If we gave a group of people that scale, we would imagine there should be a correlation between scores on, for example, mood disturbance and lack of enjoyment. They aren’t the same concept, but they are related. So, there should be a mathematical relationship between them. A specific statistic known as Cronbach’s Alpha provides a way to measure how well each question of a scale is related to the others. Cronbach’s alpha (sometimes shown as α) can range from 0 to 1.0. As a general rule, Cronbach’s alpha should be at least .7 to reflect acceptable internal consistency with scores of .8 or higher considered an indicator of good internal consistency.
Test-retest, inter-rater, and internal consistency are three important subtypes of reliability. Researchers use these types of reliability to make sure their measures are consistently measuring the concepts in their research questions.
While reliability is about consistency, validity is about accuracy. What image comes to mind for you when you hear the word alcoholic? Are you certain that the image you conjure up is similar to the image others have in mind? If not, then we may be facing a problem of validity.
For a measure to have validity, we must be certain that our measures accurately get at the meaning of our concepts. Think back to the first possible measure of alcoholism we considered in the previous few paragraphs. There, we initially considered measuring alcoholism by asking research participants the following question: Have you ever had a problem with alcohol? We realized that this might not be the most reliable way of measuring alcoholism because the same person’s response might vary dramatically depending on how they are feeling that day. Likewise, this measure of alcoholism is not particularly valid. What is “a problem” with alcohol? For some, it might be having had a single regrettable or embarrassing moment that resulted from consuming too much. For others, the threshold for “problem” might be different; perhaps a person has had numerous embarrassing drunken moments but still gets out of bed for work every day, so they don’t perceive themselves as having a problem. Because what each respondent considers to be problematic could vary so dramatically, our measure of alcoholism isn’t likely to yield any useful or meaningful results if our aim is to objectively understand, say, how many of our research participants have alcoholism. 
Types of validity
Below are some basic subtypes of validity, though there are certainly others you can read more about. One way to think of validity is to think of it as you would a portrait. Some portraits of people look just like the actual person they are intended to represent. But other representations of people’s images, such as caricatures and stick drawings, are not nearly as accurate. While a portrait may not be an exact representation of how a person looks, what’s important is the extent to which it approximates the look of the person it is intended to represent. The same goes for validity in measures. No measure is exact, but some measures are more accurate than others.
In the last paragraph, critical engagement with our measure for alcoholism “Do you have a problem with alcohol?” was shown to be flawed. We assessed its face validity or whether it is plausible that the question measures what it intends to measure. Face validity is a subjective process. Sometimes face validity is easy, as a question about height wouldn’t have anything to do with alcoholism. Other times, face validity can be more difficult to assess. Let’s consider another example.
Perhaps we’re interested in learning about a person’s dedication to healthy living. Most of us would probably agree that engaging in regular exercise is a sign of healthy living, so we could measure healthy living by counting the number of times per week that a person visits their local gym. But perhaps they visit the gym to use their tanning beds or to flirt with potential dates or sit in the sauna. These activities, while potentially relaxing, are probably not the best indicators of healthy living. Therefore, recording the number of times a person visits the gym may not be the most valid way to measure their dedication to healthy living.
Another problem with this measure of healthy living is that it is incomplete. Content validity assesses for whether the measure includes all of the possible meanings of the concept. Think back to the previous section on multidimensional variables. Healthy living seems like a multidimensional concept that might need an index, scale, or typology to measure it completely. Our one question on gym attendance doesn’t cover all aspects of healthy living. Once you have created one, or found one in the existing literature, you need to assess for content validity. Are there other aspects of healthy living that aren’t included in your measure?
Let’s say you have created (or found) a good scale, index, or typology for your measure of healthy living. A valid measure of healthy living would have scores that are similar to other measures of healthy living. Criterion validity occurs when the results from the measure are similar to those from an external criterion (that, ideally, has already been validated or is a more direct measure of the variable).
There are two types of criterion validity — predictive validity and concurrent validity. They are distinguished by timing — whether or not the measure is similar to something that is measured in the future or at the same time. Predictive validity means that your measure predicts things it should be able to predict. A valid measure of healthy living would be able to predict, for example, scores of a blood panel test during a patient’s annual physical. In this case, the assumption is that if you have a healthy lifestyle, a standard blood test done a few months later during an annual checkup would show healthy results. On the other hand, if we were to administer the blood panel measure at the same time as the scale of healthy living, we would be assessing concurrent validity. Concurrent validity is the same as predictive validity—the scores on your measure should be similar to an established measure—except that both measures are given at the same time.
Another closely related concept is construct validity. The logic behind construct validity is that sometimes there is no established criterion to use for comparison. However, there may be a construct that is theoretically related to the variable being measured. The measure could then be compared to that construct, even though it isn’t exactly the same concept as what’s being measured. In other words, construct validity exists when the measure is related to other measures that are hypothesized to be related to it.
It might be helpful to look at two types of construct validity – convergent validity and discriminant validity. Convergent validity takes an existing measure of the same concept and compares your measure to it. If their scores are similar, then it’s probably likely that they are both measuring the same concept.In assessing for convergent validity, one should look for different methods of measuring the same concept. If someone filled out a scale about their substance use and the results from the self-reported scale consistently matched the results of a lab test, then the scale about substance use would demonstrate convergent validity. Discriminant validity is a similar concept, except you would be comparing your measure to one that is expected to be unrelated. A participant’s scores on a healthy lifestyle measure shouldn’t be too closely correlated with a scale that measures self-esteem because you want the measure to discriminate between the two constructs.
Reliability versus Validity
If you are still confused about validity and reliability, Figure 5.2 shows an example of what validity and reliability look like. On the first target, our shooter’s aim is all over the place. It is neither reliable (consistent) nor valid (accurate). The second (middle) target demonstrates consistency…but it is reliably off-target, or invalid. The third and final target (bottom right) represents a reliable and valid result. The person is able to hit the target accurately and consistently. This is what you should aim for in your research. An instrument can be reliable without being valid (target 2), but it cannot be valid without also being reliable (target 3).
- Reliability is a matter of consistency.
- Validity is a matter of accuracy.
- There are many types of validity and reliability.
- Concurrent validity- if a measure is able to predict outcomes from an established measure given at the same time
- Content validity- if the measure includes all of the possible meanings of the concept
- Convergent validity- if a measure is conceptually similar to an existing measure of the same concept
- Discriminant validity- if a measure is not related to measures to which it shouldn’t be statistically correlated
- Face validity- if it is plausible that the measure measures what it intends to
- Internal consistency reliability- degree to which scores on each question of a scale are correlated with each other
- Inter-rater reliability- the degree to which different observers agree on what happened
- Predictive validity- if a measure predicts things it should be able to predict in the future
- Reliability- a measure’s consistency.
- Test-retest reliability- if a measure is given multiple times, the results will be consistent each time
- Validity- a measure’s accuracy
Figure 5.2 was adapted from Nevit Dilmen’s “Reliability and validity” (2012) Shared under a CC-BY-SA 3.0 license
- Of course, if our interest is in how many research participants perceive themselves to have a problem, then our measure may be just fine. ↵