10.5 Measurement quality: Validity

Learning Objectives

Learners will be able to…

  • Define and describe the types validity from the lens of traditional validity theory
  • Explain how contemporary validity theory differs from traditional validity theory
  • Describe evidence for validity in contemporary validity theory

Validity is a measurement quality in the context for which it is being used. It is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? It is more than whether or not an instrument measures consistently because a measure can be extremely consistent but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure could be very consistent, the interpretation would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.

Traditional discussions of validity often divide it into several distinct “types.” But in our contemporary understanding, a good way to interpret these types is that they are the kinds of evidence that should be taken into account when judging the validity of a measure’s interpretation or use.

>>> FROM BHATTACHERJEE, CHAPTER 7 (do we like it??)

Validity can be assessed using theoretical or empirical approaches, and should ideally be measured using both approaches.

Contemporary theory of validity

  • describe construct validity : put all the elements into this
  • However, traditionally, validity was broken into separate “types” which we describe below
    • In current understanding, below we discuss evidence for construct validity, but traditionally they were considered their own separate “type” of validity

Theoretical assessment of validity

Theoretical assessment of validity focuses on how well the idea of a theoretical construct is translated into or represented in an operational measure. This type of validity is called translational validity (or representational validity), and, in traditional validity theory, consists of two subtypes: face and content validity. Translational validity is typically assessed using a panel of expert judges, who rate each item (indicator) on how well they fit the conceptual definition of that construct, and a qualitative technique called Q-sort is used.

Face validity

Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence of a measure’s validity. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.

Evidence for appropriate content (traditionally “Content Validity”)

Content validity is the extent to which a measure “covers” all aspects of a construct. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then their measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that they think positive thoughts about exercising, feels good about exercising, and actually exercise. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct and is often done by experts in the field.

Empirical assessment of the validity of measures

>>>>FROM BHATTACHERJEE, CHAPTER 7 (do we like it??)

While translation validity examines whether a measure is a good reflection of its underlying construct, empirical assessments of validity examine whether a given measure behaves the way it should, given the construct or theories related to the construct. This assessment is based on quantitative analysis of observed data (i.e., empirical evidence) using statistical techniques such as correlational analysis, factor analysis, and so forth. Below we discuss three types of validity based on empirical assessment: criterion validity, construct validity, and factorial validity.

Criterion validity

Empirical assessment of validity examines how well a given measure relates to one or more external criterion, based on empirical observations. This type of validity is called criterion-related validity.  Criterion validity is the extent to which people’s scores on a measure of a construct are associated with other variables (known as criteria) that exemplify the construct. For example, people’s scores on a new measure of depressive symptoms should be correlated with their scores on an established measure of depressive symptoms or associated with a physician’s diagnosis of depression. If this were the case, then this would be a piece of evidence that the scores on the new measure really represent levels of depression. But if it were found that people who were diagnosed with having major depressive disorder did not score as having high levels of depressive symptoms on the new instrument, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think would be an indicator of the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years.

When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity. >>>>>FROM BHATTACHERJEE, CHAPTER 7 (do we like it??) Concurrent validity examines how well one measure relates to other concrete criterion that is presumed to occur simultaneously. For instance, do students’ scores in a calculus class correlate well with their scores in a linear algebra class? These scores should be related concurrently because they are both tests of mathematics.<<<< When the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome). >>>>>FROM BHATTACHERJEE, CHAPTER 7 (do we like it??) Predictive validity is the degree to which a measure successfully predicts a future outcome that it is theoretically expected to predict. For instance, can standardized test scores (e.g., Scholastic Aptitude Test scores) correctly predict the academic success in college (e.g., as measured by college grade point average)?<<<<<<<

Construct validity

 

See: 

https://psycnet.apa.org/record/1956-03730-001

https://psycnet.apa.org/doiLanding?doi=10.1037%2F1040-3590.17.4.396]]

Convergent and discriminant validity

There are two types of construct validity: convergent and discriminant.  Convergent validity refers to the closeness with which a measure relates to (or converges on) the construct that it is purported to measure, and discriminant validity refers to the degree to which a measure does not measure (or discriminates from) other constructs that it is not supposed to measure. Usually, convergent validity and discriminant validity are assessed jointly for a set of related constructs. For instance, if you expect that an organization’s knowledge is related to its performance, how can you assure that your measure of organizational knowledge is indeed measuring organizational knowledge (for convergent validity) and not organizational performance (for discriminant validity)?

Convergent validity can be established by comparing the observed values of one indicator of one construct with that of other indicators of the same construct and demonstrating similarity (or high correlation) between values of these indicators. [do we need to add more?]

Discriminant validity is established by demonstrating that indicators of one construct are dissimilar from (i.e., have low correlation with) other constructs. In the above example, if we have a three-item measure of organizational knowledge and three more items for organizational performance, based on observed sample data, we can compute bivariate correlations between each pair of knowledge and performance items. If this correlation matrix shows high correlations within items of the organizational knowledge and organizational performance constructs, but low correlations between items of these constructs, then we have simultaneously demonstrated convergent and discriminant validity.

Let’s look at an example of the need for discriminant validity in measurements. One of our authors was a Dialysis Social Worker. She would administer the PHQ-9[1] to her patients in dialysis to assess for depressive symptoms. The PHQ-9 has 9 items that are rated on a four point rating scale from 0 (“not at all”) to 3 (“nearly every day”). The scores on each item get summed for a composite score on the instrument, which is then associated with a level of severity of depressive symptoms. In her experience, patients would often select “not at all” for items related to mood (e.g., “feeling down, depressed, or hopeless” or “feeling bad about yourself”), but would select “nearly every day” for somatic or cognitive items (i.e., “trouble falling asleep, staying asleep, or sleeping too much”, “feeling tired or having little energy”, “poor appetite or overeating,” or “trouble concentrating on things…”). These responses on the instrument would often identify them as having moderate levels of depression. However, as she knew and noted by the National Kidney Foundation (n.d.)[2], symptoms of chronic kidney disease include feeling tired, trouble concentrating, loss of appetite, and trouble sleeping.  In her practice setting it was difficult for the PHQ-9 to differentiate between depression and the effects of chronic kidney disease.

An alternative and more common statistical method used to demonstrate convergent and discriminant validity is exploratory factor analysis. This is a data reduction technique which aggregates a given set of items to a smaller set of factors based on the bivariate correlation structure discussed above using a statistical technique called principal components analysis. These factors should ideally correspond to the underling theoretical constructs that we are trying to measure.

A more sophisticated technique for evaluating convergent and discriminant validity is the multi-trait multi-method (MTMM) approach. This technique requires measuring each construct (trait) using two or more different methods (e.g., survey and personal observation, or perhaps survey of two different respondent groups such as teachers and parents for evaluating academic quality).

Factorial Validity

We also need to include factorial validity?

How do reliability and validity relate?

Although we have thus far presented the concepts of reliability and validity as if they are two distinct and independent concepts, many psychometricians see the two concepts as being more interrelated. Defining reliability in terms of the true score and error helps clarify the nature of this relation. For example, it can be shown that one aspect of validity, namely criterion validity, can never be greater than the reliability index. As such, high reliability (i.e., the reliability index) is a necessary condition for high validity in measurement (Raykov & Marcoulides, 2011, 192-94).

Key Takeaways

  • Validity is a matter of how well an instrument measures what it is supposed to measure.
  • There are many types of validity which can provide evidence for the overall validity of an instrument.

Exercises

TRACK 1 (IF YOU ARE CREATING A RESEARCH PROPOSAL FOR THIS CLASS):

Use the measurement tools you located in the previous exercise. Evaluate the reliability and validity of these tools. Hint: You will need to go into the literature to “research” these tools.

  • Provide a clear statement regarding the reliability and validity of these tools. What strengths did you notice? What were the limitations?
  • Think about your target population. Are there changes that need to be made in order for one of these tools to be appropriate for your population?
  • If you decide to create your own tool, how will you assess its validity and reliability?

TRACK 2 (IF YOU AREN’T CREATING A RESEARCH PROPOSAL FOR THIS CLASS): 

You are interested in studying older adults’ social-emotional well-being. Specifically, you would like to research the impact on levels of older adult loneliness of an intervention that pairs older adults living in assisted living communities with university student volunteers for a weekly conversation.

Use the measurement tool you located in the previous exercise. Evaluate the reliability and validity of these tools. Hint: You will need to go into the literature to “research” these tools.

  • Provide a clear statement regarding the reliability and validity of these tools. What strengths did you notice? What were the limitations?
  • Think about your target population. Are there changes that need to be made in order for one of these tools to be appropriate for your population?

  1. Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: validity of a brief depression severity measure. Journal of general internal medicine, 16(9), 606–613. https://doi.org/10.1046/j.1525-1497.2001.016009606.x
  2. National Kidney Foundation. (n.d.). Chronic kidney disease. [Webpage]. https://www.kidney.org/atoz/content/about-chronic-kidney-disease#what-are-symptoms
definition

License

Doctoral Research Methods in Social Work Copyright © by Mavs Open Press. All Rights Reserved.

Share This Book