Discussion on this Learning Activity
Range 300 words
Include in-text citations and peer-reviewed references in APA format
Integrate theory, research, and/or professional experience
Include specific examples and/or substantiating evidence from course readings and research
Stay on topic and address the course objectives
Provide a new thought, idea, or perspective.
Demonstrate critical thinking skills and application of Bloom’s Taxonomy [Bloom’s Taxonomy for distinctions of writing that are expected: 1. Knowledge, 2. Comprehension, 3. Applications, 4.Analysis, 5. Synthesis, 6. Evaluation]
Cite a workplace application or organizational example of what we are learning.
Add a new twist or interpretation on a reference perspective.
Use critical thinking about an idea/concept or comparison and contrast.
Question or challenge a principle/perspective with sound rationale
Demonstrate proper spelling, grammar, and scholarly tone
W6DQ1: How do research questions frame and guide studies?
W6DQ2: What type of research questions would lead to a qualitative study? To a quantitative one? How would the wording differ? Discuss these questions relative to your field of study.
W6DQ3: How does a researcher determine how many research questions and hypotheses are needed for a particular study? Do all research studies require hypotheses?
W6DQ4: Why might a hypothesis be inappropriate for a qualitative study?
W6DQ5:* Is a quantitative study that has a research question but no hypothesis weaker than one with a hypothesis? Why or why not?
W6DQ6: How are research questions different from survey and/or interview questions?
Surveyors and sailors measure distances between objects by taking observations from multiple positions. By observing the object from several different angles or viewpoints, the surveyors and sailors can obtain a good fix on an object’s true location (see Figure 6.1). Social researchers employ a similar process of triangulation. In social research, we build on the principle that we learn more by observing from multiple perspectives than by looking from only a single perspective.
The idea that looking at something from multiple points of view improves accuracy.
Social researchers use several types of triangulation (see Expansion Box 6.1, Example of Four Types of Triangulation). The most common type is triangulation of measure, meaning that we take multiple measures of the same phenomena. For example, you want to learn about a person’s health. First, you ask the person to complete a questionnaire with multiple-choice answers. Next you conduct an open-ended informal interview. You also ask a live-in partner/caregiver about the person’s health. You interview the individual’s physician and together examine his or her medical records and lab test results. Your confidence that you have an accurate picture grows from the multiple measures you used compared to relying on just one, especially if each measure offers a similar picture. Differences you see among the measures stimulates questions as well.
figure 6.1 Triangulation: Observing from Different Viewpoints
expansion box 6.1 Example of Four Types of Triangulation
The amount of violence in popular American films
Measures: Create three quantitative measures of violence: the frequency (e.g., number of killings, punches), intensity (e.g., volume and length of time screaming, amount of pain shown in face or body movement), and level of explicit, graphic display (e.g., showing a corpse with blood flowing, amputated body parts, close-ups of injury) in films.
Observers: Have five different people independently watch, evaluate, and record the forms and degrees of violence in a set of ten highly popular American films.
Theory: Compare how a feminist, a functional, and a symbolic interaction theory explains the forms, causes, and societal results of violence that is in popular films.
Method: Conduct a content analysis of a set of ten popular films, as an experiment to measure the responses of experimental subjects to violence in each film, to survey attitudes toward film violence among the movie-going public, and to make field observations on audience behavior during and immediately after showing the films.
Triangulation of observers is a variation on the first type. In many studies, we conduct interviews or are the lone observer of events and behavior. Any limitations of a single observer (e.g., lack of skill in an area, a biased view on an issue, inattention to certain details) become restrictions of the study. Multiple observers bring alternative perspectives, backgrounds, and social characteristics. They thereby reduce the limitations. For example, two people interact with and observe the behavior of ten 5-year-old children at a child care center. One of the observers is a 60-year-old White male pediatrician with 25 years of experience working in a large city hospital. The other is a 31-year-old Hispanic female mother of two children who has 6 years of experience as an elementary school teacher in a small town. Each observer may notice and record different data. Combining what both see and experience will produce a fuller picture than relying on either one alone.
Triangulation of theory requires using multiple theoretical perspectives to plan a study or interpret the data. Each theoretical perspective has assumptions and concepts. They operate as a lens through which to view the social world. For example, a study of work relations in a bank could use conflict theory with its emphasis on power differences and inequality. The study could highlight the pay and working condition inequalities based on positions of authority (e.g., manager versus teller). The study reveals relevant differences in social backgrounds: a middle-aged White male manager with an MBA and a young African American female teller with an associate’s degree. Next, rational choice theory is applied to focus on decision-making and rational strategies individuals use to maximize personal benefits. This perspective highlights how the bank manager varies the time/effort he devotes to various customers depending on their loan or savings account size. It also presents a better picture of how the teller invests her time and energy differently with various supervisors, depending on whether she believes they might help her get a promotion. Each perspective guides the study: It identifies relevant data, provides a set of concepts, and helps to interpret the meaning and significance of the data.
Triangulation of method mixes the qualitative and quantitative research approaches and data. Most researchers develop an expertise in one approach, but the approaches have complementary strengths. A study that combines both tends to be richer and more comprehensive. Mixing them occurs in several ways:1 by using the approaches sequentially, first one and then the other, or by using them in parallel or simultaneously. In the study that opened this chapter, Klinenberg mixed a statistical analysis of quantitative data on deaths with interviews and document analysis. (seeExample Box 6.1, A Multimethod Study on page 166).
QUALITATIVE AND QUANTITATIVE ORIENTATIONS TOWARD RESEARCH
In all research, we strive to collect empirical data systematically and to examine data patterns so we can better understand and explain social life, yet differences between research approaches can create miscommunication and misunderstandings. They are mutually intelligible; grasping both approaches and seeing how each complements the other simply takes more time and effort. Next we will look at some sources of differences.
A first difference originates in the nature of the data itself. Soft data (i.e., words, sentences, photos, symbols) dictate qualitative research strategies and data collection techniques that differ from hard data (in the form of numbers) for which quantitative approaches are used. Such differences may make the tools for a quantitative study inappropriate or irrelevant for a qualitative study and vice versa.
Another difference between qualitative and quantitative research originates in principles about the research process and assumptions about social life. Qualitative and quantitative research principles give rise to different “languages of research” with different emphases. In a quantitative study, we rely more on positivist principles and use a language of variables and hypotheses. Our emphasis is on precisely measuring variables and test hypotheses. In a qualitative study, we rely more on the principles from interpretive or critical social science. We speak a language of “cases and contexts” and of cultural meaning. Our emphasis is on conducting detailed examinations of specific cases that arise in the natural flow of social life. Interestingly, more female than male social researchers adopt the qualitative approach.2
example box 6.1 A Multimethod Study
Lee and Bean (2007) mixed quantitative and qualitative research approaches in a study of multiracial identity in the United States. They observed that social diversity has increased because of growing immigration since 1970, and for the first time in 2000, the United States census offered the option of classifying oneself as multiracial. The new diversity contrasts to the long history of single-race categories and a dominant White-Black dichotomous racial division. Lee and Bean asked whether multiracial people feel free or highly constrained when they pick a single racial-ethnic or multiracial identity. They also asked whether selecting a multiracial category on the census form is a symbolic action or a reflection of a person’s multiracial daily existence. In the quantitative part of the study, the authors statistically analyzed 2000 census data on the numbers and mixes of people who classified themselves as multiracial. In the qualitative part of the study, they conducted forty-six in-depth semi-structured interviews with multiracial adults from northern and southern California. In the interviews, Lee and Bean asked how and why a person chose to identify herself or himself as she or he did, whether that identity changed over time or by context, and about language use and other practices associated with race and ethnicity. They interviewed adults of various mixtures of Asian, White, Latino, and Black races. Based on the interviews, Lee and Bean found that multiracial Blacks were less likely to call themselves multiracial than people of other mixed race categories. This restriction is consistent with the U.S. historical pattern of the public identifying a person with only some Black heritage as being Black. Persons of mixed White and Asian or Latino or Latino-Asian heritage had more flexibility. Some mixed Asian-White or Latino-White people self-identified as White because of public perceptions and a narrow stereotypical definition of proper Asian or Latino appearance. Other White-Asian and White-Latino people said that they are proud of their mixed heritage even if it made little difference in their daily encounters. People did not stick with one label but claimed different racial-ethnic backgrounds in different situations. Pulling together the quantitative and qualitative findings, Lee and Bean suggested that racial-ethnic group boundaries are fading faster for Latinos and Asians than for Blacks. They concluded that a new Black versus non-Black divide is emerging to replace the old White-Black division but that Blacks are still in a disadvantaged position relative to all racial categories.
A third difference between qualitative and quantitative research lies in what we try to accomplish in a study. “The heart of good work”—whether it is quantitative or qualitative—“is a puzzle and an idea” (Abbott, 2003:xi). In all studies, we try to solve a puzzle or answer a question, but depending on the approach, we do this in different ways. In the heat wave study that opened this chapter, Klinenberg (2002) asked why so many people died. But he also asked how they died, and why some categories of people were greatly affected but others were not. In a quantitative study, we usually try to verify or falsify a relationship or hypothesis we already have in mind. We focus on an outcome or effect found across numerous cases. The test of a hypothesis may be more than a simple true or false answer; frequently it includes learning that a hypothesis is true for some cases or under certain conditions but not others. In the heat wave study, Klinenberg asked whether a person’s social class influenced an outcome: being likely to die during the heat wave. Using quantitative data, he tested the relationship between class and death rate by comparing the social class of the roughly 700 who died with thousands who did not.
In many qualitative studies, we often generate new hypotheses and describe details of the causal mechanism or process for a narrow set of cases. Returning to the heat wave study, Klinenberg (2002) tested existing hypotheses about class and death rates. He also developed several new hypotheses as he looked closely into the mechanism that caused some to die but not others. He learned that high death rates occurred in poverty- and crime-ridden neighborhoods. More males than females died, and more African Americans died than Latinos or Whites. By walking around in different low-income neighborhoods and interviewing many people firsthand, he identified the mechanisms of urban isolation that accounted for very different heat wave survival rates among people of the same social class. He examined the social situations of older African American men and discovered the local social environment to be the critical causal mechanism. He also looked at larger forces that created the social situations and local environments in Chicago in the mid-1990s.
A fourth difference between quantitative and qualitative studies is that each has a distinct “logic” and path of conducting research. In a quantitative study, we employ a logic that is systematic and follows a linear research path. In a qualitative study, the logic arises from ongoing practice and we follow a nonlinear research path. In the next section, we examine the logics and paths of research.
Reconstructed Logic and Logic in Practice
How we learn and discuss research tends to follow one of two logics.3 The logics summarize the degree to which our research strategy is explicit, codified, and standardized. In specific studies, we often mix the two logics, but the proportion of each varies widely by study.
A reconstructed logic emphasizes using an explicit research process. Reconstructed logic has been “reconstructed” or restated from the many messy details of doing a real-life study into an idealized, formal set of steps with standard practices and consistent principles, terms, and rules. You can think of it as a “cleansed model” of how best to do a high-quality study. Following this logic is like cooking by exactly following a printed recipe. Thus, the way to conduct a simple random sample (discussed in Chapter 7) is straightforward and follows a clear step-by-step procedure.
A logic of research based on reorganizing, standardizing, and codifying research knowledge and practices into explicit rules, formal procedures, and techniques; it is characteristic of quantitative research.
The logic in practice is messy and closer to the concrete practice of doing research. Logic in practice includes advice that comes from the practical activities of doing specific real-life studies more than a set of restated, ideal rules. This logic relies heavily on “judgment calls” and “tricks of the trade” that active, experienced researchers share. We learn it best by reading many studies and being an apprentice researcher and from the folk wisdom that passes informally among experienced researchers. It is like cooking without a written recipe—adding a pinch of an ingredient here, stirring until something “looks right,” and adjusting while cooking until we reach a certain smell or taste.
Logic in practice
A logic of research based on an apprenticeship model and the sharing of implicit knowledge about practical concerns and specific experiences; it is characteristic of qualitative research.
You can see the reconstructed logic in the distinct research methods section of a quantitative research report. In contrast, in qualitative research reports, you may not see the research method (common for historical-comparative research) discussed or find it mixed with a personal autobiographical account of a particular study (common for field research). The absence of a standard method does not make qualitative study less valid; however, it often requires more time and a different style of thinking for the newcomer to master.
Linear and Nonlinear Paths
The path is a metaphor for a sequence of things to do: what you finish first or where you have been and what comes next. You can follow a straight, well-worn, and marked path that has clear signposts and is where many others have trod before. Alternatively, you may follow a path that meanders into unknown territory where few others have gone. The path has few signs, so you move forward, veer off to the side, and sometimes backtrack a little before going forward again.
When using the linear research path, we follow a fixed sequence of steps that are like a staircase that leads upward in one direction. By following a linear path, we move in a direct, narrow, and straight way toward a conclusion. This pathway toward task completion is the dominant approach in western European and North American cultures. It is most widely used in quantitative research. By contrast, a nonlinear research path requires us to make successive passes through the steps. We may move forward, backward, and sideways before advancing again. It is more of a spiral than a straight staircase. We move upward but slowly and indirectly. With each cycle or repetition, we may collect new data and gain new insights.
Nonlinear research path
Research that proceeds in a cyclical, iterative, or back-and-forth pattern and is often used in qualitative research.
Linear research path
Research that proceeds in a clear, logical, step-by-step straight line; often used in quantitative research.
People who are accustomed to a direct, linear approach often become impatient with a less direct cyclical path. Although a nonlinear path is not disorganized, undefined chaos, the cyclical path appears inefficient and without rigor. People who are used to a nonlinear path often feel stifled and “boxed in” by a linear approach. To them, a linear path feels artificial or rigid. They believe that this approach prevents them from being naturally creative and spontaneous.
Each path has its strengths. The linear path is logical, easy to follow, and efficient. The nonlinear path can be highly effective in creating an authentic feeling for understanding an entire setting, for grasping subtle shades of meaning, for integrating divergent bits of information, and for switching perspectives. Each path has its own discipline and rigor. The linear path borrows from the natural sciences with their emphasis on logic and precision. A nonlinear path borrows devices from the humanities (e.g., metaphor, analogy, theme, motif, and irony) and is suited for tasks such as translating languages, a process in which delicate shades of meaning, subtle connotations, or contextual distinctions can be important (see Figure 6.2 for a graphic representation of each path).
Objectivity and Integrity
We try to be fair, honest, truthful, and unbiased in our research activity, yet, we also have opportunities to be biased, dishonest, or unethical in all knowledge production including social research. The two major research approaches address the issue of reducing difficulties and ensuring honest, truthful studies in different ways.
In qualitative research, we often try to acquire intimate, firsthand knowledge of the research setting. Thus, we do not want to distance ourselves from the people or events we are studying. Acquiring an intimate understanding of a setting does not mean that we can arbitrarily interject personal opinion, be sloppy about data collection, or use evidence selectively to support our prejudices. Rather, we take maximum advantage of personal insight, inner feelings, and life perspective to understand social life. We “walk a fine line” between intimacy and detachment and place personal integrity and honesty at the forefront. Some techniques may help us walk a fine line. One technique is to become highly sensitive to our own views, preconceptions, and prior assumptions and then “bracket” them, or put them aside, so we can see beyond them better. Instead of trying to bury or deny our assumptions, viewpoints, and values, we find that acknowledging them and being open about them is best. We can then recognize how they might influence us. We try to be forthright and candid in our involvement in the research setting, in dealing with the people in the study, and with any relevant issues that arise. We do this in the way that we conduct the study and report on the findings.
Personal openness and integrity by the individual researcher are central to a qualitative study. By contrast, in a quantitative study, we stress neutrality and objectivity. In a quantitative study, we rely on the principle of replication, adhere to standardized procedures, measure with numbers, and analyze the data with statistics.4 In a sense, we try to minimize or eliminate the subjective human factor in a quantitative study. As Porter (1995:7, 74) has argued,
• Ideally, expertise should be mechanized and objectified … grounded in specific techniques…. This ideal of objectivity is a political as well as scientific one. Objectivity means rule of law, not of men. It implies the subordination of personal interests and prejudices to public standards.
figure 6.2 Graphic Representation of Linear and Nonlinear Paths
The issue of integrity in quantitative research mirrors the natural science approach. It relies on using an explicit and objective technology, such as making statements in precise neutral terms, using well-documented standard techniques, and making replicable, objective numerical measures.
Quantitative social research shares the hallmarks of natural science validation: explicit, standard procedures; precise numerical measurement; and replication. By contrast, validation in qualitative research relies more on a dependable, credible researcher and her or his personal integrity, self-discipline, and trustworthiness.5 Four other forms of validation in qualitative research somewhat parallel the objective procedures found in quantitative studies.6
The first form indicates that the researcher has carefully evaluated various forms of evidence and checked them for consistency. For example, a field researcher listens to and records a student who says, “Professor Smith threw an eraser at Professor Jones.” The researcher must consider the evidence carefully. This includes considering what other people say about the event. The field researcher also looks for confirming evidence and checks for internal consistency. The researcher asks whether the student has firsthand knowledge of the event, that is, directly witnessed it, and asks whether the student’s feelings or self-interest might lead him or her to lie (e.g., the student dislikes Professor Smith).
A second form of validation arises from the great volume of detailed written notes in most qualitative studies. In addition to verbatim description of the evidence, other documentation includes references to sources, commentaries by the researcher, and quotes, photographs, videos, maps, diagrams, paraphrasing, and counts. The huge volume of information, its great diversity, and its interlocking and mutually reinforcing presentation help to validate its authenticity.
A third kind of validation comes from other observers. Most qualitative researchers work alone, but many others know about the evidence. For example, we study people in a specific setting who are alive today. Other researchers can visit the same setting and talk to the same people. The people we studied can read study details and verify or raise questions about it. Likewise, historical-comparative researchers cite historical documents, archival sources, or visual material. By leaving a careful “audit trail” with precise citations, others can check the references and verify sources.
A fourth type of truthfulness is created by the way we publicly disclose results. In a quantitative study, we adhere to a standard format for writing a research report. We explain in detail how we followed accepted procedures. We describe each step of the study, display the quantitative data in charts, graphs, or tables, and make data files available to others to reanalyze. We offer to answer any questions about the study. In a qualitative study, we cannot publicly display or share the many mountains of detailed notes, recorded interviews, photos, or original source materials in a research report. They might fill an entire room! Instead, we “spin a web” of interlocking details and use tightly cross-referenced material. Through our writing and presentation, we provide sufficient texture and detail to build an “I-was-there” sense within readers. By providing rich specific descriptions supplemented with maps, photos, and verbatim quotations, we convey an intimate knowledge of a setting. We build a sense of shared familiarity in readers. A skilled qualitative researcher can recreate the visual images, voices, smells, sounds, tensions, and entire atmosphere that existed by referring to the mountains of empirical evidence.
Preplanned and Emergent Research Questions
Studies start in many ways, but the usual first step is to select a topic.7 We have no formula for how to do this task. Whether we have experience or are just a beginning researcher, the best guide is to pick something that interests us. There are many ways to select topics (see Expansion Box 6.2, Sources of Topics). We may begin with one topic, but it is too large and is only a starting point. We must narrow it into a focused research question. How we do this varies by whether our study is primarily qualitative or quantitative. Both kinds of studies work well with some topics; we can study poverty by examining official statistics, conducting a survey, doing ethnographic field research, or completing a historical-comparative analysis. Some topics are best suited for a qualitative study (e.g., how do people reshape their self-identity through participating in goth youth subculture) and others for a quantitative study (e.g., how has public opinion on the death penalty shifted over the past 50 years and whether one’s opinion on this issue is influenced by views on related issues or by the amount of exposure the news media gives to certain topics).
Most qualitative studies start with a vague or loosely defined topic. The specific topic emerges slowly during the study, and it may change direction based on new evidence. This was the case for Venkatesh’s study (2008) that opened Chapter 5. He began with an interest in studying poverty in an inner-city housing project but shifted to studying a drug-selling gang. Focusing on a specific research question continues while we gather data. Venkatesh increasingly focused his topic of gang activity into sharper questions: How and why did gangs in a low-income housing project sustain an underground economy and provide housing project residents with protection and aid services?
Flexibility in qualitative research encourages us to continuously focus throughout a study. An emergent research question may become clear only during the research process. We can focus and refine the research question after we gather some data and begin a preliminary analysis. In many qualitative studies, the most important issues and most interesting questions become clear only after we become immersed in the data. We need to remain open to unanticipated ideas, data, and issues. We should periodically reevaluate our focus early in a study and be ready to change direction and follow new lines of evidence. At the same time, we must exercise self-restraint and discipline. If we constantly change the focus of our research without end, we will never complete a study. As with most things, a balance is required.
expansion box 6.2 Sources of Topics
• 1. Personal experience. You can choose a topic based on something that happens to you or those you know. For example, while you work a summer job at a factory, the local union calls a strike. You do not have strong feelings either way, but you are forced to choose sides. You notice that tensions rise. Both management and labor become hostile toward each other. This experience suggests unions or organized labor as a topic.
• 2. Curiosity based on something in the media. Sometimes you read a newspaper or magazine article or see a television program that leaves you with questions. What you read raises questions or suggests replicating what others’ research found. For example, you read aNewsweek article on people who are homeless, but you do not really know much about who they are, why they are homeless, whether this has always been a problem, and so forth. This suggests homeless people as a topic.
• 3. The state of knowledge in a field. Basic research is driven by new research findings and theories that push at the frontiers of knowledge. As theoretical explanations are elaborated and expanded, certain issues or questions need to be answered for the field to move forward. As such issues are identified and studied, knowledge advances. For example, you read about attitudes toward capital punishment and realize that most research points to an underlying belief in the innate wickedness of criminals among capital punishment supporters. You notice that no one has yet examined whether people who belong to certain religious groups that teach such a belief in wickedness support capital punishment, nor has anyone mapped the geographic location of these religious groups. Your knowledge of the field suggests a topic for a research project: beliefs about capital punishment and religion in different regions.
• 4. Solving a problem. Applied research topics often begin with a problem that needs a solution. For example, as part of your job as a dorm counselor, you want to help college freshmen establish friendships with each other. Your problem suggests friendship formation among new college students as a topic.
• 5. Social premiums. This is a term suggested by Singleton and colleagues (1988:68). It means that some topics are “hot” or offer an opportunity. For example, you read that a lot of money is available to conduct research on nursing homes, but few people are interested in doing so. Your need of a job suggests nursing homes as a topic.
• 6. Personal values. Some people are highly committed to a set of religious, political, or social values. For example, you are strongly committed to racial equality and become morally outraged whenever you hear about racial discrimination. Your strong personal belief suggests racial discrimination as a topic.
• 7. Everyday life. Potential topics can be found throughout everyday life in old sayings, novels, songs, statistics, and what others say (especially those who disagree with you). For example, you hear that the home court advantage is very important in basketball. This statement suggests home court advantage as a topic for research.
Typical qualitative research questions include these: How did a certain condition or social situation originate? How do people, events, and conditions sustain a situation over time? By what processes does the situation change, develop, or end? Another type of question seeks to confirm existing beliefs or assumptions (e.g., do Southern and Northern Whites act differently around people of other races as those in McDermott’s  study of working class neighborhoods in Atlanta and Boston). A last type of research question tries to discover new ideas.8
In a quantitative study, we narrow a topic into a focused question as a discrete planning step before we finalize the study design. Focusing the question is a step in the process of developing a testable hypothesis (to be discussed later). It guides the study design before you collect any data.9
In a qualitative study, we can use the data to help narrow the focus. In a quantitative study, we must focus without the benefit of data and use other techniques. After picking a topic, we ask ourselves: What is it about the topic that is of greatest interest? For a topic about which we know little, we must first acquire background knowledge by reading studies about the topic. Reading the research literature can stimulate many ideas for how to focus a research question.
In most quantitative studies, research questions refer to relationships among a small number of variables. This means that we should list variables as we try to focus the topic into a research question (see Expansion Box 6.3, Techniques for Narrowing a Topic into a Research Question). For example, the question what causes divorce? is not a good research question. A better one is, is age at marriage associated with divorce? The second question has two variables: age of marriage and whether or not a divorce occurred (also see Example Box 6.2, Examples of Bad and Good Research Questions).
Personal experience can suggest topics. Perhaps personal experience suggests people released from prison as a topic as it did for Pager (2007). We can read about former inmates and their reentry and about probation in dozens of books and hundreds of articles. A focused research question might be whether it is more difficult for someone who has a nonviolent criminal record to get a job offer than someone without a criminal record. This question is more specific in terms of type of criminal record and the specific outcome for a former prisoner. It focuses on two variables, whether a person has a criminal record and whether the person gets a job offer. A common type of research question asks which factor among several had the most significant impact on an outcome. We might ask, as Pager did, how does racial category (Black versus White) and whether a person had a criminal record affect the chances of getting a job? Did race make a difference, did being a former prisoner make a difference, did the two factors operate separately, cancel out one another, or intensify one another in their impact on getting a job offer?
expansion box 6.3 Techniques for Narrowing a Topic into a Research Question
• 1. Examine the literature. Published articles are excellent sources of ideas for research questions. They are usually at an appropriate level of specificity and suggest research questions that focus on the following:
o a. Replicating a previous research project exactly or with slight variations.
o b. Exploring unexpected findings discovered in previous research.
o c. Following suggestions an author gives for future research at the end of an article.
o d. Extending an existing explanation or theory to a new topic or setting.
o e. Challenging the findings or attempting to refute a relationship.
o f. Specifying the intervening process and considering any linking relations.
• 2. Talk over ideas with others.
o a. Ask people who are knowledgeable about the topic for questions about it that they have thought of.
o b. Seek out those who hold opinions that differ from yours on the topic and discuss possible research questions with them.
• 3. Apply to a specific context.
o a. Focus the topic onto a specific historical period or time period.
o b. Narrow the topic to a specific society or geographic unit.
o c. Consider which subgroups or categories of people/ units are involved and whether there are differences among them.
• 4. Define the aim or desired outcome of the study.
o a. Will the research question be for an exploratory, explanatory, or descriptive study?
o b. Will the study involve applied or basic research?
example box 6.2 Examples of Bad and Good Research Questions
BAD RESEARCH QUESTIONS
Not Empirically Testable, Nonscientific Questions
• ¦ Should abortion be legal?
• ¦ Is it right to have capital punishment?
General Topics, Not Research Questions
• ¦ Treatment of alcohol and drug abuse
• ¦ Sexuality and aging
Set of Variables, Not Questions
• ¦ Capital punishment and racial discrimination
• ¦ Urban decay and gangs
Too Vague, Ambiguous
• ¦ Do police affect delinquency?
• ¦ What can be done to prevent child abuse?
Need to Be Still More Specific
• ¦ Has the incidence of child abuse risen?
• ¦ How does poverty affect children?
• ¦ What problems do children who grow up in poverty experience that others do not? GOOD RESEARCH QUESTIONS
• ¦ Has the incidence of new forms of child abuse appeared in Wisconsin in the past 10 years?
• ¦ Is child abuse, violent or sexual, more common in families that have experienced a divorce than in intact, never-divorced families?
• ¦ Are the children raised in impoverished households more likely to have medical, learning, and social-emotional adjustment difficulties than children who are not living in poverty?
• ¦ Does the emotional instability created by experiencing a divorce increase the chances that divorced parents will physically abuse their children?
• ¦ Is a lack of sufficent funds for preventive treatment a major cause of more serious medical problems among children raised in families in poverty?
We also want to specify the universe to which we generalize answers to a research question. All research questions and studies apply to some category of people, organizations, or other units. The universe is the set of all units that the research question covers or to which we can generalize. For example, in Pager’s (2007) study, his units were individuals, specifically young White and Black men. The universe to which we might generalize his findings includes all U.S. males in their twenties of these two racial categories.
The entire category or class of units that is covered or explained by a relationship or hypothesis.
As we refine a topic into a research question and design a study, we also need to consider practical limitations. Designing the perfect research project is an interesting academic exercise, but if we expect to carry out a study, practical limitations must shape its design. Major limitations include time, costs, access to resources, approval from authorities, ethical concerns, and expertise. If we have 10 hours a week for 5 weeks to conduct a research project but answering the research question will require 2 years, we must narrow the question to fit the practical limitations.
Time is always a consideration. However, it is very difficult to estimate the time required for a study. A specific research question, the research techniques used, the complexity of the study, and the amount and types of data we plan to collect all affect the amount of time required. Experienced researchers are the best source for getting good estimates of time requirements.
Cost is another limitation, and we cannot answer some research questions because of the great expense involved. For example, our research question asks whether sports fans develop strong positive feelings toward team mascots if the team has a winning season but negative feelings if it has a losing season. To examine the question for all sports teams across a nation across a decade would require a great investment of time and money. The focus could be narrowed to one sport (football), to sports played in college, and to student fans at just four colleges across three seasons. As with time, experienced researchers can help provide estimates of the cost to conduct a study.
table 6.1 Quantitative Research versus Qualitative Research
quantitative research qualitative research
Researchers test hypotheses that are stated at the beginning. Researchers capture and discover meaning once they become immersed in the data.
Concepts are in the form of distinct variables. Concepts are in the form of themes, motifs, generalizations, and taxonomies.
Measures are systematically created before data collection and are standardized. Measures are created in an ad hoc manner and are often specific to the individual setting or researcher.
Data are in the form of numbers from precise measurement. Data are in the form of words and images from documents, observations, and transcripts.
Theory is largely causal and is deductive. Theory can be causal or noncausal and is often inductive.
Procedures are standard, and replication is frequent. Research procedures are particular, and replication is very rare.
Analysis proceeds by using statistics, tables, or charts and discussing how what they show relates to hypotheses. Analysis proceeds by extracting themes or generalizations from evidence and organizing data to present a coherent, consistent picture.
Access to resources is a common limitation. Resources include expertise, special equipment, and information. For example, a research question about burglary rates and family income in many different nations is nearly impossible to answer. Data on burglary and income are not collected or available for many countries. Other questions require the approval of authorities (e.g., to see medical records) or involve violating basic ethical principles (e.g., lying to a person and endangering her or him). Our expertise or background as researchers is also a limitation. Answering some research questions involves the use of data collection techniques, statistical methods, knowledge of a foreign language, or skills we may not have. Unless we acquire the necessary training or can pay for another person’s services, the research question may not be practical.
In sum, qualitative and quantitative studies share a great deal, but they differ on several design issues: logic, research path, mode of verification, and way to arrive at a research question (seeTable 6.1). In addition, the research approaches speak different “languages” and emphasize distinct study design features, issues that we consider in the next section.
QUALITATIVE DESIGN ISSUES
The Language of Cases and Contexts
Most qualitative studies involve a language of cases and contexts, employ bricolage (discussed later in this chapter), examine social processes and cases in their social context, and study interpretations or meanings in specific socio-cultural settings. We examine social life from multiple points of view and explain how people construct identities. Only rarely do we use variables, test hypotheses, or create precise measures in the form of numbers.
Most qualitative studies build on the assumption that certain areas of social life are intrinsicallyqualitative. For this reason, qualitative data are not imprecise or deficient but are very meaningful. Instead of trying to convert fluid, active social life into variables or numbers, we borrow ideas and viewpoints from the people we study and situate them in a fluid natural setting. Instead of variables, we examine motifs, themes, distinctions, and perspectives. Most often, our approach is inductive and relies on a form of grounded theory (discussed in Chapter 3).
Qualitative data may appear to be soft, intangible, and elusive. This does not mean that we cannot capture them. We gather qualitative data by documenting real events, recording what actual people say (with words, gestures, and tone), observing specific behaviors, examining written documents, and studying visual images. These are specific, concrete aspects of the social world. As we closely scrutinize photos or videotapes of people or social events, we are looking at “hard” physical evidence.10 The evidence is just as “hard” and physical as the numeric measures of attitudes, social pressure, intelligence, and the like found in a quantitative study.
In qualitative research, we may develop theory during the data collection process. This largely inductive method means that we are building theory from data or ground the theory in the data. Grounded theory adds flexibility and allows the data and theory to interact. This process also helps us remain open to the unexpected. We can change direction of study and even abandon the original research question in the middle of a project if we discover something new and exciting.11
We build theory by making comparisons. For example, we observe an event (e.g., a police officer confronting a speeding motorist who has stopped). We may ponder questions and look for similarities and differences. When watching a police officer, we ask: Does the police officer always radio in the car’s license number before proceeding? After radioing the car’s location, does the officer ask the motorist to get out of the car or some times casually walk up to the car and talk to the seated driver? When we intersperse data collection and theorizing, new theoretical questions may arise that suggest future observations. In this way, we tailor new data to answer theoretical questions that arose only from thinking about previous data.
In grounded theory, we build from specific observations to broader concepts that organize observational data and then continue to build principles or themes that connect the concepts. Compared to other ways of theorizing, grounded theory tends to be less abstract and closer to concrete observations or specific events. Building inductively from the data to theory creates strong data-theory linkages. However, this can be a weakness as well. It may make connecting concepts and principles across many diverse settings difficult, and it may slow the development of concepts that build toward creating general, abstract knowledge. To counteract this weakness, we become familiar with the concepts and theories developed in other studies to apply shared concepts when appropriate and to note any similarities and differences. In this way, we can establish cross-study interconnections and move toward generalized knowledge.
The Context Is Critical
In qualitative research, we usually emphasize the social context because the meaning of a social action, event, or statement greatly depends on the context in which it appears. If we strip social context from an event, social action, or conversation, it is easy to distort its meaning and alter its social significance.
Social context includes time context (when something occurs), spatial context (where something occurs), emotional context (the feelings regarding how something occurs), and socio-cultural context (the social situation and cultural milieu in which something occurs). For example, a social activity (a card game, sexual act, or disagreement) occurs late at night on the street in a low-income area of a large city, a setting for drug use, fear and anger, violent crime, and prostitution within a cultural milieu of extreme racial-economic inequality. The same activity occurs midday in the backyard of a large house in an affluent suburban neighborhood in a social setting of relaxation and leisure, surrounded by trust and emotional closeness, and within a cultural milieu of established affluence and privilege. The context will significantly color the activity’s meaning. With different contextual meanings, the same activity or behavior may have different consequences.
In a quantitative study, we rarely treat context as important. We often strip it away as being “messy” or just “noise” and instead concentrate on precise counts or numerical measures. Thus, what a qualitative study might treat as essential may be seen as irrelevant noise in a quantitative study. For example, if a quantitative study counts the number of votes across time or cultures, a qualitative researcher might consider what voting means in the context. He or she may treat the same behavior (e.g., voting for a presidential candidate) differently depending on the social context in which it occurs (see Example Box 6.3, Example of Importance of Context for Meaning).
Context goes beyond social events, behaviors, and statements to include physical objects. One handgun could be an art object, part of a recreational hobby, a key element in committing a violent crime, evidence of an irresponsible parent, a suicide facilitator, or a means of social peace and community protection, each depending on the context. Without including the surrounding context, we cannot assign meaning to an object.
example box 6.3 Example of the Importance of Context for Meaning
“Voting in a national election” has different meanings in different contexts:
• 1. A one-party dictatorship with unopposed candidates, where people are required by law to vote. The names of nonvoters are recorded by the police. Nonvoters are suspected of being antigovernment subversives. They face fines and possible job loss for not voting.
• 2. A country in the midst of violent conflict between rebels and those in power. Voting is dangerous because the armed soldiers on either side may shoot voters they suspect of opposing their side. The outcome of the vote will give power to one or the other group and dramatically restructure the society. Anyone over the age of 16 can vote.
• 3. A context in which people choose between a dozen political parties of roughly equal power that represent very different values and policies. Each party has a sizable organization with its own newspapers, social clubs, and neighborhood organizers. Election days are national holidays when no one has to work. A person votes by showing up with an identification card at any of many local voting locations. Voting itself is by secret ballot, and everyone over age 18 can vote.
• 4. A context in which voting is conducted in public by White males over age 21 who have regular jobs. Family, friends, and neighbors see how one another vote. Political parties do not offer distinct policies; instead, they are tied to ethnic or religious groups and are part of a person’s ethnic-religious identity. Ethnic and religious group identities are very strong. They affect where one lives, where one works, whom one marries, and the like. Voting follows massive parades and week-long community events organized by ethnic and religious groups.
• 5. A context in which one political party is very powerful and is challenged by one or two very small, weak alternatives. The one party has held power for the past 60 years through corruption, bribery, and intimidation. It has the support of leaders throughout society (in religious organizations, educational institutions, businesses, unions, and the mass media). The jobs of anyone working in any government job (e.g., every police officer, post office clerk, schoolteacher, and garbage collector) depend on the political party staying in power.
• 6. A context in which the choice is between two parties with little difference between them. People select candidates primarily on the basis of television advertising. Candidates pay for advertising with donations by wealthy people or powerful organizations. Voting is a vague civic obligation that few people take seriously. Elections are held on a workday. In order to vote, a person must meet many requirements and register to vote several weeks in advance. Recent immigrants and anyone arrested for a crime are prohibited from voting.
A bricoleur is someone who has learned to be adept in diverse areas, can draw on a variety of sources, and makes do with whatever is at hand.12 The bricolage technique involves working with one’s hands and combining odds and ends in a practical, skilled, and inventive way to accomplish a task. A successful bricoleur possesses a deep knowledge of materials, a set of esoteric skills, and a capacity to combine or create flexibly. The typical bricoleur is often a highly inventive and skilled craftsperson, repairperson, or jack-of-all-trades.
Improvisation by drawing on diverse materials that are lying about and using them in creative ways to accomplish a pragmatic task.
A qualitative study draws on a variety of skills, materials, and approaches as needed. This usually happens when we are unable to anticipate the need for them. The process of mixing diverse source materials, applying disparate approaches, and assembling bits and pieces into a whole is analogous to the bricolage of a skilled craftsperson who is able to create or repair many things by using whatever is available at the time.
The Case and Process
We can divide all empirical social research into two groups: case study (with one or a few cases) orcross-case (comprising many cases).13 Most qualitative studies use a “case-oriented approach [that] places cases, not variables, center stage” (Ragin, 1992a:5). Thus, we examine many aspects of a few cases. The intensive, in-depth study a handful of cases replaces the extensive, surface-level study of numerous cases as is typical in quantitative research. Often a case-oriented analysis emphasizes contingencies in “messy” natural settings (i.e., the co-occurrence of many specific factors and events in one place and at one time). Rather than precise measures of a huge number of cases, as is typical of quantitative research, we acquire in-depth of knowledge and an astute insight into a small number of cases.
The study of cases tends to produce complex explanations or interpretations in the form of an unfolding plot or a narrative story about particular people or specific events. This makes the passage of time integral to the explanation. Often the emphasis becomes the sequence of events: what occurred first, second, third, and so on. This focus on process helps to reveal how an issue evolves, a conflict emerges, or a social relationship develops.
To interpret means to assign significance or coherent meaning. In quantitative research, meaning comes from using numbers (e.g., percentages or statistical coefficients), and we explain how the numerical data relate to the hypotheses. Qualitative studies rarely include tables with numbers. The only visual presentations of data may be maps, photographs, or diagrams showing how ideas are related. We instead weave the data into discussions of the ideas’ significance. The data are in the form of words, including quotes or descriptions of particular events. Any numerical information is supplementary to the textual evidence.
Qualitative studies give data meaning, translate them, or make them understandable. We begin with the point of view of the people we study and then find out how they see the world and define situations. We learn what events, behaviors, and activities mean for them. To begin qualitative interpretation, we first must learn the meanings of things for the people we are studying.14
People who create social activities and behavior have personal reasons or motives for what they do. This is first-order interpretation. As we discover and reconstruct this first-order interpretation, it becomes a second-order interpretation because we come from the outside to discover what has occurred. In a second-order interpretation, we elicit an underlying coherence or sense of meaning in the data. Meaning develops only in relation to a large set of other meanings, not in a vacuum. In a second-order interpretation, we place the human action being studied into a “stream of behavior” or events to which it is related: its context.
Interpretations from the point of view of the people being studied.
Qualitative interpretations from the point of view of the researcher who conducted a study.
If we were to adopt a very strict interpretive approach, we might stop at a second-order interpretation, that is, once we understand the significance of the action for the people we study. Most qualitative researchers go further. They want to generalize or link the second-order interpretation to a theory or general knowledge. They move to a broad level of interpretation, orthird-order interpretation by which they assign general theoretical significance to the data.
Qualitative interpretations made by the readers of a research report.
Because interpreting social meaning in context is often a major purpose and outcome of qualitative studies, keep in mind that the three steps or orders of interpretation help provide a way to organize the research process.
QUANTITATIVE DESIGN ISSUES
The Language of Variables and Hypotheses
Variation and Variables.
Simply defined, a variable is a concept that varies. In quantitative research, we use a language of variables and relationships among variables.
A concept or its empirical measure that can take on multiple values.
In Chapter 3, we discussed two types of concepts: those that refer to a fixed phenomenon (e.g., the ideal type of bureaucracy) and those that vary in quantity, intensity, or amount (e.g., amount of education). Variables are this second type of concept and measures of the concepts.
A variable must have two or more values. Once we become aware of them, we see variables everywhere. For example, gender is a variable; it can take one of two values: male or female. Marital status is a variable; it can take the value of never married single, married, divorced, or widowed. Type of crime committed is a variable; it can take values of robbery, burglary, theft, murder, and so forth. Family income is a variable; it can take values from zero to billions of dollars. A person’s attitude toward abortion is a variable; as a woman’s basic right can range from strongly favoring legal abortion to strongly believing in the sanctity of fetal life.
A variable’s values or categories are its attributes. It is easy to confuse variables with attributes. The confusion arises because one variable’s attribute can itself be a separate variable in its own right with only a slight change in definition. This rests on a distinction between concepts that vary and the conditions within concepts that vary. For example, “male” is not a variable; it describes a category of gender. Male is an attribute of the variable gender, yet a related idea, degree of masculinity, is a variable. It describes the intensity or strength of attachment to a set of beliefs, orientations, and behaviors that are associated with the concept of masculine within a culture. Likewise, “married” is not a variable; it is an attribute of the variable marital status. Related ideas such as number of years married or depth of commitment to a marriage are variables. In a third example, “robbery” is not a variable; but an attribute of the variable type of crime. Number of robberies, robbery rate, amount taken during a robbery, and type of robbery are all variables because they vary or take on a range of values.
The categories or levels of a variable.
In quantitative research, we redefine all concepts into the language of variables. As the examples of variables and attributes illustrate, the redefinition often requires only a slight change in definition. As noted in Chapter 3, concepts are the building blocks of theory; they organize thinking about the social world. Clear concepts with careful definitions are essential in theory.
Types of Variables.
As we focus on causal relations among variables, we usually begin with an effect and then search for its cause(s). We can classify variables depending on their location in a causal relationship or chain of causality. The cause variable, or the force or condition that acts on something else, is theindependent variable. The variable that is the effect, result, or outcome of another variable is thedependent variable. The independent variable is “independent of” prior causes that have acted on it whereas the dependent variable depends on the cause.
A type of variable that produces an effect or results on a dependent variable in a causal hypothesis.
The effect or result variable that is caused by an independent variable in a causal hypothesis.
It is not always easy to determine whether a variable is independent or dependent. Two questions can help to identify the independent variable. First, does it come before other variables in time? Independent variables must come before any other type. Second, if two variables occur at the same time, does one variable have an impact on another variable? Independent variables affect or have an impact on other variables. We often phrase research topics and questions in terms of the dependent variable because dependent variables are the phenomena we want to explain. For example, an examination of the reasons for an increase in the crime rate in Dallas, Texas would have the dependent variable as the crime rate in Dallas.
A simple causal relationship requires only an independent and a dependent variable. A third variable type, the intervening variable, appears in more complex causal relations. Coming between the independent and dependent variables, this variable helps to show the link or mechanism between them. As noted in Chapter 3, advances in knowledge depend not only on documenting cause-and-effect relationships but also on specifying the mechanisms that account for the causal relation. In a sense, the intervening variable acts as a dependent variable with respect to the independent variable and acts as an independent variable toward the dependent variable.
A variable that comes logically or temporally after the independent variable and before the dependent variable and through which their causal relation operates.
For example, French sociologist Émile Durkheim developed a theory of suicide that specified a causal relationship between marital status and suicide rate. Durkheim found evidence that married people are less likely to commit suicide than single people. He believed that married people have more social integration (i.e., feelings of belonging to a group or family). He thought that a major cause of one type of suicide was that people lacked a sense of belonging to a group. Thus, his theory can be restated as a three-variable relationship: marital status (independent variable) causes the degree of social integration (intervening variable), which affects suicide (dependent variable). Specifying the chain of causality makes the linkages in a theory clearer and helps a researcher test complex explanations.15
Simple theories have one dependent and one independent variable whereas complex ones can contain dozens of variables with multiple independent, intervening, and dependent variables. For example, a theory of criminal behavior (dependent variable) identifies four independent variables: an individual’s economic hardship, opportunities to commit crime easily, membership in a deviant subgroup that does not disapprove of crime, and lack of punishment for criminal acts. A multicause explanation usually specifies which independent variable has the most significant causal effect.
A complex theoretical explanation has a string of multiple intervening variables. For example, family disruption causes lower self-esteem among children, which causes depression, which causes poor grades in school, which causes reduced prospects for a good job, which causes a lower adult income. The chain of variables is family disruption (independent), childhood self-esteem (intervening), depression (intervening), grades in school (intervening), job prospects (intervening), adult income (dependent).
Two theories on the same topic can differ as to the number of independent variables. In addition, theories might agree about the independent and dependent variables but differ on the intervening variable or causal mechanism. For example, two theories say that family disruption causes lower adult income, each for different reasons. One theory holds that disruption encourages children to join deviant peer groups, which are not socialized to the norms of work and thrift. Another theory emphasizes the impact of the disruption on childhood depression and poor academic performance. In the second theory, depression and limited school learning directly cause poor job performance.
In one study, we usually test only one or a few parts of a causal chain. For example, a research project examining six variables may take the six from a large, complex theory with two dozen variables. Explicit links to a larger theory strengthen and clarify a research project.
Causal Theory and Hypotheses
The Hypothesis and Causality.
A causal hypothesis is a proposition to be tested or a tentative statement of a relationship between two variables. Hypotheses are guesses about how the social world works; they are stated in a value-neutral form. Kerlinger (1979:35) noted that,
• Hypotheses are much more important in scientific research than they would appear to be just by knowing what they are and how they are constructed. They have a deep and highly significant purpose of taking man out of himself…. Hypotheses are powerful tools for the advancement of knowledge, because, although formulated by man, they can be tested and shown to be correct or incorrect apart from man’s values and beliefs.
A statement of a causal explanation or proposition that has at least one independent and one dependent variable and has yet to be empirically tested.
A causal hypothesis has five characteristics (see Expansion Box 6.4, Five Characteristics of Causal Hypotheses). For example, we can restate the hypothesis that attending religious services reduces the probability of divorce as a prediction: Couples who attend religious services frequently have a lower divorce rate than do couples who rarely attend religious services. We can test the prediction against the empirical evidence. We should logically connect the hypothesis to a research question and to a broader theory; after all, we test hypotheses to answer the research question or to find empirical support for a theory. Statements that are logically or necessarily true, or questions that are impossible to answer through empirical observation (e.g., What is the “good life”? Is there a God?) are not scientific hypotheses.
expansion box 6.4 Five Characteristics of Casual Hypotheses
• 1. They have at least two variables.
• 2. They express a causal or cause–effect relationship between the variables.
• 3. They can be expressed as a prediction or an expected future outcome.
• 4. They are logically linked to a research question and a theory.
• 5. They are falsifiable; that is, they are capable of being tested against empirical evidence and shown to be true or false.
We can state causal hypotheses in several ways. Sometimes we use the word cause, but it is not necessary. For example, we can state a causal hypothesis between religious attendance and a reduced likelihood of divorce in ten different ways (see Example Box 6.4, Ways to State Causal Relations).
In scientific research, we avoid using the term proved when talking about testing hypotheses. Journalism, courts of law, and advertisements use the word proof, but a research scientist almost never uses it. A jury says that the evidence “proves” someone guilty, or a television commercial will state, “Studies prove that our aspirin cures headaches the fastest.” This is not the language of scientific research. In science, we recognize that knowledge is tentative and that creating knowledge is an ongoing process that avoids premature closure. The word proof implies finality, absolute certainty, or something that does not need further investigation. It is too strong a term for the cautious world of science. We might say that the evidence supports or confirms, but does not prove, the hypothesis. Even after hundreds of studies show the same results, such as the link between cigarette smoking and lung cancer, scientists do not say that we have absolute proof. Instead we can say that overwhelming evidence, or all studies to date, support or are consistent with the hypothesis. Scientists never want to close off the possibility of discovering new evidence that might contradict past findings. They do not want to cut off future inquiry or stop exploring intervening mechanisms. History contains many examples of relationships that people once thought to be proved but were later found to be in error. We can use proof when referring to logical or mathematical relations, as in a mathematical proof, but not for empirical research.
example box 6.4 Ways to State Casual Relations
• ¦ Religious attendance causes reduced divorce. Religious attendance leads to reduced divorce.
• ¦ Religious attendance is related to reduced divorce.
• ¦ Religious attendance influences the reduction of divorce.
• ¦ Religious attendance is associated with reduced divorce.
• ¦ Religious attendance produces reduced divorce.
• ¦ Religious attendance results in reduced divorce.
• ¦ If people attend religious services, then the likelihood of divorce will be reduced.
• ¦ The higher religious attendance, the lower the likelihood of divorce.
• ¦ Religious attendance reduces the likelihood of divorce.
Testing and Refining a Hypothesis.
Knowledge rarely advances on the basis of one test of a single hypothesis. In fact, researchers can get a distorted picture of the research process by focusing on a single study that tests one hypothesis. Knowledge develops over time as many researchers across the scientific community test many hypotheses. It slowly grows from shifting and winnowing through many hypotheses. Each hypothesis represents an explanation of a dependent variable. If the evidence fails to support some hypotheses, they are gradually eliminated from consideration. Those that receive support remain in contention. Theorists and researchers constantly create new hypotheses to challenge those that have received support (see Figure 6.3 on page 182). From Figure 6.3 we see that in 2010, three hypotheses are in contention, but from 1970 to 2010, eleven hypotheses were considered, and over time, eight of them were rejected in one or more tests.
Scientists are a skeptical group. Supporting a hypothesis in one study is not sufficient for them to accept it. The principle of replication says that a hypothesis needs several tests with consistent and repeated support before it can gain broad acceptance. Another way to strengthen confidence in a hypothesis is to test related causal linkages in the theory from which it comes.
As scientists, we accept the strongest contender with the greatest empirical support as the best explanation at the time. The more alternatives we test a hypothesis against, the more confidence we have in it. Some tests are called crucial experiments or crucial studies. This is a type of study whereby
• two or more alternative explanations for some phenomenon are available, each being compatible with the empirically given data; the crucial experiment is designed to yield results that can be accounted for by only one of the alternatives, which is thereby shown to be “the correct explanation.” (Kaplan, 1964:151–152)
A direct comparison and evaluation of competing explanations of the same phenomenon designed to show that one is superior to the other.
Thus, the infrequent crucial experiment is an important test of theory. Hypotheses from two different theories confront each other in crucial experiments, and one is knocked out of the competition. It is rare, but significant, when it occurs.
Types of Hypotheses.
Hypotheses are links in a theoretical causal chain and are used to test the direction and strength of a relationship between variables. When a hypothesis defeats its competitors, it supports the researcher’s explanation. A curious aspect of hypothesis testing is that researchers treat evidence that supports a hypothesis differently from evidence that opposes it: They give negative evidence more importance. The idea that negative evidence is critical when evaluating a hypothesis comes from the logic of disconfirming hypotheses.16 It is associated with Karl Popper’s idea of falsification (see Chapter 4 under positivism) and with the use of null hypotheses (see later in this section).
Logic of disconfirming hypothesis
The logic for the null hypothesis based on the idea that confirming empirical evidence makes a weak case for the existence of a relationship; instead of gathering supporting evidence, testing that no relationship exists provides more cautious, indirect support for its possible existence.
figure 6.3 How the Process of Hypotheses Testing Operates over Time
Recall the preceding discussion of proof. We never prove a hypothesis; however, we can disprove it. With supporting evidence, we can say only that the hypothesis remains a possibility or that it is still being considered. Negative evidence is more significant. With it, the hypothesis becomes “tarnished” or “soiled” because a hypothesis makes predictions. Negative and disconfirming evidence shows that the predictions are wrong. Positive or confirming evidence for a hypothesis is less critical because various alternative hypotheses may make the same prediction. When we find confirming evidence for a prediction, we may elevate one explanation over its alternatives that could also have confirming evidence.
For example, a man stands on a street corner with an umbrella and claims that his umbrella protects him from falling elephants. He has supporting evidence for his hypothesis that the umbrella provides protection. He has not had a single elephant fall on him in all of the time he has had his umbrella open, yet such supportive evidence is weak; it also is consistent with an alternative hypothesis: elephants do not fall from the sky. Both hypotheses predict that the man will be safe from falling elephants. Negative evidence for the hypothesis—the one elephant that falls on him and his umbrella, crushing both—would destroy the hypothesis for good!
We can test hypotheses in two ways: in a straightforward way and in a null hypothesis way. Many quantitative researchers, especially experimenters, frame hypotheses in terms of a null hypothesisbased on the logic of the disconfirming hypotheses. These researchers look for evidence that will allow them to accept or reject the null hypothesis. Most people talk about a hypothesis as a way to predict a relationship. The null hypothesis does the opposite. It predicts no relationship. For example, Sarah believes that students who live on campus in dormitories get higher grades than students who live off campus and commute to college. Her null hypothesis is that there is no relationship between residence and grades. Researchers use the null hypothesis with a corresponding alternative hypothesis or experimental hypothesis. The alternative hypothesis says that a relationship exists. Sarah’s alternative hypothesis is that students’ on-campus residence has a positive effect on grades.
A hypothesis stating that there is no significant effect of an independent variable on a dependent variable.
A hypothesis paired with the null hypothesis that says an independent variable has a significant effect on a dependent variable.
For most people, the null hypothesis approach seems like a backward way to think about hypothesis testing. Using a null hypothesis rests on the assumption that we want to discover a relationship. Because of our inner desire to find relationships, we need to design hypothesis testing to make finding relationships very demanding. When we use the null hypothesis approach, we directly test only the null hypothesis. If evidence supports or leads us to accept the null hypothesis, we conclude that the tested relationship does not exist. This implies that the alternative hypothesis is false. On the other hand, if we find evidence to reject the null hypothesis, the alternative hypotheses remain a possibility. We cannot prove the alternative; rather, by testing the null hypotheses, we keep the alternative hypotheses in contention. When we add null hypothesis testing to confirming evidence, the argument for alterative hypotheses can become stronger over time.
If all this discussion of null hypothesis is confusing to you, remember that the scientific community is extremely cautious. After all, it is in the business of creating genuine, verified truth. It would prefer to consider a causal relationship as false until mountains of evidence show it to be true. This is similar to the Anglo-American legal idea of innocent until proved guilty. We assume, or act as though, the null hypothesis is correct until reasonable doubt suggests otherwise. When we use null hypotheses, we can also use specific statistical tests (e.g., t-test or F-test) designed for this way of thinking. Thus, we say there is reasonable doubt in a null hypothesis if a statistical test suggests that the odds of it being false are 99 in 100. This is what we mean when we say that statistical tests allow us to “reject the null hypothesis at the .01 level of significance” (we will discuss statistical significance further in Chapter 12).
Another type of hypothesis is the double-barreled hypothesis.17 It shows unclear thinking and creates unnecessary confusion and should be avoided. A double-barreled hypothesis puts two separate relationships into one hypothesis. For example, we say that poverty and a high concentration of teenagers in an area cause property crime to increase. This is double barreled. We might mean either of two things: that poverty or a high concentration of teenagers causes property crime or that only the combination of poverty with a high concentration of teenagers causes property crime. If “either one” is intended and only one independent variable has an effect, the results of hypothesis testing are unclear. For example, if the evidence shows that poverty causes crime but a concentration of teenagers does not, is the hypothesis supported? If we intend the combination hypothesis, then we really mean that the joint occurrence of poverty with a high concentration of teenagers only, but neither alone, causes property crime. If we intend the combination meaning, it is not double barreled. We need to be very clear and state the combination hypothesis explicitly. The term for a combination hypothesis is the interaction effect(interaction effects are discussed later; also see Figure 6.4).
A confusing and poorly designed hypothesis with two independent variables in which it is unclear whether one or the other variable or both in combination produce an effect.
Potential Errors in Causal Explanation
Developing a good explanation for any theory (i.e., causal, interpretive, or network) requires avoiding some common logical errors. These errors can enter while starting a study, while interpreting and analyzing quantitative data, or while collecting and analyzing qualitative data. Such errors can be referred to as fallacies or false explanations that may deceptively appear to be legitimate on the surface but have serious problems once they are more deeply investigated.
A tautology is a form of circular reasoning. We appear to say something new but are really talking in circles and making a statement that is true by definition. We cannot test tautologies with empirical data. For example, I heard a news report about a representative in the U.S. Congress who argued for a new crime law that would send many more 14- and 15-year-olds to adult courts. When asked why he was interested only in harsh punishment, not prevention, the representative said that offenders would learn that crime does not pay and that would prevent crime. He believed that the only prevention that worked was harsh punishment. This sounded a bit odd when I heard it. So, I reexamined the argument and realized it was tautological (i.e., it contained a logic error). The representative essentially said punishment resulted in prevention because he had redefinedprevention as being the same as punishment. Logically, he said punishment caused prevention because harsh punishment was prevention. Politicians may confuse the public with circular reasoning, but social researchers need to learn how to see through and avoid such garble.
An error in explanation in which the causal factor (independent variable) and the result (dependent variable) are actually the same or restatements of one another, making an apparent causal relationship true by definition.
A conservative is a person with certain attitudes, beliefs, and values (desires less government regulation, no taxes on upper income people, a strong military, religion taught in public schools, an end to antidiscrimination laws, etc.). It is a tautology to say that wanting less regulation, a strong military, and so on causes conservatism. In sloppy everyday usage, we can say, “Sally is conservative because she believes that there should be less regulation.” This appears to be a causal statement, but it is not. The set of attitudes is a reason to label Sally as a conservative, but those attitudes cannot be the cause of Sally’s conservatism. Her attitudes are conservatism, so the statement is true by definition. It would be impossible ever to obtain evidence showing that those attitudes were not associated with conservatism.
A teleology is something directed by an ultimate purpose or goal. It can take two forms. First, it is associated with an event that occurs because it is in “God’s plan” or in some overarching, mysterious unseen and unknowable force. In other words, an event occurs because God, or an unseen, unknowable master force has predetermined that it must occur. It is a teleology to say that something occurs because it is part of the “natural unfolding” of some all-powerful inner spirit orGeist (German for spirit). Thus, it is a teleology to say that a society develops in a certain direction because of the “spirit of the nation” or a “manifest destiny.” Similar teleogical arguments rely on human nature as a cause, such as “Crime occurs because it is just human nature.” Teleology has appeared in theories of history when someone says we are moving toward an “ideal society” or a utopia, and this movement explains events that are occurring today. Teleology has also been found in functional arguments. It is a teleology to say the family takes a certain form (e.g., nuclear) because the nuclear family fulfills social system “needs” for societal continuation. Logically, this says that the functional needs of the social system’s survival into the distant future are the cause of the family form we see today. It is impossible to measure the cause and empirically test teleologies.
An error in explanation in which the causal relationship is empirically untestable because the causal factor does not come earlier in time than the result or because the causal factor is a vague, general force that cannot be empirically measured.
figure 6.4 Double-Barreled Hypothesis versus Interaction Effect
Teleology violates the temporal order requirement of causality. There is no true independent variable because the “causal factor” is extremely vague, distant, and unseen. Many people confuse goal motivation (i.e., a desire for something yet to occur) with teleology. I might say a goal causes an action. For example, my goal to get an A in a class caused me to get a good grade. My conscious goal or desire could be a legitimate cause and not be teleological. To show this, I need to outline the causal chain. First, we can empirically measure my mental condition (e.g., goals, desires, or aspirations) at some time point. This clarifies both the empirical evidence and temporal order issue. Second, we can compare my mental condition to future events that may or may not occur, such as getting a specific grade in a course. The mental condition can be a motivation that causes me to engage in certain behaviors, such as studying (an intervening variable). The studying behaviors could increase the chances that a future event (a course grade) will occur. Conscious human goals differ from the will of God, a society’s Geist, or system needs, which we cannot empirically measure, have no fixed existence in time, and always match what occurs.
The statement The nuclear family is the dominant family form in Western industrial societiesbecause it is functional for the survival of the society is an untestable teleological statement from structural functional theory. It is saying “society’s survival” causes “development of family form,” yet the only way we can observe whether a society survives is after the fact, or as a consequence of its having had a form of the family. Here is another example of a teleological statement: Because it was the destiny of the United States to become a major world power, we find thousands of immigrants entering the Western frontier during the early nineteenth century. This says that “becoming a major world power,” which occurred from 1920 to 1945, caused “westward migration,” which took place between 1850 and 1890. It uses the obscure term destiny, which, like other similar terms (e.g., “in God’s plan”), cannot be observed in causal relationships.
The ecological fallacy arises from a mismatch of units of analysis. It refers to a poor fit between the units for which we have empirical evidence and the units for which we want to make general statements. Ultimately, it comes down to imprecise reasoning and generalizing well beyond what the evidence warrants. Ecological fallacy occurs when we gather data at a higher or an aggregatedunit of analysis but want to say something about a lower or disaggregated unit. It is a fallacy because what happens in one unit of analysis does not always hold for a different unit of analysis.18 Thus, when we gather data for large aggregates (e.g., organizations, entire countries) and draw conclusions about the behavior of individuals from those data, we are creating an ecological fallacy. To avoid this error, we must ensure that the unit of analysis we use in an explanation is the same as or very close to the unit on which we collect data (see Example Box 6.5, The Ecological Fallacy).
An error in explanation in which empirical data about associations found among large-scale units of analysis are greatly overgeneralized and treated as evidence for statements about relationships among much smaller units.
About 45,000 people live in Tomsville and in Joansville. Tomsville has a high percentage of upper income people. More than half of the households in the town have family incomes of over $160,000. The town also has more motorcycles registered in it than any other town of its size. The town of Joansville has many poor people. Half of its households live below the poverty line. The town also has fewer motorcycles registered in it than any other town of its size. But it is a fallacy to say, on the basis of this information alone, that rich people are more likely to own motorcycles or that the evidence shows a relationship between family income and motorcycle ownership. The reason is that we do not know which families in Tomsville or Joansville own motorcycles. We know about only the two variables—average income and number of motorcycles—for the towns as a whole. The unit of analysis for observing variables is each town as a whole. Perhaps all of the low- and middle-income families in Tomsville belong to a motorcycle club, but not a single upper income family belongs to one. Or perhaps one rich family and five poor ones in Joansville own motorcycles. To make a statement about the relationship between family ownership of motorcycles and family income, we have to collect information on families, not on towns as a whole.
example box 6.5 The Ecological Fallacy
Researchers have criticized the famous study Suicide ( 1957) by Émile Durkheim for the ecological fallacy of treating group data as though they were individual-level data. In the study, Durkheim compared the suicide rates of Protestant and Catholic districts in nineteenth-century western Europe and explained observed differences as due to dissimilarity between people’s beliefs and practices in the two religions. He said that Protestants had a higher suicide rate than Catholics because the Protestants were more individualistic and had lower social integration. Durkheim and early researchers had data only by district. Because people tended to reside with others of the same religion, Durkheim used group-level data (i.e., region) for individuals.
Later researchers (van Poppel and Day, 1996) reexamined nineteenth century suicide rates with only individual-level data that they discovered for some areas. They compared the death records and looked at the official reason of death and religion, but their results differed from Durkheim’s. Apparently, local officials at that time recorded deaths differently for people of different religions. They recorded “unspecified” as a reason for death far more often for Catholics because of the religion’s strong moral prohibition against suicide. Durkheim’s larger theory may be correct, yet the evidence he had to test it was weak because he used data aggregated at the group level while trying to explain the actions of individuals.
Another problem that involves a mismatch of units of analysis and imprecise reasoning about evidence is reductionism, also called the fallacy of nonequivalence (see Example Box 6.6, Error of Reductionism on page 188). This error occurs in an explanation of macro-level events using evidence about specific individuals. It occurs when a person observes a lower or disaggregated unit of analysis but makes statements about the operations of higher or aggregated units. In a way, it is a mirror image of the mismatch error in the ecological fallacy. A person makes this error when heor she has data on how individuals behave but wants to talk about the dynamics of macro-level units. It occurs because it is often easier to obtain data on individuals. Also, the operation of macro-level units is more abstract and nebulous. Lieberson argued that this error produces inconsistencies, contradictions, and confusion. He (1985:108, 113–114) forcefully stated:
• Associations on the lower level are irrelevant for determining the validity of a proposition about processes operating on the higher level. As a matter of fact, no useful understanding of the higher-level structure can be obtained from lower-level analysis.… If we are interested in the higher-level processes and events, it is because we operate with the understanding that they have distinct qualities that are not simply derived by summing up the subunits.
An error in explanation in which empirical data about associations found among small-scale units of analysis are greatly overgeneralized and treated as evidence for statements about relationships among much larger units.
example box 6.6 Error of Reductionism
Suppose you pick up a book and read the following:
• American race relations changed dramatically during the Civil Rights Era of the 1960s. Attitudes among the majority, White population shifted to greater tolerance as laws and court rulings changed across the nation. Opportunities that had been legally and officially closed to all but the White population—in the areas of housing, jobs, schooling, voting rights, and so on—were opened to people of all races. From the Brown vs. Board of Education decision in 1955, to the Civil Rights Act of 1964, to the War on Poverty from 1966 to 1968, a new, dramatic outlook swept the country. This was the result of the vision, dedication, and actions of America’s foremost civil rights leader, Dr. Martin Luther King, Jr.
This says: dependent variable = major change in U.S. race relations over a 10- to 13-year period;independent variable = King’s vision and actions.
If you know much about the civil rights era, you see a problem. The entire civil rights movement and its successes are attributed to a single individual. Yes, one individual does make a difference and helps build and guide a movement, but the movement is missing. The idea of a social-political movement as a causal force is reduced to its major leader. The distinct social phenomenon—a movement—is obscured. Lost are the actions of hundreds of thousands of people (marches, court cases, speeches, prayer meetings, sit-ins, rioting, petitions, beatings, etc.) involved in advancing a shared goal and the responses to them. The movement’s ideology, popular mobilization, politics, organization, and strategy are absent. Related macro-level historical events and trends that may have influenced the movement (e.g., Vietnam War protest, mood shift with the killing of John F. Kennedy, African American separatist politics, African American migration to urban North) are also ignored.
This error is not unique to historical explanations. Many people think in terms of only individual actions and have an individualist bias, sometimes called methodological individualism. This is especially true in the extremely individualistic U.S. culture. The error is that it disregards units of analysis or forces beyond the individual. The error of reductionism shifts explanation to a much lower unit of analysis. One could continue to reduce from an individual’s behavior to biological processes in a person, to micro-level neurochemical activities, to the subatomic level.
Most people live in “social worlds” focused on local, immediate settings and their interactions with a small set of others, so their everyday sense of reality encourages seeing social trends or events as individual actions or psychological processes. Often, they become blind to more abstract, macro-level entities—social forces, processes, organizations, institutions, movements, or structures. The idea that all social actions cannot be reduced to individuals alone is the core of sociology. In his classic work Suicide, Émile Durkheim fought methodological individualism and demonstrated that larger, unrecognized social forces explain even highly individual, private actions.
As with the ecological fallacy, to avoid the error of reductionism, we must make certain that the unit of analyses in our explanation and for which we have empirical evidence are very close. When we fail to think precisely about the units of analysis and fail to couple the data closely with the theory, we might commit the ecological fallacy or error of reductionism. These are mistakes about having data that are appropriate for a research question and seriously overgeneralizing from the data.
It is possible to make assumptions about units of analysis other than the ones we study empirically. Thus, research on individuals rests on assumptions that individuals act within a set of social institutions. We base research on social institutions on assumptions about individual behavior. We know that many micro-level units join to form macro-level units. The danger is that it is easy to slide into using the behavior of micro units, such as individuals, to explain the actions of macro units, such as social institutions. What happens among units at one level does not necessarily hold for different units of analysis. Sociology as a field rests on the belief that a distinct level of social reality exists beyond the individual. Explanations of this level require data and theory that go beyond the individual alone. We cannot reduce the causes, forces, structures, or processes that exist among macro units to individual behavior.
Why did World War I occur? You may have heard that it was because a Serbian shot an archduke in the Austro-Hungarian Empire in 1914. This is reductionism. Yes, the assassination was a factor, but the macro-political event between nations—war—cannot be reduced to a specific act of one individual. If it could, we could also say that the war occurred because the assassin’s alarm clock worked and woke him up that morning. If it had not worked, there would have been no assassination, so the alarm clock caused the war! The cause of the event, World War I, was much more complex and was due to many social, political, and economic forces that came together at a point in history. The actions of specific individuals had a role, but only a minor one compared to these macro forces. Individuals affect events, which eventually, in combination with large-scale social forces and organizations, affect others and move nations, but individual actions alone are not the cause. Thus, it is likely that a war would have broken out at about that time even if the assassination had not occurred.
To call a relationship between variables spurious means that it is false, a mirage. We often get excited if we think we have found a spurious relationship because we can show the world to be more complex than it appears on the surface. Because any association between two variables might be spurious, we must be cautious when we discover that two variables are associated; upon further investigation, it may not be the basis for a causal relationship. It may be an illusion, just like the mirage that resembles a pool of water on a road during a hot day.
Spuriousness occurs when two variables are associated but are not causally related because an unseen third factor is the real cause (see Example Box 6.7, Spuriousness and Example Box 6.8, Night-Lights and Spuriousness on page 190). The third variable is the cause of both the apparent independent and the dependent variable. It accounts for the observed association. In terms of conditions for causality, the unseen third factor represents a more powerful alternative explanation.
An apparent causal relationship that is illusionary due to the effect of an unseen or initially hidden causal factor; the unseen factor has a causal impact on both an independent and dependent variable, and produces the false impression that a relationship between them exists.
How can you tell whether a relationship is spurious? How do you find out what the mysterious third factor might be? You will need to use statistical techniques (discussed later in this book) to test whether an association is spurious. To use them, you need a theory or at least a guess about possible third factors. Actually, spuriousness is based on some commonsense logic that you already use. For example, you know that an association exists between the use of air conditioners and ice cream cone consumption. If you measured the number of air conditioners in use and the number of ice cream cones sold each day, you would find a strong correlation with more cones being sold on the days when more air conditioners are in use. But you know that eating ice cream cones does not cause people to turn on air conditioners. Instead, a third variable, hot days, causes both variables. You could verify this by measuring the daily temperature, ice cream consumption, and air conditioner use. In social research, opposing theories help us figure out which third factors are relevant for many topics (e.g., the causes of crime or the reasons for war or child abuse).
example box 6.7 Spuriousness
In their study of the news media, Neuman and colleagues (1992) found a correlation between type of news source and knowledge. People who prefer to get their news from television are less knowledgeable than those who get it from print sources. This correlation is often interpreted as the “dumbing down” of information. In other words, television news causes people to know little.
The authors found that the relationship was spurious, however. “We were able to show that the entire relationship between television news preference and lower knowledge scores is spurious” (p. 113). They found that a third variable, initially unseen, explained both a preference for television news and a level of knowledge about current events. They said, “We find that what is really causing the television-is-the-problem effect is the preference for people with lower cognitive skill to get their news from television” (p. 98). The missing or hidden variable was “cognitive skill.” The authors defined cognitive skill as a person’s ability to use reason and manipulate abstract ideas. In other words, people who find it difficult to process abstract, complex information turn to television news. Others may also use the high-impact, entertaining television news sources, but they use them less and heavily supplement them with other more demanding, information-rich print sources. People who have weak information skills also tend to be less knowledgeable about current events and about other topics that require abstract thought or deal with complex information.
example box 6.8 Night-Lights and Spuriousness
For many years, researchers observed a strong positive association between the use of a night-light and children who were nearsighted. Many thought that the night-light was somehow causing the children to develop vision problems (illustrated below). Other researchers could think of no reason for a causal link between night-light use and developing nearsightedness. A 1999 study provided the answer. It found that nearsighted parents are more likely to use night-lights; they also genetically pass on their vision deficiency to their children. The study found no link between night-light use and nearsightedness once parental vision was added to the explanation (see b below). Thus the initial causal link was misleading or spurious (from New York Times, May 22, 2001).
Source: “Vital Signs: Update; New Study Vindicates Night Lights” from The New York Times, Health Section, 5/22/2001 Issue, Page(s) 6.
Some people argue that taking illegal drugs causes suicide, school dropouts, and violent acts. Advocates of “drugs-are-the-problem” position point to the positive correlations between taking drugs and being suicidal, dropping out of school, and engaging in violence. The supporters argue that ending drug use will greatly reduce suicide, dropouts, and violence. Others argue that many people turn to drugs because of their emotional problems or high levels of disorder of their communities (e.g., high unemployment, unstable families, high crime, few community services, lack of civility). The people with emotional problems or who live in disordered communities are also more likely to commit suicide, drop out, and engage in violence. This means that reducing emotional problems and community disorder will cause illegal drug use, dropping out, suicide, and violence to decline greatly. Reducing drug taking alone will have only a limited effect because it ignores the root cause, which is not drugs. The “drugs-are-the-problem” argument is spurious because the initial relationship between taking illegal drugs and the problems that advocates identify is misleading. The emotional problems and community disorder are the true and often unseen causal variables.
We can now turn from the errors in causal explanation to avoid and move to other issues involving hypotheses. Table 6.2 provides a review of the major errors, and Figure 6.5 illustrates them.
From the Research Question to Hypotheses
It is difficult to move from a broad topic to hypotheses, but the leap from a well-formulated research question to hypotheses is a short one. A good research question has hypotheses embedded within it. In addition, hypotheses are tentative answers to research questions.
table 6.2 Summary of Errors in Explanation
type of error short definition example
Tautology The relationship is true by definition and involves circular reasoning. Poverty is caused by having very little money.
Teleology The cause is an intention that is inappropriate, or it has misplaced temporal order. People get married in religious ceremonies because society wants them to.
Ecological fallacy The empirical observations are at too high a level for the causal relationship that is stated. New York has a high crime rate. Joan lives in New York. Therefore, she probably stole my watch.
Reductionism The empirical observations are at too low a level for the causal relationship that is stated. Because Steven lost his job and did not buy a new car, the country entered a long economic recession.
Spuriousness An unseen third variable is the actual cause of both the independent and dependent variable. Hair length is associated with TV programs. People with short hair prefer watching football; people with long hair prefer romance stories. (Unseen: Gender)
Consider this example of a research question: “Is age at marriage associated with divorce?” The question has two variables: “age at marriage” and “divorce.” To develop a hypothesis, we must determine which is the independent variable. The independent variable is age at marriage because marriage must logically precede divorce. We may also ask what the direction of the relationship is. The hypothesis could be the following: “The lower the age at time of marriage, the higher the chances that the marriage will end in divorce.” This hypothesis answers the research question and makes a prediction. Notice that we can reformulate and better focus it now into: “Are couples who marry younger more likely to divorce?”
figure 6.5 Five Errors in Explanation to Avoid
We can create several hypotheses for one research question. Another hypothesis from the same research question is as follows: “The smaller the difference between the ages of the marriage partners at the time of marriage, the less likely that the marriage will end in divorce.” In this case, we specify the variable age at marriage differently.
We can have a hypothesis that specifies that a relationship holds under some conditions but not others. As Lieberson (1985:198) remarked, “In order to evaluate the utility of a given causal proposition, it is important that there be a clear-cut statement of the conditions under which it will operate.” For example, a hypothesis states: The lower the age of the partners at time of marriage, the higher are the chances that the marriage will end in divorce, unless it is a marriage betweenmembers of a tight-knit traditional religious community in which early marriage is the norm.
Formulating a research question and a hypothesis does not have to proceed in fixed stages. We can formulate a tentative research question and then develop possible hypotheses; the hypotheses will help us to state the research question more precisely. The process is interactive and requires our creativity.
You may be wondering where theory fits into the process of moving from a topic to a testable hypothesis. Recall from Chapter 3 that theory takes many forms. We use general theoretical issues as a source of topics. Theories provide concepts that we turn into variables as well as the reasoning or mechanism that helps us connect variables together to produce a research question. A hypothesis can both answer a research question and be an untested proposition from a theory. We can express a hypothesis at an abstract, conceptual level or restate it in a more concrete, measurable form. Examples of specific studies may help to illustrate the parts of the research process. For examples of three quantitative studies, see Chart 6.1 on page 194; for two qualitative studies, see Chart 6.2 on page 195.
In this chapter, you encountered the groundwork needed to begin a study. You saw how differences in the qualitative and quantitative styles direct us to prepare for a study differently. In all types of research, you must narrow a topic into a more specific, focused research question. Each of the major approaches to doing research implies a different form and sequence of decisions as well as different answers as to when and how to focus on a research question. The most effective approach will depend on the topic you select, your purpose and intended use of study results, the orientation toward social science you adopt, and the your own assumptions and beliefs.
A quantitative study generally takes a linear path and emphasizes objectivity. In it you will use explicit, standardized procedures and a causal explanation. It uses the language of variables and hypotheses that is found across many areas of science that are based on a positivist tradition. The process is often deductive with a sequence of discrete steps that precede data collection: Narrow the topic to a more focused question, transform nebulous theoretical concepts into more exact variables, and develop one or more hypotheses to test. In actual practice, you will move back and forth, but the general process flows in a single, linear direction. In addition, you should take special care to avoid logical errors in hypothesis development and causal explanation.
In a qualitative study, you will likely follow a nonlinear path and emphasize becoming intimate with the details of a natural setting or a particular cultural-historical context. There are fewer standardized procedures or explicit steps, and you must often devise on-the-spot techniques for one situation or study. The language of cases and contexts directs you to conduct detailed investigations of particular cases or processes in a search for authenticity. Planning and design decisions are rarely separated into a distinct predata collection stage but continue to develop throughout early data collection. In fact, you use a more inductive qualitative style that encourages a slow, flexible evolution toward a specific focus based on what you learn from the data. Grounded theory emerges from your continuous reflections on the data and the context.
The qualitative and quantitative distinction is often overdrawn. Too often, it appears as a rigid dichotomy. Adherents of one approach judge the studies of the other approach on the basis of its own assumptions and standards. The quantitative researcher demands to know the variables used and the hypothesis tested. The qualitative researcher balks at turning humanity into cold numbers. A well-versed, prudent social researcher will understand and appreciate each approach to research on its own terms and recognize the strengths and limitations of each. The ultimate goal of developing a better understanding and explanation of the social world comes from an appreciation of what each has to offer.
chart 6.1 Examples of Quantitative Studies
Study citation and title Ridgeway and Erickson (2000), “Creating and Spreading Status Beliefs” Musick, Wilson, and Bynum (2000), “Race and Formal Volunteering: The Differential Effects of Class and Religion” Barlow, Barlow, and Chiricos (1995), “Economic Conditions and Ideologies of Crime in the Media”
Methodological technique used Experiment Survey Content analysis
Topic Processes by which people develop beliefs about the social status of others Rates of volunteering by White and Black adults U.S. mass media portrayals of lawbreakers
Research question As individuals interact, do external, structural factors that affect the interaction mold the beliefs they come to hold about entire categories of people in the future? What different kinds of resources are available to Blacks and Whites that explain why Blacks are less likely to volunteer? Do economic conditions affect how the media portray offenders?
Main hypothesis tested People can be “taught” to make status distinctions among categories of people, who are actually equal, based on limited interaction in which one category exerts more skill. Social class and religion affect whether Blacks volunteer differently than Whites. The media distortion of crime shows offenders in a more negative way (blames them) when economic conditions are bad.
Main independent variable(s) Whether a person’s interaction with someone in a category that shows members of the category to have superior or inferior skill at tasks Social class, religious attendance, race Unemployment rate in several years, 1953–1982
Main dependent variable Whether individuals develop and apply a belief of inequality to an entire category of people Whether a person said he or she volunteered for any of five organizations (religious, education, political or labor, senior citizen, or local) Whether distortion occurred, measured as a mismatch between media attention (articles in Time magazine) and crime statistics for several years
Unit of analysis Individual undergraduate student Individual adults The media report
Universe All individuals All adult Whites and Blacks in the United States All U.S. mass media reports
chart 6.2 Examples of Qualitative Studies
Study citation and title Lu and Fine (1995), “The Presentation of Ethnic Authenticity: Chinese Food as a Social Accomplishment” Molotch, Freudenburg, and Paulsen (2000), “History Repeats Itself, but How? City Character, Urban Tradition, and the Accomplishment of Place”
Methodological technique used Field research Historical-comparative research
Topic The ways ethnic cultures are displayed within the boundaries of being acceptable in the United States and how they deploy cultural resources The ways cities develop a distinct urban “character”
Research question How do Chinese restaurants present food to balance authenticity and to satisfy non-Chinese U.S. customers? Why did the California cities of Santa Barbara and Ventura, which appear very similar on the surface, develop very different characters?
Grounded theory Ethnic restaurants Americanize their food to fit local tastes but also construct an impression of authenticity. This is a negotiated process of meeting the customer’s expectations/taste conventions and the desire for an exotic and authentic eating experience. The authors use two concepts, “lash up” (interaction of many factors) and structure (past events create constraints on subsequent ones), to elaborate on character and tradition. Economic, political, cultural, and social factors combine to create distinct cultural-economic places. Similar forces can have opposite results depending on context.
Bricolage The authors observed and interviewed at four Chinese restaurants but relied on evidence from past studies. The authors used historical records, maps, photos, official statistical information, and interviews. In addition to economic and social conditions, they examined voluntary associations and physical materials.
Process Restaurants make modifications to fit available ingredients, their market niche, and the cultural and food tastes of local customers. Conditions in the two cities contributed to two different economic development responses to oil and highways. Ventura formed an industrial-employment base around oil and allowed new highways. Santa Barbara limited both and instead focused on creating a tourism industry.
Context Chinese restaurants, especially four in Athens, Georgia The middle part of California’s coast over the past 100 years
linear research path
logic in practice
logic of disconfirming hypothesis
nonlinear research path
• 1. What are the implications of saying that qualitative research uses more logic in practice than a reconstructed logic?
• 2. What does it mean to say that qualitative research follows a nonlinear path? In what ways is a nonlinear path valuable?
• 3. Describe the differences between independent, dependent, and intervening variables.
• 4. Why don’t we prove results in social research?
• 5. Take a topic of interest and develop two research questions for it. For each research question, specify the units of analysis and universe.
• 6. What two hypotheses are used if a researcher uses the logic of disconfirming hypotheses? Why is negative evidence stronger?
• 7. Restate the following in terms of a hypothesis with independent and dependent variables: The number of miles a person drives in a year affects the number of visits a person makes to filling stations, and there is a positive unidirectional relationship between the variables.
• 8. Compare the ways in which quantitative and qualitative researchers deal with personal bias and the issue of trusting the researcher.
• 9. How do qualitative and quantitative researchers use theory?
• 10. Explain how qualitative researchers approach the issue of interpreting data. Refer to first-, second-, and third-order interpretations.
See Tashakkori and Teddlie (1998).
Ward and Grant (1985) and Grant and colleagues (1987) analyzed research in sociology journals and suggested that journals with a higher proportion of qualitative research articles address gender topics but that studies of gender are not themselves more likely to be qualitative.
See Kaplan (1964:3–11) for a discussion.
On the issue of using quantitative, statistical techniques as a substitute for trust, see Collins (1984), Porter (1995), and Smith and Heshusius (2004).
For discussion, see Schwandt (1997), Swanborn (1996), and Tashakkori and Teddlie (1998:90–93).
For examples of checking, see Agar (1980) and Becker (1970c).
Problem choice and topic selection are discussed in Campbell and associates (1982) and Zuckerman (1978).
See Flick (1998:51).
Exceptions are secondary data analysis and existing statistics research. In working with them, a quantitative researcher often focuses the research question and develops a specific hypothesis to test after she or he examines the available data
See Ball and Smith (1992) and Harper (1994).
For place of theory in qualitative research, see Hammersley (1995).
See Harper (1987:9, 74–75) and Schwandt (1997: 10–11).
See Gerring (2007:20) and George and Bennett (2005).
See Blee and Billings (1986), Ricoeur (1970), and Schneider (1987) on the interpretation of text in qualitative research.
See Lieberson (1985:185–187) for a discussion of basic and superficial variables in a set of causal linkages. Davis (1985) and Stinchcombe (1968) provide good general introductions to making linkages among variables in social theory.
The logic of disconfirming hypothesis is discussed in Singleton and associates (1988:56–60).
See Bailey (1987:43) for a discussion of this term.
The general problem of aggregating observation and making causal inferences is discussed in somewhat technical terms in Blalock (1982:237–264) and in Hannan (1985). O’Brien (1992) argues that the ecological fallacy is one of a whole group of logical fallacies in which levels and units of analysis are confused and over-generalized.
CHAPTER 7 Qualitative and Quantitative Measurement
The Need for Measurement
Quantitative and Qualitative Measurement
The Measurement Process
Reliability and Validity
A Guide to Quantitative Measurement
Scales and Indexes
Measurement, in short, is not an end in itself. Its scientific worth can be appreciated only in an instrumentalist perspective, in which we ask what ends measurement is intended to serve, what role it is called upon to play in the scientific situation, what functions it performs in inquiry.
—Abraham Kaplan, The Conduct of Inquiry, p. 171
Who is poor and how much poverty exists? U.S. government officials in the 1960s answered these questions using the poverty line to measure poverty. New programs were to provide aid to poor people (for schooling, health care, housing assistance, and so forth). They began with the idea of being so impoverished that a family was unable to buy enough food to prevent malnourishment. Studies at the time showed that low-income people were spending one-third of their income on food. Officials visited grocery stores and calculated how much low-cost nutritional food for a family would cost and multiplied the amount by 3 to create a poverty line. Since then, the number has been adjusted for inflation. When Brady (2003:730) reviewed publications from 1990–2001, he found that 69.8 percent of poverty studies in the United States used the official government rate. However, numerous studies found that the official U.S. measure of poverty has major deficiencies. When the National Research Council examined the measure in 1995, members declared it outdated and said it should not be retained. The poverty measure sets an arbitrary income level and “it obscures differences in the extent of poverty among population groups and across geographic contexts and provides an inaccurate picture of trends over time” (Brady, 2003:718). It fails to capture the complex nature of poverty and does not take into account new family situations, new aid programs, changes in taxes, and new living expenses. Adding to the confusion, we cannot compare U.S. poverty reduction over time to those in other countries because each country uses different poverty measures. All of the methodological improvements as to how we measure poverty would result in counting far more people as being poor, so few government officials want to change the measure.
THE NEED FOR MEASUREMENT
As researchers, we encounter measures everyday such as the Stanford Binet IQ test to measure intelligence, the index of dissimilarity to measure racial segregation, or uniform crime reports to measure the amount of crime. We need measures to test a hypothesis, evaluate an explanation, provide empirical support for a theory, or study an applied issue. The way we measure a range of social life—aspects such as self-esteem, political power, alienation, or racial prejudice—is the focus of this chapter. We measure in both quantitative and qualitative studies, but quantitative researchers are most concerned with measurement. In quantitative studies, measurement is a distinct step in the research process that occurs prior to data collection. Quantitative measurement has a special terminology and set of techniques because the goal is to precisely capture details of the empirical social world and express what we find in numbers.
In qualitative studies, we measure with alternatives to numbers, and measurement is less a separate research step. Because the process is more inductive, we are measuring and creating new concepts simultaneously with the process of gathering data.
Measuring is not some arcane, technical issue (like pulling out a tape measure to determine an object’s length or putting an object on a scale to check its weight) that we can skip over quickly. Measurement intimately connects how we perceive and think about the social world with what we find in it. Poor-quality measures can quickly destroy an otherwise good study. Measurement also has consequences in everyday life. For example, psychologists and others debate the meaning and measures of intelligence. We use IQ “tests” to measure a person’s intelligence in schools, on job applications, and in statements about racial or other inherited superiority. But what is intelligence? Most such IQ “tests” measure only analytic reasoning (i.e., one’s capacity to think abstractly and to infer logically). However, we recognize other types of intelligence: artistic, practical, mechanical, and creative. Some people suggest even more types, such as social-interpersonal, emotional, body-kinesthetic, musical, or spatial. If there are many forms of intelligence but we narrowly measure only one type, we limit the way schools identify and nurture learning; the way we select, evaluate, and promote employees; and the way society as a whole values diverse human capabilities.
As the chapter opening indicated, the way we measure poverty determines whether people receive assistance from numerous social programs (e.g., subsidized housing, food aid, health care, child-care). Some say that people are poor if they cannot afford to buy food required to prevent malnutrition. Others say that poor means having an annual income that is less than one-half of the average (median) income. Still others say that poor means someone who earns less than a “living wage” based on a judgment about an income needed to meet minimal community standards of health, safety and decency in hygiene, housing, clothing, diet, transportation, and so forth. Decisions about measuring poverty can greatly influence the daily living conditions of millions of people.
We use many measures in daily life. For example, this morning I woke up and hopped onto a bathroom scale to see how well my diet is working. I glanced at a thermometer to find out whether to wear a coat. Next, I got into my car and checked the gas gauge to be sure I could make it to campus. As I drove, I watched the speedometer so I would not get a speeding ticket. By 8:00 a.m., I had measured weight, temperature, gasoline volume, and speed—all measures about the physical world. Such precise, well-developed measures of daily life are fundamental in the natural sciences.
Our everyday measures of the nonphysical world are usually less exact. We are measuring when we say that a restaurant has excellent food, that Pablo is really smart, that Karen has a negative attitude toward life, that Johnson is really prejudiced, or that last night’s movie contained lots of violence. Such everyday judgments as “really prejudiced” or “lots of violence” are sloppy and imprecise.
Measurement instruments also extend our senses. The astronomer or biologist uses the telescope or the microscope to extend natural vision. Measuring helps us see what is otherwise invisible, and it lets us observe things that were once unseen and unknown but predicted by theory. For example, we may not see or feel magnetism with our natural senses. Magnetism comes from a theory about the physical world. We see its effects indirectly; for instance, metal flecks move near a magnet. The magnet allows us to “see” or measure the magnetic fields. In contrast to our natural senses, scientific measurement is more sensitive and varies less with the specific observer and yields more exact information. We recognize that a thermometer gives more specific, precise information about temperature than touch can. Likewise, a good bathroom scale gives us more specific, constant, and precise information about the weight of a 5-year-old girl than we can get by lifting her and then calling her “heavy” or “light.”
Before we can measure, we need to have a very clear idea about what we are interested in. This is a key principle; measurement connects ideas we carry in our heads with specific things we do in the empirical world to make those ideas visible. Natural scientists use many theories, and they created measures to “see” very tiny things (molecules or insect organs) or very large things (huge geological land masses or planets) that are not observable through ordinary senses. All researchers are constantly creating new measures.1
We might easily see age, sex, and race that are measured in social research (e.g., physical wrinkles of age, body parts of each sex, skin tones, and eye shape), but many aspects of the social world (e.g., attitudes, ideology, divorce rates, deviance, social roles) are difficult to observe directly. Just as natural scientists created indirect measures of the “invisible” molecules and the force of gravity, social scientists created measures for difficult-to-observe parts of the social world.
QUANTITATIVE AND QUALITATIVE MEASUREMENT
In all social research—both qualitative and quantitative studies—we connect data to ideas or concepts. We can think of the data in a study as the empirical representation of a concept. Measurement links the data to the concepts, yet the measurement process differs depending on whether our data and research approach are primarily quantitative or qualitative. Three features separate quantitative from qualitative approaches to measurement.
The first difference is timing. In quantitative research, we think about variables and convert them into specific actions during a planning stage that is before and separate from gathering or analyzing data. In qualitative research, we measure while in the data collection phase.
A second difference involves the data itself. In a quantitative study, we use techniques that will produce data in the form of numbers. Usually this happens by moving deductively from abstract ideas to specific data collection techniques, and to precise numerical information that the techniques yield. Numerical data represent a uniform, standardized, and compact way to empirically represent abstract ideas. In a qualitative study, data sometimes come in the form of numbers; more often, the data are written or spoken words, actions, sounds, symbols, physical objects, or visual images (e.g., maps, photographs, videos). Unlike a quantitative study, a qualitative study does not convert all observations into a single, common medium such as numbers but leaves the data in a variety of nonstandard shapes, sizes, and forms. While numerical data convert information into a standard and condensed format, qualitative data are voluminous, diverse, and nonstandard.
A third difference involves how we connect concepts with data. In quantitative research, we contemplate and reflect on concepts before we gather data. We select measurement techniques to bridge the abstract concepts with the empirical data. Of course, after we collect and examine the data, we do not shut off our minds and continue to develop new ideas, but we begin with clearly thought-out concepts and consider how we might measure them.
In qualitative research, we also reflect on concepts before gathering data. However, many of the concepts we use are developed and refined during or after the process of data collection. We reexamine and reflect on the data and concepts simultaneously and interactively. As we gather data, we are simultaneously reflecting on it and generating new ideas. The new ideas provide direction and suggest new ways to measure. In turn, the new ways to measure shape how we will collect additional data. In short, we bridge ideas with data in an ongoing, interactive process.
To summarize, we think about and make decisions regarding measurement in quantitative studies before we gather data. The data are in a standardized, uniform format: numbers. In contrast, in a qualitative study, most of our thinking and measurement decisions occur in the midst of gathering data, and the data are in a diffuse forms.
THE MEASUREMENT PROCESS
When we measure, we connect an invisible concept, idea, or construct in our minds with a technique, process, or procedure with which we observe the idea in the empirical world.2 In quantitative studies, we tend to start with abstract ideas and end with empirical data. In qualitative studies, we mix data and ideas while gathering data. However, in a specific study, things are messy and tend to be more interactive than this general statement suggests.
We use two major processes in measurement: conceptualization and operationalization.Conceptualization refers to taking an abstract construct and refining it by giving it a conceptual or theoretical definition. A conceptual definition is a statement of the idea in your head in specific words or theoretical terms that are linked to other ideas or constructs. There is no magical way to turn a construct into a precise conceptual definition; doing so involves thinking carefully, observing directly, consulting with others, reading what others have said, and trying possible definitions.
The process of developing clear, rigorous, systematic conceptual definitions for abstract ideas/concepts.
A careful, systematic definition of a construct that is explicitly written down.
A good definition has one clear, explicit, and specific meaning. There is no ambiguity or vagueness. Sometimes conceptualization is highly creative and produces new insights. Some scholarly articles have been devoted to conceptualizing key concepts. Melbin (1978) conceptualized night as a frontier, Gibbs (1989) analyzed the meaning of the concept of terrorism, and Ball and Curry (1995) discussed what street gang means. The key point is this: We need clear, unambiguous definitions of concepts to develop sound explanations.
A single construct can have several definitions, and people may disagree over definitions. Conceptual definitions are linked to theoretical frameworks. For example, a conflict theorist may define social class as the power and property that a group of people in society has or lacks. A structural functionalist defines social class in terms of individuals who share a social status, lifestyle, or subjective identification. Although people disagree over definitions, we as researchers should always state explicitly which definition we are using.
Some constructs (e.g., alienation) are highly abstract and complex. They contain lower level concepts within them (e.g., powerlessness), which can be made even more specific (e.g., a feeling of little power concerning where one can live). Other constructs are concrete and simple (e.g., age). We need to be aware of how complex and abstract a construct is. For example, it is easier to define a concrete construct such as age (e.g., number of years that have passed since birth) than a complex, abstract concept such as morale.
Before we can measure, we must distinguish exactly what we are interested in from other nearby things. This is common sense. How can we measure something unless we know what we are looking for? For example, a biologist cannot observe a cancer cell unless he or she first knows what a cancer cell is, has a microscope, and can distinguish the cell from noncell “stuff” under the microscope. The process of measurement involves more than simply having a measurement instrument (e.g., a microscope). We need three things in the measurement process: a construct, a measure, and the ability to recognize what we are looking for.3
For example, let us say that I want to measure teacher morale. I must first define teacher morale. What does the construct of morale mean? As a variable construct, morale takes on different values: high versus low or good versus bad. Next I must create a measure of my construct. This could take the form of survey questions, an examination of school records, or observations of teachers. Finally, I must distinguish morale from other things in the answers to survey questions, school records, or observations.
The social researcher’s job is more difficult than that of the natural scientist because social measurement involves talking with people or observing their behavior. Unlike the planets, cells, or chemicals, the answers people give and their actions can be ambiguous. People can react to the very fact that they are being asked questions or observed. Thus, the social researcher has a double burden: first, to have a clear construct, a good measure, and an ability to recognize what is being looked for, and second, to try to measure fluid and confusing social life that may change just because of an awareness that a researcher is trying to measure.
How can I develop a conceptual definition of teacher morale, or at least a tentative working definition to get started? I begin with my everyday understanding of morale: something vague such as “how people feel about things.” I ask some of my friends how they define it. I also look at an unabridged dictionary and a thesaurus. They give definitions or synonyms such as “confidence, spirit, zeal, cheerfulness, esprit de corps, mental condition toward something.” I go to the library and search the research literature on morale or teacher morale to see how others have defined it. If someone else has already given an excellent definition, I might borrow it (citing the source, of course). If I do not find a definition that fits my purposes, I turn to theories of group behavior, individual mental states, and the like for ideas. As I collect various definitions, parts of definitions, and related ideas, I begin to see the boundaries of the core idea.
By now, I have many definitions and need to sort them out. Most of them say that morale is a spirit, feeling, or mental condition toward something, or a group feeling. I separate the two extremes of my construct. This helps me turn the concept into a variable. High morale involves confidence, optimism, cheerfulness, feelings of togetherness, and willingness to endure hardship for the common good. Low morale is the opposite; it is a lack of confidence, pessimism, depression, isolation, selfishness, and an unwillingness to put forth effort for others.
Because I am interested in teacher morale, I learn about teachers to specify the construct to them. One strategy is to make a list of examples of high or low teacher morale. High teacher morale includes saying positive things about the school, not complaining about extra work, or enjoying being with students. Low morale includes complaining a lot, not attending school events unless required to, or looking for other jobs.
Morale involves a feeling toward something else; a person has morale with regard to something. I list the various “somethings” toward which teachers have feelings (e.g., students, parents, pay, the school administration, other teachers, the profession of teaching). This raises an issue that frequently occurs when developing a definition. Are there several types of teacher morale, or are all of these “somethings” aspects of one construct? There is no perfect answer. I have to decide whether morale means a single, general feeling with different parts or dimensions or several distinct feelings.
What unit of analysis does my construct apply to: a group or an individual? Is morale a characteristic of an individual, of a group (e.g., a school), or of both? I decide that for my purposes, morale applies to groups of people. This tells me that my unit of analysis will be a group: all teachers in a school.
I must distinguish the construct of interest from related ideas. How is my construct of teacher morale similar to or different from related concepts? For example, does morale differ from mood? I decide that mood is more individual and temporary than morale. Likewise, morale differs from optimism and pessimism. Those are outlooks about the future that individuals hold. Morale is a group feeling. It may include positive or negative feelings about the future as well as related beliefs and feelings.
Conceptualization is the process of thinking through the various possible meanings of a construct. By now, I know that teacher morale is a mental state or feeling that ranges from high (optimistic, cheerful) to low (pessimistic, depressed); morale has several dimensions (regarding students, regarding other teachers); it is a characteristic of a group; and it persists for a period of months. I have a much more specific mental picture of what I want to measure than when I began. If I had not conceptualized, I would have tried to measure what I started with: “how people feel about things.”
Even with all of the conceptualization, some ambiguity remains. To complete the conceptualization process, boundaries are necessary. I must decide exactly what I intend to include and exclude. For example, what is a teacher? Does a teacher include guidance counselors, principals, athletic coaches, and librarians? What about student teachers or part-time or substitute teachers? Does the word teachers include everyone who teaches for a living, even if someone is not employed by a school (e.g., a corporate trainer, an on-the-job supervisor who instructs an apprentice, a hospital physician who trains residents)? Even if I restrict my definition to people in schools, what is a school? It could include a nursery school, a training hospital, a university’s Ph.D. program, a for-profit business that prepares people to take standardized tests, a dog obedience school, a summer camp that teaches students to play basketball, and a vocational school that teaches how to drive semitrailer trucks.
Some people assume teacher means a full-time, professionally trained employee of a school teaching grades 1 through 12 who spends most of the day in a classroom with students. Others use a legal or official government definition that could include people certified to teach, even if they are not in classrooms. It excludes people who are uncertified, even if they are working in classrooms with students. The central point is that conceptualization requires me to be very clear in my own thinking. I must know exactly what I mean by teachers and morale before I can begin to measure. I must state what I think in very clear and explicit terms that other people can understand.
Operationalization links a conceptual definition to a set of measurement techniques or procedures, the construct’s operational definition (i.e., a definition in terms of the specific operations or actions). An operational definition could be a survey questionnaire, a method of observing events in a field setting, a way to measure symbolic content in the mass media, or any process that reflects, documents, or represents the abstract construct as it is expressed in the conceptual definition.
The process of moving from a construct’s conceptual definition to specific activities or measures that allow a researcher to observe it empirically.
A variable in terms of the specific actions to measure or indicate it in the empirical world.
We often can measure a construct in several ways; some are better and more practical than other ways. The key point is that we must fit the measure to the specific conceptual definition by working with all practical constraints within which we must operate (e.g., time, money, available participants). We can develop a new measure from scratch or use one that other researchers are using (see Expansion Box 7.1, Five Suggestions for Coming Up with a Measure).
expansion box 7.1 Five Suggestions for Coming Up with a Measure
• 1. Remember the conceptual definition. The underlying principle for any measure is to match it to the specific conceptual definition of the construct that will be used in the study.
• 2. Keep an open mind. Do not get locked into a single measure or type of measure. Be creative and constantly look for better measures. Avoid what Kaplan (1964:28) called the “law of the instrument,” which means being locked into using one measurement instrument for all problems.
• 3. Borrow from others. Do not be afraid to borrow from other researchers, as long as credit is given. Good ideas for measures can be found in other studies or modified from other measures.
• 4. Anticipate difficulties. Logical and practical problems often arise when trying to measure variables of interest. Sometimes a problem can be anticipated and avoided with careful forethought and planning.
• 5. Do not forget your units of analysis. Your measure should fit with the units of analysis of the study and permit you to generalize to the universe of interest.
Operationalization connects the language of theory with the language of empirical measures. Theory has many abstract concepts, assumptions, definitions, and cause-and-effect relations. By contrast, empirical measures are very concrete actions in specific, real situations with actual people and events. Measures are specific to the operations or actions we engage in to indicate the presence or absence of a construct as it exists in concrete, observable reality.
Quantitative Conceptualization and Operationalization
Quantitative measurement proceeds in a straightforward sequence: first conceptualization, next operationalization, and then application of the operational definition or the collection of data. We must rigorously link abstract ideas to measurement procedures that can produce precise information in the form of numbers. One way to do this is with rules of correspondence or an auxiliary theory. The purpose of the rules is to link the conceptual definitions of constructs to concrete operations for measuring the constructs.4
Rules of correspondence are logical statements of the way an indicator corresponds to an abstract construct. For example, a rule of correspondence says that we will accept a person’s verbal agreement with a set of ten specific statements as evidence that the person strongly holds an anti-feminist attitude. This auxiliary theory may explain how and why indicators and constructs connect. Carmines and Zeller (1979:11) noted, “The auxiliary theory specifying the relationship between concepts and indicators is equally important to social research as the substantive theory linking concepts to one another.” Perhaps we want to measure alienation. Our definition of the alienation has four parts, each in a different sphere of life: family relations, work relations, relations with community, and relations with friends. An auxiliary theory may specify that certain behaviors or feelings in each sphere of life are solid evidence of alienation. In the sphere of work, the theory says that if a person feels a total lack of control over when, where, and with whom he or she works, what he or she does when working, or how fast he or she must work, that person is alienated.
Rules of correspondence
Strandards that researchers use to connect abstract constructs with measurement operations in empirical social reality.
Figure 7.1 illustrates the measurement process linking two variables in a theory and a hypothesis. We must consider three levels: conceptual, operational, and empirical.5 At the most abstract level, we may be interested in the causal relationship between two constructs, or a conceptual hypothesis. At the level of operational definitions, we are interested in testing an empirical hypothesis to determine the degree of association between indicators. This is the level at which we consider correlations, statistics, questionnaires, and the like. The third level is the empirical reality of the lived social world. As we link the operational indicators (e.g., questionnaire items) to a construct (e.g., alienation), we capture what is taking place in the lived social world and relate it back to the conceptual level.
A type of hypothesis that expresses variables and the relationships among them in abstract, conceptual terms.
A type of hypothesis in which the researcher expresses variables in specific empirical terms and expresses the association among the measured indicators in observable, empirical terms.
As we measure, we link the three levels together and move deductively from the abstract to the concrete. First, we conceptualize a variable, giving it a clear conceptual definition; next we operationalize it by developing an operational definition or set of indicators for it; and lastly, we apply indicators to collect data and test empirical hypotheses.
Let us return to the example mentioned earlier. How do I give my teacher morale construct an operational definition? First, I read the research reports of others and see whether a good indicator already exists. If there are no existing indicators, I must invent one from scratch. Morale is a mental state or feeling, so I measure it indirectly through people’s words and actions. I might develop a questionnaire for teachers and ask them about their feelings toward the dimensions of morale in my definition. I might go to the school and observe the teachers in the teachers lounge, interacting with students, and attending school activities. I might use school personnel records on teacher behaviors for statements that indicate morale (e.g., absences, requests for letters of recommendation for other jobs, performance reports). I might survey students, school administrators, and others to find out what they think about teacher morale. Whichever indicator I choose, I further refine my conceptual definition as I develop it (e.g., write specific questionnaire questions).
figure 7.1 Conceptualization and Operationalization
Conceptualization and operationalization are necessary for each variable. In the preceding example, morale is one variable, not a hypothesis. It could be a dependent variable caused by something else, or it could be an independent variable causing something else. It depends on my theoretical explanation.
Qualitative Conceptualization and Operationalization
In qualitative research, instead of refining abstract ideas into theoretical definitions early in the research process, we refine rudimentary “working ideas” during the data collection and analysis process. Conceptualization is a process of forming coherent theoretical definitions as we struggle to “make sense” or organize the data and our preliminary ideas about it.
As we gather and analyze qualitative data, we develop new concepts, formulate definitions for major constructs, and consider relationships among them. Eventually, we link concepts and constructs to create theoretical relationships. We form and refine constructs while examining data (e.g., field notes, photos and maps, historical documents), and we ask theoretical questions about the data (e.g., Is this a case of class conflict? What is the sequence of events and could it be different? Why did this happen here but not somewhere else?).
We need clear, explicit definitions expressed in words and descriptions of specific actions that link to other ideas and are tied to the data. In qualitative research, conceptualization flows largely from the data.
In qualitative studies, operationalization often precedes conceptualization (see Figure 7.2) and gives deductive measurement (see Figure 7.3 for inductive measurement). We may create conceptual definitions out of rudimentary “working ideas” while we are making observations or gathering data. Instead of turning refined conceptual definitions into measurement operations, we operationalize by describing how specific observations and thoughts about the data contribute to working ideas that are the basis of conceptual definitions.
figure 7.2 Example of the Deductive Measurement Process for the Hypothesis: A Professional Work Environment Increases the Level of Teacher Morale
Thus, qualitative research operationalization largely involves developing a description of how we use working ideas while making observations. Oerationalization describes how we gathered specific observations or data and we struggled to understand the data as the data evolved into abstract constructs. In this way, qualitative operationalization is more an after-the-fact description than a preplanned technique.
Just as quantitative operationalization deviates from a rigid deductive process, qualitative researchers may draw on ideas from beyond the data of a specific research setting. Qualitative operationalization includes using preexisting techniques and concepts that we blend with those that emerged during the data collection process.
Fantasia’s (1988) field research on contested labor actions illustrates qualitative operationalization. Fantasia used cultures of solidarity as a central construct. He related this construct to ideas of conflict-filled workplace relations and growing class consciousness among nonmanagerial workers. He defined a culture of solidarity as a type of cultural expression created by workers that evolves in particular places over time. The workers over time develop shared feelings and a sense of unity that is in opposition to management and business owners. It is an interactive process. Slowly overtime, the workers arrive at common ideas, understandings, and actions. It is “less a matter of disembodied mental attitude than a broader set of practices and repertoires available for empirical investigation” (Fantasia:14).
figure 7.3 Example of the Inductive Measurement Process for the Proposition: Radical Labor Action Is Likely to Occur Where a Culture of Solidarity Has Been Created
To operationalize the construct, Fantasia describes how he gathered data. He presents them to illustrate the construct, and explains his thinking about the data. He describes his specific actions to collect the data (e.g., he worked in a particular factory, attended a press conference, and interviewed people). He also shows us the data in detail (e.g., he describes specific events that document the construct by showing several maps indicating where people stood during a confrontation with a foreperson, retelling the sequence of events at a factory, recounting actions by management officials, and repeating statements that individual workers made). He gives us a look into his thinking process as he reflected and tried to understand his experiences and developed new ideas drawing on older ideas.
In qualitative research, ideas and evidence are mutually interdependent. This applies particularly to case study analysis. Cases are not given preestablished empirical units or theoretical categories apart from data; they are defined by data and theory. By analyzing a situation, the researcher organizes data and applies ideas simultaneously to create or specify a case. Making or creating a case, called casing, brings the data and theory together. Determining what to treat as a case resolves a tension or strain between what the researcher observes and his or her ideas about it. “Casing, viewed as a methodological step, can occur at any phase of the research process, but occurs especially at the beginning of the project and at the end” (Ragin, 1992b:218).
Developing cases in qualitative research.
RELIABILITY AND VALIDITY
All of us as researchers want reliability and validity, which are central concerns in all measurement. Both connect measures to constructs. It is not possible to have perfect reliability and validity, but they are ideals toward which we strive. Reliability and validity are salient because our constructs are usually ambiguous, diffuse, and not observable. Reliability and validity are ideas that help to establish the truthfulness, credibility, or believability of findings. Both terms also have multiple meanings. As used here, they refer to related, desirable aspects of measurement.
Reliability means dependability or consistency. It suggests that the same thing is repeated or recurs under the identical or very similar conditions. The opposite of reliability is an erratic, unstable, or inconsistent result that happens because of the measurement itself. Validity suggests truthfulness. It refers to how well an idea “fits” with actual reality. The absence of validity means that the fit between the ideas we use to analyze the social world and what actually occurs in the lived social world is poor. In simple terms, validity addresses the question of how well we measure social reality using our constructs about it.
All researchers want reliable and valid measurement, but beyond an agreement on the basic ideas at a general level, qualitative and quantitative researchers see reliability and validity differently.
Reliability and Validity in Quantitative Research
Measurement reliability means that the numerical results an indicator produces do not vary because of characteristics of the measurement process or measurement instrument itself. For example, I get on my bathroom scale and read my weight. I get off and get on again and again. I have a reliable scale if it gives me the same weight each time, assuming, of course, that I am not eating, drinking, changing clothing, and so forth. An unreliable scale registers different weights each time, even though my “true” weight does not change. Another example is my car speedometer. If I am driving at a constant slow speed on a level surface but the speedometer needle jumps from one end to the other, the speedometer is not a reliable indicator of how fast I am traveling. Actually, there are three types of reliability.6
The dependability or consistency of the measure of a variable.
Three Types of Reliability
1. Stability reliability is reliability across time. It addresses the question: Does the measure deliver the same answer when applied in different time periods? The weight-scale example just given is of this type of reliability. Using the test-retest method can verify an indicator’s degree of stability reliability. Verification requires retesting or re-administering the indicator to the same group of people. If what is being measured is stable and the indicator has stability reliability, then I will have the same results each time. A variation of the test-retest method is to give an alternative form of the test, which must be very similar to the original. For example, I have a hypothesis about gender and seating patterns in a college cafeteria. I measure my dependent variable (seating patterns) by observing and recording the number of male and female students at tables, and noting who sits down first, second, third, and so on for a 3-hour period. If, as I am observing, I become tired or distracted or I forget to record and miss more people toward the end of the 3 hours, my indicator does not have a high degree of stability reliability.
Measurement reliability across time; a measure that yields consistent results at different time points assuming what is being measured does not itself change.
2. Representative reliability is reliability across subpopulations or different types of cases. It addresses the question: Does the indicator deliver the same answer when applied to different groups? An indicator has high representative reliability if it yields the same result for a construct when applied to different subpopulations (e.g., different classes, races, sexes, age groups). For example, I ask a question about a person’s age. If people in their twenties answered my question by overstating their true age whereas people in their fifties understated their true age, the indicator has a low degree of representative reliability. To have representative reliability, the measure needs to give accurate information for every age group.
Measurement reliability across groups; a measure that yields consistent results for various social groups.
A subpopulation analysis verifies whether an indicator has this type of reliability. The analysis compares the indicator across different subpopulations or subgroups and uses independent knowledge about them. For example, I want to test the representative reliability of a questionnaire item that asks about a person’s education. I conduct a subpopulation analysis to see whether the question works equally well for men and women. I ask men and women the question and then obtain independent information (e.g., check school records) and check to see whether the errors in answering the question are equal for men and women. The item has representative reliability if men and women have the same error rate.
3. Equivalence reliability applies when researchers use multiple indicators—that is, when a construct is measured with multiple specific measures (e.g., several items in a questionnaire all measure the same construct). Equivalence reliability addresses the question: Does the measure yield consistent results across different indicators? If several different indicators measure the same construct, then a reliable measure gives the same result with all indicators.
Measurement reliability across indicators; a measure that yields consistent results using different specific indicators, assuming that all measure the same construct.
The use of multiple procedures or several specific measures to provide empirical evidence of the levels of a variable.
We verify equivalence reliability with the split-half method. This involves dividing the indicators of the same construct into two groups, usually by a random process, and determining whether both halves give the same results. For example, I have fourteen items on a questionnaire. All measure political conservatism among college students. If my indicators (i.e., questionnaire items) have equivalence reliability, then I can randomly divide them into two groups of seven and get the same results. For example, I use the first seven questions and find that a class of fifty business majors is twice as conservative as a class of fifty education majors. I get the same results using the second seven questions. Special statistical measures (e.g., Cronbach’s alpha) also can determine this type of reliability. A special type of equivalence reliability, intercoder reliability, can be used when there are several observers, raters, or coders of information (explained in Chapter 11). In a sense, each observer is an indicator. A measure is reliable if the observers, raters, or coders agree with each other. This measure is a common type of reliability reported in content analysis studies. For example, I hire six students to observe student seating patterns in a cafeteria. If all six are equally skilled at observing and recording, I can combine the information from all six into a single reliable measure. But if one or two students are lazy, inattentive, or sloppy, my measure will have lower reliability. Intercoder reliability is tested by having several coders measure the exact same thing and then comparing the measures. For instance, I have three coders independently code the seating patterns during the same hour on three different days. I compare the recorded observations. If they agree, I can be confident of my measure’s intercoder reliability. Special statistical techniques measure the degree of intercoder reliability.
How to Improve Reliability.
It is rare to have perfect reliability. We can do four things to improve reliability: (1) clearly conceptualize constructs, (2) use a precise level of measurement, (3) use multiple indicators, and (4) use pilot tests.
1. Clearly conceptualize all constructs. Reliability increases when each measure indicates one and only one concept. This means we must develop unambiguous, clear theoretical definitions. Constructs should be specified to eliminate “noise” (i.e., distracting or interfering information) from other constructs. For example, the indicator of a pure chemical compound is more reliable than the indicator in which the chemical is mixed with other material or dirt. In the latter case, separating the “noise” of other material from the pure chemical is difficult.
Let us return to the example of teacher morale. I should separate morale from related ideas (e.g., mood, personality, spirit, job attitude). If I did not do this, I could not be sure what I was really measuring. I might develop an indicator for morale that also indicates personality; that is, the construct of personality contaminates that of morale and produces a less reliable indicator. Bad measurement occurs by using one indicator to operationalize different constructs (e.g., using the same questionnaire item to indicate morale and personality).
2. Increase the level of measurement. Levels of measurement are discussed later in this chapter. Indicators at higher or more precise levels of measurement are more likely to be reliable than less precise measures because the latter pick up less detailed information. If more specific information is measured, it is less likely that anything other than the construct will be captured. The general principle is: Try to measure at the most precise level possible. However, quantifying at higher levels of measurement is more difficult. For example, if I have a choice of measuring morale as either high or low, or in ten categories from extremely low to extremely high, it would be better to measure it in ten refined categories.
3. Use multiple indicators of a variable. A third way to increase reliability is to use multiple indicators because two (or more) indicators of the same construct are better than one.7 Figure 7.4illustrates the use of multiple indicators in hypothesis testing. Three indicators of the one independent variable construct are combined into an overall measure, A, and two indicators of a dependent variable are combined into a single measure, B. For example, I have three specific measures of A, which is teacher morale: (a1) the answers to a survey question on attitudes about school, (a2) the number of absences for reasons other than illness and (a3) the number of complaints others heard made by a teacher. I also have two measures of my dependent variable B, giving students extra attention: (b1) number of hours a teacher spends staying after school hours to meet individually with students and (b2) whether the teacher inquires frequently about a student’s progress in other classes.
figure 7.4 Measurement Using Multiple Indicators
With multiple indicators, we can build on triangulation and take measurements from a wider range of the content of a conceptual definition (i.e., sample from the conceptual domain). We can measure different aspects of the construct with its own indicator. Also, one indicator may be imperfect, but several measures are less likely to have the same error. James (1991) provides a good example of this principle applied to counting persons who are homeless. If we consider only where people sleep (e.g., using sweeps of streets and parks and counting people in official shelters), we miss some because many people who are homeless have temporary shared housing (e.g., sleep on the floor of a friend or family member). We also miss some by using records of official service agencies because many people who are homeless avoid involvement with government and official agencies. However, if we combine the official records with counts of people sleeping in various places and conduct surveys of people who use a range of services (e.g., street clinics, food lines, temporary shelters), we can get a more accurate picture of the number of people who are homeless. In addition to capturing the entire picture, multiple indicator measures tend to be more stable than single item measures.
4. Use pilot studies and replication. You can improve reliability by first using a pilot version of a measure. Develop one or more draft or preliminary versions of a measure and try them before applying the final version in a hypothesis-testing situation. This takes more time and effort. Returning to the example discussed earlier, in my survey of teacher morale, I go through many drafts of a question before the final version. I test early versions by asking people the question and checking to see whether it is clear.
The principle of using pilot tests extends to replicating the measures from researchers. For example, I search the literature and find measures of morale from past research. I may want to build on and use a previous measure if it is a good one, citing the source, of course. In addition, I may want to add new indicators and compare them to the previous measure (see Example Box 7.1, Improving the Measure of U.S. Religious Affiliation). In this way, the quality of the measure can improve over time as long as the same definition is used (see Table 7.1 on page 212 for a summary of reliability and validity types).
Validity is an overused term. Sometimes, it is used to mean “true” or “correct.” There are several general types of validity. Here we are concerned with measurement validity, which also has several types. Nonmeasurement types of validity are discussed later.
How well an empirical indicator and the conceptual definition of the construct that the indicator is supposed to measure “fit” together.
When we say that an indicator is valid, it is valid for a particular purpose and definition. The same indicator may be less valid or invalid for other purposes. For example, the measure of morale discussed above (e.g., questions about feelings toward school) might be valid for measuring morale among teachers but invalid for measuring morale among police officers.8
example box 7.1 Improving the Measure of U.S. Religious Affiliation
Quantitative researchers measure individual religious beliefs (e.g., Do you believe in God? in a devil? in life after death? What is God like to you?), religious practices (e.g., How often do you pray? How frequently do you attend services?), and religious affiliation (e.g., If you belong to a church or religious group, which one?). They have categorized the hundreds of U.S. religious denominations into either a three-part grouping (Protestant, Catholic, Jewish) or a three-part classification of fundamentalist, moderate, or liberal that was introduced in 1990.
Steensland and colleagues (2000) reconceptualized affiliation, and, after examining trends in religious theology and social practices, argued for classifying all American denominations into six major categories: Mainline Protestant, Evangelical Protestant, Black Protestant, Roman Catholic, Jewish, and Other (including Mormon, Jehovah’s Witnesses, Muslim, Hindu, and Unitarian). The authors evaluated their new six-category classification by examining people’s religious views and practices as well as their views about contemporary social issues. Among national samples of Americans, they found that the new classification better distinguished among religious denominations than did previous measures.
At its core, measurement validity tells us how well the conceptual and operational definitions mesh with one other: The better the fit, the higher is the measurement validity. Validity is more difficult to achieve than reliability. We cannot have absolute confidence about validity, but some measures are more valid than others. The reason is that constructs are abstract ideas, whereas indicators refer to concrete observation. This is the gap between our mental pictures about the world and the specific things we do at particular times and places. Validity is part of a dynamic process that grows by accumulating evidence over time, and without it, all measurement becomes meaningless.
table 7.1 Summary of Measurement Reliability and Validity Types
reliability (dependable measure) validity (true measure)
Stability—over time (verify using test-retest method) Face—makes sense in the judgment of others
Representative—across subgroups (verify using split-half method) Content—captures the entire meaning
Equivalence—across indicators (verify using subpopulation analysis) Criterion—agrees with an external source
• ¦ Concurrent—agrees with a preexisting measure
• ¦ Predictive—agrees with future behavior
Construct—has consistent multiple indicators
• ¦ Convergent—alike ones are similar
• ¦ Discriminant—different ones differ
Some researchers use rules of correspondence (discussed earlier) to reduce the gap between abstract ideas and specific indicators. For example, a rule of correspondence is: A teacher who agrees with statements that “things have gotten worse at this school in the past 5 years” and that “there is little hope for improvement” is indicating low morale. Some researchers talk about theepistemic correlation, a hypothetical correlation between an indicator and the construct that the indicator measures. We cannot empirically measure such correlations, but they can be estimated.9
Four Types of Measurement Validity.
1. Face validity is the most basic and easiest type of validity to achieve. It is a judgment by the scientific community that the indicator really measures the construct. It addresses the question: On the face of it, do people believe that the definition and method of measurement fit? For example, few people would accept a measure of college student math ability by asking students what 2 + 2 equals. This is not a valid measure of college-level math ability on the face of it. Recall that the principle of organized skepticism in the scientific community means that others scrutinize aspects of research.10
A type of measurement validity in which an indicator “makes sense” as a measure of a construct in the judgment of others, especially in the scientific community.
2. Content validity addresses this question: Is the full content of a definition represented in a measure? A conceptual definition holds ideas; it is a “space” containing ideas and concepts. Measures should sample or represent all ideas or areas in the conceptual space. Content validity involves three steps. First, specify the content in a construct’s definition. Next, sample from all areas of the definition. Finally, develop one or more indicators that tap all of the parts of the definition.
A type of measurement validity that requires that a measure represent all aspects of the conceptual definition of a construct.
Let us consider an example of content validity. I define feminism as a person’s commitment to a set of beliefs creating full equality between men and women in areas of the arts, intellectual pursuits, family, work, politics, and authority relations. I create a measure of feminism in which I ask two survey questions: (1) Should men and women get equal pay for equal work? and (2) Should men and women share household tasks? My measure has low content validity because the two questions ask only about pay and household tasks. They ignore the other areas (intellectual pursuits, politics, authority relations, and other aspects of work and family). For a content-valid measure, I must either expand the measure or narrow the definition.11
3. Criterion validity uses some standard or criterion to indicate a construct accurately. The validity of an indicator is verified by comparing it with another measure of the same construct in which a researcher has confidence. The two subtypes of this type of validity are concurrent and predictive.12
Measurement validity that relies on some independent, outside verification.
To have concurrent validity, we need to associate an indicator with a preexisting indicator that we already judge to be valid (i.e., it has face validity). For example, we create a new test to measure intelligence. For it to be concurrently valid, it should be highly associated with existing IQ tests (assuming the same definition of intelligence is used). This means that most people who score high on the old measure should also score high on the new one, and vice versa. The two measures may not be perfectly associated, but if they measure the same or a similar construct, it is logical for them to yield similar results.
Measurement validity that relies on a preexisting and already accepted measure to verify the indicator of a construct.
Criterion validity by which an indicator predicts future events that are logically related to a construct is called predictive validity. It cannot be used for all measures. The measure and the action predicted must be distinct from but indicate the same construct. Predictive measurement validity should not be confused with prediction in hypothesis testing in which one variable predicts a different variable in the future. For example, the Scholastic Assessment Test (SAT) that many U.S. high school students take measures scholastic aptitude: the ability of a student to perform in college. If the SAT has high predictive validity, students who achieve high SAT scores will subsequently do well in college. If students with high scores perform at the same level as students with average or low scores, the SAT has low predictive validity.
Measurement validity that relies on the occurrence of a future event or behavior that is logically consistent to verify the indicator of a construct.
Another way to test predictive validity is to select a group of people who have specific characteristics and predict how they will score (very high or very low) vis-à-vis the construct. For example, I create a measure of political conservatism. I predict that members of conservative groups (e.g., John Birch Society, Conservative Caucus, Daughters of the American Revolution, Moral Majority) will score high on it whereas members of liberal groups (e.g., Democratic Socialists, People for the American Way, Americans for Democratic Action) will score low. I “validate” it by pilot-testing it on members of the groups. It can then be used as a measure of political conservatism for the public.
4. Construct validity is for measures with multiple indicators. It addresses this question: If the measure is valid, do the various indicators operate in a consistent manner? It requires a definition with clearly specified conceptual boundaries. The two types of construct validity are convergent and discriminant.
A type of measurement validity that uses multiple indicators and has two subtypes: how well the indicators of one construct converge or how well the indicators of different constructs diverge.
Convergent validity applies when multiple indicators converge or are associated with one another. It means that multiple measures of the same construct hang together or operate in similar ways. For example, I measure the construct “education” by asking people how much education they have completed, looking up school records, and asking the people to complete a test of school knowledge. If the measures do not converge (i.e., people who claim to have a college degree but have no records of attending college or those with college degrees perform no better than high school dropouts on my tests), my measure has weak convergent validity, and I should not combine all three indicators into one measure.
A type of measurement validity for multiple indicators based on the idea that indicators of one construct will act alike or converge.
Discriminant validity is the opposite of convergent validity and means that the indicators of one construct “hang together,” or converge, but also are negatively associated with opposing constructs. Discriminant validity says that if two constructs A and B are very different, measures of A and Bshould not be associated. For example, I have ten items that measure political conservatism. People answer all ten in similar ways. But I also put five questions that measure political liberalism on the same questionnaire. My measure of conservatism has discriminant validity if the ten conservatism items converge and are negatively associated with the five liberalism ones. (See Figure 7.5 for a review of measurement validity.)
A type of measurement validity for multiple indicators based on the idea that indicators of different constructs diverge.
Reliability and Validity in Qualitative Research
Qualitative research embraces the core principles of reliability and validity, but we rarely see the terms in this approach because they are so closely associated with quantitative measurement. In addition, in qualitative studies, we apply the principles differently.
Recall that reliability means dependability or consistency. We use a wide variety of techniques (e.g., interviews, participation, photographs, document studies) to record observations consistently in qualitive studies. We want to be consistent (i.e., not vacillating or being erratic) in how we make observations, similar to the idea of stability reliability. One difficulty with reliability is that we often study processes that are unstable over time. Moreover, we emphasize the value of a changing or developing interaction between us as researchers and the people we study. We believe that the subject matter and our relationship to it is an evolving process. A metaphor for the relationship is one of an evolving relationship or living organism (e.g., a plant) that naturally matures over time. Many qualitative researchers see the quantitative approach to reliability as a cold, fixed mechanical instrument that one applies repeatedly to static, lifeless material.
In qualitative studies, we consider a range of data sources and employ multiple measurement methods. We do not become locked into the quantitative-positivist ideas of replication, equivalence, and subpopulation reliability. We accept that different researchers or researchers who use alternative measures may find distinctive results. This happens because data collection is an interactive process in which particular researchers operate in an evolving setting whose context dictates using a unique mix of measures that cannot be repeated. The diverse measures and interactions with different researchers are beneficial because they can illuminate different facets or dimensions of a subject matter. Many qualitative researchers question the quantitative researcher’s quest for standard, fixed measures and fear that such measures ignore the benefits of having a variety of researchers with many approaches and may neglect key aspects of diversity that exist in the social world.
Validity means truthfulness. In qualitative studies, we are more interested in achieving authenticity than realizing a single version of “Truth.” Authenticity means offering a fair, honest, and balanced account of social life from the viewpoint of the people who live it every day. We are less concerned with matching an abstract construct to empirical data than with giving a candid portrayal of social life that is true to the lived experiences of the people we study. In most qualitative studies, we emphasize capturing an inside view and providing a detailed account of how the people we study understand events (see Expansion Box 7.2, Meanings of Validity in Qualitative Research).
There are qualitative research substitutes for the quantitative approach to validity: ecological validity or natural history methods (see Chapter 13). Both emphasize conveying an insider’s view to others. Historical researchers use internal and external criticisms (see Chapter 14) to determine whether the evidence is real. Qualitative researchers adhere to the core principle of validity, to be truthful (i.e., avoid false or distorted accounts) and try to create a tight fit between understandings, ideas, and statements about the social world and what is actually occurring in it.
figure 7.5 Types of Validity
expansion box 7.2 Meanings of Validity in Qualitative Research
Measurement validity in qualitative research does not require demonstrating a fixed correspondence between a carefully defined abstract concept and a precisely calibrated measure of its empirical appearance. Other features of the research measurement process are important for establishing validity.
First, to be considered valid, a researcher’s truth claims need to be plausible and, as Fine (1999) argued, intersubjectively “good enough” (i.e., understandable by many other people). Plausiblemeans that the data and statements about it are not exclusive; they are not the only possible claims, nor are they exact accounts of the one truth in the world. This does not make them mere inventions or arbitrary. Instead, they are powerful, persuasive descriptions that reveal a researcher’s genuine experiences with the empirical data.
Second, a researcher’s empirical claims gain validity when supported by numerous pieces of diverse empirical data. Any one specific empirical detail alone may be mundane, ordinary, or “trivial.” Validity arises out of the cumulative impact of hundreds of small, diverse details that only together create a heavy weight of evidence.
Third, validity increases as researchers search continuously in diverse data and consider the connections among them. Raw data in the natural social world are not in neatly prepackaged systematic scientific concepts; rather, they are numerous disparate elements that “form a dynamic and coherent ensemble” (Molotch et al., 2000:816). Validity grows as a researcher recognizes a dense connectivity in disparate details. It grows with the creation of a web of dynamic connections across diverse realms, not only with the number of specifics that are connected.
Relationship between Reliability and Validity
Reliability is necessary for validity and is easier to achieve than validity. Although reliability is necessary to have a valid measure of a concept, it does not guarantee that the measure will be valid. It is not a sufficient condition for validity. A measure can yield a result over and over (i.e., has reliability), but what it truly measures may not match a construct’s definition (i.e., validity).
For example, I get on a scale to check my weight. The scale registers the same weight each time I get on and off during a 2-hour period. I next go to another scale—an “official” one at a medical clinic—and it reports my weight to be twice as much. The first scale yielded reliable (i.e., dependable and consistent) results, but it was not a valid measure of my weight. A diagram might help you see the relationship between reliability and validity. Figure 7.6 illustrates the relationship between the concepts by using the analogy of a target. The bull’seye represents a fit between a measure and the definition of the construct.
Validity and reliability are usually complementary concepts, but in some situations, they conflict with each other. Sometimes, as validity increases, reliability becomes more difficult to attain and vice versa. This situation occurs when the construct is highly abstract and not easily observable but captures the “true essence” of an idea. Reliability is easiest to achieve when a measure is precise, concrete, and observable. For example, alienation is a very abstract, subjective construct. We may define it as a deep inner sense of loss of one’s core humanity; it is a feeling of detachment and being without purpose that diffuses across all aspects of life (e.g., the sense of self, relations with other people, work, society, and even nature). While it is not easy, most of us can grasp the idea of alienation, a directionless disconnection that pervades a person’s existence. As we get more deeply into the true meaning of the concept, measuring it precisely becomes more difficult. Specific questions on a questionnaire may produce reliable measures more than other methods, yet the questions cannot capture the idea’s essence.
Other Uses of the Words Reliable and Valid
Many words have multiple definitions, creating confusion among various uses of the same word. This happens with reliability and validity. We use reliability in everyday language. A reliableperson is a dependable, stable, and responsible person who responds in similar, predictable ways in different times and conditions. A reliable car is dependable and trustworthy; it starts and peforms in a predicable way. Sometimes, we say that a study or its results are reliable. This means that other researchers can reproduce the study and will get similar results.
figure 7.6 Illustration of Relationship between Reliability and Validity
Source: Adapted version of Figure 5-2 An Analogy to Validity and Reliability, page 155 from Babbie, E. R. 1986. The Practice of Social Research, Fourth Edition. Belmont, CA: Wadsworth Publishing Company.
Internal validity means we have not made errors internal to the design of a research project that might produce false conclusions.13 In experimental research, we primarily talk about possible alternative causes of results that arise despite our attempts to institute controls (see Chapter 9 for discussion).
External validity is also used primarily in experimental research. It refers to whether we can generalize a result that we found in a specific setting with a particular small group beyond that situation or externally to a wider range of settings and many different people. External validity addresses this question: If something happens in a laboratory or among a particular set of research participants (e.g., college students), does it also happen in the “real” (nonlaboratory) world or among the general population (nonstudents) (discussed in Chapter 9)? External validity has serioius implications for evaluating theory. If a general theory is true, it implies that we can generalize findings from a single test of the theory to many other situations and populations (see Lucas, 2003).
Statistical validity means that we used the proper statistical procedure for a particular purpose and have met the procedure’s mathematical requirements. This validity arises because different statistical tests or procedures are appropriate for different situations as is discussed in textbooks on statistical procedures. All statistical procedures rest on assumptions about the mathematical properties of the numbers being used. A statistic will yield nonsense results if we use it for inappropriate situations or seriously violate its assumptions even if the computation of the numbers is correct. This is why we must know the purposes for which a statistical procedure is designed and its assumptions to use it. This is also why computers can do correct computations but produce output that is nonsense.
A GUIDE TO QUANTITATIVE MEASUREMENT
Thus far, we have discussed principles of measurement. Quantitative researchers have specialized measures that assist in the process of creating operational definitions for reliable and valid measures. This section of the chapter is a brief guide to these ideas and a few of the specific measures.
Levels of Measurement
We can array possible measures on a continuum. At one end are at “higher” ones. These measures contain a great amount of highly specific information with many exact and refined distinctions. At the opposite end are “lower” ones. These are rough, less precise measures with minimal information and a few basic distinctions. The level of measurement affects how much we can learn when we measure features of the social world and limits the types of indicator we can use as we try to capture empirical details about a construct.
The level of measurement is determined by how refined, exact, and precise a construct is in our assumptions about it. This means that how we conceptualize a construct carries serious implications. It influences how we can measure the construct and restricts the range of statistical procedures that we can use after we have gathered data. Often we see a trade-off between the level of measurement and the ease of measuring. Measuring at a low level is simpler and easier than it is at a high level; however, a low level of measurement offers us the least refined information and allows the fewest statistical procedures during data analysis. We can look at the issue in two ways: (1) continuous versus discrete variable, and (2) the four levels of measurement.
Levels of measurement
A system for organizing information in the measurement of variables into four levels, from nominal level to ratio level.
Continuous and Discrete Variables.
Variables can be continuous or discrete. Continuous variables contain a large number of values or attributes that flow along a continuum. We can divide a continuous variable into many smaller increments; in mathematical theory, the number of increments is infinite. Examples of continuous variables include temperature, age, income, crime rate, and amount of schooling. For example, we can measure the amount of your schooling as the years of schooling you completed. We can subdivide this into the total number of hours you have spent in classroom instruction and out-of-class assignments or preparation. We could further refine this into the number of minutes you devoted to acquiring and processing information and knowledge in school or due to school assignments. We could further refine this into all of the seconds that your brain was engaged in specific cognitive activities as you were acquiring and processing information.
Variables that are measured on a continuum in which an infinite number of finer gradations between variable attributes are possible.
Discrete variables have a relatively fixed set of separate values or variable attributes. Instead of a smooth continuum of numerous values, discrete variables contain a limited number of distinct categories. Examples of discrete variables include gender (male or female), religion (Protestant, Catholic, Jew, Muslim, atheist), marital status (never married single, married, divorced or separated, widowed), or academic degrees (high school diploma, or community college associate, four-year college, master’s or doctoral degrees). Whether a variable is continuous or discrete affects its level of measurement.
Variables in which the attributes can be measured with only a limited number of distinct, separate categories.
Four Levels of Measurement.
Levels of measurement build on the difference between continuous and discrete variables. Higher level measures are continuous and lower level ones are discrete. The four levels of measurement categorize its precision.14
Deciding on the appropriate level of measurement for a construct is not always easy. It depends on two things: how we understand a construct (its definition and assumptions), and the type of indicator or measurement procedure.
The way we conceptualize a construct can limit how precisely we can measure it. For example, we might reconceptualize some of the variables listed earlier as continuous to be discrete. We can think of temperature as a continuous variable with thousands of refined distinctions (e.g., degrees and fractions of degrees). Alternatively, we can think of it more crudely as five discrete categories (e.g., very hot, hot, cool, cold, very cold). We can think of age as continuous (in years, months, days, hours, minutes, or seconds) or discrete categories (infancy, childhood, adolescence, young adulthood, middle age, old age).
While we can convert continuous variables into discrete ones, we cannot go the other way around, that is, convert discrete variables into continuous ones. For example, we cannot turn sex, religion, and marital status into continuous variables. We can, however, treat related constructs with slightly different definitions and assumptions as being continuous (e.g., amount of masculinity or femininity, degree of religiousness, commitment to a marital relationship). There is a practical reason to conceptualize and measure at higher levels of measurement: We can collapse higher levels of measurement to lower levels, but the reverse is not true.
table 7.2 Characteristics of the Four Levels of Measurements
level different categories ranked distance between categories measured true zero
Ordinal Yes Yes
Interval Yes Yes Yes
Ratio Yes Yes Yes Yes
Distinguishing among the Four Levels.
The four levels from lowest to highest precision are nominal, ordinal, interval, and ratio. Each level provides a different type of information (see Table 7.2). Nominal-level measurement indicates that a difference exists among categories (e.g., religion: Protestant, Catholic, Jew, Muslim; racial heritage: African, Asian, Caucasian, Hispanic, other). Ordinal-level measurement indicates a difference and allows us to rank order the categories (e.g., letter grades: A, B, C, D, F; opinion measures: strongly agree, agree, disagree, strongly disagree). Interval-level measurement does everything the first two do and allows us to specify the amount of distance between categories (e.g., Fahrenheit or celsius temperature: 5°, 45°, 90°; IQ scores: 95, 110, 125). Ratio-level measurementdoes everything the other levels do, and it has a true zero. This feature makes it possible to state relationships in terms of proportion or ratios (e.g., money income: $10, $100, $500; years of formal schooling: 1, 10, 13). In most practical situations, the distinction between interval and ratio levels makes little difference.
The lowest, least precise level of measurement for which there is a difference in type only among the categories of a variable.
A level of measurement that identifies a difference among categories of a variable and allows the categories to be rank ordered as well.
The highest, most precise level of measurement; variable attributes can be rank ordered, the distance between them precisely measured, and there is an absolute zero.
A level of measurement that identifies differences among variable attributes, ranks categories, and measures distance between categories but has no true zero.
One source of confusion is that we sometimes use arbitrary zeros in interval measures but the zeros are only to help keep score. For example, a rise in temperature from 30 to 60 degrees is not really a doubling of the temperature, although the numbers appear to double. Zero degrees in Fahrenheit or centigrade is not the absence of any heat but is just a placeholder to make counting easier. For example, water freezes at 32° on a Fahrenheit temperature scale, 0° on a celsius or centigrade scale, and 273° on a Kelvin scale. Water boils at 212°, 100°, or 373.15°, respectively. If there were a true zero, the actual relation among temperature numbers would be a ratio. For example, 25° to 50° Fahrenheit would be “twice as warm,” but this is not true because a ratio relationship does not exist without a true zero. We can see this in the ratio of boiling to freezing water temperatures. The ratio is 6.625 times higher in Fahrenheit, 100 times in Celsius, and 1.366 times in Kelvin. The Kelvin scale has an absolute zero (the absence of all heat), and its ratio corresponds to physical conditions. While this physical world example may be familiar, another example of arbitrary—not true—zeros occurs when measuring attitudes with numbers. We may assign a value to statements in a survey questionnaire (e.g., –1 = disagree, 0 = no opinion, +1 = agree). Just because our data are in the form of numbers does not allow us to use statistical procedures that require the mathematical assumption of a true zero.
Discrete variables are nominal and ordinal, whereas we can measure continuous variables at the interval or ratio level. There is an interesting unidirectional relationship among the four levels. We can convert a ratio-level measure into the interval, ordinal, or nominal level; an interval level into an ordinal or nominal level; and an ordinal into a nominal level; but the process does not work in the opposite way! This happens because higher levels of measurement contain more refined information than lower levels. We can always toss out or ignore the refined information of a high-level measure, but we cannot squeeze additional refined information out of a low-level measure.
For ordinal measures, we generally want to have at least five ordinal categories and try to obtain many observations for each. This is so because a distortion occurs as we collapse a continuous construct into few ordered categories. We minimize the distortion as the number of ordinal categories and the number of observations increase.15 (See Example Box 7.2, Example of Four Levels of Measurement).
Before continuing, keep two things in mind. First, we can measure nearly any social phenomenon. We can measure some constructs directly and create precise numerical values (e.g., family income) while other constructs are less precise and require the use of surrogates or proxies to indirectly measure a variable (e.g., predisposition to commit a crime). Second, we can learn a great deal from the measures created by other researchers. We are fortunate to have the work of other researchers to draw on. It is not always necessary to start from scratch. We can use a past scale or index or modify it for our own purposes. Measuring aspects of social life is an ongoing process. We are constantly creating ideas, refining theoretical definitions, and improving measures of old or new constructs.
example box 7.2 Example of Four Levels of Measurement
variable (level of measurement) how variable is measured
Religion (nominal) Different religious denominations (Jewish, Catholic, Lutheran, Baptist) are not ranked but are only different (unless one belief is conceptualized as closer to heaven).
Attendance (ordinal) “How often do you attend religious services? (0) Never, (1) less than once a year, (3) several times a year, (4) about once a month, (5) two or three times a week, or (8) several times a week.” This might have been measured at a ratio level if the exact number of times a person attended were asked instead.
IQ score (interval) Most intelligence tests are organized with 100 as average, middle, or normal. Scores higher or lower indicate distance from the average. Someone with a score of 115 has somewhat above average measured intelligence for people who took the test, whereas 90 is slightly below. Scores of below 65 or above 140 are rare.
Age (ratio) Age is measured by years. There is a true zero (birth). Note that a 40-year-old has lived twice as long as a 20-year-old.
Principles of Good Measurement.
Three features of good measurement whether we are considering using a single-indicator or a scale or index (discussed next) to measure a variable are that (1) the attributes or categories of a variable should be mutually exclusive, (2) they should also be exhaustive, and (3) the measurement should be unidimensional.
1. Mutually exclusive attributes means that an individual or a case will go into one and only one variable category. For example, we wish to measure the variable type of religion using the four attributes Christian, non-Christian, Jewish, and Muslim. Our measure is not mutually exclusive. Both Islam and Judaism are non-Christian religious faiths. A Jewish person and a Muslim fit into two categories: (1) the non-Christian and (2) Jewish or Muslim. Another example without mutually exclusive attributes is to measure the type of city using the three categories of river port city, state capital, and access to an international airport. A city could be all three (a river port state capital with an international airport), any combination of the three, or none of the three. To have mutually exclusive attitudes, we must create categories so that cases cannot be placed into more than one category.
Mutually exclusive attribute
The principle that variable attributes or categories in a measure are organized so that responses fit into only one category and there is no overlap.
2. Exhaustive attribute means that every case has a place to go or fits into at least one of a variable’s categories. Returning to the example of the variable religion, with the four categorical attributes of Christian, non-Christian, Jewish, and Muslim, say we drop the non-Christian category to make the attributes mutually exclusive: Christian, Jewish, or Muslim. These are not exclusive attributes. The Buddhist, Hindu, atheist, and agnostic do not fit anywhere. We must create attributes to cover every possible situation. For example, Christian, Jewish, Muslim, or Other attributes for religion would be exclusive and mutually exclusive.
The principle that attributes or categories in a measure should provide a category for all possible responses.
3. Unidimensionality means that a measure fits together or measures one single, coherent construct. Unidimensionality was hinted at in the previous discussions of construct and content validity. Unidimensionality states that if we combine several specific pieces of information into a single score or measure, all of the pieces should measure the same thing. We sometimes use a more advanced technique—factor analysis—to test for the unidimensionality of data.
The principle that when using multiple indicators to measure a construct, all indicators should consistently fit together and indicate a single construct.
We may see an apparent contradiction between the idea of using multiple indicators or a scale or index (see next section) to capture diverse parts of a complex construct and the criteria of unidimensionality. The contraction is apparent only because constructs vary theoretically by level of abstraction. We may define a complex, abstract construct using multiple subdimensions, each being a part of the complex construct’s overall content. In contrast, simple, low-level constructs that are concrete typically have just one dimension. For example, “feminist ideology” is a highly abstract and complex construct. It includes specific beliefs and attitudes toward social, economic, political, family, and sexual relations. The ideology’s belief areas are parts of the single, more abstract and general construct. The parts fit together as a whole. They are mutually reinforcing and collectively form one set of beliefs about the dignity, strength, and power of women. To create a unidimensional measure of feminist ideology requires us to conceptualize it as a unified belief system that might vary from very antifeminist to very profeminist. We can test the convergence validity of our measure with multiple indicators that tap the construct’s subparts. If one belief area (e.g., sexual relations) is consistently distinct from all other areas in empirical tests, then we question its unidimensionality.
It is easy to become confused about unidimensionality because an indicator we use for a simpleconstruct in one situation might indicate one part of a different, complex construct in another situation. We can combine multiple simple, concrete constructs into a complex, more abstract construct. The principle of unidimensionality in measurement means that for us to measure a construct, we must conceptualize it as one coherent, integrated core idea for its level of abstraction. This shows the way that the processes of conceptualization and measurement are tightly interwoven.
Here is a specific example. A person’s attitude about gender equality with regard to getting equal pay for work is a simpler, more specific and less abstract idea than gender ideology (i.e., a general set of beliefs about gender relations in all areas of life). We might measure attitude regarding equal pay as a unidimensional construct in its own or as a less abstract subpart of the complex, broader construct of gender ideology. This does not mean that gender ideology ceases to be unidimensional. It is a complex idea with several parts but can be unidimensional at a more abstract level.
SCALES AND INDEXES
In this section, we look at scales and indexes, specialized measures from among the hundreds created by researchers.16 We have scales and indexes to measure many things: the degree of formalization in bureaucratic organizations, the prestige of occupations, the adjustment of people to a marriage, the intensity of group interaction, the level of social activity in a community, the degree to which a state’s sexual assault laws reflect feminist values, and the level of socioeconomic development of a nation. We will examine principles of measurement, consider principles of index and scale construction, and then explore a few major types of index and scale.
You might find the terms index and scale confusing because people use them interchangeably. One researcher’s scale is another’s index. Both produce ordinal- or interval-level measures. To add to the confusion, we can combine scale and index techniques into a single measure. Nonetheless, scales and indexes are very valuable. They give us more information about a variable and expand the quality of measurement (i.e., increase reliability and validity) over using a simple, single indictor measure. Scales and indexes also aid in data reduction by condensing and simplifying information (see Expansion Box 7.3, Scales and Indexes: Are They Different?).
expansion box 7.3 Scales and Indexes: Are They Different?
For most purposes, researchers can treat scales and indexes as being interchangeable. Social researchers do not use a consistent nomenclature to distinguish between them.
A scale is a measure in which a researcher captures the intensity, direction, level, or potency of a variable construct and arranges responses or observations on a continuum. A scale can use a single indicator or multiple indicators. Most are at the ordinal level of measurement.
An index is a measure in which a researcher adds or combines several distinct indicators of a construct into a single score. This composite score is often a simple sum of the multiple indicators. It is used for content and convergent validity. Indexes are often measured at the interval or ratio level.
Researchers sometimes combine the features of scales and indexes in a single measure. This is common when a researcher has several indicators that are scales (i.e., that measure intensity or direction). He or she then adds these indicators together to yield a single score, thereby creating an index.
You hear about indexes all the time. For example, U.S. newspapers report the Federal Bureau of Investigation (FBI) crime index and the consumer price index (CPI). The FBI index is the sum of police reports on seven so-called index crimes (criminal homicide, aggravated assault, forcible rape, robbery, burglary, larceny of $50 or more, and auto theft). The index began as part of the Uniform Crime Report in 1930 (see Rosen, 1995). The CPI, which is a measure of inflation, is created by totaling the cost of buying a list of goods and services (e.g., food, rent, and utilities) and comparing the total to the cost of buying the same list in the previous period. The CPI has been used by the U.S. Bureau of Labor Statistics since 1919; wage increases, union contracts, and social security payments are based on it. An index is a combination of items into a single numerical score. Various components or subparts of a construct are each measured and then combined into one measure.
The summing or combining of many separate measures of a construct or variable to create a single score.
There are many types of indexes. For example, the total number of questions correct on an exam with 25 questions is a type of index. It is a composite measure in which each question measures a small piece of knowledge and all questions scored correct or incorrect are totaled to produce a single measure. Indexes measure the most desirable place to live (based on unemployment, commuting time, crime rate, recreation opportunities, weather, and so on), the degree of crime (based on combining the occurrence of different specific crimes), the mental health of a person (based on the person’s adjustment in various areas of life), and the like.
Creating indexes is so easy that we must be careful to check that every item in an index has face validity and excludes any without face validity. We want to measure each part of the construct with at least one indicator. Of course, it is better to measure the parts of a construct with multiple indicators.
An example of an index is a college quality index (see Example Box 7.3, Example of Index). A theoretical definition says that a high-quality college has six distinguishing characteristics: (1) few students per faculty member, (2) a highly educated faculty, (3) high number of books in the library, (4) few students dropping out of college, (5) many students who go on to seek advanced degrees, and (6) faculty members who publish books or scholarly articles. We score 100 colleges on each item and then add the scores for each to create an index score of college quality that can be used to compare colleges.
We can combine indexes. For example, to strengthen my college quality index, I add a subindex on teaching quality. The index contains eight items: (1) average size of classes, (2) percentage of class time devoted to discussion, (3) number of different classes each faculty member teaches, (4) availability of faculty to students outside the classroom, (5) currency and amount of reading assigned, (6) degree to which assignments promote learning, (7) degree to which faculty get to know each student, and (8) student ratings of instruction. Similar subindex measures can be created for other parts of the college quality index. They can be combined into a more global measure of college quality. This further elaborates the definition of the construct “quality of college.”
Next we look at three issues involved when we construct an index: weight of items, missing data, and the use of rates and standardization.
1. Weighting is an important issue in index construction. Unless otherwise stated, we assume that the items in an index are unweighted. Likewise, unless we have a good theoretical reason for assigning different weights to items, we use equal weights. An unweighted index gives each item equal weight. We simply sum the items without modification, as if each were multiplied by 1 (or –1 for items that are negative). A weighted index values or weights some items more than others. The size of weights can come from theoretical assumptions, the theoretical definition, or a statistical technique such as factor analysis.
For example, we can elaborate the theoretical definition of the college quality index. We decide that the student/faculty ratio and number of faculty with Ph.D.s are twice as important as the number of books in the library per student or the percentage of students pursuing advanced degrees. Also, the percentage of freshmen who drop out and the number of publications per faculty member are three times more important than books in the library or percentage of students pursuing an advanced degree. This is easier to see when it is expressed as a formula (refer toExample Box 7.3).
The number of students per faculty member and the percentage who drop out have negative signs because, as they increase, the quality of the college declines. The weighted and unweighted indexes can produce different results. Consider Old Ivy College, Local College, and Big University. All have identical unweighted index scores, but the colleges have different quality scores after weighting.
example box 7.3 Example of Index
In symbolic form, where:
• Q = overall college quality
A quality-of-college index is based on the following six items:
• R = number of students per faculty member
• F = percentage of faculty with Ph.D.s
• B = number of books in library per student
• D = percentage of freshmen who drop out or do not finish
• A = percentage of graduates who seek an advanced degree
• P = number of publications per faculty member
Unweighted formula: (–1) R + (1) F + (1) B + (–1) D + (1) A + (1) P = Q
Weighted formula: (–2) R + (2) F + (1) B + (–3) D + (1) A + (3) P = Q
Old Ivy College
Unweighted: (–1) 13 + (1) 80 + (1) 334 + (–1) 14 + (1) 28 + (1) 4 = 419
Weighted: (–2) 13 + (2) 80 + (1) 334 + (–3) 14 + (1) 28 + (3) 4 = 466
Unweighted: (–1) 20 + (1) 82 + (1) 365 + (–1) 25 + (1) 15 + (1) 2 = 419
Weighted: (–2) 20 + (2) 82 + (1) 365 + (–3) 25 + (1) 15 + (3) 2 = 435
Unweighted: (–1) 38 + (1) 95 + (1) 380 + (–1) 48 + (1) 24 + (1) 6 = 419
Weighted: (–2) 38 + (2) 95 + (1) 380 + (–3) 48 + (1) 24 + (3) 6 = 392
Weighting produces different index scores in this example, but in most cases, weighted and un-weighted indexes yield similar results. Researchers are concerned with the relationship between variables, and weighted and unweighted indexes usually give similar results for the relationships between variables.17
2. Missing data can be a serious problem when constructing an index. Validity and reliability are threatened whenever data for some cases are missing. There are four ways to attempt to resolve the problem (see Expansion Box 7.4, Ways to Deal with Missing Data), but none fully solves it.
For example, I construct an index of the degree of societal development in 1985 for 50 nations. The index contains four items: life expectancy, percentage of homes with indoor plumbing, percentage of population that is literate, and number of telephones per 100 people. I locate a source of United Nations statistics for my information. The values for Belgium are 68 + 87 + 97 + 28 and for Turkey are 55 + 36 + 49 + 3; for Finland, however, I discover that literacy data are unavailable. I check other sources of information, but none has the data because they were not collected.
3. Rates and standardization are related ideas. You have heard of crime rates, rates of population growth, or the unemployment rate. Some indexes and single-indicator measures are expressed as rates. Rates involve standardizing the value of an item to make comparisons possible. The items in an index frequently need to be standardized before they can be combined.
expansion box 7.4 Ways to Deal with Missing Data
• 1. Eliminate all cases for which any information is missing. If one nation in the discussion is removed from the study, the index will be reliable for the nations on which information is available. This is a problem if other nations have missing information. A study of 50 nations may become a study of 20 nations. Also, the cases with missing information may be similar in some respect (e.g., all are in eastern Europe or in the Third World), which limits the generalizability of findings.
• 2. Substitute the average score for cases in which data are present. The average literacy score from the other nations is substituted. This “solution” keeps Finland in the study but gives it an incorrect value. For an index with few items or for a case that is not “average,” this creates serious validity problems.
• 3. Insert data based on nonquantitative information about the case. Other information about Finland (e.g., percentage of 13- to 18-year-olds in high school) is used to make an informed guess about the literacy rate. This “solution” is marginally acceptable in this situation. It is not as good as measuring Finland’s literacy, and it relies on an untested assumption—that one can predict the literacy rate from other countries’ high school attendance rate.
• 4. Insert a random value. This is unwise for the development index example. It might be acceptable if the index had a very large number of items and the number of cases was very large. If that were the situation, however, then eliminating the case is probably a better “solution” that produces a more reliable measure.
Source: Allison (2001).
Standardization involves selecting a base and dividing a raw measure by the base. For example, City A had ten murders and City B had thirty murders in the same year. In order to compare murders in the two cities, we will need to standardize the raw number of murders by the city population. If the cities are the same size, City B is more dangerous. But City B may be safer if it is much larger. For example, if City A has 100,000 people and City B has 600,000, then the murder rate per 100,000 is ten for City A and five for City B.
Procedures to adjust measures statistically to permit making an honest comparison by giving a common basis to measures of different units.
Standardization makes it possible for us to compare different units on a common base. The process of standardization, also called norming, removes the effect of relevant but different characteristics in order to make the important differences visible. For example, there are two classes of students. An art class has twelve smokers and a biology class has twenty-two smokers. We can compare the rate or incidence of smokers by standardizing the number of smokers by the size of the classes. The art class has 32 students and the biology class has 143 students. One method of standardization that you already know is the use of percentages, whereby measures are standardized to a common base of 100. In terms of percentages, it is easy to see that the art class has more than twice the rate of smokers (37.5 percent) than the biology class (15.4 percent).
A critical question in standardization is deciding what base to use. In the examples given, how did I know to use city size or class size as the base? The choice is not always obvious; it depends on the theoretical definition of a construct. Different bases can produce different rates. For example, the unemployment rate can be defined as the number of people in the workforce who are out of work. The overall unemployment rate is
We can divide the total population into subgroups to get rates for subgroups in the population such as White males, African American females, African American males between the ages of 18 and 28, or people with college degrees. Rates for these subgroups may be more relevant to the theoretical definition or research problem. For example, we believe that unemployment is an experience that affects an entire household or family and that the base should be households, not individuals. The rate will look like this:
Different conceptualizations suggest different bases and different ways to standardize. When combining several items into an index, it is best to standardize items on a common base (seeExample Box 7.4, Standardization and the Real Winners at the 2000 Olympics).
We often use scales when we want to measure how an individual feels or thinks about something. Some call this the hardness or potency of feelings. Scales also help in the conceptualization and operationalization processes. For example, you believe a single ideological dimension underlies people’s judgments about specific policies (e.g., housing, education, foreign affairs). Scaling can help you determine whether a single construct—for instance, “conservative/liberal ideology”—underlies the positions that people take on specific policies.
Scaling measures the intensity, direction, level, or potency of a variable. Graphic rating scales are an elementary form of scaling. People indicate a rating by checking a point on a line that runs from one extreme to another. This type of scale is easy to construct and use. It conveys the idea of a continuum, and assigning numbers helps people think about quantities. Scales assume that people with the same subjective feeling mark the graphic scale at the same place. Figure 7.7 is an example of a “feeling thermometer” scale that is used to find out how people feel about various groups in society (e.g., the National Organization of Women, the Ku Klux Klan, labor unions, physicians). Political scientists have used this type of measure in the national election study since 1964 to measure attitudes toward candidates, social groups, and issues.18
A class of quantitative data measures often used in survey research that captures the intensity, direction, level, or potency of a variable construct along a continuum; most are at the ordinal level of measurement.
We next look at five commonly used social science scales: Likert, Thurstone, Borgadus social distance, semantic differential, and Guttman scale. Each illustrates a somewhat different logic of scaling.
figure 7.7 “Feeling Thermometer” Graphic Rating Scale
1. Likert scaling. You have probably used Likert scales; they are widely used in survey research. They were developed in the 1930s by Rensis Likert to provide an ordinal-level measure of a person’s attitude.19 Likert scales are called summated-rating or additive scales because a person’s score on the scale is computed by summing the number of responses he or she gives. Likert scales usually ask people to indicate whether they agree or disagree with a statement. Other modifications are possible; people might be asked whether they approve or disapprove or whether they believe something is “almost always true” (see Example Box 7.5, Examples of Types of Likert Scales on page 228).
A scale often used in survey research in which people express attitudes or other responses in terms of ordinal-level categories (e.g., agree, disagree) that are ranked along a continuum.
example box 7.4 Standardization and the Real Winners at the 2000 Olympics
Sports fans in the United States were jubilant about “winning” at the 2000 Olympics by carrying off the most gold medals. However, because they failed to standardize, the “win” is an illusion. Of course, the world’s richest nation with the third largest population does well in one-on-one competition among all nations. To see what really happened, one must standardize on a base of the population or wealth. Standardization yields a more accurate picture by adjusting the results as if the nations had equal populations and wealth. The results show that the Bahamas, with fewer than 300,000 citizens (smaller than a medium-sized U.S. city), proportionately won the most gold. Adjusted for its population size or wealth, the United States is not even near the top; it appears to be the leader only because of its great size and wealth. Sports fans in the United States can perpetuate the illusion of being at the top only if they ignore the comparative advantage of the United States.
top ten gold medal winning countries at the 2000 olympics in sydney
Unstandardized Rank Standardized Rank*rank country total country total population gdp
1 USA 39 Bahamas 1.4 33.3 20.0
2 Russia 32 Slovenia 2 10 10.0
3 China 28 Cuba 11 9.9 50.0
4 Australia 16 Norway 4 9.1 2.6
5 Germany 14 Australia 16 8.6 4.1
6 France 13 Hungry 8 7.9 16.7
7 Italy 13 Netherlands 12 7.6 3.0
8 Netherlands 12 Estonia 1 7.1 20.0
9 Cuba 11 Bulgaria 5 6.0 41.7
10 Britain 11 Lithuania 2 5.4 18.2
80 EU15 80 2.1 0.9
USA 39 1.4 0.4
*Population is gold medals per 10 million people and GDP is gold medals per $10 billion.
**EU15 is the 15 nations of the European Union treated as a single unit.
Source: Adapted from The Economist, October 7, 2000, p. 52. Copyright 2000 by Economist Newspaper Group. Reproduced with permission of Economist Newspaper Group in the format Textbook via Copyright Clearance Center.
To create a Likert scale, you need a minimum of two categories, such as “agree” and “disagree.” Using only two choices creates a crude measure and forces distinctions into only two categories. It is usually better to use four to eight categories. You can combine or collapse categories after the data have been collected, but once you collect them using crude categories, you cannot make them more precise later. You can increase the number of categories at the end of a scale by adding “strongly agree,” “somewhat agree,” “very strongly agree,” and so forth. You want to keep the number of choices to eight or nine at most. More distinctions than that are not meaningful, and people will become confused. The choices should be evenly balanced (e.g., “strongly agree,” “agree,” “strongly disagree,” “disagree”). Nunnally (1978:521) stated:
• As the number of scale steps is increased from 2 up through 20, the increase in reliability is very rapid at first. It tends to level off at about 7, and after about 11 steps, there is little gain in reliability from increasing the number of steps.
example box 7.5 Examples of Types of Likert Scales
the rosenberg self-esteem scale
All in all, I am inclined to feel that I am a failure:
• (1) Almost always true
• (2) Often true
• (3) Sometimes true
• (4) Seldom true
• (5) Never true
a student evaluation of instruction scale
Overall, I rate the quality of instruction in this course as:
Excellent Good Average Fair Poor
a market research mouthwash rating scale
Brand Dislike Completely Dislike Somewhat Dislike a Little Like a Little Like Somewhat Like Completely
X ____________ ____________ ____________ ____________ ____________ ____________
Y ____________ ____________ ____________ ____________ ____________ ____________
work group supervisor scale
Never Seldom Sometimes Often Always
Lets members know what is expected of them 1 2 3 4 5
Is friendly and approachable 1 2 3 4 5
Treats all unit members as equals 1 2 3 4 5
Researchers have debated about whether to offer a neutral category (e.g., “don’t know,” “undecided,” “no opinion”) in addition to the directional categories (e.g., “disagree,” “agree”). A neutral category implies an odd number of categories.
We can combine several Likert scale items into a composite index if they all measure the same construct. Consider the Index of Equal Opportunity for Women and the Self-Esteem Index created by Sniderman and Hagen (1985) (see Example Box 7.6, Examples of Using the Likert Scale to Create Indexes). In the middle of large surveys, they asked respondents three questions about the position of women. The researchers later scored answers and combined items into an index that ranged from 3 to 15. Respondents also answered questions about self-esteem. Notice that when scoring these items, they scored one item (question 2) in reverse. The reason for switching directions in this way is to avoid the problem of the response set. The response set, also called response style andresponse bias, is the tendency of some people to answer a large number of items in the same way (usually agreeing) out of laziness or a psychological predisposition. For example, if items are worded so that saying “strongly agree” always indicates self-esteem, we would not know whether a person who always strongly agreed had high self-esteem or simply had a tendency to agree with questions. The person might be answering “strongly agree” out of habit or a tendency to agree. We word statements in alternative directions so that anyone who agrees all the time appears to answer inconsistently or to have a contradictory opinion.
A tendency to agree with every question in a series rather than carefully thinking through one’s answer to each.
example box 7.6 Examples of Using the Likert Scale to Create Indexes
Sniderman and Hagen (1985) created indexes to measure beliefs about equal opportunity for women and self-esteem. For both indexes, scores were added to create an un-weighted index.
index of equal opportunity for women
• 1. Women have less opportunity than men to get the education they need to be hired in top jobs.
Strongly Agree Somewhat Agree Somewhat Disagree Disagree a Great Deal Don’t Know
• 2. Many qualified women cannot get good jobs; men with the same skills have less trouble.
Strongly Agree Somewhat Agree Somewhat Disagree Disagree a Great Deal Don’t Know
• 3. Our society discriminates against women.
Strongly Agree Somewhat Agree Somewhat Disagree Disagree a Great Deal Don’t Know
Scoring: For all items, Strongly Agree = 1, Somewhat Agree = 2, Somewhat Disagree = 4, Disagree a Great Deal = 5, Don’t Know = 3.
Highest Possible Index Score = 15, respondent feels opportunities for women are equal
Lowest Possible Index Score = 3, respondent feels opportunities are not equal
1. On the whole, I am satisfied with myself. Agree Disagree Don’t Know
2. At times, I think I am no good at all. Agree Disagree Don’t Know
3. I sometimes feel that (other) men do not take my opinion seriously. Agree Disagree Don’t Know
Scoring: Items 1 and 3: 1 = Disagree, 2 = Don’t Know, 3 = Agree, Item 2: 1 = Disagree, 2 = Don’t Know, 1 = Agree.
Highest Possible Index Score = 9, high self-esteem
Lowest Possible Index Score = 3, low self-esteem
We often combine many Likert-scaled attitude indicators into an index. Scale and indexes can improve reliability and validity. An index uses multiple indicators, which improves reliability. The use of multiple indicators that measure several aspects of a construct or opinion improves content validity. Finally, the index scores give a more precise quantitative measure of a person’s opinion. For example, we can measure a person’s opinion with a number from 10 to 40 instead of in four categories: “strongly agree,” “agree,” “disagree,” and “strongly disagree.”
Instead of scoring Likert items, as in the previous example, we could use the scores –2, –1, +1, +2. This scoring has an advantage in that a zero implies neutrality or complete ambiguity whereas a high negative number means an attitude that opposes the opinion represented by a high positive number.
The numbers we assign to the response categories are arbitrary. Remember that the use of a zero does not give the scale or index a ratio level of measurement. Likert scale measures are at the ordinal level of measurement because responses indicate only a ranking. Instead of 1 to 4 or –2 to +2, the numbers 100, 70, 50, and 5 would have worked. Also, we should not be fooled into thinking that the distances between the ordinal categories are intervals just because numbers are assigned. The numbers are used for convenience only. The fundamental measurement is only ordinal.20
The real strength of the Likert Scale is its simplicity and ease of use. When we combine several ranked items, we get a more comprehensive multiple indicator measurement. The scale has two limitations: Different combinations of several scale items produce the same overall score, and the response set is a potential danger.
2. Thurstone scaling. This scale is for situations when we are interested in something with many ordinal aspects but would like a measure that combines all information into a single interval-level continuum. For example, a dry cleaning business, Quick and Clean, contacts us; the company wants to identify its image in Greentown compared to that of its major competitor, Friendly Cleaners. We conceptualize a person’s attitude toward the business as having four aspects: attitude toward location, hours, service, and cost. We learn that people see Quick and Clean as having more convenient hours and locations but higher costs and discourteous service. People see Friendly Cleaners as having low cost and friendly service but inconvenient hours and locations. Unless we know how the four aspects relate to the core attitude—image of the dry cleaner—we cannot say which business is generally viewed more favorably. During the late 1920s, Louis Thur-stone developed scaling methods for assigning numerical values in such situations. These are now calledThurstone scaling or the method of equal-appearing intervals.21
Measuring in which the researcher gives a group of judges many items and asks them to sort the items into categories along a continuum and then considers the sorting results to select items on which the judges agree.
Thurstone scaling uses the law of comparative judgment to address the issue of comparing ordinal attitudes when each person makes a unique judgment. The law anchors or fixes the position of one person’s attitude relative to that of others as each makes an individual judgment. The law of comparative judgment states that we can identify the “most common response” for each object or concept being judged. Although different people arrive at different judgments, the individual judgments cluster around a single most common response. The dispersion of individual judgments around the common response follows a statistical pattern called the normal distribution. According to the law, if many people agree that two objects differ, then the most common responses for the two objects will be distant from each other. By contrast, if many people are confused or disagree, the common responses of the two objects will be closer to each other.
With Thurstone scaling, we develop many statements (e.g., more than 100) regarding the object of interest and then use judges to reduce the number to a smaller set (e.g., 20) by eliminating ambiguous statements. Each judge rates the statements on an underlying continuum (e.g., favorable to unfavorable). We examine the ratings and keep some statements based on two factors: (1) agreement among the judges and (2) the statement’s location on a range of possible values. The final set of statements is a measurement scale that spans a range of values.
Thurstone scaling begins with a large number of statements that cover all shades of opinion. Each statement should be clear and precise. “Good” statements refer to the present and are not capable of being interpreted as facts. They are unlikely to be endorsed by everyone, are stated as simple sentences, and avoid words such as always and never. We can get ideas for writing the statements from reviewing the literature, from the mass media, from personal experience, and from asking others. For example, statements about the dry cleaning business might include the four aspects listed before plus the following:
• ¦ I think X Cleaners dry cleans clothing in a prompt and timely manner.
• ¦ In my opinion, X Cleaners keeps its stores looking neat and attractive.
• ¦ I do not think that X Cleaners does a good job of removing stains.
• ¦ I believe that X Cleaners charges reasonable prices for cleaning coats.
• ¦ I believe that X Cleaners returns clothing clean and neatly pressed.
• ¦ I think that X Cleaners has poor delivery service.
We would next locate 50 to 300 judges who should be familiar with the object or concept in the statements. Each judge receives a set of statement cards and instructions. Each card has one statement on it, and the judges place each card in one of several piles. The number of piles is usually 7, 9, 11, or 13. The piles represent a range of values (e.g., favorable to neutral to unfavorable) with regard to the object or concept being evaluated. Each judge places cards in rating piles independently of the other judges.
After the judges place all cards in piles, we create a chart cross-classifying the piles and the statements. For example, 100 statements and 11 piles results in an 11 × 100 chart, or a chart with 11 × 100 = 1,100 boxes. The number of judges who assigned a rating to a given statement is written into each box. Statistical measures (beyond the present discussion) are used to compute the average rating of each statement and the degree to which the judges agree or disagree. We keep the statements with the highest between-judge agreement, or interrater reliability, as well as statements that represent the entire range of values. (See Example Box 7.7, Example of Thurstone Scaling on page 232.)
With Thurstone scaling, we can construct an attitude scale or select statements from a larger collection of attitude statements. The method has four limitations:
• ¦ It measures agreement or disagreement with statements but not the intensity of agreement or disagreement.
• ¦ It assumes that judges and others agree on where statements appear in a rating system.
• ¦ It is time consuming and costly.
• ¦ It is possible to get the same overall score in several ways because agreement or disagreement with different combinations of statements can produce the same average.
3. Bogardus social distance scale. A measure of the “social distance” that separates social groups from each other is the Bogardus social distance scale. We use it with one group to learn how much distance its members feel toward a target or “out-group.” Emory Bogardus developed this technique in the 1920s to measure the willingness of members of different ethnic groups to associate with each other. Since then it has been used to see how close or distant people in one group feel toward some other group (e.g., a religious minority or a deviant group).22
Bogardus social distance scale
A scale measuring the social distance between two or more social groups by having members of one group indicate the limit of their comfort with various types of social interaction or closeness with members of the other group(s).
example box 7.7 Example of Thurstone Scaling
Variable Measured: Opinion with regard to the death penalty.
Step 1: Develop 120 statements about the death penalty using personal experience, the popular and professional literature, and statements by others.
• 1. I think that the death penalty is cruel and unnecessary punishment.
• 2. Without the death penalty, there would be many more violent crimes.
• 3. I believe that the death penalty should be used only for a few extremely violent crimes.
• 4. I do not think that anyone was ever prevented from committing a murder because of fear of the death penalty.
• 5. I do not think that people should be exempt from the death penalty if they committed a murder even if they are insane.
• 6. I believe that the Bible justifies the use of the death penalty.
• 7. The death penalty itself is not the problem for me, but I believe that electrocuting people is a cruel way to put them to death.
Step 2: Place each statement on a separate card or sheet of paper and make 100 sets of the 120 statements.
Step 3: Locate 100 persons who agree to serve as judges. Give each judge a set of the statements and instructions to place them in one of 11 piles, from 1 = highly unfavorable statement through 11 = highly favorable statement.
Step 4: The judges place each statement into one of the 11 piles (e.g., Judge 1 puts statement 1 into pile 2; Judge 2 puts the same statement into pile 1; Judge 3 also puts it into pile 2, Judge 4 puts it in pile 3, and so on).
Step 5: Collect piles from judges and create a chart summarizing their responses. See the example chart that follows.
number of judges rating each statement rating pile
Step 6: Compute the average rating and degree of agreement by judges. For example, the average for question 1 is about 2, so there is high agreement; the average for question 3 is closer to 5, and there is much less agreement.
Step 7: Choose the final 20 statements to include in the death penalty opinion scale. Choose statements if the judges showed agreement (most placed an item in the same or a nearby pile) and ones that reflect the entire range of opinion, from favorable to neutral to unfavorable.
Step 8: Prepare a 20-statement questionnaire, and ask people in a study whether they agree or disagree with the statements.
The scale has a simple logic. We ask people to respond to a series of ordered statements. We place more socially intimate or close situations at one end and the least socially threatening situations at the opposite end. The scale’s logic assumes that a person who is uncomfortable with another social group and might accept a few nonthreatening (socially distant) situations will express discomfort or refusal regarding the more threatening (socially intimate) situations.
We can use the scale in several ways. For example, we give people a series of statements: People from Group X are entering your country, are in your town, work at your place of employment, live in your neighborhood, become your personal friends, and marry your brother or sister. We ask people whether they feel comfortable with the situation in the statement or the contact is acceptable. We ask people to respond to all statements until they are at a situation with which they do not feel comfortable. No set number of statements is required; the number usually ranges from five to nine.
We can use the Bogardus scale to see how distant people feel from one outgroup versus another (see Example Box 7.8, Example of Bogardus Social Distance Scale). We can use the measure of social distance as either an independent or a dependent variable. For example, we might believe that social distance from a group is highest for people who have some other characteristic, such as education. Our hypothesis might be that White people’s feelings of social distance toward Vietnamese people is negatively associated with education; that is, the least educated Whites feel the most social distance. In this situation, social distance is the dependent variable, and amount of education is the independent variable.
The social distance scale has two potential limitations. First, we must tailor the categories to a specific outgroup and social setting. Second, it is not easy for us to compare how a respondent feels toward several different groups unless the respondent completes a similar social distance scale for all outgroups at the same time. Of course, how a respondent completes the scale and the respondent’s actual behavior in specific social situations may differ.
4. Semantic differential. Developed in the 1950s as an indirect measure of a person’s feelings about a concept, object, or other person, semantic differential measures subjective feelings by using many adjectives because people usually communicate evaluations through adjectives. Most adjectives have polar opposites (e.g., light/dark, hard/soft, slow/fast). The semantic differential attempts to capture evaluations by relying on the connotations of adjectives. In this way, it measures a person’s feelings and evaluations in an indirect manner.
A scale that indirectly measures feelings or thoughts by presenting people a topic or object and a list of polar opposite adjectives or adverbs and then having them indicate feelings by marking one of several spaces between the two adjectives or adverbs.
To use the semantic differential, we offer research participants a list of paired opposite adjectives with a continuum of 7 to 11 points between them. We ask participants to mark the spot on the continuum between the adjectives that best expresses their evaluation or feelings. The adjectives can be very diverse and should be mixed (e.g., positive items should not be located mostly on either the right or the left side). Adjectives in English tend to fall into three major classes of meaning: evaluation (good–bad), potency (strong–weak), and activity (active–passive). Of the three classes, evaluation is usually the most significant.
The most difficult part of the semantic differential is analyzing the results. We need to use advanced statistical procedures to do so. Results from the procedures inform us as to how a person perceives different concepts or how people view a concept, object, or person. For example, political analysts might discover that young voters perceive their candidate to be traditional, weak, and slow, and midway between good and bad. Elderly voters perceive the candidate as leaning toward strong, fast, and good, and midway between traditional and modern. In Example Box 7.9, Example of Semantic Differential on page 235, a person rated two concepts. The pattern of responses for each concept illustrates how this individual feels. This person views the two concepts differently and appears to feel negatively about divorce.
example box 7.8 Example of Bogardus Social Distance Scale
A researcher wants to find out how socially distant freshmen college students feel from exchange students from two different countries: Nigeria and Germany. She wants to see whether students feel more distant from students coming from Africa or from Europe. She uses the following series of questions in an interview:
Please give me your first reaction, yes or no, whether you personally would feel comfortable having an exchange student from (name of country):
• ____________ As a visitor to your college for a week
• ____________ As a full-time student enrolled at your college
• ____________ Taking several of the same classes you are taking
• ____________ Sitting next to you in class and studying with you for exams
• ____________ Living a few doors down the hall on the same floor in your dormitory
• ____________ As a same-sex roommate sharing your dorm room
• ____________ As someone of the opposite sex who has asked you to go out on a date
Percentage of Freshmen Who Report Feeling Comfortable
Visitor 100% 100%
Enrolled 98 100
Same class 95 98
Study together 82 88
Same dorm 71 83
Roommate 50 76
Go on date 42 64
The results suggest that freshmen feel more distant from Nigerian students than from German students. Almost all feel comfortable having the international students as visitors, enrolled in the college, and taking classes. Feelings of distance increase as interpersonal contact increases, especially if the contact involves personal living settings or activities not directly related to the classroom.
Statistical techniques can create three-dimensional diagrams of results.23 The three aspects are diagrammed in a three-dimensional “semantic space.” In the diagram, “good” is up and “bad” is down, “active” is left and “passive” is right, “strong” is away from the viewer and “weak” is close.
5. Guttman scaling. Also called cumulative scaling, the Guttman scaling index differs from the previous scales or indexes in that we use it to evaluate data after collecting them. This means that we must design a study with the Guttman scaling technique in mind. Louis Guttman developed the scale in the 1940s to determine whether there was a structured relationship among a set of indicators. He wanted to learn whether multiple indicators about an issue had an underlying single dimension or cumulative intensity.24
Guttman scaling index
A scale that researchers use after data are collected to reveal whether a hierarchical pattern exists among responses so that people who give responses at a “higher level” also tend to give “lower level” ones.
example box 7.9 Example of Semantic Differential
Please read each pair of adjectives below and then place a mark on the blank space that comes closest to your first impression feeling. There are no right or wrong answers.
How do you feel about the idea of divorce?
How do you feel about the idea of marriage?
To use Guttman scaling, we begin by measuring a set of indicators or items. These can be questionnaire items, votes, or observed characteristics. We usually measure three to twenty indicators in a simple yes/no or present/absent fashion. We select items for which we believe there could be a logical relationship among all of them. We place the results into a Guttman scale chart and next determine whether there is a hierarchical pattern among items.
After we have the data, we can consider all possible combinations of responses. For example, we have three items: whether a child knows (1) her age, (2) her telephone number, and (3) three local elected political officials. The little girl could know her age but no other answer, or all three, or only her age and telephone number. Three items have eight possible combinations of answers or patterns of responses from not knowing any through knowing all three. There is a mathematical way to compute the number of combinations (e.g., twenty-three); you can write down all combinations of yes or no for three questions and see the eight possibilities.
An application of Guttman scaling known as scalogram analysis allows us to test whether a patterned hierarchical relationship exists in the data. We can divide response patterns into scaled items and errors (or nonscalable). A scaled pattern for the child’s knowledge example would be as follows: not knowing any item, knowing age only, knowing only age plus phone number, and knowing all three. All other combinations of answers (e.g., knowing the political leaders but not her age) are logically possible but nonscalable. If we find a hierarchical relationship, then most answers fit into the scalable patterns. The items are scalable, or capable of forming a Guttman scale, if a hierarchical pattern exists. For higher order items, a smaller number would agree but all would also agree to the lower order ones but not vice versa. In other words, higher order items build on the middle-level ones, and middle-level build on lower ones.
Statistical procedures indicate the degree to which items fit the expected hierarchical pattern. Such procedures produce a coefficient that ranges from zero to 100 percent. A score of zero indicates a random pattern without hierarchical structure; one of 100 percent indicates that all responses fit the hierarchical pattern. Alternative statistics to measure scalability have also been suggested.25(See Example Box 7.10, Guttman Scale Example.)
example box 7.10 Guttman Scale Example
Crozat (1998) examined public responses to various forms of political protest. He looked at survey data on the public’s acceptance of forms of protest in Great Britain, Germany, Italy, the Netherlands, and the United States in 1974 and 1990. He found that the pattern of the public’s acceptance formed a Guttman scale. Those who accepted more intense forms of protest (e.g., strikes and sit-ins) almost always accepted more modest forms (e.g., petitions or demonstrations), but not all who accepted modest forms accepted the more intense forms. In addition to showing the usefulness of the Guttman scale, Crozat also found that people in different nations saw protest similarly and the degree of Guttman scalability increased over time. Thus, the pattern of acceptance of protest activities was Guttman “scalable” in both time periods, but it more closely followed the Guttman pattern in 1990 than in 1974.
form of protest
Petitions Demonstrations Boycotts Strikes Sit-Ins
N N N N N
Y N N N N
Y Y N N N
Y Y Y N N
Y Y Y Y N
Y Y Y Y Y
Other Patterns (examples only)
N Y N Y N
Y N Y Y N
Y N Y N N
N Y Y N N
Y N N Y Y
Clogg and Sawyer (1981) studied U.S. attitudes toward abortion using Guttman scaling. They examined the different conditions under which people thought abortion was acceptable (e.g., mother’s health in danger, pregnancy resulting from rape). They discovered that 84.2 percent of responses fit into a scaled response pattern.
This chapter dicussed the principles and processes of measurement. Central to measurement is how we conceptualize—or refine and clarify ideas into conceptual definitions and operationalize conceptual variables into specific measures—or develop procedures that link conceptual definitions to empirical reality. How we approach these processes varies depending on whether a study is primarily qualitative or quantitative. In a quantitative study, we usually adopt a more deductive path, whereas with a qualitative study, the path is more inductive. Nonetheless, they share the same goal to establish an unambiguous connection between abstract ideas and empirical data.
The chapter also discussed the principles of reliability and validity. Reliability refers to a measure’s dependability; validity refers to its truthfulness or the fit between a construct and data. In both quantitative and qualitative studies, we try to measure in a consistent way and seek a tight fit between the abstract ideas and the empirical social world. In addition, the principles of measurement are applied in quantitative studies to build indexes and scales. The chapter also discussed some major scales in use.
Beyond the core ideas of reliability and validity, we now know principles of sound measurement: Create clear definitions for concepts, use multiple indicators, and, as appropriate, weigh and standardize the data. These principles hold across all fields of study (e.g., family, criminology, inequality, race relations) and across the many research techniques (e.g., experiments, surveys).
As you are probably beginning to realize, a sound research project involves doing a good job in each phase of research. Serious mistakes or sloppiness in any one phase can do irreparable damage to the results, even if the other phases of the research project were conducted in a flawless manner.
bogardus social distance scale
guttman scaling index
level of measurement
mutually exclusive attributes
rules of correspondence
• 1. What are the three basic parts of measurement, and how do they fit together?
• 2. What is the difference between reliability and validity, and how do they complement each other?
• 3. What are ways to improve the reliability of a measure?
• 4. How do the levels of measurement differ from each other?
• 5. What are the differences between convergent, content, and concurrent validity? Can you have all three at once? Explain your answer.
• 6. Why are multiple indicators usually better than one indicator?
• 7. What is the difference between the logic of a scale and that of an index?
• 8. Why is unidimensionality an important characteristic of a scale?
• 9. What are advantages and disadvantages of weighting indexes?
• 10. How does standardization make comparison easier?
Duncan (1984:220–239) presented cautions from a positivist approach on the issue of measuring anything.
The terms concept, construct, and idea are used more or less interchangeably, but their meanings have some differences. An idea is any mental image, belief, or impression. It refers to any vague impression, opinion, or thought. Aconcept is a thought, a general notion, or a generalized idea about a class of objects. A construct is a thought that is systematically put together, an orderly arrangement of ideas, facts, and impressions. The term construct is used here because its emphasis is on taking vague concepts and turning them into systematically organized ideas.
See Grinnell (1987:5–18) for further discussion.
See Blalock (1982:25–27) and Costner (1985) on the rules of correspondence or the auxiliary theories that connect an abstract concept with empirical indicators. Also see Zeller and Carmines (1980:5) for a diagram that illustrates the place of the rules in the measurement process. In his presidential address to the American Sociological Association in 1979, Hubert Blalock (1979a:882) said, “I believe that the most serious and important problems that require our immediate and concerted attention are those of conceptualization and measurement.”
See Bailey (1984, 1986) for a discussion of the three levels.
See Bohrnstedt (1992a,b) and Carmines and Zeller (1979) for discussions of reliability and its various types.
See Sullivan and Feldman (1979) on multiple indicators. A more technical discussion can be found in Herting (1985), Herting and Costner (1985), and Scott (1968).
See Carmines and Zeller (1979:17). For a discussion of the many types of validity, see Brinberg and McGrath (1982).
The epistemic correlation is discussed in Costner (1985) and in Zeller and Carmines (1980:50–51, 137–139).
Kidder (1982) discussed the issue of disagreements over face validity, such as acceptance of a measure’s meaning by the scientific community but not the subjects being studied.
This was adapted from Carmines and Zeller (1979:20–21).
For a discussion of types of criterion validity, see Carmines and Zeller (1979:17–19) and Fiske (1982) for construct validity.
See Cook and Campbell (1979) for elaboration.
See Borgatta and Bohrnstedt (1980) and Duncan (1984:119–155) for a discussion and critique of the topic of levels of measurement.
Johnson and Creech (1983) examined the measurement errors that occur when variables that are conceptualized as continuous are operationalized in a series of ordinal categories. They argued that errors are not serious if more than four categories and large samples are used.
For compilations of indexes and scales used in social research, see Brodsky and Smitherman (1983), Miller (1991), Robinson and colleagues (1972), Robinson and Shaver (1969), and Schuessler (1982).
For a discussion of weighted and unweighted index scores, see Nunnally (1978:534).
Feeling thermometers are discussed in Wilcox and associates (1989).
For more information on Likert scales, see Anderson and associates (1983:252–255), Converse (1987:72–75), McIver and Carmines (1981:22–38), and Spector (1992).
Some researchers treat Likert scales as interval-level measures, but there is disagreement on this issue. Statistically, whether the Likert scale has at least five response categories and an approximately even proportion of people answer in each category makes little difference.
McIver and Carmines (1981:16–21) have an excellent discussion of Thurstone scaling. Also see discussions in Anderson and colleagues (1983:248–252), Converse (1987:66–77), and Edwards (1957). The example used here is partially borrowed from Churchill (1983:249–254), who described the formula for scoring Thurstone scaling.
The social distance scale is described in Converse (1987:62–69). The most complete discussion can be found in Bogardus (1959).
The semantic differential is discussed in Nunnally (1978:535–543). Also see Heise (1965, 1970) on the analysis of scaled data.
See Guttman (1950).
See Bailey (1987:349–351) for a discussion of an improved method for determining scalability called minimal marginal reproducibility. Guttman scaling can involve more than yes/no choices and a large number of items, but the complexity increases quickly. A more elaborate discussion of Guttman scaling can be found in Anderson and associates (1983:256–260), Converse (1987:189–195), McIver and Carmines (1981:40–71), and Nunnally (1978:63–66). Clogg and Sawyer (1981) presented alternatives to Guttman scaling.