Item Analysis
Content Validity
Criterion-Related Validity
Construct Validity
Practical Considerations
100

Before performing item analysis, researchers often remove data from participants who failed to respond appropriately to an attention check item. 

What is an attention check item? Why do researchers remove data from participants who respond inappropriately?

In surveys, researchers often include attention check items (e.g., Select 'agree' to demonstrate you are paying attention.) Researchers remove responses from individuals who respond incorrectly to such items to help ensure high-quality data. We can't be confident an individual provided high-quality data if they weren't paying close attention.

100

What does it mean for test content to be representative?

Content is covered in a reasonable proportion. The most important content is covered more than the least important content. If everything is equally important, questions should cover content evenly (e.g., 5 questions about item analysis, 5 questions about content validity, 5 questions about criterion-related validity).

100

What is a criterion?

An outcome we expect is associated with test scores 

For example, your performance on this jeopardy game may be associated with your Exam 3 score. Your Exam 3 score would be the criterion.

100

What is a construct?

an attribute, trait, or characteristic that is not directly observable but can be inferred by looking at observable behaviors

Examples: Aggression, Intelligence, Dog Lover, Environmental Activism, Videogame Fan, USF School Spirit, Pop Music Fan, Sports Enthusiast, Job satisfaction

100

What are some controversies relevant to psychological testing? 

arbitrary cut scores, low-quality tests, low-quality inferences and decisions, testing discrimination, testing fairness (and how to handle it)

200

When conducting item analysis, testing professionals may examine Cronbach's alpha with item removed. What is that? When should it be examined? What might suggest you should remove an item?

What is that? Cronbach's alpha with item removed tells you the internal consistency reliability of a scale if an item is removed. 

When should it be examined? When a scale is homogeneous (i.e., the items measure one underlying construct).

What might suggest you should remove an item? You may want to remove an item if Cronbach's alpha with item removed is high (well above .70). If Cronbach's alpha with an item removed is high, the scale would still have adequate internal consistency reliability without the item. Sometimes removing an item would increase Cronbach's alpha of the scale, which is a red flag for considering removal.

200

What 3 questions are relevant to content validity?

Is the test content representative? Does it leave out anything important? Does it measure anything irrelevant?

200

What are the two types of criterion-related validity studies?

predictive validity study - the criterion is measured after the predictor is measured

concurrent validity study - the "predictor" and the criterion are measured at the same time


200

What are convergent evidence of validity and discriminant evidence of validity?

Convergent evidence of validity - test scores are strongly, positively associated with scores on tests measuring similar constructs

Discriminant evidence of validity - test scores are unrelated to scores on tests measuring dissimilar constructs

200

What are some challenges testing professionals face?

- informing clients on why tests take so long to develop and why rigorous procedures are so important

- facing pressure to break promises of confidentiality

300

When conducting item analysis, testing professionals may examine item-total correlations. What are those? When should you examine them? What might suggest you should remove an item?

What are those? Item-total correlations tell you how well an item relates to the average of the other items in a scale. 

When should you examine them? When a scale is homogeneous (i.e., measures one underlying construct). Items in a homogeneous scale should have high item-total correlations because they should be measuring the same thing.

What might suggest you should remove an item? You may want to remove an item in a homogenous scale with a low item-total correlation.

300

What are the 4 steps to establish content evidence of validity before a test is developed?

1. Define the testing universe

2. Develop test specifications

3. Establish a test format

4. Construct test questions

Every step should consider the previous step(s). For example, the test specifications should be based on the testing universe.

300

What is a validity coefficient? How do you interpret it?

Validity coefficient - the correlation between scores on the predictor and the criterion (e.g., the correlation between performance on this jeopardy game and scores on Exam 3). You interpret it just like any other correlation. It is a value between -1 and 1, and both the direction and magnitude of the validity coefficient should be interpreted. 

300

What are each of the following: monotrait-monomethod correlations; monotrait-heteromethod correlations; heterotrait-monomethod correlations; heterotrait-heteromethod correlations

Monotrait-monomethod - correlations between scores on the same trait measured with the same method (i.e., reliability)

Monotrait-heteromethod - correlations between scores on the same trait measured with different methods

Heterotrait-monomethod - correlations between scores on different traits measured with the same method

Heterotrait-heteromethod – correlations between scores on different traits measured with different methods

300

What is testing bias? What causes it?

Testing bias occurs when a group (or groups) of individuals are less likely to perform well on a test for reasons that have nothing to do with the construct being measured. Bias occurs when a test requires knowledge, skills, or abilities that are irrelevant to the construct being measured (e.g., requiring high-level vocabulary on a math test; including culturally-specific knowledge on an intelligence test).

400

What is the difference between interitem correlations and item-total correlations?

Interitem correlations tell you the relationship between scores on an item and scores on every other item in a scale. (e.g., The correlation between scores on Item 1 and Item 2 is .86. The correlation between scores on Item 1 and Item 3 is .92. The correlation between scores on Item 1 and Item 4 is .67.) This provides detailed information at the item level.

Item-total correlations tell you the relationship between scores on an item and the average of scores on the other items in a scale. (e.g., The correlation between scores on Item 1 and average scores on the other items is .86.) This provides a bigger picture overview.

400

What is the testing universe? What are some ways test developers define the testing universe?

Testing universe - the body of knowledge or behaviors that a test represents

To effectively define the testing universe, you must be an expert on the construct you want to measure. Testing professionals may conduct literature reviews, interview experts, survey experts, etc.

400

What is a common problem in criterion-related validity studies?

range restriction - we don't have data on people who are low on the predictor; this attenuates (lessens) the validity coefficient

400

What types of correlations provide convergent evidence of validity? What types of correlations provide discriminant evidence of validity?

convergent evidence of validity - monotrait-monomethod; monotrait-heteromethod

discriminant evidence of validity - heterotrait-monomethod; heterotrait-heteromethod

400

Suppose you know the answers to all of the questions in this jeopardy game, and you conclude you are a genius. Thoroughly evaluate relevant evidence, and explain the quality of the inference.

Inference: I am a genius.

Evidence based on test content: The content does not representatively capture content relevant to being a genius (it only captures content relevant to psych tests and measurement). It leaves out a lot of things that are important (e.g., verbal reasoning, spatial intelligence, problem-solving). It measures things that are irrelevant (specific knowledge about tests and measurement).

Evidence based on relations with criteria: There is no evidence to suggest that performance on this jeopardy game is associated with genius outcomes (e.g., winning a Nobel peace prize, being deemed an expert in your field).

Evidence based on relations with constructs: There is no evidence to suggest that your performance on this jeopardy game is associated with your performance on an IQ test or another measure of genius-ness.

There is no evidence to suggest you are genius. (You might be, but your score on this jeopardy game is irrelevant.)

500

What is item difficulty? How do you interpret it? What is the ideal average item difficulty of a test?

Item difficulty is the percentage of test-takers who answer a test question correctly (converted to a decimal; e.g., If 90% of test-takers get an item correct, the item difficulty is .90). Values above .9 may indicate an item is too easy (i.e., a large percentage of test-takers are getting it correct). Values below .2 may indicate an item is too hard (i.e., a small percentage of test-takers are getting it correct). 

Items that are too easy or too hard do not help to distinguish between test-takers because everyone tends to answer similarly. You want diversity in item difficulty, with most items having an item difficulty between .2 and .9. Ideally, the average item difficulty is around .5 (50% of test-takers answer correctly) to optimally distinguish between test-takers.

500

What are 3 ways to examine content evidence of validity after a test is developed? Which of the 3 is the least informative, and why?

1. Experts review and rate how relevant each test item is to the underlying construct(s).

2. Experts match each test item to the construct it seems to be measuring. They should recreate the test developer's test specifications.

3. Ask test-takers the relevance of each test item (i.e., face validity).

Face validity is not strong evidence of validity. Test-takers may not have the expertise to recognize whether an item/question is relevant. However, high face validity is valuable to gain test-takers' approval of a test.

500
Compare the advantages and disadvantages of objective criteria versus subjective criteria. Consider at least 4 differences.

Objective measures: Less subject to rater biases; Prone to recording errors; Deficient (Miss important performance aspects); Contaminated (Reflect factors outside of individuals’ control)

Subjective measures: Prone to rater biases; Not very prone to recording errors; Capture a broader criterion domain (less deficient); Can be contaminated by rater biases but may be less contaminated by other factors

500

Rank the correlations you would find in a MTMM correlation matrix in order of what you would expect/hope to be smallest to largest.

1. heterotrait-heteromethod correlations

2. heterotrait-monomethod correlations

3. monotrait-heteromethod correlations

4. monotrait-monomethod correlations

500

According to the American Psychological Association (APA), what are the 11 ethical principles relevant to assessments?

Bases of Assessments

Use of Assessments

Informed Consent in Assessments

Release of Test Data

Test Construction

Interpreting Assessment Results

Assessment by Unqualified Persons

Obsolete Tests and Outdated Test Results

Test Scoring and Interpretation Services

Explaining Assessment Results

Maintaining Test Security

M
e
n
u