Chapters 4 & 5: The ABC's of Educational Testing

Reliability

Validity

Reliability or Validity?

Measuring

Examples of compromising the data

100

A teacher gives a reading assessment on Monday and then gives the exact same test on Friday to ensure the scores are stable over time.

What is Test-Retest Reliability?

100

A History teacher reviews their final exam to ensure every question aligns specifically with the state standards taught during the semester.

What is Content Validity?

100

A scale that consistently weighs a 5lb bag of flour as 7lbs every single time.

What is Reliable but not Valid?

100

A teacher calculates the difference between a student's "True Score" and their "Observed Score" to identify this.

What is Measurement Error?

100

A fire alarm goes off in the middle of a high-stakes exam, causing scores to fluctuate wildly and inconsistently.

What is a threat to Reliability?

200

To avoid students sharing answers between periods, a math teacher creates "Form A" and "Form B" of the same algebra unit test, ensuring both versions are equally difficult.

What is Alternate-Form Reliability?

200

A high school uses students' PSAT scores to estimate how well those same students will likely perform on the actual SAT six months later.

What is Predictive Validity?

200

If a teacher makes a test extremely long (100+ questions), they are likely trying to increase this specific metric.

What is Reliability?

200

This statistical coefficient, often denoted as $r$, is used to show the strength of the relationship between two sets of test scores.

What is a Correlation Coefficient?

200

A teacher includes a "trick question" with a double negative that confuses students who actually know the material.

What is a threat to Validity

300

Two different teachers use the same rubric to grade a student’s open-ended essay and arrive at the exact same score.

What is Inter-Rater Reliability?

300

A student complains that a "word problem" on a math test is actually testing their reading level rather than their ability to multiply fractions.

What is Construct Under-representation (or a threat to Construct Validity)?

300

If a teacher allows a calculator on a test meant to measure "mental math fluencies," they have compromised this metric.

What is Validity?

300

A teacher looks at a "Table of Specifications" to ensure that the number of test items matches the amount of instructional time spent on each topic.

What is a tool for ensuring Content Validity?

300

On a multi-page exam, the teacher notices that students' scores significantly drop on the last five questions because they ran out of time.

What is the "Speededness" effect (impacting both Reliability and Validity)?

400

A standardized test developer analyzes if the first half of a science exam yields similar results to the second half for the same group of students.

What is Split-Half Reliability?

400

An evaluator checks to see if a new "Social-Emotional Learning" survey actually correlates with other established, proven psychological scales of empathy.

What is Concurrent Validity?

400

This concept is the "ceiling" of the other; a test cannot be more [X] than it is [Y].

What is a test cannot be more Valid than it is Reliable?

400

When an assessment is compared to a "Gold Standard" measure already in use by the district to see if they yield the same ranking of students.

What is Criterion-Related Validity?

400

An essay prompt about "sailing a yacht" is given to a group of students from a landlocked, low-income community who have never seen a boat.

What is Assessment Bias (a threat to Validity)?

500

This statistical value, often represented as a decimal between 0 and 1, tells a teacher how much "error" is likely present in a student’s observed score.

What is the Standard Error of Measurement (SEM)?

500

A district leader asks, "Is the way we are using these test scores to place students in Remedial Reading actually helping them, or is it unfairly labeling them?"

What is Consequential Validity?

500

These are "unsystematic" fluctuations in scores caused by things like a loud hallway or a student having a headache.

What is Random Error (which impacts Reliability)?

500

This specific type of evidence asks: "Does the test look like it measures what it’s supposed to measure to the students taking it?"

What is Face Validity?

500

A teacher gives a "Pre-test" on Monday and the exact same "Post-test" on Tuesday; students score higher simply because they remember the questions.

What is the Practice Effect (or Memory Effect)?