Educational Data Systems
6
biserial values. A low point-biserial implies that students who got the item incorrect also scored high on the test
overall while students who got the item correct scored low on the test overall. Therefore, items with low point-
biserial values need further examination. Something in the wording, presentation or content of such items may
explain the low point-biserial correlation. However, even if nothing appears visibly faulty with such items, it is
recommended that they be removed from scoring and future testing. When evaluating items it is helpful to use a
minimum threshold value for the point-biserial correlation. A point-biserial value of at least 0.15 is recommended,
though our experience has shown that “good” items have point-biserials above 0.25.
Interpretation of p-values
The p-value of an item provides the proportion of students that got the item correct, and is a proxy for item difficulty
(or more precisely, item easiness). Refer to Table 4 and note that Item 10 has the lowest p-value (0.11). A brief
examination of the data matrix explains why – only one student got that item correct. The highest p-value, 0.89, is
associated with three items: 1, 2 and 3. Eight out of nine students got each of these three items correct. The higher
the p-value, the easier the item. Low p-values indicate a difficult item. In general, tests are more reliable when the p-
values are spread across the entire 0.0 to 1.0 range with a larger concentration toward the center, around 0.50. (Note:
Better item difficulty statistics can be computed using psychometric models, but p-values give a reasonable estimate.)
The relationship between point-biserial correlations and p-values
Problematic items (items with a low point-biserial correlation) may show high p-values, but the high p-values should
not be taken as indicative of item quality. Only the point-biserial should be used to judge item quality. Our sample
data matrix contains two items that appear to have “conflicting” p-value and point-biserial statistics. One is Item 3,
which has a low point-biserial (0.12) but a high p-value (0.89); the second is Item 10, which has a high point-biserial
(0.40) but a low p-value (0.11).
Examination of the data matrix shows that Item 3 was answered incorrectly by Kid-G. Even though Kid-G did not
correctly answer Item 3, she did correctly respond to Items 4 and 6, which are harder items (as indicated by their
lower p-values). One explanation of how Kid-G could get the harder items correct but the easier item wrong is that
she guessed on Items 4 and 6. Let us assume, however, that she did not guess and that she actually did answer Items 4
and 6 correctly. This would suggest that Item 3 measures something different from the rest of the test, at least for
Kid-G, as she was unable to respond to this item correctly even though it is a relatively easy item. Although in this
article we are dealing with a very small data matrix, in real life there may be a group of students like Kid-G. When
faced with statistics such as we see here for Item 3 (high p-value, low point-biserial), it is recommended that the item
be qualitatively reviewed for content and wording. Something in the wording of the item, its presentation or the
content caused Kid-G to get it wrong and caused the low item point-biserial. In our little sample matrix, Item 3 is a
problematic item because it does not fit the model, meaning that this item behaves differently from other items for
Kid-G. The model says that Kid-G should have got that item right, but she got it wrong. Even if qualitative review of
the item does not reveal any obvious reason for the low point-biserial, it is often advisable that this item be removed
from future testing. Items that measure another content (also called multidimensionality) often show low point-
biserials.