Introduction
Word recognition score (WRS) is one of the most frequently used measures for speech audiometry. Generally, several monosyllabic word lists (MWL) with a similar level of difficulty are used to get the WRS. Korean MWLs for adults (MWL-A) were recently developed [
1] and selected as a Korean standard (KS) for speech audiometry [
2]. The KS-MWL-A is widely used in many hearing clinics, hearing aid centers, and auditory rehabilitation centers in Korea. In the clinical settings, WRS gives valuable information to see how much improvement occurred for each individual at the end of treatment, hearing aid fitting, aural rehabilitation, etc. [
345]. We would not be sure whether the improvement is significant or not, however, if test-retest reliability is not established, which refers to the repeatability of a measure [
3456789101112]. It is well known that parameters affecting WRS include a number of test words, stimulus presentation level and mode, difficulty level of word lists, etc. Although few studies [
131213] examined test-retest reliability of Korean WRS for adults, their data were not enough to clearly interpret retest results of the KS-MWL-A with respect to aforementioned parameters, because of differences between old and newly developed word lists, small number of subjects, skewed distribution of WRSs, or homogeneity problem in age.
Indices to show test-retest results include correlation, confidence interval (CI), and prediction interval (PI) in this study. The CI can be described as an estimate of the interval in which the sample mean represents the population mean and the PI as an estimate of the interval in which the retest results will fall with a certain probability, given the results at the previous test [
389]. The PI is useful for making inferences whether the degree of change in WRS at the retest is significant or not for each individual. Therefore, this study tried to investigate the test-retest reliability of the KS-WRS-A according to the recommendations of both international and Korean standards for speech audiometry [
214]. More specifically, first, correlations between test and retest results were analyzed as a function of the number of test words. Second, CIs were calculated with respect to the whole range of WRS for interpreting group data and finally, PIs were obtained at each score of WRS for clinically interpreting individual retest results.
Discussion
In this study, we tried to establish the test-retest reliability of KS-MWL-A regarding each score of WRS as well as the whole range of WRS as a function of the number of test words. Results of the whole range of WRS indicated that the test-retest reliability was high based on the high correlations and narrow CIs for 25 and 50 test words. As expected, the retest reliability of WRS for 10 test words was low, compared to the 25 and 50 test words. Previous studies [
312] also reported that correlation became higher and SD was getting smaller and the CI was getting narrower as the number of test words increased in WRS testing. Both this study and aforementioned researches would recommend 25 or more test words for obtaining a reliable WRS.
As the presentation level increased from 0 to 30 dB HL, means of WRSs increased both at test and retest; however, the variation of differences between WRSs at test and retest became smaller, probably because of the ceiling effect toward the extreme band of 86-100%. These results are also consistent with the previous studies [
34611]. Correlation coefficients of this study are higher and CIs are narrower than You and Lee [
3] results for all test conditions, however. This is considered mainly due to the large group of subjects and their homogeneity in age in this study. As seen in
Table 2, 95% PIs for the whole range of WRS are wider than 95% CI, which suggests that individual variance is greater than group variance. These results are also in consistent with the previous studies. In both large group and small group studies, PIs were reduced as the number of test words increased, which suggests that further analysis of PI for each score of WRS be needed for clinical utilization.
The whole range of WRS can be divided by 9 bands which consist of 0-14%, 15-24%, 25-34%, 35-44%, 45-55%, 56-65%, 66-75%, 76-85%, and 86-100%, so that the band of 45-55% is positioned at the center band. In this study, as expected, the SD of differences between WRSs at test and retest was largest at the center band and gradually decreased as the band level went up to the highest level for all three conditions of the number of test items. It can be theoretically inferred regarding the normal distribution that if data were obtained at WRS bands lower than the center band, SDs at lower bands would be also smaller than that at the center as SDs at upper bands were. That is, the variances of upper bands of 86-100%, 76-85%, 66-75%, and 56-65% would be equal or at least similar to the lower bands of 0-15%, 16-25%, 26-35%, and 36-45%, respectively. Thus, it can also be inferred that as WRS band increases, 95% PI of each band also decreases as SD does, because PI is calculated by the SEM which is directly affected by SD.
In this study, the intra-subject variability in WRS is described by the ±2 SEM for 95% PI in
Table 2 and
3 as recommended by previous researches [
389]. The SEM is different from the SE which refers to the SD of sample means as explained earlier. The SEM is directly related to the reliability of a test with respect to an individual performance, that is, the wider the PI, the lower the reliability of the test. Thus it can also be asserted that the more the number of test words, the higher the reliability of the test. However, testing time is also an important factor regarding clinical efficiency. That is why it is valuable to generate the table showing the upper and lower limits of 95% PI as a function of the number of test items, which can be easily used at clinical settings when interpreting individual retest results. If a difference between test and retest WRS score is greater than double of the SEM, then it means a statistically significant variation with respect to the 95% PI. The upper and lower limits of the 95% PIs for each score of WRS in this study show similar trends to those of 95% critical differences about English WRS for adults reported by Thornton and Raffin [
6], although they calculated the 95% critical differences based on the binomial confidence intervals.
As aforementioned, PIs are affected by the number of test words as well as the WRS band level as seen in
Table 3. For example, if WRS measured by using 25 test words was 60% before auditory training, the upper limit of the PI of this condition would be 76% as seen in
Table 4. Thus, WRS of 80% or greater be interpreted as a significant improvement after training. If the 50 test words were used, then the upper limit of the PI would be 76%. Thus, 78% or greater at retest would be accepted as a significant improvement. For the 10 test words, however, the upper limit of the PI would be 80%, thus only 90% or 100% at retest would be accepted as a significant improvement. In the other example, if WRS for 50 test words was 30% without fitting hearing aids, the upper limit of the PI of this condition would be 44% as seen in
Table 4. Thus, the WRS of 46% or greater be interpreted as a significant improvement after fitting the hearing aids. If the 10 test words were used, then the upper limit of the PI would be 50%. Thus, 60% or greater at retest would be accepted as a considerable improvement. In sum, it would be important to apply the PI values as a function of the number of the test words in
Table 4 for interpreting individual retest results.
Conclusion
This study aimed to investigate the test-retest reliability of WRS testing as a function of the number of test words. Twenty-five or greater test words are recommended for reliable WRS measurement for adults, based on higher correlations, narrower CIs and PIs compared to those of 10 test words. When interpreting retest results, 95% CI for the whole range of WRS for each number of test words would be useful for group data. For individual data, however, 95% PI at each score of WRS for each number of test words would be more useful. If WRS testing with 10 test words is necessary for some individuals for some reasons, then 95% PI for 10 test words should be applied for interpreting retest results of that individual.