12 Jun 2022

106

Understanding the Alternative Mantel-Haenszel Statistic: Factors Affecting Its Robustness to Detect Non-Uniform DIF

Format: APA

Academic level: Master’s

Paper type: Research Paper

Words: 3687

Pages: 3

Downloads: 0

Mohammad Mollazehi a and Abdel-Salam G. Abdel-Salam a,b 

a Department of Mathematics, Statistics and Physics, Qatar University, Doha, Qatar; 

b Student Experience Department, Student Affairs Sector, VPSA Office, Qatar University, Qatar. 

Correspondence should be addressed to Abdel-Salam Gomaa Abdel-Salam, Associate Professor of Statistics, Section Head of Student Data Management, Student Experience Department, Qatar University, E-mail: abdo@qu.edu.qa

It’s time to jumpstart your paper!

Delegate your assignment to our experts and they will do the rest.

Get custom essay

Understanding the Alternative Mantel-Haenszel Statistic: Factors Affecting Its Robustness to Detect Non-Uniform DIF 

Test-item bias has become an increasingly challenging issue in statistics and education. A popular method, the Mantel-Haenszel Test, is used for detecting non-uniform DIF but requires constructing several performance tiers to maintain robustness. The Alternative Mantel-Haenszel Test (1994) was developed within the last two decades as a proxy procedure requiring only two scoring tiers. There is, however, inadequate information on how important factors like comparison group sizes, question discrimination affects its ability to detect bias. In this study, we investigate how item difficulty and discrimination, and the examined ratio between the focal and reference groups impact the likelihood of the AMH detecting DIF. A comprehensive simulation study was conducted where the test scores were generated under three conditions. They are three commonly-used difficulty levels (easy, medium, and hard), two discrimination levels (referred to as 'low' and 'high'), and three group comparison ratios (1:1, 2:1, and 5:1). The simulation study showed the detection rates of the AMH Test to be comparable to those of other common tests like the Breslow-Day Test and even better. The current study aims to investigate and determine the factors that affect AMH detection behavior. 

Keywords: Differential item functioning, non-uniform DIF, discrimination, item difficulty, Breslow-Day, Mantel-Haenszel 

INTRODUCTION

Differential item functioning (DIF) is an assessment tool that has been extensively used in quantitative psychology, educational measurement, business management, and insurance, and in the healthcare sector (Holland & Wainer, 2012). Test analysts need to identify items that create bias as a function of the characteristics of the examinees in large-scale assessments (Jensen, 1980; Scheuneman & Bleistein, 1989). Several studies have shown that various conditions such as major differences in sample sizes, item performance, and ability distributions between groups of examinees can tremendously affect how often the MH procedure detects non-uniform DIF correctly (Herrera & Gómez, 2008; Marañón, Garcia, & Costas, 1997; Kathleen M. Mazor, Clauser, & Hambleton, 1992; Narayanan & Swaminathan, 1994; Narayanon & Swaminathan, 1996; Swaminathan & Rogers, 1990). Kathleen M Mazor, Clauser, and Hambleton (1994) attempted to determine whether a simple adjustment of the MH procedure could improve detection rates for items showing non-uniform DIF. The researchers developed an ingenious alternative to the Mantel-Haenszel (AMH) procedure and suggested that by partitioning examinees based on the total test score, particularly separating the examinees into either a ‘high’ or a ‘low’ scoring group, the MH procedure could be used to identify non-uniform DIF. The results of the study led conclusion that the procedure helps to increase detection rates without increasing the rate of Type I error.

The BD method nonetheless is good for analysing non-uniform DIF since it can assess trends in odds ratio heterogeneity (Aguerri et al., 2009). Though the test has numerous weaknesses, it has helped significantly in detecting DIF. Compared to other alternatives, however, the test is less accurate, even though it can be combined with other tests to offer more accurate results. This is important to ensure that the test achieves the intended results i.e., detecting DIF (Aguerri t al., 2009).

Aguerri et al. (2009) used the BD test to detect non-uniform DIF when the average ability of one group was considerably larger than another group. They used the BD test to effectively determine factors that affect the rate of detection of non-uniform DIF. The results from the BD test were compared with logistic regression analysis (LR) and the standard Mantel-Haenszel procedure (MH). The researchers found that the BD was better compared to the logistic regression (LR) and the MH procedure. The method was also used to test several other parameters, including the sample size and item, to effectively determine factors that affect the use of the BD test to detect non-uniform DIF. According to Aguerri et al. (2009), when the item with the largest discrimination and difficulty parameters for equally sized groups was omitted from the goodness-of-fit to the binomial distribution, the test returned Type I error that was similar to the nominal one.

In another study, Penfield (2001) performed a DIF to compare a single reference group and multiple focal groups. According to Penfield, conducting a separate test of DIF had several undesirable qualities, which included an increase in Type I error rate and the need for substantial time and computing resources. Penfield suggested the use of a procedure with the ability to assess DIF across multiple groups to address these drawbacks. In this regard, the researcher conducted a study to compare three MH procedures, namely, the Mantel–Haenszel chi-square statistic with no adjustment to the alpha level, the Mantel–Haenszel chi-square statistic with a Bonferroni adjusted alpha level, and the Generalized Mantel–Haenszel statistic (GMH). From the findings of the study, GMH was the most appropriate procedure for assessing DIF among multiple groups. Since the development of the AMH procedure, studies investigating it have been rare. Fidalgo and Mellenbergh (1995) compared the performance of the AMH procedure to that of the standard MH and iterative logit procedures. They found that the AMH procedure had a higher power rate than the other two procedures but later cautioned that the AMH procedure was not as robust. The study by Fidalgo and Mellenbergh showed some significant relationships between sample size and the effect of DIF on the performance of the AMH procedure. Several limitations surrounding the study, nevertheless, warrant further investigation. For instance, while the study considered two sample sizes (200 and 1,000), it conditioned the compared groups to have equal sizes. Additionally, the study investigated non-uniform DIF as a composite measure of both difficulty and discrimination. Finally, the researchers investigated symmetric non-uniform DIF, thus leaving gaps concerning how these factors affect asymmetric non-uniform DIF.

The present research is an extension of the study done by Kathleen M Mazor et al. (1994) in which it uses a predictive model to explain the rate at which a non-uniform DIF item is detected when partitioning subjects by high and low ability levels. In this respect, the study aims to achieve several objectives to understand DIF and approaches that are used to detect DIF. There are four research questions in this study. They are:

How does sample size ratio, item difficulty, and item discrimination affect the detection of non-uniform DIF using the AMH procedure?

What factors affect the rate of detection of non-uniform DIF using the AMH procedure?

What are the conditions under which AMH procedure works best in detecting non-uniform DIF compared to the BD procedure?

There were several predictors used in these models. They include item difficulty, item discrimination, and sample size ratios. As an extension to the work of  Kathleen M Mazor et al. (1994) , the results also considered the case where the ability level distributions for the reference and focal groups were equal and unequal. The study identified the significant factors that contributed to the detection rates and the conditions in which the highest non-uniform detection rates occurred.

METHOD

The current study combines methods described by Kathleen M Mazor et al. (1994) and Penfield (2003). There are three processes including simulating the data, assessing the non-uniform DIF items, and creating the regression models. Each step will be explained in the following subsections. 

Simulating the Data 

Each simulation contains the item scores for a 75-item examination, in which the last item is created to contain non-uniform DIF representing the studied item. The number of items is considered since it reflects that of similar large-scale assessments. For the non-studied items, difficulty levels will be generated from a normal distribution with a mean of zero and a standard deviation of one. The discrimination levels will be generated from a lognormal distribution with a mean of 0 and a standard deviation of 0.35. These considerations are based on a similar study by Penfield (2003). The guessing parameter for the non-studied items is 0.2. This setting will be used for performance comparisons as well. 

Parameters for the studied item are manipulated to create non-uniform DIF based on six combinations between item discrimination and difficulty. Two levels for item discrimination will be tested including low ( and ) and high ( and ). Three levels for item difficulty will be considered, they are easy ( ), medium ( ), and difficult ( ). These parameters are taken from Kathleen M Mazor et al., 1994. The guessing parameter, like the non-studied items, is set at 0.2. Three sample size ratios are considered in this study, namely 1:1 ( ), 2:1 ( ), and 5:1 ( ). These ratios are similar to those from Penfield (2003).

A total of 10,000 simulations will be performed for each of the eighteen combinations of item discrimination, item difficulty, and reference-to-focal group ratio. These simulations are then stratified across two ability level distribution conditions. The first case includes reference and focal groups that come from normal distributions with a mean of 0 and standard deviation of 1, and in the second case, the ability levels for the reference group remain unchanged with the focal group having a mean that is one standard deviation lower than the reference group. Methods in the studies by Penfield (2003) and Kathleen M Mazor et al. (1994) are similar to the present method.

Assessing the Non-uniform DIF

After creating the simulated item scores, the AMH and BD tests will be performed to determine whether non-uniform DIF is present in the studied item. A significant level of five percent will be considered for both procedures for two main reasons. First, most studies in the domains of quantitative research use five percent level of significance and second, it was proven that the type one error rates were consistent at or below the nominal level of five percent in both cases, whether the group ability distributions were equal or unequal (Penfield, 2003). In the current study, a type one error occurs where DIF is detected when in fact there is no DIF. 

Creating the Regression Models

Four logistic regression models will be created to assess the likelihood of detecting non-uniform DIF from the studied item. Each of the two models will assess the DIF using the AMH and BD procedures, where each distinguishes between equal and unequal ability distributions. Each logistic regression model uses item discrimination, item difficulty, and group ratio as explanatory variables to predict the likelihood of detecting non-uniform DIF. The logit form of the logistic regression model is in the following form: 

represents the log of the odds of detecting non-uniform DIF. Model regression estimates are produced from the logistic model. The Akaike Information Criterion (Portet, 2020) , C-statistic (Caetano et al., 2018) , and correct classification rate (Yan et al., 2017) will be reported to provide diagnostic statistics of the model.

ANALYSIS AND RESULTS

Simulation Study

The detection rate of DIF, which refers to the percentage of simulations in which the AMH and the BD procedures successfully detected non-uniform DIF in the studied item are shown in Table 1 and 2. Both tables reflect percentages as a function of sample size ratios (RF), item discrimination (AL), and item difficulty (BL), but the two tables contrast by the consideration of equal and unequal ability distributions.

TABLE 1

Non-uniform DIF Detection Rates from Equal Ability Distributions

R:F Ratio Discrimination Difficulty

DIF Detection Rate

AMH

BD

1:1 Low Easy

97.87%

47.28%

2:1 Low Easy

91.34%

34.70%

5:1 Low Easy

70.27%

22.28%

1:1 Low Medium

49.95%

43.68%

2:1 Low Medium

37.40%

29.76%

5:1 Low Medium

25.86%

18.27%

1:1 Low Hard

89.65%

18.53%

2:1 Low Hard

75.16%

14.05%

5:1 Low Hard

48.88%

9.74%

1:1 High Easy

100.00%

99.98%

2:1 High Easy

100.00%

99.16%

5:1 High Easy

100.00%

90.51%

1:1 High Medium

99.88%

99.94%

2:1 High Medium

98.32%

99.22%

5:1 High Medium

84.07%

87.58%

1:1 High Hard

99.95%

44.43%

2:1 High Hard

99.35%

29.69%

5:1 High Hard

89.87%

19.44%

For reference and focal groups with equal ability distributions, detection rates ranged from 25.86% to 100.00% for the AMH procedure and 9.74% to 99.98% for the BD procedure. Variability of detection rates was higher for the BD procedure than for the AMH procedure. The three highest detection rates for the AMH procedure occurred with items containing high ALs and easy BLs, while the three lowest detection rates occurred with items containing low ALs and medium BLs. For the BD procedure, a smaller number of combinations with high detection rates were observed. For the BD procedure, three highest detection rates were observed in situations where items contained high ALs and easy or medium BLs. The three lowest detection rates were present for items with low ALs, 2:1 or 5:1 RFs, and medium or hard BLs. Regardless of the RFs, both procedures had a higher potential of detecting non-uniform DIF when the AL was low and the BL was medium.

TABLE 2

Non-uniform DIF Detection Rates from Unequal Ability Distributions

R:F Ratio Discrimination Difficulty

DIF Detection Rate

AMH

BD

1:1 Low Easy

93.03%

47.47%

2:1 Low Easy

85.37%

31.48%

5:1 Low Easy

68.29%

21.95%

1:1 Low Medium

48.90%

36.26%

2:1 Low Medium

34.91%

24.72%

5:1 Low Medium

22.95%

14.54%

1:1 Low Hard

89.40%

11.79%

2:1 Low Hard

73.06%

9.86%

5:1 Low Hard

47.14%

7.68%

1:1 High Easy

100.00%

99.97%

2:1 High Easy

100.00%

99.23%

5:1 High Easy

99.87%

89.06%

1:1 High Medium

99.58%

99.39%

2:1 High Medium

95.15%

95.63%

5:1 High Medium

74.62%

73.98%

1:1 High Hard

99.98%

16.16%

2:1 High Hard

99.20%

11.87%

5:1 High Hard

89.73%

7.80%

Similar behavior was observed with the two procedures when the ability distributions were unequal. Detection rates ranged between 22.95% to 100.00% for the AMH procedure and 7.68% to 99.97% for the BD procedure, with more variability in detection rates observed for the BD procedure. The three highest detection rates for the AMH procedure were found in items containing high ALs and easy BLs, even though some high detection rates were observed even when the BL increased. The lowest detection rates for the AMH procedure were found for the case where the AL was low and the BL was medium, regardless of the RF ratio. For the BD procedure, the highest detection rates were observed for items containing high ALs and easy or medium BLs. The lowest detection rates occurred with items with hard BLs and large RFs (7.58% and 7.80%). The results from both tables suggest that considering equal or unequal ability distributions does not significantly affect non-uniform DIF detection rates. Rather, the key factors include the size of the RF and item characteristics.

To further understand the behavior and dependence that the AL, BL, and RF have on non-uniform DIF detection rates, four full-effect logistic regression models were created to predict the likelihood of detecting item non-uniform DIF. They are detecting non-uniform DIF for the AMH procedure with equal ability distributions (Model 4a), detecting non-uniform DIF for the BD procedure with equal ability distributions (Model 4b), detecting non-uniform DIF for the AMH procedure with unequal ability distributions (Model 4c), and detecting non-uniform DIF for the BD procedure with unequal ability distributions (Model 4d). Table 4 shows the estimates and standard errors from the four models, and the AIC, c-statistic, and CCR to represent the model diagnostics.

It is important to note that the current study focused on Models 4a and 4c. Models 4b and 4d were created to compare the significance found in these models to those of Models 4a and 4c, thus identifying concordant and discordant predictor behavior when detecting non-uniform DIF using the AMH versus the BD procedures.

Tables 4 to 7 of the Appendix show model-building, main-effect models to demonstrate the inclusion of each predictor in the model and to determine whether a predictor’s inclusion affect the significance or change in its effect.

TABLE 3

AMH & BD Full-Effects Logistic Models

 

Equal 

Unequal 

 

AMH (4a) 

BD (4b) 

AMH (4c) 

BD (4d) 

Estimates                 
Intercept (

2.89 

-0.19 

1.32 

-0.10 

Reference-Focal Ratio (RF)                 
RF = 2:1 

-1.26*** 

0.06 

-0.52*** 

0.03 

-0.58*** 

0.09 

-0.68*** 

0.08 

RF = 5:1 

-2.62*** 

0.06 

-1.14*** 

0.03 

-1.77*** 

0.33 

-1.17** 

0.38 

Discrimination Level (AL)                 
AL = high 

15.32 

89.84 

8.63*** 

0.71 

17.00 

95.12 

8.21*** 

0.58 

Difficulty Level (BL)                 
BL = medium 

-5.86*** 

0.07 

-0.15** 

0.03 

-3.40*** 

0.09 

-0.46*** 

0.07 

BL = hard 

-0.87*** 

0.17 

-1.34*** 

0.14 

1.08*** 

0.09 

-1.91*** 

0.07 

Two-Way Interactions                 
RF = 2 AL = high 

1.26 

145.10 

-3.22*** 

0.72 

-10.61 

95.12 

-2.57*** 

0.59 

RF = 5 AL = high 

-11.16 

89.84 

-5.12*** 

0.72 

-13.03 

95.12 

-4.84*** 

0.69 

RF = 2 BL = medium 

1.17*** 

0.14 

-0.08 

0.07 

0.12 

0.13 

0.13 

0.09 

RF = 2 BL = hard 

0.09 

0.17 

0.19 

0.14 

-0.63*** 

0.10 

0.48*** 

0.09 

RF = 5 BL = medium 

2.60*** 

0.10 

-0.10 

0.06 

1.14*** 

0.34 

-0.34 

0.39 

RF = 5 BL = hard 

0.30 

0.17 

0.40** 

0.14 

-0.70* 

0.33 

0.69 

0.39 

AL = high BL = medium 

-15.38 

89.84 

-0.95 

0.82 

-15.29 

95.12 

-2.55 

0.59 

AL = high BL = hard 

-9.78 

89.84 

-7.34*** 

0.72 

1.08 

134.50 

-7.84 

0.58 

Three-Way Interactions                 
RF = 2 AL = high BL = medium 

-0.94 

145.10 

1.25 

0.84 

10.16 

95.12 

1.11 

0.61 

RF = 2 AL = high BL = hard 

-1.91 

145.10 

2.91 

0.74 

0.07 

134.50 

2.41 

0.60 

RF = 5 AL = high BL = medium 

11.56 

89.84 

0.90 

0.83 

11.95 

95.12 

2.00** 

0.71 

RF = 5 AL = high BL = hard 

8.18 

89.84 

4.67*** 

0.73 

-0.13 

134.50 

4.49*** 

0.70 

                 
AIC 

69,363.72 

82,660.50 

86,121.56 

91,351.22 

0.935 

0.895 

0.938 

0.924 

CCR 

85.88% 

82.01% 

86.61% 

88.08% 

Note. * ,** , ***                

The AIC for Model 4a was 69,363.72. The c-statistic and CCR of 0.935 and 85.88% respectively suggests that this model possesses strong predictive power. The RF was found to have a very strong negative effect on the detection of non-uniform DIF. Non-uniform DIF items from RFs of 2:1 and 5:1 had lower likelihoods of being detected than non-uniform DIF items from an RF of 1:1, which decreased the log of the odds of detection by 1.26 and 2.62 logits respectively. The behavior of the RF predictor was found to be concordant with that of the RF predictor from Model 4b. The AL was found to be a poor predictor when estimating non-uniform DIF items, and strongly discordant to the behavior of the AL predictor in Model 4b. Significant negative effects were present with the BL predictor. These effects suggest that non-uniform DIF items with hard BLs have lower detection rates than items with easy BLs when using the AMH procedure ( ), but items with medium BLs have even lower detection rates ( ). Two-way and three-way interactions of these predictors were found to be insignificant. Significant positive effects were present only for the interaction between the RF and BL, particularly for items with medium BLs involving 2:1 RF ( ) or 5:1 RFs ( ). These significant interactions were discordant to the behaviors and significant interactions found in Model 4b with the RF and AL predictors (negative effects with the 2:1 and 5:1 RF levels with the high AL), RF and BL predictors (positive effect with the 5:1 RF with the hard BL level), the AL and BL predictors (negative effect with the high AL and hard BL), and all three predictors (positive effect with the 5:1 RF, high AL, and hard BL).

Model 4c had an AIC of 86,121.56. The c-statistic and CCR for Model 4c were 0.938 and 86.61% respectively, which was just slightly higher than those calculated in Model 4d. The RF predictor was found to have a strong negative effect on non-uniform DIF detection. Items created from 2:1 or 5:1 RFs decreased DIF detection rates by -0.58 and -1.77 logits. It is interesting to recognize that the behaviors and significances observed with the RF predictor were similar in Models 4c and 4d, while the standard errors were different. One plausible reason could be due to the differences in the ability distributions. These results were similar to those in Models 4c and 4d where the effects were negative and the standard errors were equivalent. Discrimination was not found to be statistically significant in Model 4c, which was opposite of the effects observed in Model 4d. Strong significant effects were present with the BL on detection rates in Models 4c and 4d. Items with medium BLs decreased detection rates by 3.40 logits, but hard items tended to increase detection rates by 1.08 logits. These results were somewhat discordant with what was observed in Model 4d in which medium and hard items exhibited a significant negative effect on the likelihood of detecting non-uniform DIF.

The similarities found in Models 4a and 4c suggest that some factors contribute to non-uniform DIF detection rates regardless of differences in ability level distributions. It is possible that as the ratio of the reference and focal groups’ sizes increase, the likelihood of detecting non-uniform DIF with the AMH decreases. Results showed that items created from groups with a 2:1 ratio significantly decreased DIF detection between -0.58 logits and -1.26 logits, while items created from groups with a 5:1 ratio significantly decreased DIF detection between 1.77 to 2.62 logits. Medium items also significantly decreased DIF detection with the AMH procedure between 3.40 and 5.86 logits. However, the interaction of the RF and BL factors had a significant positive effect on non-uniform DIF detection as they increased the likelihood between 1.14 and 2.60 logits.

DISCUSSION, CONCLUSION AND LIMITATIONS

The purpose of this study was to determine whether changes in sample size ratios of the reference and focal groups, item discrimination, and item difficulty affected how often the AMH and BD procedures detected non-uniform DIF. It also examined the factors that significantly affected the detection rate in the AMH procedure, and whether particular combinations of these factors yielded higher detection rates with the AMH procedure compared to the BD procedure.

With equal ability distributions, the results suggest that the detection rate of non-uniform DIF using the AMH procedure is most affected by items with high discrimination, followed by items with easy item difficulty, and then by items answered by reference and a focal group of equal sizes. This was based on comparing the six largest detection rates found in Table 1. Using the AMH procedure, items with high discrimination tend to have 84% chance or more of being detected, 70% or more for items with easy difficulty, and 50% or more for items with equal reference and focal group ratios. Items associated with all three characteristics have the highest chance of being detected by this procedure. The order of these factors may be trivial since item characteristics have been shown to play an important part in detecting non-uniform DIF using several other DIF procedures.

Differences in reference and focal group ratios have a significant negative effect when modeling non-uniform DIF detection rates, with the significance becoming greater as the magnitude of the difference increases. Item difficulty also has a significantly negative effect on non-uniform DIF detection rates with the AMH procedure. Items with medium or hard difficulty exhibit lower chances of being detected, with medium-level items possessing the lowest chance of detection. The negative effect for medium items slightly offsets higher group ratios.

For the unequal ability distribution case, the results are slightly similar. Items with high discrimination, followed by items with easy item difficulty, and then by items answered by reference and a focal group of equal sizes offer the highest detection rates for non-uniform DIF with the AMH procedure. Items with high discrimination tend to have about 75% chance or more of being detected, 68% or more for items with easy difficulty, and about a 49% chance or more for items with equal reference and focal group ratios. These percentages are similar to those of the equivalent case, and so it is believed that non-uniform DIF detection is insensitive to unequal ability levels between the reference and focal groups.

Group ratio and item difficulty still have significant effects on non-uniform DIF detection rates. Group ratio particularly has a dominant-negative effect, and items involving higher ratios have a stronger negative effect. With regards to item difficulty, medium items harm uniform DIF detection rates when compared to easy items, but hard items have a positive impact. This behavior is in contrast to what was observed with the equivalent ability distribution case and is also discordant to the behavior found that would be observed using the BD procedure, in which item difficulty has a completely negative effect.

The results suggest that the effects of item difficulty and group ratios are similar to those that would be observed using the BD procedure, but the effects observed for item discrimination would differ. One possible reason could be that the test statistic formulas involved with the two methods are sample-size dependent. The AMH procedure is a special situation of the MH statistic, and like other IRT-based methods, is affected by item difficulty. With regards to sample size, it expresses how decreases in group sizes have significant effects on the MH’s ability to detect DIF-affected items.

Results from the simulation indicate that the AMH procedure detects non-uniform DIF, which is the best for items with easy difficulty levels or high discrimination. Test analysts should expect at least a 2.36:1 chance of successfully detecting non-uniform DIF when either property is present in an item.

In comparison to the BD procedure, the AMH procedure has a higher chance of detecting non-uniform DIF items except when items contain high discrimination and medium difficulty. The AMH procedure yields the strongest potential to detect non-uniform DIF items for ones with easy difficulty or high discrimination, and the weakest detection rates for high-discriminating, medium-difficulty items.

There is a limitation involved regarding the types of DIF tests performed. While the current study tested the effects of several factors on non-uniform DIF detection rate using MH and BD procedures, additional tests can be considered. Comparing MH and BD procedures to other methods such as IRT and LR methods may be of interest to some education analysts.

REFRENCESS

Aguerri, M. E., Galibert, M. S., Attorresi, H. F., & Marañón, P. P. (2009). Erroneous detection of non-uniform DIF using the Breslow-Day test in a short test.  Quality & quantity 43 (1), 35-44. 

Caetano, S. J., Sonpavde, G., & Pond, G. R. (2018). C-statistic: A brief explanation of its construction, interpretation and limitations.  European Journal of Cancer 90 , 130–132. https://doi.org/10.1016/j.ejca.2017.10.027 

Fidalgo, A., & Mellenbergh, G. (1995). Evaluación del procedimiento Mantel-Haenszel frente al método logit iterativo en la detección del funcionamiento diferencial de los ítems uniforme y no uniforme. Paper presented at the Comunicación presentada al IV Simposio de Metodología de las Ciencias del Comportamiento. La Manga del Mar Menor.

Herrera, A.-N., & Gómez, J. (2008). Influence of equal or unequal comparison group sample sizes on the detection of differential item functioning using the Mantel–Haenszel and logistic regression techniques. Quality & Quantity, 42 (6), 739.

Holland, P. W., & Wainer, H. (2012). Differential item functioning : Routledge.

Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.

Marañón, P. P., Garcia, M. I. B., & Costas, C. S. L. (1997). Identification of nonuniform differential item functioning: A comparison of Mantel-Haenszel and item response theory analysis procedures. Educational and Psychological Measurement, 57 (4), 559-568.

Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1992). The effect of sample size on the functioning of the Mantel-Haenszel statistic. Educational and Psychological Measurement, 52 (2), 443-451. doi:10.1177/0013164492052002020

Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational and Psychological Measurement, 54 (2), 284-291.

Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and simultaneous item bias procedures for detecting differential item functioning. Applied Psychological Measurement, 18 (4), 315-328. doi:10.1177/014662169401800403

Narayanon, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20 (3), 257-274.

Penfield, R. D. (2001). Assessing differential item functioning among multiple groups: A comparison of three Mantel-Haenszel procedures.  Applied Measurement in Education 14 (3), 235-259. 

Penfield, R. D. (2003). Applying the Breslow-Day test of trend in odds ratio heterogeneity to the analysis of nonuniform DIF. Alberta Journal of Educational Research, 49 (3).

Portet, S. (2020). A primer on model selection using the Akaike Information Criterion.  Infectious Disease Modelling 5 , 111–128. https://doi.org/10.1016/j.idm.2019.12.010 

Scheuneman, J. D., & Bleistein, C. A. (1989). A consumer's guide to statistics for identifying differential item functioning. Applied Measurement in Education, 2 (3), 255-275.

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational measurement, 27 (4), 361-370.

Yan, K., Ji, Z., & Shen, W. (2017). Online fault detection methods for chillers combining extended kalman filter and recursive one-class SVM.  Neurocomputing 228 , 205–212. https://doi.org/10.1016/j.neucom.2016.09.076 

‌ 

Appendix: Main Effects Logistic Models

TABLE 4

AMH Main Effects Logistic Models: Equal Ability Distributions

Estimates

Model 1

Model 2

Model 3

           

Intercept ( )

0.02

0.11

2.37

Reference-Focal Ratio

           
 

0.51***

0.01

0.50***

0.01

-0.68***

0.03

 

-0.41***

0.02

-0.46***

0.02

-1.91***

0.03

Discrimination Level

           
     

-0.16***

0.01

2.13***

0.04

Difficulty Level

           
         

-6.49***

0.04

         

-0.73***

0.02

             

AIC

152,078.76

151,921.62

72,911.56

c

0.599

0.606

0.934

CCR

57.69%

55.19%

85.88%

Note. * ,** , ***

TABLE 5

BD Main Effects Logistic Models: Equal Ability Distributions

Estimates

Model 1

Model 2

Model 3

           

Intercept ( )

0.02

- 0.16

2.37

Reference-Focal Ratio

           
 

-0.83***

0.02

-0.88***

0.02

-0.35***

0.02

 

-1.81***

0.02

-1.50***

0.02

-1.05***

0.02

Discrimination Level

           
     

2.91***

0.02

3.16***

0.02

Difficulty Level

           
         

0.01

0.02

         

-2.54***

0.03

             

AIC

142,446.36

103,656.00

90,064.18

c

0.681

0.854

0.889

CCR

64.85%

79.59%

80.26%

Note. * ,** , ***

TABLE 6

AMH Main Effects Logistic Models: Unequal Ability Distributions

Estimates

Model 1

Model 2

Model 3

           

Intercept ( )

1.18***

0.53***

0.82***

Reference-Focal Ratio

           
 

-0.46***

0.01

-0.52***

0.01

-0.76***

0.02

 

-1.07***

0.01

-1.23***

0.01

-1.79***

0.02

Discrimination Level

           
     

1.65***

0.01

2.42***

0.02

Difficulty Level

           
         

-2.04***

0.02

         

1.86***

0.02

             

AIC

224,423.86

200,806.62

145,088.81

c

0.617

0.741

0.888

CCR

65.46%

70.62%

80.26%

Note. * ,** , ***

TABLE 7

BD Main Effects Logistic Models: Unequal Ability Distributions

Estimates

Model 1

Model 2

Model 3

           

Intercept ( )

0.13***

-0.65***

0.50***

Reference-Focal Ratio

           
 

-0.27***

0.01

-0.32***

0.01

-0.51***

0.02

 

-0.75***

0.01

-0.87***

0.01

-1.40***

0.02

Discrimination Level

           
     

1.59***

0.01

2.58***

0.02

Difficulty Level

           
         

-0.70***

0.01

         

-4.34***

0.02

             

AIC

243,495.60

218,229.33

145,181.07

c

0.582

0.724

0.903

CCR

53.27%

68.17%

83.63%

Note. * ,** , ***

Illustration
Cite this page

Select style:

Reference

StudyBounty. (2023, September 16). Understanding the Alternative Mantel-Haenszel Statistic: Factors Affecting Its Robustness to Detect Non-Uniform DIF.
https://studybounty.com/understanding-the-alternative-mantel-haenszel-statistic-factors-affecting-its-robustness-to-detect-non-uniform-dif-research-paper

illustration

Related essays

We post free essay examples for college on a regular basis. Stay in the know!

17 Sep 2023
Statistics

Scatter Diagram: How to Create a Scatter Plot in Excel

Trends in statistical data are interpreted using scatter diagrams. A scatter diagram presents each data point in two coordinates. The first point of data representation is done in correlation to the x-axis while the...

Words: 317

Pages: 2

Views: 186

17 Sep 2023
Statistics

Calculating and Reporting Healthcare Statistics

10\. The denominator is usually calculated using the formula: No. of available beds x No. of days 50 bed x 1 day =50 11\. Percentage Occupancy is calculated as: = =86.0% 12\. Percentage Occupancy is calculated...

Words: 133

Pages: 1

Views: 150

17 Sep 2023
Statistics

Survival Rate for COVID-19 Patients: A Comparative Analysis

Null: There is no difference in the survival rate of COVID-19 patients in tropical countries compared to temperate countries. Alternative: There is a difference in the survival rate of COVID-19 patients in tropical...

Words: 255

Pages: 1

Views: 250

17 Sep 2023
Statistics

5 Types of Regression Models You Should Know

Theobald et al. (2019) explore the appropriateness of various types of regression models. Despite the importance of regression in testing hypotheses, the authors were concerned that linear regression is used without...

Words: 543

Pages: 2

Views: 175

17 Sep 2023
Statistics

The Motion Picture Industry - A Comprehensive Overview

The motion picture industry is among some of the best performing industries in the country. Having over fifty major films produced each year with different performances, it is necessary to determine the success of a...

Words: 464

Pages: 2

Views: 86

17 Sep 2023
Statistics

Spearman's Rank Correlation Coefficient (Spearman's Rho)

The Spearman’s rank coefficient, sometimes called Spearman’s rho is widely used in statistics. It is a nonparametric concept used to measure statistical dependence between two variables. It employs the use of a...

Words: 590

Pages: 2

Views: 308

illustration

Running out of time?

Entrust your assignment to proficient writers and receive TOP-quality paper before the deadline is over.

Illustration