Determination of differantial item functioning (DIF) according to SIBTEST, Lord’s , Raju’s area measurement and Breslow-Day methods

The aim of this study is to examine whether the items in the mathematics subtest of the Centralized High School Entrance Placement Test [HSEPT] administered in 2012 by the Ministry of National Education in Turkey show DIF according to gender and type of school. For this purpose, SIBTEST, Breslow-Day, Lord’s and Raju's area measurement methods were used to determine the DIF of the 20 items in the mathematics subtest of HSEPT in 2012, and it was determined whether the items show DIF according to these methods or not. The research was conducted on the basis of the data obtained from HSEPT that the eighth grade students took in 2012. After the missing data were removed from the data set, DIF analyses were performed for the mathematics subtest of 1,063,570 (n female =523,939 and n male =539,631; n state school =1,025,979 and n private school =37,591) in total. Since it is aimed to examine the current situation, this study is based on a descriptive research design. According to the methods used, the number and DIF levels of the items with DIF differed depending on the variables of gender and type of school. In line with the findings, this research suggests the researchers to use at least two methods to determine the DIF.


Introduction
In Turkey, various examinations (TEOG, KPSS, LGS and YKS etc.) are administered at different grade levels and course areas with the purposes of diagnosis, selection and placement. The accuracy of the decisions made according to the results obtained from the tests used in these examinations directly affects the individuals' lives. Therefore, it is very important that the results obtained from the tests are valid and reliable. The biggest concern in test development and implementation is to prove that the use of the scores obtained from the test and the comments made are valid (Bachman, 1990).
It is expected that students who have the same ability level in terms of the structure measured through the assessment tools in education will receive a similar score from a certain item. In addition, the test should be fair for different participants. DIF is one of the statistical approaches proposed to determine whether test items function differently between the sub-groups taking the test and to determine the sources of variance (Geranpayeh & Kunnan 2007). DIF occurs when the probability of responding a particular item correctly differs after the individuals in different subgroups are matched at the level of same ability that the item intends to measure (Holland & Wainer, 1993). If an item is identified as having DIF, this is due to a source of variance not related to the structure measured by the test. In other words, due to the grouping factor, groups perform differently in a given item, and this is a source of variance not related to structure (Messick, 1989(Messick, , 1994. In DIF determination studies, the performances of at least two groups, focal and reference, are compared (Finch & French, 2013;Karami, 2012). The focal group is called the minority group, and the reference group is called the majority group (Finch & French, 2013). Two types of DIF are identified, uniform and non-uniform. Uniform DIF occurs when one group performs better at all ability levels than the other group. More specifically, almost all members of a group perform better than members of the other group (Karami, 2012). The non-uniform DIF situation arises when the group members' probability of correctly responding to a particular item is not the same throughout their ability levels (Camilli & Shepard, 1994;Zumbo, 1999). In other words, there is an interaction between grouping and ability levels (Karami, 2012).
DIF studies have an important role in assessing the validity of test scores (French & Finch, 2013;French & Finch, 2015). Because, it is accepted that the presence of DIF in the test items may reduce the validity of the test (French & Finch, 2015;Li & Zumbo, 2009). In addition, since studies on DIF have led to the determination of bias, it has received increasing interest in measurement practices in education and psychology over the last two decades (Millsap & Everson, 1993). Extensive research has been conducted to identify DIF and methods have been developed in this direction (Wiberg, 2007). There are also many studies on the development of statistical methods used in determining DIF (Clauser & Mazor, 1998;Hidalgo & Gomez-Benito, 2010;Osterlind & Everson, 2009;Raju, et al., 2009;Steinberg & Thissen, 2006;Zumbo, 2007). Basically, these methods assume that individuals with similar abilities or knowledge will perform similarly in relevant test items (Dorans & Holland, 1993). There are methods on DIF determination such as Mantel-Haenszel statistics (Holland & Thayer, 1988), Logistic Regression (Swaminathan & Rogers, 1990), Simultaneous Item Bias Test (SIBTEST) (Shealy & Stout, 1993), Raju's Area Measurement (Raju, 1988(Raju, , 1990, Breslow-Day (Breslow & Day) (1980), Standardization (Dorans & Holland, 1993;Dorans and Kulick, 1986), Confirmatory Factor Analysis and multidimensional approaches (Jöreskog & Sörbom, 1989). Because of the advantages and disadvantages of these methods, it is recommended to use more than one method in DIF studies (Camilli & Shepard, 1994;Holland & Wainer, 1993). In this respect, there are many studies in literature on these statistical methods used to determine DIF (Clauser & Mazor, 1998;Hidalgo & Gomez-Benito, 2010;Osterlind & Everson, 2009;Raju, et al., 2009;Steinberg & Thissen, 2006;Zumbo, 2007). In the scope of the research; SIBTEST and Breslow-Day based on the Classical Test Theory and Lord's χ^2and Raju's Area Measurement methods based on Item Response Theory were chosen. In the DIF analyses, these methods based on the CTT match the groups according to the observed scores, whereas those based on the IRT match the groups according to the implicit variable. These methods have been applied in the scope of the research since they use different matching criteria. The explanations for the four DIF determination methods discussed are given below.

Simultaneous Item Bias Test (SIBTEST) Method
It is a statistical method proposed by Shealy and Stout (1993) to determine DIF in Dichotomous data. In the method, the latent score rather than the observed score is used as the matching criterion (Clauser & Mazor, 1998). Using the true score estimation through the observed scores as matching criteria makes it possible to control Type I error in determining DIF (Gierl, 2005;Shealy & Stout, 1993). The magnitude of DIF is determined by the statistic obtained from the analysis. The statistic is given in equation 1.

̂ ∫ ( ) ( )
Equation (1) The values of ( ) and ( ) are respectively the probability that the individuals in the group respond to the item and the density function of the probability to respond to the item correctly.
The criteria proposed by Roussos and Stout (1996) for the interpretation of this magnitude level; A level: ; B level: .059 ≤ .088 and C level: I .088.

Lord's Method
In this method, the difference between the item parameter values of the focus and reference group is tested (Magis, Beland, Teurlinckx & Boeck, 2010). In this method, variance covariance values of difficulty and discrimination parameters are examined, and the area between the item characteristic curves of the groups is calculated (Hambleton, Swaminathan & Rogers, 1991). The method has been proposed by Lord to determine both uniform and non-uniform DIF (Wiberg, 2007). Lord's statistic is given in equation 2.
The values of are item parameter values for the focal group.

Raju's Area Measurement Method
In this method, the area between the item characteristic curves of the focus and reference group is examined to determine whether the item is DIF or not (Magis, Beland, Teurlinckx & Boeck, 2010). If the area between the item characteristic curves is zero, it indicates that the item is not with DIF.
As the area between the curves moves away from zero, bias increases in item (Lord, 1980;Raju, 1988). In the determination of DIF, different methods are used in the calculation of the area between the curves, including marked and unmarked area indices, weighted and unweighted marked and unmarked area indices (Crocker & Algina 1986;Raju & Arenson 2002). The Z statistic for one parameter logistic model is given in Equation 3.

√ ̂ ̂
Equation (3) and ̂ represents the item parameter estimates and the standard error values of the estimate, respectively.
The criteria proposed by Wright and Oshima (2015) for the evaluation of NCDIF statistics obtained with Raju's area measurement method analysis are as follows; A level: NCDIF < .003; B level: .003 ≤ NCDIF < .008; C level: NCDIF ≥ .008.

Breslow-Day Method
This method, developed by Breslow and Day (1980), was proposed to evaluate the homogeneity of the relationship between focal and reference group membership and item responses in the total test score range. In the absence of homogeneity, there is uniform DIF (Aguerri, Galibert, Attorresi & Marañón, 2009). The method has a distribution of with a degree of freedom of 1. In addition, Breslow-day method has superior statistical power and Type I error rate compared to other proposed methods (Penfield, 2003). Breslow-Day statistics are given in equality 4 (Aguerri, Galibert, Attorresi & Marañón, 2009):

Equation (4)
In this study, whether the items in SBS 2012 mathematics subtest showed DIF according to students' gender and school type was examined by Breslow-Day, SIBTEST, Lord's and Raju's area measurement methods. When the literature is examined, there are research studies conducted in DIF and bias by using the SBS data in terms of gender and school types in Turkey (Arıkan, Uğurlu, & Atar, 2016;Kan, Sünbül, & Ömür, 2013;Karakaya, 2012;Karakaya & Kutlu, 2012;Kelecioğlu, Karabay, & Karabay, 2014;Terzi & Yakar, 2018;Toprak, & Yakar, 2017;Yıldırım & Büyüköztürk, 2018). Karakaya (2012) conducted the DIF analysis of the items according to gender in the SBS 6th, 7th and 8th grade science and technology and mathematics subtests in 2009 using Mantel-Haenszel method. Then, it was determined that none of the items with DIF were biased as a result of the expert opinion. Karakaya and Kutlu (2012) carried out DIF analyzes by using Logistic Regression and Mantel-Haenszel methods according to gender and school type in 2009 SBSTurkish subtest, and then they conducted bias study based on expert opinion. As a result of the research, they found that only one of the items including DIF by gender and school type was biased according to the gender. Kan, Sünbül and Ömür (2013) used Transformed Item Difficulty, Mantel-Haenszel, Logistic Regression, Lord's and Raju's area measurement methods. According to the results of the study, the majority of the items in the sub-tests did not contain DIF in the methods based on the Classical Test Theory; however, the majority of the items in the sub-tests contained DIF in the methods based on the Item Response Theory. Kelecioğlu, Karabay and Karabay (2014) performed DIF analyzes using SIBTEST, Mantel-Haenszel and logistic regression methods. In the study conducted on the 8th grade SBS data in 2009, they examined the existence of DIF in Turkish, mathematics, science and technology and social studies subtests according to school type and gender variables. In addition, they conducted bias study with expert opinions on DIF-containing items according to at least two of the methods used in the study. Arıkan, Uğurlu and Atar (2016) carried out DIF analyzes using Mantel-Haenszel, SIBTEST, MIMIC and Logistic Regression methods and then conducted biased studies based on expert opinion. They carried out whether the items in 8th grade science and technology sub-tests in SBS (2009) showed DIF by gender on the sub-samples of 300, 600, 1200, 2000 participants. As a result of the study, they found that different number of items contain DIF. Toprak and Yakar (2017) conducted a DIF study using Logistic Regression, Mantel-Haenszel, SIBTEST, Likelihood Ratio and Wald istatistic methods. In the study, they determined the existence of DIF in terms of gender in the 8th grade Turkish subtest of SBS (2011). Terzi and Yakar (2018) conducted DIF analysis by gender using Mantel-Haenszel and Logistic Regression methods. Yıldırım and Büyüköztürk (2018) carried out DIF determination and bias studies according to gender and school type by using Mantel-Haenszel and Logistic Regression methods.
When the related researches are examined, it is shown that there are research studies that use SIBTEST and Mantel Haenzsel methods together (Arıkan, Uğurlu & Atar;2016;Kelecioğlu, Karabay, & Karabay, 2014;Toprak & Yakar, 2017) and Raju's area measurement and Lord's χ 2 methods together (Kan, Sünbül & Ömür;. In this study, DIF analyses were made by using SIBTEST, Breslow-Day, Lord's χ 2 and Raju's area measurement methods according to gender and school type variables. When performing DIF analyses, the groups are matched according to the observed score (total score) in SIBTEST and Breslow-Day methods based on CTT. In the Lord's χ 2 and Raju's area measurement methods based on IRT, DIF analyses are performed by matching the groups according to the implicit variable. Using the true score estimation as matching criterion through observed scores makes it possible to control Type I error in DIF determination (Gierl, 2005;Shealy & Stout, 1993). It is attempted to define the similarities and differences of the methods through the comparison of the results by using these methods together based on CTT and IRT that match observed scores and implicit variable. Research is important in this aspect since there are no studies, which use these four methods together to determine DIF, in the literature. For this reason, it is thought that this research conducted empirical data set will contribute to the literature.
In the research, it is intended to answer the following questions: 1. In the math subtest of SBS (2012), are there any items with DIF by gender in the analyses made with SIBTEST, Lord's χ 2 , Raju's area measurement and Breslow-Day methods? 2. In the 2012 SBS math subtest, are there any items with DIF by school type in the analyses made with SIBTEST, Lord's χ 2 , Raju's area measurement methods and Breslow-Day methods?

Research Model
In the study, it was examined whether the items in the mathematics subtest in 2012 SBS show DIF by gender and school type according to various methods. Since it is aimed to examine the current situation, this study is based on a descriptive research design.

Participants
The research was conducted with the eighth grade students who took the SBS exam in 2012. After the missing data were removed from the dataset, DIF analyzes were performed on the basis of the responses of 1,063,570 (nfemale = 523,939 and nmale = 539,631; nstate school = 1,025,979 and nprivate school = 37,591) eighth grade students to the math subtest.

Data Collection
In this research, the data of the eighth grade mathematics subtest of 2012 Placement Test which was applied in selecting students to secondary education were used. The data used in the research were accessed with the permission of the Ministry of National Education Innovation and Educational Technologies General Directorate. The mathematics subtest consists of 20 questions and shows a one-dimensional structure with normal distribution.

Data Analysis
In order to search for answers to the questions in the sub-objectives, firstly, the suitability of the data for the analysis of DIF determination methods was examined. LISREL 8.51 and SPSS 21 package programs were used to determine whether the items in the mathematics subtest met the unidimensionality and normality assumptions. According to the results, it was found that mathematics subtest data provided one-dimensionality (RMSEA = .03; CFI = .90; GFI = .98 and AGFI = .97) and normality assumptions. Descriptive statistics related to mathematics subtest are given in Table 1. As can be seen in Table 1, while the average difficulty of the test was .35 for female students, this value was found to be approximately the same .36 for male students. While the average difficulty of the test for private schools shows that it is an easy test, this value shows that it is difficult for state schools. By examining the skewness and kurtosis coefficients, it can be said that the measurements show normal distribution. In addition, the reliability coefficient of the measurements was found to be sufficiently high with .86 for the whole group.
It was found that mathematics subtest scores of students showed significant difference according to gender (t= -11.74, p<.01). Female students' math test score means were higher than male students' scores. It was determined that mathematics subtest scores of students showed significant difference according to school type (t= 298.97, p<.01). The math test scores of private school students were higher than the mean scores of the students attending state schools.
In this research, DIF analyses were carried out with Breslowday method based on the CTT and SIBTEST, Lord's χ 2 and Raju's area measurement methods based on IRT. In terms of gender variable, female students were specified as the focal group, and male students were specified as the reference group (nfemale= 523,939, nmale=539,631). In terms of school type, private schools were specified as the focal group, and stated schools were specified as the reference group (nprivate-school=37,591, nstated-school=1,025,979). In the DIF analysis based on the Item Response Theory, the estimations were made according to the 2PL model. All the analyses of DIF determination methods were made by means of R "difR" and "mirt" packages in R Studio program.

Results
The results of the DIF analyses by gender of SIBTEST, Raju's area measurement, Lord's χ 2 and Breslowday methods of the items in HSEPT 2012 math subtest are given in Tables 2, 3 and 4.  Table 2 shows the SIBTEST findings of the items in the HSEPT 2012 math subtest according to the variable of gender. According to Table 2, 15 of the 20 items in the mathematics subtest show DIF at the A level, and one item is at the C level according to gender. 10 of the items with DIF at A level are in favor of males, and 5 items are in favor of females. It is found that the 4th item with DIF at the C level is in favor of females.  Table 3 shows Raju's area measurement method findings of the items in the HSEPT 2012 math subtest by gender. According to Table 3, 17 of the 20 items included in the mathematics subtest show DIF at the A level, one item is at the B level, and two items are at the C level according to gender.  Table 4 presents the findings of the Lord's χ 2 and Breslow-Day method by gender in the items in the HSEPT 2012 math subtest. Table 4 shows that and p values. According to Lord's χ 2 and Breslow-Day methods, it was found that each item included in the mathematics subtest contains DIF by gender.
The results of the DIF by school type analysis of SIBTEST, Raju's area measurement, Lord's χ 2 and Breslowday methods of items in HSEPT 2012 math subtest are given in Tables 5, 6 and 7.  Table 5 shows the SIBTEST findings of the items included in the HSEPT 2012 math subtest by school type. Table 5 shows that 19 of the 20 items in the mathematics subtest show DIF by school type at A level. 10 of the items with DIF at A level are in favor of private schools, and 9 of them are in favor of state schools.   Table 6 shows the findings of Raju's area measurement method according to school type of items in HSEPT 2012 math subtest. According to Table 6, 16 of the 20 items included in mathematics subtest show DIF by school type at level A, and four items are at B level.  Table 7 shows the findings of the Breslow-Day and Lord's χ 2 methods according to the school type of the items in the HSEPT 2012 math subtest. Table 7 shows χ2 and p values. It was found that each item except for the 18th item in mathematics subtest showed DIF in Breslow-Day method according to school type (p<.05). All items show DIF according to Lord's χ 2 method (p<.05).

Discussion and Conclusion
In this study, whether the items in the 8th grade HSEPT mathematics sub-test of 2012 showed DIF according to gender and school type variables was examined by Breslow-Day, SIBTEST, Lord's χ 2 and Raju's area measurement methods. The findings obtained can be summarized as follows.
As a result of the analyses, when the gender variable was examined, according to SIBTEST method, one item was at high level, and fifteen items were at low level; however, according to Raju's area measurement method, two items were at high level, one was at moderate level, and seventeen items had low level DIF. According to the Lord's χ 2 and Breslow-Day method, it was determined that all items showed DIF. In addition, according to SIBTEST method, ten of the items with DIF were in favor of males, six of them were in favor of females, and also one item with a high level of DIF was in favor of females. When the school type variable was examined, according to SIBTEST method, nineteen items had low level DIF; however, according to Raju's area measurement method, four items showed moderate level DIF, and sixteen items showed low level DIF. According to the Lord's χ 2 method, all items showed DIF whereas nineteen items showed DIF according to Breslow-Day method. In addition, according to the SIBTEST method, ten of the items with low level of DIF were found to be in favor of private schools, and nine of them were in favor of state schools. In line with these findings, it can be interpreted that all four methods give moderately similar results in determining DIF. In addition, when the results obtained were examined, it was found that items with DIF were similar according to the four methods used, but DIF levels differed according to the four methods. It is considered that the reason for this difference may be due to the different criteria used in the evaluation of DIF methods.
When DIF studies, which are given in the literature and conducted on the data obtained from HSEPT administered in different years, were examined (Arıkan, Uğurlu & Atar, 2016;Kan, Sünbül & Ömür, 2013;Karakaya, 2012;Karakaya & Kutlu, 2012;Kelecioğlu, Karabay & Karabay, 2014;Terzi & Yakar, 2018;Toprak & Yakar, 2017;Yıldırım & Büyüköztürk, 2018), it was found that the items showed DIF in different numbers and levels according to the results obtained from different DIF determination methods. These results are consistent with our study. In addition, Karakaya and Kutlu (2012) concluded that Logistic Regression and Mantel-Haenszel methods showed moderate similarity in their studies examining the items in the 2009 HSEPT Turkish subtest in terms of gender and school type variables. Arıkan, Uğurlu and Atar (2016) examined the similarities and differences of MIMIC, SIBTEST, Logistic Regression and Mantel-Haenszel methods over different samples (300, 600, 1000, 1200 and 2000). As a result of the study, they found that the number of items with DIF determined by SIBTEST method increased as the sample size increased. Çepni (2011) carried out DIF studies with SIBTEST, Mantel-Haenszel, Logistic Regression and methods based on Item Response Theory and stated that methods determined DIF in similar and different items. According to the findings of this study, when gender and school type variables were taken into consideration, it was seen that the four methods used in the study determined DIF in almost all items. This finding is consistent with the finding that hypothesis tests were meaningful for almost every item when Çepni (2011) and Kan, Sünbül and Ömür (2013) used IRT methods. In addition, Breslow-Day method, which was used less frequently in literature, was used in this study. Similar to the results of the other three methods, DIF was also determined by Breslow-Day method in almost all items.

Suggestions
According to the four methods used in this study, it was determined that items with DIF showed different levels of DIF. In the light of the findings obtained from the research, it may be suggested that researchers should use at least two methods in the studies to determine DIF.
In line with the purpose of this study, it was investigated whether the items showed DIF by using different methods or not, but no bias studies were conducted in this research. Even if different criteria are considered according to the same DIF determination method, whether an item has DIF or not, or the level of DIF may vary. For this reason, in further studies, whether the items with DIF are biased or not can be determined through judgemental process. Besides, DIF and bias studies can be conducted by considering different variables in addition to gender and type of school (socio-economic status, region, etc.).