An extended McNemar test for comparing correlated proportion of positive responses

doi:10.15406/bbij.2019.08.00281

eISSN: 2378-315X

Biometrics & Biostatistics International Journal

Review Article Volume 8 Issue 4

An extended McNemar test for comparing correlated proportion of positive responses

Okeh Uchechukwu Marius,¹

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Obiora-Ilouno Happiness²

¹Department of Mathematics and Computer Science, Ebonyi State University Abakaliki, Nigeria
²Department of Statistics, Nnamdi Azikiwe University Awka, Nigeria

Correspondence: Okeh Uchechukwu, Department of Mathematics and Computer Science, Ebonyi State University Abakaliki, Nigeria

Received: May 09, 2019 | Published: July 8, 2019

Citation: Marius OU, Happiness OI. An extended McNemar test for comparing correlated proportion of positive responses. Biom Biostat Int J. 2019;8(4):125-137. DOI: 10.15406/bbij.2019.08.00281

Download PDF

Abstract

The area under a ROC curve (AUC) is an important summary measures useful in assessing the accuracy of a diagnostic test in discriminating true disease status when the data for measurement is paired. This assessment is most important when the AUCs of different diagnostic test procedures are compared. These comparisons are not without some problem associated with it such as the inability of some test such as the McNemar test to adjust for the possible presence of ties in the data, thereby leading to erroneous conclusions in data analysis occasioned by committing Type II error more often than not. This is evident when the use of the traditional McNemar test in data analysis yielded high value of variance and low chi-square value thereby making one to accept a false null hypothesis more often than expected. To be able to tackle this challenge, we extend the usual McNemar test adopted by adjusting for the possible presence of ties in the data when measurements of data may be on any scale. The extended McNemar test can enable one to easily estimate the probability that randomly selected pair of subjects from two diagnostic test procedures respond positive or both respond negative and it can be used to test the null hypothesis of equality of proportion of positive responses in two diagnostic test procedures. An extensive simulation study was carried out to determine the Type I error and statistical power of the existing and extended tests and the application of the tests to standard and real data, was carried out and result showed that in all the McNemar test demonstrates superior statistical power and less conservative type I error compared to DeLong area test, Bandos et al area test and the usual McNemar area test and so compares favorably.

Keywords: extended mcnemar test, positive response, correlated data, nonparametric test, diagnostic tests, type ii error

Introduction

The receiver operating characteristic (ROC) curve is a standard tool used to evaluate the performance of a diagnostic test when measurement of test results are either continuous or ordinal.¹ In 1950s the methodology of ROC was first developed by electrical and radar engineers during World War II for signal detection theory in battle fields.² In an ROC curve, the true positive rate (TPR) is plotted against the false positive rate (FPR) across all possible cut-off values in other to make meaningful decision. The area under the ROC curve (AUC) is a summary index for measuring the diagnostic accuracy. AUC ranges from 0 to 1 inclusive and the greater the value of AUC close to 1, the better the discriminatory power of the diagnostic procedure. Often times, the aim of many diagnostic studies is to compare the accuracy of diagnostic tests to determine the superiority of one test over another test for a certain condition or disease when data measurement may be on any scale. Statistical inference may be based on parametric, nonparametric or semi-parametric statistics. If the statistical inference is nonparametric, the difference between correlated AUCs for paired data was first proposed by DeLong et al.,³ and it is based upon asymptotic theory for U-statistics.⁴ But the validity of this or any other method relays on large sample size and when the sample size is small, the validity of the test for the difference between two or more AUCs may not be achieved. Two permutation tests for paired receiver operating characteristic (ROC) studies currently exist: one proposed by Venkatraman & Begg⁵ and the more recent test of Bandos et al.,⁶ The test of Bandos et al.,⁶ directly tests for an equality of AUCs, while the test of Venkatraman & Begg⁵ is more general and tests for equality of the underlying ROC curves. As a result, the test of Venkatraman & Begg⁵ is less powerful for testing equality of AUCs. Both permutation tests are executed by permuting the labels of the two tests within each diseased and non-diseased subject. Such an approach implicitly assumes that both tests are exchangeable within subject and requires an appropriate transformation, such as ranks, for tests differing in scale. Bandos et al.,⁶ compared the performance of their test to that of DeLong et al.,³ using simulation and found that the permutation test had greater power than the nonparametric test developed by DeLong et al.,³ when there was moderate correlation between two tests, large AUCs, and small sample sizes.

When comparing two diagnostic procedures, the difference between AUCs is often used and to control for the sources of changes arising from changes due to subjects which represents a reasonable size of the overall changes of the AUC, a paired data is recommended. This is because paired data usually induces positive correlation between the test results of the same subjects. Based on the use of paired data, Sumi et al.,⁷ adopted the usual McNemar⁸ test for comparing two correlated marginal probability of positive responses in diagnostic test procedures. This paper is an extension of this work for evaluating the performance of two diagnostic tests in terms of the proportion of positive responses and the comparison of this method with the existing tests by DeLong et al.,³ Bandos et al.,⁶ Sumi et al.⁷

Estimation of AUC

In estimating the AUC, two main factors have to be considered namely, the design of the study and the distribution of test result.⁹ Under the study design, test results or dataset can be classified into three types namely: (i) paired data (ii) unpaired data and (iii) partially paired data. For the paired and partially-paired set of data, correlation between AUCs is considered. Under the distribution type of test result, three approaches for estimating the AUCs are considered namely: (i) A parametric approach (ii) A semi-parametric approach (iii) A non-parametric approach, in this paper, our focus will be on the non-parametric method. All the approaches to estimating the AUC differ in the way the distribution functions of both populations are estimated based on their sample values. Basically the nonparametric (empirical) method of estimating AUC is stated as follows.

Given that there are two diagnostic tests, let n be the total number of subjects without disease and m as the total number of subjects with disease. Suppose represent the subjects without disease and with disease respectively. Therefore the corresponding bivariate outcomes for the two diagnostic procedures on the same N non-diseased and M diseased subjects should be Bivariate cumulative distribution functions are denoted by and their corresponding margina Bamber¹⁰ noted that the AUC is equal to Let be the AUCs of diagnostic procedures. The formula suggested by Hanley and McNeil (1982) for computing the AUC is given as

$A U C = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} g (X_{i}, Y_{j})$ (1)

Where m=number of diseased subjects, n=number of non-diseased subjects. Also $X_{i} a n d Y_{j}$ are respectively the test result of the i^th and j^th subject without and with disease and g is the indicator function comparing $X_{i} w i t h Y_{j}$ such that

$g (X_{i}, Y_{j}) = {\begin{matrix} 1, i f Y_{j} > X_{i} \\ 0.5, i f Y_{j} = X_{i} \\ 0, o t h e r w i s e \end{matrix}$ (2)

Therefore for the

b^{t h}

diagnostic test procedure the AUC can be computed as

$A U C_{b} = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} g (X_{i}^{b}, Y_{j}^{b})$ (3)

To carry out significant test for the differences between two or more correlated AUCs, it is necessary to consider the distribution of the test result which also determines the procedure to be adopted in estimating the AUCs and its variance-covariance matrix. By the comparison of areas under the two ROC curves, we can estimate which one of two diagnostic tests is more suitable for discriminating non-diseased subjects from diseased subjects or any other two conditions of interest.⁷ Braun & Alonzo¹¹ proposed a modified rank test that does not require such a transformation and showed that the modified test has the same power as Bandos et al.,⁶ Bantis & Feng.,¹² focused on comparing two correlated ROC curves at a given specificity level. They proposed parametric approaches, transformations to normality, and nonparametric kernel-based approaches. Extensions of their methods also involved inference for the AUC and accommodating covariates. They evaluated the robustness of their techniques through simulations, compared to other known approaches and presented a real data application involving prostate cancer screening. They approaches perform satisfactorily in terms of size and power. The limitation of Bantis & Feng¹² method is that their Box-Cox version does not take into account the variability of the transformation parameter. Finally, to increase the ability to detect the crossing alternative, Yu et al.,¹³ suggested a two-stage test, where the first stage uses the test derived by DeLong et al.,³ to test the equality of the two AUCs and the second stage uses a modified area test to test two partial AUCs.

Existing nonparametric tests for comparing correlated AUCS in paired sample design

A number of tests exist for comparing two or more AUCs or proportion of positive responses for the matched sample case.

DeLong et al’s conventional nonparametric method for comparing AUCs

DeLong et al.,.,³ developed a totally nonparametric approach to compare two correlated AUCs of two diagnostic tests for paired samples of subjects by using the theory of generalized U statistics. In other words, they developed a conventional fully nonparametric approach leading to an asymptotically normal test statistic. This method is important as it helps to study the behavior of the type I error and the statistical power of the conventional nonparametric test for comparing two AUCs over a wide range of relevant parameters and against various alternatives. The test by Delong et al is limited by the fact that the AUC has an unbiased non-parametric estimator called the indicator variable that requires the comparison of all the number of subjects responding positive and negative, thus working with very large number of observations, so that computational time could be long. In estimating AUC, sigmoid function is sometimes used instead of indicator function or variable.¹⁴ However, DeLong et al.,³ method is based only on a continuous scale of measurement. The method of structural components is used to generate an estimated covariance matrix and the resulting test statistic has asymptotically a chi-square distribution.

Suppose $X_{i}, i = 1, 2, ...., n$ denote test results for a sample of n non-diseased subjects, and $Y_{j}, j = 1, 2, ...., m$ denote the test results for m diseased subjects. For each $(X_{i}, Y_{j})$ pair, an indicator function is defined as follows:

$= {\begin{matrix} 1 i f Y_{j} > X_{i} \\ 0.5 i f Y_{j} = X_{i} \\ 0 i f Y_{j} < X_{i} . \end{matrix}$ (4)

The average of these values for I over all nm comparisons is the Wilcoxon or Mann-Whitney U statistic:

$U = \frac{1}{m n} \sum_{i = 1}^{n} \sum_{j = 1}^{m} I (X_{i}, Y_{j}) .$ (5)

Where U is equivalent to the AUC under the trapezoidal ROC curve Wieand et al.,¹⁵ obtained by connecting the ROC data points by straight lines, and the expected value of U, E(U), according to Hajian-Tilaki & Hanley.,¹⁶ is the area under the theoretical (population) ROC curve $(θ)$ : $E (U) = θ = p r o b (Y > X) .$

An alternative representation, used by DeLong et al.,³ is to define the components of the U statistic for each of the n non-diseased subjects and for each of the m diseased subjects:

$V a r_{N} (X_{i}) = \frac{1}{m} \sum_{j = 1}^{m} I (X_{i}, Y_{j}) a n d V a r_{D} (Y_{j}) = \frac{1}{n} \sum_{i = 1}^{n} I (X_{i}, Y_{j}) .$ (6)

Where are called “pseudo-values” or “pseudo-accuracies.” The pseudo-value for the ith subject in the non-diseased group is defined as the proportion of Y’s in the sample of diseased subjects where Y is greater than . While for the jth subject in the diseased group is defined as the proportion of X’s in the sample of non-diseased subjects whose X is less than . can be used in place of the original diagnostic test results{X}and{Y}to construct the empirical ROC curve. The average of the sample are respectively given as

$\bar{V} a r_{N} = \frac{1}{n} \sum_{i = 1}^{n} V a r_{N} (X_{i}) = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} I (X_{i}, Y_{j}) = U .$ (7)

and

$\bar{V} a r_{D} = \frac{1}{m} \sum_{j = 1}^{m} V a r_{D} (Y_{j}) = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} I (X_{i}, Y_{j}) = U .$ (8)

Therefore

A \hat{U} C = U = \frac{1}{n} \sum_{i = 1}^{n} V a r_{N} (X_{i}) = \frac{1}{m} \sum_{j = 1}^{m} V a r_{D} (Y_{j})

(9)

Thus, the average of the values for $n {V a r_{N}}$ and the average of those for $m {V a r_{D}}$ are both equivalent to the U statistic, which is why there are called pseudo-accuracy measures. As was shown by Hettmansperger.,¹⁷ the estimate of variance of the U statistic (which he called W instead of U) can be expressed as the sum of variances of and a third component, $U (1 - U) / n m .$ DeLong et al.,³ omitted the third component, since it is negligible when n and m are large. They explained that for a single diagnostic test, the variance of AUC is given as

$V a r [A \hat{U} C] = V a r [\hat{U}] = \frac{S_{T_{D}}^{2}}{m} + \frac{S_{T_{N}}^{2}}{n} .$ (10)

Where

S_{T_{D}}^{2} a n d S_{T_{N}}^{2}

are respectively the sample variances for the diseased and non-diseased components and are defined as

$S_{T_{N}}^{2} = \frac{\sum_{i = 1}^{n} {[V a r_{N} (X_{i}) - A U C]}^{2}}{n - 1} a n d S_{T_{D}}^{2} = \frac{\sum_{j = 1}^{m} {[V a r_{D} (Y_{j}) - A U C]}^{2}}{m - 1}$ (11)

The null hypothesis of interest is to compare the equality of AUCs from two diagnostic test procedures when the data is paired and by extension if the period of measurement of test results are the same and the test statistic according to DeLong et al.,3 is the Z-test given as

$Z = \frac{A U C_{1} - A U C_{2}}{\sqrt{V a r (A U C_{1} - A U C_{2})}}$ (12)

Where

V a r (A U C_{1} - A U C_{2}) = V a r (A U C_{1}) + V a r (A U C_{2}) - 2 C o v (A U C_{1}, A U C_{2})

If the two diagnostic tests are not matched to the same subjects, the two AUCs are independent and the covariance term would be zero. In other to estimates the AUCs for the two diagnostic test procedures, Delong et al.,3 considered that each variance of AUC be defined as

$V a r (A U C_{b}) = \frac{S_{T_{D b}}}{m_{b}} + \frac{S_{T_{N b}}}{n_{b}}$ (13)

Where $S_{T_{N b}} = \frac{\sum_{i = 1}^{n_{b}} {[V a r_{N} (X_{i b}) - A U C_{b}]}^{2}}{n_{b} - 1}, b = 1, 2.$

and $S_{T_{D b}} = \frac{\sum_{j = 1}^{m_{b}} {[V a r_{D} (Y_{j b}) - A U C_{b}]}^{2}}{m_{b} - 1}, b = 1, 2.$

The variance of the components

V a r_{N} (X_{i b}) a n d V a r_{D} (Y_{j b})

are respectively defined as

$V a r_{N} (X_{i b}) = \frac{\sum_{j = 1}^{m_{b}} I (X_{b i}, Y_{b j})}{m_{b} - 1} a n d V a r_{D} (X_{i b}) = \frac{\sum_{i = 1}^{n_{b}} I (X_{b i}, Y_{b j})}{n_{b} - 1}, b = 1, 2.$ (14)

Where $I (X_{i b}, Y_{j b}) = {\begin{matrix} 1 i f Y > X \\ 0.5 i f Y = X \\ 0 i f Y < X \end{matrix}$

$A U C_{b} = \frac{1}{m_{b}} \sum_{j = 1}^{m_{b}} V a r_{D} (Y_{b j}) = \frac{1}{n_{b}} \sum_{i = 1}^{n_{b}} V a r_{N} (X_{b i}), b = 1, 2.$ (15)

Note here that

Y_{b j} a n d X_{b i}

are the observed diagnostic test results for the subjects in group b diagnostic test procedures that are diseased and non-diseased respectively.

Also $C o v (A U C_{1}, A U C_{2}) = \frac{S_{T_{D_{1}} T_{D_{2}}}}{m} + \frac{S_{T_{N_{1}} T_{N 2}}}{n}$ (16)

Where

S_{T_{D_{1}} T_{D_{2}}} = \frac{1}{m - 1} \sum_{j = 1}^{m} [V a r_{D} (Y_{1 j}) - A U C_{1}] [V a r_{D} (Y_{2 j}) - A U C_{2}]

And $S_{T_{N_{1}} T_{N_{2}}} = \frac{1}{n - 1} \sum_{i = 1}^{n} [V a r_{N} (X_{1 i}) - A U C_{1}] [V a r_{N} (X_{2 i}) - A U C_{2}]$

Here $S_{T_{D_{1}} T_{D_{2}}}$ is the pooled variances of diseased test result for the first and second diagnostic test procedure or process, $S_{T_{N_{1}} T_{N_{2}}}$ is the pooled variances of the non-diseased test result for the first and second diagnostic test process or procedure, $V a r_{D} (Y_{1 j})$ is the variance of the positive diagnostic test result for the jth subject in the first diagnostic test process, $V a r_{D} (Y_{2 j})$ is the variance of the positive diagnostic test result for the j^th subject in the second diagnostic test process, $V a r_{N} (X_{1 i})$ is the variance of the negative diagnostic test result for the i^th subject in the first diagnostic test process and $V a r_{N} (X_{2 i})$ is the variance of the negative diagnostic test result for the i^th subject in the second diagnostic test process. When the variances are estimated, one can calculate the AUC for the two diagnostic tests and then make comparison.

Bandos et al permutation nonparametric test for comparing AUCs

Bandos et al.,⁶ derived exact and asymptotic permutation test methods to test the equality of two correlated ROC curves which are designed to have increased power to detect difference in the AUC. The test of Bandos et al.,⁶ directly tests for an equality of AUCs. This approach implicitly assumes that both diagnostic test procedures are exchangeable within subject and requires an appropriate transformation, such as ranks, for diagnostic test procedures differing in scale. Bandos et al.,⁶ compared the performance of their test to that of DeLong et al.,³ via simulation and found that the permutation test had greater power than the nonparametric test developed by DeLong et al.,³ when there was moderate correlation between diagnostic tests, large AUCs, and small sample sizes. Bandos et al.,⁶ test is limited by the fact that it requires the exchangeability of the diagnostic test procedures and do requires also the transformations of the original data. It also requires diagnostic tests that are measured on identical scales and so may prove to be less powerful in settings in which the diagnostic test results are skewed Braun & Alonzo.¹¹ If ${X_{i}^{b}}_{i = 1}^{n}, {Y_{j}^{b}}_{j = 1}^{m}$ be the test results of the diagnostic procedure b for n actually non-diseased and m actually diseased subjects and ${x_{i}^{b}}_{i = 1}^{n}, {y_{j}^{b}}_{j = 1}^{m}$ be approximately transformed test results, an unbiased nonparametric estimator for the AUC for diagnostic procedure or test b can be written as $A \hat{U} C_{b} .$ For a paired sample design, the difference in two AUCs can be estimated as,

$A \hat{U} C_{1} - A U {\hat{C}}_{2} = \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m} ψ (X_{i}^{1}, Y_{j}^{1})}{n m} - \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m} ψ (X_{i}^{2}, Y_{j}^{2})}{n m}$ (17)

Where

ψ (X_{i}^{1}, Y_{j}^{1}) - ψ (X_{i}^{2}, Y_{j}^{2}) = {\begin{matrix} \begin{matrix} 1, i f x_{i}^{1} < y_{j}^{1}, x_{i}^{2} > y_{j}^{2} \\ 0.5, i f x_{i}^{1} < y_{j}^{1}, x_{i}^{2} = y_{j}^{2} o r x_{i}^{1} = y_{j}^{1}, x_{i}^{2} > y_{j}^{2} \end{matrix} \\ 0, i f x_{i}^{1} < y_{j}^{1}, x_{i}^{2} < y_{j}^{2} o r x_{i}^{1} > y_{j}^{1}, x_{i}^{2} > y_{j}^{2} o r x_{i}^{1} = y_{j}^{1}, x_{i}^{2} = y_{j}^{2} \\ - 0.5, i f x_{i}^{1} > y_{j}^{1}, x_{i}^{2} = y_{j}^{2} o r x_{i}^{1} = y_{j}^{1}, x_{i}^{2} < y_{j}^{2} \\ - 1, i f x_{i}^{1} > y_{j}^{1}, x_{i}^{2} < y_{j}^{2} \end{matrix}

Being a member of U statistics, the non-parametric estimator of the AUC difference is known to be asymptotically normally distributed under quite general condition Hoeffding.⁴ Based on this property and the additional assumption of exchangeability, they constructed a simple asymptotic test procedure with test statistic

$\frac{A \hat{U} C_{1} - A \hat{U} C_{2}}{\sqrt{V a r_{Ω} (A \hat{U} C_{1} - A \hat{U} C_{2})}} \overset{d}{\to} N (0, 1)$ (18)

Where $Ω$ is the parameter space.

Sumi et al (McNemar Test) nonparametric method for comparing AUCs

Sumi et al.,⁷ proposed a method for comparing two proportion of positive responses. This test is based on McNemar.,⁸ for the comparison of two diagnostic tests for continuous and discrete binary scale data that are matched. Their McNemar⁸ test is based on the comparison of the equality of the proportion of positive responses in two diagnostic tests. Here each subject"s test result is either positive coded 1 or negative coded 0 on each of two diagnostic processes and interest is in testing whether the proportion of "positive" responses are the same on the first and second diagnostic procedure taken into account the correlation of the two diagnostic test results. This test is limited by the fact that it does not provide evidence of inferiority or superiority of one diagnostic test over another. Any test capable of this should have one sided alternative hypothesis Zhou et al.,¹⁸ The test assumes the use of summarized data which leads to loss of information and reliability in decisions about the data analyzed. Such summarized data could have many ties and if not adjusted for will reduce the power of any test statistic employed for the analysis. It is worthy of mentioning that McNemar⁸ test is concerned with matched pairs of dichotomous test results. Here the result of each diagnostic test are all into two categories, positive coded 1 and negative coded 0.The resulting data is presented in a 2x2 contingency table where row represents the result of one diagnostic test while the column represent the result of another diagnostic test. Here each cell represents the number of observed cases with the particular combination of test results. Depending on the scale of measurement of test results whether continuous or binary, one can compare the two test procedures by constructing a 2x2 contingency table after which McNemar⁸ test can be applied and the result compared with the result obtained using the conventional non-parametric test suggested by DeLong et al.,³ and the permutation test by Bandos et al.,⁶ For two diagnostic tests producing the continuous test results as ${X_{i}^{b}}_{i = 1}^{n} a n d {Y_{j}^{b}}_{j = 1}^{m}$ in the bth diagnostic test, the subjects are ordered so that ${X_{i}^{b}}_{i = 1}^{n} a n d {Y_{j}^{b}}_{j = 1}^{m}$ becomes the transferred results in the bth diagnostic test for n real negative and m real positive subjects. Suppose we have an optimal cut-off value of $c_{b}$ for bth diagnostic test, then we classify all results above as positive and results less than or equal to $c_{b}$ as negative so that the 2x2 contingency table can be constructed for each diagnostic procedure. The resulting table 1 is From Table 1, $A_{b}$ =number of subjects who are diseased and who actually tested positive $(y_{j}^{b} > c_{b}),$ $B_{b}$ =number of subjects who do not have disease and actually tested positive $(x_{i}^{b} > c_{b}),$ $C_{b}$ =number of subjects with disease and actually tested negative $(y_{j}^{b} \leq c_{b}),$ $D_{b}$ c= number of subjects without disease who actually tested negative $(x_{i}^{b} \leq c_{b}) .$ Now each diagnostic test result is used to obtain a 2x2 contingency table based on the optimal cut-off value, so that one can verify if the diagnostic test procedure has any effect on the true observed (True) status. To test for the significance of any observed change using the McNemar8 test, one sets up a fourfold table of frequencies representing the first and the second sets of responses (test results) from the same subjects. If both diagnostic test procedures have significant effects, in other words, there are correlated, we can combine the two diagnostic test procedures thus obtaining a matched pair data from the combination of these two diagnostic tests and we obtain a contingency Table 2.

Test result for diagnostic procedure	Observed (True) status		Total
Test result for diagnostic procedure	$N o n d i s e a s e d (-)$	$N o n d i s e a s e d (-)$	Total
$p o s i t i v e (+ v e)$	$A_{b}$	$B_{b}$	$A_{b} + B_{b}$
$n e g a t i v e (- v e)$	$C_{b}$	$D_{b}$	$C_{b} + D_{b}$
Total	$A_{b} + C_{b}$	$B_{b} + D_{b}$	$n_{b}$

Table 1 A 2x2 contingency table for bth (b=1, 2) diagnostic test procedure

Diagnostic test 2			Total
Diag test 1	Positive( )	Negative( )	Total
Positive( $+ v e$ )	$\begin{array}{l} A \\ (P_{A}) \end{array}$	$\begin{array}{l} B \\ (P_{B}) \end{array}$	$\begin{array}{l} A + B \\ (P_{A} + P_{B}) \end{array}$
Negative( $- v e$ )	$\begin{array}{l} C \\ (P_{C}) \end{array}$	$\begin{array}{l} D \\ (P_{D}) \end{array}$	$\begin{array}{l} C + D \\ (P_{C} + P_{D}) \end{array}$
Total	$\begin{array}{l} A + C \\ (P_{A} + P_{C}) \end{array}$	$\begin{array}{l} B + D \\ (P_{B} + P_{D}) \end{array}$	$\begin{array}{l} N \\ (1) \end{array}$

Table 2 A 2x2 contingency table for two diagnostic test procedures

P_{A}

represents probability of positive test results on both test procedures,

P_{B}

is the probability of positive test result in diagnostic test procedure 1 but negative test result in diagnostic test procedure 2,

P_{C} a n d P_{D}

are similarly defined. A, B, C, and D are the corresponding frequencies representing test results on both diagnostic tests.For instance, A represents the frequency that diagnostic test 1 and diagnostic test 2 subjects both respond positive while D represents the frequency that diagnostic test 1 and diagnostic test 2 subjects both respond negative and

n

represents the pairs of diagnostic test 1- diagnostic test 2 subjects studied. From Table 2, the proportion of diagnostic test 1 subjects studied who respond positive is

$p_{1} = \frac{A + C}{N}$ (19)

while the proportion of diagnostic test 2 studied who respond positive is

$p_{2} = \frac{A + B}{N}$ (20)

The difference between the proportions of diagnostic test 1 and diagnostic test 2 subjects who respond positive is

$p_{2} - p_{1} = \frac{B + C}{N}$ (21)

which is independent of A and D, the number of test results in which the diagnostic test 1 and diagnostic test 2 subjects both respond positive or both respond negative respectively.

The standard error of the difference between the two proportions of positive responses is

$S e (p_{2} - p_{1}) = \frac{\sqrt{B + C}}{N}$ (22)

which is also unaffected by A and D.

If $π_{1} a n d π_{2}$ are respectively the proportions of diagnostic test 1 and diagnostic test 2 in the sampled populations who respond positive then a null hypothesis that may be of interest is whether the two diagnostic test procedures are equal in their performances as

$H_{0} : π_{2} - π_{1} = 0 v e r s u s H_{1} : π_{2} - π_{1} \neq 0$ (23)

Its equivalent is to test whether the marginal probabilities of positive result on the diagnostic test 1 and diagnostic test 2 Sumi et al.,⁷ based on Table 2 are equal

$H_{0} : P_{A} + P_{B} = P_{A} + P_{C} v e r s u s H_{1} : P_{A} + P_{B} \neq P_{A} + P_{C}$ (24)

The McNemar test statistic (1947) follows a chi-square distribution with 1 degree of freedom for testing the null hypothesis of Equ.23/24 is

$χ^{2} = {(\frac{(p_{2} - p_{1})}{S e (p_{2} - p_{1})})}^{2} = \frac{{(B - C)}^{2}}{B + C} (w i t h o u t c o n t i n u i t y c o r r e c t i o n)$ (25)

$χ^{2} = \frac{{(| B - C | - 1)}^{2}}{B + C} (w i t h c o n t i n u i t y c o r r e c t i o n)$ (26)

which has a chi-square distribution with 1 degree of freedom. The null hypothesis of equal population proportions is rejected at the

α

level of significance in favour of the alternative hypothesis if

$χ^{2} \geq χ {}_{1 - α; 1}^{2}$ (27)

McNemar test used here employs a continuous distribution to approximate a discrete probability distribution by recommending for continuity for correction in calculating the test statistic. When the sample size is small in the interest of accuracy, the exact binomial probability for the data should be used Sumi et al.,⁷ McNemar test unlike the DeLong et al.,³ and Bandos et al.,⁶ methods is applicable both for continuous and discrete binary scale data irrespective of having knowledge of true disease status (gold standard).

The identified problem statement associated with this study is that the usual McNemar⁸ test cannot adjust for the possible presence of ties in data, thereby making the variance value high while the chi-square value remained low such that Type II error is often times committed. To be able to solve this problem, this study is aimed at comparing correlated proportion of positive responses in two diagnostic test procedures by extending the usual McNemar test statistic to accommodate for ties in the data.

Extended McNemar test

This extension is based on the previous work by Sumi et al.,⁷ who applied the usual McNemar⁸ in comparing correlated marginal probability of positive responses from two diagnostic test procedures. The usual McNemar⁸ test assumes that the data to be used are presented in a summarized form rather than being in a raw form that needs to be processed. Most times, these data may be quantitative in nature and as such may be continuous also meaning that the chances of getting any tied data is at least zero in theory but practically, there exist ties in the data. This is one of the limitations of the usual McNemar test that needs attention. It these ties are adjusted for, the power of the test statistic used for data analysis is increased. To extend the usual McNemar test adopted by Sumi et al.,⁷ to allow for the possible presence of ties in the data, let

(t_{v 2}, t_{v 1})

be the test results of subjects from diagnostic test 2 case 1 respectively for the vth pair of subjects who are undergoing diagnostic test 2 and 1 say respectively where v=1,2,..,N pairs of subjects in diagnostic test 2 and 1.Assuming that the data is measured on at least interval scale.

$L e t T_{v} = {\begin{matrix} 1, i f t_{v 2} a n d t_{v 1} a r e t e s t r e s u l t s o f s u b j e c t s f o r d i a g n o s t i c t e s t 2 a n d 1 r e s p o n d i n g p o s i t i v e a n d n e g a t i v e t o t h e c o n d i t i o n r e s p e c t i v e l y \\ 0, i f t_{v 2} a n d t_{v 1} a r e t e s t r e s u l t s o f s u b j e c t s f o r d i a g n o s t i c t e s t 2 a n d 1 r e s p o n d i n g b o t h p o s i t i v e o r r e s p o n d i n g b o t h n e g a t i v e \\ - 1, i f t_{v 2} a n d t_{v 1} a r e t e s t r e s u l t s o f s u b j e c t s f o r d i a g n o s t i c t e s t 2 a n d 1 r e s p o n d i n g n e g a t i v e a n d p o s i t i v e t o t h e c o n d i t i o n r e s p e c t i v e l y \end{matrix}$ (28)

For the v^th pair of subjects in diagnostic test 2 and 1, where v=1,2,..,N,where N is the total number of pairs.

$π^{+} = P (T_{v} = 1) : π^{0} = P (T_{v} = 0); a n d π^{-} = P (T_{v} = - 1)$ (29)

Where

π^{+} + π^{0} + π^{-} = 1

(30)

Therefore let $W = \sum_{v = 1}^{N} T_{v}$ (31)

Where W is the total number of subjects in the matched pairs of subjects who test or respond positive. Based on the above specifications, the expected value of

T_{v}

$E (T_{v}) = π^{+} - π^{-}$ (32)

While $V a r (T_{v}) = π^{+} + π^{-} - {(π^{+} - π^{-})}^{2}$ (33)

From equations 6 and 7, expected value of W is

$E (W) = n (π^{+} - π^{-})$ (34)

Adding from equation 8

$V a r (W) = N (π^{+} + π^{-} - {(π^{+} - π^{-})}^{2})$ (35)

Note that

π^{+}, π^{0} ​ a n d π^{-}

are respectively the probabilities that for a randomly selected pair of subjects from diagnostic tests 2 and 1, the subjects from diagnostic test 2 on the average responds positive and the subjects from diagnostic test 1 responds negative or the subjects from diagnostic test 2 and 1 both respond positive or the subjects from both diagnostic tests respond negative, or the subjects from diagnostic test 2 responds negative and subjects from diagnostic test 1 responds positive. The sample estimates of these probabilities are respectively defined as

${\hat{π}}^{+} = \frac{p^{+}}{N}; {\hat{π}}^{0} = \frac{p^{0}}{N} a n d {\hat{π}}^{-} = \frac{p^{-}}{N}$ (36)

where $p^{+}, p^{0} a n d p^{-}$ represents respectively the frequencies 1"s,0"s and -1"s in the distribution given in $T_{v}, v = 1, 2, ..., N .$ That is, $p^{+}, p^{0} a n d p^{-}$ are respectively the number of diagnostic test 2 and 1 subject pairs in which the diagnostic test 2 respond positive and the diagnostic test 1 respond negative or the diagnostic test 2 and 1 subjects both respond positive or both respond negative or the diagnostic test 2 responds negative and the diagnostic test 1 subject responds positive. These frequencies are expressed in terms of diagnostic tests 2 and 1 in Table 3.

Diagnostic test 2			Total
Diag test 1	Positive Response ( )	Negative Response( )	Total
Positive Response ( $+ v e$ )	$n_{11} = p^{0 +} = A$	$n_{12} = p^{+} = B$	$n_{11} + n_{12} = A + B$
Negative Response ( $- v e$ )	$n_{21} = p^{-} = C$	$n_{22} = p^{0 -} = D$	$n_{21} + n_{22} = C + D$
Total	$n_{11} + n_{21} = A + C$	$n_{12} + n_{22} = B + D$	$n_{..} (= N)$

Table 3 Fourfold Table for presenting Data on paired samples

There are respectively represented from Table 3 as

$p^{+} = n_{12}; p^{0} = n_{11} + n_{22} = p^{0 +} + p^{0 -}; p^{-} = n_{21} $ (37)

Where $p^{0 +} = n_{11}; p^{0 -} = n_{22} $ (38)

are respectively the number of diagnostic test 2 and 1 subject pairs where diagnostic test 2 and 1 subjects both respond positive or both respond negative and

{\hat{π}}^{0 +} a n d {\hat{π}}^{0 -}

are the corresponding relative frequencies.

But $π^{+} - π^{-}$ measures the difference in rate of positive responses by subjects in the diagnostic test 2 and diagnostic test 1 procedure and its estimate of the sample is

$π^{+} - π^{-} = \frac{W}{N} = \frac{p^{+} - p^{-}}{N}$ (39)

And the variance is estimated from Equ 35 as

$V a r ({\hat{π}}^{+} - {\hat{π}}^{-}) = \frac{V a r (W)}{N^{2}} = \frac{{\hat{π}}^{+} + {\hat{π}}^{-} - {({\hat{π}}^{+} - {\hat{π}}^{-})}^{2}}{N}$ (40)

But the McNemar test statistic is

χ^{2} = {(\frac{(p_{2} - p_{1})}{S e (p_{2} - p_{1})})}^{2} = \frac{{(B - C)}^{2}}{B + C}

with the numerator given as

$W^{2} = {(N ({\hat{π}}^{+} - {\hat{π}}^{-}))}^{2} = {(p^{+} - p^{-})}^{2}$ (41)

Now a test statistic explaining the difference between positive response rates for diagnostic test 2 and 1 subjects can be developed by noting that $π^{+}$ represents the proportion of pairs of subjects out of a total of N pairs in which the subject from diagnostic test 2 procedure and was given say T2 treatment in a given pair responds positive and the subject from diagnostic test 1 in the pair and given treatment T1 say, responds negative; $π^{0}$ represents the proportion of the total number of N pairs of subjects with the members of the pair both responding positive or both responding negative and $π^{-}$ is the proportion of pairs out of a total of "N" pairs in which the subject from diagnostic test 2 procedure and was given say T2 in a given pair responds negative and the subject from diagnostic test 1 in the pair and given treatment T1 responds positive. The diagnostic test 2 and 1 differential positive response rate is given as $π^{+} - π^{-}$ with their sample estimate and variance given respectively by Eqns 39 and 40. If the sampled proportion is given respectively as $p_{1} = \frac{A + C}{N} a n d p_{2} = \frac{A + B}{N}$ based on Table 1, we obtain more important and detailed information given as

$P_{1} = \frac{n_{11} + n_{21}}{N} = \frac{p^{0 +} + p^{-}}{N} = {\hat{π}}^{0 +} + {\hat{π}}^{-}$ (42)

And $P_{2} = \frac{n_{11} + n_{12}}{N} = \frac{p^{0 +} + p^{+}}{N} = {\hat{π}}^{0 +} + {\hat{π}}^{+}$ (43)

$w h e r e {\hat{π}}^{0 +} = \frac{p^{0 +}}{N} a n d {\hat{π}}^{0 -} = \frac{p^{0 -}}{N}$ (44)

$s u c h t h a t {\hat{π}}^{0} = {\hat{π}}^{0 +} + {\hat{π}}^{0 -}$ (45)

Now the null hypothesis H0 of interest is to test that the proportions of subjects responding positive in the diagnostic test 2 and 1 procedures or treatment conditions T2 and T1 differ by some value $β_{0}$ .This is equivalent to testing the null hypothesis given as

$H_{0} : π^{+} - π^{-} = β_{0} v e r s u s H_{1} : π^{+} - π^{-} \neq β_{0} (- 1 \leq β_{0} \leq 1) $ (46)

While the test statistic is given by

$χ^{2} = \frac{{(W - n β_{0})}^{2}}{N ({\hat{π}}^{+} + {\hat{π}}^{-} - {({\hat{π}}^{+} - {\hat{π}}^{-})}^{2})}$ (47)

Or equivalently

$χ^{2} = \frac{n {(({\hat{π}}^{+} - {\hat{π}}^{-}) - β_{0})}^{2}}{{\hat{π}}^{+} + {\hat{π}}^{-} - {({\hat{π}}^{+} - {\hat{π}}^{-})}^{2}}$ (48)

which with 1 degree of freedom is approximately chi-square distributed for sufficiently large "n". The null hypothesis of equal population proportion of positive responses is rejected at the $α$ level of significance in favour of the alternative hypothesis if

$χ^{2} \geq χ {}_{1 - α; 1}^{2}$ (49)

Note therefore that under null hypothesis H0, the numerators of the extended test statistic of Equs 47 and 48 are as in the usual McNemar⁸ test statistic independent of $n_{11} = p^{0 +} a n d n_{22} = p^{0 -}$ the number of pairs in which diagnostic test 2 and 1 subjects in each pair both respond positive or both respond negative to the conditions of interest while for equations 47 and 48, the denominator is also independent of n11 and n22.Hence both the extended test statistic and the usual McNemar⁸ test statistic are not affected by those pairs in which the subjects in each pair both respond positive or both respond negative to the disease or treatments condition. Unlike the usual McNemar test statistic, the extended McNemar⁸ test has by specifications been adjusted and corrected for the possible presence of ties in the data. In addition, the variance of the extended McNemar test statistic in Eqn 48 is smaller than the variance of the usual McNemar test statistic stated in between eqns 40 and 41.This is because of the fact that $V a r ({\hat{π}}^{+} - {\hat{π}}^{-}) = \frac{V a r (W)}{N^{2}} = \frac{{\hat{π}}^{+} + {\hat{π}}^{-} - {({\hat{π}}^{+} - {\hat{π}}^{-})}^{2}}{N}$ and $S e (p_{2} - p_{1}) = \frac{\sqrt{B + C}}{N}$ so that

$V a r ({\hat{π}}^{+} - {\hat{π}}^{-}) = \frac{{\hat{π}}^{+} + {\hat{π}}^{-} - {({\hat{π}}^{+} - {\hat{π}}^{-})}^{2}}{N} = \frac{n_{12} + n_{21}}{N^{2}} - \frac{{(n_{12} + n_{21})}^{2}}{N^{3}} \leq \frac{n_{12} + n_{21}}{N^{2}} = V a r (P_{2} - P_{1}),$

$\sin c e \frac{{({\hat{π}}^{+} - {\hat{π}}^{-})}^{2}}{N} = \frac{{(n_{12} + n_{21})}^{2}}{N^{3}} \geq 0, f o r a l l π^{+} \neq π^{-}, o r n_{12} \neq n_{21}$

In conclusion, the extended McNemar test statistic is relatively more efficient and so is most likely to be more powerful than the usual McNemar test statistic whenever the diagnostic test 2 and 1 test results of subjects have differences in positive response rates $({\hat{π}}^{+} \neq {\hat{π}}^{-}; o r P_{1} \neq P_{2})$ to the conditions of interest. It is note worthy that $N {({\hat{π}}^{+} - {\hat{π}}^{-})}^{2}$ is the reduced value in the variance of W since by specifications of equation 28 it has been adjusted for the possible presence of ties between the responses of diagnostic test 2 and 1 procedures. The major difference between the usual McNemar⁸ test and the extended McNema⁸ test is that there is adjustment of possible presence of tied observations in the later test, the extended McNemar⁸ test statistic will likely have smaller variance and larger calculated chi-square value than the usual McNemar⁸ test statistic, thus leading to the more chances of committing Type II error in the usual McNemar⁸ test more often than in the extended McNemar⁸ test.

Application to simulation study

We carried out computer simulations here to evaluate the performance of the extended McNemar test. We performed extensive simulations to evaluate and compare Type I errors (empirical test sizes) and statistical power of the extended McNemar⁸ test, usual (traditional) McNemar test, conventional nonparametric test of DeLong et al.,³ and asymptotic test of Bandos et al.,⁶ Here we assumed equal correlation coefficient across the two diagnostic test procedures for diseased and non-diseased test results of subjects measured on continuous and discrete binary scales and the sample sizes are 20,60,100 and 180. These test results of subject were generated from a standard bi-variate normal distribution having mean and variance respectively for the two diagnostic tests as $μ_{1}, σ_{1}^{2} a n d μ_{2}, σ_{2}^{2}$ when measurement of data is on continuous scale. The AUC for diagnostic test 1 and 2 procedures are respectively given as $A U C_{1} = Φ (\frac{μ_{1}}{\sqrt{(1 + σ_{1}^{2})}}) a n d A U C_{2} = Φ (\frac{μ_{2}}{\sqrt{(1 + σ_{2}^{2})}})$ where $Φ$ is the standard normal cumulative distribution function. Under binary random variable X for one diagnostic test procedure, if the test result of subject is positive, it is coded 1 and if the test result is negative, it is coded 0. If binary variables (X,Y) is assumed for correlated diagnostic test procedures, the joint distribution of X and Y is determined. The correlation coefficient $(r)$ of X and Y is determined and having the range $- 1 \leq r \leq + 1.$ For data on binary scale of measurement, correlated binary test results were generated with required probabilities of positive responses $(P_{1}, P_{2})$ to obtain specific difference $(π^{+} - π^{-} = β_{0} o r π_{b} - π_{t} \geq β_{0})$ between the probability of positive responses for the two diagnostic test procedures for the extended McNemar test and the proposed chi-square test respectively. The binary test results for the non-diseased subjects, are generated by fixing the probability of positive responses as 0.30 and 0.35. This procedure of simulating binary data is in line with the previous works of Leisch et al.,¹⁹ and Islam et al.,²⁰ who discussed the algorithm for simulating correlated binary test results. The SAS version 9 is the statistical software used to perform the simulation study.

The range of values of the correlation coefficient r for the extensive simulation for continuous test results and values of parameters (a and b) for estimating mean $(μ_{1}, μ_{2})$ and variance $(σ_{1}^{2}, σ_{2}^{2})$ parametrically as drawn to obtain the difference between two AUCs ranges from 0 to 0.3. For binary test results, the correlation coefficient r is also taken to range from 0.25 to 0.75 and the probability of positive responses were drawn so as to obtain the difference between probability of positive responses of subjects for the two diagnostic test procedures and it ranged from 0 to 0.2. For either binary or continuous scenario considered, we used 2000 replications in running the simulations. Table 5 compares the empirical test size (Type I error) and the statistical power of the extended McNemar⁸ test to the usual McNemar⁸ test proposed by Sumi et al.,⁷ to the conventional nonparametric test developed by DeLong et al.,³ and to asymptotic permutation test developed by Bandos et al.,⁶ for comparing two diagnostic test procedures for continuous test results. This comparison was similarly carried out for binary test results. The estimates of Type I error as well as estimates of the statistical power are obtained when the proportion of positive responses or the true AUCs for the two diagnostic test procedures are the same and different respectively as can be seen in Table 5&6. The rejection regions for the two tests are determined using 5% as level of significance.

For smaller AUCs, the extended McNemar test indicates a more conserved empirical test size (type I error) and thereafter an increased statistical power when compared to the traditional McNemar⁸ test by Sumi et al.,⁷ conventional nonparametric method by DeLong et al.,³ and asymptotic permutation test by Bandos et al.,⁶ when the test results is continuous. But when the correlation coefficient is moderate and for increased sample size for the two diagnostic test procedure, stability appears to be more in the scenario considered (continuous case) and the five tests mentioned above tends to be very close in terms of their empirical test size and statistical power. The extended McNemar⁸ test shows more false positive rate (FPR) when the correlation coefficient r is smaller. This is because the McNemar⁸ test are most suitably used when the data is correlated. However, when the correlation coefficient r is increased, the estimate of FPR reduces drastically. In the same way, when the AUCs is increased, the estimates of the empirical test size (type I error) for every sample sizes and all values of correlation coefficients can be compared. The extended McNemar test discriminates better than the traditional McNemar test by Sumi et al.,⁷ conventional nonparametric test by DeLong et al.,³ and the permutation test Bandos et al.,⁶ when the AUCs are getting higher and for lower values of correlation coefficients.

When the AUCs values are high and for moderate values of the correlation coefficient, the other three tests namely the usual McNemar⁸ test by Sumi et al.,⁷ test by DeLong et al.,³ and test by Bandos et al.,⁶ gives better statistical power than the extended McNemar test but when the sample sizes increases, the extended McNemar⁸ test provides very close statistical power to the others. In considering the binary test results in all aspects of parameter settings and for either big or small sample sizes, the extended McNemar⁸ test shows lower conservative empirical test size (Type I error) and shows higher statistical power when compared to tests by Sumi et al.,⁷ DeLong et al.,³ and Bandos et al.,⁶ Finally, in the continuous case situation, the results of the simulation shows that the proposed chi-square test and the extended McNemar⁸ test gives very close harmony of Type I error to the significant level $α$ but when the values of AUCs are low this harmony yields or provides among the diagnostic test procedures moderate and very high correlation coefficient. Also having greater or higher sample sizes in the continuous case also makes the extended McNemar⁸ test have statistical power that is very comparable to other existing nonparametric methods of comparing correlated AUCs. In addition, for the discrete binary case, the extended McNemar⁸ test possesses higher operating characteristics than other existing tests considered in all the settings of parameter. The performance of the extended McNemar⁸ test may be impaired in a simulation study when the test result is continuous because of the problem of choosing or finding an optimal cut-off value for classifying the test results of subjects. To make this point clearer, we in the next section will adopt a known standard data set that already has a real cut-off value and we will conduct a bootstrap power analysis so as to compare the statistical power of all the four tests namely, extended McNemar⁸ test, usual McNemar⁸ test by Sumi et al.,⁷ conventional test by DeLong et al.,³ and permutation test by Bandos et al (Table 4&5).

AUC	Mean		Variance	Sample size		$ρ = 0.25$				$ρ = 0.50$				$ρ = 0.75$
$A U C_{1}$ $A U C_{2}$		$μ_{2}$ $μ_{2}$	$σ_{1}^{2} = σ_{2}^{2}$	N	M	Da	Bb	Sc	EMd	Da	Bb	Sc	EMd	Da	Bb	Sc	EMd
Type I error and statistical power
0.6, 0.7 Type I error	.38	.38	1.0	20	20	.049	.040	.065	.069	.048	.044	.059	.061	.051	.052	.050	.049
				60	60	.045	.043	.072	.080	.047	.048	.054	.067	.050	.050	.048	.049
				100	100	.058	.057	.095	.096	.040	.040	.061	.063	.045	.045	.056	.057
				140	140	.047	.047	.087	.091	.043	.042	.083	.086	.043	.042	.076	.083
				180	180	.043	.042	.097	.099	.042	.042	.072	.078	.046	.046	.071	.080
0.6, 0.8 Power	.38	.76	1.0	20	20	.121	.090	.183	.189	.171	.162	.204	.240	.225	.214	.199	.209
				60	60	.188	.177	.334	.357	.297	.287	.387	.398	.397	.386	.453	.462
				100	100	.229	.085	.458	.472	.449	.439	.553	.572	.587	.575	.632	.641
				140	140	.441	.430	.678	.692	.637	.628	.781	.796	.800	.791	.876	.886
				180	180	.608	.604	.841	.855	.808	.801	.914	.935	.936	.932	.962	.978
0.6, 0.9 Power	.38	1.23	1.0	20	20	.404	.364	.468	.472	.570	.523	.558	.576	.723	.626	.603	.589
				60	60	.705	.678	.803	.825	.870	.838	.883	.898	.955	.942	.926	.918
				100	100	.682	.849	.939	.952	.975	.967	.978	.989	.997	.991	.990	.997
				140	140	.978	.976	.995	.898	.998	.998	.998	.998	1.000	1.000	1.000	1.000
				180	180	.996	.996	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.7 , 0.8 Power	.38	1.84	1.0	20	20	.816	.766	.762	.778	.938	.903	.835	.907	.985	.968	.883	.878
				60	60	.990	.983	.982	.986	.998	.998	.991	.996	1.000	1.000	.998	.998
				100	100	.998	.997	.999	.998	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
				140	140	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
				180	180	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.7, 0.9 Type I error	.79	.80	1.0	20	20	.047	.041	.048	.049	.046	.044	.039	.049	.049	.051	.029	0.019
				60	60	0.40	.038	.044	.048	.049	.048	.047	.057	.050	.049	.041	.032
				100	100	.061	.057	.047	.058	.036	.035	.046	.048	.046	.048	.039	.028
				140	140	.033	.033	.065	.072	.051	.050	.051	.056	.049	.049	.050	.042
				180	180	.048	.048	.064	.066	.041	.041	.051	.048	.054	.054	.049	.047
0.7, 0.9 Power	.79	1.25	1.0	20	20	.136	.125	.117	.123	.196	.186	.128	.120	.253	.245	.150	.140
				60	60	.231	.220	.228	.236	.350	.339	.271	.243	.470	.459	.324	.309
				100	100	.362	.348	.353	.361	.526	.512	.417	.401	.678	.668	.493	.417
				140	140	.561	.551	.576	.583	.744	.733	.679	.662	.870	.858	.769	.703
				180	180	.729	.723	.755	.763	.903	.898	.669	.619	.969	.966	.911	.879
0.8, 0.9 Power	.79	1.85	1.0	20	20	.531	.497	.356	.462	.696	.656	.414	.389	.824	.780	.467	.412
				60	60	.857	.832	.693	.721	.959	.946	.778	.742	.990	.980	.841	.810
				100	100	.953	.943	.892	.898	.995	.993	.929	.911	1.000	.998	.969	.931
				140	140	.998	.997	.984	..998	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
				180	180	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000

Table 4 Empirical type I error and statistical power when comparing two diagnostic tests for continuous test results. [- Area of diagnostic test 1; - Area of diagnostic test 2; D, DeLong et al.,³ Test; B, Bandos et al.,⁶ Test; S, Sumi et al.,⁷Test; EM, Extended McNemar⁸ Test]

D^a, Conventional AUC DeLong et al.,³; B^b, Approximation to permutation AUC test Bandos et al.,⁶; M^c, McNemar⁸ testSumi et al.,⁷; EM^d, Extended McNemar⁸ test (new method)

AUC	Mean		Variance	Sample size		$ρ = 0.25$				$ρ = 0.50$				$ρ = 0.75$
$A U C_{1}$ $A U C_{2}$		$μ_{2}$ $μ_{2}$	$σ_{1}^{2} = σ_{2}^{2}$	N	M	Da	Bb	Sc	EMd	Da	Bb	Sc	EMd	Da	Bb	Sc	EMd
Type I error and statistical power
0.60 Type I error	0.60	0.00	20	20	.065	.059	.027	.019	.077	.062	.022	.017	.071	.054	.024	.018
			60	60	.059	.054	.038	.027	.069	.068	.039	.024	.069	.065	.048	.032
			100	100	.068	.066	.048	.034	.080	.074	.056	.037	.093	..092	.059	.037
			140	140	.084	.081	.068	.047	.097	.095	.080	.062	.107	.104	.091	.078
			180	180	.115	.112	.079	.058	.124	.122	.093	.056	.135	.132	.120	.96
0.60 Power	0.70	0.10	20	20	.061	.054	.062	.051	.062	.049	.069	.078	.076	.052	.080	.093
			60	60	.071	.064	.131	.109	.073	.069	.146	.163	.075	.068	.220	.287
			1000	100	.079	.069	.204	.178	.076	.068	.248	.287	.097	.094	.336	.413
			140	140	.089	.087	.298	.242	.092	.087	.380	.421	.117	.109	.559	.624
			180	180	.102	.098	.439	.217	.112	.110	.557	.734	.146	.140	.764	.813
0.60 Power	0.80	0.20	20	20	.112	.102	.147	.181	.146	.106	.183	.192	.184	.140	.236	.261
			60	60	.182	.165	.343	.479	.231	.213	.408	.524	.303	.268	.584	.692
			100	100	.243	.222	.510	.611	.320	..293	.609	.741	.445	.422	.794	.847
			140	140	.376	.263	.719	.876	.489	.459	.846	.919	.643	.609	.959	.980
			180	180	.521	.497	.907	.968	.625	.603	.960	.987	.806	.787	.893	.907
0.70 Power	0.7	0.00	20	20	.065	.056	.031	.027	.071	.057	.023	.019	.069	.060	.024	.020
			60	60	.057	.054	.042	.035	.059	.058	.041	.017	.066	.060	.048	.034
			100	100	.076	.071	.055	.036	.084	.079	.058	.041	.095	.094	.061	.043
			140	140	0.85	.084	.060	.041	.094	.090	.075	.035	.136	.130	.086	.063
			180	180	.098	.096	.086	.063	.118	.116	.098	.062	.163	.159	.129	.108
0.70 Type I error	0.80	0.10	20	20	.062	.051	.064	.073	.060	.046	.075	.092	.076	.054	.085	.092
			60	60	.069	.065	.137	.148	.070	.068	.160	.177	.078	.068	.345	.269
			100	100	.076	.072	.214	.265	.086	.082	.267	.281	.098	.095	.381	.420
			140	140	.089	.084	.352	.368	.097	.092	.432	..525	.130	.122	.583	.674
			180	180	.110	.103	.480	.519	.110	.106	.606	.718	.166	.157	.784	.819
0.70 Power	0.90	0.20	20	20	.127	.108	.157	.168	.152	.112	.196	.227	.198	.153	.256	.280
			60	60	.198	.184	.372	.428	.251	.238	.445	.632	.336	.301	.627	.684
			100	100	.278	.259	.564	.687	.357	.325	.665	.728	.473	.444	.839	.872
			140	140	.422	.406	.785	.938	.501	.473	.883	.921	.704	.671	.973	.981
			180	180	.584	.565	.931	.981	.696	.674	.778	.835	.852	.820	.989	.991

Table 5 Empirical type I error and statistical power when comparing two diagnostic tests for discrete binary test results. [- Area of diagnostic test 1; - Area of diagnostic test 2; D, DeLong et al.,³ Test; B, Bandos et al.,⁶ Test; S, Sumi et al.,⁷Test; EM, Extended McNemar⁸ Test]

Da, Conventional AUC test DeLong et al.,³; B^b, Approximation to permutation AUC test Bandos et al.,⁶; M^c, McNemar⁸ testSumi et al.,⁷; EM^d, Extended McNemar⁸ test (new method)

Application of tests to standard data

In other to demonstrate the workability of the new non-parametric method (extended McNemar test) for comparing correlated proportion of positive responses, we consider a practical data set adopted from Venkatraman & Begg⁵ who carried out a distribution free procedure for comparing ROC curves from a paired experiment. This study was aimed at evaluating the performance of two diagnostic test results obtained from the anterior and posterior nodes in the cause of diagnosing Melanoma.

To demonstrate the feasibility of the extended McNemar test, we made use of the data from this study whose objective was to investigate the performance of two diagnostic test results obtained from anterior and posterior nodes for diagnosing Melanoma. The data presented in Table 4 in Venkatraman & Begg⁵ provide the results using a clinical scoring system and a dermoscopic scoring scheme. The purpose of the analysis is to determine whether the dermoscope contributes similar diagnostic information. The null hypothesis is that the dermoscope contributes the same information as the clinical scoring system. This is the same as testing the null hypothesis that the sizes of anterior and posterior nodes possess equivalent diagnostic information. Using these data, estimates of proportion of positive responses for the two diagnostic tests 1 and 2 procedures are 0.725 and 0.652 respectively and the estimated correlation coefficient between the two diagnostic tests is 0.157. To test equivalence of the accuracy of these two diagnostic tests, the conventional test by DeLong et al.,³ asymptotic permutation test by Bandos et al.,⁶ the usual McNemar⁸ test by Sumi et al.,⁷ and the extended McNemar8 test are in agreement of significant different performances yielding two tailed p-values of 0.0048,0.017,0.0028,0.0019 respectively.

Bootstrap power analysis for comparing the statistical power of tests

The bootstrap is a powerful nonparametric approach Efron.²¹ In an effort to obtain better and more specific knowledge regarding statistical power of tests, we have conducted a bootstrapping study where for each of considered sample sizes, 2000 random samples were taken from the data and rejection rates are computed.

Table 6 shows that given all sample sizes, the extended McNemar test provides the highest superior rejection rate followed by the McNemar⁸ test by Sumi et al.,⁷ and so on. At increased sample sizes, tests by DeLong et al.,³ Bandos et al.,⁶ and Sumi et al.,⁷ shows rejection rates very closed to the Extended McNemar⁸ test.

Sample size		Rejection rate
N	M	Da	Bb	Sc	EMd
20	20	0.67	0.538	0.679	0.685
60	60	0.769	0.737	0.819	0.827
100	100	0.869	0.857	0.889	0.89
140	140	0.919	0.911	0.929	0.931
180	180	0.946	0.938	0.977	0.994

Table 6 Bootstrapping Test for obtaining the statistical power of different tests

D^a, Conventional AUC test DeLong et al.,³; B^b, Approximation to permutation AUC test Bandos et al.,⁶; M^c, McNemar⁸ testSumi et al.,⁷; EM^d, Extended McNemar⁸ test (new method)

Application to real life data

The new test for comparing correlated proportion of positive responses can be applied to real life data on gestational diabetes mellitus (GDM). Actually a random sample of 1113 pregnant women who tested positive for 50g Glucose Challenge Test (GCT) indicating that their plasma blood glucose level were at least 140 mg/dl after 1 hour. These same numbers of pregnant women were subsequently recalled and further subjected to two competing diagnostic test procedures, namely, 2-hour 75g OGTT and 3-hours 100g OGTT at various gestation periods according to the standard of World Health Organization²² and National Diabetes Data Group.²³ These two diagnostic test procedures are paired. Women who were known diabetics, or who were suffering from any chronic illness were excluded from the study. The data is measured on a continuous scale and is dichotomized using at 7.8mmol/l or at least 140 mg/dl as cut-off value which is the recommended cut-off value for diagnosing GDM WHO.²² Pregnant women whose test result is at least 7.8mmol/l is considered diseased (positive, coded 1) otherwise; they are not diseased (negative, coded 0). The data for the GDM response variables (tests results) for diagnostic test 1 and 2 procedures, namely 75g OGTT and 100g OGTT are paired and hence correlated for the 1113 pregnant women considered for this study. The null hypothesis of interest is testing the equality of the proportion of positive responses for the two diagnostic test procedures. The dichotomized data for the two diagnostic tests are as usual cross classified and presented in a contingency table to demonstrate the feasibility of the new nonparametric methods as well as the existing methods considered. We therefore obtain the sample estimates ${\hat{π}}^{+}, {\hat{π}}^{0} a n d {\hat{π}}^{-},$ variance estimates and the McNemar test statistic and test the null hypothesis. In applying the extended McNemar test to the data, we evaluate the values of $T_{v}$ of Eqn 29 where $t_{v 1} a n d t_{v 2}$ are test results respectively by the subjects in the vth pair of diagnostic test 1 and diagnostic test 2 procedures for $v = 1, 2, ...., 1113.$ From the values of $T_{v}$ , we have that

$p^{+} = n_{12} = 270, p^{0} = n_{11} + n_{22} = p^{0 +} + p^{0 -} = 134 + 157 = 291; p^{-} = n_{21} = 556$

From Eqn 36, we have the sample estimates as

{\hat{π}}^{+} = \frac{270}{1113} = 0.2426; {\hat{π}}^{0} = \frac{291}{1113} = 0.2615; {\hat{π}}^{-} = \frac{556}{1113} = 0.4995;

$B u t {\hat{π}}^{0} = \frac{291}{1113} = \frac{134}{1113} + \frac{157}{1113} = 0.1204 + 0.1411 = {\hat{π}}^{0 +} + {\hat{π}}^{0 -} .$

$A l s o W = p^{+} - p^{-} = n_{12} - n_{21} = 270 - 556 = - 286.$

From Eqn 11, we have the estimated variance of W as

$V a r (W) = (1113) (0.2426 + 0.4995 - {(0.2426 - 0.4995)}^{2}) = (1113) (0.7421 - 0.0660) = (1113) (0.6761) = 752.4993.$

Therefore to test the null hypothesis of equation 46 using the extended McNemar test statistic we have from Eqn 47 with $β_{0} = 0$ that $χ^{2} = \frac{{(270 - 556)}^{2}}{752.4993} = \frac{81796}{752.4993} = 108.69 (P - v a l u e = 0.0012)$ which with 1 degree of freedom is statistically significant showing that diagnostic test 1 and diagnostic test 2 do have differential effect of GDM on pregnant women. In other words, the probability of positive responses from the two diagnostic test procedures for the pregnant women differs significantly. To differ this result, we make use of the usual McNemar⁸ test which was adopted by Sumi et al.,⁷ to analyze the GDM data that the estimated variance of P2-P1 is $V a r (P_{2} - P_{1}) = \frac{p^{+} + p^{-}}{N^{2}} = \frac{n_{12} + n_{21}}{N^{2}} = \frac{270 + 556}{{(1113)}^{2}} = \frac{825}{1238769} = 0.000667.$ Its test statistic for the H0 of Eqn 36 with $β_{0} = 0$ is $χ^{2} = \frac{{(270 - 556)}^{2}}{270 + 556} = \frac{81796}{826} = 99.03 (P - v a l u e = 0.0028)$ which with 1 degree of freedom is also statistically significant. Even though the extended McNemar⁸ test statistic and the usual McNemar⁸ test statistic had both lead to the rejection null hypothesis, the relative sizes of the calculated chi-square values and the p-values obtained indicates that the usual McNemar⁸ test statistic as adopted by Sumi et al.,⁷ has greater chances of leading to Type II error more often than the extended McNemar⁸ test statistic. Also, we note that the estimated variance of ${\hat{π}}^{+} - {\hat{π}}^{-} i s var ({\hat{π}}^{+} - {\hat{π}}^{-}) = \frac{0.2426 + 0.4995 - {(0.2426 - 0.4995)}^{2}}{1113} = \frac{0.7421 - 0.0660}{1113} = \frac{0.6761}{1113} = 0.000607$ which is $0.000667 - 0.000607 = 0.00006 = \frac{0.0660}{1113} = \frac{{({\hat{π}}^{+} - {\hat{π}}^{-})}^{2}}{N},$ smaller as expected than the variance of P2-P1 obtained when the usual or unmodified McNemar test is used.

Application of existing tests to the real life data

Applying the tests on the real life data, we obtain the following estimates of AUCs for the two diagnostic tests, the correlation coefficients between the test results of the two diagnostic test procedures and the p-values after testing for the equality of performance of the two diagnostic test procedures as.

From Table 7 results indicates that all tests showed significant difference since the p-values are less than the chosen level of significant of 5 percent at increased sample size of 1113 for the data on GDM. Overall result shows that the extended McNemar⁸ test are in agreement of significant different in their performances and therefore out performs other tests considered in this work.

S/n	Tests	$p_{1}$	$p_{2}$	$A \hat{U} C_{1}$	$A \hat{U} C_{2}$	Correlation coefficient (r)	p-value
1	Extended McNemar8	0.7214	0.7022	0.91183	0.9012	0.1654	0.0007
2	Sumi et al.,7	0.6765	0.6532	0.8675	0.8564	0.1754	0.0012
3	Bandos et al.,6	0.6375	0.6253	0.7392	0.7235	0.2732	0.00014
4	DeLong et al.,3	0.6453	0.6359	0.6443	0.6248	0.2401	0.0016

Table 7 Comparison of the tests by estimates obtained from the data on GDM

Discussion

The extended McNemar⁸ test statistic shown in this work apart from being simple to calculate, easy to understand and readily applicable, has proved that it is more powerful than the usual McNemar⁸ test based on the fact that it provides for the possible presence of ties in the data used for analysis. From the analysis, it was seen that even though the extended McNemar⁸ test statistic and the usual McNemar⁸ test statistic had both lead to the rejection null hypothesis, the relative sizes of the calculated chi-square values and the p-values obtained indicates that the usual McNemar⁸ test statistic as adopted by Sumi et al.,⁷ has greater chances of leading to Type II error more often than the extended McNemar8 test statistic. The proposed chi-square test does not require the knowledge of the true disease status or the gold standard may not be known. This is not the same with other traditional tests such as Bandos et al.,⁶ and Delong et al.,³ which must require the knowledge of true status (gold standard) in estimating the AUC.

The extended McNemar⁸ test as an alternative method of evaluating the accuracy of diagnostic tests can be used in testing the null hypothesis that the proportion of positive responses are equal in two diagnostic test procedures.

It is known that in the study of the statistical methods for diagnosis, one of the most interesting topics is the comparison of the accuracy of two binary diagnostic tests in relation to the same gold standard. The extended McNemar⁸ test used in comparing the accuracy of two diagnostic tests does not make any reference to the gold standard in its comparison. This is indeed an innovation in statistical methods for diagnosis.

Summary and conclusions

The extended McNemar⁸ test is applied to correlated data so as to compare the discriminatory abilities of two different test procedures. The data analysis using these methods involved computer simulation, standard data and real life data analysis carried out and result showed that the extended McNemar⁸ test can be good alternative to the test by Sumi et al.,⁷ test by DeLong et al.,³ and test by Bandos et al.,⁶ whose limitations were outlined in this paper. The McNemar test is therefore simple to communicate to the potential users of the procedures and it is easy to be applied in discriminating diagnostic test procedures even by non-statisticians. The summary of the finding are as follows:

In comparison to other tests, extended McNemar⁸test statistic is a very suitable alternative having the highest statistical power among the analysis carried out and so has the capacity to discriminate between diseased and non-diseased subjects in a better way.
The extended McNemar test does not require the knowledge of true status of subjects or any other gold standard in carrying out its analysis.
The proposed extended McNemar⁸ test offers reliable statistical inferences even in small sample problems and circumvents the long period normally experienced while estimating the test statistics for the Delong et al(1988) and Bandos et al.,⁶ which leads to computer memory loss and time.
The extended McNemar⁸ test adjusts for the possible presence of ties in the data and therefore eliminates erroneous conclusions occasioned by using data without adjustment.
The variance of the extended McNemar⁸ test statistic is smaller than the variance of the usual McNemar⁸ test statistic and is relatively more efficient and is more powerful than the usual McNemar⁸ test statistic. The calculated chi-square value of the extended McNemar⁸ test is larger than that of the usual McNemar⁸ test so that the chances of committing Type II error are reduced.
The extended McNemar8 test shows more false positive rate (FPR) when the correlation coefficient r is smaller than other tests considered. This is because the McNemar⁸ tests are most suitably used when the data is correlated.
Considering all the applications to data, results showed that the extended McNemar test discriminates better than the traditional McNemar test by Sumi et al.,⁷ conventional nonparametric test by DeLong et al.,³ and the permutation test Bandos et al.,⁶ when the AUCs are getting higher and for lower values of correlation coefficients.
The extended McNemar test enables the researcher to readily estimate not only the chances that among a randomly selected pair of diagnostic test 1 and 2 test results of subjects, the diagnostic test 1 responds positive and the diagnostic test 2 responds negative; or the diagnostic test 1 responds negative and the diagnostic test 2 responds positive, but also even when both the diagnostic test 1 and 2 test results of subjects have similar responses, it enables one easily estimate the probability that both respond positive or both respond negative. We therefore conclude as follows: The extended McNemar test statistic is more powerful than the usual McNemar test and indeed test by DeLong et al.,³ and the permutation test Bandos et al.⁶ Using any test statistic, the presence of ties in a data needs to be adjusted for before carrying out data analysis to avoid committing as much Type II error as possible so that decisions based on data analysis will not be erroneous.

Acknowledgments

I wish to appreciate Dr. Happiness Ilouno and Dr C.H Nwankwo of the Department of Statistics Nnamdi Azikiwe24 University Awka for their valuable moral support during the period of putting up this work. Their advice and contributions cannot be forgotten in a hurry.