Research Article Volume 9 Issue 4
Department of Statistics, Faculty of Science, King Abdulaziz University, Saudi Arabia
Correspondence: Ezz H. Abdelfattah, Department of Statistics, Faculty of Science, King Abdulaziz University, Saudi Arabia
Received: June 18, 2020 | Published: August 31, 2020
Citation: Abdelfattah EH. Reclassifying inferential statistics into diagnostic and predictive statistics with an application on gynecologic cancer. Biom Biostat Int J. 2020;9(4):146-150. DOI: 10.15406/bbij.2020.09.00312
Statisticians use to classify Statistics into two main parts, namely Descriptive and Inferential Statistics. Here, we suggest reclassifying Inferential Statistics into two parts, namely Diagnostic Statistics and Predictive Statistics.
Based on that we will have four levels to analyze data (Descriptive, Diagnostic, Predictive and Perspective Statistics). Descriptive statistics mainly related to Graphs, Frequency tables, Measures of Central Tendency, Measures of Variation and Measures of Shape. Diagnostic statistics mainly related to the effects of the Independent variables (inputs) on the Dependent (Target) variable based on the Tests of Correlation or Association, Tests for Means differences and Tests for Classification. Predictive statistics mainly related to Estimation, Regression techniques and Time series Analysis for the Dependent (Target) variable. Perspective statistics mainly related to the previous three levels and acts as a prescription to how to solve or prevent the problem. In this paper, we will clarify the statistical tests used in each level of statistical analysis and will give an example on a real data related to Gynecologic Cancer
Keywords: inferential statistics, diagnostic statistics, predictive statistics, perspective statistics, gynecologic cancer
Statisticians use to classify Statistics into two main parts, namely Descriptive and Inferential Statistics. Here, we suggest reclassifying Inferential Statistics into two parts, namely Diagnostic Statistics and Predictive Statistics. The Diagnostic statistics depends on Tests of Differences and Associations, while the Predictive statistics depends on Estimation, Prediction and Forecasting as shown in Figure 1.
We will consider having four levels of statistical analysis, namely Descriptive, Diagnostic, Predictive and Perspective statistics and will summarize the statistical tools that should be used. In terms of complexity of the algorithms and techniques involved, descriptive analytics are the simplest. Both Descriptive and Diagnostic statistics are related to the data already collected, and hence considered to be related to “past”. While Predictive and Perspective statistics are related to what is expected to happen, and hence considered to be related to “future”. Prescriptive analytics has the most impact on decision making, as it helps to identify the best action for the future. Predictive statistics mainly related to the previous three levels and acts as a prescription to how to solve or prevent the problem. Figure 2 shows a relative comparison of the four different types of analytics related to time.
Descriptive analysis is the statistical tools that should answer the question “What had happened?”. This form of analytics mainly deals with understanding the already gathered data. It is mainly related to Graphs, Frequency tables, Measures of Central Tendency, Measures of Variation and Measures of Shape. It involves the use of tools and algorithms to understand the internal structure of the Data and find categorical or temporal patterns or trends in it. The statistical tools may be summarized in the Table 1, based on the type of variable and its measurement level:
|
Qualitative |
Quantitative |
|
Nominal |
Ordinal |
Interval or Ratio |
|
Basic Graphs |
Bar, Pie |
Bar, Pie |
Bar (for discrete), Histogram, Polygon, Curve, Ogive (for continuous), Line (for time), Scatter diagram, (for binary data). |
Measures of Central Tendency |
Mode |
Mode, Median |
Mode, Median, Mean, Geometric mean, Harmonic mean, trimmed mean. |
Measures of Variation |
- |
Quartile range |
Range, Variance, Standard deviation, Coefficient of variation |
Measures of Position |
- |
Quartiles |
Standard Scores, Percentiles, Quartiles and Deciles, Skewness, Kurtosis |
Table 1 Basic statistical tools for descriptive statistics
Once the data is described, the next step is to seek independent variables affecting the Target (Dependent) variable, through answering the question “Why did it happened?”. Diagnostic analytics focuses on the reasons behind the observed patterns that are derived from descriptive analytics. The principal point here is the Target variable’s measurement level and its relation with the Independent variables (inputs). This may be checked mainly through the Tests for Means values for the Target using tests of differences and the Tests of Association for the Target with the inputs. Based on the Target’s measurement level, the statistical tools related to the differences, (where the inputs’ are categorical) may be summarized in Tables 2, while the statistical tools related to the association (for any Input’s measurement level) may be summarized in Table 3.
Dependent
Independent |
Target (Dependent) Measurement level |
||||
Qualitative |
Quantitative |
||||
Nominal |
Ordinal (Rank) |
Interval or Ratio |
|||
Scale (from Non-Normal Population) |
Scale (from Normal Population) |
||||
Groups of the Categorical Independent variable |
2 independent groups |
Chi-square test |
Mann-Whitney |
Mann-Whitney |
Independent sample t test |
3+ independent groups |
Chi-square test |
Kruskal-Wallis |
Kruskal-Wallis |
One-way ANOVA |
|
2 matched groups |
McNemar |
Wilcoxon test |
Wilcoxon test |
Paired sample t test |
|
3+ matched groups |
Chi-square test |
Friedman test |
Friedman test |
Repeated Measurements |
Table 2 Basic Diagnostic statistics tools for checking the differences in the target
Dependent
Independent |
Target (Dependent) Measurement level |
|||
Qualitative |
Quantitative |
|||
Nominal |
Ordinal (Rank) |
Interval or Ratio |
||
Input (Independent) |
Nominal |
Phi |
Bi-serial |
Point Bi-serial |
Ordinal (Rank) |
Bi-serial |
Kendall tau |
Ordinal Bi-serial |
|
Interval or Ratio |
Point Bi-serial |
Ordinal Bi-serial |
Pearson |
Table 3 Basic diagnostic statistics tools for checking the association with the target
Given the current trends in data identified by both the descriptive and diagnostic analytics, what might happen in the future is a crucial question. Predictive analytics tools provide insights into the possible future scenarios. Predictive analytics uses the outcomes of descriptive and diagnostic analytics to create a model for the future. In other words, analyzing what happened and gives insights to prepare a model for what is possible in the future, through answering the question “Why is likely to happen?”. The principal point here, as in the Diagnostic analytics, is the Target variable’s measurement level and its relation with the Independent variables (inputs). This may be checked mainly through the Estimation (in case of unknown population parameter), or Prediction (mainly based on regression techniques) or Forecasting future value (based on time series techniques), as indicated in Figure 1 and summarized in Table 4.
Dependent
Objective |
Target (Dependent) Measurement level |
||
Qualitative |
Quantitative |
||
Nominal |
Ordinal (Rank) |
Interval or Ratio |
|
Estimation |
Confidence interval for Proportion |
Confidence interval for Median |
Confidence interval for Mean |
Prediction |
Logistic Regression, Generalized Linear Mixed Model |
Ordinal Regression Generalized Linear Model, Generalized Linear Mixed Model |
Linear, Non linear, General linear model, Generalized Linear Model, Generalized Linear Mixed Model |
Forecasting |
NA |
NA |
Exponential Smoothing, ARMA, ARIMA, SARIMA |
Table 4 Basic predictive statistics tools for checking the association with the target
Prescriptive analytics tools provide a “what if” kind of analysis capability. What are the different options available and which among them is the best suited, given the predictions and other constraints, through answering the question “What should happen?”. It is a result of Diagnostic and Predictive analytics. Through Prescriptive analytics, we advice an action or a solution to be taken before the occurrence of the problem. Simulation, Decision Modelling and Expert systems play the main rule with Prescriptive statistics.
Gynecologic cancer (malignant tumor) is any cancer that starts in a woman’s reproductive organs. Types of Gynecologic Cancer is: Cervical cancer, Ovarian cancer, Uterine cancer (Uterine cancers can be one of two types: endometrial cancer (common) and uterine sarcoma (rare)), Vaginal cancer and Vulvar cancer. Each Gynecologic cancer is unique, with different signs and symptoms, different risk factors (things that may increase your chance of getting a disease), and different prevention strategies. All women are at risk for Gynecologic cancers, and risk increases with age. When Gynecologic cancers are found early, treatment is most effective. The following analysis are based on data collected from King Abdulaziz University Hospital, Saudi Arabia, during the period from beginning of the year 2000 to the end of 2016.
Descriptive analytics
We have initial routine tests of 513 patients (228 Benign 285 Malignant), with 118 fields: [Age, Nationality, Body Mass Index (BMI), Parity, Miscarriage, Date of admission, Marital status (Married before or not), Medical illness (Contains 60 types of illness) , Previous surgery (Contains 47 type of surgery] and Heart block (HB). Date of admission start from 2000-01-16 to 2016-11-09. Table 5 summarize the highest frequencies for the patients’ diseases or previous surgery, while table 6 summarize the descriptive statistics for the continuous variables.
Variable |
|
n |
% |
Tumor Type |
Benign |
228 |
44.4 |
Malignant |
285 |
55.6 |
|
Nationalities |
Asian |
427 |
83.2 |
African |
84 |
16.4 |
|
Missing |
2 |
0.4 |
|
Marital status |
Unmarried before |
29 |
5.7 |
Married before |
481 |
93.8 |
|
Missing |
3 |
0.6 |
|
Most repeated Medical illness |
I.HTN |
175 |
34.1 |
I.DM |
121 |
23.6 |
|
I.BA |
30 |
5.8 |
|
Most repeated Previous surgery |
D&C |
53 |
10.3 |
C/S |
37 |
7.2 |
|
Laproscopic cholecystectomy |
18 |
3.5 |
|
Myomectomy |
12 |
2.3 |
|
Hernia rep |
9 |
1.8 |
|
Appendectomy |
8 |
1.6 |
|
Vaginal repair |
8 |
1.6 |
Table 5 Descriptive statistics for the categorical variables
Variable |
Min |
Max |
Mean |
SD |
Skewness |
AGE |
13 |
95 |
51.9 |
11.693 |
0.468 |
Parity |
0 |
13 |
4.4 |
3.356 |
0.418 |
Miscarriage |
0 |
8 |
0.5 |
1.099 |
2.761 |
BMI |
14.6 |
168.4 |
31.3 |
9.808 |
6.107 |
HB |
6.8 |
18.7 |
11.4 |
1.741 |
-0.006 |
Table 6 Descriptive statistics for the continuous variables
Diagnostic analytics
Among all the variables measured, only Age, Parity and Miscarriage are the only significant continuous variables affecting the dependent variable (Tumor), this was done through using the independent t-test, with p-values less than 0.05, as shown in table 7. While Previous marriage, DM, hernia rep and myomectomy are the only significant discrete variables affecting the dependent variable (Tumor), this was done through using the Chi-square tests, with p-values less than 0.05, as shown in table 8.
Tumor type |
N |
Mean |
SD |
t |
P-value |
|
Age |
Benign |
327 |
49.58 |
10.558 |
-2.722 |
0.007 |
Malignant |
350 |
52.06 |
12.923 |
|||
parity |
Benign |
315 |
4.81 |
3.224 |
2.974 |
0.003 |
Malignant |
332 |
4.04 |
3.357 |
|||
Miscarriage |
Benign |
314 |
0.69 |
1.171 |
3.874 |
0.000 |
Malignant |
332 |
0.36 |
0.987 |
Table 7 t-tests for the continuous variables
Tumor type |
Total |
c2 |
P-value |
Odds Ratio |
||||||
Malignant |
Benign |
|||||||||
n |
% |
n |
% |
n |
% |
|||||
Previous marriage |
Yes |
320 |
49.90% |
321 |
50.10% |
641 |
100% |
15.238 |
0.000 |
0.20 |
No |
30 |
83.30% |
6 |
16.70% |
36 |
100% |
||||
DM |
Yes |
95 |
62.50% |
57 |
37.50% |
152 |
100% |
9.533 |
0.002 |
1.78 |
No |
256 |
48.30% |
274 |
51.70% |
530 |
100% |
||||
hernia rep |
Yes |
9 |
90.00% |
1 |
10.00% |
10 |
100% |
6.033 |
0.014 |
8.68 |
No |
342 |
50.90% |
330 |
49.10% |
672 |
100% |
||||
myomectomy |
Yes |
4 |
25.00% |
12 |
75.00% |
16 |
100% |
4.595 |
0.032 |
0.31 |
No |
347 |
52.10% |
319 |
47.90% |
666 |
100% |
Table 8 X2-tests for the categorical variables
Also, the value 1.78 can be interpreted as an estimate of the ratio of the odds, in the population, of a diabetes developing Malignant to the odds of a non-diabetes this type of tumor. This odds ratio can be said that a diabetes has 1.78 times the risk of a non- diabetes of developing Malignant tumor. That means it has a bad (positive) effect.
Similar interpretation can be given for the hernia rep with positive effect and myomectomy with negative effect on developing Malignant tumor.
Predictive analytics
Predictive analytics mainly focus on developing a predictive model, that can be used to “predict” the Target, or the (Dependent) variable. We consider the Tumor’ type (i.e Benign/Malignant) is the Dependent variable. Since it is nominal, we must use the Logistic regression (Table 4). Stepwise logistic regression is used to include only the significant variables affecting the patient’s type and in descending order of importance. Table 9 summarize the significance parameter estimates (B) and the odds ratios (Exp (B)), for the significant variables.
|
B |
S.E. |
Sig. |
Exp(B) |
Miscarriage |
-.260 |
.083 |
.002 |
.771 |
DM |
.593 |
.212 |
.006 |
1.809 |
Previous Marriage |
-1.145 |
.489 |
.020 |
.318 |
hernia rep |
2.062 |
1.086 |
.058 |
7.864 |
Age |
.019 |
.008 |
.014 |
1.019 |
parity |
-.070 |
.027 |
.011 |
.932 |
myomectomy |
-1.214 |
.613 |
.046 |
.297 |
Constant |
.529 |
1.329 |
.539 |
1.697 |
Table 9 The significance parameter estimates for the logistic regression
For example for the Age, the value of 1.019 of odds ratio (Exp(B)) means that with the increase of one year in age the risk of Malignant tumor is increased 1.019 times provided all other factors are kept constant. Since one year increase does not give any significant change, therefore, we can see the significant change after 10 years. This is calculated as:
eyears × b = e10 × 0.019= 1.21. This indicates that with an increase of 10 years in age the risk of Malignant tumor increases 1.21 times.
Perspective analytics
Perspective analytics is a result of all the previous analytics and gives the –in advance- guides that can be used to “avoid” the problem. Here, the problem is to have a Malignant tumor. Based on the results obtained, we may support that Marriage, parity and myomectomy, will help decreasing the chance of having a Malignant tumor. While patients should avoid DM and hernia rep, that increase the chance of having a Malignant tumor.
None.
None.
©2020 Abdelfattah. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.
2 7