Research Article
Predicting the Lengths of Gestation for the Most Recent Births from Odds, Odds Ratios and Dummy Variable Regression Model
- Uchechukwu Marius Okeh *
International Institute for Nuclear Medicine and Allied Health Research, David Umahi Federal University of Health Sciences Uburu Ebonyi State, Nigeria.
*Corresponding Author: Uchechukwu Marius Okeh, International Institute for Nuclear Medicine and Allied Health Research, David Umahi Federal University of Health Sciences Uburu Ebonyi State, Nigeria.
Citation: Uchechukwu M. Okeh. (2025). Predicting the Lengths of Gestation for the Most Recent Births from Odds, Odds Ratios and Dummy Variable Regression Model, Journal of Women Health Care and Gynecology, BioRes Scientia Publishers. 5(6):1-8. DOI: 10.59657/2993-0871.brs.25.099
Copyright: © 2025 Uchechukwu Marius Okeh, this is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Received: July 07, 2025 | Accepted: October 10, 2025 | Published: October 17, 2025
Abstract
Background: One important advantage of dummy variable regression over ordinary regression is that instead of accommodating only quantitative response and explanatory variables, qualitative explanatory variables, can be incorporated into a linear model. A dummy-variable regressor is coded to represent a dichotomous outcome. This paper proposes a method of estimating probabilities, odds and odds ratios of dichotomized outcomes using dummy variable regression.
Methodology: Dummy variable regression in which both the dependent and independent variables are binary is used. This dummy variable regression is used to estimate the probabilities, odds and odds ratios of dichotomous outcomes. To do this, we first partitioned each of the parent independent variables into a set of mutually exclusive categories or subgroups and then use dummy variables to represent these categories in a regression model. In such a regression model, each parent independent variable is represented by one dummy variable of 1’s and 0’s less than the number of its categories. Any level of a parent independent variable that is not specifically represented by a dummy variable is referred to as the excluded level of that parent variable while the others are termed the included levels in the regression model. Data collection and its analysis were carried out retrospectively and using SPSS version 14 respectively.
Results: We illustrated the proposed method using the data to obtain the probabilities, odds and odds ratios of dichotomous outcomes such as the probability that any randomly selected mother has a gestation period of certain weeks for her last birth, the odds that a randomly selected mother from the population has a gestation period of some stated weeks and the odds that the randomly selected mother has a male birth with gestation period of some stated weeks were among others all estimated.
Conclusions: We conclude that dummy variable regression enables one to estimate as probabilities, odds and odds ratios of dichotomous outcomes also even as logistic regression can be used.
Keywords: odds; odds ratio; dichotomous response; dummy variable regression; probabilities
Introduction
Several researchers use the concept of multiple regression analysis and ANOVA in their statistical research analysis. Multiple regression analysis is frequently used in different aspects of life. (oberkirchner et al, 2010) uses multiple regression to analyses and develop a model for the effect of essential material and process parameters to weight and moisture content of impregnated papers. (Bajpai, 2013) analyze university model using multiple regression and ANCOVA and found very essential. (Syla, 2013) study the significance of active-employment programs on employment levels using multiple regression. (Everarda et al, 2005) uses multiple regression result to study the importance of engaging student to graphical user interface in teaching statistical courses. (Oswald, 2012) shows how viewing multiple regression results through multiple lenses can give a better assessment to the researchers. (Kelley et al, 2003) shows that in multiple regression obtaining accurate parameter contributes more than having statistical significance. (Pazzani et al, 1981) shows how independent sign regression generate linear model that are almost accurate as multiple regression. (Ludlow, 2014) study suppressor variables and suppression effects in building regression model. (Moya-laraño et al, 2008) encourages ecological researchers to use partial regression in their studies. (Breheny et al, 2013) uses visreg package which is useful tool in visualizing the relationship between explanatory variables that is estimated. Visreg construct convenient support in regression model.
Often subject or candidate for an examination or job interview may wish to estimate the probability of success given some predisposing factors such as the number of hours the subject studied per day or per week, the nature, type and duration of the examination, the candidates’ prior qualifications, age, gender, ethnic group, state of origin, etc. A Clinician conducting a diagnostic test or drug trial for a certain condition may wish to know the odds that subjects or patients respond positive given their various characteristics such as age, gender, body weight, family history, etc. A Gynecologist or Pediatrician may wish to estimate the odds that a new born baby is under-weight or has more than normal gestation period even given the mothers’ age, parity, body weight and the Childs’ gender. etc. In other words, the response to the condition of interest is dichotomous assuming one of two possible values and the predisposing factors are either categorical variables or could be subdivided into a number of mutually exclusive sub-groups or classes. This would hence enable the fitting of a multiple regression model in which both the dependent and independent variables are categorical and used in estimating the odds of occurrence of the outcome of interest as hereunder discussed.
The Proposed Method
Suppose a researcher collects a random sample of size n respondent, subjects, or patients from a certain population; for investigation for the presence or absence of some condition. Let yi be the response of the ith subject to the condition under study in the presence of some predisposing factors on parent explanatory variables A, B, C, …. with levels a, b, c… respectively for i=1, 2, …, n.
Let
Interest is in representing each of the parent explanatory variables A, B, C, …. as dummy variables of 1s and 0s and using them in a multiple dummy variable regression model in which y is the dependent variable with two mutually exclusive outcomes. To do this each of the parent independent or explanatory variables is represented by one dummy variable less than the number of its categories or levels. This is to avoid linear dependence among the columns of the design matrix X of the regression model and hence ensure that X is of full column rank (Boyle,1970; Nates and Wasserman,1974; Oyeka,1992).
This let,
Following these specifications, we may now fit the multiple dummy variable regression model expressing the dependence of yi on the xjs as
Where
are regression parameters or coefficients and
are error terms uncorrelated with the
with
for i=1, 2, …, n. Note that Equation 3 may alternatively be represented in its matrix form as 
Where
is an nx1 column vector of 1’s and 0’s representing the n scores or responses of subjects to the condition of interest, X is an nxr design matrix of 1’s and 0’s,
is an rx1 column vector of regression parameters and
is an nx1 column vector of error terms uncorrelated with X with
where n is the number of parameters(regression coefficients) in the model (Equation 3).
Applying the method of least squares to either Equation 3 or 4 yields an unbiased estimator of
as 
The following analysis of variance (ANOVA) Table is used to test the adequacy of Equation 3 or 4 based on the F-test statistic.
| Source of Variation | Sum of Square (SS) | Degree of Freedom (DF) | Mean Sum of Square (MS) | F-Ratio |
| Regression | ![]() | r-1 | ![]() | ![]() |
| Error | ![]() | n-r | ![]() | |
| Total | ![]() | n-1 |
Table 1: ANOVA Table for Equation 4.
The null hypothesis to be tested for the adequacy of Equation 4 using the results of Table 1 is 
is rejected at the
level of significance if 
Otherwise
is accepted where
is the critical value of F distribution with r-1 and n-r degrees of freedom for a specified
of significance. If the model fits, that is if
is rejected so that not all the
are zero, then we may proceed to estimate the required probabilities and odds of positive responses to the condition of interest.
Now from Equation 4 we have that the expected value of y is 
Or equivalently from Equation 3 we have that 
Which is the expected proportion of positive responses or the probability that the ith subject responds positive (1) to the condition of interest.
The expected probability that the ith subject responds negative (0) is 
Hence the odds that the ith randomly selected subject responds positive to the condition under study is 
In particular interest may be on some specific levels of some parent explanatory variable such as the jth level of factor A say. Then to find the probability that the ith randomly selected subject in the jth level of factor A responds positive to the condition we set
and all other
in Equation 9 yielding 
For j=1, 2,….,a-1
This is the probability that the jth level of factor A together with the omitted levels of all the other factors in the model (the levels) omitted in the specifications Equation 2 respond positive to the condition under study. Similarly, the probability that this subject (ith the jth level of factor A and omitted levels of the other factors) responds 
Hence the odds that the ith randomly selected subject in the jth level of factor A and the omitted levels of the other factors responding positive to the condition under study is 
Now that Equations 12 13 and 14 are obtained from Equations 9,10 and 11 respectively by certain
.
Equations 12 and 13 are respectively the probability that a randomly selected subject responds positives and the probability that the subject responds negative to the condition under study while Equation 14 is the odds of positive response. In general, if interest is in determining the odds of positive response by a randomly selected subject in the jth level of factor A, lth level of factor B, sth level of factor C and omitted levels of other factors in the model,

Finally, the odds ratio of positive response or of experiencing the condition by a randomly selected subjects in the jth and kth levels of factor A and omitted levels of other factors in the model is 
Results
We here illustrate the present method using data on the lengths of gestation (in weeks) for the most recent births of a random samples of n=41 women by age and parity of mother and gender of the last birth (Table 2).
Table 2: Data on Lengths of Gestation for last Births by Maternal Age, Parity and Gender of last birth.
| S/N | Mothers | Parity | Gender of last birth | Length of Gestation for last Birth |
| 1 | 28 | 3 | F | 40 |
| 2 | 36 | 8 | M | 34+1 |
| 3 | 30 | 1 | F | 39+2 |
| 4 | 25 | 3 | F | 41+5 |
| 5 | 27 | 1 | M | 40 |
| 6 | 30 | 1 | M | 38+6 |
| 7 | 27 | 3 | M | 40 |
| 8 | 20 | 1 | M | 40+5 |
| 9 | 31 | 6 | M | 39 |
| 10 | 31 | 6 | F | 39 |
| 11 | 27 | 1 | M | 40 |
| 12 | 19 | 0 | M | 38+2 |
| 13 | 30 | 3 | F | 41 |
| 14 | 30 | 5 | M | 39 |
| 15 | 39 | 2 | M | 41+5 |
| 16 | 25 | 1 | F | 40+5 |
| 17 | 29 | 4 | M | 40+3 |
| 18 | 23 | 2 | M | 39 |
| 19 | 30 | 2 | M | 37+5 |
| 20 | 28 | 0 | M | 40+4 |
| 21 | 24 | 0 | M | 37+5 |
| 22 | 20 | 1 | M | 40+4 |
| 23 | 30 | 6 | M | 42+3 |
| 24 | 32 | 4 | F | 41+2 |
| 25 | 22 | 1 | F | 37+3 |
| 26 | 25 | 0 | F | 38+5 |
| 27 | 22 | 0 | F | 39+3 |
| 28 | 33 | 4 | F | 39 |
| 29 | 29 | 1 | M | 40+1 |
| 30 | 29 | 3 | F | 39+5 |
| 31 | 25 | 0 | M | 37 |
| 32 | 28 | 2 | F | 38+3 |
| 33 | 26 | 1 | F | 40 |
| 34 | 28 | 4 | M | 37+4 |
| 35 | 35 | 2 | M | 39+2 |
| 36 | 25 | 0 | M | 40+3 |
| 37 | 34 | 0 | F | 40 |
| 38 | 26 | 0 | F | 36 |
| 39 | 30 | 6 | M | 42+3 |
| 40 | 32 | 7 | F | 40+2 |
| 41 | 25 | 6 | M | 38 |
Using length of gestation for last birth as the dependent variable and mothers age, parity and gender of last birth as the independent variables, we may proceed as follows, Let 
The resulting dummy variable multiple regression model is then 
The regression coefficients of Equation 18 are estimated from Equation 5 yielding the predicted regression model 
The corresponding analysis of variance table is presented in table 3.
Table 3: Anova Table for Equation 19.
| Source of Variation | Sum of Squares | Degree of Freedom | Mean Sum of Square | F-Ratio |
| Regression | 0.692 | 5 | 0.134 | 0.491 |
| Error | 9.572 | 35 | 0.273 | |
| Total | 10.244 | 40 |
The present model explains only about 6.6% of the total variation in length of gestation and hence the null hypothesis of equation 6 is not rejected.
Now the findings of no association between length of gestation and the independent, age, parity and gender of last birth, that is, the acceptance of H0 would ordinarily signal the end of statistical analysis. However, we here for illustration purposes only, the calculation of the probabilities, odds and odds ratios of occurrence of the condition under study namely that a randomly selected mother has a gestation period of more than 39.5 weeks for her last birth.
The probability that the ith randomly selected mother has a gestation period of over 39.5 weeks for her last birth is estimated from equation 19 and probability that her gestation period lasted for less than 39.5 weeks is using equation 19 in equation 10. 
Hence, the odds that a randomly selected mother from the population has a gestation period of more than 39.5 weeks is estimated from equations 11 and 19 to 20 as 
In particular, the odds that the randomly selected mother has a male birth with gestation period of more than 39.5 weeks is obtained using equation 21 in equation 14 by setting
in equation 21 yielding. 
This means that for every one thousand males’ births with a gestation period equal to or less than 39.5 weeks; 636 have a gestation period of over 39.5weeks. The odds that the last female birth of a randomly selected mother has a gestation period of over 39.5weeks is obtained by setting all
in equations 19 and 20 and taking the ratio yielding 
This means that for every one thousand female births with gestation period of at most 39.5 weeks 639 have a gestation period of over 39.5 weeks.
The odds ratio is therefore 
In other words, for every 1000 female births with a gestation period of more than 39.5 weeks, there are 995 males’ births with the same gestation period of 39.5 weeks. Note that the estimated regression coefficient
when interpreted means that if mothers’ parity and gender of last birth are held constant than the probability that the length of gestation for the birth by a randomly selected mother is over 39.5 weeks is expected to be lower on the average by 7.4 percent if the woman is aged 25 years or less, than if she belongs to any other age group. Similarly,
implies that if age of mother and gender of child are held at constant levels then the probability is 14.5 percent higher on the average for a randomly selected mother with a parity of two or three children when compared with other mothers that the length of gestation if her last birth exceeds 39.5 weeks.
However, interpreting these estimated regression coefficients in terms of absolute probabilities and odds would seem more illuminating. Thus, the probability that the most recent male birth by a randomly selected woman aged 30 years or more with more than three births has a gestation period of over 39.5 weeks is obtained as in Equation 2 by
in Equation 19 yielding 
The probability that the length of gestation for the male birth is equal to or less than 39.5 weeks (see Equation 15) is therefore 
Hence the corresponding odds for this event is estimated as 
This means that for every 1000 most recent male births with a gestation period of at most 39.5 weeks by women aged 30 years or more with more than three children we would expect about 637 of these most recent births to have a gestation period of over 39.5 weeks. Also, the probability that the most recent female birth by a randomly selected mother aged 30 years or more with more than three children has a gestation period of over 39.5 weeks is obtained by setting
in Equation 19 yielding 
The complementary probability is 
The corresponding odds is estimated as 
This means that for every 1000 most recent female births by a randomly selected mother aged 30 years or more with more than three children with a length of gestation of at most 39.5 weeks 639 of these female births are expected to have a length of gestation of over 39.5 weeks. Thus, the estimated odds ratio is 
This means that for every 1000 most recent female births with a length of gestation of over 39.5 weeks we would expect about 997 most recent male birth by mothers aged 30 years or over and with more than three children to also exceed a length of gestation of 39.5 weeks. The probability that a randomly selected mother aged 25 years or less with a parity of at most one child has a male birth after 39.5 weeks of gestation
in Equation 19 is 
The most complementary probability is 
In other words, the most recent birth by a randomly selected mother aged 25 years or less with a parity of at most one child, if male has a probability of 37.5 percent of being born after and a probability of 62.5 percent of being born before a gestation period of 39.5weeks. The corresponding odds is 
That is for every 1000 most recent male births by mothers aged at most 25 years with not more than one child following a gestation period of at most 39.5 weeks there are 60 male births by these women with a gestation period of over 39.5 weeks. If the most recent birth by the randomly selected mother is female, we set
in Equation 19 and hence in Equation 21 to obtain the odds that the most recent female birth by the randomly selected mother has a gestation period of over 39.5 weeks as 
Therefore, the corresponding odds ratio for this event is 
This means that for every 1000 female births with a gestation period of over 39.5 weeks, we would expect 995 male births to have a gestation period of also over 39.5 weeks born mothers aged at most 25 years with not more than one child.
Finally, the odds that a randomly selected mother aged 30 years or more with 2 or 3 children has a male child after over 39.5 weeks of gestation
in Equation 19 is from Equation 15 
Similarly, the odds that the most recent female birth by this randomly selected mother has a gestation period of over 39.5 weeks
in Equation 19 is 
The corresponding odds ratio is 
The odds that the length of gestation for the most recent birth by a randomly selected mother aged 30 years or more with a parity of over 3 children is
if child is male, and
if child is female.
Hence for the most recent male birth the ratio of the odds that a randomly selected woman aged 30 years or more with a parity of 2 or 3 children has a gestation period of over 39.5 weeks to the odds that her counter-part with more than 3 children has a gestation period of over 39.5 weeks using Equation 16 
This means that for every 1000 most recent male births by mothers aged 30 years or more with more than 3 children, there are 1.799 male births by their counterparts and with a parity of two or three children born after over 39.5 weeks of gestation. The odds ratio for female births is 
In other words, for every 1000 female birth by mother aged 30 years or more with a parity of more than 3 children born after over 39.5 weeks of gestation, there are 1.801 female births born by their counterparts with a parity of 2 or 3 children after over 39.5 weeks of gestation. Other probabilities, odds and odds ratios can similarly be estimated.
Conclusion
We have in this paper tried to develop a method of estimating the probabilities, odds and odds ratios of the occurrence of positive responses in dichotomized data using multiple dummy variable regression where both the dependent and independent variables are all binary. This approach enables the interpretation of the resulting partial regression coefficients as probabilities. The proposed method does not require the often-restrictive assumptions of normality and homogeneity usually necessary when the variables used in a regression model are assumed to be continuous. The method is illustrated with some sample data where lengths of gestation for last births is regressed against maternal age, parity and gender of last birth.
References
- Bajpai, P. (2013). Multiple regression analysis using ANCOVA in University model. International Journal of Applied Physics and Mathematics, 3(5):336.
Publisher | Google Scholor - Beri, G.C., (2005). Business Statistics. Second Edition. McGraw-Hill.
Publisher | Google Scholor - Breheny, P., Burchett, W. (2017). Visualization of regression models using visreg.
Publisher | Google Scholor - Cunningham, E., Wang, W. (2005). Using AMOS graphics to enhance the understanding and communication of multiple regression.
Publisher | Google Scholor - Kelley, K., Maxwell, S. E. (2003). Sample size for multiple regression: obtaining regression coefficients that are accurate, not simply significant. Psychological Methods, 8(3):305.
Publisher | Google Scholor - Ludlow, L., Klein, K. (2014). Suppressor variables: The difference between ‘is’ versus ‘acting as’. Journal of Statistics Education, 22(2).
Publisher | Google Scholor - Moya-Laraño, J., Corcobado, G. (2008). Plotting partial correlation and regression in ecological studies. Web Ecology, 8(1):35-46.
Publisher | Google Scholor - Nathans, L. L., Oswald, F. L., Nimon, K. (2012). Interpreting multiple linear regression: a guidebook of variable importance. Practical Assessment, Research & Evaluation, 17(9):n9.
Publisher | Google Scholor - Pazzani, M. J., Bay, S. D. (2020). The independent sign bias: Gaining insight from multiple linear regression. In Proceedings of the Twenty-first Annual Conference of the Cognitive Science Society (pp. 525-530). Psychology Press.
Publisher | Google Scholor - Shpresa, S. Y. L. A. (2013). Application of Multiple Linear Regression Analysis of Employment through ALMP. International Journal of Academic Research in Business and Social Sciences, 3(12):2222-6990.
Publisher | Google Scholor



