Fetal Health Classification using Machine Learning
A subset of advanced analytics called predictive analytics uses historical data, statistical modeling, data mining, and machine learning to forecast future events. Fetal health classification is a crucial task in the field of obstetrics as it enables early detection and prevention of potential complications during pregnancy. However, traditional fetal health assessment methods such as fetal heart rate monitoring and ultrasound imaging are subjective and may not provide accurate predictions. Therefore, there is a need to develop a machine-learning model that can accurately classify fetal health based on various fetal health indicators.
What do we wish to achieve through this project?
This project aims to develop a machine-learning model that can accurately classify fetal health as normal, suspicious, or pathological based on features such as fetal heart rate, fetal movement, uterine contractions, and other clinical data. The model should be able to identify patterns and relationships between the features and the fetal health status and make accurate predictions based on the input data. The successful development of such a model can potentially improve prenatal care and reduce the incidence of adverse fetal outcomes, ultimately improving the health and well-being of both mother and child.
Approach
Before proceeding with our analysis, we had to make sure that there were no inconsistencies in our dataset. We checked for null values in our dataset and fortunately the dataset that we had sourced had no null values. The next step was to eliminate duplicate samples to avoid
redundancy.
To better understand the data and how features were distributed in the sample space and how they were correlated with each other as well the target variable, we performed univariate analysis for each of the features along with generation of bar plots, histograms, count plots, heat plots and scatter matrices. Through this step, we noticed that most of the features were not normally distributed and were left skewed and this provided valuable insights on how to proceed further with exploratory data analysis (EDA).
Dataset
The data used for this project is going to be sourced from Kaggle.
Dataset source citation link: https://www.kaggle.com/datasets/andrewmvd/fetal-healthclassifica1on.
The dataset contains 2126 records of features extracted from Cardiotocograms(CTG) exams
which were then classified by three expert obstetricians into three classes:
(i) Normal - No hypoxia or acidosis; no intervention necessary to improve fetal oxygenation state.
(ii) Suspect - Low probability of hypoxia/acidosis, warrants action to correct reversible causes
(iii) Pathological - High probability of hypoxia/acidosis, requires immediate action to correct reversible causes
The data contains 22 columns all of which are numerical.
Machine Learning Models
Support Vector Machine
Support Vector Machine (SVM) is a popular supervised machine learning algorithm used for classification and regression analysis. SVM is based on the idea of finding a hyperplane (decision boundary) that best separates the classes in a given dataset.
SVM has several advantages over other classification algorithms, including its ability to handle high-dimensional data, its effectiveness in dealing with small and noisy datasets, and its ability to handle non-linear decision boundaries. SVM is widely used in various applications, including text classification, image classification, and bioinformatics.
K - Nearest neighbor
K-Nearest Neighbor (KNN) is a popular supervised machine learning algorithm used for classification and regression analysis. It is a simple and intuitive algorithm that works by finding the k nearest neighbors of a given data point in the feature space and classifying or predicting the target value of that point based on the class or average value of its neighbors.
In the case of classification, KNN classifies a new data point based on the majority class of its k-nearest neighbors. In the case of regression, KNN predicts the target value of a new data point based on the average of the target values of its k nearest neighbors.
KNN has several advantages, including its simplicity, flexibility, and effectiveness in dealing with non-linear and complex decision boundaries. It is widely used in various applications, including image recognition, recommendation systems, and bioinformatics.
Decision Tree
The decision tree algorithm is a popular supervised machine learning algorithm used for both classification and regression analysis. It works by recursively partitioning the feature space into subsets based on the values of the features and their relationship with the target variable.
In a decision tree, each internal node represents a test on a feature, and each branch represents the outcome of the test. The leaves of the tree represent the class or value of the target variable for a given combina1on of feature values. The tree is constructed by selecting the best feature to split the data at each node based on a certain criterion, such as information gain or Gain impurity.
The decision tree algorithm has several advantages, including its simplicity, interpretability, and effectiveness in dealing with both categorical and con1nuous features. Decision trees can also handle missing values and outliers and can be used for feature selection and feature engineering.
Logistic Regression
Logistic regression is a popular supervised machine learning algorithm used for binary classification problems. It models the relationship between a binary dependent variable (also known as the response or target variable) and one or more independent variables (also known as predictors or features) using a logistic function.
The logistic function, also called the sigmoid function, maps any real-valued input to a value between 0 and 1, which can be interpreted as the probability of the dependent variable being in a particular class. The logistic regression algorithm estimates the parameters of the logistic function using a maximum likelihood method, which involves minimizing the negative log-likelihood of the data.
Naive Bayes Classifier
The Naive Bayes Classifier is a popular supervised machine learning algorithm used for classification problems. It is based on Bayes' theorem, which states that the probability of a hypothesis (in this case, the class of a data point) given the observed evidence (in this case, the features of the data point) is proportional to the product of the prior probability of the hypothesis and the likelihood of the evidence given the hypothesis.
The Naive Bayes Classifier estimates the prior probabilities of the classes and the conditional probabilities of the features given the classes using a training dataset. Then, it applies Bayes' theorem to calculate the posterior probabilities of the classes for a new data point and selects the class with the highest posterior probability as the predicted class.
Results
Regarding supervised learning, we were able to confirm that, in general, balancing the dataset allows the model to produce a higher performance and to learn better than the imbalanced dataset. In particular, balancing the dataset allows us to reduce the number of false-negatives in almost all the cases, which is a very good result as predicting as normal a pathological case could be very risky.
• Looking at the outcomes of the models implemented above, we can see that the number of false negatives is comparatively much lower for K-NN and Decision Trees.
• Based on the classification reports, we can see that K-NN (balanced dataset) has an accuracy of 95.951% and Decision Tree (original dataset) has an accuracy of 95.03%.
• Thus, the best model for predicting fetal health is K-NN (K – Nearest Neighbor) and the model should perform well as long as test data remains constant and consistent with cross validation data.