Diabetic vs. Non-Diabetic

Data Analysis

Report

Hello

~

Ciao

~

Thank you ah

~

Can Can

~

Hello ~ Ciao ~ Thank you ah ~ Can Can ~

This analysis seeks to explore and understand the complex relationships among various health indicators — including glucose levels, blood pressure, insulin, body mass index (BMI), skinfold thickness, family history, pregnancies, and age — that contribute to the risk of diabetes. Using Principal Component Analysis (PCA) as the primary method, this study aims to reduce the complexity of the dataset while preserving the essential variance, allowing for clearer visualization and interpretation of key trends.

  • Diabetes is a rapidly growing global health concern, affecting millions of individuals across all age groups. Early identification of individuals at risk is crucial for timely intervention, effective management, and the prevention of severe complications associated with the disease. With the increasing availability of clinical and demographic data, data-driven approaches offer powerful tools to uncover hidden patterns and risk factors linked to diabetes development.

  • Diabetes is a chronic metabolic disorder characterized by elevated blood glucose levels, resulting from impairments in insulin production, insulin action, or both. Early identification and understanding of the risk factors associated with diabetes are critical for effective prevention, intervention, and management strategies.

  • The dataset used in this analysis includes a variety of clinical and demographic variables, such as glucose concentration, blood pressure, insulin levels, body mass index (BMI), skinfold thickness, diabetes pedigree function (family history measure), number of pregnancies, and age

  • A result of Principal Component Analysis (PCA)—a method purely based on variance—a few important findings emerged.

  • Item description

Introduction

Diabetes is a rapidly growing global health concern, affecting millions of individuals across all age groups. Early identification of individuals at risk is crucial for timely intervention, effective management, and the prevention of severe complications associated with the disease. With the increasing availability of clinical and demographic data, data-driven approaches offer powerful tools to uncover hidden patterns and risk factors linked to diabetes development. This analysis seeks to explore and understand the complex relationships among various health indicators — including glucose levels, blood pressure, insulin, body mass index (BMI), skinfold thickness, family history, pregnancies, and age — that contribute to the risk of diabetes. Using Principal Component Analysis (PCA) as the primary method, this study aims to reduce the complexity of the dataset while preserving the essential variance, allowing for clearer visualization and interpretation of key trends. Furthermore, the potential applications of factor analysis and multivariate analysis of variance (MANOVA) are discussed to emphasize the broader statistical frameworks that can deepen our understanding of diabetes risk patterns. Through this approach, the study not only identifies dominant factors influencing diabetes but also highlights the multifactorial nature of the disease, offering insights for future predictive modeling and clinical decision-making.

Background

Diabetes is a chronic metabolic disorder characterized by elevated blood glucose levels, resulting from impairments in insulin production, insulin action, or both. Early identification and understanding of the risk factors associated with diabetes are critical for effective prevention, intervention, and management strategies. The dataset used in this analysis includes a variety of clinical and demographic variables, such as glucose concentration, blood pressure, insulin levels, body mass index (BMI), skinfold thickness, diabetes pedigree function (family history measure), number of pregnancies, and age — all well-established predictors of diabetes risk.

Principal Component Analysis (PCA) is employed in this study as an unsupervised dimensionality reduction method to explore the data structure without considering the outcome label. PCA transforms the original correlated variables into new uncorrelated principal components that capture the maximum variance within the dataset, facilitating better visualization and interpretation of complex relationships among variables. Through PCA, key patterns and dominant variables contributing to the data's variability can be uncovered.

Additionally, factor analysis is recognized as an important complementary technique. While PCA focuses on maximizing total variance, factor analysis aims to uncover latent underlying constructs (unobserved factors) that explain the correlations among observed variables. Although factor analysis has not been conducted in this study, it represents a valuable future approach for modeling the hidden dimensions driving diabetes-related health indicators.

Furthermore, multivariate analysis of variance (MANOVA) offers another relevant analytical perspective. MANOVA is designed to assess whether multiple dependent variables differ across groups defined by categorical predictors. In the context of diabetes research, MANOVA could test whether a combination of clinical measures (such as glucose, insulin, and BMI) differs significantly between diabetic and non-diabetic groups. Although MANOVA has not been implemented here, its consideration highlights an important inferential tool that could strengthen conclusions by evaluating group-level multivariate differences

By applying PCA and considering the future use of factor analysis and MANOVA, this study aims to deepen the understanding of the complex interrelationships among diabetes risk factors, providing a foundation for more informed predictive modeling and clinical decision-making strategies.

Data Description

The diabetes dataset contains 768 observations and 9 variables related to health indicators for predicting diabetes outcomes. Each row represents an individual case, and the columns include Pregnancies (number of times pregnant), Glucose (plasma glucose concentration), BloodPressure (diastolic blood pressure), SkinThickness (triceps skinfold thickness), Insulin (2-hour serum insulin), BMI (body mass index), DiabetesPedigreeFunction (a measure of family diabetes history), Age (in years), and Outcome (diabetes presence: 1 for diabetic, 0 for non-diabetic). All variables are numeric, with Outcome being a binary integer variable. No missing values are detected, although some zeros in medical measurements (such as Glucose and Insulin) may represent missing or unrecorded data. Summary statistics show, for example, that Glucose has a mean around 121 mg/dL with a standard deviation of about 32. Overall, the dataset provides a comprehensive set of clinical and demographic predictors for studying patterns related to diabetes risk.

Results

As a result of Principal Component Analysis (PCA)—a method purely based on variance—a few important findings emerged. First, a significant relationship was observed among Skin Thickness, Insulin, and BMI, which were highly correlated and primarily loaded onto the first principal component (PC1). PC1 captures the largest proportion of variance in the dataset. Another key finding is that almost all predictors appeared across multiple principal components, suggesting that the variables in this dataset are complex and involved in several types of variation rather than following a single trend. This complexity explains why the data does not align along a simple trend line.

Among all predictors, Glucose (representing blood glucose levels typically ranging between 70–99 mg/dL) was the most prominent variable in the loadings of principal components. This indicates that blood glucose levels are highly influential in the dataset. However, it’s important to note that glucose alone is not sufficient to determine diabetes status. Additional modeling and validation steps are required to establish predictive indicators.

In contrast, Insulin and Age had the least contribution across the principal components, implying that these factors are less strongly associated with the primary patterns of variation in this dataset. Interestingly, the relatively minor role of Age supports current medical observations: the early onset of diabetes in young adults and even children has become more common.

To further explore the PCA results, both two-dimensional and three-dimensional visualizations were used.
The 2D PCA biplot captures only 48.5% of the total variance, which limits interpretation. Therefore, a 3D PCA plot was utilized to better illustrate the relationships among the eight predictors.

The PCA biplot shows that all eight predictors are generally correlated in similar directions and distributed across the range of PC2. This observation indicates that PC2 reveals a clearer pattern, whereas PC1, despite capturing the most variance, does so due to random correlations rather than coherent groupings. This highlights the importance of examining secondary principal components (PC2, PC3) for meaningful structure.

According to the PCA loadings, PC2 is most influenced by Age and Pregnancies. This finding is consistent with known medical relationships: higher pregnancy counts can correlate with increased risk of gestational diabetes, and age and pregnancy history are naturally interrelated, as older individuals are more likely to have higher pregnancy counts.

Meanwhile, PC3 is primarily driven by Blood Pressure, Glucose, and Diabetes Pedigree Function (a measure of family diabetes history or genetic predisposition).
To fully capture the relationships between PC1, PC2, and PC3, a three-dimensional PCA plot was generated.


3D scatter plot of PCA

According to the 3D PCA plot, non-diabetic cases show a clear cluster, primarily located at positive PC1, negative PC2, and negative PC3, indicating a relatively homogeneous group.Otherwise, diabetic cases are more widely scattered across all three dimensions, lacking a clear cluster, and reflecting greater variability across risk factors.

Overall, non-diabetic individuals exhibit uniformity across PC2 and PC3, while diabetic individuals show complex and diverse profiles influenced by multiple factors such as Glucose, BMI, Skin Thickness, Insulin, and family diabetes history.

Conclusion

This analysis provided meaningful insights into the complex interrelationships among clinical and demographic variables associated with diabetes. Through Principal Component Analysis (PCA), it was revealed that while the first principal component captured the highest variance, it did not necessarily represent a simple or strongly interpretable pattern. In contrast, the second and third components offered clearer groupings, highlighting important factors such as age, number of pregnancies, glucose levels, blood pressure, and family history of diabetes. The 3D PCA visualization further demonstrated that non-diabetic individuals tended to cluster closely, reflecting more uniform and balanced health profiles, whereas diabetic individuals exhibited greater variability and complexity across the principal components. These findings emphasize the multifactorial nature of diabetes and highlight the value of multivariate techniques like PCA in uncovering hidden structures within health data, ultimately contributing to a deeper understanding of diabetes risk patterns.

Thank you

Previous
Previous

Project Two

Next
Next

Project Four