[Project Repository] Predicting cardiovascular diseases.
Cadio Catch Diseases is a company specialized in detecting heart disease in the early stages. The company is in the Service Business, which offers an early diagnosis of a cardiovascular disease for a certain price.
Currently, the cardiovascular disease diagnosis is manually, by a specialists team. The diagnosis current accuracy lies between 55% and 65%, due to the complexity of the diagnosis and also the team fatigue, who take turns to minimize the risks. The cost of each diagnosis, including the devices and the analysts payroll, is R$ 1,000.00.
The diagnosis price, paid by the client, varies depending on the precision achieved by the specialists team. The client pays R$ 500,00 for each 5% accuracy above 50%. For example, for a 55% accuracy, the diagnosis costs R$ 500.00 for the client, for a 60% accuracy, the price is R$ 1000.00 and so on. If the diagnostic accuracy is 50%, its free of charge.
As we see, the diagnosis acurracy deviation make the CCD company diagnosis either a profitable operation or a unprofitable operation. This diagnosis ROI creates an unpredictable revenue for the CCD company.
Increment the Cardio Catch Diseases company (CCD) profit. By increase the diagnostics tests precision and stability.
Create a binary classification toll with statics model and machine learning to increase the diagnostics tests precision and stability.
id
- [int]: Patients ID register in the systemage
- [int]: Patients age in days height
- [float]: Patients height in cmweight
- [float]: Patients weight in kggender
- [binary]: Patients genderap_hi
- [float]: Patients diastolic blood pressureap_lo
- [float]: Patients diastolic blood pressurecholesterol
- [categorical]: Patients cholesterol levelgluc
- [categorical]: Patients glucose levelsmoke
- [binary]: Check if Patients is a smokeralco
- [binary]: Check if Patients is a drinkeractive
- [binary]: Check if patients practices physical activitiescardio
- [binary]: Check if patients practices has cardiovascular diseases
All numerical variables have a large number of outliers yet, except the age
feature.
FALSE
The proportion between sick and healthy men compared to the proportion between sick and healthy women are almost the same.
TRUE
The proportion between sick and healthy people increase with the BMI Level growth.
TRUE
The proportion between sick and healthy people increase with the Cholesterol Level growth.
TRUE
The proportion between sick and healthy people increase with the Glucose Level growth.
FALSE
The proportion between sick and healthy non smokers compared to the proportion between sick and healthy smokers are almost the same.
FALSE
The proportion between sick and healthy non drinkers compared to the proportion between sick and healthy drinkers are almost the same.
TRUE
The proportion between sick and healthy non sportists compared to the proportion between sick and healthy sportists is lower.
TRUE
The proportion between sick and healthy people increase with the Age Range Level growth.
The proportion between sick and healthy people increase with the Hypertension Level growth.
Between elderlies, the proportion between sick and healthy people increase with the Hypertension Level growth
This increase exponentially higher than the increase between people in general.
Prehypertense elderlies are more likely to have cardiovascular diseases than prehypertense people in general.
There are some variables that have considerable impact over ‘cardio_disease’ result values:
high_pressure
hypertension_level
low_pressure
age
cholesterol
BMI
age_range
weight
To start, the following machine learning models were tested:
So, we will choose the Top 4 best F1 Score models above to analyze, which is a metric that takes into account the Precision and the Recall metrics.
So, we will choose the LGBM Tuned as the final model to propose.
OBS: For more about the decisions made and how it was done: Cardio Catch Diseases notebook
The diagnosis test is processed through a manual analysis, by a specialists team.
It depends on the precision achieved by the team and on properly working devices. Which causes a significant precision deviance.
Precision Interval Achieved: 55% to 65%
The diagnosis test is processed through an automatic analysis, independent on human analysis.
It depends on the machine learning algorithm running on a cloud plataform. This algorithm implementaion generates a small precision deviance.
Precision Interval Achieved: 73,33% to 75,37%
Taking the considerable precision deviance and range, it creates an unpredictable revenue, with positive and negative resulting scenarios.
It takes a R$ 1,000.00 price cost, which includes the specialists team payment and operation devices cost.
Worst Scenario: - R$ 500.00
Best Scenario: R$ 500.00
Taking in account the small precision deviance and range, compared to the actual business method, it still creates an unpredictable revenue as well, buy only with positive resulting scenarios.
Taking both cases best scenarios in comparison, we have a 400% revenue increase.
Worst Scenario: R$ 2,000.00
Best Scenario: R$ 2,500.00
The model application can be found here:
Optimize the machine learning model precision interval for values over 75%, so the diagnosis test price always remain R$ 2.500,00 so we can have a predictable revenue.