Kenny Vang’s Data Science Portfolio
  • Home
  • Life Expectancy
  • Liver Predictor

Liver Disease Analysis

This project analyzes liver disease risk factors using the Indian Liver Patient Dataset (ILPD). The notebook focuses on cleaning, age and protein patterns, gender differences, and hypothesis tests. The interactive app at the end is a demo only, not a clinical model.

  • Analysis Report And Findings
  • 🩺 Interactive Predictor

Dataset Overview

The analysis is based on the Indian Liver Patient Dataset (ILPD), containing 583 patient records collected from Andhra Pradesh, India. Each record is labeled as either healthy or diagnosed with liver disease.

Cleaning

The notebook found 4 missing values in the A/G Ratio column and removed those rows before analysis.

Code
pd.DataFrame({
    "Missing values in A/G Ratio": [ag_ratio_missing]
})
Missing values in A/G Ratio
0 4

1. Age Distribution

We first analyzed the distribution of patient ages.

Code
fig_age = px.histogram(
    df, x="Age", nbins=20, 
    title="Distribution of Patient Ages",
    color_discrete_sequence=['#636EFA']
)
fig_age.show()
Code
age_stats
Statistic Value
0 Mean 44.746141
1 Median 45.000000
2 Min 4.000000
3 Max 90.000000

Most patients are middle-aged, clustering between 30 and 60 years old. The average age is about 44.7, the median is 45, and the full range in the dataset is 4 to 90.

2. Are Protein Levels a Useful Indicator?

We compared Total Proteins (TP) against Albumin (ALB) to see how closely they move together.

Code
fig_corr = px.scatter(
    df, x="TP", y="ALB", 
    title="Total Proteins vs Albumin Correlation",
    opacity=0.6,
    color_discrete_sequence=['#EF553B']
)
fig_corr.show()

Total Proteins and Albumin levels move closely together. When one is higher, the other tends to be higher too, with a correlation of 0.78 out of a maximum of 1.0. Since the liver produces Albumin, this tight relationship makes it a useful early signal of liver health.

3. Demographic Risks (Men vs Women)

We filtered the data to examine patients under 60 and calculate the likelihood of presenting with liver disease based on gender.

Code
fig_prob = px.bar(
    gender_prob, x="Probability of Liver Disease (%)", y="Group", orientation='h',
    title="Probability of Liver Disease by Gender (Under 60)",
    color="Group"
)
fig_prob.show()
Code
gender_prob
Group Probability of Liver Disease (%)
0 Men < 60 27.146814
1 Women < 60 34.146341

In this dataset, about 27% of men under 60 and 34% of women under 60 were diagnosed with liver disease. Women under 60 show the higher rate in this sample.

4. Do These Patterns Hold Up to Statistical Testing?

To check whether the patterns above could simply be due to chance, the notebook ran three statistical tests. Think of a p-value as a “could this be a fluke?” score. A p-value below 0.05 means the finding is unlikely to be random.

  • Men vs. women disease rates (under 60): The difference was large enough to be unlikely random chance (p = 0.03). ✓ Meaningful.
  • Albumin levels in healthy vs. diseased female patients: No significant difference found (p = 0.18). This one could be chance.
  • Albumin levels by gender among diagnosed patients: Male and female patients with liver disease had meaningfully different albumin levels (p = 0.02). ✓ Meaningful.
Code
test_results
Test Observed Result P-Value Interpretation
0 Gender disease rate, under 60 8.5763 0.0328 Significant at 5%
1 Albumin difference in female patients 0.1391 0.1750 Not significant at 5%
2 Custom albumin by gender test 0.3122 0.0229 Significant at 5%

Key Takeaways

  • Most patients in this dataset are middle-aged, with an average age of about 45.
  • Women under 60 show a higher liver disease diagnosis rate than men in this sample (34% vs 27%).
  • Total Proteins and Albumin move closely together, making Albumin a potentially useful early health marker.
  • Gender appears to be associated with albumin levels among diagnosed patients. This is a finding that could be worth investigating further in a clinical setting.

Note: This is a demo app. It uses a synthetic training set inside the app code, so it should not be read as a real ILPD-based medical predictor.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| components: [viewer]
#| viewerHeight: 650

from shiny import App, render, ui, reactive
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Define the user interface
app_ui = ui.page_fluid(
    ui.h2("Liver Disease Risk Demo"),
    ui.layout_sidebar(
        ui.sidebar(
            ui.h4("Patient Health Metrics"),
            ui.input_select("preset", "🧪 Choose a Patient Profile:", 
                            {"custom": "Custom", "healthy": "Healthy Adult", "high_risk": "High-Risk Patient"}),
            ui.input_slider("age", "Age", min=4, max=90, value=30),
            ui.input_select("gender", "Gender", {"1": "Male", "0": "Female"}),
            ui.input_slider("tb", "Total Bilirubin", min=0.4, max=75.0, value=0.8, step=0.1),
            ui.input_slider("alkphos", "Alkaline Phosphotase", min=63, max=2110, value=150),
            ui.input_slider("sgpt", "Alamine Aminotransferase (SGPT)", min=10, max=2000, value=25),
            ui.input_action_button("predict", "Calculate Risk Score", class_="btn-primary")
        ),
        ui.card(
            ui.h3("Prediction Result:"),
            ui.output_ui("prediction_result")
        )
    )
)

def server(input, output, session):

    @reactive.Calc
    def train_model():
        np.random.seed(42)
        X = np.random.rand(500, 5) * [90, 1, 10, 500, 100]
        y = (X[:, 2] > 2.5) | (X[:, 3] > 250) | (X[:, 4] > 60) | ((X[:, 0] > 50) & (X[:, 2] > 1.5))
        y = y.astype(int)
        
        clf = RandomForestClassifier(n_estimators=50, max_depth=5, class_weight='balanced', random_state=42)
        clf.fit(X, y)
        return clf

    @reactive.Effect
    @reactive.event(input.preset)
    def update_sliders():
        p = input.preset()
        if p == "healthy":
            ui.update_slider("age", value=32)
            ui.update_select("gender", selected="0")
            ui.update_slider("tb", value=0.8)
            ui.update_slider("alkphos", value=160)
            ui.update_slider("sgpt", value=22)
        elif p == "high_risk":
            ui.update_slider("age", value=65)
            ui.update_select("gender", selected="1")
            ui.update_slider("tb", value=7.5)
            ui.update_slider("alkphos", value=480)
            ui.update_slider("sgpt", value=85)

    @render.ui
    @reactive.event(input.predict)
    def prediction_result():
        clf = train_model()
        features = [[
            input.age(),
            int(input.gender()),
            input.tb(),
            input.alkphos(),
            input.sgpt()
        ]]
        
        prob = clf.predict_proba(features)[0][1]
        
        reasons = []
        if input.tb() > 1.2:
            reasons.append(f"**Elevated Total Bilirubin** ({input.tb()} > 1.2 mg/dL)")
        if input.alkphos() > 147:
            reasons.append(f"**Elevated Alkaline Phosphatase** ({input.alkphos()} > 147 IU/L)")
        if input.sgpt() > 56:
            reasons.append(f"**Elevated SGPT** ({input.sgpt()} > 56 U/L)")
        
        if prob > 0.5:
            msg = f"#### ⚠️ High Likelihood of Liver Disease (Risk Score: {prob:.2%})\n\n"
            if reasons:
                msg += "**Primary Risk Factors Identified:**\n\n- " + "\n- ".join(reasons)
        else:
            msg = f"#### ✅ Low Likelihood of Liver Disease (Risk Score: {prob:.2%})\n\n"
            if reasons:
                msg += "*(Note: Some values are outside optimal ranges:)*\n\n- " + "\n- ".join(reasons)
                
        return ui.markdown(msg)

app = App(app_ui, server)