In the United States, healthcare can be extremely expensive - a doctor's visit can cost several hundreds of dollars, and a hospital stay tens of thousands. Because many of us would not be able to pay these charges, especially given that sicknesses and injuries can be unpredictable and intermittent, health insurance provides a more reasonable cost to patients/consumers. Patients pick a health insurance plan and agree to pay a premium for the policy - the insurance company then agrees to pay a specific percentage of covered medical expenses. This works by sharing risk - since most people are healthy most of the time and wouldn't need medical care, their premium dollars would go towards covering the expenses of the relatively few people who are sick or injured, and need the medical care.
Using this data, insurance companies can set costs based on how often they expect households that fit into certain clusters or demographics to use their insurance.
import pandas as pd
# source: https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset
df = pd.read_csv('insurance.csv')
# show first 20 rows of the data
df.head(20)
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
5 | 31 | female | 25.740 | 0 | no | southeast | 3756.62160 |
6 | 46 | female | 33.440 | 1 | no | southeast | 8240.58960 |
7 | 37 | female | 27.740 | 3 | no | northwest | 7281.50560 |
8 | 37 | male | 29.830 | 2 | no | northeast | 6406.41070 |
9 | 60 | female | 25.840 | 0 | no | northwest | 28923.13692 |
10 | 25 | male | 26.220 | 0 | no | northeast | 2721.32080 |
11 | 62 | female | 26.290 | 0 | yes | southeast | 27808.72510 |
12 | 23 | male | 34.400 | 0 | no | southwest | 1826.84300 |
13 | 56 | female | 39.820 | 0 | no | southeast | 11090.71780 |
14 | 27 | male | 42.130 | 0 | yes | southeast | 39611.75770 |
15 | 19 | male | 24.600 | 1 | no | southwest | 1837.23700 |
16 | 52 | female | 30.780 | 1 | no | northeast | 10797.33620 |
17 | 23 | male | 23.845 | 0 | no | northeast | 2395.17155 |
18 | 56 | male | 40.300 | 0 | no | southwest | 10602.38500 |
19 | 30 | male | 35.300 | 0 | yes | southwest | 36837.46700 |
age (integer): age of primary beneficiary sex (male/female): sex of the insurance contractor bmi (float): body mass index children (integer): number of dependents smoker (yes/no): smoker/non-smoker region (string): residential area in the US - northeast, northwest, southeast, southwest charges (float): individual medical costs billed by insurance
This data is sufficient because it includes information on the patients that may affect their insurance needs, as well as the actual costs that they were billed by insurance. Factors like BMI and smoking status can help predict if a patient will anticipate needing more medical care in the long term. Generally, as age increases, the amount of care a patient will need also increases. Families with children and/or a higher average age would likely need more insurance coverage for medical services such as preventative care and checkups.
Clustering: identify households that are similar in terms of risk (ex. higher average age of all members in a risk pool/household, extremely high or low BMI, smokers) - adjust cost of premiums based on how often certain clusters of households are expected to use their insurance.