healthcare cost¶

  1. Describes and motivates a real-world problem where data science may provide helpful insights. Your description should be easily understood by a casual reader and include citations to motivating sources or relevant information (e.g. news articles, further reading links…Wikipedia makes for a poor reference but the links it cites are usually promising).

In the United States, healthcare can be extremely expensive - a doctor's visit can cost several hundreds of dollars, and a hospital stay tens of thousands. Because many of us would not be able to pay these charges, especially given that sicknesses and injuries can be unpredictable and intermittent, health insurance provides a more reasonable cost to patients/consumers. Patients pick a health insurance plan and agree to pay a premium for the policy - the insurance company then agrees to pay a specific percentage of covered medical expenses. This works by sharing risk - since most people are healthy most of the time and wouldn't need medical care, their premium dollars would go towards covering the expenses of the relatively few people who are sick or injured, and need the medical care.

Using this data, insurance companies can set costs based on how often they expect households that fit into certain clusters or demographics to use their insurance.

In [6]:
import pandas as pd

# source: https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset
df = pd.read_csv('insurance.csv')

# show first 20 rows of the data
df.head(20)
Out[6]:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
5 31 female 25.740 0 no southeast 3756.62160
6 46 female 33.440 1 no southeast 8240.58960
7 37 female 27.740 3 no northwest 7281.50560
8 37 male 29.830 2 no northeast 6406.41070
9 60 female 25.840 0 no northwest 28923.13692
10 25 male 26.220 0 no northeast 2721.32080
11 62 female 26.290 0 yes southeast 27808.72510
12 23 male 34.400 0 no southwest 1826.84300
13 56 female 39.820 0 no southeast 11090.71780
14 27 male 42.130 0 yes southeast 39611.75770
15 19 male 24.600 1 no southwest 1837.23700
16 52 female 30.780 1 no northeast 10797.33620
17 23 male 23.845 0 no northeast 2395.17155
18 56 male 40.300 0 no southwest 10602.38500
19 30 male 35.300 0 yes southwest 36837.46700
  1. Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above.

age (integer): age of primary beneficiary sex (male/female): sex of the insurance contractor bmi (float): body mass index children (integer): number of dependents smoker (yes/no): smoker/non-smoker region (string): residential area in the US - northeast, northwest, southeast, southwest charges (float): individual medical costs billed by insurance

This data is sufficient because it includes information on the patients that may affect their insurance needs, as well as the actual costs that they were billed by insurance. Factors like BMI and smoking status can help predict if a patient will anticipate needing more medical care in the long term. Generally, as age increases, the amount of care a patient will need also increases. Families with children and/or a higher average age would likely need more insurance coverage for medical services such as preventative care and checkups.

  1. Write one or two sentences about how the data will be used to solve the problem. Earlier in the semester, we won’t have studied the Machine Learning methods just yet but you should have a general idea of what the ML will set out to do.

Clustering: identify households that are similar in terms of risk (ex. higher average age of all members in a risk pool/household, extremely high or low BMI, smokers) - adjust cost of premiums based on how often certain clusters of households are expected to use their insurance.

sources: https://www.uhcsr.com/insurance101#:~:text=Just%20like%20car%20or%20home,medical%20services%20(covered%20services).

https://www.harvardmagazine.com/2020/05/feature-forum-costliest-health-care