Water is essential for human beings across the world, especially in towns and remote areas where access to this resource is limited to pipes in supplemental areas. When there are failures in these pumps, it can lead to devestating impacts on these remote states, thus leading to families suffering. What if we could predict when these events were likely to happen?
Based on the given sensor data, when will a water pump fail?
Back home in India, access to water at a steady rate, even in the city is a major problem. It especially is an issue for agriculture because of how reliant the village is for harvesting and food consumption. In my dad's village, there is always a consistent outage of water at random times—or so random it seems. I'd like to be able to reconstruct this in a smaller scale to hopefully assist my dad's village in a larger scale.
This data set was obtained via Kaggle, but the interesting part is that I get to work Iot sensor data. The entries here are raw outputs from 52 different sensors, so I have to develop my own data pipeline to clean and use the data to predict an outage (I consider it an interesting procesing and prediction challenge).
Source: [https://www.kaggle.com/datasets/nphantawee/pump-sensor-data]
import pandas as pd
df = pd.read_csv('sensor.csv')
df.head(5)
Unnamed: 0 | timestamp | sensor_00 | sensor_01 | sensor_02 | sensor_03 | sensor_04 | sensor_05 | sensor_06 | sensor_07 | ... | sensor_43 | sensor_44 | sensor_45 | sensor_46 | sensor_47 | sensor_48 | sensor_49 | sensor_50 | sensor_51 | machine_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2018-04-01 00:00:00 | 2.465394 | 47.09201 | 53.2118 | 46.310760 | 634.3750 | 76.45975 | 13.41146 | 16.13136 | ... | 41.92708 | 39.641200 | 65.68287 | 50.92593 | 38.194440 | 157.9861 | 67.70834 | 243.0556 | 201.3889 | NORMAL |
1 | 1 | 2018-04-01 00:01:00 | 2.465394 | 47.09201 | 53.2118 | 46.310760 | 634.3750 | 76.45975 | 13.41146 | 16.13136 | ... | 41.92708 | 39.641200 | 65.68287 | 50.92593 | 38.194440 | 157.9861 | 67.70834 | 243.0556 | 201.3889 | NORMAL |
2 | 2 | 2018-04-01 00:02:00 | 2.444734 | 47.35243 | 53.2118 | 46.397570 | 638.8889 | 73.54598 | 13.32465 | 16.03733 | ... | 41.66666 | 39.351852 | 65.39352 | 51.21528 | 38.194443 | 155.9606 | 67.12963 | 241.3194 | 203.7037 | NORMAL |
3 | 3 | 2018-04-01 00:03:00 | 2.460474 | 47.09201 | 53.1684 | 46.397568 | 628.1250 | 76.98898 | 13.31742 | 16.24711 | ... | 40.88541 | 39.062500 | 64.81481 | 51.21528 | 38.194440 | 155.9606 | 66.84028 | 240.4514 | 203.1250 | NORMAL |
4 | 4 | 2018-04-01 00:04:00 | 2.445718 | 47.13541 | 53.2118 | 46.397568 | 636.4583 | 76.58897 | 13.35359 | 16.21094 | ... | 41.40625 | 38.773150 | 65.10416 | 51.79398 | 38.773150 | 158.2755 | 66.55093 | 242.1875 | 201.3889 | NORMAL |
5 rows × 55 columns
df.shape
(220320, 55)
df.columns
Index(['Unnamed: 0', 'timestamp', 'sensor_00', 'sensor_01', 'sensor_02', 'sensor_03', 'sensor_04', 'sensor_05', 'sensor_06', 'sensor_07', 'sensor_08', 'sensor_09', 'sensor_10', 'sensor_11', 'sensor_12', 'sensor_13', 'sensor_14', 'sensor_15', 'sensor_16', 'sensor_17', 'sensor_18', 'sensor_19', 'sensor_20', 'sensor_21', 'sensor_22', 'sensor_23', 'sensor_24', 'sensor_25', 'sensor_26', 'sensor_27', 'sensor_28', 'sensor_29', 'sensor_30', 'sensor_31', 'sensor_32', 'sensor_33', 'sensor_34', 'sensor_35', 'sensor_36', 'sensor_37', 'sensor_38', 'sensor_39', 'sensor_40', 'sensor_41', 'sensor_42', 'sensor_43', 'sensor_44', 'sensor_45', 'sensor_46', 'sensor_47', 'sensor_48', 'sensor_49', 'sensor_50', 'sensor_51', 'machine_status'], dtype='object')
df.describe()
Unnamed: 0 | sensor_00 | sensor_01 | sensor_02 | sensor_03 | sensor_04 | sensor_05 | sensor_06 | sensor_07 | sensor_08 | ... | sensor_42 | sensor_43 | sensor_44 | sensor_45 | sensor_46 | sensor_47 | sensor_48 | sensor_49 | sensor_50 | sensor_51 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 220320.000000 | 210112.000000 | 219951.000000 | 220301.000000 | 220301.000000 | 220301.000000 | 220301.000000 | 215522.000000 | 214869.000000 | 215213.000000 | ... | 220293.000000 | 220293.000000 | 220293.000000 | 220293.000000 | 220293.000000 | 220293.000000 | 220293.000000 | 220293.000000 | 143303.000000 | 204937.000000 |
mean | 110159.500000 | 2.372221 | 47.591611 | 50.867392 | 43.752481 | 590.673936 | 73.396414 | 13.501537 | 15.843152 | 15.200721 | ... | 35.453455 | 43.879591 | 42.656877 | 43.094984 | 48.018585 | 44.340903 | 150.889044 | 57.119968 | 183.049260 | 202.699667 |
std | 63601.049991 | 0.412227 | 3.296666 | 3.666820 | 2.418887 | 144.023912 | 17.298247 | 2.163736 | 2.201155 | 2.037390 | ... | 10.259521 | 11.044404 | 11.576355 | 12.837520 | 15.641284 | 10.442437 | 82.244957 | 19.143598 | 65.258650 | 109.588607 |
min | 0.000000 | 0.000000 | 0.000000 | 33.159720 | 31.640620 | 2.798032 | 0.000000 | 0.014468 | 0.000000 | 0.028935 | ... | 22.135416 | 24.479166 | 25.752316 | 26.331018 | 26.331018 | 27.199070 | 26.331018 | 26.620370 | 27.488426 | 27.777779 |
25% | 55079.750000 | 2.438831 | 46.310760 | 50.390620 | 42.838539 | 626.620400 | 69.976260 | 13.346350 | 15.907120 | 15.183740 | ... | 32.812500 | 39.583330 | 36.747684 | 36.747684 | 40.509258 | 39.062500 | 83.912030 | 47.743060 | 167.534700 | 179.108800 |
50% | 110159.500000 | 2.456539 | 48.133678 | 51.649300 | 44.227428 | 632.638916 | 75.576790 | 13.642940 | 16.167530 | 15.494790 | ... | 35.156250 | 42.968750 | 40.509260 | 40.219910 | 44.849540 | 42.534720 | 138.020800 | 52.662040 | 193.865700 | 197.338000 |
75% | 165239.250000 | 2.499826 | 49.479160 | 52.777770 | 45.312500 | 637.615723 | 80.912150 | 14.539930 | 16.427950 | 15.697340 | ... | 36.979164 | 46.614580 | 45.138890 | 44.849540 | 51.215280 | 46.585650 | 208.333300 | 60.763890 | 219.907400 | 216.724500 |
max | 220319.000000 | 2.549016 | 56.727430 | 56.032990 | 48.220490 | 800.000000 | 99.999880 | 22.251160 | 23.596640 | 24.348960 | ... | 374.218800 | 408.593700 | 1000.000000 | 320.312500 | 370.370400 | 303.530100 | 561.632000 | 464.409700 | 1000.000000 | 1000.000000 |
8 rows × 53 columns
Because this data is entirely sensor data, there a lot of pre-processing work to do to standardize the data values. This could result in data truncation and leakage as some of the precision may be lost during the conversion. Nonetheless, we expect to have a relatively accurate prediction of outages.
At a very high level, we know that having 52 entries of sensors is useful in that we have 52 different measurements for each time interval, which will result in accurate predictions for our model later on. We also know that having over 220,000 entries means that our model will have enough testing and training data to make accurate predictions as to when an outage is likely to occur.
The data also contains our target variable (Machine Status), as that is what we are trying predict using this data set. Because our target variable can only have 3 different status values as aforementioned, this project becomes a classification problem. Hence, we can use a variety of ML models, including but not limited to KNN, Random Forest, and SVM. We will likely implement a variety of these models to show which one is most accurate.