Predicting Water Pump Failures In The Agriculture Sector¶

Summary¶

Water is essential for human beings across the world, especially in towns and remote areas where access to this resource is limited to pipes in supplemental areas. When there are failures in these pumps, it can lead to devestating impacts on these remote states, thus leading to families suffering. What if we could predict when these events were likely to happen?

Problem Statement¶

Based on the given sensor data, when will a water pump fail?

Motivation¶

Back home in India, access to water at a steady rate, even in the city is a major problem. It especially is an issue for agriculture because of how reliant the village is for harvesting and food consumption. In my dad's village, there is always a consistent outage of water at random times—or so random it seems. I'd like to be able to reconstruct this in a smaller scale to hopefully assist my dad's village in a larger scale.

Reference Links:¶

Looking at The Data¶

This data set was obtained via Kaggle, but the interesting part is that I get to work Iot sensor data. The entries here are raw outputs from 52 different sensors, so I have to develop my own data pipeline to clean and use the data to predict an outage (I consider it an interesting procesing and prediction challenge).

Data Dictionary:¶

Unnamed (int): Unnamed column to be removed
Timestamp (Date): Date and timestamp of when the measurement was taken
sensor_00 to sensor_51 (float): 52 series measurements of water flow rate (measurement still unknown)
machine_status (Sensor Status Object): Status of the machine, either NORMAL, RECOVERING, OR BROKEN

Data Overview:¶

Source: [https://www.kaggle.com/datasets/nphantawee/pump-sensor-data]

In [1]:

import pandas as pd
df = pd.read_csv('sensor.csv')
df.head(5)

Out[1]:

	Unnamed: 0	timestamp	sensor_00	sensor_01	sensor_02	sensor_03	sensor_04	sensor_05	sensor_06	sensor_07	...	sensor_43	sensor_44	sensor_45	sensor_46	sensor_47	sensor_48	sensor_49	sensor_50	sensor_51	machine_status
0	0	2018-04-01 00:00:00	2.465394	47.09201	53.2118	46.310760	634.3750	76.45975	13.41146	16.13136	...	41.92708	39.641200	65.68287	50.92593	38.194440	157.9861	67.70834	243.0556	201.3889	NORMAL
1	1	2018-04-01 00:01:00	2.465394	47.09201	53.2118	46.310760	634.3750	76.45975	13.41146	16.13136	...	41.92708	39.641200	65.68287	50.92593	38.194440	157.9861	67.70834	243.0556	201.3889	NORMAL
2	2	2018-04-01 00:02:00	2.444734	47.35243	53.2118	46.397570	638.8889	73.54598	13.32465	16.03733	...	41.66666	39.351852	65.39352	51.21528	38.194443	155.9606	67.12963	241.3194	203.7037	NORMAL
3	3	2018-04-01 00:03:00	2.460474	47.09201	53.1684	46.397568	628.1250	76.98898	13.31742	16.24711	...	40.88541	39.062500	64.81481	51.21528	38.194440	155.9606	66.84028	240.4514	203.1250	NORMAL
4	4	2018-04-01 00:04:00	2.445718	47.13541	53.2118	46.397568	636.4583	76.58897	13.35359	16.21094	...	41.40625	38.773150	65.10416	51.79398	38.773150	158.2755	66.55093	242.1875	201.3889	NORMAL

5 rows × 55 columns

In [2]:

df.shape

Out[2]:

(220320, 55)

In [3]:

df.columns

Out[3]:

Index(['Unnamed: 0', 'timestamp', 'sensor_00', 'sensor_01', 'sensor_02',
       'sensor_03', 'sensor_04', 'sensor_05', 'sensor_06', 'sensor_07',
       'sensor_08', 'sensor_09', 'sensor_10', 'sensor_11', 'sensor_12',
       'sensor_13', 'sensor_14', 'sensor_15', 'sensor_16', 'sensor_17',
       'sensor_18', 'sensor_19', 'sensor_20', 'sensor_21', 'sensor_22',
       'sensor_23', 'sensor_24', 'sensor_25', 'sensor_26', 'sensor_27',
       'sensor_28', 'sensor_29', 'sensor_30', 'sensor_31', 'sensor_32',
       'sensor_33', 'sensor_34', 'sensor_35', 'sensor_36', 'sensor_37',
       'sensor_38', 'sensor_39', 'sensor_40', 'sensor_41', 'sensor_42',
       'sensor_43', 'sensor_44', 'sensor_45', 'sensor_46', 'sensor_47',
       'sensor_48', 'sensor_49', 'sensor_50', 'sensor_51', 'machine_status'],
      dtype='object')

In [4]:

df.describe()

Out[4]:

	Unnamed: 0	sensor_00	sensor_01	sensor_02	sensor_03	sensor_04	sensor_05	sensor_06	sensor_07	sensor_08	...	sensor_42	sensor_43	sensor_44	sensor_45	sensor_46	sensor_47	sensor_48	sensor_49	sensor_50	sensor_51
count	220320.000000	210112.000000	219951.000000	220301.000000	220301.000000	220301.000000	220301.000000	215522.000000	214869.000000	215213.000000	...	220293.000000	220293.000000	220293.000000	220293.000000	220293.000000	220293.000000	220293.000000	220293.000000	143303.000000	204937.000000
mean	110159.500000	2.372221	47.591611	50.867392	43.752481	590.673936	73.396414	13.501537	15.843152	15.200721	...	35.453455	43.879591	42.656877	43.094984	48.018585	44.340903	150.889044	57.119968	183.049260	202.699667
std	63601.049991	0.412227	3.296666	3.666820	2.418887	144.023912	17.298247	2.163736	2.201155	2.037390	...	10.259521	11.044404	11.576355	12.837520	15.641284	10.442437	82.244957	19.143598	65.258650	109.588607
min	0.000000	0.000000	0.000000	33.159720	31.640620	2.798032	0.000000	0.014468	0.000000	0.028935	...	22.135416	24.479166	25.752316	26.331018	26.331018	27.199070	26.331018	26.620370	27.488426	27.777779
25%	55079.750000	2.438831	46.310760	50.390620	42.838539	626.620400	69.976260	13.346350	15.907120	15.183740	...	32.812500	39.583330	36.747684	36.747684	40.509258	39.062500	83.912030	47.743060	167.534700	179.108800
50%	110159.500000	2.456539	48.133678	51.649300	44.227428	632.638916	75.576790	13.642940	16.167530	15.494790	...	35.156250	42.968750	40.509260	40.219910	44.849540	42.534720	138.020800	52.662040	193.865700	197.338000
75%	165239.250000	2.499826	49.479160	52.777770	45.312500	637.615723	80.912150	14.539930	16.427950	15.697340	...	36.979164	46.614580	45.138890	44.849540	51.215280	46.585650	208.333300	60.763890	219.907400	216.724500
max	220319.000000	2.549016	56.727430	56.032990	48.220490	800.000000	99.999880	22.251160	23.596640	24.348960	...	374.218800	408.593700	1000.000000	320.312500	370.370400	303.530100	561.632000	464.409700	1000.000000	1000.000000

8 rows × 53 columns

Sources of Data Problems¶

Because this data is entirely sensor data, there a lot of pre-processing work to do to standardize the data values. This could result in data truncation and leakage as some of the precision may be lost during the conversion. Nonetheless, we expect to have a relatively accurate prediction of outages.

Data Analysis & Modeling Overview¶

At a very high level, we know that having 52 entries of sensors is useful in that we have 52 different measurements for each time interval, which will result in accurate predictions for our model later on. We also know that having over 220,000 entries means that our model will have enough testing and training data to make accurate predictions as to when an outage is likely to occur.

The data also contains our target variable (Machine Status), as that is what we are trying predict using this data set. Because our target variable can only have 3 different status values as aforementioned, this project becomes a classification problem. Hence, we can use a variety of ML models, including but not limited to KNN, Random Forest, and SVM. We will likely implement a variety of these models to show which one is most accurate.