Air Quality Analysis¶

By: Aaroh Jugulum and Madhav Kapa¶

Description of our Real World Problem:¶

As the data set we chose looks specifically at the different factors that can affect air-quality, we came to the realization that third world countries and even some cities in developed nations, who suffer from poor air conditions can benefit from a project looking to predict air quality based on different factors in order to combat the problems they already have. We see the adverse impact air pollution has on these countries as it accounts for roughly 6% of deaths in these developing nations, while also having long lasting negative impacts such as slowing the development of childrens' lungs and potentially causing premature births. Additionally climate change is poorly impacted by poor air quality, thus allowing us to aid another issue which can be detrimental to daily life as climate change has the potential to impact the world as a whole. These metrics display the severity of the situation in these places, thus motivating us to look into this issue much further. Given metrics such as temperature, carbon monoxide levels, and atmospheric pressure, we can see the cause of poor air quality in these places, allowing us to find a remedy which can combat these issues. Using different measurements to predict air quality can allow us to pin-point certain trends within developing countries in order to help prevent these issues from persisting even further and compromising more lives.

Sources:

  • 5 Facts About the Effects of Pollution in Developing Countries

  • WHO Air Pollution

  • Climate Change and Air Pollution

Datasets:¶

  • Air Quality Data (ISC Machine Learning Repository: This dataset contains hourly measurements of air quality from an Italian city, including factors such as temperature, humidity, carbon monoxide levels, and other atmospheric gases. The data covers the period from March 2004 to February 2005, and it includes measurements from two different monitoring sites in the city. The dataset contains 9358 instances and can be used to train machine learning models to predict air quality based on the different factors provided.

  • Real-time Air Quality Index (AQI): This dataset provides real-time air quality data from over 10,000 monitoring stations in over 100 countries around the world. The data covers various pollutants such as PM2.5, PM10, and ozone, and it is updated hourly. The AQI data can be accessed through the AQICN website or through their API.

  • Air Quality System (AQS) EPA: The EPA AQS is a large database containing air quality measurements from over 10,000 monitoring stations across the United States. The data covers various pollutants such as ozone, particulate matter, and carbon monoxide, and it includes hourly, daily, and annual measurements. The AQS data can be accessed through the EPA's website or through the AQS Data Mart. This dataset is a valuable resource for researchers and policymakers who are working to address air quality issues in the United States.

In [1]:
# Importing Libraries
import pandas as pd
In [2]:
# Importing the dataset
dataset = pd.read_csv('AirQualityUCI.csv', sep=';')
# Dropping the last two columns
dataset = dataset.drop(['Unnamed: 15', 'Unnamed: 16'], axis=1)
# Displaying the first 5 rows of the dataset
dataset.head()
Out[2]:
Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13,6 48,9 0,7578
1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 103.0 1174.0 92.0 1559.0 972.0 13,3 47,7 0,7255
2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11,9 54,0 0,7502
3 10/03/2004 21.00.00 2,2 1376.0 80.0 9,2 948.0 172.0 1092.0 122.0 1584.0 1203.0 11,0 60,0 0,7867
4 10/03/2004 22.00.00 1,6 1272.0 51.0 6,5 836.0 131.0 1205.0 116.0 1490.0 1110.0 11,2 59,6 0,7888

Data Dictionary¶

  • Date: Date in format dd/mm/yyyy
  • Time: time in format hh:mm:ss
  • CO(GT): True hourly averaged concentration CO in mg/m^3 (reference analyzer)
  • PT08.S1(CO): PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
  • NMHC(GT): True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
  • C6H6(GT): True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
  • PT08.S2(NMHC): PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
  • NOx(GT): True hourly averaged NOx concentration in ppb (reference analyzer)
  • PT08.S3(NOx): PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
  • NO2(GT): True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
  • PT08.S4(NO2): PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
  • PT08.S5(O3): PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
  • T: Temperature in °C
  • RH: Relative Humidity (%)
  • AH: Absolute Humidity

The air quality datasets offers a lot of information on different factors that can impact air quality, including temperature, humidity, atmospheric gases, and various pollutants. By analyzing this data using a regression model, we can identify patterns and trends that contribute to poor air quality in different regions around the world, including developing countries. This information is crucial for the development of strategies to combat air pollution and improve air quality.

The Air Quality Data from the ISC Machine Learning Repository could be used to train a regression model to predict air quality based on the different factors provided. The regression model would be trained on historical data to predict air quality based on temperature, humidity, and various atmospheric gases. Once the model is trained, it can be used to make predictions on new data and identify potential areas with poor air quality. Similarly, the Real-time Air Quality Index and the Air Quality System (AQS) EPA datasets could be used to continuously monitor air quality and make real-time predictions using the regression model. This would allow us to track changes in air quality over time and across different regions. By combining and analyzing these datasets, we can gain a more comprehensive understanding of the factors that contribute to poor air quality and develop effective strategies to address this issue. This demonstrates that the air quality data is sufficient to make progress on the real-world problem of improving air quality in developing countries using a regression model.