Each individual student will submit a project proposal (3% of final grade) in .ipynb format which:
(1%) Describes and motivates a real-world problem where data science may provide helpful insights. Your description should be easily understood by a casual reader and include citations to motivating sources or relevant information (e.g. news articles, further reading links … Wikipedia makes for a poor reference but the links it cites are usually promising).
(1%) Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above.
(1%) Write one or two sentences about how the data will be used to solve the problem. Earlier in the semester, we won’t have studied the Machine Learning methods just yet but you should have a general idea of what the ML will set out to do. For example:
“We’ll cluster the movies into sets of movies which are often watched by the same users. Doing so allows us to discover if there is a more natural grouping of movies rather than the traditional genres: horror, comedy, romantic-comedy, etc”.
Claire Pan DS2500
Many people ingest news on a daily basis, though there can be issues with detecting fake news from real news. This poses a problem because people may be taking news from untrustworthy sources, which imposes cerain ideas or political agendas. These ideas, with the help of more fake news sources and articles, are spread further, and can end up clouding the judgements of individuals. For this reason, it is important that we are able to detect fraud news sources that are writing articles about fake stories.
To read more on the dangers of fake news, see this link: https://www.cits.ucsb.edu/fake-news/danger-social
For this project, I have decided to use a dataset that I found on kaggle. There is a 'True.csv' and a 'Fake.csv', and combining these 2 csv files will give us the dataset that we will use for the project.
import pandas as pd
true = pd.read_csv('True.csv')
true
title | text | subject | date | |
---|---|---|---|---|
0 | As U.S. budget fight looms, Republicans flip t... | WASHINGTON (Reuters) - The head of a conservat... | politicsNews | December 31, 2017 |
1 | U.S. military to accept transgender recruits o... | WASHINGTON (Reuters) - Transgender people will... | politicsNews | December 29, 2017 |
2 | Senior U.S. Republican senator: 'Let Mr. Muell... | WASHINGTON (Reuters) - The special counsel inv... | politicsNews | December 31, 2017 |
3 | FBI Russia probe helped by Australian diplomat... | WASHINGTON (Reuters) - Trump campaign adviser ... | politicsNews | December 30, 2017 |
4 | Trump wants Postal Service to charge 'much mor... | SEATTLE/WASHINGTON (Reuters) - President Donal... | politicsNews | December 29, 2017 |
... | ... | ... | ... | ... |
21412 | 'Fully committed' NATO backs new U.S. approach... | BRUSSELS (Reuters) - NATO allies on Tuesday we... | worldnews | August 22, 2017 |
21413 | LexisNexis withdrew two products from Chinese ... | LONDON (Reuters) - LexisNexis, a provider of l... | worldnews | August 22, 2017 |
21414 | Minsk cultural hub becomes haven from authorities | MINSK (Reuters) - In the shadow of disused Sov... | worldnews | August 22, 2017 |
21415 | Vatican upbeat on possibility of Pope Francis ... | MOSCOW (Reuters) - Vatican Secretary of State ... | worldnews | August 22, 2017 |
21416 | Indonesia to buy $1.14 billion worth of Russia... | JAKARTA (Reuters) - Indonesia will buy 11 Sukh... | worldnews | August 22, 2017 |
21417 rows × 4 columns
fake = pd.read_csv('Fake.csv')
fake
title | text | subject | date | |
---|---|---|---|---|
0 | Donald Trump Sends Out Embarrassing New Year’... | Donald Trump just couldn t wish all Americans ... | News | December 31, 2017 |
1 | Drunk Bragging Trump Staffer Started Russian ... | House Intelligence Committee Chairman Devin Nu... | News | December 31, 2017 |
2 | Sheriff David Clarke Becomes An Internet Joke... | On Friday, it was revealed that former Milwauk... | News | December 30, 2017 |
3 | Trump Is So Obsessed He Even Has Obama’s Name... | On Christmas day, Donald Trump announced that ... | News | December 29, 2017 |
4 | Pope Francis Just Called Out Donald Trump Dur... | Pope Francis used his annual Christmas Day mes... | News | December 25, 2017 |
... | ... | ... | ... | ... |
23476 | McPain: John McCain Furious That Iran Treated ... | 21st Century Wire says As 21WIRE reported earl... | Middle-east | January 16, 2016 |
23477 | JUSTICE? Yahoo Settles E-mail Privacy Class-ac... | 21st Century Wire says It s a familiar theme. ... | Middle-east | January 16, 2016 |
23478 | Sunnistan: US and Allied ‘Safe Zone’ Plan to T... | Patrick Henningsen 21st Century WireRemember ... | Middle-east | January 15, 2016 |
23479 | How to Blow $700 Million: Al Jazeera America F... | 21st Century Wire says Al Jazeera America will... | Middle-east | January 14, 2016 |
23480 | 10 U.S. Navy Sailors Held by Iranian Military ... | 21st Century Wire says As 21WIRE predicted in ... | Middle-east | January 12, 2016 |
23481 rows × 4 columns
news = true.merge(fake, how='outer')
news
title | text | subject | date | |
---|---|---|---|---|
0 | As U.S. budget fight looms, Republicans flip t... | WASHINGTON (Reuters) - The head of a conservat... | politicsNews | December 31, 2017 |
1 | U.S. military to accept transgender recruits o... | WASHINGTON (Reuters) - Transgender people will... | politicsNews | December 29, 2017 |
2 | Senior U.S. Republican senator: 'Let Mr. Muell... | WASHINGTON (Reuters) - The special counsel inv... | politicsNews | December 31, 2017 |
3 | FBI Russia probe helped by Australian diplomat... | WASHINGTON (Reuters) - Trump campaign adviser ... | politicsNews | December 30, 2017 |
4 | Trump wants Postal Service to charge 'much mor... | SEATTLE/WASHINGTON (Reuters) - President Donal... | politicsNews | December 29, 2017 |
... | ... | ... | ... | ... |
44893 | McPain: John McCain Furious That Iran Treated ... | 21st Century Wire says As 21WIRE reported earl... | Middle-east | January 16, 2016 |
44894 | JUSTICE? Yahoo Settles E-mail Privacy Class-ac... | 21st Century Wire says It s a familiar theme. ... | Middle-east | January 16, 2016 |
44895 | Sunnistan: US and Allied ‘Safe Zone’ Plan to T... | Patrick Henningsen 21st Century WireRemember ... | Middle-east | January 15, 2016 |
44896 | How to Blow $700 Million: Al Jazeera America F... | 21st Century Wire says Al Jazeera America will... | Middle-east | January 14, 2016 |
44897 | 10 U.S. Navy Sailors Held by Iranian Military ... | 21st Century Wire says As 21WIRE predicted in ... | Middle-east | January 12, 2016 |
44898 rows × 4 columns
“We’ll cluster the movies into sets of movies which are often watched by the same users. Doing so allows us to discover if there is a more natural grouping of movies rather than the traditional genres: horror, comedy, romantic-comedy, etc”.
For the project, I propose that we cluster the news into 'Real' and 'Fake' based on just the text that is in the articles. Doing this allows us to analyze types of articles, titles, and subjects that pertain to fake news; we could potentially also see what types of key words are in these articles.