Can tweets help data scientists beat the market?¶

Predicting a stock price is a problem that is notoriously difficult to solve.¶

There are many factors that make this difficult, but some include: time, market sentiment, future cashflows or growth in earning. Im sure there are various metrics that can all get you closer to finding out a value to give a fair price.

But if it were easy to figure out we would all be rich, and we cant all be rich right........? or can we?

The scope of the project is, Can we use Tweets to determain wether the price of a stock will go up or down tomorrow?¶

In doing reasearch I found that there is a way to accurately determain sentiment from tweets. This will be helpful as you can add up all of the sentiment scores for a day and see if the price goes up or down that day maybe! https://towardsdatascience.com/can-we-beat-the-stock-market-using-twitter-ef8465fd12e2

The data set below contains tweets that mention the following stocks from 2015 - 2020: apple , Google Inc , Google Inc , Amazon.com , Tesla Inc and Microsoft

In [ ]:
 
In [24]:
import pandas as pd

df_main = pd.read_csv("/Users/Emre/Desktop/DS2501/Tweet.csv")
In [25]:
df_company = pd.read_csv("/Users/Emre/Desktop/DS2501/Company_Tweet.csv")

Lots of Data¶

over 3 million tweets¶

Description: Tweet_id (int): unique idenitfier for a tweet to match ticker with tweet|ticker_symbol(str): ticker symbol of company

In [29]:
df_company.head()
Out[29]:
tweet_id ticker_symbol
0 550803612197457920 AAPL
1 550803610825928706 AAPL
2 550803225113157632 AAPL
3 550802957370159104 AAPL
4 550802855129382912 AAPL
In [28]:
df_company.shape
Out[28]:
(4336445, 2)

Description: Tweet_id (int): unique idenitfier for a tweet to match ticker with tweet | writter : "tweeter" | post_date (int): in epoch | body : text of tweet | comment_num: number of comments | retweet_num : number of retweets | like_num : number of likes

In [15]:
df_main.head()
Out[15]:
tweet_id writer post_date body comment_num retweet_num like_num
0 550441509175443456 VisualStockRSRC 1420070457 lx21 made $10,008 on $AAPL -Check it out! htt... 0 0 1
1 550441672312512512 KeralaGuy77 1420070496 Insanity of today weirdo massive selling. $aap... 0 0 0
2 550441732014223360 DozenStocks 1420070510 S&P100 #Stocks Performance $HD $LOW $SBUX $TGT... 0 0 0
3 550442977802207232 ShowDreamCar 1420070807 $GM $TSLA: Volkswagen Pushes 2014 Record Recal... 0 0 1
4 550443807834402816 i_Know_First 1420071005 Swing Trading: Up To 8.91% Return In 14 Days h... 0 0 1
In [30]:
df_main.shape
Out[30]:
(3717964, 7)

Write one or two sentences about how the data will be used to solve the problem. First there will have to be some data set manipulation. we will need to get the price for all of the tweets.

Additionally, VADER (Valence Aware Dictionary and Sentiment Reasoner) will be useful in determaining the sentiment of every tweet, but we must assaign a score to all 3 million data points

Next, I was thinking of using a logistic regression to determain wether or not we should buy today.