Can tweets help data scientists beat the market?¶

Predicting a stock price is a problem that is notoriously difficult to solve.¶

There are many factors that make this difficult, but some include: time, market sentiment, future cashflows or growth in earning. Im sure there are various metrics that can all get you closer to finding out a value to give a fair price.

But if it were easy to figure out we would all be rich, and we cant all be rich right........? or can we?

The scope of the project is, Can we use Tweets to determain wether the price of a stock will go up or down tomorrow?¶

In doing reasearch I found that there is a way to accurately determain sentiment from tweets. This will be helpful as you can add up all of the sentiment scores for a day and see if the price goes up or down that day maybe! https://towardsdatascience.com/can-we-beat-the-stock-market-using-twitter-ef8465fd12e2

The data set below contains tweets that mention the following stocks from 2015 - 2020: apple , Google Inc , Google Inc , Amazon.com , Tesla Inc and Microsoft

In [ ]:

In [24]:

import pandas as pd

df_main = pd.read_csv("/Users/Emre/Desktop/DS2501/Tweet.csv")

In [25]:

df_company = pd.read_csv("/Users/Emre/Desktop/DS2501/Company_Tweet.csv")

Lots of Data¶

over 3 million tweets¶

Description: Tweet_id (int): unique idenitfier for a tweet to match ticker with tweet|ticker_symbol(str): ticker symbol of company

In [29]:

df_company.head()

Out[29]:

	tweet_id	ticker_symbol
0	550803612197457920	AAPL
1	550803610825928706	AAPL
2	550803225113157632	AAPL
3	550802957370159104	AAPL
4	550802855129382912	AAPL

In [28]:

df_company.shape

Out[28]:

(4336445, 2)

In [15]:

df_main.head()

Out[15]:

	tweet_id	writer	post_date	body	like_num
0	550441509175443456	VisualStockRSRC	1420070457	lx21 made $10,008 on $AAPL -Check it out! htt...	1
1	550441672312512512	KeralaGuy77	1420070496	Insanity of today weirdo massive selling. $aap...	0
2	550441732014223360	DozenStocks	1420070510	S&P100 #Stocks Performance $HD $LOW $SBUX $TGT...	0
3	550442977802207232	ShowDreamCar	1420070807	$GM $TSLA: Volkswagen Pushes 2014 Record Recal...	1
4	550443807834402816	i_Know_First	1420071005	Swing Trading: Up To 8.91% Return In 14 Days h...	1

In [30]:

df_main.shape

Out[30]:

(3717964, 7)

Write one or two sentences about how the data will be used to solve the problem. First there will have to be some data set manipulation. we will need to get the price for all of the tweets.

Additionally, VADER (Valence Aware Dictionary and Sentiment Reasoner) will be useful in determaining the sentiment of every tweet, but we must assaign a score to all 3 million data points

Next, I was thinking of using a logistic regression to determain wether or not we should buy today.