Analyzing S&P500 Stock Sectors¶

I would like to analyze the various sectors within the S&P500 over time. This could be used to get a better idea of how markets and stocks are moving, and could be used by stock traders to make more efficient investment decisions. By having a better understanding of financial metrics of companies within sectors and how that impact stock price historically, they would be better suited to make investment decisions going forward.

Here is a report from S&P Global analyzing the relationship between sector and S&P500 returns: https://www.spglobal.com/spdji/en/documents/research/research-sector-effects-in-the-sp-500.pdf I would like to get more granular and examine how similar (financially) the stocks within each are and to examine which financial metrics exhibit the strongest relationship with stock price change.

Here is article from Investopedia about which metrics are the best for valuing a stock: https://www.investopedia.com/articles/fundamental-analysis/09/five-must-have-metrics-value-investors.asp A quick google search will yield hundreds of articles (this is one of the better ones) telling you how to accurately value a stock. However, most of the articles I clicked on were 'get rich quick' schemes and very few seemed to cite actual research or analysis. By analyzing the S&P500 we can get a better idea of which metrics were, historically, the most related to future stock movement and how those metrics vary by sector.

Data¶

Firstly, I would have to get a list of all the stocks in the S&P500 as well as their respective sectors. Since the composition of the S&P500 changes over time, it would make sense to remove stocks that weren't in the S&P500 for the selected analysis period. Since the composition of the S&P500 tends to stay roughly the same, this could be achieved manually with relative ease. Alternatively, we could find the composition of the S&P500 at the start of the analysis period and then track the movement of those stocks throughout the period, regardless of whether or not they stay in the index. For now, I have only linked data for the current S&P500, but historical data is incredibly easy to find and is listed on the S&P500 website.

In [1]:
#Import pandas
import pandas as pd

#Get data for stocks and sectors
#Given that is is a relatively small dataset, it would probably be more efficient to just download the csv for the project
url="https://datahub.io/core/s-and-p-500-companies-financials/r/constituents.csv"
c=pd.read_csv(url)
c.head()
Out[1]:
Symbol Name Sector
0 MMM 3M Company Industrials
1 AOS A.O. Smith Corp Industrials
2 ABT Abbott Laboratories Health Care
3 ABBV AbbVie Inc. Health Care
4 ACN Accenture plc Information Technology

Columns: 3 [Symbol, Name, Sector] Rows: 505

Each row contains the stock ticker (Symbol), company name (Name), and respective sector(Sector) for a company in the S&P500. There are 505 rows, because several companies in the S&P500 have multiple classes of shares.

Next I would like to get historical financial data and stock movement for each of these companies. This used to be really easy, but Google discontinued their API in 2018. Another solution would be scraping data from Yahoo Finance using the yfinance library.

In [2]:
#Import yfinance
import yfinance as yf
#Get historic stock data
aapl = yf.Ticker("AAPL")
stock_data = aapl.history()
stock_data.head()
Out[2]:
Open High Low Close Volume Dividends Stock Splits
Date
2023-01-30 00:00:00-05:00 144.739016 145.328113 142.632232 142.781998 64015300 0.0 0.0
2023-01-31 00:00:00-05:00 142.482450 144.119949 142.063092 144.070023 65874500 0.0 0.0
2023-02-01 00:00:00-05:00 143.750517 146.386491 141.104563 145.208282 77663600 0.0 0.0
2023-02-02 00:00:00-05:00 148.673001 150.949524 147.944119 150.590088 118339000 0.0 0.0
2023-02-03 00:00:00-05:00 147.804328 157.140080 147.604636 154.264465 154279900 0.0 0.0

If it works properly this should show the Date (as the data index) as well as the corresponding Open, High, Low, Close, Adj. Close, Volume (how many shares were traded in the day), Dividends, and Stock Splits. The most important column is Adj.Close as it takes stock splits and dividends into account to create a historically comparable

In [3]:
#Get historic income statement data
financial_data = aapl.income_stmt
financial_data
https://s.yimg.com/uc/finance/dd-site/js/main.00a1b2c52cf830cd64e7.min.js
WARNING: No decryption keys could be extracted from JS file. Falling back to backup decrypt methods.
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_64996/2499431118.py in <module>
      1 #Get historic income statement data
----> 2 financial_data = aapl.income_stmt
      3 financial_data

~\anaconda3\lib\site-packages\yfinance\ticker.py in income_stmt(self)
    156     @property
    157     def income_stmt(self) -> _pd.DataFrame:
--> 158         return self.get_income_stmt(pretty=True)
    159 
    160     @property

~\anaconda3\lib\site-packages\yfinance\base.py in get_income_stmt(self, proxy, as_dict, pretty, freq, legacy)
   1567             data = self._fundamentals.financials.get_income_scrape(freq=freq, proxy=proxy)
   1568         else:
-> 1569             data = self._fundamentals.financials.get_income_time_series(freq=freq, proxy=proxy)
   1570 
   1571         if pretty:

~\anaconda3\lib\site-packages\yfinance\scrapers\fundamentals.py in get_income_time_series(self, freq, proxy)
    111         res = self._income_time_series
    112         if freq not in res:
--> 113             res[freq] = self._fetch_time_series("income", freq, proxy=None)
    114         return res[freq]
    115 

~\anaconda3\lib\site-packages\yfinance\scrapers\fundamentals.py in _fetch_time_series(self, name, timescale, proxy)
    140 
    141         try:
--> 142             statement = self._create_financials_table(name, timescale, proxy)
    143 
    144             if statement is not None:

~\anaconda3\lib\site-packages\yfinance\scrapers\fundamentals.py in _create_financials_table(self, name, timescale, proxy)
    153             name = "financials"
    154 
--> 155         keys = self._get_datastore_keys(name, proxy)
    156         try:
    157             return self.get_financials_time_series(timescale, keys, proxy)

~\anaconda3\lib\site-packages\yfinance\scrapers\fundamentals.py in _get_datastore_keys(self, sub_page, proxy)
    160 
    161     def _get_datastore_keys(self, sub_page, proxy) -> list:
--> 162         data_stores = self._data.get_json_data_stores(sub_page, proxy)
    163 
    164         # Step 1: get the keys:

~\anaconda3\lib\site-packages\yfinance\data.py in wrapped(*args, **kwargs)
     39         args = tuple([tuple(arg) if isinstance(arg, list) else arg for arg in args])
     40         kwargs = {k: tuple(v) if isinstance(v, list) else v for k, v in kwargs.items()}
---> 41         return func(*args, **kwargs)
     42 
     43     # copy over the lru_cache extra methods to this wrapper to be able to access them

~\anaconda3\lib\site-packages\yfinance\data.py in get_json_data_stores(self, sub_page, proxy)
    310 
    311         # Decrypt!
--> 312         stores = decrypt_cryptojs_aes_stores(data, keys)
    313         if stores is None:
    314             # Maybe Yahoo returned old format, not encrypted

~\anaconda3\lib\site-packages\yfinance\data.py in decrypt_cryptojs_aes_stores(data, keys)
    160                 pass
    161         if not success:
--> 162             raise Exception("yfinance failed to decrypt Yahoo data response")
    163 
    164     decoded_stores = json.loads(plaintext)

Exception: yfinance failed to decrypt Yahoo data response

This is supposed to pull data from the selected company's income statement. Namely it's revenue, various expenses, and profit. This feature is currently broken (it was working a few days ago), so for now this isn't a good data source.

Unfortunately, yfinance and Yahoo Finance are fighting, with Yahoo Finance trying to actively inhibit web scrapers like yfinance. The library is actively managed and they've been pretty good about updating the decrypt keys, but it is not super reliable as of yet. This would be by far the easiest way to get financial and stock data, but given that this is a relatively recent problem I'm unsure if this is a viable method. As of now using the fundamentalanalysis website API works well and while it only lets you pull data for one stock at a time, we can just loop through all the tickers in the list and store them.

In [4]:
#Import funamentalanalysis to access the fundamental analysis website api
import fundamentalanalysis as fa

#Establish ticker and api_key
#Loop through all tickers in S&P500 to get all data
ticker = "AAPL"
api_key = "6eb8c70712a889b4f7efbf8d8ad902d4"

#Get historical stock price data
stock_data = fa.stock_data(ticker, period="10y", interval="1mo")
stock_data.head()
Out[4]:
low volume close open high adjclose
2013-03-01 14.964286 9176876800 15.809286 15.642857 16.783930 13.654180
2013-04-01 13.753571 10963489600 15.813571 15.782143 15.901786 13.657886
2013-05-01 14.960714 9447530400 16.061787 15.873571 16.633928 13.872261
2013-06-01 13.888214 7018536000 14.161786 16.097500 16.229643 12.312227
2013-07-01 14.329286 6538114800 16.161785 14.381786 16.333570 14.051022

Columns: 6 ['low', 'volume', 'close', 'open', 'high', 'adjclose'] Rows: 121

This gets the stock data of a company for each month. The total time period and the frequency of the data can be changed, but for now it is set to monthly stock data for 10 years which leads to 121 rows of data. The data includes the stocks high, low, open and close price for that month. It also has share volume and the adjclose which would be the most important metric for analysis. To make the analysis more data efficient we can only call for the "adjclose" column since the others are unlikely to be relevant.

In [5]:
#Get historical financial ratios and metrics
financial_metrics = fa.key_metrics(ticker, api_key, period="annual").transpose()
financial_metrics.head()
Out[5]:
period revenuePerShare netIncomePerShare operatingCashFlowPerShare freeCashFlowPerShare cashPerShare bookValuePerShare tangibleBookValuePerShare shareholdersEquityPerShare interestDebtPerShare ... averagePayables averageInventory daysSalesOutstanding daysPayablesOutstanding daysOfInventoryOnHand receivablesTurnover payablesTurnover inventoryTurnover roe capexPerShare
2022 FY 24.317273 6.154614 7.532763 6.872426 2.978793 3.124822 3.124822 3.124822 7.585118 ... 59439000000 5763000000 56.400205 104.685277 8.075698 6.471608 3.486641 45.197331 1.969589 -0.660337
2021 FY 21.903541 5.669029 6.229346 5.565624 3.750553 3.777557 3.777557 3.777557 7.626006 ... 48529500000 5320500000 51.390969 93.851071 11.276593 7.102415 3.88914 32.367933 1.500713 -0.663722
2020 FY 15.820258 3.308587 4.64923 4.228014 5.241031 3.765477 3.765477 3.765477 6.64524 ... 44266000000 4083500000 49.787534 91.04819 8.741883 7.331152 4.008866 41.753016 0.878664 -0.421217
2019 FY 14.085283 2.991446 3.756685 3.188508 5.443948 4.898834 4.898834 4.898834 6.043039 ... 51062000000 4031000000 64.258765 104.314077 9.263639 5.680159 3.499048 39.401364 0.610645 -0.568178
2018 FY 13.399334 3.003354 3.906565 3.23492 3.344902 5.405593 5.405593 5.405593 5.939155 ... 52468500000 4405500000 67.332499 124.570214 8.817631 5.420859 2.930074 41.394338 0.555601 -0.671644

5 rows × 58 columns

Index: Year (representing the fiscal year of the companies financial fillings) Columns: 58 ['period', 'revenuePerShare', 'netIncomePerShare', 'operatingCashFlowPerShare', 'freeCashFlowPerShare', 'cashPerShare', 'bookValuePerShare', 'tangibleBookValuePerShare', 'shareholdersEquityPerShare', 'interestDebtPerShare', 'marketCap', 'enterpriseValue', 'peRatio', 'priceToSalesRatio', 'pocfratio', 'pfcfRatio', 'pbRatio', 'ptbRatio', 'evToSales', 'enterpriseValueOverEBITDA', 'evToOperatingCashFlow', 'evToFreeCashFlow', 'earningsYield', 'freeCashFlowYield', 'debtToEquity', 'debtToAssets', 'netDebtToEBITDA', 'currentRatio', 'interestCoverage', 'incomeQuality', 'dividendYield', 'payoutRatio', 'salesGeneralAndAdministrativeToRevenue', 'researchAndDdevelopementToRevenue', 'intangiblesToTotalAssets', 'capexToOperatingCashFlow', 'capexToRevenue', 'capexToDepreciation', 'stockBasedCompensationToRevenue', 'grahamNumber', 'roic', 'returnOnTangibleAssets', 'grahamNetNet', 'workingCapital', 'tangibleAssetValue', 'netCurrentAssetValue', 'investedCapital', 'averageReceivables', 'averagePayables', 'averageInventory', 'daysSalesOutstanding', 'daysPayablesOutstanding', 'daysOfInventoryOnHand', 'receivablesTurnover', 'payablesTurnover', 'inventoryTurnover', 'roe', 'capexPerShare'] Rows: 38

This represents the financial profile of a company in terms of its 58 key financial metrics. Over the course of the analysis we are like to remove some columns that do not add value to our analysis. Additionally, we would only pull data for the relevant analysis year rather than every year that is recorded (in this case 1985-2022)

Machine Learning¶

One thing we could do is use agglomerative clustering to group companies by their financial metrics. We could then compare these groups to the sector groups defined by the S&P500. If there is a high level of similarity, it would show that companies in the same sector tend to have similar financial profiles. This would be important to know, because it would allow a stock trader to better understand and identify the financial metrics of a particular sector. Alternatively, if there is no link between metrics and sectors, it would show that serving similar end markets does not mean that companies have the same business model and financial make-up. Additionally, I could take financial data from a base year, say 2015, and then train a model using the metrics as the x variable and stock data from say 2016-2020 as the y variable. Then could apply feature selection to identify which metrics are the best indicator for a stocks price change over the following years and if those features vary by sector. Finally, we could test the finished model by feeding in metrics from 2020, forecasting stock movement, and then testing it against actual movement from 2021-2023.