1. Final Project - %Change in Stock Price
(Technology Service Industry) Analysis
Name: Natsarankorn Kijtorntham
Packages
In [1]:
%matplotlib inline
from datetime import datetime
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup as bs
from scipy.stats import pearsonr
from patsy import dmatrix
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflat
ion_factor
import itertools
import random
import warnings
warnings.filterwarnings('ignore')
2. 1. Introduction
What is the importance of your data set?
This project scraped the data from finance.yahoo.com. The main data set was the
companies in the Technology Services sector. This analysis aims to predict the
%Change (dependent variable) of the stock price, with the OLS model. The features
are the statistical key of each stock. For example, Market Cap, P/E Ratio, Price/Sales,
Enterprise Value, ROE, ROA, etc. Since there're a lot of indices to the stock price (high
dimensionality), the significant level of independent variables as well as the subset
selection method would preliminarily filter out unnecessary variables.
Which question(s) can it help us understand?
Which indices (variables) are statistically important to the Change in percentage
of stock price in the Technology industry?
What is the magnitude of each variable against Change of stock price in the
Technology Services industry?
====================================================================
2. Data Scraping
Where and how are you getting the data?
The data set is tocks from the Technology Services Sector of Yahoo Finance
(https://finance.yahoo.com/screener/predefined/ms_technology). There're
approximately 390 companies in this data set.
Scraping Steps
Part 1:
Scrape the main dataframe df_1 of all companies containing 'Symbol',
'Name of the company', 'Price', 'Change', '% Change', 'Volume', 'Avg
Vol', 'Market Cap', 'PE Ratio'.
Part 2:
3. Get the 'href' link for each company.
Get the tables from the 'Statistical Keys' page using the modified 'href'
links, e.g. AAPL (https://finance.yahoo.com/quote/AAPL/key-statistics?
p=AAPL), MSFT (https://finance.yahoo.com/quote/MSFT/key-statistics?
p=MSFT), TSM (https://finance.yahoo.com/quote/TSM/key-statistics?
p=TSM).
Run comprehensive for loops to build another data frame, df_2 , from
the second part.
Joining two data frames ( df_1 , and df_2 ) using an index, 'Symbol'.
What data are available?
The whole data contains approximately 190 observations after omitting NAs, and
20 variables to run the full OLS model.
The dependent variable is Change which is the percentage of the stock price of
that day (Change (USD)/Price (Intraday)).
The independent variables are Price, Mkt_Cap, PEG_Ratio, PE_Ratio, PriceSales,
PriceBook, EV, EVRevenue, Payout_Ratio, Profit_Margin, Operating_Margin, ROA,
ROE, Revenue, RevenueShare, Gross_Profit, EBITDA, NItoCommon, Diluted_EPS,
Earnings_Growth_Q, Health.
EV/EBITDA is transformed into Health , a binary variable
For a company who has EV/EBITDA > industrial average, it will be indicated
as 1 = 'healthy' company
For a company who has EV/EBITDA < industrial average, it will be indicated
as 0 = 'unhealthy' company
What relationships do you expect to see in the data?
The expected relationships are both positives and negatives as shown below:
Independent Variables E(Relationships)
Market Cap +
PEG Ratio -
P/E Ratio +
Price/Sales -
Price/Book +
EV +
4. EV/Revenue -
Payout Ratio(%) +
Profit Margin(%) +
Operating Margin(%) +
ROA +
ROE +
Revenue +
Revenue/Share +
Gross Profit(%) +
EBITDA +
Net Income to Common +
Diluted EPS +
Quarterly Earnings Growth (yoy) +
Health (healthy) +
Steps:
In [2]:
# Time of data scraping
now = datetime.now()
dt = now.strftime("%d/%m/%Y %H:%M:%S")
print("This data was scraped on", dt)
# This data was scraped on 23/11/2019 22:25:07
PART 1
Getting a main dataframe (df_1)
This data was scraped on 23/11/2019 22:25:07
5. In [3]:
# Scraping the main dataframe (df_1) by setting the parameters f
or 100 counts each page with approximately 390 rows.
url = 'https://finance.yahoo.com/screener/predefined/ms_technolo
gy'
rows = np.arange(0,301,100).tolist()
# rows = [100,200,300]
url_list = []
tech_df = []
for i in rows:
r = requests.get(url, params = {'count' : '100', 'offset' :
i})
link = r.url
url_list.append(link)
for link in url_list:
df = pd.read_html(link)
tb = df[0]
tech_df.append(tb)
df_1 = pd.concat(tech_df)
In [4]:
# Setting 'Symbol' as an index
df = df_1.set_index('Symbol')
df.to_csv('df_1.csv')
In [5]:
url_list
Out[5]:
['https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=0',
'https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=100',
'https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=200',
'https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=300']
6. In [6]:
df.head()
In [7]:
# The dimension for the main dataframe
df.shape
PART 2
Scraping from Key-Statistics page
Out[6]:
Name
Price
(Intraday)
Change
%
Change
Volume
Avg Vol
(3
month)
Symbol
AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M
MSFT
Microsoft
Corporation
149.59 0.11 +0.07% 15.842M 22.825M
TSM
Taiwan
Semiconductor
Manufacturing
Company
Lim...
52.79 -0.19 -0.36% 4.103M 6.848M
INTC
Intel
Corporation
57.61 -0.61 -1.05% 15.69M 18.498M
CSCO
Cisco
Systems, Inc.
44.85 0.01 +0.02% 16.516M 19.124M
Out[7]:
(393, 9)
7. In [8]:
# get thae 'a' tag from web element
table = []
tag = []
# get text from the main page
for url in url_list:
txt = requests.get(url).text
soup = bs(txt)
t = soup.find('div', {'id':'scr-res-table'})
table.append(t)
for i in range(0,4):
t = table[i].find_all('a')
tag.append(t)
In [9]:
# Get the href link for key statistics page of each ticker to ex
tract table
link = []
for e in range(0,4):
for i in tag[e]:
l = 'https://finance.yahoo.com'+i.get('href')
l_kstat = l.split('?')[0]+'/key-statistics?'+l.split('?'
)[1]
link.append(l_kstat)
Note
Some HTML links don't work (404) when running the code for some specific time
(around closing time of the stock market). So, the code chunk below will prevent the
error when scraping.
8. In [10]:
connection = []
for l in link:
if requests.get(l).status_code == 200:
status = ['good', l]
else: status = ['404', l]
connection.append(status)
# Get the responded links, '200', and company tickers
links = []
tickers = []
for status in range(0,len(connection)):
if connection[status][0] == 'good':
good_link = connection[status][1]
else: bad_link = connection[status][1]
links.append(good_link)
tickers.append(good_link.split('=')[1])
In [11]:
print('There are',len(links),'links that responded (200)')
There are 393 links that responded (200)
9. In [12]:
tic = time.time()
data = []
tables = (0,3,5,6,7) # This indicates the specific table in stat
istical key page used in the analysis
for url in links[:len(links)]:
for table in tables:
d = pd.read_html(url)[table]
data.append(d)
matrix = pd.concat(data)
matrix.shape
m = matrix.set_index(0)
toc = time.time()
print("Total scraping time:", (toc-tic)/60, "minutes.")
# Total scraping time: 35.854390549659726 minutes.
In [13]:
# Build a second dataframe from concatenated matrix
df_2 = pd.DataFrame()
for i in range(0,len(m),31):
m_m = m.iloc[i:i+31]
n = (i+31)/31-1
m_m.columns = [tickers[int(n)]]
df_2[tickers[int(n)]] = m_m[tickers[int(n)]]
df_2 = df_2.transpose()
In [14]:
df_2.to_csv('df_2.csv')
Total scraping time: 35.854390549659726 minutes.
10. Joining Data frames
In [15]:
df = df.join(df_2)
In [16]:
df = df.iloc[:len(df_2)]
df.shape
In [17]:
df.head()
Out[16]:
(393, 40)
11. In [18]:
df.to_csv('tech_390.csv')
====================================================================
3. Data Cleaning
Rename Variables in The Data Frame, df
Out[17]:
Name
Price
(Intraday)
Change
%
Change
Volume
Avg Vol
(3
month)
Symbol
AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M
MSFT
Microsoft
Corporation
149.59 0.11 +0.07% 15.842M 22.825M
TSM
Taiwan
Semiconductor
Manufacturing
Company
Lim...
52.79 -0.19 -0.36% 4.103M 6.848M
INTC
Intel
Corporation
57.61 -0.61 -1.05% 15.69M 18.498M
CSCO
Cisco
Systems, Inc.
44.85 0.01 +0.02% 16.516M 19.124M
5 rows × 40 columns
12. In [19]:
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 400)
In [20]:
df = pd.read_csv('tech_390.csv')
df = df.set_index('Symbol')
# df.head()
Out[20]:
Name
Price
(Intraday)
Change
%
Change
Volume
Avg Vol
(3
month)
Symbol
AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M
MSFT
Microsoft
Corporation
149.59 0.11 +0.07% 15.842M 22.825M
TSM
Taiwan
Semiconductor
Manufacturing
Company
Lim...
52.79 -0.19 -0.36% 4.103M 6.848M
INTC
Intel
Corporation
57.61 -0.61 -1.05% 15.69M 18.498M
CSCO
Cisco
Systems, Inc.
44.85 0.01 +0.02% 16.516M 19.124M
13. In [21]:
data = df[['% Change', 'Price (Intraday)', 'Market Cap', 'PE Rat
io (TTM)',
'PEG Ratio (5 yr expected) 1', 'Price/Sales (ttm)', '
Price/Book (mrq)',
'Enterprise Value 3', 'Enterprise Value/Revenue 3',
'Enterprise Value/EBITDA 6', 'Payout Ratio 4', 'Profi
t Margin',
'Operating Margin (ttm)', 'Return on Assets (ttm)', '
Return on Equity (ttm)',
'Revenue (ttm)', 'Revenue Per Share (ttm)', 'Gross Pr
ofit (ttm)', 'EBITDA',
'Net Income Avi to Common (ttm)', 'Diluted EPS (ttm)'
,
'Quarterly Earnings Growth (yoy)']]
data.columns = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio',
'PEG_Ratio', 'PriceSales', 'PriceBook',
'EV', 'EVRevenue',
'EV/EBITDA', 'Payout_Ratio', 'Profit_Margin',
'Operating_Margin', 'ROA', 'ROE',
'Revenue', 'RevenueShare', 'Gross_Profit', 'EBIT
DA',
'NItoCommon', 'Diluted_EPS' ,
'Earnings_Growth_Q']
data.head()
Out[21]:
Change Price Mkt_Cap PE_Ratio PEG_Ratio PriceSales PriceBook
Symbol
AAPL -0.09% 261.78 1.183T 22.02 2.04 4.55
MSFT +0.07% 149.59 1.141T 28.22 1.91 8.79
TSM -0.36% 52.79 263.261B 23.67 2.39 NaN
INTC -1.05% 57.61 250.604B 13.49 1.79 3.56
CSCO +0.02% 44.85 190.265B 17.85 1.97 3.66
14. Data Cleaning & Transformation
Converting str to float by using
replace() to replace abbreviations (T, B, M ,and K) to Scientific Notation
(e) for Mkt_Cap, EV, Revenue, Gross_Profit, EBITDA, and NItoCommon.
strip() to strip the unnecessary symbols which are ',', and '%'.
astype() to change string to float
Creating categorical variable (Binary) for company health based on industrial
average of EV/EBITDA .
In [22]:
# Check the type of variables
columns = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio',
'PEG_Ratio', 'PriceSales', 'PriceBook',
'EV', 'EVRevenue',
'EV/EBITDA', 'Payout_Ratio', 'Profit_Margin',
'Operating_Margin', 'ROA', 'ROE',
'Revenue', 'RevenueShare', 'Gross_Profit', 'EBITDA',
'NItoCommon', 'Diluted_EPS',
'Earnings_Growth_Q']
for i in columns:
print(type(data[i].values[0]), i)
17. In [25]:
# Creating binary variable based on 'data['EV/EBITDA'].mean()'
health = []
print('The industrial average of EV/EBITDA is' ,data['EV/EBITDA'
].mean())
for i in data['EV/EBITDA']:
if i > data['EV/EBITDA'].mean():
h = 1
else:
h = 0
health.append(h)
data['Health'] = health
# Since the new categorical data was created from 'EV/EBITDA', i
t will be taken out from the dataframe.
del data['EV/EBITDA']
In [26]:
# Drop NAs
data = data.dropna()
print('After steps of data cleaning, and manipulating, the dataf
rame used in the model has', data.shape[0],
'observations (companies) with', data.shape[1]-1, 'feature
s.')
In [27]:
a = data['Health'] == 1
print('The number of observations defined as healthy are',a.sum(
))
The industrial average of EV/EBITDA is 7.90497282608
6954
After steps of data cleaning, and manipulating, the
dataframe used in the model has 186 observations (co
mpanies) with 21 features.
The number of observations defined as healthy are 17
0
20. Table Summary
The correlation table for numeric variables indicates that positive relationships are
PE_Ratio, Earnings_Growth_Q, PriceSales, EVRevenue, and Profit_Margin,
respectively. For negative relationships variables are, Operating_Margin, Gross_Profit,
EBITDA, EV, Mkt_Cap, NItoCommon, PriceBook, PEG_Ratio, Revenue, Price,
Diluted_EPS, ROE, RevenueShare, Payout_Ratio, and ROA, respectively.
In [5]:
plt.figure(figsize=(15,15))
plt.imshow(correl) # show as image
plt.colorbar()
# To set the label on the axes
plt.xticks(range(21), variables, rotation='vertical') # Require
list of variables list creation
plt.yticks(range(21), variables)
Out[5]:
([<matplotlib.axis.YTick at 0x1c227f7c18>,
<matplotlib.axis.YTick at 0x1c227f7550>,
<matplotlib.axis.YTick at 0x1c22721550>,
<matplotlib.axis.YTick at 0x1c2275c588>,
<matplotlib.axis.YTick at 0x1c22755be0>,
<matplotlib.axis.YTick at 0x1c22765c88>,
<matplotlib.axis.YTick at 0x1c227654e0>,
<matplotlib.axis.YTick at 0x1c22aeb550>,
<matplotlib.axis.YTick at 0x1c22aebac8>,
<matplotlib.axis.YTick at 0x1c22af00f0>,
<matplotlib.axis.YTick at 0x1c22af05f8>,
<matplotlib.axis.YTick at 0x1c22af0b70>,
<matplotlib.axis.YTick at 0x1c22af8160>,
<matplotlib.axis.YTick at 0x1c22af09e8>,
<matplotlib.axis.YTick at 0x1c227650b8>,
<matplotlib.axis.YTick at 0x1c22af85c0>,
<matplotlib.axis.YTick at 0x1c22af8b38>,
<matplotlib.axis.YTick at 0x1c22afe160>,
<matplotlib.axis.YTick at 0x1c22afe668>,
<matplotlib.axis.YTick at 0x1c22afebe0>,
<matplotlib.axis.YTick at 0x1c22b06198>],
<a list of 21 Text yticklabel objects>)
22. In [6]:
plt.hist(data['Change'], bins=20)
plt.title('Histogram of % Change in Stock Price')
plt.xlabel('% Change in Stock Price')
The histogram of Change indicates that the independent variable (Y) is approximately
normally distributed.
Example for Plots of independent variables against Y
Out[6]:
Text(0.5, 0, '% Change in Stock Price')
23. In [7]:
# Enterprise Value/Revenue
plt.scatter(data['EVRevenue'], data['Change'])
plt.xlabel('Enterprise Value/Revenue')
plt.ylabel('% Change in Stock Price')
plt.title('Enterprise Value/Revenue vs Change')
From the plot above, there's no obvious direction whether it's an upward or downward
slope. However, there's a slightly non-linear relationship between this feature and
response. Hence, in further analysis, if this variable is statistically significant in the
model, the polynomial term of EV/Revenue will be generated in order to improve the
model.
Out[7]:
Text(0.5, 1.0, 'Enterprise Value/Revenue vs Change')
24. In [8]:
# Payout Ratio (%)
plt.figure(figsize=(15,5))
plt.subplot(121)
plt.scatter(data['Payout_Ratio'], data['Change'])
plt.xlabel('Payout Ratio (%)')
plt.ylabel('% Change in Stock Price')
plt.title('Payout_Ratio vs Change')
plt.subplot(122)
plt.scatter(np.log(data['Payout_Ratio']), data['Change'])
plt.xlabel('Log of Payout Ratio (%)')
plt.ylabel('% Change in Stock Price')
plt.title('log(Payout_Ratio) vs Change')
From the left panel, the distribution is dense when the payout ratio (%) is less than
200. After taking logarithm function to the observation, the scatter plot (right panel)
shows a slightly negative relationship but no non-linear relationship is detected. Taking
a log() function into the model might improve the model. Unfortunately, there's some
infinite value after taking log() in the observations. Dropping more observations (infinite
values) from the approximately existing 190 observations would potentially reduce the
model accuracy. Hence, this variable will be kept as it is.
Out[8]:
Text(0.5, 1.0, 'log(Payout_Ratio) vs Change')
25. In [9]:
# Operating Margin (%)
plt.scatter(data['Operating_Margin'], data['Change'])
plt.xlabel('Operating Margin (%)')
plt.ylabel('% Change in Stock Price')
plt.title('Operating_Margin vs Change')
The plot above shows a slight positive relationship between Operating Margin (%) and
% change in stock price.
Out[9]:
Text(0.5, 1.0, 'Operating_Margin vs Change')
26. In [10]:
# ROA (%)
plt.scatter(data['ROA'], data['Change'])
plt.xlabel('Return on Asset (%)')
plt.ylabel('% Change in Stock Price')
plt.title('ROA (%) vs Change')
The scatter plot of ROA (%) and % Change in stock price shows a negative
relationship with no evidence of non-linearity.
The Boxplot (for Binary Variable)
Out[10]:
Text(0.5, 1.0, 'ROA (%) vs Change')
27. In [11]:
plt.figure(figsize=(5,5))
sns.boxplot(data['Health'],data['Change'])
The boxplot of this binary variable Health shows a slight difference between the
mean and 50% of observations for each group where the 'healthy' companies (Health
=1) are slightly higher and have a wider range of distribution (whiskers). Hence, this
variable is potentially statistical significant in the model.
====================================================================
4. Predictive Modeling
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c2410a4
38>
28. The Multicollinearity Among Variables (VIF)
Calculating the variance inflation factor
*source
(https://etav.github.io/python/vif_factor_python.html)
In [37]:
y, X_vif = dmatrices('Change ~' + 'Price+Mkt_Cap+PEG_Ratio+PE_Ra
tio+PriceSales+PriceBook+EV+EVRevenue+Payout_Ratio+Profit_Margin
+Operating_Margin+ROA+ROE+Revenue+RevenueShare+Gross_Profit+EBIT
DA+NItoCommon+Earnings_Growth_Q+Health', data = data, return_typ
e='dataframe')
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i)
for i in range(X_vif.shape[1])]
vif["features"] = X_vif.columns
vif.round(2).set_index('features')
30. From the result above, there're high multicollinearities (VIF > 10) in the features
among Mkt_Cap , Price/Sales , EV , EV/Revenue , Revenue ,
Gross_Profit , EBITDA , and NItoCommon . In which, these varaibles are not
independent. In the next step, the subset selection method will help filter out
highly correlated and unnecessary variables from the model.
Best Subset Selection Method
Since there're many variables (high dimension) with multicollinearity in the data
set. Including all the variables may lead to high varaince in the model. To reduce
the model variance, this method could help select the best set of variables that
yields a higher Adjusted R squared by minimizing RSS.
* source
(http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8-
py.html)
31. In [13]:
y = data.Change
X = data[['Price','Mkt_Cap','PEG_Ratio', 'PE_Ratio', 'PriceSales
', 'PriceBook', 'EV', 'EVRevenue',
'Payout_Ratio', 'Profit_Margin', 'Operating_Margin', 'R
OA', 'ROE',
'Revenue', 'RevenueShare', 'Gross_Profit', 'EBITDA', 'N
ItoCommon',
'Earnings_Growth_Q', 'Health']]
X = pd.concat([X], axis=1)
X.head()
In [14]:
def processSubset(feature_set):
# Fit model on feature_set and calculate RSS
model = sm.OLS(y,X[list(feature_set)])
regr = model.fit()
RSS = ((regr.predict(X[list(feature_set)]) - y) ** 2).sum()
return {"model":regr, "RSS":RSS}
Out[13]:
Price Mkt_Cap PEG_Ratio PE_Ratio PriceSales PriceBook
Symbol
AAPL 261.78 1.183000e+12 2.04 22.02 4.55 12.85
MSFT 149.59 1.141000e+12 1.91 28.22 8.79 10.77
INTC 57.61 2.506040e+11 1.79 13.49 3.56 3.38
CSCO 44.85 1.902650e+11 1.97 17.85 3.66 5.53
ORCL 56.39 1.851010e+11 1.45 18.46 4.68 10.08
32. In [15]:
def getBest(k):
tic = time.time()
results = []
for combo in itertools.combinations(X.columns, k):
results.append(processSubset(combo))
models = pd.DataFrame(results)
# Choose model with the lowest RSS
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed", models.shape[0], "models on", k, "predict
ors in", (toc-tic), "seconds.")
return best_model
In [16]:
def getBest(k):
tic = time.time()
results = []
for combo in itertools.combinations(X.columns, k):
results.append(processSubset(combo))
models = pd.DataFrame(results)
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed", models.shape[0], "models on", k, "predict
ors in", (toc-tic), "seconds.")
return best_model
33. In [17]:
models_best = pd.DataFrame(columns=['RSS', 'model'])
tic = time.time()
for i in range(1,10):
models_best.loc[i] = getBest(i)
toc = time.time()
print("Total elapsed time:", (toc-tic)/60, "minutes.")
Processed 20 models on 1 predictors in 0.06646728515
625 seconds.
Processed 190 models on 2 predictors in 0.4089689254
760742 seconds.
Processed 1140 models on 3 predictors in 2.546672344
2077637 seconds.
Processed 4845 models on 4 predictors in 11.26060104
3701172 seconds.
Processed 15504 models on 5 predictors in 34.6961269
3786621 seconds.
Processed 38760 models on 6 predictors in 88.0667259
6931458 seconds.
Processed 77520 models on 7 predictors in 613.451443
9105988 seconds.
Processed 125970 models on 8 predictors in 403.86032
41443634 seconds.
Processed 167960 models on 9 predictors in 419.71901
79824829 seconds.
Total elapsed time: 26.632335432370503 minutes.
36. In [20]:
print('The model has the highest Adjusted R squared at', '{0:.4f
}'.format(models_best.loc[7, "model"].rsquared_adj), 'when it ha
s', rsquared_adj.argmax() , 'variables' )
print('The model has the lowest AIC at', '{0:.4f}'.format(models
_best.loc[7, "model"].aic), 'when it has', aic.argmin() , 'varia
bles' )
print('The model has the lowest BIC at', '{0:.4f}'.format(models
_best.loc[7, "model"].bic), 'when it has', bic.argmin() , 'varia
bles' )
Out[19]:
Text(0, 0.5, 'BIC')
The model has the highest Adjusted R squared at 0.06
37 when it has 7 variables
The model has the lowest AIC at 602.5338 when it has
4 variables
The model has the lowest BIC at 625.1140 when it has
2 variables
38. Criteria
The subset selection computes or models were computed by minimizing
the RSS.
Adjusted R squared:
According to the formula above the smaller RSS would yield a higher
Adjusted R squared. As the result of the subset selection method, the 7
variable-model yields the highest adjusted R squared at 0.067, where the
variables are Price/Sales, Price/Book, EV/Revenue, Payout Ratio, Operating
Margin, ROA, and Health. There're 5 variables that are statistically significant
at p-value < 0.05, where R squared is 0.099. (The result shown below)
AIC
From the result, AIC criterion yields the model with 4 variables which are
Payout Ratio, Operating Margin, ROA, and Health. However, there's only one
variable (ROA) that is statistically significant at 95% confident interval, with R
squared equal to 0.071.
BIC
Since BIC criterion is more restricted (higher penalty term; ), it
yields a smaller model with two significant variables which are
Operating_Margin, and ROA. The R squared is 0.045.
Criteria # of Optimal Variables
7 0.099 0.064
AIC 4 0.071 0.051
BIC 2 0.045 0.035
To conclude, based on Adjusted R squared criterion, the optimal non-linear
model is OLS model with 7 variables, shown below.
( )
𝑝
𝑘
𝑝!
𝑘!(𝑝−𝑘)!
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 = 1 −𝑅2
𝑅𝑆𝑆
𝑛−𝑝−1
𝑇𝑆𝑆
𝑛−1
log(𝑛) ∗ 𝑝 ∗ σ̂2
𝑅2
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2
40. The Optimal Linear Model (7 Variables)
In [3]:
y = data.Change
X_7 = dmatrix('1 + PriceSales + PriceBook + EVRevenue + Payout_R
atio + Operating_Margin + ROA + Health', data = data)
m = sm.OLS(y, X_7)
m.data.xnames = X_7.design_info.column_names
m = m.fit()
print(m.summary())
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.098
Model: OLS Adj. R-squar
ed: 0.063
Method: Least Squares F-statistic:
2.774
Date: Thu, 28 Nov 2019 Prob (F-stat
istic): 0.00924
Time: 21:50:30 Log-Likeliho
od: -294.25
No. Observations: 186 AIC:
604.5
Df Residuals: 178 BIC:
630.3
Df Model: 7
Covariance Type: nonrobust
====================================================
================================
coef std err t
41. VIF After The Subset Selection
P>|t| [0.025 0.975]
----------------------------------------------------
--------------------------------
Intercept 0.0664 0.326 0.204
0.839 -0.576 0.709
PriceSales 0.2852 0.153 1.863
0.064 -0.017 0.587
PriceBook 0.0310 0.021 1.481
0.140 -0.010 0.072
EVRevenue -0.3609 0.170 -2.124
0.035 -0.696 -0.026
Payout_Ratio -0.0020 0.001 -2.041
0.043 -0.004 -6.67e-05
Operating_Margin 0.0474 0.017 2.713
0.007 0.013 0.082
ROA -0.1509 0.039 -3.909
0.000 -0.227 -0.075
Health 0.4269 0.332 1.286
0.200 -0.228 1.082
====================================================
==========================
Omnibus: 8.080 Durbin-Watso
n: 1.953
Prob(Omnibus): 0.018 Jarque-Bera
(JB): 14.644
Skew: -0.096 Prob(JB):
0.000661
Kurtosis: 4.361 Cond. No.
493.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
42. In [36]:
y, X_vif = dmatrices('Change ~' + 'PriceSales + PriceBook + EVRe
venue + Payout_Ratio + Operating_Margin + ROA + Health', data =
data, return_type='dataframe')
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i)
for i in range(X_vif.shape[1])]
vif["features"] = X_vif.columns
vif.round(2).set_index('features')
As the result, the high VIF variables are eliminated. Even though the model still has
some multicollinearity beween variables, Price/Sales and EV/Revenue, it is
moderately acceptable.
Out[36]:
VIF Factor
features
Intercept 13.63
PriceSales 44.86
PriceBook 2.12
EVRevenue 51.57
Payout_Ratio 1.07
Operating_Margin 3.46
ROA 2.95
Health 1.11
43. The Non-Linear Model with Polynomial Term
Based on the data visualization above, there's an evidence that EV/Revenue
could have non-linearity.
Below, a model with polynomial term is performed, along with other selected
variables.
In [25]:
y = data.Change
X_new = dmatrix('1 + PriceSales + PriceBook + EVRevenue + I(EVR
evenue**2) + Payout_Ratio + Operating_Margin + ROA + Health', da
ta = data)
m_new = sm.OLS(y, X_new)
m_new.data.xnames = X_new.design_info.column_names
m_new = m_new.fit()
print(m_new.summary())
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.115
Model: OLS Adj. R-squar
ed: 0.075
Method: Least Squares F-statistic:
2.872
Date: Wed, 27 Nov 2019 Prob (F-stat
istic): 0.00499
Time: 00:10:37 Log-Likeliho
od: -292.52
No. Observations: 186 AIC:
603.0
Df Residuals: 177 BIC:
632.1
Df Model: 8
Covariance Type: nonrobust
====================================================
=================================
coef std err t
45. Compared with linear model, this non-linear model with the polynomial term has a
higher Adjusted R squared at 0.075 ( > 0.063 ), and R squared at 0.115. This means
the explanation of variation for the dependent variable (%Change in stock price) has
been improved by the polynomial term of EV/Revenue. However, the model will be
evaluated by the cross-validation to compare the predictive performance between the
linear and non-linear model.
Prediction Accuracy Between Models
Cross-Validation: Using the random sample cross-validition with 80:20 partitioning,
and random.seed(1) to validate the model predictive power.
In [26]:
# Create a training and testing set
random.seed(1)
train = random.sample(range(0,len(data)), round(len(data)*0.8))
test = []
for n in range(0,len(data)):
if n not in train:
test.append(n)
y_training = data['Change'].iloc[train]
x_training = data[['PriceSales', 'PriceBook', 'EVRevenue', 'Payo
ut_Ratio', 'Operating_Margin', 'ROA', 'Health']].iloc[train]
y_testing = data['Change'].iloc[test]
46. In [27]:
# Build a model with training set from the best subset model (7
Variables)
y = y_training
X_7 = dmatrix('1 + PriceSales + PriceBook + EVRevenue + Payout_R
atio + Operating_Margin + ROA + Health', data = x_training)
m_7_cv = sm.OLS(y, X_7)
m_7_cv.data.xnames = X_7.design_info.column_names
m_7_cv = m_7_cv.fit()
print(m_7_cv.summary())
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.122
Model: OLS Adj. R-squar
ed: 0.079
Method: Least Squares F-statistic:
2.802
Date: Wed, 27 Nov 2019 Prob (F-stat
istic): 0.00922
Time: 00:10:37 Log-Likeliho
od: -227.28
No. Observations: 149 AIC:
470.6
Df Residuals: 141 BIC:
494.6
Df Model: 7
Covariance Type: nonrobust
====================================================
================================
coef std err t
P>|t| [0.025 0.975]
----------------------------------------------------
--------------------------------
Intercept -0.0203 0.329 -0.061
0.951 -0.672 0.631
PriceSales 0.1883 0.155 1.213
0.227 -0.119 0.495
PriceBook 0.0267 0.023 1.167
47. In [28]:
# Build a model with training set from a model with polynomial t
erm (EV/Revenue^2)
y = y_training
X_new = dmatrix('1 + PriceSales + PriceBook + EVRevenue + I(EVR
evenue**2) + Payout_Ratio + Operating_Margin + ROA + Health', da
ta = x_training)
m_new_cv = sm.OLS(y, X_new)
m_new_cv.data.xnames = X_new.design_info.column_names
m_new_cv = m_new_cv.fit()
print(m_new_cv.summary())
0.245 -0.018 0.072
EVRevenue -0.2841 0.170 -1.672
0.097 -0.620 0.052
Payout_Ratio -0.0023 0.001 -2.375
0.019 -0.004 -0.000
Operating_Margin 0.0370 0.018 2.008
0.047 0.001 0.073
ROA -0.1325 0.040 -3.283
0.001 -0.212 -0.053
Health 0.7233 0.344 2.102
0.037 0.043 1.403
====================================================
==========================
Omnibus: 8.706 Durbin-Watso
n: 1.839
Prob(Omnibus): 0.013 Jarque-Bera
(JB): 15.574
Skew: 0.193 Prob(JB):
0.000415
Kurtosis: 4.536 Cond. No.
522.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
OLS Regression Results
====================================================
==========================
49. Mean Squared Error
𝑀𝑆𝐸 =
∑
𝑖=1
𝑛
( −𝑦̂ 𝑦𝑖)2
𝑛
In [29]:
# Calculate the test MSEs
x_testing = dmatrix('1+PriceSales+PriceBook+EVRevenue+Payout_Rat
io+Operating_Margin+ROA+Health',data = data.iloc[test])
predicted_7 = m_7_cv.predict(x_testing)
x_testing = dmatrix('1+PriceSales+PriceBook+EVRevenue+I(EVRevenu
e**2)+Payout_Ratio+Operating_Margin+ROA+Health',data = data.iloc
[test])
predicted_new = m_new_cv.predict(x_testing)
mse = pd.DataFrame()
mse['Actual Value'] = y_testing
mse['Predicted Value (m_7)'] = predicted_7
mse['Predicted Value (m_new)'] = predicted_new
mse['Squared Error (m_7)'] = (mse['Predicted Value (m_7)'] - mse
['Actual Value'])**2
mse['Squared Error (m_new)'] = (mse['Predicted Value (m_new)'] -
mse['Actual Value'])**2
MSE_7 = mse['Squared Error (m_7)'].sum()/len(mse)
MSE_new = mse['Squared Error (m_new)'].sum()/len(mse)
(JB): 18.892
Skew: 0.166 Prob(JB):
7.90e-05
Kurtosis: 4.713 Cond. No.
524.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
50. In [31]:
mse.T
In [32]:
print('The model test MSE for the linear model with 7 variable i
s', MSE_7)
print('The model test MSE for the model with polynomial term is'
, MSE_new)
Out[31]:
Symbol ACN AVGO IBM NOW MU AMD
Actual
Value
-0.060000 -0.100000 0.370000 0.310000 0.700000 -0.940000
Predicted
Value
(m_7)
-0.780854 -0.253737 0.194448 0.138732 -0.359416 0.184337
Predicted
Value
(m_new)
-0.790585 -0.126589 0.075529 -0.300076 -0.351424 0.357285
Squared
Error
(m_7)
0.519630 0.023635 0.030819 0.029333 1.122362 1.264133
Squared
Error
(m_new)
0.533755 0.000707 0.086713 0.372193 1.105493 1.682949
5 rows × 37 columns
The model test MSE for the linear model with 7 varia
ble is 2.1063194289258402
The model test MSE for the model with polynomial ter
m is 2.154634174218139
51. The Optimal Model Recall
According to the MSEs value above, the model with a lower error, a non-linear model
with 7 selected variables, has been recalled below.
In [4]:
print(m.summary())
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.098
Model: OLS Adj. R-squar
ed: 0.063
Method: Least Squares F-statistic:
2.774
Date: Thu, 28 Nov 2019 Prob (F-stat
istic): 0.00924
Time: 21:51:37 Log-Likeliho
od: -294.25
No. Observations: 186 AIC:
604.5
Df Residuals: 178 BIC:
630.3
Df Model: 7
Covariance Type: nonrobust
====================================================
================================
coef std err t
P>|t| [0.025 0.975]
----------------------------------------------------
--------------------------------
Intercept 0.0664 0.326 0.204
0.839 -0.576 0.709
PriceSales 0.2852 0.153 1.863
0.064 -0.017 0.587
PriceBook 0.0310 0.021 1.481
0.140 -0.010 0.072
EVRevenue -0.3609 0.170 -2.124
0.035 -0.696 -0.026
52. Regression Diagnosis
*Source (https://robert-alvarez.github.io/2018-06-04-diagnostic_plots/)
Payout_Ratio -0.0020 0.001 -2.041
0.043 -0.004 -6.67e-05
Operating_Margin 0.0474 0.017 2.713
0.007 0.013 0.082
ROA -0.1509 0.039 -3.909
0.000 -0.227 -0.075
Health 0.4269 0.332 1.286
0.200 -0.228 1.082
====================================================
==========================
Omnibus: 8.080 Durbin-Watso
n: 1.953
Prob(Omnibus): 0.018 Jarque-Bera
(JB): 14.644
Skew: -0.096 Prob(JB):
0.000661
Kurtosis: 4.361 Cond. No.
493.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
53. In [5]:
# Residual Plot
sns.residplot(m.fittedvalues, 'Change', data=data, lowess=True,
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw'
: 1, 'alpha': 0.8})
plt.title('Residuals vs Fitted')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
Residuals and fited value plot shows there's some nonlinearity that this linear model
couldn't capture.
Out[5]:
Text(0, 0.5, 'Residuals')
54. In [6]:
# Normal Q-Q plot
sm.qqplot(m.resid, line='45', color='cornflowerblue', alpha=0.6)
plt.title('Normal Q-Q')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Standardized Residuals')
The Q-Q plot indicates that approimately more than 85% of the residuals align along
the line, which means the errors are being normally distributed.
Out[6]:
Text(0, 0.5, 'Standardized Residuals')
55. In [7]:
# Scale-Location Plot
norm_res_abs_sqrt = np.sqrt(np.abs(m.get_influence().resid_stude
ntized_internal))
plt.scatter(m.fittedvalues, norm_res_abs_sqrt, alpha=0.5);
sns.regplot(m.fittedvalues, norm_res_abs_sqrt, scatter=False, ci
=False, lowess=True,
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8});
plt.xlabel('Fitted values')
plt.ylabel('$sqrt{|Standardized Residuals|}$')
The residual plot shows a slightliness of uneven cloud of the residual. This model might
suffer from heteroscedasticiy.
Out[7]:
Text(0, 0.5, '$sqrt{|Standardized Residuals|}$')
56. In [8]:
# Residual and Leverage
leverage = m.get_influence().hat_matrix_diag
norm_res = m.get_influence().resid_studentized_internal
plt.scatter(leverage, norm_res, alpha=0.5);
sns.regplot(leverage, norm_res, scatter=False, ci=False, lowess=
True,
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.xlim(0, max(leverage)+0.01)
plt.ylim(-3, 5)
plt.title('Residuals vs Leverage')
plt.xlabel('Leverage')
plt.ylabel('Standardized Residuals');
The residual and leverage plot shows that there's no outlier.
57. Model Conclusion
From the training set, the model with polynomial term seems to perform better
than the linear model, due to a higher Adjusted R squared as well as R squared,
which means that the variation of %Change on stock price is better explained by
the additional polynomial term. However, the training error tends to underestimate
the testing error.
According to the test MSEs for both models, the model without polynomial term
yields a slightly lower MSE ( 2.1063 < 2.1546 ). This indicates that the model with
7 variables (non-linear) has a stronger predictive power.
The non-linear model:
𝐶ℎ𝑎𝑛𝑔𝑒 = 0.0664 + 0.2852(𝑃 𝑟𝑖𝑐𝑒𝑆𝑎𝑙𝑒𝑠) + 0.0310(𝑃 𝑟𝑖𝑐𝑒𝐵𝑜𝑜𝑘) − 0.3609(𝐸𝑉 𝑅𝑒
− 0.1509(𝑅𝑂𝐴) + 0.4269(
The Optimal Model Interpretation
Independent Variables Relationships Coefficient P-Value
Intercept + 0.0664 0.839
Price/Sales + 0.2852 0.064 (.)
Price/Book + 0.0310 0.140
EV/Revenue - 0.3609 0.035 (*)
Payout Ratio (%) - 0.0020 0.043 (*)
Operating Margin (%) + 0.0474 0.007 (**)
Return on Asset (ttm) - 0.1509 0.000 (***)
Health + 0.4269 0.200
R squared
The 9.8% variation of the percentage change in stock price is explained by
independent variables in this ordinary least squares model.
In order to improve the R sqaured value, the model might need other vairable
that is more correlated to the response. Because the stock data has high
variation as well as high randomness, besides, numeric data we might need
58. other data such as, daily news, financial report, 10-K, index companies
performance, and so on to improve the change in stock price evaluation.
Coefficients (Significant at 95% confident interval)
EVRevenue : The coefficient indicates that, on average, when the Enterprise
Value/Revenue increases by 1 unit, the stock price will decline by 0.3609%,
holding other variables constant, at p-value 0.035 < 0.05.
Payout_Ratio : On average, when the Payout Ratio increases by 1%, the
stock price will decrease by 0.002%, while holding others constant, at p-value
0.045 < 0.05.
Operating_Margin : The coefficient indicates that, on average, when the
Operating Margin (ttm) increases by 1%, the stock price will also increase by
0.0474%, holding others constant, at p-value 0.007 (< 0.05).
ROA : ROA is highly significant with p-value at 0.000. On average, when the
Return on Asset (ttm) increases by 1% while holding others constant, the
stock price will decreases by 0.1509%.
PriceSales has a p-value at 0.064 which is statistically significant at 99%
confidence level, which means it isn't highly correlated with the dependent
variable.
PriceBook , and Health are not statistically significant.
In [33]:
# # Use this code in order to predict a specific scenerio
# PriceSales =
# PriceBook =
# EVRevenue =
# Payout_Ratio =
# Operating_Margin =
# ROA =
# Health =
# data_new = [1, PriceSales, PriceBook, EVRevenue, Payout_Ratio,
Operating_Margin, ROA, Health]
# predicted = m_7_cv.predict(data_new)[0]
# predicted
====================================================================
59. 5. Conclusions
What have we seen based on the data?
From joining the two data frames (one from yahoo finance by the technology
services sector, and another from the key-statistic page), the data set had
approximately 35 numeric variables with 390 companies. After cleaning the data
the observations were reduced to approximately 190 companies.
Building a correlation table and plot, as well as taking those variables into the
scatter plot, the results showed that most of them had vague relationships (low
magnitude of correlation) between the response (% Change in stock price). Also,
there's a sign of non-linearity between the response and EV/Revenue . Hence
the polynomial term of this variable was performed in further progress.
Since there's a high dimensionality in the model, the best subset selection method
was performed. Then, some variables (21 variables including binary variable) are
selected to run the best subset selection model. According to the lowest RSS and
highest Adjusted R squared, 7 variables were selected which are Price/Sales,
Price/Book, EV/Revenue, Payout Ratio, Operating Margin, ROA, and Health
(Binary variable created based on the industrial average of EV/EBITDA)
Due to the non-linearity of EV/Revenue , the model with an additional polynomial
term was performed. The result turns out that the model's Adjusted R squared has
improved. As a result, the % Change of stock price variation is better explained by
dependent variables and the additional polynomial term, .
However, the predictive accuracy was further investigated
Model predictive accuracy:
To validate the model accuracy between these two models, the cross-validation is
performed. The data set was randomly divided into 80% of the training set and
20% of the test set. (Set the seed equal to 1). From the result, the test MSE of the
linear model is slightly lower than the model with a non-linear term. Even though
the non-linear model has a higher Adjusted R squared indicating a better
describing of the relationship between predictors and response, the linear model
has a slightly stronger predictive power.
(The model result comparision shown in the table below)
(𝐸𝑉 /𝑅𝑒𝑣𝑒𝑛𝑢𝑒)2
61. How has our understanding of the original question changed?
Recall the question(s):
Which indices (variables) are statistically important to the Change in percentage
of stock price in Technology Services industry?
The statistically significant indices are Price/Share (+), EV/Revenue (-), Payout
Ratio (%) (-), Operating Margin (%) (+), and Return on Asset (%) (-). Besides
these significant variables in the model, to determine the change in stock
price, some additional factors need to be considered. In the stock market,
there're many types of information the stock analyst could use for decision
making. For instance, reading an annual report like 10-K as well as news and
integrate with numeric data would help them gain more advantage over a
person who only relies on less source.
What is the magnitude of each variable against Change of stock price in the
Technology Services industry?
On the basis, I expected that the market capitalization would pay a significant
role with a positive magnitude as a predictor since most of the companies
that are mostly paid attention, like S&P500 have a high market capitalization.
However, this variable is not statistically significant in the model where the
dependent variable is percentage change in stock price. Also, the result of
ROA is not the same as expected. The more return on asset the more profit a
company can generate from its source. Surprisingly, this variable has negative
relationship in the model.
However, the actual relationship of EV/Revenue is the same as expected
(negative relationship). Since the EV/Revenue is used to compare the
company's revenue with the enterprise value. The lower of the multiple would
mean it's undervalued, the more attraction is drawing to the company. Also,
other variables such as Operating Margin(%), Payout Ratio(%), and
Price/Book are the same as expectation because these variables are the
indices that can draw the attention from investor (the higher the value, the
more attractive stock would gain).