SlideShare una empresa de Scribd logo
1 de 61
Final Project - %Change in Stock Price
(Technology Service Industry) Analysis
Name: Natsarankorn Kijtorntham
Packages
In [1]:
%matplotlib inline
from datetime import datetime
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup as bs
from scipy.stats import pearsonr
from patsy import dmatrix
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflat
ion_factor
import itertools
import random
import warnings
warnings.filterwarnings('ignore')
1. Introduction
What is the importance of your data set?
This project scraped the data from finance.yahoo.com. The main data set was the
companies in the Technology Services sector. This analysis aims to predict the
%Change (dependent variable) of the stock price, with the OLS model. The features
are the statistical key of each stock. For example, Market Cap, P/E Ratio, Price/Sales,
Enterprise Value, ROE, ROA, etc. Since there're a lot of indices to the stock price (high
dimensionality), the significant level of independent variables as well as the subset
selection method would preliminarily filter out unnecessary variables.
Which question(s) can it help us understand?
Which indices (variables) are statistically important to the Change in percentage
of stock price in the Technology industry?
What is the magnitude of each variable against Change of stock price in the
Technology Services industry?
====================================================================
2. Data Scraping
Where and how are you getting the data?
The data set is tocks from the Technology Services Sector of Yahoo Finance
(https://finance.yahoo.com/screener/predefined/ms_technology). There're
approximately 390 companies in this data set.
Scraping Steps
Part 1:
Scrape the main dataframe df_1 of all companies containing 'Symbol',
'Name of the company', 'Price', 'Change', '% Change', 'Volume', 'Avg
Vol', 'Market Cap', 'PE Ratio'.
Part 2:
Get the 'href' link for each company.
Get the tables from the 'Statistical Keys' page using the modified 'href'
links, e.g. AAPL (https://finance.yahoo.com/quote/AAPL/key-statistics?
p=AAPL), MSFT (https://finance.yahoo.com/quote/MSFT/key-statistics?
p=MSFT), TSM (https://finance.yahoo.com/quote/TSM/key-statistics?
p=TSM).
Run comprehensive for loops to build another data frame, df_2 , from
the second part.
Joining two data frames ( df_1 , and df_2 ) using an index, 'Symbol'.
What data are available?
The whole data contains approximately 190 observations after omitting NAs, and
20 variables to run the full OLS model.
The dependent variable is Change which is the percentage of the stock price of
that day (Change (USD)/Price (Intraday)).
The independent variables are Price, Mkt_Cap, PEG_Ratio, PE_Ratio, PriceSales,
PriceBook, EV, EVRevenue, Payout_Ratio, Profit_Margin, Operating_Margin, ROA,
ROE, Revenue, RevenueShare, Gross_Profit, EBITDA, NItoCommon, Diluted_EPS,
Earnings_Growth_Q, Health.
EV/EBITDA is transformed into Health , a binary variable
For a company who has EV/EBITDA > industrial average, it will be indicated
as 1 = 'healthy' company
For a company who has EV/EBITDA < industrial average, it will be indicated
as 0 = 'unhealthy' company
What relationships do you expect to see in the data?
The expected relationships are both positives and negatives as shown below:
Independent Variables E(Relationships)
Market Cap +
PEG Ratio -
P/E Ratio +
Price/Sales -
Price/Book +
EV +
EV/Revenue -
Payout Ratio(%) +
Profit Margin(%) +
Operating Margin(%) +
ROA +
ROE +
Revenue +
Revenue/Share +
Gross Profit(%) +
EBITDA +
Net Income to Common +
Diluted EPS +
Quarterly Earnings Growth (yoy) +
Health (healthy) +
Steps:
In [2]:
# Time of data scraping
now = datetime.now()
dt = now.strftime("%d/%m/%Y %H:%M:%S")
print("This data was scraped on", dt)
# This data was scraped on 23/11/2019 22:25:07
PART 1
Getting a main dataframe (df_1)
This data was scraped on 23/11/2019 22:25:07
In [3]:
# Scraping the main dataframe (df_1) by setting the parameters f
or 100 counts each page with approximately 390 rows.
url = 'https://finance.yahoo.com/screener/predefined/ms_technolo
gy'
rows = np.arange(0,301,100).tolist()
# rows = [100,200,300]
url_list = []
tech_df = []
for i in rows:
r = requests.get(url, params = {'count' : '100', 'offset' :
i})
link = r.url
url_list.append(link)
for link in url_list:
df = pd.read_html(link)
tb = df[0]
tech_df.append(tb)
df_1 = pd.concat(tech_df)
In [4]:
# Setting 'Symbol' as an index
df = df_1.set_index('Symbol')
df.to_csv('df_1.csv')
In [5]:
url_list
Out[5]:
['https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=0',
'https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=100',
'https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=200',
'https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=300']
In [6]:
df.head()
In [7]:
# The dimension for the main dataframe
df.shape
PART 2
Scraping from Key-Statistics page
Out[6]:
Name
Price
(Intraday)
Change
%
Change
Volume
Avg Vol
(3
month)
Symbol
AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M
MSFT
Microsoft
Corporation
149.59 0.11 +0.07% 15.842M 22.825M
TSM
Taiwan
Semiconductor
Manufacturing
Company
Lim...
52.79 -0.19 -0.36% 4.103M 6.848M
INTC
Intel
Corporation
57.61 -0.61 -1.05% 15.69M 18.498M
CSCO
Cisco
Systems, Inc.
44.85 0.01 +0.02% 16.516M 19.124M
Out[7]:
(393, 9)
In [8]:
# get thae 'a' tag from web element
table = []
tag = []
# get text from the main page
for url in url_list:
txt = requests.get(url).text
soup = bs(txt)
t = soup.find('div', {'id':'scr-res-table'})
table.append(t)
for i in range(0,4):
t = table[i].find_all('a')
tag.append(t)
In [9]:
# Get the href link for key statistics page of each ticker to ex
tract table
link = []
for e in range(0,4):
for i in tag[e]:
l = 'https://finance.yahoo.com'+i.get('href')
l_kstat = l.split('?')[0]+'/key-statistics?'+l.split('?'
)[1]
link.append(l_kstat)
Note
Some HTML links don't work (404) when running the code for some specific time
(around closing time of the stock market). So, the code chunk below will prevent the
error when scraping.
In [10]:
connection = []
for l in link:
if requests.get(l).status_code == 200:
status = ['good', l]
else: status = ['404', l]
connection.append(status)
# Get the responded links, '200', and company tickers
links = []
tickers = []
for status in range(0,len(connection)):
if connection[status][0] == 'good':
good_link = connection[status][1]
else: bad_link = connection[status][1]
links.append(good_link)
tickers.append(good_link.split('=')[1])
In [11]:
print('There are',len(links),'links that responded (200)')
There are 393 links that responded (200)
In [12]:
tic = time.time()
data = []
tables = (0,3,5,6,7) # This indicates the specific table in stat
istical key page used in the analysis
for url in links[:len(links)]:
for table in tables:
d = pd.read_html(url)[table]
data.append(d)
matrix = pd.concat(data)
matrix.shape
m = matrix.set_index(0)
toc = time.time()
print("Total scraping time:", (toc-tic)/60, "minutes.")
# Total scraping time: 35.854390549659726 minutes.
In [13]:
# Build a second dataframe from concatenated matrix
df_2 = pd.DataFrame()
for i in range(0,len(m),31):
m_m = m.iloc[i:i+31]
n = (i+31)/31-1
m_m.columns = [tickers[int(n)]]
df_2[tickers[int(n)]] = m_m[tickers[int(n)]]
df_2 = df_2.transpose()
In [14]:
df_2.to_csv('df_2.csv')
Total scraping time: 35.854390549659726 minutes.
Joining Data frames
In [15]:
df = df.join(df_2)
In [16]:
df = df.iloc[:len(df_2)]
df.shape
In [17]:
df.head()
Out[16]:
(393, 40)
In [18]:
df.to_csv('tech_390.csv')
====================================================================
3. Data Cleaning
Rename Variables in The Data Frame, df
Out[17]:
Name
Price
(Intraday)
Change
%
Change
Volume
Avg Vol
(3
month)
Symbol
AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M
MSFT
Microsoft
Corporation
149.59 0.11 +0.07% 15.842M 22.825M
TSM
Taiwan
Semiconductor
Manufacturing
Company
Lim...
52.79 -0.19 -0.36% 4.103M 6.848M
INTC
Intel
Corporation
57.61 -0.61 -1.05% 15.69M 18.498M
CSCO
Cisco
Systems, Inc.
44.85 0.01 +0.02% 16.516M 19.124M
5 rows × 40 columns
In [19]:
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 400)
In [20]:
df = pd.read_csv('tech_390.csv')
df = df.set_index('Symbol')
# df.head()
Out[20]:
Name
Price
(Intraday)
Change
%
Change
Volume
Avg Vol
(3
month)
Symbol
AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M
MSFT
Microsoft
Corporation
149.59 0.11 +0.07% 15.842M 22.825M
TSM
Taiwan
Semiconductor
Manufacturing
Company
Lim...
52.79 -0.19 -0.36% 4.103M 6.848M
INTC
Intel
Corporation
57.61 -0.61 -1.05% 15.69M 18.498M
CSCO
Cisco
Systems, Inc.
44.85 0.01 +0.02% 16.516M 19.124M
In [21]:
data = df[['% Change', 'Price (Intraday)', 'Market Cap', 'PE Rat
io (TTM)',
'PEG Ratio (5 yr expected) 1', 'Price/Sales (ttm)', '
Price/Book (mrq)',
'Enterprise Value 3', 'Enterprise Value/Revenue 3',
'Enterprise Value/EBITDA 6', 'Payout Ratio 4', 'Profi
t Margin',
'Operating Margin (ttm)', 'Return on Assets (ttm)', '
Return on Equity (ttm)',
'Revenue (ttm)', 'Revenue Per Share (ttm)', 'Gross Pr
ofit (ttm)', 'EBITDA',
'Net Income Avi to Common (ttm)', 'Diluted EPS (ttm)'
,
'Quarterly Earnings Growth (yoy)']]
data.columns = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio',
'PEG_Ratio', 'PriceSales', 'PriceBook',
'EV', 'EVRevenue',
'EV/EBITDA', 'Payout_Ratio', 'Profit_Margin',
'Operating_Margin', 'ROA', 'ROE',
'Revenue', 'RevenueShare', 'Gross_Profit', 'EBIT
DA',
'NItoCommon', 'Diluted_EPS' ,
'Earnings_Growth_Q']
data.head()
Out[21]:
Change Price Mkt_Cap PE_Ratio PEG_Ratio PriceSales PriceBook
Symbol
AAPL -0.09% 261.78 1.183T 22.02 2.04 4.55
MSFT +0.07% 149.59 1.141T 28.22 1.91 8.79
TSM -0.36% 52.79 263.261B 23.67 2.39 NaN
INTC -1.05% 57.61 250.604B 13.49 1.79 3.56
CSCO +0.02% 44.85 190.265B 17.85 1.97 3.66
Data Cleaning & Transformation
Converting str to float by using
replace() to replace abbreviations (T, B, M ,and K) to Scientific Notation
(e) for Mkt_Cap, EV, Revenue, Gross_Profit, EBITDA, and NItoCommon.
strip() to strip the unnecessary symbols which are ',', and '%'.
astype() to change string to float
Creating categorical variable (Binary) for company health based on industrial
average of EV/EBITDA .
In [22]:
# Check the type of variables
columns = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio',
'PEG_Ratio', 'PriceSales', 'PriceBook',
'EV', 'EVRevenue',
'EV/EBITDA', 'Payout_Ratio', 'Profit_Margin',
'Operating_Margin', 'ROA', 'ROE',
'Revenue', 'RevenueShare', 'Gross_Profit', 'EBITDA',
'NItoCommon', 'Diluted_EPS',
'Earnings_Growth_Q']
for i in columns:
print(type(data[i].values[0]), i)
In [23]:
for c in ['Mkt_Cap', 'EV', 'Revenue', 'Gross_Profit', 'EBITDA',
'NItoCommon']:
data[c] = data[c].astype(str).str.replace("T", "e+12").str.r
eplace("B", "e+9").str.replace("M", "e+6").str.replace("k", "e+3
").astype(float)
data['ROE'] = data['ROE'].str.replace(",", "")
data['Earnings_Growth_Q'] = data['Earnings_Growth_Q'].str.replac
e(",", "")
col = ['Change', 'Profit_Margin', 'Payout_Ratio', 'Operating_Mar
gin', 'ROA', 'ROE', 'Earnings_Growth_Q']
for c in col:
var = data[c].str.replace('%', '').astype(float)
data[c] = var
<class 'str'> Change
<class 'numpy.float64'> Price
<class 'str'> Mkt_Cap
<class 'numpy.float64'> PE_Ratio
<class 'numpy.float64'> PEG_Ratio
<class 'numpy.float64'> PriceSales
<class 'numpy.float64'> PriceBook
<class 'str'> EV
<class 'numpy.float64'> EVRevenue
<class 'numpy.float64'> EV/EBITDA
<class 'str'> Payout_Ratio
<class 'str'> Profit_Margin
<class 'str'> Operating_Margin
<class 'str'> ROA
<class 'str'> ROE
<class 'str'> Revenue
<class 'numpy.float64'> RevenueShare
<class 'str'> Gross_Profit
<class 'str'> EBITDA
<class 'str'> NItoCommon
<class 'numpy.float64'> Diluted_EPS
<class 'str'> Earnings_Growth_Q
In [24]:
for i in columns:
print(type(data[i].values[0]), i)
<class 'numpy.float64'> Change
<class 'numpy.float64'> Price
<class 'numpy.float64'> Mkt_Cap
<class 'numpy.float64'> PE_Ratio
<class 'numpy.float64'> PEG_Ratio
<class 'numpy.float64'> PriceSales
<class 'numpy.float64'> PriceBook
<class 'numpy.float64'> EV
<class 'numpy.float64'> EVRevenue
<class 'numpy.float64'> EV/EBITDA
<class 'numpy.float64'> Payout_Ratio
<class 'numpy.float64'> Profit_Margin
<class 'numpy.float64'> Operating_Margin
<class 'numpy.float64'> ROA
<class 'numpy.float64'> ROE
<class 'numpy.float64'> Revenue
<class 'numpy.float64'> RevenueShare
<class 'numpy.float64'> Gross_Profit
<class 'numpy.float64'> EBITDA
<class 'numpy.float64'> NItoCommon
<class 'numpy.float64'> Diluted_EPS
<class 'numpy.float64'> Earnings_Growth_Q
In [25]:
# Creating binary variable based on 'data['EV/EBITDA'].mean()'
health = []
print('The industrial average of EV/EBITDA is' ,data['EV/EBITDA'
].mean())
for i in data['EV/EBITDA']:
if i > data['EV/EBITDA'].mean():
h = 1
else:
h = 0
health.append(h)
data['Health'] = health
# Since the new categorical data was created from 'EV/EBITDA', i
t will be taken out from the dataframe.
del data['EV/EBITDA']
In [26]:
# Drop NAs
data = data.dropna()
print('After steps of data cleaning, and manipulating, the dataf
rame used in the model has', data.shape[0],
'observations (companies) with', data.shape[1]-1, 'feature
s.')
In [27]:
a = data['Health'] == 1
print('The number of observations defined as healthy are',a.sum(
))
The industrial average of EV/EBITDA is 7.90497282608
6954
After steps of data cleaning, and manipulating, the
dataframe used in the model has 186 observations (co
mpanies) with 21 features.
The number of observations defined as healthy are 17
0
In [28]:
data.to_csv('data.csv')
Data Visualization
Correlation Table (for Numeric Variables)
In [2]:
data = pd.read_csv('data.csv')
data = data.set_index('Symbol')
data.head()
In [3]:
# Excluding binary variable (Health)
variables = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio', 'PEG_Rati
o', 'PriceSales', 'PriceBook',
'EV', 'EVRevenue', 'Payout_Ratio', 'Profit_Margin',
'Operating_Margin',
'ROA', 'ROE', 'Revenue', 'RevenueShare', 'Gross_Pro
fit', 'EBITDA',
'NItoCommon', 'Diluted_EPS', 'Earnings_Growth_Q']
Out[2]:
Change Price Mkt_Cap PE_Ratio PEG_Ratio PriceSales
Symbol
AAPL -0.09 261.78 1.183000e+12 22.02 2.04 4.55
MSFT 0.07 149.59 1.141000e+12 28.22 1.91 8.79
INTC -1.05 57.61 2.506040e+11 13.49 1.79 3.56
CSCO 0.02 44.85 1.902650e+11 17.85 1.97 3.66
ORCL 0.28 56.39 1.851010e+11 18.46 1.45 4.68
5 rows × 22 columns
In [4]:
# The correlation table ordered by its magnitude
correl = data.loc[:,variables].corr()
correl[:1].T.sort_values(by=['Change'], ascending=False)
Out[4]:
Change
Change 1.000000
PE_Ratio 0.094638
Earnings_Growth_Q 0.042029
PriceSales 0.025884
EVRevenue 0.019835
Profit_Margin 0.017111
Operating_Margin -0.004450
Gross_Profit -0.008711
EBITDA -0.009072
EV -0.009756
Mkt_Cap -0.010124
NItoCommon -0.010922
PriceBook -0.014621
PEG_Ratio -0.023386
Revenue -0.023475
Price -0.027291
Diluted_EPS -0.059097
ROE -0.083297
RevenueShare -0.095545
Payout_Ratio -0.097380
ROA -0.166271
Table Summary
The correlation table for numeric variables indicates that positive relationships are
PE_Ratio, Earnings_Growth_Q, PriceSales, EVRevenue, and Profit_Margin,
respectively. For negative relationships variables are, Operating_Margin, Gross_Profit,
EBITDA, EV, Mkt_Cap, NItoCommon, PriceBook, PEG_Ratio, Revenue, Price,
Diluted_EPS, ROE, RevenueShare, Payout_Ratio, and ROA, respectively.
In [5]:
plt.figure(figsize=(15,15))
plt.imshow(correl) # show as image
plt.colorbar()
# To set the label on the axes
plt.xticks(range(21), variables, rotation='vertical') # Require
list of variables list creation
plt.yticks(range(21), variables)
Out[5]:
([<matplotlib.axis.YTick at 0x1c227f7c18>,
<matplotlib.axis.YTick at 0x1c227f7550>,
<matplotlib.axis.YTick at 0x1c22721550>,
<matplotlib.axis.YTick at 0x1c2275c588>,
<matplotlib.axis.YTick at 0x1c22755be0>,
<matplotlib.axis.YTick at 0x1c22765c88>,
<matplotlib.axis.YTick at 0x1c227654e0>,
<matplotlib.axis.YTick at 0x1c22aeb550>,
<matplotlib.axis.YTick at 0x1c22aebac8>,
<matplotlib.axis.YTick at 0x1c22af00f0>,
<matplotlib.axis.YTick at 0x1c22af05f8>,
<matplotlib.axis.YTick at 0x1c22af0b70>,
<matplotlib.axis.YTick at 0x1c22af8160>,
<matplotlib.axis.YTick at 0x1c22af09e8>,
<matplotlib.axis.YTick at 0x1c227650b8>,
<matplotlib.axis.YTick at 0x1c22af85c0>,
<matplotlib.axis.YTick at 0x1c22af8b38>,
<matplotlib.axis.YTick at 0x1c22afe160>,
<matplotlib.axis.YTick at 0x1c22afe668>,
<matplotlib.axis.YTick at 0x1c22afebe0>,
<matplotlib.axis.YTick at 0x1c22b06198>],
<a list of 21 Text yticklabel objects>)
Histogram of dependent variable
In [6]:
plt.hist(data['Change'], bins=20)
plt.title('Histogram of % Change in Stock Price')
plt.xlabel('% Change in Stock Price')
The histogram of Change indicates that the independent variable (Y) is approximately
normally distributed.
Example for Plots of independent variables against Y
Out[6]:
Text(0.5, 0, '% Change in Stock Price')
In [7]:
# Enterprise Value/Revenue
plt.scatter(data['EVRevenue'], data['Change'])
plt.xlabel('Enterprise Value/Revenue')
plt.ylabel('% Change in Stock Price')
plt.title('Enterprise Value/Revenue vs Change')
From the plot above, there's no obvious direction whether it's an upward or downward
slope. However, there's a slightly non-linear relationship between this feature and
response. Hence, in further analysis, if this variable is statistically significant in the
model, the polynomial term of EV/Revenue will be generated in order to improve the
model.
Out[7]:
Text(0.5, 1.0, 'Enterprise Value/Revenue vs Change')
In [8]:
# Payout Ratio (%)
plt.figure(figsize=(15,5))
plt.subplot(121)
plt.scatter(data['Payout_Ratio'], data['Change'])
plt.xlabel('Payout Ratio (%)')
plt.ylabel('% Change in Stock Price')
plt.title('Payout_Ratio vs Change')
plt.subplot(122)
plt.scatter(np.log(data['Payout_Ratio']), data['Change'])
plt.xlabel('Log of Payout Ratio (%)')
plt.ylabel('% Change in Stock Price')
plt.title('log(Payout_Ratio) vs Change')
From the left panel, the distribution is dense when the payout ratio (%) is less than
200. After taking logarithm function to the observation, the scatter plot (right panel)
shows a slightly negative relationship but no non-linear relationship is detected. Taking
a log() function into the model might improve the model. Unfortunately, there's some
infinite value after taking log() in the observations. Dropping more observations (infinite
values) from the approximately existing 190 observations would potentially reduce the
model accuracy. Hence, this variable will be kept as it is.
Out[8]:
Text(0.5, 1.0, 'log(Payout_Ratio) vs Change')
In [9]:
# Operating Margin (%)
plt.scatter(data['Operating_Margin'], data['Change'])
plt.xlabel('Operating Margin (%)')
plt.ylabel('% Change in Stock Price')
plt.title('Operating_Margin vs Change')
The plot above shows a slight positive relationship between Operating Margin (%) and
% change in stock price.
Out[9]:
Text(0.5, 1.0, 'Operating_Margin vs Change')
In [10]:
# ROA (%)
plt.scatter(data['ROA'], data['Change'])
plt.xlabel('Return on Asset (%)')
plt.ylabel('% Change in Stock Price')
plt.title('ROA (%) vs Change')
The scatter plot of ROA (%) and % Change in stock price shows a negative
relationship with no evidence of non-linearity.
The Boxplot (for Binary Variable)
Out[10]:
Text(0.5, 1.0, 'ROA (%) vs Change')
In [11]:
plt.figure(figsize=(5,5))
sns.boxplot(data['Health'],data['Change'])
The boxplot of this binary variable Health shows a slight difference between the
mean and 50% of observations for each group where the 'healthy' companies (Health
=1) are slightly higher and have a wider range of distribution (whiskers). Hence, this
variable is potentially statistical significant in the model.
====================================================================
4. Predictive Modeling
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c2410a4
38>
The Multicollinearity Among Variables (VIF)
Calculating the variance inflation factor
*source
(https://etav.github.io/python/vif_factor_python.html)
In [37]:
y, X_vif = dmatrices('Change ~' + 'Price+Mkt_Cap+PEG_Ratio+PE_Ra
tio+PriceSales+PriceBook+EV+EVRevenue+Payout_Ratio+Profit_Margin
+Operating_Margin+ROA+ROE+Revenue+RevenueShare+Gross_Profit+EBIT
DA+NItoCommon+Earnings_Growth_Q+Health', data = data, return_typ
e='dataframe')
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i)
for i in range(X_vif.shape[1])]
vif["features"] = X_vif.columns
vif.round(2).set_index('features')
Out[37]:
VIF Factor
features
Intercept 18.79
Price 2.00
Mkt_Cap 1423.99
PEG_Ratio 1.06
PE_Ratio 1.91
PriceSales 82.42
PriceBook 5.48
EV 1611.13
EVRevenue 95.27
Payout_Ratio 1.12
Profit_Margin 4.02
Operating_Margin 7.23
ROA 3.72
ROE 5.17
Revenue 31.00
RevenueShare 1.82
Gross_Profit 52.42
EBITDA 197.00
NItoCommon 223.06
Earnings_Growth_Q 1.38
Health 1.29
From the result above, there're high multicollinearities (VIF > 10) in the features
among Mkt_Cap , Price/Sales , EV , EV/Revenue , Revenue ,
Gross_Profit , EBITDA , and NItoCommon . In which, these varaibles are not
independent. In the next step, the subset selection method will help filter out
highly correlated and unnecessary variables from the model.
Best Subset Selection Method
Since there're many variables (high dimension) with multicollinearity in the data
set. Including all the variables may lead to high varaince in the model. To reduce
the model variance, this method could help select the best set of variables that
yields a higher Adjusted R squared by minimizing RSS.
* source
(http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8-
py.html)
In [13]:
y = data.Change
X = data[['Price','Mkt_Cap','PEG_Ratio', 'PE_Ratio', 'PriceSales
', 'PriceBook', 'EV', 'EVRevenue',
'Payout_Ratio', 'Profit_Margin', 'Operating_Margin', 'R
OA', 'ROE',
'Revenue', 'RevenueShare', 'Gross_Profit', 'EBITDA', 'N
ItoCommon',
'Earnings_Growth_Q', 'Health']]
X = pd.concat([X], axis=1)
X.head()
In [14]:
def processSubset(feature_set):
# Fit model on feature_set and calculate RSS
model = sm.OLS(y,X[list(feature_set)])
regr = model.fit()
RSS = ((regr.predict(X[list(feature_set)]) - y) ** 2).sum()
return {"model":regr, "RSS":RSS}
Out[13]:
Price Mkt_Cap PEG_Ratio PE_Ratio PriceSales PriceBook
Symbol
AAPL 261.78 1.183000e+12 2.04 22.02 4.55 12.85
MSFT 149.59 1.141000e+12 1.91 28.22 8.79 10.77
INTC 57.61 2.506040e+11 1.79 13.49 3.56 3.38
CSCO 44.85 1.902650e+11 1.97 17.85 3.66 5.53
ORCL 56.39 1.851010e+11 1.45 18.46 4.68 10.08
In [15]:
def getBest(k):
tic = time.time()
results = []
for combo in itertools.combinations(X.columns, k):
results.append(processSubset(combo))
models = pd.DataFrame(results)
# Choose model with the lowest RSS
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed", models.shape[0], "models on", k, "predict
ors in", (toc-tic), "seconds.")
return best_model
In [16]:
def getBest(k):
tic = time.time()
results = []
for combo in itertools.combinations(X.columns, k):
results.append(processSubset(combo))
models = pd.DataFrame(results)
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed", models.shape[0], "models on", k, "predict
ors in", (toc-tic), "seconds.")
return best_model
In [17]:
models_best = pd.DataFrame(columns=['RSS', 'model'])
tic = time.time()
for i in range(1,10):
models_best.loc[i] = getBest(i)
toc = time.time()
print("Total elapsed time:", (toc-tic)/60, "minutes.")
Processed 20 models on 1 predictors in 0.06646728515
625 seconds.
Processed 190 models on 2 predictors in 0.4089689254
760742 seconds.
Processed 1140 models on 3 predictors in 2.546672344
2077637 seconds.
Processed 4845 models on 4 predictors in 11.26060104
3701172 seconds.
Processed 15504 models on 5 predictors in 34.6961269
3786621 seconds.
Processed 38760 models on 6 predictors in 88.0667259
6931458 seconds.
Processed 77520 models on 7 predictors in 613.451443
9105988 seconds.
Processed 125970 models on 8 predictors in 403.86032
41443634 seconds.
Processed 167960 models on 9 predictors in 419.71901
79824829 seconds.
Total elapsed time: 26.632335432370503 minutes.
In [18]:
models_best
Out[18]:
RSS model
1 282.423717 <statsmodels.regression.linear_model.Regressio...
2 273.200854 <statsmodels.regression.linear_model.Regressio...
3 270.210182 <statsmodels.regression.linear_model.Regressio...
4 265.668865 <statsmodels.regression.linear_model.Regressio...
5 264.720145 <statsmodels.regression.linear_model.Regressio...
6 260.887309 <statsmodels.regression.linear_model.Regressio...
7 257.760233 <statsmodels.regression.linear_model.Regressio...
8 256.530992 <statsmodels.regression.linear_model.Regressio...
9 255.976209 <statsmodels.regression.linear_model.Regressio...
In [19]:
plt.figure(figsize=(20,10))
plt.rcParams.update({'font.size': 18, 'lines.markersize': 10})
plt.subplot(2, 2, 1)
plt.plot(models_best["RSS"])
plt.xlabel('# Predictors')
plt.ylabel('RSS')
# Adjusted R squared Plot
rsquared_adj = models_best.apply(lambda row: row[1].rsquared_adj
, axis=1)
plt.subplot(2, 2, 2)
plt.plot(rsquared_adj)
plt.plot(rsquared_adj.argmax(), rsquared_adj.max(), "or")
plt.xlabel('# Predictors')
plt.ylabel('adjusted rsquared')
# AIC Plot
aic = models_best.apply(lambda row: row[1].aic, axis=1)
plt.subplot(2, 2, 3)
plt.plot(aic)
plt.plot(aic.argmin(), aic.min(), "or")
plt.xlabel('# Predictors')
plt.ylabel('AIC')
# BIC plot
bic = models_best.apply(lambda row: row[1].bic, axis=1)
plt.subplot(2, 2, 4)
plt.plot(bic)
plt.plot(bic.argmin(), bic.min(), "or")
plt.xlabel('# Predictors')
plt.ylabel('BIC')
In [20]:
print('The model has the highest Adjusted R squared at', '{0:.4f
}'.format(models_best.loc[7, "model"].rsquared_adj), 'when it ha
s', rsquared_adj.argmax() , 'variables' )
print('The model has the lowest AIC at', '{0:.4f}'.format(models
_best.loc[7, "model"].aic), 'when it has', aic.argmin() , 'varia
bles' )
print('The model has the lowest BIC at', '{0:.4f}'.format(models
_best.loc[7, "model"].bic), 'when it has', bic.argmin() , 'varia
bles' )
Out[19]:
Text(0, 0.5, 'BIC')
The model has the highest Adjusted R squared at 0.06
37 when it has 7 variables
The model has the lowest AIC at 602.5338 when it has
4 variables
The model has the lowest BIC at 625.1140 when it has
2 variables
In [21]:
print('*** 7 variable-model ***')
print(models_best.loc[7, "model"].params)
print('')
print('*** 4 variable-model ***')
print(models_best.loc[4, "model"].params)
print('')
print('*** 2 variable-model ***')
print(models_best.loc[2, "model"].params)
*** 7 variable-model ***
PriceSales 0.283440
PriceBook 0.030632
EVRevenue -0.358678
Payout_Ratio -0.001987
Operating_Margin 0.047336
ROA -0.148925
Health 0.480233
dtype: float64
*** 4 variable-model ***
Payout_Ratio -0.001794
Operating_Margin 0.022919
ROA -0.100080
Health 0.350337
dtype: float64
*** 2 variable-model ***
Operating_Margin 0.030559
ROA -0.083739
dtype: float64
Criteria
The subset selection computes or models were computed by minimizing
the RSS.
Adjusted R squared:
According to the formula above the smaller RSS would yield a higher
Adjusted R squared. As the result of the subset selection method, the 7
variable-model yields the highest adjusted R squared at 0.067, where the
variables are Price/Sales, Price/Book, EV/Revenue, Payout Ratio, Operating
Margin, ROA, and Health. There're 5 variables that are statistically significant
at p-value < 0.05, where R squared is 0.099. (The result shown below)
AIC
From the result, AIC criterion yields the model with 4 variables which are
Payout Ratio, Operating Margin, ROA, and Health. However, there's only one
variable (ROA) that is statistically significant at 95% confident interval, with R
squared equal to 0.071.
BIC
Since BIC criterion is more restricted (higher penalty term; ), it
yields a smaller model with two significant variables which are
Operating_Margin, and ROA. The R squared is 0.045.
Criteria # of Optimal Variables
7 0.099 0.064
AIC 4 0.071 0.051
BIC 2 0.045 0.035
To conclude, based on Adjusted R squared criterion, the optimal non-linear
model is OLS model with 7 variables, shown below.
( )
𝑝
𝑘
𝑝!
𝑘!(𝑝−𝑘)!
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 = 1 −𝑅2
𝑅𝑆𝑆
𝑛−𝑝−1
𝑇𝑆𝑆
𝑛−1
log(𝑛) ∗ 𝑝 ∗ σ̂2
𝑅2
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2
In [22]:
models_best.loc[7, "model"].summary()
Out[22]:
OLS Regression Results
Dep. Variable: Change R-squared (uncentered): 0.099
Model: OLS
Adj. R-squared
(uncentered):
0.064
Method: Least Squares F-statistic: 2.807
Date:
Wed, 27 Nov
2019
Prob (F-statistic): 0.00851
Time: 00:10:37 Log-Likelihood: -294.27
No.
Observations:
186 AIC: 602.5
Df Residuals: 179 BIC: 625.1
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
PriceSales 0.2834 0.152 1.860 0.065 -0.017 0.584
PriceBook 0.0306 0.021 1.474 0.142 -0.010 0.072
EVRevenue -0.3587 0.169 -2.121 0.035 -0.692 -0.025
Payout_Ratio -0.0020 0.001 -2.037 0.043 -0.004 -6.17e-05
Operating_Margin 0.0473 0.017 2.718 0.007 0.013 0.082
ROA -0.1489 0.037 -3.992 0.000 -0.223 -0.075
Health 0.4802 0.204 2.358 0.019 0.078 0.882
Omnibus: 8.154 Durbin-Watson: 1.957
Prob(Omnibus): 0.017 Jarque-Bera (JB): 14.803
Skew: -0.101 Prob(JB): 0.000610
Kurtosis: 4.367 Cond. No. 272.
The Optimal Linear Model (7 Variables)
In [3]:
y = data.Change
X_7 = dmatrix('1 + PriceSales + PriceBook + EVRevenue + Payout_R
atio + Operating_Margin + ROA + Health', data = data)
m = sm.OLS(y, X_7)
m.data.xnames = X_7.design_info.column_names
m = m.fit()
print(m.summary())
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.098
Model: OLS Adj. R-squar
ed: 0.063
Method: Least Squares F-statistic:
2.774
Date: Thu, 28 Nov 2019 Prob (F-stat
istic): 0.00924
Time: 21:50:30 Log-Likeliho
od: -294.25
No. Observations: 186 AIC:
604.5
Df Residuals: 178 BIC:
630.3
Df Model: 7
Covariance Type: nonrobust
====================================================
================================
coef std err t
VIF After The Subset Selection
P>|t| [0.025 0.975]
----------------------------------------------------
--------------------------------
Intercept 0.0664 0.326 0.204
0.839 -0.576 0.709
PriceSales 0.2852 0.153 1.863
0.064 -0.017 0.587
PriceBook 0.0310 0.021 1.481
0.140 -0.010 0.072
EVRevenue -0.3609 0.170 -2.124
0.035 -0.696 -0.026
Payout_Ratio -0.0020 0.001 -2.041
0.043 -0.004 -6.67e-05
Operating_Margin 0.0474 0.017 2.713
0.007 0.013 0.082
ROA -0.1509 0.039 -3.909
0.000 -0.227 -0.075
Health 0.4269 0.332 1.286
0.200 -0.228 1.082
====================================================
==========================
Omnibus: 8.080 Durbin-Watso
n: 1.953
Prob(Omnibus): 0.018 Jarque-Bera
(JB): 14.644
Skew: -0.096 Prob(JB):
0.000661
Kurtosis: 4.361 Cond. No.
493.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
In [36]:
y, X_vif = dmatrices('Change ~' + 'PriceSales + PriceBook + EVRe
venue + Payout_Ratio + Operating_Margin + ROA + Health', data =
data, return_type='dataframe')
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i)
for i in range(X_vif.shape[1])]
vif["features"] = X_vif.columns
vif.round(2).set_index('features')
As the result, the high VIF variables are eliminated. Even though the model still has
some multicollinearity beween variables, Price/Sales and EV/Revenue, it is
moderately acceptable.
Out[36]:
VIF Factor
features
Intercept 13.63
PriceSales 44.86
PriceBook 2.12
EVRevenue 51.57
Payout_Ratio 1.07
Operating_Margin 3.46
ROA 2.95
Health 1.11
The Non-Linear Model with Polynomial Term
Based on the data visualization above, there's an evidence that EV/Revenue
could have non-linearity.
Below, a model with polynomial term is performed, along with other selected
variables.
In [25]:
y = data.Change
X_new = dmatrix('1 + PriceSales + PriceBook + EVRevenue + I(EVR
evenue**2) + Payout_Ratio + Operating_Margin + ROA + Health', da
ta = data)
m_new = sm.OLS(y, X_new)
m_new.data.xnames = X_new.design_info.column_names
m_new = m_new.fit()
print(m_new.summary())
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.115
Model: OLS Adj. R-squar
ed: 0.075
Method: Least Squares F-statistic:
2.872
Date: Wed, 27 Nov 2019 Prob (F-stat
istic): 0.00499
Time: 00:10:37 Log-Likeliho
od: -292.52
No. Observations: 186 AIC:
603.0
Df Residuals: 177 BIC:
632.1
Df Model: 8
Covariance Type: nonrobust
====================================================
=================================
coef std err t
P>|t| [0.025 0.975]
----------------------------------------------------
---------------------------------
Intercept -0.0138 0.327 -0.042
0.966 -0.658 0.631
PriceSales 0.3043 0.152 1.996
0.047 0.003 0.605
PriceBook 0.0297 0.021 1.431
0.154 -0.011 0.071
EVRevenue -0.2612 0.178 -1.472
0.143 -0.612 0.089
I(EVRevenue ** 2) -0.0067 0.004 -1.819
0.071 -0.014 0.001
Payout_Ratio -0.0019 0.001 -1.908
0.058 -0.004 6.45e-05
Operating_Margin 0.0413 0.018 2.337
0.021 0.006 0.076
ROA -0.1474 0.038 -3.839
0.000 -0.223 -0.072
Health 0.2658 0.341 0.778
0.437 -0.408 0.940
====================================================
==========================
Omnibus: 7.753 Durbin-Watso
n: 1.932
Prob(Omnibus): 0.021 Jarque-Bera
(JB): 14.037
Skew: -0.061 Prob(JB):
0.000895
Kurtosis: 4.340 Cond. No.
496.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
Compared with linear model, this non-linear model with the polynomial term has a
higher Adjusted R squared at 0.075 ( > 0.063 ), and R squared at 0.115. This means
the explanation of variation for the dependent variable (%Change in stock price) has
been improved by the polynomial term of EV/Revenue. However, the model will be
evaluated by the cross-validation to compare the predictive performance between the
linear and non-linear model.
Prediction Accuracy Between Models
Cross-Validation: Using the random sample cross-validition with 80:20 partitioning,
and random.seed(1) to validate the model predictive power.
In [26]:
# Create a training and testing set
random.seed(1)
train = random.sample(range(0,len(data)), round(len(data)*0.8))
test = []
for n in range(0,len(data)):
if n not in train:
test.append(n)
y_training = data['Change'].iloc[train]
x_training = data[['PriceSales', 'PriceBook', 'EVRevenue', 'Payo
ut_Ratio', 'Operating_Margin', 'ROA', 'Health']].iloc[train]
y_testing = data['Change'].iloc[test]
In [27]:
# Build a model with training set from the best subset model (7
Variables)
y = y_training
X_7 = dmatrix('1 + PriceSales + PriceBook + EVRevenue + Payout_R
atio + Operating_Margin + ROA + Health', data = x_training)
m_7_cv = sm.OLS(y, X_7)
m_7_cv.data.xnames = X_7.design_info.column_names
m_7_cv = m_7_cv.fit()
print(m_7_cv.summary())
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.122
Model: OLS Adj. R-squar
ed: 0.079
Method: Least Squares F-statistic:
2.802
Date: Wed, 27 Nov 2019 Prob (F-stat
istic): 0.00922
Time: 00:10:37 Log-Likeliho
od: -227.28
No. Observations: 149 AIC:
470.6
Df Residuals: 141 BIC:
494.6
Df Model: 7
Covariance Type: nonrobust
====================================================
================================
coef std err t
P>|t| [0.025 0.975]
----------------------------------------------------
--------------------------------
Intercept -0.0203 0.329 -0.061
0.951 -0.672 0.631
PriceSales 0.1883 0.155 1.213
0.227 -0.119 0.495
PriceBook 0.0267 0.023 1.167
In [28]:
# Build a model with training set from a model with polynomial t
erm (EV/Revenue^2)
y = y_training
X_new = dmatrix('1 + PriceSales + PriceBook + EVRevenue + I(EVR
evenue**2) + Payout_Ratio + Operating_Margin + ROA + Health', da
ta = x_training)
m_new_cv = sm.OLS(y, X_new)
m_new_cv.data.xnames = X_new.design_info.column_names
m_new_cv = m_new_cv.fit()
print(m_new_cv.summary())
0.245 -0.018 0.072
EVRevenue -0.2841 0.170 -1.672
0.097 -0.620 0.052
Payout_Ratio -0.0023 0.001 -2.375
0.019 -0.004 -0.000
Operating_Margin 0.0370 0.018 2.008
0.047 0.001 0.073
ROA -0.1325 0.040 -3.283
0.001 -0.212 -0.053
Health 0.7233 0.344 2.102
0.037 0.043 1.403
====================================================
==========================
Omnibus: 8.706 Durbin-Watso
n: 1.839
Prob(Omnibus): 0.013 Jarque-Bera
(JB): 15.574
Skew: 0.193 Prob(JB):
0.000415
Kurtosis: 4.536 Cond. No.
522.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.145
Model: OLS Adj. R-squar
ed: 0.096
Method: Least Squares F-statistic:
2.961
Date: Wed, 27 Nov 2019 Prob (F-stat
istic): 0.00431
Time: 00:10:37 Log-Likeliho
od: -225.34
No. Observations: 149 AIC:
468.7
Df Residuals: 140 BIC:
495.7
Df Model: 8
Covariance Type: nonrobust
====================================================
=================================
coef std err t
P>|t| [0.025 0.975]
----------------------------------------------------
---------------------------------
Intercept -0.1252 0.331 -0.378
0.706 -0.779 0.529
PriceSales 0.2243 0.155 1.448
0.150 -0.082 0.531
PriceBook 0.0196 0.023 0.854
0.394 -0.026 0.065
EVRevenue -0.1754 0.178 -0.988
0.325 -0.526 0.176
I(EVRevenue ** 2) -0.0085 0.004 -1.923
0.057 -0.017 0.000
Payout_Ratio -0.0021 0.001 -2.222
0.028 -0.004 -0.000
Operating_Margin 0.0304 0.019 1.633
0.105 -0.006 0.067
ROA -0.1268 0.040 -3.163
0.002 -0.206 -0.048
Health 0.5493 0.353 1.558
0.122 -0.148 1.246
====================================================
==========================
Omnibus: 9.464 Durbin-Watso
n: 1.856
Prob(Omnibus): 0.009 Jarque-Bera
Mean Squared Error
𝑀𝑆𝐸 =
∑
𝑖=1
𝑛
( −𝑦̂ 𝑦𝑖)2
𝑛
In [29]:
# Calculate the test MSEs
x_testing = dmatrix('1+PriceSales+PriceBook+EVRevenue+Payout_Rat
io+Operating_Margin+ROA+Health',data = data.iloc[test])
predicted_7 = m_7_cv.predict(x_testing)
x_testing = dmatrix('1+PriceSales+PriceBook+EVRevenue+I(EVRevenu
e**2)+Payout_Ratio+Operating_Margin+ROA+Health',data = data.iloc
[test])
predicted_new = m_new_cv.predict(x_testing)
mse = pd.DataFrame()
mse['Actual Value'] = y_testing
mse['Predicted Value (m_7)'] = predicted_7
mse['Predicted Value (m_new)'] = predicted_new
mse['Squared Error (m_7)'] = (mse['Predicted Value (m_7)'] - mse
['Actual Value'])**2
mse['Squared Error (m_new)'] = (mse['Predicted Value (m_new)'] -
mse['Actual Value'])**2
MSE_7 = mse['Squared Error (m_7)'].sum()/len(mse)
MSE_new = mse['Squared Error (m_new)'].sum()/len(mse)
(JB): 18.892
Skew: 0.166 Prob(JB):
7.90e-05
Kurtosis: 4.713 Cond. No.
524.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
In [31]:
mse.T
In [32]:
print('The model test MSE for the linear model with 7 variable i
s', MSE_7)
print('The model test MSE for the model with polynomial term is'
, MSE_new)
Out[31]:
Symbol ACN AVGO IBM NOW MU AMD
Actual
Value
-0.060000 -0.100000 0.370000 0.310000 0.700000 -0.940000
Predicted
Value
(m_7)
-0.780854 -0.253737 0.194448 0.138732 -0.359416 0.184337
Predicted
Value
(m_new)
-0.790585 -0.126589 0.075529 -0.300076 -0.351424 0.357285
Squared
Error
(m_7)
0.519630 0.023635 0.030819 0.029333 1.122362 1.264133
Squared
Error
(m_new)
0.533755 0.000707 0.086713 0.372193 1.105493 1.682949
5 rows × 37 columns
The model test MSE for the linear model with 7 varia
ble is 2.1063194289258402
The model test MSE for the model with polynomial ter
m is 2.154634174218139
The Optimal Model Recall
According to the MSEs value above, the model with a lower error, a non-linear model
with 7 selected variables, has been recalled below.
In [4]:
print(m.summary())
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.098
Model: OLS Adj. R-squar
ed: 0.063
Method: Least Squares F-statistic:
2.774
Date: Thu, 28 Nov 2019 Prob (F-stat
istic): 0.00924
Time: 21:51:37 Log-Likeliho
od: -294.25
No. Observations: 186 AIC:
604.5
Df Residuals: 178 BIC:
630.3
Df Model: 7
Covariance Type: nonrobust
====================================================
================================
coef std err t
P>|t| [0.025 0.975]
----------------------------------------------------
--------------------------------
Intercept 0.0664 0.326 0.204
0.839 -0.576 0.709
PriceSales 0.2852 0.153 1.863
0.064 -0.017 0.587
PriceBook 0.0310 0.021 1.481
0.140 -0.010 0.072
EVRevenue -0.3609 0.170 -2.124
0.035 -0.696 -0.026
Regression Diagnosis
*Source (https://robert-alvarez.github.io/2018-06-04-diagnostic_plots/)
Payout_Ratio -0.0020 0.001 -2.041
0.043 -0.004 -6.67e-05
Operating_Margin 0.0474 0.017 2.713
0.007 0.013 0.082
ROA -0.1509 0.039 -3.909
0.000 -0.227 -0.075
Health 0.4269 0.332 1.286
0.200 -0.228 1.082
====================================================
==========================
Omnibus: 8.080 Durbin-Watso
n: 1.953
Prob(Omnibus): 0.018 Jarque-Bera
(JB): 14.644
Skew: -0.096 Prob(JB):
0.000661
Kurtosis: 4.361 Cond. No.
493.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
In [5]:
# Residual Plot
sns.residplot(m.fittedvalues, 'Change', data=data, lowess=True,
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw'
: 1, 'alpha': 0.8})
plt.title('Residuals vs Fitted')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
Residuals and fited value plot shows there's some nonlinearity that this linear model
couldn't capture.
Out[5]:
Text(0, 0.5, 'Residuals')
In [6]:
# Normal Q-Q plot
sm.qqplot(m.resid, line='45', color='cornflowerblue', alpha=0.6)
plt.title('Normal Q-Q')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Standardized Residuals')
The Q-Q plot indicates that approimately more than 85% of the residuals align along
the line, which means the errors are being normally distributed.
Out[6]:
Text(0, 0.5, 'Standardized Residuals')
In [7]:
# Scale-Location Plot
norm_res_abs_sqrt = np.sqrt(np.abs(m.get_influence().resid_stude
ntized_internal))
plt.scatter(m.fittedvalues, norm_res_abs_sqrt, alpha=0.5);
sns.regplot(m.fittedvalues, norm_res_abs_sqrt, scatter=False, ci
=False, lowess=True,
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8});
plt.xlabel('Fitted values')
plt.ylabel('$sqrt{|Standardized Residuals|}$')
The residual plot shows a slightliness of uneven cloud of the residual. This model might
suffer from heteroscedasticiy.
Out[7]:
Text(0, 0.5, '$sqrt{|Standardized Residuals|}$')
In [8]:
# Residual and Leverage
leverage = m.get_influence().hat_matrix_diag
norm_res = m.get_influence().resid_studentized_internal
plt.scatter(leverage, norm_res, alpha=0.5);
sns.regplot(leverage, norm_res, scatter=False, ci=False, lowess=
True,
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.xlim(0, max(leverage)+0.01)
plt.ylim(-3, 5)
plt.title('Residuals vs Leverage')
plt.xlabel('Leverage')
plt.ylabel('Standardized Residuals');
The residual and leverage plot shows that there's no outlier.
Model Conclusion
From the training set, the model with polynomial term seems to perform better
than the linear model, due to a higher Adjusted R squared as well as R squared,
which means that the variation of %Change on stock price is better explained by
the additional polynomial term. However, the training error tends to underestimate
the testing error.
According to the test MSEs for both models, the model without polynomial term
yields a slightly lower MSE ( 2.1063 < 2.1546 ). This indicates that the model with
7 variables (non-linear) has a stronger predictive power.
The non-linear model:
𝐶ℎ𝑎𝑛𝑔𝑒 = 0.0664 + 0.2852(𝑃 𝑟𝑖𝑐𝑒𝑆𝑎𝑙𝑒𝑠) + 0.0310(𝑃 𝑟𝑖𝑐𝑒𝐵𝑜𝑜𝑘) − 0.3609(𝐸𝑉 𝑅𝑒
− 0.1509(𝑅𝑂𝐴) + 0.4269(
The Optimal Model Interpretation
Independent Variables Relationships Coefficient P-Value
Intercept + 0.0664 0.839
Price/Sales + 0.2852 0.064 (.)
Price/Book + 0.0310 0.140
EV/Revenue - 0.3609 0.035 (*)
Payout Ratio (%) - 0.0020 0.043 (*)
Operating Margin (%) + 0.0474 0.007 (**)
Return on Asset (ttm) - 0.1509 0.000 (***)
Health + 0.4269 0.200
R squared
The 9.8% variation of the percentage change in stock price is explained by
independent variables in this ordinary least squares model.
In order to improve the R sqaured value, the model might need other vairable
that is more correlated to the response. Because the stock data has high
variation as well as high randomness, besides, numeric data we might need
other data such as, daily news, financial report, 10-K, index companies
performance, and so on to improve the change in stock price evaluation.
Coefficients (Significant at 95% confident interval)
EVRevenue : The coefficient indicates that, on average, when the Enterprise
Value/Revenue increases by 1 unit, the stock price will decline by 0.3609%,
holding other variables constant, at p-value 0.035 < 0.05.
Payout_Ratio : On average, when the Payout Ratio increases by 1%, the
stock price will decrease by 0.002%, while holding others constant, at p-value
0.045 < 0.05.
Operating_Margin : The coefficient indicates that, on average, when the
Operating Margin (ttm) increases by 1%, the stock price will also increase by
0.0474%, holding others constant, at p-value 0.007 (< 0.05).
ROA : ROA is highly significant with p-value at 0.000. On average, when the
Return on Asset (ttm) increases by 1% while holding others constant, the
stock price will decreases by 0.1509%.
PriceSales has a p-value at 0.064 which is statistically significant at 99%
confidence level, which means it isn't highly correlated with the dependent
variable.
PriceBook , and Health are not statistically significant.
In [33]:
# # Use this code in order to predict a specific scenerio
# PriceSales =
# PriceBook =
# EVRevenue =
# Payout_Ratio =
# Operating_Margin =
# ROA =
# Health =
# data_new = [1, PriceSales, PriceBook, EVRevenue, Payout_Ratio,
Operating_Margin, ROA, Health]
# predicted = m_7_cv.predict(data_new)[0]
# predicted
====================================================================
5. Conclusions
What have we seen based on the data?
From joining the two data frames (one from yahoo finance by the technology
services sector, and another from the key-statistic page), the data set had
approximately 35 numeric variables with 390 companies. After cleaning the data
the observations were reduced to approximately 190 companies.
Building a correlation table and plot, as well as taking those variables into the
scatter plot, the results showed that most of them had vague relationships (low
magnitude of correlation) between the response (% Change in stock price). Also,
there's a sign of non-linearity between the response and EV/Revenue . Hence
the polynomial term of this variable was performed in further progress.
Since there's a high dimensionality in the model, the best subset selection method
was performed. Then, some variables (21 variables including binary variable) are
selected to run the best subset selection model. According to the lowest RSS and
highest Adjusted R squared, 7 variables were selected which are Price/Sales,
Price/Book, EV/Revenue, Payout Ratio, Operating Margin, ROA, and Health
(Binary variable created based on the industrial average of EV/EBITDA)
Due to the non-linearity of EV/Revenue , the model with an additional polynomial
term was performed. The result turns out that the model's Adjusted R squared has
improved. As a result, the % Change of stock price variation is better explained by
dependent variables and the additional polynomial term, .
However, the predictive accuracy was further investigated
Model predictive accuracy:
To validate the model accuracy between these two models, the cross-validation is
performed. The data set was randomly divided into 80% of the training set and
20% of the test set. (Set the seed equal to 1). From the result, the test MSE of the
linear model is slightly lower than the model with a non-linear term. Even though
the non-linear model has a higher Adjusted R squared indicating a better
describing of the relationship between predictors and response, the linear model
has a slightly stronger predictive power.
(The model result comparision shown in the table below)
(𝐸𝑉 /𝑅𝑒𝑣𝑒𝑛𝑢𝑒)2
Model Linear Model Non-linear Model
Formula
Change = 0.0664 +
0.2852(PriceSales) +
0.0310(PriceBook) -
0.3609(EVRevenue) -
0.0020(Payout_Ratio) +
0.0474(Operating_Margin) -
0.1509(ROA) + 0.4269(Health)
Change = - 0.0138 + 0.3043(PriceSales) +
0.0297(PriceBook) - 0.2612(EVRevenue) -
0.0067(EVRevenue^2) -
0.0019(Payout_Ratio) +
0.0413(Operating_Margin) - 0.1474(ROA) +
0.2658(Health)
0.063 0.075
0.098 0.115
MSEs 2.106319 2.154634
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2
𝑅2
How has our understanding of the original question changed?
Recall the question(s):
Which indices (variables) are statistically important to the Change in percentage
of stock price in Technology Services industry?
The statistically significant indices are Price/Share (+), EV/Revenue (-), Payout
Ratio (%) (-), Operating Margin (%) (+), and Return on Asset (%) (-). Besides
these significant variables in the model, to determine the change in stock
price, some additional factors need to be considered. In the stock market,
there're many types of information the stock analyst could use for decision
making. For instance, reading an annual report like 10-K as well as news and
integrate with numeric data would help them gain more advantage over a
person who only relies on less source.
What is the magnitude of each variable against Change of stock price in the
Technology Services industry?
On the basis, I expected that the market capitalization would pay a significant
role with a positive magnitude as a predictor since most of the companies
that are mostly paid attention, like S&P500 have a high market capitalization.
However, this variable is not statistically significant in the model where the
dependent variable is percentage change in stock price. Also, the result of
ROA is not the same as expected. The more return on asset the more profit a
company can generate from its source. Surprisingly, this variable has negative
relationship in the model.
However, the actual relationship of EV/Revenue is the same as expected
(negative relationship). Since the EV/Revenue is used to compare the
company's revenue with the enterprise value. The lower of the multiple would
mean it's undervalued, the more attraction is drawing to the company. Also,
other variables such as Operating Margin(%), Payout Ratio(%), and
Price/Book are the same as expectation because these variables are the
indices that can draw the attention from investor (the higher the value, the
more attractive stock would gain).

Más contenido relacionado

La actualidad más candente

Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data StructureSakthi Dasans
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer TraceThe MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Traceoysteing
 

La actualidad más candente (9)

Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
5 format
5 format5 format
5 format
 
My sql cheat sheet
My sql cheat sheetMy sql cheat sheet
My sql cheat sheet
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Functor Laws
Functor LawsFunctor Laws
Functor Laws
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
 
Data struture lab
Data struture labData struture lab
Data struture lab
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer TraceThe MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Trace
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 

Similar a Final project kijtorntham n

[open source] hamilton, a micro framework for creating dataframes, and its ap...
[open source] hamilton, a micro framework for creating dataframes, and its ap...[open source] hamilton, a micro framework for creating dataframes, and its ap...
[open source] hamilton, a micro framework for creating dataframes, and its ap...Stefan Krawczyk
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessingAbdurRazzaqe1
 
Efficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in REfficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in RGregg Barrett
 
Company segmentation - an approach with R
Company segmentation - an approach with RCompany segmentation - an approach with R
Company segmentation - an approach with RCasper Crause
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Class 12 computer sample paper with answers
Class 12 computer sample paper with answersClass 12 computer sample paper with answers
Class 12 computer sample paper with answersdebarghyamukherjee60
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...FarhanAhmade
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3San Kim
 
Forecast stock prices python
Forecast stock prices pythonForecast stock prices python
Forecast stock prices pythonUtkarsh Asthana
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Charles Givre
 
Interactive financial analytics with vix(cboe)
Interactive financial analytics with vix(cboe)Interactive financial analytics with vix(cboe)
Interactive financial analytics with vix(cboe)Aiden Wu, FRM
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxParveenShaik21
 
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...GapData Institute
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Dieter Plaetinck
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...DataScienceConferenc1
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 

Similar a Final project kijtorntham n (20)

[open source] hamilton, a micro framework for creating dataframes, and its ap...
[open source] hamilton, a micro framework for creating dataframes, and its ap...[open source] hamilton, a micro framework for creating dataframes, and its ap...
[open source] hamilton, a micro framework for creating dataframes, and its ap...
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
 
Efficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in REfficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in R
 
Company segmentation - an approach with R
Company segmentation - an approach with RCompany segmentation - an approach with R
Company segmentation - an approach with R
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Feature Engineering in NLP.pdf
Feature Engineering in NLP.pdfFeature Engineering in NLP.pdf
Feature Engineering in NLP.pdf
 
Class 12 computer sample paper with answers
Class 12 computer sample paper with answersClass 12 computer sample paper with answers
Class 12 computer sample paper with answers
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
 
Forecast stock prices python
Forecast stock prices pythonForecast stock prices python
Forecast stock prices python
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2
 
Interactive financial analytics with vix(cboe)
Interactive financial analytics with vix(cboe)Interactive financial analytics with vix(cboe)
Interactive financial analytics with vix(cboe)
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 

Último

Vasai-Virar High Profile Model Call Girls📞9833754194-Nalasopara Satisfy Call ...
Vasai-Virar High Profile Model Call Girls📞9833754194-Nalasopara Satisfy Call ...Vasai-Virar High Profile Model Call Girls📞9833754194-Nalasopara Satisfy Call ...
Vasai-Virar High Profile Model Call Girls📞9833754194-Nalasopara Satisfy Call ...priyasharma62062
 
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...dipikadinghjn ( Why You Choose Us? ) Escorts
 
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Kopar Khairane Russian Call Girls Number-9833754194-Navi Mumbai Fantastic Unl...
Kopar Khairane Russian Call Girls Number-9833754194-Navi Mumbai Fantastic Unl...Kopar Khairane Russian Call Girls Number-9833754194-Navi Mumbai Fantastic Unl...
Kopar Khairane Russian Call Girls Number-9833754194-Navi Mumbai Fantastic Unl...priyasharma62062
 
VIP Kalyan Call Girls 🌐 9920725232 🌐 Make Your Dreams Come True With Mumbai E...
VIP Kalyan Call Girls 🌐 9920725232 🌐 Make Your Dreams Come True With Mumbai E...VIP Kalyan Call Girls 🌐 9920725232 🌐 Make Your Dreams Come True With Mumbai E...
VIP Kalyan Call Girls 🌐 9920725232 🌐 Make Your Dreams Come True With Mumbai E...roshnidevijkn ( Why You Choose Us? ) Escorts
 
Best VIP Call Girls Morni Hills Just Click Me 6367492432
Best VIP Call Girls Morni Hills Just Click Me 6367492432Best VIP Call Girls Morni Hills Just Click Me 6367492432
Best VIP Call Girls Morni Hills Just Click Me 6367492432motiram463
 
Technology industry / Finnish economic outlook
Technology industry / Finnish economic outlookTechnology industry / Finnish economic outlook
Technology industry / Finnish economic outlookTechFinland
 
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...dipikadinghjn ( Why You Choose Us? ) Escorts
 
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...dipikadinghjn ( Why You Choose Us? ) Escorts
 
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...dipikadinghjn ( Why You Choose Us? ) Escorts
 
Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...
Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...
Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...priyasharma62062
 
cost-volume-profit analysis.ppt(managerial accounting).pptx
cost-volume-profit analysis.ppt(managerial accounting).pptxcost-volume-profit analysis.ppt(managerial accounting).pptx
cost-volume-profit analysis.ppt(managerial accounting).pptxazadalisthp2020i
 
Lion One Corporate Presentation May 2024
Lion One Corporate Presentation May 2024Lion One Corporate Presentation May 2024
Lion One Corporate Presentation May 2024Adnet Communications
 
Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...
Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...
Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...priyasharma62062
 
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )Pooja Nehwal
 
Stock Market Brief Deck (Under Pressure).pdf
Stock Market Brief Deck (Under Pressure).pdfStock Market Brief Deck (Under Pressure).pdf
Stock Market Brief Deck (Under Pressure).pdfMichael Silva
 
Webinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech BelgiumWebinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech BelgiumFinTech Belgium
 
Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...
Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...
Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...priyasharma62062
 

Último (20)

Vasai-Virar High Profile Model Call Girls📞9833754194-Nalasopara Satisfy Call ...
Vasai-Virar High Profile Model Call Girls📞9833754194-Nalasopara Satisfy Call ...Vasai-Virar High Profile Model Call Girls📞9833754194-Nalasopara Satisfy Call ...
Vasai-Virar High Profile Model Call Girls📞9833754194-Nalasopara Satisfy Call ...
 
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
 
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
 
Kopar Khairane Russian Call Girls Number-9833754194-Navi Mumbai Fantastic Unl...
Kopar Khairane Russian Call Girls Number-9833754194-Navi Mumbai Fantastic Unl...Kopar Khairane Russian Call Girls Number-9833754194-Navi Mumbai Fantastic Unl...
Kopar Khairane Russian Call Girls Number-9833754194-Navi Mumbai Fantastic Unl...
 
(INDIRA) Call Girl Srinagar Call Now 8617697112 Srinagar Escorts 24x7
(INDIRA) Call Girl Srinagar Call Now 8617697112 Srinagar Escorts 24x7(INDIRA) Call Girl Srinagar Call Now 8617697112 Srinagar Escorts 24x7
(INDIRA) Call Girl Srinagar Call Now 8617697112 Srinagar Escorts 24x7
 
VIP Kalyan Call Girls 🌐 9920725232 🌐 Make Your Dreams Come True With Mumbai E...
VIP Kalyan Call Girls 🌐 9920725232 🌐 Make Your Dreams Come True With Mumbai E...VIP Kalyan Call Girls 🌐 9920725232 🌐 Make Your Dreams Come True With Mumbai E...
VIP Kalyan Call Girls 🌐 9920725232 🌐 Make Your Dreams Come True With Mumbai E...
 
Best VIP Call Girls Morni Hills Just Click Me 6367492432
Best VIP Call Girls Morni Hills Just Click Me 6367492432Best VIP Call Girls Morni Hills Just Click Me 6367492432
Best VIP Call Girls Morni Hills Just Click Me 6367492432
 
Technology industry / Finnish economic outlook
Technology industry / Finnish economic outlookTechnology industry / Finnish economic outlook
Technology industry / Finnish economic outlook
 
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
 
(Vedika) Low Rate Call Girls in Pune Call Now 8250077686 Pune Escorts 24x7
(Vedika) Low Rate Call Girls in Pune Call Now 8250077686 Pune Escorts 24x7(Vedika) Low Rate Call Girls in Pune Call Now 8250077686 Pune Escorts 24x7
(Vedika) Low Rate Call Girls in Pune Call Now 8250077686 Pune Escorts 24x7
 
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
 
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
 
Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...
Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...
Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...
 
cost-volume-profit analysis.ppt(managerial accounting).pptx
cost-volume-profit analysis.ppt(managerial accounting).pptxcost-volume-profit analysis.ppt(managerial accounting).pptx
cost-volume-profit analysis.ppt(managerial accounting).pptx
 
Lion One Corporate Presentation May 2024
Lion One Corporate Presentation May 2024Lion One Corporate Presentation May 2024
Lion One Corporate Presentation May 2024
 
Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...
Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...
Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...
 
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
 
Stock Market Brief Deck (Under Pressure).pdf
Stock Market Brief Deck (Under Pressure).pdfStock Market Brief Deck (Under Pressure).pdf
Stock Market Brief Deck (Under Pressure).pdf
 
Webinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech BelgiumWebinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech Belgium
 
Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...
Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...
Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...
 

Final project kijtorntham n

  • 1. Final Project - %Change in Stock Price (Technology Service Industry) Analysis Name: Natsarankorn Kijtorntham Packages In [1]: %matplotlib inline from datetime import datetime import time import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import requests from bs4 import BeautifulSoup as bs from scipy.stats import pearsonr from patsy import dmatrix from patsy import dmatrices import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflat ion_factor import itertools import random import warnings warnings.filterwarnings('ignore')
  • 2. 1. Introduction What is the importance of your data set? This project scraped the data from finance.yahoo.com. The main data set was the companies in the Technology Services sector. This analysis aims to predict the %Change (dependent variable) of the stock price, with the OLS model. The features are the statistical key of each stock. For example, Market Cap, P/E Ratio, Price/Sales, Enterprise Value, ROE, ROA, etc. Since there're a lot of indices to the stock price (high dimensionality), the significant level of independent variables as well as the subset selection method would preliminarily filter out unnecessary variables. Which question(s) can it help us understand? Which indices (variables) are statistically important to the Change in percentage of stock price in the Technology industry? What is the magnitude of each variable against Change of stock price in the Technology Services industry? ==================================================================== 2. Data Scraping Where and how are you getting the data? The data set is tocks from the Technology Services Sector of Yahoo Finance (https://finance.yahoo.com/screener/predefined/ms_technology). There're approximately 390 companies in this data set. Scraping Steps Part 1: Scrape the main dataframe df_1 of all companies containing 'Symbol', 'Name of the company', 'Price', 'Change', '% Change', 'Volume', 'Avg Vol', 'Market Cap', 'PE Ratio'. Part 2:
  • 3. Get the 'href' link for each company. Get the tables from the 'Statistical Keys' page using the modified 'href' links, e.g. AAPL (https://finance.yahoo.com/quote/AAPL/key-statistics? p=AAPL), MSFT (https://finance.yahoo.com/quote/MSFT/key-statistics? p=MSFT), TSM (https://finance.yahoo.com/quote/TSM/key-statistics? p=TSM). Run comprehensive for loops to build another data frame, df_2 , from the second part. Joining two data frames ( df_1 , and df_2 ) using an index, 'Symbol'. What data are available? The whole data contains approximately 190 observations after omitting NAs, and 20 variables to run the full OLS model. The dependent variable is Change which is the percentage of the stock price of that day (Change (USD)/Price (Intraday)). The independent variables are Price, Mkt_Cap, PEG_Ratio, PE_Ratio, PriceSales, PriceBook, EV, EVRevenue, Payout_Ratio, Profit_Margin, Operating_Margin, ROA, ROE, Revenue, RevenueShare, Gross_Profit, EBITDA, NItoCommon, Diluted_EPS, Earnings_Growth_Q, Health. EV/EBITDA is transformed into Health , a binary variable For a company who has EV/EBITDA > industrial average, it will be indicated as 1 = 'healthy' company For a company who has EV/EBITDA < industrial average, it will be indicated as 0 = 'unhealthy' company What relationships do you expect to see in the data? The expected relationships are both positives and negatives as shown below: Independent Variables E(Relationships) Market Cap + PEG Ratio - P/E Ratio + Price/Sales - Price/Book + EV +
  • 4. EV/Revenue - Payout Ratio(%) + Profit Margin(%) + Operating Margin(%) + ROA + ROE + Revenue + Revenue/Share + Gross Profit(%) + EBITDA + Net Income to Common + Diluted EPS + Quarterly Earnings Growth (yoy) + Health (healthy) + Steps: In [2]: # Time of data scraping now = datetime.now() dt = now.strftime("%d/%m/%Y %H:%M:%S") print("This data was scraped on", dt) # This data was scraped on 23/11/2019 22:25:07 PART 1 Getting a main dataframe (df_1) This data was scraped on 23/11/2019 22:25:07
  • 5. In [3]: # Scraping the main dataframe (df_1) by setting the parameters f or 100 counts each page with approximately 390 rows. url = 'https://finance.yahoo.com/screener/predefined/ms_technolo gy' rows = np.arange(0,301,100).tolist() # rows = [100,200,300] url_list = [] tech_df = [] for i in rows: r = requests.get(url, params = {'count' : '100', 'offset' : i}) link = r.url url_list.append(link) for link in url_list: df = pd.read_html(link) tb = df[0] tech_df.append(tb) df_1 = pd.concat(tech_df) In [4]: # Setting 'Symbol' as an index df = df_1.set_index('Symbol') df.to_csv('df_1.csv') In [5]: url_list Out[5]: ['https://finance.yahoo.com/screener/predefined/ms_t echnology?count=100&offset=0', 'https://finance.yahoo.com/screener/predefined/ms_t echnology?count=100&offset=100', 'https://finance.yahoo.com/screener/predefined/ms_t echnology?count=100&offset=200', 'https://finance.yahoo.com/screener/predefined/ms_t echnology?count=100&offset=300']
  • 6. In [6]: df.head() In [7]: # The dimension for the main dataframe df.shape PART 2 Scraping from Key-Statistics page Out[6]: Name Price (Intraday) Change % Change Volume Avg Vol (3 month) Symbol AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M MSFT Microsoft Corporation 149.59 0.11 +0.07% 15.842M 22.825M TSM Taiwan Semiconductor Manufacturing Company Lim... 52.79 -0.19 -0.36% 4.103M 6.848M INTC Intel Corporation 57.61 -0.61 -1.05% 15.69M 18.498M CSCO Cisco Systems, Inc. 44.85 0.01 +0.02% 16.516M 19.124M Out[7]: (393, 9)
  • 7. In [8]: # get thae 'a' tag from web element table = [] tag = [] # get text from the main page for url in url_list: txt = requests.get(url).text soup = bs(txt) t = soup.find('div', {'id':'scr-res-table'}) table.append(t) for i in range(0,4): t = table[i].find_all('a') tag.append(t) In [9]: # Get the href link for key statistics page of each ticker to ex tract table link = [] for e in range(0,4): for i in tag[e]: l = 'https://finance.yahoo.com'+i.get('href') l_kstat = l.split('?')[0]+'/key-statistics?'+l.split('?' )[1] link.append(l_kstat) Note Some HTML links don't work (404) when running the code for some specific time (around closing time of the stock market). So, the code chunk below will prevent the error when scraping.
  • 8. In [10]: connection = [] for l in link: if requests.get(l).status_code == 200: status = ['good', l] else: status = ['404', l] connection.append(status) # Get the responded links, '200', and company tickers links = [] tickers = [] for status in range(0,len(connection)): if connection[status][0] == 'good': good_link = connection[status][1] else: bad_link = connection[status][1] links.append(good_link) tickers.append(good_link.split('=')[1]) In [11]: print('There are',len(links),'links that responded (200)') There are 393 links that responded (200)
  • 9. In [12]: tic = time.time() data = [] tables = (0,3,5,6,7) # This indicates the specific table in stat istical key page used in the analysis for url in links[:len(links)]: for table in tables: d = pd.read_html(url)[table] data.append(d) matrix = pd.concat(data) matrix.shape m = matrix.set_index(0) toc = time.time() print("Total scraping time:", (toc-tic)/60, "minutes.") # Total scraping time: 35.854390549659726 minutes. In [13]: # Build a second dataframe from concatenated matrix df_2 = pd.DataFrame() for i in range(0,len(m),31): m_m = m.iloc[i:i+31] n = (i+31)/31-1 m_m.columns = [tickers[int(n)]] df_2[tickers[int(n)]] = m_m[tickers[int(n)]] df_2 = df_2.transpose() In [14]: df_2.to_csv('df_2.csv') Total scraping time: 35.854390549659726 minutes.
  • 10. Joining Data frames In [15]: df = df.join(df_2) In [16]: df = df.iloc[:len(df_2)] df.shape In [17]: df.head() Out[16]: (393, 40)
  • 11. In [18]: df.to_csv('tech_390.csv') ==================================================================== 3. Data Cleaning Rename Variables in The Data Frame, df Out[17]: Name Price (Intraday) Change % Change Volume Avg Vol (3 month) Symbol AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M MSFT Microsoft Corporation 149.59 0.11 +0.07% 15.842M 22.825M TSM Taiwan Semiconductor Manufacturing Company Lim... 52.79 -0.19 -0.36% 4.103M 6.848M INTC Intel Corporation 57.61 -0.61 -1.05% 15.69M 18.498M CSCO Cisco Systems, Inc. 44.85 0.01 +0.02% 16.516M 19.124M 5 rows × 40 columns
  • 12. In [19]: pd.set_option('display.max_columns', 50) pd.set_option('display.max_rows', 400) In [20]: df = pd.read_csv('tech_390.csv') df = df.set_index('Symbol') # df.head() Out[20]: Name Price (Intraday) Change % Change Volume Avg Vol (3 month) Symbol AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M MSFT Microsoft Corporation 149.59 0.11 +0.07% 15.842M 22.825M TSM Taiwan Semiconductor Manufacturing Company Lim... 52.79 -0.19 -0.36% 4.103M 6.848M INTC Intel Corporation 57.61 -0.61 -1.05% 15.69M 18.498M CSCO Cisco Systems, Inc. 44.85 0.01 +0.02% 16.516M 19.124M
  • 13. In [21]: data = df[['% Change', 'Price (Intraday)', 'Market Cap', 'PE Rat io (TTM)', 'PEG Ratio (5 yr expected) 1', 'Price/Sales (ttm)', ' Price/Book (mrq)', 'Enterprise Value 3', 'Enterprise Value/Revenue 3', 'Enterprise Value/EBITDA 6', 'Payout Ratio 4', 'Profi t Margin', 'Operating Margin (ttm)', 'Return on Assets (ttm)', ' Return on Equity (ttm)', 'Revenue (ttm)', 'Revenue Per Share (ttm)', 'Gross Pr ofit (ttm)', 'EBITDA', 'Net Income Avi to Common (ttm)', 'Diluted EPS (ttm)' , 'Quarterly Earnings Growth (yoy)']] data.columns = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio', 'PEG_Ratio', 'PriceSales', 'PriceBook', 'EV', 'EVRevenue', 'EV/EBITDA', 'Payout_Ratio', 'Profit_Margin', 'Operating_Margin', 'ROA', 'ROE', 'Revenue', 'RevenueShare', 'Gross_Profit', 'EBIT DA', 'NItoCommon', 'Diluted_EPS' , 'Earnings_Growth_Q'] data.head() Out[21]: Change Price Mkt_Cap PE_Ratio PEG_Ratio PriceSales PriceBook Symbol AAPL -0.09% 261.78 1.183T 22.02 2.04 4.55 MSFT +0.07% 149.59 1.141T 28.22 1.91 8.79 TSM -0.36% 52.79 263.261B 23.67 2.39 NaN INTC -1.05% 57.61 250.604B 13.49 1.79 3.56 CSCO +0.02% 44.85 190.265B 17.85 1.97 3.66
  • 14. Data Cleaning & Transformation Converting str to float by using replace() to replace abbreviations (T, B, M ,and K) to Scientific Notation (e) for Mkt_Cap, EV, Revenue, Gross_Profit, EBITDA, and NItoCommon. strip() to strip the unnecessary symbols which are ',', and '%'. astype() to change string to float Creating categorical variable (Binary) for company health based on industrial average of EV/EBITDA . In [22]: # Check the type of variables columns = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio', 'PEG_Ratio', 'PriceSales', 'PriceBook', 'EV', 'EVRevenue', 'EV/EBITDA', 'Payout_Ratio', 'Profit_Margin', 'Operating_Margin', 'ROA', 'ROE', 'Revenue', 'RevenueShare', 'Gross_Profit', 'EBITDA', 'NItoCommon', 'Diluted_EPS', 'Earnings_Growth_Q'] for i in columns: print(type(data[i].values[0]), i)
  • 15. In [23]: for c in ['Mkt_Cap', 'EV', 'Revenue', 'Gross_Profit', 'EBITDA', 'NItoCommon']: data[c] = data[c].astype(str).str.replace("T", "e+12").str.r eplace("B", "e+9").str.replace("M", "e+6").str.replace("k", "e+3 ").astype(float) data['ROE'] = data['ROE'].str.replace(",", "") data['Earnings_Growth_Q'] = data['Earnings_Growth_Q'].str.replac e(",", "") col = ['Change', 'Profit_Margin', 'Payout_Ratio', 'Operating_Mar gin', 'ROA', 'ROE', 'Earnings_Growth_Q'] for c in col: var = data[c].str.replace('%', '').astype(float) data[c] = var <class 'str'> Change <class 'numpy.float64'> Price <class 'str'> Mkt_Cap <class 'numpy.float64'> PE_Ratio <class 'numpy.float64'> PEG_Ratio <class 'numpy.float64'> PriceSales <class 'numpy.float64'> PriceBook <class 'str'> EV <class 'numpy.float64'> EVRevenue <class 'numpy.float64'> EV/EBITDA <class 'str'> Payout_Ratio <class 'str'> Profit_Margin <class 'str'> Operating_Margin <class 'str'> ROA <class 'str'> ROE <class 'str'> Revenue <class 'numpy.float64'> RevenueShare <class 'str'> Gross_Profit <class 'str'> EBITDA <class 'str'> NItoCommon <class 'numpy.float64'> Diluted_EPS <class 'str'> Earnings_Growth_Q
  • 16. In [24]: for i in columns: print(type(data[i].values[0]), i) <class 'numpy.float64'> Change <class 'numpy.float64'> Price <class 'numpy.float64'> Mkt_Cap <class 'numpy.float64'> PE_Ratio <class 'numpy.float64'> PEG_Ratio <class 'numpy.float64'> PriceSales <class 'numpy.float64'> PriceBook <class 'numpy.float64'> EV <class 'numpy.float64'> EVRevenue <class 'numpy.float64'> EV/EBITDA <class 'numpy.float64'> Payout_Ratio <class 'numpy.float64'> Profit_Margin <class 'numpy.float64'> Operating_Margin <class 'numpy.float64'> ROA <class 'numpy.float64'> ROE <class 'numpy.float64'> Revenue <class 'numpy.float64'> RevenueShare <class 'numpy.float64'> Gross_Profit <class 'numpy.float64'> EBITDA <class 'numpy.float64'> NItoCommon <class 'numpy.float64'> Diluted_EPS <class 'numpy.float64'> Earnings_Growth_Q
  • 17. In [25]: # Creating binary variable based on 'data['EV/EBITDA'].mean()' health = [] print('The industrial average of EV/EBITDA is' ,data['EV/EBITDA' ].mean()) for i in data['EV/EBITDA']: if i > data['EV/EBITDA'].mean(): h = 1 else: h = 0 health.append(h) data['Health'] = health # Since the new categorical data was created from 'EV/EBITDA', i t will be taken out from the dataframe. del data['EV/EBITDA'] In [26]: # Drop NAs data = data.dropna() print('After steps of data cleaning, and manipulating, the dataf rame used in the model has', data.shape[0], 'observations (companies) with', data.shape[1]-1, 'feature s.') In [27]: a = data['Health'] == 1 print('The number of observations defined as healthy are',a.sum( )) The industrial average of EV/EBITDA is 7.90497282608 6954 After steps of data cleaning, and manipulating, the dataframe used in the model has 186 observations (co mpanies) with 21 features. The number of observations defined as healthy are 17 0
  • 18. In [28]: data.to_csv('data.csv') Data Visualization Correlation Table (for Numeric Variables) In [2]: data = pd.read_csv('data.csv') data = data.set_index('Symbol') data.head() In [3]: # Excluding binary variable (Health) variables = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio', 'PEG_Rati o', 'PriceSales', 'PriceBook', 'EV', 'EVRevenue', 'Payout_Ratio', 'Profit_Margin', 'Operating_Margin', 'ROA', 'ROE', 'Revenue', 'RevenueShare', 'Gross_Pro fit', 'EBITDA', 'NItoCommon', 'Diluted_EPS', 'Earnings_Growth_Q'] Out[2]: Change Price Mkt_Cap PE_Ratio PEG_Ratio PriceSales Symbol AAPL -0.09 261.78 1.183000e+12 22.02 2.04 4.55 MSFT 0.07 149.59 1.141000e+12 28.22 1.91 8.79 INTC -1.05 57.61 2.506040e+11 13.49 1.79 3.56 CSCO 0.02 44.85 1.902650e+11 17.85 1.97 3.66 ORCL 0.28 56.39 1.851010e+11 18.46 1.45 4.68 5 rows × 22 columns
  • 19. In [4]: # The correlation table ordered by its magnitude correl = data.loc[:,variables].corr() correl[:1].T.sort_values(by=['Change'], ascending=False) Out[4]: Change Change 1.000000 PE_Ratio 0.094638 Earnings_Growth_Q 0.042029 PriceSales 0.025884 EVRevenue 0.019835 Profit_Margin 0.017111 Operating_Margin -0.004450 Gross_Profit -0.008711 EBITDA -0.009072 EV -0.009756 Mkt_Cap -0.010124 NItoCommon -0.010922 PriceBook -0.014621 PEG_Ratio -0.023386 Revenue -0.023475 Price -0.027291 Diluted_EPS -0.059097 ROE -0.083297 RevenueShare -0.095545 Payout_Ratio -0.097380 ROA -0.166271
  • 20. Table Summary The correlation table for numeric variables indicates that positive relationships are PE_Ratio, Earnings_Growth_Q, PriceSales, EVRevenue, and Profit_Margin, respectively. For negative relationships variables are, Operating_Margin, Gross_Profit, EBITDA, EV, Mkt_Cap, NItoCommon, PriceBook, PEG_Ratio, Revenue, Price, Diluted_EPS, ROE, RevenueShare, Payout_Ratio, and ROA, respectively. In [5]: plt.figure(figsize=(15,15)) plt.imshow(correl) # show as image plt.colorbar() # To set the label on the axes plt.xticks(range(21), variables, rotation='vertical') # Require list of variables list creation plt.yticks(range(21), variables) Out[5]: ([<matplotlib.axis.YTick at 0x1c227f7c18>, <matplotlib.axis.YTick at 0x1c227f7550>, <matplotlib.axis.YTick at 0x1c22721550>, <matplotlib.axis.YTick at 0x1c2275c588>, <matplotlib.axis.YTick at 0x1c22755be0>, <matplotlib.axis.YTick at 0x1c22765c88>, <matplotlib.axis.YTick at 0x1c227654e0>, <matplotlib.axis.YTick at 0x1c22aeb550>, <matplotlib.axis.YTick at 0x1c22aebac8>, <matplotlib.axis.YTick at 0x1c22af00f0>, <matplotlib.axis.YTick at 0x1c22af05f8>, <matplotlib.axis.YTick at 0x1c22af0b70>, <matplotlib.axis.YTick at 0x1c22af8160>, <matplotlib.axis.YTick at 0x1c22af09e8>, <matplotlib.axis.YTick at 0x1c227650b8>, <matplotlib.axis.YTick at 0x1c22af85c0>, <matplotlib.axis.YTick at 0x1c22af8b38>, <matplotlib.axis.YTick at 0x1c22afe160>, <matplotlib.axis.YTick at 0x1c22afe668>, <matplotlib.axis.YTick at 0x1c22afebe0>, <matplotlib.axis.YTick at 0x1c22b06198>], <a list of 21 Text yticklabel objects>)
  • 22. In [6]: plt.hist(data['Change'], bins=20) plt.title('Histogram of % Change in Stock Price') plt.xlabel('% Change in Stock Price') The histogram of Change indicates that the independent variable (Y) is approximately normally distributed. Example for Plots of independent variables against Y Out[6]: Text(0.5, 0, '% Change in Stock Price')
  • 23. In [7]: # Enterprise Value/Revenue plt.scatter(data['EVRevenue'], data['Change']) plt.xlabel('Enterprise Value/Revenue') plt.ylabel('% Change in Stock Price') plt.title('Enterprise Value/Revenue vs Change') From the plot above, there's no obvious direction whether it's an upward or downward slope. However, there's a slightly non-linear relationship between this feature and response. Hence, in further analysis, if this variable is statistically significant in the model, the polynomial term of EV/Revenue will be generated in order to improve the model. Out[7]: Text(0.5, 1.0, 'Enterprise Value/Revenue vs Change')
  • 24. In [8]: # Payout Ratio (%) plt.figure(figsize=(15,5)) plt.subplot(121) plt.scatter(data['Payout_Ratio'], data['Change']) plt.xlabel('Payout Ratio (%)') plt.ylabel('% Change in Stock Price') plt.title('Payout_Ratio vs Change') plt.subplot(122) plt.scatter(np.log(data['Payout_Ratio']), data['Change']) plt.xlabel('Log of Payout Ratio (%)') plt.ylabel('% Change in Stock Price') plt.title('log(Payout_Ratio) vs Change') From the left panel, the distribution is dense when the payout ratio (%) is less than 200. After taking logarithm function to the observation, the scatter plot (right panel) shows a slightly negative relationship but no non-linear relationship is detected. Taking a log() function into the model might improve the model. Unfortunately, there's some infinite value after taking log() in the observations. Dropping more observations (infinite values) from the approximately existing 190 observations would potentially reduce the model accuracy. Hence, this variable will be kept as it is. Out[8]: Text(0.5, 1.0, 'log(Payout_Ratio) vs Change')
  • 25. In [9]: # Operating Margin (%) plt.scatter(data['Operating_Margin'], data['Change']) plt.xlabel('Operating Margin (%)') plt.ylabel('% Change in Stock Price') plt.title('Operating_Margin vs Change') The plot above shows a slight positive relationship between Operating Margin (%) and % change in stock price. Out[9]: Text(0.5, 1.0, 'Operating_Margin vs Change')
  • 26. In [10]: # ROA (%) plt.scatter(data['ROA'], data['Change']) plt.xlabel('Return on Asset (%)') plt.ylabel('% Change in Stock Price') plt.title('ROA (%) vs Change') The scatter plot of ROA (%) and % Change in stock price shows a negative relationship with no evidence of non-linearity. The Boxplot (for Binary Variable) Out[10]: Text(0.5, 1.0, 'ROA (%) vs Change')
  • 27. In [11]: plt.figure(figsize=(5,5)) sns.boxplot(data['Health'],data['Change']) The boxplot of this binary variable Health shows a slight difference between the mean and 50% of observations for each group where the 'healthy' companies (Health =1) are slightly higher and have a wider range of distribution (whiskers). Hence, this variable is potentially statistical significant in the model. ==================================================================== 4. Predictive Modeling Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x1c2410a4 38>
  • 28. The Multicollinearity Among Variables (VIF) Calculating the variance inflation factor *source (https://etav.github.io/python/vif_factor_python.html) In [37]: y, X_vif = dmatrices('Change ~' + 'Price+Mkt_Cap+PEG_Ratio+PE_Ra tio+PriceSales+PriceBook+EV+EVRevenue+Payout_Ratio+Profit_Margin +Operating_Margin+ROA+ROE+Revenue+RevenueShare+Gross_Profit+EBIT DA+NItoCommon+Earnings_Growth_Q+Health', data = data, return_typ e='dataframe') # For each X, calculate VIF and save in dataframe vif = pd.DataFrame() vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])] vif["features"] = X_vif.columns vif.round(2).set_index('features')
  • 29. Out[37]: VIF Factor features Intercept 18.79 Price 2.00 Mkt_Cap 1423.99 PEG_Ratio 1.06 PE_Ratio 1.91 PriceSales 82.42 PriceBook 5.48 EV 1611.13 EVRevenue 95.27 Payout_Ratio 1.12 Profit_Margin 4.02 Operating_Margin 7.23 ROA 3.72 ROE 5.17 Revenue 31.00 RevenueShare 1.82 Gross_Profit 52.42 EBITDA 197.00 NItoCommon 223.06 Earnings_Growth_Q 1.38 Health 1.29
  • 30. From the result above, there're high multicollinearities (VIF > 10) in the features among Mkt_Cap , Price/Sales , EV , EV/Revenue , Revenue , Gross_Profit , EBITDA , and NItoCommon . In which, these varaibles are not independent. In the next step, the subset selection method will help filter out highly correlated and unnecessary variables from the model. Best Subset Selection Method Since there're many variables (high dimension) with multicollinearity in the data set. Including all the variables may lead to high varaince in the model. To reduce the model variance, this method could help select the best set of variables that yields a higher Adjusted R squared by minimizing RSS. * source (http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8- py.html)
  • 31. In [13]: y = data.Change X = data[['Price','Mkt_Cap','PEG_Ratio', 'PE_Ratio', 'PriceSales ', 'PriceBook', 'EV', 'EVRevenue', 'Payout_Ratio', 'Profit_Margin', 'Operating_Margin', 'R OA', 'ROE', 'Revenue', 'RevenueShare', 'Gross_Profit', 'EBITDA', 'N ItoCommon', 'Earnings_Growth_Q', 'Health']] X = pd.concat([X], axis=1) X.head() In [14]: def processSubset(feature_set): # Fit model on feature_set and calculate RSS model = sm.OLS(y,X[list(feature_set)]) regr = model.fit() RSS = ((regr.predict(X[list(feature_set)]) - y) ** 2).sum() return {"model":regr, "RSS":RSS} Out[13]: Price Mkt_Cap PEG_Ratio PE_Ratio PriceSales PriceBook Symbol AAPL 261.78 1.183000e+12 2.04 22.02 4.55 12.85 MSFT 149.59 1.141000e+12 1.91 28.22 8.79 10.77 INTC 57.61 2.506040e+11 1.79 13.49 3.56 3.38 CSCO 44.85 1.902650e+11 1.97 17.85 3.66 5.53 ORCL 56.39 1.851010e+11 1.45 18.46 4.68 10.08
  • 32. In [15]: def getBest(k): tic = time.time() results = [] for combo in itertools.combinations(X.columns, k): results.append(processSubset(combo)) models = pd.DataFrame(results) # Choose model with the lowest RSS best_model = models.loc[models['RSS'].argmin()] toc = time.time() print("Processed", models.shape[0], "models on", k, "predict ors in", (toc-tic), "seconds.") return best_model In [16]: def getBest(k): tic = time.time() results = [] for combo in itertools.combinations(X.columns, k): results.append(processSubset(combo)) models = pd.DataFrame(results) best_model = models.loc[models['RSS'].argmin()] toc = time.time() print("Processed", models.shape[0], "models on", k, "predict ors in", (toc-tic), "seconds.") return best_model
  • 33. In [17]: models_best = pd.DataFrame(columns=['RSS', 'model']) tic = time.time() for i in range(1,10): models_best.loc[i] = getBest(i) toc = time.time() print("Total elapsed time:", (toc-tic)/60, "minutes.") Processed 20 models on 1 predictors in 0.06646728515 625 seconds. Processed 190 models on 2 predictors in 0.4089689254 760742 seconds. Processed 1140 models on 3 predictors in 2.546672344 2077637 seconds. Processed 4845 models on 4 predictors in 11.26060104 3701172 seconds. Processed 15504 models on 5 predictors in 34.6961269 3786621 seconds. Processed 38760 models on 6 predictors in 88.0667259 6931458 seconds. Processed 77520 models on 7 predictors in 613.451443 9105988 seconds. Processed 125970 models on 8 predictors in 403.86032 41443634 seconds. Processed 167960 models on 9 predictors in 419.71901 79824829 seconds. Total elapsed time: 26.632335432370503 minutes.
  • 34. In [18]: models_best Out[18]: RSS model 1 282.423717 <statsmodels.regression.linear_model.Regressio... 2 273.200854 <statsmodels.regression.linear_model.Regressio... 3 270.210182 <statsmodels.regression.linear_model.Regressio... 4 265.668865 <statsmodels.regression.linear_model.Regressio... 5 264.720145 <statsmodels.regression.linear_model.Regressio... 6 260.887309 <statsmodels.regression.linear_model.Regressio... 7 257.760233 <statsmodels.regression.linear_model.Regressio... 8 256.530992 <statsmodels.regression.linear_model.Regressio... 9 255.976209 <statsmodels.regression.linear_model.Regressio...
  • 35. In [19]: plt.figure(figsize=(20,10)) plt.rcParams.update({'font.size': 18, 'lines.markersize': 10}) plt.subplot(2, 2, 1) plt.plot(models_best["RSS"]) plt.xlabel('# Predictors') plt.ylabel('RSS') # Adjusted R squared Plot rsquared_adj = models_best.apply(lambda row: row[1].rsquared_adj , axis=1) plt.subplot(2, 2, 2) plt.plot(rsquared_adj) plt.plot(rsquared_adj.argmax(), rsquared_adj.max(), "or") plt.xlabel('# Predictors') plt.ylabel('adjusted rsquared') # AIC Plot aic = models_best.apply(lambda row: row[1].aic, axis=1) plt.subplot(2, 2, 3) plt.plot(aic) plt.plot(aic.argmin(), aic.min(), "or") plt.xlabel('# Predictors') plt.ylabel('AIC') # BIC plot bic = models_best.apply(lambda row: row[1].bic, axis=1) plt.subplot(2, 2, 4) plt.plot(bic) plt.plot(bic.argmin(), bic.min(), "or") plt.xlabel('# Predictors') plt.ylabel('BIC')
  • 36. In [20]: print('The model has the highest Adjusted R squared at', '{0:.4f }'.format(models_best.loc[7, "model"].rsquared_adj), 'when it ha s', rsquared_adj.argmax() , 'variables' ) print('The model has the lowest AIC at', '{0:.4f}'.format(models _best.loc[7, "model"].aic), 'when it has', aic.argmin() , 'varia bles' ) print('The model has the lowest BIC at', '{0:.4f}'.format(models _best.loc[7, "model"].bic), 'when it has', bic.argmin() , 'varia bles' ) Out[19]: Text(0, 0.5, 'BIC') The model has the highest Adjusted R squared at 0.06 37 when it has 7 variables The model has the lowest AIC at 602.5338 when it has 4 variables The model has the lowest BIC at 625.1140 when it has 2 variables
  • 37. In [21]: print('*** 7 variable-model ***') print(models_best.loc[7, "model"].params) print('') print('*** 4 variable-model ***') print(models_best.loc[4, "model"].params) print('') print('*** 2 variable-model ***') print(models_best.loc[2, "model"].params) *** 7 variable-model *** PriceSales 0.283440 PriceBook 0.030632 EVRevenue -0.358678 Payout_Ratio -0.001987 Operating_Margin 0.047336 ROA -0.148925 Health 0.480233 dtype: float64 *** 4 variable-model *** Payout_Ratio -0.001794 Operating_Margin 0.022919 ROA -0.100080 Health 0.350337 dtype: float64 *** 2 variable-model *** Operating_Margin 0.030559 ROA -0.083739 dtype: float64
  • 38. Criteria The subset selection computes or models were computed by minimizing the RSS. Adjusted R squared: According to the formula above the smaller RSS would yield a higher Adjusted R squared. As the result of the subset selection method, the 7 variable-model yields the highest adjusted R squared at 0.067, where the variables are Price/Sales, Price/Book, EV/Revenue, Payout Ratio, Operating Margin, ROA, and Health. There're 5 variables that are statistically significant at p-value < 0.05, where R squared is 0.099. (The result shown below) AIC From the result, AIC criterion yields the model with 4 variables which are Payout Ratio, Operating Margin, ROA, and Health. However, there's only one variable (ROA) that is statistically significant at 95% confident interval, with R squared equal to 0.071. BIC Since BIC criterion is more restricted (higher penalty term; ), it yields a smaller model with two significant variables which are Operating_Margin, and ROA. The R squared is 0.045. Criteria # of Optimal Variables 7 0.099 0.064 AIC 4 0.071 0.051 BIC 2 0.045 0.035 To conclude, based on Adjusted R squared criterion, the optimal non-linear model is OLS model with 7 variables, shown below. ( ) 𝑝 𝑘 𝑝! 𝑘!(𝑝−𝑘)! 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 = 1 −𝑅2 𝑅𝑆𝑆 𝑛−𝑝−1 𝑇𝑆𝑆 𝑛−1 log(𝑛) ∗ 𝑝 ∗ σ̂2 𝑅2 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2
  • 39. In [22]: models_best.loc[7, "model"].summary() Out[22]: OLS Regression Results Dep. Variable: Change R-squared (uncentered): 0.099 Model: OLS Adj. R-squared (uncentered): 0.064 Method: Least Squares F-statistic: 2.807 Date: Wed, 27 Nov 2019 Prob (F-statistic): 0.00851 Time: 00:10:37 Log-Likelihood: -294.27 No. Observations: 186 AIC: 602.5 Df Residuals: 179 BIC: 625.1 Df Model: 7 Covariance Type: nonrobust coef std err t P>|t| [0.025 0.975] PriceSales 0.2834 0.152 1.860 0.065 -0.017 0.584 PriceBook 0.0306 0.021 1.474 0.142 -0.010 0.072 EVRevenue -0.3587 0.169 -2.121 0.035 -0.692 -0.025 Payout_Ratio -0.0020 0.001 -2.037 0.043 -0.004 -6.17e-05 Operating_Margin 0.0473 0.017 2.718 0.007 0.013 0.082 ROA -0.1489 0.037 -3.992 0.000 -0.223 -0.075 Health 0.4802 0.204 2.358 0.019 0.078 0.882 Omnibus: 8.154 Durbin-Watson: 1.957 Prob(Omnibus): 0.017 Jarque-Bera (JB): 14.803 Skew: -0.101 Prob(JB): 0.000610 Kurtosis: 4.367 Cond. No. 272.
  • 40. The Optimal Linear Model (7 Variables) In [3]: y = data.Change X_7 = dmatrix('1 + PriceSales + PriceBook + EVRevenue + Payout_R atio + Operating_Margin + ROA + Health', data = data) m = sm.OLS(y, X_7) m.data.xnames = X_7.design_info.column_names m = m.fit() print(m.summary()) Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ==================================================== ========================== Dep. Variable: Change R-squared: 0.098 Model: OLS Adj. R-squar ed: 0.063 Method: Least Squares F-statistic: 2.774 Date: Thu, 28 Nov 2019 Prob (F-stat istic): 0.00924 Time: 21:50:30 Log-Likeliho od: -294.25 No. Observations: 186 AIC: 604.5 Df Residuals: 178 BIC: 630.3 Df Model: 7 Covariance Type: nonrobust ==================================================== ================================ coef std err t
  • 41. VIF After The Subset Selection P>|t| [0.025 0.975] ---------------------------------------------------- -------------------------------- Intercept 0.0664 0.326 0.204 0.839 -0.576 0.709 PriceSales 0.2852 0.153 1.863 0.064 -0.017 0.587 PriceBook 0.0310 0.021 1.481 0.140 -0.010 0.072 EVRevenue -0.3609 0.170 -2.124 0.035 -0.696 -0.026 Payout_Ratio -0.0020 0.001 -2.041 0.043 -0.004 -6.67e-05 Operating_Margin 0.0474 0.017 2.713 0.007 0.013 0.082 ROA -0.1509 0.039 -3.909 0.000 -0.227 -0.075 Health 0.4269 0.332 1.286 0.200 -0.228 1.082 ==================================================== ========================== Omnibus: 8.080 Durbin-Watso n: 1.953 Prob(Omnibus): 0.018 Jarque-Bera (JB): 14.644 Skew: -0.096 Prob(JB): 0.000661 Kurtosis: 4.361 Cond. No. 493. ==================================================== ========================== Warnings: [1] Standard Errors assume that the covariance matri x of the errors is correctly specified.
  • 42. In [36]: y, X_vif = dmatrices('Change ~' + 'PriceSales + PriceBook + EVRe venue + Payout_Ratio + Operating_Margin + ROA + Health', data = data, return_type='dataframe') # For each X, calculate VIF and save in dataframe vif = pd.DataFrame() vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])] vif["features"] = X_vif.columns vif.round(2).set_index('features') As the result, the high VIF variables are eliminated. Even though the model still has some multicollinearity beween variables, Price/Sales and EV/Revenue, it is moderately acceptable. Out[36]: VIF Factor features Intercept 13.63 PriceSales 44.86 PriceBook 2.12 EVRevenue 51.57 Payout_Ratio 1.07 Operating_Margin 3.46 ROA 2.95 Health 1.11
  • 43. The Non-Linear Model with Polynomial Term Based on the data visualization above, there's an evidence that EV/Revenue could have non-linearity. Below, a model with polynomial term is performed, along with other selected variables. In [25]: y = data.Change X_new = dmatrix('1 + PriceSales + PriceBook + EVRevenue + I(EVR evenue**2) + Payout_Ratio + Operating_Margin + ROA + Health', da ta = data) m_new = sm.OLS(y, X_new) m_new.data.xnames = X_new.design_info.column_names m_new = m_new.fit() print(m_new.summary()) OLS Regression Results ==================================================== ========================== Dep. Variable: Change R-squared: 0.115 Model: OLS Adj. R-squar ed: 0.075 Method: Least Squares F-statistic: 2.872 Date: Wed, 27 Nov 2019 Prob (F-stat istic): 0.00499 Time: 00:10:37 Log-Likeliho od: -292.52 No. Observations: 186 AIC: 603.0 Df Residuals: 177 BIC: 632.1 Df Model: 8 Covariance Type: nonrobust ==================================================== ================================= coef std err t
  • 44. P>|t| [0.025 0.975] ---------------------------------------------------- --------------------------------- Intercept -0.0138 0.327 -0.042 0.966 -0.658 0.631 PriceSales 0.3043 0.152 1.996 0.047 0.003 0.605 PriceBook 0.0297 0.021 1.431 0.154 -0.011 0.071 EVRevenue -0.2612 0.178 -1.472 0.143 -0.612 0.089 I(EVRevenue ** 2) -0.0067 0.004 -1.819 0.071 -0.014 0.001 Payout_Ratio -0.0019 0.001 -1.908 0.058 -0.004 6.45e-05 Operating_Margin 0.0413 0.018 2.337 0.021 0.006 0.076 ROA -0.1474 0.038 -3.839 0.000 -0.223 -0.072 Health 0.2658 0.341 0.778 0.437 -0.408 0.940 ==================================================== ========================== Omnibus: 7.753 Durbin-Watso n: 1.932 Prob(Omnibus): 0.021 Jarque-Bera (JB): 14.037 Skew: -0.061 Prob(JB): 0.000895 Kurtosis: 4.340 Cond. No. 496. ==================================================== ========================== Warnings: [1] Standard Errors assume that the covariance matri x of the errors is correctly specified.
  • 45. Compared with linear model, this non-linear model with the polynomial term has a higher Adjusted R squared at 0.075 ( > 0.063 ), and R squared at 0.115. This means the explanation of variation for the dependent variable (%Change in stock price) has been improved by the polynomial term of EV/Revenue. However, the model will be evaluated by the cross-validation to compare the predictive performance between the linear and non-linear model. Prediction Accuracy Between Models Cross-Validation: Using the random sample cross-validition with 80:20 partitioning, and random.seed(1) to validate the model predictive power. In [26]: # Create a training and testing set random.seed(1) train = random.sample(range(0,len(data)), round(len(data)*0.8)) test = [] for n in range(0,len(data)): if n not in train: test.append(n) y_training = data['Change'].iloc[train] x_training = data[['PriceSales', 'PriceBook', 'EVRevenue', 'Payo ut_Ratio', 'Operating_Margin', 'ROA', 'Health']].iloc[train] y_testing = data['Change'].iloc[test]
  • 46. In [27]: # Build a model with training set from the best subset model (7 Variables) y = y_training X_7 = dmatrix('1 + PriceSales + PriceBook + EVRevenue + Payout_R atio + Operating_Margin + ROA + Health', data = x_training) m_7_cv = sm.OLS(y, X_7) m_7_cv.data.xnames = X_7.design_info.column_names m_7_cv = m_7_cv.fit() print(m_7_cv.summary()) OLS Regression Results ==================================================== ========================== Dep. Variable: Change R-squared: 0.122 Model: OLS Adj. R-squar ed: 0.079 Method: Least Squares F-statistic: 2.802 Date: Wed, 27 Nov 2019 Prob (F-stat istic): 0.00922 Time: 00:10:37 Log-Likeliho od: -227.28 No. Observations: 149 AIC: 470.6 Df Residuals: 141 BIC: 494.6 Df Model: 7 Covariance Type: nonrobust ==================================================== ================================ coef std err t P>|t| [0.025 0.975] ---------------------------------------------------- -------------------------------- Intercept -0.0203 0.329 -0.061 0.951 -0.672 0.631 PriceSales 0.1883 0.155 1.213 0.227 -0.119 0.495 PriceBook 0.0267 0.023 1.167
  • 47. In [28]: # Build a model with training set from a model with polynomial t erm (EV/Revenue^2) y = y_training X_new = dmatrix('1 + PriceSales + PriceBook + EVRevenue + I(EVR evenue**2) + Payout_Ratio + Operating_Margin + ROA + Health', da ta = x_training) m_new_cv = sm.OLS(y, X_new) m_new_cv.data.xnames = X_new.design_info.column_names m_new_cv = m_new_cv.fit() print(m_new_cv.summary()) 0.245 -0.018 0.072 EVRevenue -0.2841 0.170 -1.672 0.097 -0.620 0.052 Payout_Ratio -0.0023 0.001 -2.375 0.019 -0.004 -0.000 Operating_Margin 0.0370 0.018 2.008 0.047 0.001 0.073 ROA -0.1325 0.040 -3.283 0.001 -0.212 -0.053 Health 0.7233 0.344 2.102 0.037 0.043 1.403 ==================================================== ========================== Omnibus: 8.706 Durbin-Watso n: 1.839 Prob(Omnibus): 0.013 Jarque-Bera (JB): 15.574 Skew: 0.193 Prob(JB): 0.000415 Kurtosis: 4.536 Cond. No. 522. ==================================================== ========================== Warnings: [1] Standard Errors assume that the covariance matri x of the errors is correctly specified. OLS Regression Results ==================================================== ==========================
  • 48. Dep. Variable: Change R-squared: 0.145 Model: OLS Adj. R-squar ed: 0.096 Method: Least Squares F-statistic: 2.961 Date: Wed, 27 Nov 2019 Prob (F-stat istic): 0.00431 Time: 00:10:37 Log-Likeliho od: -225.34 No. Observations: 149 AIC: 468.7 Df Residuals: 140 BIC: 495.7 Df Model: 8 Covariance Type: nonrobust ==================================================== ================================= coef std err t P>|t| [0.025 0.975] ---------------------------------------------------- --------------------------------- Intercept -0.1252 0.331 -0.378 0.706 -0.779 0.529 PriceSales 0.2243 0.155 1.448 0.150 -0.082 0.531 PriceBook 0.0196 0.023 0.854 0.394 -0.026 0.065 EVRevenue -0.1754 0.178 -0.988 0.325 -0.526 0.176 I(EVRevenue ** 2) -0.0085 0.004 -1.923 0.057 -0.017 0.000 Payout_Ratio -0.0021 0.001 -2.222 0.028 -0.004 -0.000 Operating_Margin 0.0304 0.019 1.633 0.105 -0.006 0.067 ROA -0.1268 0.040 -3.163 0.002 -0.206 -0.048 Health 0.5493 0.353 1.558 0.122 -0.148 1.246 ==================================================== ========================== Omnibus: 9.464 Durbin-Watso n: 1.856 Prob(Omnibus): 0.009 Jarque-Bera
  • 49. Mean Squared Error 𝑀𝑆𝐸 = ∑ 𝑖=1 𝑛 ( −𝑦̂ 𝑦𝑖)2 𝑛 In [29]: # Calculate the test MSEs x_testing = dmatrix('1+PriceSales+PriceBook+EVRevenue+Payout_Rat io+Operating_Margin+ROA+Health',data = data.iloc[test]) predicted_7 = m_7_cv.predict(x_testing) x_testing = dmatrix('1+PriceSales+PriceBook+EVRevenue+I(EVRevenu e**2)+Payout_Ratio+Operating_Margin+ROA+Health',data = data.iloc [test]) predicted_new = m_new_cv.predict(x_testing) mse = pd.DataFrame() mse['Actual Value'] = y_testing mse['Predicted Value (m_7)'] = predicted_7 mse['Predicted Value (m_new)'] = predicted_new mse['Squared Error (m_7)'] = (mse['Predicted Value (m_7)'] - mse ['Actual Value'])**2 mse['Squared Error (m_new)'] = (mse['Predicted Value (m_new)'] - mse['Actual Value'])**2 MSE_7 = mse['Squared Error (m_7)'].sum()/len(mse) MSE_new = mse['Squared Error (m_new)'].sum()/len(mse) (JB): 18.892 Skew: 0.166 Prob(JB): 7.90e-05 Kurtosis: 4.713 Cond. No. 524. ==================================================== ========================== Warnings: [1] Standard Errors assume that the covariance matri x of the errors is correctly specified.
  • 50. In [31]: mse.T In [32]: print('The model test MSE for the linear model with 7 variable i s', MSE_7) print('The model test MSE for the model with polynomial term is' , MSE_new) Out[31]: Symbol ACN AVGO IBM NOW MU AMD Actual Value -0.060000 -0.100000 0.370000 0.310000 0.700000 -0.940000 Predicted Value (m_7) -0.780854 -0.253737 0.194448 0.138732 -0.359416 0.184337 Predicted Value (m_new) -0.790585 -0.126589 0.075529 -0.300076 -0.351424 0.357285 Squared Error (m_7) 0.519630 0.023635 0.030819 0.029333 1.122362 1.264133 Squared Error (m_new) 0.533755 0.000707 0.086713 0.372193 1.105493 1.682949 5 rows × 37 columns The model test MSE for the linear model with 7 varia ble is 2.1063194289258402 The model test MSE for the model with polynomial ter m is 2.154634174218139
  • 51. The Optimal Model Recall According to the MSEs value above, the model with a lower error, a non-linear model with 7 selected variables, has been recalled below. In [4]: print(m.summary()) OLS Regression Results ==================================================== ========================== Dep. Variable: Change R-squared: 0.098 Model: OLS Adj. R-squar ed: 0.063 Method: Least Squares F-statistic: 2.774 Date: Thu, 28 Nov 2019 Prob (F-stat istic): 0.00924 Time: 21:51:37 Log-Likeliho od: -294.25 No. Observations: 186 AIC: 604.5 Df Residuals: 178 BIC: 630.3 Df Model: 7 Covariance Type: nonrobust ==================================================== ================================ coef std err t P>|t| [0.025 0.975] ---------------------------------------------------- -------------------------------- Intercept 0.0664 0.326 0.204 0.839 -0.576 0.709 PriceSales 0.2852 0.153 1.863 0.064 -0.017 0.587 PriceBook 0.0310 0.021 1.481 0.140 -0.010 0.072 EVRevenue -0.3609 0.170 -2.124 0.035 -0.696 -0.026
  • 52. Regression Diagnosis *Source (https://robert-alvarez.github.io/2018-06-04-diagnostic_plots/) Payout_Ratio -0.0020 0.001 -2.041 0.043 -0.004 -6.67e-05 Operating_Margin 0.0474 0.017 2.713 0.007 0.013 0.082 ROA -0.1509 0.039 -3.909 0.000 -0.227 -0.075 Health 0.4269 0.332 1.286 0.200 -0.228 1.082 ==================================================== ========================== Omnibus: 8.080 Durbin-Watso n: 1.953 Prob(Omnibus): 0.018 Jarque-Bera (JB): 14.644 Skew: -0.096 Prob(JB): 0.000661 Kurtosis: 4.361 Cond. No. 493. ==================================================== ========================== Warnings: [1] Standard Errors assume that the covariance matri x of the errors is correctly specified.
  • 53. In [5]: # Residual Plot sns.residplot(m.fittedvalues, 'Change', data=data, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw' : 1, 'alpha': 0.8}) plt.title('Residuals vs Fitted') plt.xlabel('Fitted values') plt.ylabel('Residuals') Residuals and fited value plot shows there's some nonlinearity that this linear model couldn't capture. Out[5]: Text(0, 0.5, 'Residuals')
  • 54. In [6]: # Normal Q-Q plot sm.qqplot(m.resid, line='45', color='cornflowerblue', alpha=0.6) plt.title('Normal Q-Q') plt.xlabel('Theoretical Quantiles') plt.ylabel('Standardized Residuals') The Q-Q plot indicates that approimately more than 85% of the residuals align along the line, which means the errors are being normally distributed. Out[6]: Text(0, 0.5, 'Standardized Residuals')
  • 55. In [7]: # Scale-Location Plot norm_res_abs_sqrt = np.sqrt(np.abs(m.get_influence().resid_stude ntized_internal)) plt.scatter(m.fittedvalues, norm_res_abs_sqrt, alpha=0.5); sns.regplot(m.fittedvalues, norm_res_abs_sqrt, scatter=False, ci =False, lowess=True, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8}); plt.xlabel('Fitted values') plt.ylabel('$sqrt{|Standardized Residuals|}$') The residual plot shows a slightliness of uneven cloud of the residual. This model might suffer from heteroscedasticiy. Out[7]: Text(0, 0.5, '$sqrt{|Standardized Residuals|}$')
  • 56. In [8]: # Residual and Leverage leverage = m.get_influence().hat_matrix_diag norm_res = m.get_influence().resid_studentized_internal plt.scatter(leverage, norm_res, alpha=0.5); sns.regplot(leverage, norm_res, scatter=False, ci=False, lowess= True, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8}) plt.xlim(0, max(leverage)+0.01) plt.ylim(-3, 5) plt.title('Residuals vs Leverage') plt.xlabel('Leverage') plt.ylabel('Standardized Residuals'); The residual and leverage plot shows that there's no outlier.
  • 57. Model Conclusion From the training set, the model with polynomial term seems to perform better than the linear model, due to a higher Adjusted R squared as well as R squared, which means that the variation of %Change on stock price is better explained by the additional polynomial term. However, the training error tends to underestimate the testing error. According to the test MSEs for both models, the model without polynomial term yields a slightly lower MSE ( 2.1063 < 2.1546 ). This indicates that the model with 7 variables (non-linear) has a stronger predictive power. The non-linear model: 𝐶ℎ𝑎𝑛𝑔𝑒 = 0.0664 + 0.2852(𝑃 𝑟𝑖𝑐𝑒𝑆𝑎𝑙𝑒𝑠) + 0.0310(𝑃 𝑟𝑖𝑐𝑒𝐵𝑜𝑜𝑘) − 0.3609(𝐸𝑉 𝑅𝑒 − 0.1509(𝑅𝑂𝐴) + 0.4269( The Optimal Model Interpretation Independent Variables Relationships Coefficient P-Value Intercept + 0.0664 0.839 Price/Sales + 0.2852 0.064 (.) Price/Book + 0.0310 0.140 EV/Revenue - 0.3609 0.035 (*) Payout Ratio (%) - 0.0020 0.043 (*) Operating Margin (%) + 0.0474 0.007 (**) Return on Asset (ttm) - 0.1509 0.000 (***) Health + 0.4269 0.200 R squared The 9.8% variation of the percentage change in stock price is explained by independent variables in this ordinary least squares model. In order to improve the R sqaured value, the model might need other vairable that is more correlated to the response. Because the stock data has high variation as well as high randomness, besides, numeric data we might need
  • 58. other data such as, daily news, financial report, 10-K, index companies performance, and so on to improve the change in stock price evaluation. Coefficients (Significant at 95% confident interval) EVRevenue : The coefficient indicates that, on average, when the Enterprise Value/Revenue increases by 1 unit, the stock price will decline by 0.3609%, holding other variables constant, at p-value 0.035 < 0.05. Payout_Ratio : On average, when the Payout Ratio increases by 1%, the stock price will decrease by 0.002%, while holding others constant, at p-value 0.045 < 0.05. Operating_Margin : The coefficient indicates that, on average, when the Operating Margin (ttm) increases by 1%, the stock price will also increase by 0.0474%, holding others constant, at p-value 0.007 (< 0.05). ROA : ROA is highly significant with p-value at 0.000. On average, when the Return on Asset (ttm) increases by 1% while holding others constant, the stock price will decreases by 0.1509%. PriceSales has a p-value at 0.064 which is statistically significant at 99% confidence level, which means it isn't highly correlated with the dependent variable. PriceBook , and Health are not statistically significant. In [33]: # # Use this code in order to predict a specific scenerio # PriceSales = # PriceBook = # EVRevenue = # Payout_Ratio = # Operating_Margin = # ROA = # Health = # data_new = [1, PriceSales, PriceBook, EVRevenue, Payout_Ratio, Operating_Margin, ROA, Health] # predicted = m_7_cv.predict(data_new)[0] # predicted ====================================================================
  • 59. 5. Conclusions What have we seen based on the data? From joining the two data frames (one from yahoo finance by the technology services sector, and another from the key-statistic page), the data set had approximately 35 numeric variables with 390 companies. After cleaning the data the observations were reduced to approximately 190 companies. Building a correlation table and plot, as well as taking those variables into the scatter plot, the results showed that most of them had vague relationships (low magnitude of correlation) between the response (% Change in stock price). Also, there's a sign of non-linearity between the response and EV/Revenue . Hence the polynomial term of this variable was performed in further progress. Since there's a high dimensionality in the model, the best subset selection method was performed. Then, some variables (21 variables including binary variable) are selected to run the best subset selection model. According to the lowest RSS and highest Adjusted R squared, 7 variables were selected which are Price/Sales, Price/Book, EV/Revenue, Payout Ratio, Operating Margin, ROA, and Health (Binary variable created based on the industrial average of EV/EBITDA) Due to the non-linearity of EV/Revenue , the model with an additional polynomial term was performed. The result turns out that the model's Adjusted R squared has improved. As a result, the % Change of stock price variation is better explained by dependent variables and the additional polynomial term, . However, the predictive accuracy was further investigated Model predictive accuracy: To validate the model accuracy between these two models, the cross-validation is performed. The data set was randomly divided into 80% of the training set and 20% of the test set. (Set the seed equal to 1). From the result, the test MSE of the linear model is slightly lower than the model with a non-linear term. Even though the non-linear model has a higher Adjusted R squared indicating a better describing of the relationship between predictors and response, the linear model has a slightly stronger predictive power. (The model result comparision shown in the table below) (𝐸𝑉 /𝑅𝑒𝑣𝑒𝑛𝑢𝑒)2
  • 60. Model Linear Model Non-linear Model Formula Change = 0.0664 + 0.2852(PriceSales) + 0.0310(PriceBook) - 0.3609(EVRevenue) - 0.0020(Payout_Ratio) + 0.0474(Operating_Margin) - 0.1509(ROA) + 0.4269(Health) Change = - 0.0138 + 0.3043(PriceSales) + 0.0297(PriceBook) - 0.2612(EVRevenue) - 0.0067(EVRevenue^2) - 0.0019(Payout_Ratio) + 0.0413(Operating_Margin) - 0.1474(ROA) + 0.2658(Health) 0.063 0.075 0.098 0.115 MSEs 2.106319 2.154634 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2 𝑅2
  • 61. How has our understanding of the original question changed? Recall the question(s): Which indices (variables) are statistically important to the Change in percentage of stock price in Technology Services industry? The statistically significant indices are Price/Share (+), EV/Revenue (-), Payout Ratio (%) (-), Operating Margin (%) (+), and Return on Asset (%) (-). Besides these significant variables in the model, to determine the change in stock price, some additional factors need to be considered. In the stock market, there're many types of information the stock analyst could use for decision making. For instance, reading an annual report like 10-K as well as news and integrate with numeric data would help them gain more advantage over a person who only relies on less source. What is the magnitude of each variable against Change of stock price in the Technology Services industry? On the basis, I expected that the market capitalization would pay a significant role with a positive magnitude as a predictor since most of the companies that are mostly paid attention, like S&P500 have a high market capitalization. However, this variable is not statistically significant in the model where the dependent variable is percentage change in stock price. Also, the result of ROA is not the same as expected. The more return on asset the more profit a company can generate from its source. Surprisingly, this variable has negative relationship in the model. However, the actual relationship of EV/Revenue is the same as expected (negative relationship). Since the EV/Revenue is used to compare the company's revenue with the enterprise value. The lower of the multiple would mean it's undervalued, the more attraction is drawing to the company. Also, other variables such as Operating Margin(%), Payout Ratio(%), and Price/Book are the same as expectation because these variables are the indices that can draw the attention from investor (the higher the value, the more attractive stock would gain).