Forecasting the Stock Market (Python)


Project Summary

Abstract

This project focuses on deriving the best statistical-learning model to predict future values for the S&P 500 Stock Index with Python.

Understanding the S&P 500 Stock Index is highly-relevant in understanding the health of the U.S. economy as it is highly-correlated (statistically) with other U.S. economic indicators such as other stock indices (e.g. the NASDAQ Stock Index and the NYSE Stock Exchange), the U.S. Housing Price Index, and the U.S. Gross Domestic Product (U.S. GDP).

Time-series analysis is a basic concept within the field of statistical-learning, which is appropriate for the analysis of the S&P 500 Stock Index.

For this project we leverage the horse-power of Python and deliver, where appropriate, gorgeous data visualizations using matplotlib.

We encourage you to try replicating this project and make your own contributions! You can fork this project on GitHub.

Requirements

Steps

Contributors

1. Load Modules

First, we want to load the appropriate modules into our python environment. For this we use the import method, followed by the module argument, and alias.

Make sure to have these modules installed in your global environment first. For this, you can use sudo pip install MODULE_NAME in your console; replace MODULE_NAME with the package you would like to install.

# START A script.py FILE

# IMPORT YOUR MODULES

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.arima_model import ARIMA, ARIMAResults
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pylab as plt
import matplotlib.dates as dates
from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 15, 6

We encourage you to research these modules so that you can get a grasp on what each library is helping you accomplish.

2. Get Data

Collecting Data

For now we just copy pasted from the Time Series in R project since we used the same data set and process to collect that data set! Now we collect our data. We want to use reliable sources of complete and accurate data. We collected 21 years (1995-2015) of S & P 500 Stock Index data at a monthly frequency (a total of 252 observations) from Yahoo Finance. You can do the same too.

Cleaning Data

We cleaned our data manually using Google Sheets. For this we made sure that all fields contained data (i.e. no missing values) and that the headers (i.e. column names) were labeled appropriately. This can be done programmatically with python, but this is outside of the scope of this project. Although we will make a tutorial later on!

We also included a bunch of other variables (for exploratory analysis) such as the NASDAQ Stock Index, the NYSE Stock Exchange, the U.S. Housing Price Index, and the U.S. Gross Domestic Product. Here is our data in file. csv, which is ready to be used with Python.

Loading Data

Then we must include our data set within our working python environment. For this we are using the pandas module to load our data first as a data frame from the csv file we compiled then we convert it to a time series object! The time series object being an extension of a data frame in the pandas module.


dataMaster = pd.read_csv('sp_500_ts.csv')
sp_500 = dataMaster['sp_500']
# Let's print the first 12 observations!
print sp_500.head(12)
0     464.547500
1     479.072510
2     493.987503
3     507.725006
4     523.650009
5     538.805008
6     553.680008
7     560.649994
8     573.727493
9     582.029983
10    594.149994
11    612.307495
Name: sp_500, dtype: float64

So for the next part we are creating a range for our data to interpret/know what intervals it will be taken in as. We create ran where we specify we want it to start at Jan 1995 and end up until before Jan 2016 (i.e. We want it to stop at Dec 2015) and the freq is set to 'M' for monthly. Then we use the pd.date_range function to create the time series object where ran is our range!

ran = pd.date_range('1995-01', '2016-1', freq = 'M')
ts = pd.Series(dataMaster['sp_500'].values, index = ran)

Now we can call our S & P 500 Stock Index data by typing sp_500 into our terminal/respective working environment.

print ts.head(12)
print ts.dtypes # This will output the structure of the pandas object
# We do this to make sure we have a time series object!
print ran # We print the range to be sure we have it formatted correctly
class 'pandas.tseries.index.DatetimeIndex'>
[1995-01-31, ..., 2015-12-31]
Length: 252, Freq: M, Timezone: None
1995-01-31    464.547500
1995-02-28    479.072510
1995-03-31    493.987503
1995-04-30    507.725006
1995-05-31    523.650009
1995-06-30    538.805008
1995-07-31    553.680008
1995-08-31    560.649994
1995-09-30    573.727493
1995-10-31    582.029983
1995-11-30    594.149994
1995-12-31    612.307495
Freq: M, dtype: float64
float64
class 'pandas.tseries.index.DatetimeIndex'>
[1995-01-31, ..., 2015-12-31]
Length: 252, Freq: M, Timezone: None

From this output we can see that we know have a time series object as opposed to a data frame so we can begin our time series analysis!

3. Do Exploratory Analysis

Plotting Our Time-Series

Plotting the data is arguably the most critical step in the exploratory analysis phase. This enables us to make inferences about important components of the time-series data, such as trend, seasonality, heteroskedasticity, and stationarity. Here is a quick summary of each:

We can plot our sp_500 data object in R using the plot() method.

plt.plot(ts)
plt.title('Time Series Plot for S & P 500')
plt.xlim([0, 255])
plt.show()

Here we create a time series object where we remove the year 2015 to compare the actual vs predicted values! Notice we use the same notation as the R version for simplicity's sake.

sp500_TR = ts['1995':'2014']
print sp500_TR
1995-01-31    464.547500
1995-02-28    479.072510
1995-03-31    493.987503
1995-04-30    507.725006
1995-05-31    523.650009
1995-06-30    538.805008
1995-07-31    553.680008
1995-08-31    560.649994
1995-09-30    573.727493
1995-10-31    582.029983
1995-11-30    594.149994
1995-12-31    612.307495
1996-01-31    621.354996
1996-02-29    643.597504
1996-03-31    642.632492
...
2013-10-31    1715.160004
2013-11-30    1781.065003
2013-12-31    1818.084991
2014-01-31    1812.434967
2014-02-28    1811.992523
2014-03-31    1862.107483
2014-04-30    1867.387482
2014-05-31    1897.945007
2014-06-30    1942.062500
2014-07-31    1953.755036
2014-08-31    1960.747528
2014-09-30    1989.915009
2014-10-31    1957.084991
2014-11-30    2040.635010
2014-12-31    2047.697510
Freq: M, Length: 240

4. Model Estimation

Diagnosing the ACF and PACF plots

Our next step is to plot our ACF and PACF plots to diagnose and deduce an appropriate ARIMA model. For this step we use plot_acf and plot_pacf from the statsmodels module! We assign the names acf and pacf appropriately:

acf = plot_acf(ts, lags = 20)
plt.title("ACF Plot of S & P 500")
acf.show()
pacf = plot_pacf(ts, lags = 20)
plt.title("PACF Plot of S & P 500")
pacf.show()

Thus we can see the geometric decay of our time series through the ACF and PACF plot so we decide transform our data.

Transforming our data to adjust for non-stationarity

Because we are reiterating our previous project we won't go to much into detail over the methods. But here we difference our time series object to account for non-stationary processes

Important to note is that we use the dropna function to get rid of the NaN produced by differencing. Which would cause problems in our analysis!

sp500_diff = ts - ts.shift()
diff = sp500_diff.dropna()
print diff.head(12)
1995-02-28    14.525010
1995-03-31    14.914993
1995-04-30    13.737503
1995-05-31    15.925003
1995-06-30    15.154999
1995-07-31    14.875000
1995-08-31     6.969986
1995-09-30    13.077499
1995-10-31     8.302490
1995-11-30    12.120010
1995-12-31    18.157502
1996-01-31     9.047501
Freq: M, dtype: float64

Next we plot our differenced time series to see if we achieved stationarity!

plt.plot(diff)
plt.title('First Difference Time Series Plot')
plt.show()

Diagnosing the acf and pacf of our transformed Time-Series Object

Here we are checking the ACF and PACF plots for diff in order to identify the number of terms for the ARIMA model we will use. Since we differenced our time series this makes the inclusion of the "integrated" part necessary so we are going to use an ARIMA(p, 1, q) model(That's why we look at the plots to estimate the p and q)!

Same exact process except we're using diff instead of ts

plt.plot(diff)
plt.title('First Difference Time Series Plot')
plt.title("ACF Plot of S & P 500(Difference)")
acfDiff.savefig("images/timeSeriesACFDiff.png", format = 'png')
plt.show()

Here we plot the ACF for our differenced time series.

acfDiff = plot_acf(diff, lags = 20)
plt.title("ACF Plot of S & P 500(Difference)")
acfDiff.show()

Next our PACF plot.

acfDiff = plot_acf(diff, lags = 20)
plt.title("ACF Plot of S & P 500(Difference)")
acfDiff.show()

From these plots we decide on an ARIMA(0, 1, 1) model which again we went into more detail in the R version!

5. Build Model

So we decided on an ARIMA(0, 1, 1) model and we create this model using the ARIMA function in the statsmodelsmodule along with ARIMAResults!

mod = ARIMA(sp500_TR, order = (0, 1, 1), freq = 'M')

results = mod.fit()
print results.summary()

From the results we can see that the constant's 95% confidence interval suggests that it is not statistically significant. The p value for the Ma process is 0 that's a good sign. I will have to research more about the time series results with python since I am relatively new to python. I would also look into the issue of including drift which is something we did in the R project, but I was't able to do so on python.

This is an iterative process so as I learn time series application in python I will keep adding more diagnostics like plotting residuals diagnostics!

6. Forecasting

Finally we plot our predictions and compare it to our actual values. We do so using the predict method after we fit our model!

# Here's us predicting the year 2015
predVals = results.predict(239, 251, typ='levels')
print predVals

2014-12-31    2107.259397
2015-01-31    2020.444050
2015-02-28    2026.936909
2015-03-31    2033.429767
2015-04-30    2039.922626
2015-05-31    2046.415484
2015-06-30    2052.908343
2015-07-31    2059.401201
2015-08-31    2065.894060
2015-09-30    2072.386918
2015-10-31    2078.879777
2015-11-30    2085.372635
2015-12-31    2091.865494
Freq: M, dtype: float64

There was a lot of discrepancies with respect to the documenatation for some of the modules so I there was some issues I was able to navigate, but I hope with the process someone can suggest how to fix some issues we faced. We removed the first value since it was not in the interval we wanted

predVals = predVals.drop(predVals.index[0])
print predVals

2015-01-31    2020.444050
2015-02-28    2026.936909
2015-03-31    2033.429767
2015-04-30    2039.922626
2015-05-31    2046.415484
2015-06-30    2052.908343
2015-07-31    2059.401201
2015-08-31    2065.894060
2015-09-30    2072.386918
2015-10-31    2078.879777
2015-11-30    2085.372635
2015-12-31    2091.865494
dtype: float64

Now that we have the predicted values we want to compare to see how they did with respect to the actual 2015 closing values! For this part we create a pandas object that includes two time series object. We are esentially merging them (through the pd.concat function in pandas) by their similar trait, the date range. We merge based on that with everything before the date Jan 1 2015 for the predicted values producing NaN's. Here we print out the years 2014 and 2015 to give you a better visual idea of what I'm talking about.

sp500_for = pd.concat([ts, predVals], axis = 1, keys=['original', 'predicted'])
print sp500_for['2014':'2015']

               original    predicted
2014-01-31  1812.434967          NaN
2014-02-28  1811.992523          NaN
2014-03-31  1862.107483          NaN
2014-04-30  1867.387482          NaN
2014-05-31  1897.945007          NaN
2014-06-30  1942.062500          NaN
2014-07-31  1953.755036          NaN
2014-08-31  1960.747528          NaN
2014-09-30  1989.915009          NaN
2014-10-31  1957.084991          NaN
2014-11-30  2040.635010          NaN
2014-12-31  2047.697510          NaN
2015-01-31  2028.592499  2020.444050
2015-02-28  2050.415039  2026.936909
2015-03-31  2082.582459  2033.429767
2015-04-30  2081.859925  2039.922626
2015-05-31  2099.354920  2046.415484
2015-06-30  2089.485046  2052.908343
2015-07-31  2086.920044  2059.401201
2015-08-31  2014.084992  2065.894060
2015-09-30  1945.722504  2072.386918
2015-10-31  1996.757538  2078.879777
2015-11-30  2074.259979  2085.372635
2015-12-31  2056.099976  2091.865494

[24 rows x 2 columns]

Now our objective is to graph this object and compare how well it was capable of predicting. And to reiterate we do plan on continually grabbing the closing values of the S & P 500 and compare at least biannually how our model is doing! But for now we're just going to compare to the year 2015! Again we're using matplotlib to visualize our data.

plt.plot(concan)
plt.xlim([0, 255])
plt.title("Actual Vs. Forecasted Values")
plt.show()

Now we want to zoom in since we can't really see the predicted vs. actual values!

plt.plot(concan)
plt.xlim([230, 255])
plt.ylim([1800, 2200])
plt.title('Real Vs. Predicted Values for 2015')
plt.show()

7. Conclusions

Since I just started learning python this tutorial is not as detailed as the R version, but should help in starting a project with respect to Time Series Analysis in Python. I will be reiterating this project to improve and add more descriptive plots much like the R version. But if you can give feedback and tips for time Series Analysis in python please feel free to contact us! We are more than willing to learn and want other people's feedback. Especially since python is a relatively new language to us!

8. Resources

We faced some issues with documentation in that when we tried following someone's time series analysis some functions were written a different way and some of the processes they did were defunct, so we worked our way around these issues, but these resources were a big contribution to this time series analysis in Python:

- Aarshay Jain. Complete Guide to Create a Time Series Forecast (with Codes in Python). Analytics Vidhya. 06 Feb. 2016. Web. 07 Oct. 2016.

CONGRATULATIONS!

Congratulations for getting this far! We hope you enjoyed this project. Please reach out to us here if you have any feedback or would like to publish your own project.

GitHub Repo Link Fork this project on GitHub

Try this project next:


Forecasting the Stock Market (R)

Forecasting the Stock Market (R)

Time-Series Analysis of the S&P 500 Stock Index with R