This project focuses on deriving the best statistical-learning model to predict future values for the S&P 500 Stock Index with Python.
Understanding the S&P 500 Stock Index is highly-relevant in understanding the health of the U.S. economy as it is highly-correlated (statistically) with other U.S. economic indicators such as other stock indices (e.g. the NASDAQ Stock Index and the NYSE Stock Exchange), the U.S. Housing Price Index, and the U.S. Gross Domestic Product (U.S. GDP).
Time-series analysis is a basic concept within the field of statistical-learning, which is appropriate for the analysis of the S&P 500 Stock Index.
For this project we leverage the horse-power of Python and deliver, where appropriate, gorgeous data visualizations using matplotlib.
We encourage you to try replicating this project and make your own contributions! You can fork this project on GitHub.
First, we want to load the appropriate modules into our python environment. For this we use the
import method, followed by the module argument, and alias.
Make sure to have these modules installed in your global environment first. For this, you can use
sudo pip install MODULE_NAME in your console; replace
MODULE_NAME with the package you would like to install.
# START A script.py FILE # IMPORT YOUR MODULES import pandas as pd import numpy as np import statsmodels.api as sm from statsmodels.tsa.arima_model import ARIMA, ARIMAResults from statsmodels.tsa.stattools import acf, pacf from statsmodels.graphics.tsaplots import plot_acf, plot_pacf import matplotlib.pylab as plt import matplotlib.dates as dates from matplotlib.pylab import rcParams rcParams['figure.figsize'] = 15, 6
We encourage you to research these modules so that you can get a grasp on what each library is helping you accomplish.
For now we just copy pasted from the Time Series in R project since we used the same data set and process to collect that data set! Now we collect our data. We want to use reliable sources of complete and accurate data. We collected 21 years (1995-2015) of S & P 500 Stock Index data at a monthly frequency (a total of 252 observations) from Yahoo Finance. You can do the same too.
We cleaned our data manually using Google Sheets. For this we made sure that all fields contained data (i.e. no missing values) and that the headers (i.e. column names) were labeled appropriately. This can be done programmatically with python, but this is outside of the scope of this project. Although we will make a tutorial later on!
We also included a bunch of other variables (for exploratory analysis) such as the NASDAQ Stock Index, the NYSE Stock Exchange, the U.S. Housing Price Index, and the U.S. Gross Domestic Product. Here is our data in file. csv, which is ready to be used with Python.
Then we must include our data set within our working python environment. For this we are using the
pandas module to load our data first as a data frame from the csv file we compiled then we convert it to a time series object! The time series object being an extension of a data frame in the
dataMaster = pd.read_csv('sp_500_ts.csv') sp_500 = dataMaster['sp_500'] # Let's print the first 12 observations! print sp_500.head(12) 0 464.547500 1 479.072510 2 493.987503 3 507.725006 4 523.650009 5 538.805008 6 553.680008 7 560.649994 8 573.727493 9 582.029983 10 594.149994 11 612.307495 Name: sp_500, dtype: float64
So for the next part we are creating a range for our data to interpret/know what intervals it will be taken in as. We create
ran where we specify we want it to start at Jan 1995 and end up until before Jan 2016 (i.e. We want it to stop at Dec 2015) and the
freq is set to 'M' for monthly. Then we use the
pd.date_range function to create the time series object where
ran is our range!
ran = pd.date_range('1995-01', '2016-1', freq = 'M') ts = pd.Series(dataMaster['sp_500'].values, index = ran)
Now we can call our S & P 500 Stock Index data by typing
sp_500 into our terminal/respective working environment.
print ts.head(12) print ts.dtypes # This will output the structure of the pandas object # We do this to make sure we have a time series object! print ran # We print the range to be sure we have it formatted correctly class 'pandas.tseries.index.DatetimeIndex'> [1995-01-31, ..., 2015-12-31] Length: 252, Freq: M, Timezone: None 1995-01-31 464.547500 1995-02-28 479.072510 1995-03-31 493.987503 1995-04-30 507.725006 1995-05-31 523.650009 1995-06-30 538.805008 1995-07-31 553.680008 1995-08-31 560.649994 1995-09-30 573.727493 1995-10-31 582.029983 1995-11-30 594.149994 1995-12-31 612.307495 Freq: M, dtype: float64 float64 class 'pandas.tseries.index.DatetimeIndex'> [1995-01-31, ..., 2015-12-31] Length: 252, Freq: M, Timezone: None
From this output we can see that we know have a time series object as opposed to a data frame so we can begin our time series analysis!
Plotting the data is arguably the most critical step in the exploratory analysis phase. This enables us to make inferences about important components of the time-series data, such as trend, seasonality, heteroskedasticity, and stationarity. Here is a quick summary of each:
We can plot our
sp_500 data object in R using the
plt.plot(ts) plt.title('Time Series Plot for S & P 500') plt.xlim([0, 255]) plt.show()
Here we create a time series object where we remove the year 2015 to compare the actual vs predicted values! Notice we use the same notation as the R version for simplicity's sake.
sp500_TR = ts['1995':'2014'] print sp500_TR 1995-01-31 464.547500 1995-02-28 479.072510 1995-03-31 493.987503 1995-04-30 507.725006 1995-05-31 523.650009 1995-06-30 538.805008 1995-07-31 553.680008 1995-08-31 560.649994 1995-09-30 573.727493 1995-10-31 582.029983 1995-11-30 594.149994 1995-12-31 612.307495 1996-01-31 621.354996 1996-02-29 643.597504 1996-03-31 642.632492 ... 2013-10-31 1715.160004 2013-11-30 1781.065003 2013-12-31 1818.084991 2014-01-31 1812.434967 2014-02-28 1811.992523 2014-03-31 1862.107483 2014-04-30 1867.387482 2014-05-31 1897.945007 2014-06-30 1942.062500 2014-07-31 1953.755036 2014-08-31 1960.747528 2014-09-30 1989.915009 2014-10-31 1957.084991 2014-11-30 2040.635010 2014-12-31 2047.697510 Freq: M, Length: 240
Our next step is to plot our ACF and PACF plots to diagnose and deduce an appropriate ARIMA model. For this step we use
plot_pacf from the
statsmodels module! We assign the names
acf = plot_acf(ts, lags = 20) plt.title("ACF Plot of S & P 500") acf.show()
pacf = plot_pacf(ts, lags = 20) plt.title("PACF Plot of S & P 500") pacf.show()
Thus we can see the geometric decay of our time series through the ACF and PACF plot so we decide transform our data.
Because we are reiterating our previous project we won't go to much into detail over the methods. But here we difference our time series object to account for non-stationary processes
Important to note is that we use the
dropna function to get rid of the
NaN produced by differencing. Which would cause problems in our analysis!
sp500_diff = ts - ts.shift() diff = sp500_diff.dropna() print diff.head(12) 1995-02-28 14.525010 1995-03-31 14.914993 1995-04-30 13.737503 1995-05-31 15.925003 1995-06-30 15.154999 1995-07-31 14.875000 1995-08-31 6.969986 1995-09-30 13.077499 1995-10-31 8.302490 1995-11-30 12.120010 1995-12-31 18.157502 1996-01-31 9.047501 Freq: M, dtype: float64
Next we plot our differenced time series to see if we achieved stationarity!
plt.plot(diff) plt.title('First Difference Time Series Plot') plt.show()
Here we are checking the ACF and PACF plots for
diff in order to identify the number of terms for the ARIMA model we will use. Since we differenced our time series this makes the inclusion of the "integrated" part necessary so we are going to use an ARIMA(p, 1, q) model(That's why we look at the plots to estimate the p and q)!
Same exact process except we're using
diff instead of
plt.plot(diff) plt.title('First Difference Time Series Plot') plt.title("ACF Plot of S & P 500(Difference)") acfDiff.savefig("images/timeSeriesACFDiff.png", format = 'png') plt.show()
Here we plot the ACF for our differenced time series.
acfDiff = plot_acf(diff, lags = 20) plt.title("ACF Plot of S & P 500(Difference)") acfDiff.show()
Next our PACF plot.
acfDiff = plot_acf(diff, lags = 20) plt.title("ACF Plot of S & P 500(Difference)") acfDiff.show()
From these plots we decide on an ARIMA(0, 1, 1) model which again we went into more detail in the R version!
So we decided on an ARIMA(0, 1, 1) model and we create this model using the
ARIMA function in the
statsmodelsmodule along with
mod = ARIMA(sp500_TR, order = (0, 1, 1), freq = 'M') results = mod.fit() print results.summary()
From the results we can see that the constant's 95% confidence interval suggests that it is not statistically significant. The p value for the Ma process is 0 that's a good sign. I will have to research more about the time series results with python since I am relatively new to python. I would also look into the issue of including drift which is something we did in the R project, but I was't able to do so on python.
This is an iterative process so as I learn time series application in python I will keep adding more diagnostics like plotting residuals diagnostics!
Finally we plot our predictions and compare it to our actual values. We do so using the
predict method after we fit our model!
# Here's us predicting the year 2015 predVals = results.predict(239, 251, typ='levels') print predVals 2014-12-31 2107.259397 2015-01-31 2020.444050 2015-02-28 2026.936909 2015-03-31 2033.429767 2015-04-30 2039.922626 2015-05-31 2046.415484 2015-06-30 2052.908343 2015-07-31 2059.401201 2015-08-31 2065.894060 2015-09-30 2072.386918 2015-10-31 2078.879777 2015-11-30 2085.372635 2015-12-31 2091.865494 Freq: M, dtype: float64
There was a lot of discrepancies with respect to the documenatation for some of the modules so I there was some issues I was able to navigate, but I hope with the process someone can suggest how to fix some issues we faced. We removed the first value since it was not in the interval we wanted
predVals = predVals.drop(predVals.index) print predVals 2015-01-31 2020.444050 2015-02-28 2026.936909 2015-03-31 2033.429767 2015-04-30 2039.922626 2015-05-31 2046.415484 2015-06-30 2052.908343 2015-07-31 2059.401201 2015-08-31 2065.894060 2015-09-30 2072.386918 2015-10-31 2078.879777 2015-11-30 2085.372635 2015-12-31 2091.865494 dtype: float64
Now that we have the predicted values we want to compare to see how they did with respect to the actual 2015 closing values! For this part we create a pandas object that includes two time series object. We are esentially merging them (through the
pd.concat function in pandas) by their similar trait, the date range. We merge based on that with everything before the date Jan 1 2015 for the predicted values producing NaN's. Here we print out the years 2014 and 2015 to give you a better visual idea of what I'm talking about.
sp500_for = pd.concat([ts, predVals], axis = 1, keys=['original', 'predicted']) print sp500_for['2014':'2015'] original predicted 2014-01-31 1812.434967 NaN 2014-02-28 1811.992523 NaN 2014-03-31 1862.107483 NaN 2014-04-30 1867.387482 NaN 2014-05-31 1897.945007 NaN 2014-06-30 1942.062500 NaN 2014-07-31 1953.755036 NaN 2014-08-31 1960.747528 NaN 2014-09-30 1989.915009 NaN 2014-10-31 1957.084991 NaN 2014-11-30 2040.635010 NaN 2014-12-31 2047.697510 NaN 2015-01-31 2028.592499 2020.444050 2015-02-28 2050.415039 2026.936909 2015-03-31 2082.582459 2033.429767 2015-04-30 2081.859925 2039.922626 2015-05-31 2099.354920 2046.415484 2015-06-30 2089.485046 2052.908343 2015-07-31 2086.920044 2059.401201 2015-08-31 2014.084992 2065.894060 2015-09-30 1945.722504 2072.386918 2015-10-31 1996.757538 2078.879777 2015-11-30 2074.259979 2085.372635 2015-12-31 2056.099976 2091.865494 [24 rows x 2 columns]
Now our objective is to graph this object and compare how well it was capable of predicting. And to reiterate we do plan on continually grabbing the closing values of the S & P 500 and compare at least biannually how our model is doing! But for now we're just going to compare to the year 2015! Again we're using
matplotlib to visualize our data.
plt.plot(concan) plt.xlim([0, 255]) plt.title("Actual Vs. Forecasted Values") plt.show()
Now we want to zoom in since we can't really see the predicted vs. actual values!
plt.plot(concan) plt.xlim([230, 255]) plt.ylim([1800, 2200]) plt.title('Real Vs. Predicted Values for 2015') plt.show()
Since I just started learning python this tutorial is not as detailed as the R version, but should help in starting a project with respect to Time Series Analysis in Python. I will be reiterating this project to improve and add more descriptive plots much like the R version. But if you can give feedback and tips for time Series Analysis in python please feel free to contact us! We are more than willing to learn and want other people's feedback. Especially since python is a relatively new language to us!
We faced some issues with documentation in that when we tried following someone's time series analysis some functions were written a different way and some of the processes they did were defunct, so we worked our way around these issues, but these resources were a big contribution to this time series analysis in Python:
- Aarshay Jain. Complete Guide to Create a Time Series Forecast (with Codes in Python). Analytics Vidhya. 06 Feb. 2016. Web. 07 Oct. 2016.
Congratulations for getting this far! We hope you enjoyed this project. Please reach out to us here if you have any feedback or would like to publish your own project.