Time-Series Forecasting with Spark ML: Part

Introduction

Momentum investing/trading strategy is a strategy in which an investor buys liquid securities (such as stocks) that are showing an upward trend in their prices and sells those that show downward trend in their prices. One of the main goals of momentum investing is to forecast the price trend of stocks in order to formulate a strategy to enter and exit from an investment in a timely manner.

In this work, we seek to develop short-term momentum trading strategies to invest in stocks. We investigate the most popular blue-chip stocks called the FANG (Facebook, Apple, Netflix and Google) stocks. The objective is to forecast the near-future prices (1-day, 3-day, 5-day and 7-day forecasts) of FANG stocks using Apache Spark Machine Learning libraries and historical daily-price data from 2008 to 2018 obtained from Nasdaq.com. The forecast will be used to formulate simple short-term trading strategies

2. Description of Software/Tools

Quandl package

We use data from Nasdaq.com. For this purpose, we use a package called quandl [1]. This package has been acquired by Nasdaq.com. This is a popular database for financial time-series data and provides Python and R APIs for downloading the data and for performing simple manipulations on the data [2]. We use the Python-time-series API for obtaining the data [3].

Installation:

We can install from the PyPI or github repository using

$pip install quandl

Using the package:

We simply import the quandl package using the import command in our Python client program.

import quandl

Authentication:

In order to access the data using the quandl APIs, we need to create an account in the quandl website [1], and generate an authentication key. We need to set the authentication key in the python client program as follows,

quandl.ApiConfig.api_key = “your API key”

Querying data:

We query the data using the quandl python APIs. The queried data is obtained as a pandas dataframe. For example, we can obtain the time-series data of apple stock price as follows,

apple_data = quandl.get(“WIKI/AAPL”)

apple_data is a pandas dataframe and has “Date” column as its index (of datetime data type). We need to reset the index if we need use the “Date” column in our calculations (Figure 1a).

Figure 1a. Obtaining Apple stock price data from Nasdaq.com using quandl package

If we need to obtain data from a specified time frame (Figure 1b), we can make filtered time-series call [3],

apple_data_timeframe = quandl.get(“WIKI/AAPL”,start_date = ‘2014–09–01’, end_date = ‘2015–09–01’)

Figure 1b. Obtaining time-series data for a specified time-window, here we obtain Apple stock price between 2014–09–01 and 2015–09–01.

Also, if we need to make any transformations to the data while querying, we can append the call with the transformation type [3]. For example, if we need differenced data (i.e. yt — yt-1), the command is (Figure 1c),

apple_data_diff = quandl.get(“WIKI/AAPL”,transformation = ‘diff’)

Figure 1c. Obtaining data after minor manipulations, here we perform first-order differencing (i.e current_value — previous_timestep_value) on Apple stock price data

Furthermore, we can change the frequency of the time-series data but appending the get call with frequency argument. For instance, we can obtain data with daily, weekly, monthly frequency etc. We can also download the data files as CSV files using an Excel add-in. Thus, the quandl package provides set of easy-to-use APIs for obtain financial time-series data from Nasdaq.com.

In addition to quandl package, we use PySpark in Anaconda (Spyder version 3.3.2, Jupyter version 5.6.0) and also use VMWare Workstation with CentOS 7.5 VM

3. Description of Data and Exploratory data analysis

We look at the stock prices of four stocks, Apple, Facebook, Netflix and Google and perform a brief exploratory data analysis. We perform simple tests to determine the stationarity of the time-series. We treat each stock price as a univariate time-series.

We first check for null values in the dataset. We first explore Facebook’s stock price data. We see that we have daily closing price data starting from 2012–05–18 till 2018–03–27 and there are no null values in the dataset.

Similarly, we check the datasets for Apple, Netflix and Google stock prices.

We have daily closing price data for Netflix stock starting from 2002–05–23 till 2018–03–27. There are no null values in the Netflix dataset. For Apple stock, we have daily closing price data ranging from 1980–12–12 till 2018–03–27 and there are no null values in the Apple dataset. Finally, for Google stock, we have daily closing price data ranging from 2004–08–19 till 2018–03–27. There are no null values in the Google dataset.

Figure 2a. Checking for null values in Facebook stock price data

Figure 2b. Checking for null values in Netflix stock price dataset.

Figure 2c. Checking for null values in Apple stock price dataset.

Figure 2d. Checking for null values in Google stock price dataset

Checking for stationarity of time-series:

We perform augmented Dickey-Fuller’s test to find if the time-series data is stationary. We observe that all the four stocks price time-series are non-stationary.

Figure 2e. We see that all the four stock price time-series are non-stationary

We accept the null hypothesis, i.e. the time-series data is stationary if p-value > 0.05. Following are the summary of results:

Figure 2f. Summary of augmented Dickey-Fuller’s test performed on all the stock prices

Although the complete time series data is non-stationary, there could be parts of data that are stationary.

We make the time-series data approximately stationary by obtaining the first difference of the data. First difference is defined as follows,

If St is the stock price at time-step t, then first difference, Zt = St — St-1

This method of making the series stationary is an approximate one and would work only if the time-series is difference-stationary. However, stock price data are known to be highly non-stationary. They could exhibit trends, seasonality etc. We do not explore these aspects in this work.

After performing first differencing on the raw stock price data, we employ machine learning algorithms on the differenced data. This method is known to yield stable, reliable forecasts. To check if the differenced data is stationary, we again perform augmented Dickey-Fuller’s test on the differenced data. We see that differenced data is stationary. All the p-values are less 0.05 and ADF statistic values are highly negative and less than 1% critical value (see Figure 2h, 2i). This suggests that we should work with differenced data for time-series forecasting.

Figure 2g. Obtaining differenced data using quandl API

Figure 2h. Differenced data for Apple and Google stock price are stationary, indicated by p-value < 0.05 and ADF statistic highly negative and less than 1% critical value

Figure 2i. Differenced data for Netflix and Facebook stock price are stationary, indicated by p-value < 0.05 and ADF statistic highly negative and less than 1% critical value

We compare the rolling mean of raw price data and differenced stock price data of Google stock. We see that upon differencing the 30-day rolling mean is somewhat constant for differenced data (implying that differenced data is stationary, and we have de-trended it).

Figure 2j. Comparison of 30-day rolling mean for raw price data (top figure) and differenced price data (bottom figure) for Google stock; x-axis is number of time-windows.

However, there is not much difference in the standard deviation.

Figure 2k. Comparison of 30-day rolling standard deviation for raw price data (top figure) and differenced price data (bottom figure) for Google stock; x-axis is number of time-windows

Similarly, for Apple stock, we see that upon differencing the data, we remove the trend in stock price and make it stationary. We see that 30-day rolling mean for differenced data is mostly constant (except in the end).

Figure 2l. Comparison of 30-day rolling mean for raw price data (top figure) and differenced price data (bottom figure) for Apple stock; x-axis is number of time-windows of size 30 days.

We see that variance of Apple stock price doesn’t change much upon differencing the data (as we saw before Google stock price).

Figure 2m. Comparison of 30-day rolling standard deviation for raw price data (top figure) and differenced price data (bottom figure) for Apple stock; x-axis is number of time-windows.

Figure 2n. 30 day rolling mean comparisons between raw data and differenced data for Netflix and Facebook stock price.

Thus, we conclude that by differencing the time-series data we can make it stationary and hence avoid spurious regression. In the next part of the series, we will describe, in detail, the classes and utility functions we developed to perform time-series forecasting using Spark MLlib.

References:

[1] www.quandl.com

[2] http://docs.quandl.com

[3] https://docs.quandl.com/docs/python-time-series

Footer