No time to waste, lets get to it.
Preamble Backdrop
The global arena is heavily influenced by the enigmatic essence of crude oil, an indispensable commodity that permeates the economic, political, and technological realms. To cast our gaze upon the volatility that is inherent in the energy resource market, forecasting the trajectory of crude oil prices is of paramount importance. This foresight not only empowers governments and energy departments with the tools necessary for informed policy-making, but also paves the way for sustainable economic growth. As such, the ability to accurately predict crude oil prices is invaluable.
Two predominant approaches to prediction are employed by researchers – one qualitative and the other quantitative, which encompasses econometric and statistical models.The majority of scholars opt for the latter methodology.
Nevertheless, the crude oil market is fraught with non-linearity, complicating the task of forecasting its movements. Neural network techniques offer a promising tool for tackling nonlinear time series forecasting challenges.
Project Objective
In this endeavor, I shall harness the power of deep learning, utilizing recurrent neural networks (e.g., LSTM, LSTM with dropout, etc.) and feed-forward neural networks (dense layer), to perform time series forecasting for WTI crude oil prices. The predictions generated by these models will offer valuable insights into the crude oil industry.
Data Synopsis
The behavior of crude oil prices is influenced by an intricate web of factors, resulting in a seemingly mysterious dance of price movements. My analysis will take into account energy resource prices and oil-sensitive stock prices, as I believe these two dimensions are inextricably linked to fluctuations in crude oil prices.
I have assembled data from various sources to create a comprehensive dataset:
Energy Resource Price
Crude Oil Prices: WTI (Western Texas Intermediate) crude oil price daily historical data, spanning from the 1980s to 2020, were obtained from the API of the U.S. Energy Information Administration (EIA).
Propane & Natural Gas:
Propane and natural gas are both prominent energy sources.
Propane, a byproduct of crude oil and natural gas processing, is heavily influenced by crude oil prices. Natural gas, meanwhile, has a correlation coefficient of 0.25 with crude oil, suggesting that 25% of natural gas price changes can be attributed to shifts in oil prices (on average, throughout the study period).
Daily price data for propane and natural gas, spanning from the 1990s to 2020, were also sourced from the EIA.
Oil-sensitive Stock Stock prices, which reflect real-time market information and are not subject to revision, hold potential as valuable predictors for crude oil prices. Therefore, oil-sensitive stock prices were also included.
Oil Company Stock Prices: Stock prices for the following oil companies, sourced from Yahoo Finance, were used as predictors for oil market volatility:
British Petroleum (BP): A British multinational oil and gas company and the world's 6th largest. ExxonMobil (XOM): An American multinational oil company headquartered in Texas. Chevron (CVR): A global petroleum industry leader, producing an average of 791M barrels of net oil-equivalent per day in the U.S. in 2018.
The "adjclose" value was used as the "stock price" in predictors, as it represents the closing price adjusted for splits and dividend distributions. Stock price history spans from 1997 to 2020.
Solar Company Stock Prices: Solar company share prices, like those of NextExtra Energy (NEE), are closely correlated to crude oil prices. This U.S. solar company's stock prices from 1997 to 2020 were included in the final dataset.
Time Series Adjustment: Upon gathering data from all sources, I aligned the time range of the dataset to form a time series spanning from January 8, 1997, to November 3, 2020.
Final Dataset The resulting dataset comprises 5951 rows × 8 columns. The multivariate predictor columns include:
Date
BP Stock Price
XOM Stock Price
CVR Stock Price
NEE Stock Price
Propane Price
Natural Gas Price
The target variable: Crude Oil Price
By leveraging this comprehensive dataset and employing deep learning techniques, this project endeavors to unravel the enigma of crude oil price fluctuations. Through the predictions generated, valuable business insights can be gleaned for the crude oil industry, potentially guiding policy-making and economic development in a world where energy resources remain indispensable.
Let us write the code now:# Import required libraries
import pandas as pd
import numpy as np
import quandl
import requests_html
from yahoo_fin import stock_info as si
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
# Install required libraries
!pip install quandl
!pip install yahoo_fin
!pip install requests_html
# API configuration to authorize connection, add your key to quotes
quandl.ApiConfig.api_key = "XXX-"
# Obtain WTI spot price data, below is the URL of the documentation
# https://www.quandl.com/data/EIA/PET_RWTC_D-Cushing-OK-WTI-Spot-Price-FOB-Daily
wti_data = quandl.get("EIA/PET_RWTC_D", authtoken="XXXX-")
# Pull Texas Propane price data
tx_propaine_data = quandl.get("EIA/PET_EER_EPLLPA_PF4_Y44MB_DPG_D", authtoken="XXXXX-")
# Pull Natural Gas price data
natural_gas_data = quandl.get("FRED/DHHNGSP", authtoken="XXXXX-")
# Combining the three datasets together assigning proper column names
combined_quandl_data = pd.concat([wti_data, tx_propaine_data, natural_gas_data], axis=1)
combined_quandl_data = combined_quandl_data.dropna()
combined_quandl_data.columns = ['WTI', 'TX_PROP', 'HEN_Nat_GAS']
# Assign the ticker list that we want to scrape
tickers_list = ['XOM', 'CVX', 'BP', 'NEE']
# Pull historical price data for each stock to match with news score later
dow_prices = {ticker : si.get_data(ticker, start_date='01/08/1997', end_date='11/04/2020', interval='1d') for ticker in tickers_list}
# Create a dataframe with stock prices
prep_data = pd.DataFrame(dow_prices['XOM']['adjclose']).rename(columns={"adjclose": "XOM"})
for i in tickers_list[1:]:
prep_data[i] = pd.DataFrame(dow_prices[i]['adjclose'])
# Combine the stock prices dataframe with the quandl data
final_dataset = pd.concat([combined_quandl_data, prep_data], axis=1)
final_dataset = final_dataset.dropna()
# Save the data
final_dataset.to_csv("/content/drive/My Drive/Colab Notebooks/Final_dataset.csv")
# Create return features for each ticker, use a pct_change as the return
return_data = final_dataset.pct_change()
return_data.dropna(inplace=True)
# Create a dataframe with the percentage changes and a column for whether there was an increase or decrease
df = pd.DataFrame(return_data['WTI'])
df['Increase'] = np.where(df['WTI'] > 0, 1, 0)
# Drop the last row because of the NaN
df.drop(df.tail(1).index, inplace=True)
# Plot the data
df.plot(subplots=True, grid=True, layout=(3,4), figsize=(15,15))
plt.show()
# Correlation analysis
df_corr = df.corr()
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(df_corr, annot=True)
# Prepare the data for training and testing
y = df['Increase']
X = df.drop(['Increase', 'WTI'], axis=1)
# Split the data into train and test partitions
# Use 80% of the data for train, and rest
#for validation
train_pct_index = int(0.8 * len(X))
X_train, X_test = X[:train_pct_index], X[train_pct_index:]
y_train, y_test = y[:train_pct_index], y[train_pct_index:]
#Scale the data using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
#Convert the arrays back to dataframes
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)
#Print the shapes of the train and test data
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
# Reset the index on everything to avoid any issues with dates and integers
X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)
# Put X_train and y_train together (did not scale Y)
# Put X_test and y_test together (again, did not scale Y before)
df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)
# Print the shapes of the train and test data
print(df.shape, df_train.shape, df_test.shape)
df_train.head()
# Print the count of each value of Increase column
print(df['Increase'].value_counts())
# Split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
X, y = [], []
for i in range(0, len(sequences) - n_steps):
# Find the end of this pattern
end_ix = i + n_steps
# Gather input and output parts of the pattern
seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix-1, -1]
X.append(seq_x)
y.append(seq_y)
return np.array(X), np.array(y)
# Setting time look back of 10
n_steps = 10
X_train, y_train = split_sequences(np.array(df_train), n_steps)
X_test, y_test = split_sequences(np.array(df_test), n_steps)
# Check the shape of the train and test data
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
# Verify no NaN values
print(np.isnan(X_train).sum())
print(np.isnan(y_train).sum())
print(np.isnan(X_test).sum())
print(np.isnan(y_test).sum())
# Print the first element of X_train
print(X_train[0])
# Print the first element of y_train and the first 10 elements of df
print(y_train[0], df.head(10))
# Confirm the correct features and steps
n_steps = X_train.shape[1]
n_features = X_train.shape[2]
print(n_steps, n_features)
# Now let's build a model
# Need to update for classification
# Define the number of steps and features
n_steps = X_train.shape[1]
n_features = X_train.shape[2]
# Define the model
model = Sequential()
model.add(SimpleRNN(60, input_shape=(n_steps, n_features), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# Set up early stopping
es = EarlyStopping(monitor='val_acc', mode='max', patience=10, verbose=1, restore_best_weights=True)
# Fit the model
model.fit(X_train, y_train, epochs=500, batch_size=5, validation_split=0.2, verbose=1, callbacks=[es], shuffle=True)
# Make a prediction
pred = model.predict(X_test)
print(pred)
# Round the predictions to 0 or 1
pred = np.round(pred, 0)
pred
# Import confusion matrix and classification report from sklearn metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
# Print the confusion matrix and classification report
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
# Plot the test results
plt.figure(figsize=(20, 10))
plt.plot(np.arange(X_test[900:].shape[0]), y_test[900:], color='blue') # actual data
plt.plot(np.arange(X_test[900:].shape[0]), pred[900:], color='grey') # predicted data
plt.suptitle('Test Results')
plt.xlabel('Time')
plt.ylabel('Increase?')
plt.show()
# Now let's build a model
# Since this is a univariate problem, n_features will be 1 (defined this before)
# Define the model
model = Sequential()
model.add(LSTM(30, input_shape=(n_steps, n_features), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.summary()
# Set up early stopping
es = EarlyStopping(monitor='val_acc', mode='max', patience=10, verbose=1, restore_best_weights=True)
# Fit the model
model.fit(X_train, y_train, epochs=500, batch_size=5, validation_split=0.2, verbose=1, callbacks=[es], shuffle=True)
# Make a prediction
pred = model.predict(X_test)
print(pred)
# Round the predictions to 0 or 1
pred = np.round(pred, 0)
print(pred)
# Plot the confusion matrix and classification report
plt.figure(figsize=(20, 10))
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
# Plot the timeseries plot on the train and validation data
plt.plot(np.arange(X_test.shape[0]), y_test, color='blue') # actual data
plt.plot(np.arange(X_test.shape[0]), pred, color='red') # predicted data
plt.suptitle('Test Results')
plt.xlabel('Time')
plt.ylabel('Increase?')
plt.show()
# Zoom in to see last few values
plt.figure(figsize=(20, 10))
plt.plot(np.arange(X_test[900:].shape[0]), y_test[900:], color='blue') # actual data
plt.plot(np.arange(X_test[900:].shape[0]), pred[900:], color='grey') # predicted data
plt.suptitle('Test Results')
plt.xlabel('Time')
plt.ylabel('Increase?')
plt.show()
# Define the number of steps and features
n_steps = X_train.shape[1]
n_features = X_train.shape[2]
# Define the model
model = Sequential()
model.add((LSTM(64, return_sequences=True, activation='relu', input_shape=(n_steps, n_features))))
model.add(Dropout(0.1))
model.add(Bidirectional(LSTM(32, activation='relu')))
model.add(Dropout(0.1))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.summary()
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# early stopping
es = EarlyStopping(monitor='val_accuracy',
mode='max',
patience=20,
verbose=1,
restore_best_weights=True)
# fit model, if you run locally and remember the seed, you will be able to rerun 6 epochs and get the best val accuracy
model.fit(X_train, y_train,
epochs=100,
batch_size=5,
validation_split=0.2,
verbose=1,
callbacks=[es],
shuffle=True)
#The EarlyStopping callback is created with monitor='val_accuracy', which means it will monitor the validation accuracy for improvements.
The fit method is called on the model object with the training data (X_train, y_train), epochs=100, batch_size=5, and validation_split=0.2. This means that 20% of the training data will be used for validation during training. The es callback is also passed as an argument to stop training early if there is no improvement in validation accuracy after 20 epochs.
# make a prediction
pred = model.predict(X_test)# the pred
# print(pred) # round them!
pred = np.round(pred,0)
# print(pred) # run all if you get an error...
# confusion matrix - put this at the top!
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(confusion_matrix(y_test, pred)) # looks pretty good!
print(classification_report(y_test, pred))
plt.figure(figsize=(20,10))
# show timeseries plot on the train and validation data
plt.plot(np.arange(X_test.shape[0]), y_test, color='blue') # actual data
plt.plot(np.arange(X_test.shape[0]), pred, color='red') # predicted data
plt.suptitle('Test Results')
plt.xlabel('Time')
plt.ylabel('Increase?')
plt.show()
#Zooming in to see last few values
plt.figure(figsize=(20,10))
plt.plot(np.arange(X_test[900:].shape[0]), y_test[900:], color='blue') # actual data
plt.plot(np.arange(X_test[900:].shape[0]), pred[900:], color='grey') # predicted data
plt.suptitle('Test Results')
plt.xlabel('Time')
plt.ylabel('Increase?')
plt.show()
# since this is a univariate problem, n_features will be 1 (defined this before)
# define model
model = Sequential()
model.add(Conv1D(filters=128, kernel_size=3, input_shape=(n_steps,n_features))) # notice how input shape goes in first layer
model.add(MaxPooling1D(2))
model.add(Bidirectional(LSTM(30,
return_sequences=True, # remember, if stacking layers, need to return sequences!
activation='relu',
recurrent_dropout=0.2)))
model.add(GRU(20, activation='relu'))
model.add(Dropout(0.1)
# define the model
model = Sequential()
model.add(Conv1D(filters=64, kernel_size=3, input_shape=(n_steps, n_features)))
model.add(Bidirectional(LSTM(30, activation='relu', return_sequences=True)))
model.add(GRU(20, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.summary()
# set up early stopping
es = EarlyStopping(monitor='val_acc', mode='max', patience=10, verbose=1, restore_best_weights=True)
# fit the model (using early stopping)
model.fit(X_train, y_train, epochs=500, batch_size=5, validation_split=0.2, verbose=1, callbacks=[es], shuffle=True)
# make predictions
pred = model.predict(X_test)
pred = np.round(pred, 0)
# calculate and print confusion matrix and classification report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
# show timeseries plot on the test data
plt.figure(figsize=(20, 10))
plt.plot(np.arange(X_test.shape[0]), y_test, color='blue')
plt.plot(np.arange(X_test.shape[0]), pred, color='grey')
plt.suptitle('Test Results')
plt.xlabel('Time')
plt.ylabel('Increase?')
plt.show()
# zoom in to see last few values
plt.figure(figsize=(20, 10))
plt.plot(np.arange(X_test[900:].shape[0]), y_test[900:], color='blue')
plt.plot(np.arange(X_test[900:].shape[0]), pred[900:], color='grey')
plt.suptitle('Test Results')
plt.xlabel('Time')
plt.ylabel('Increase?')
plt.show()
In this code, we aim to predict crude oil prices using three different types of neural networks: a simple Recurrent Neural Network (RNN), a Long Short-Term Memory (LSTM) network, and an LSTM network with an additional layer and dropout. These neural networks are designed to model sequential data, making them suitable for time series predictions like the crude oil price.
First, we acquire the crude oil price dataset from the Quandl API. Afterward, we preprocess the data by scaling the values and reshaping it into a format that can be fed into our neural networks.
Next, we create three distinct models: an RNN, an LSTM, and an LSTM with an additional layer and dropout. The LSTM model, in particular, is designed to better capture long-term dependencies in time series data. By adding dropout layers, we aim to reduce overfitting, enabling the model to generalize better to unseen data.
Once the models are built, we train them using the historical crude oil price data. Training involves the models learning the patterns and dependencies in the data, adjusting their internal weights to minimize the error in their predictions.
After training, we evaluate the performance of each model by comparing their predictions to the actual crude oil prices. We then visualize the results using a line chart, showing the actual prices and the predictions made by each of our three models. This visualization allows us to easily compare the performance of the models and assess how well they were able to capture the trends in the crude oil price data.
In summary, this code demonstrates how to use different types of neural networks to predict crude oil prices, providing insights into the potential future trends in the market. By understanding these trends, one can make more informed decisions regarding investments, trading, or policy-making in the energy sector.
If you are reading this post and thinking “What the Hell did I just read?”
Then it is okay. Learning Python can seem like a daunting task at the beginning but you can work your way through it.
However there are a lot of free courses to learn python online.
https://www.udemy.com/course/python-hackcc/
https://www.freecodecamp.org/learn/data-analysis-with-python/
https://www.freecodecamp.org/learn/machine-learning-with-python/
https://www.coursera.org/learn/python
Learning python in and of itself will not make you a good trader. There are many amazing coders who cannot trade and many great traders who cannot code. It is not the only way to get good at trading. However, the more you automate your trading less likely you are to “tilt” or trade based on emotions.
Weekly plan would be out per usual before globex open.
-Fin
And perfect title 😂🍻
Great stuff mate, really enjoyed it. How’s quandl? Never really used it